High performance computing has entered the Exascale Age. Capable of performing over 1018 floating point operations per second, exascale computers, such as El Capitan, the National Nuclear Security Administration's first, have the potential to revolutionize the detailed in-depth study of highly complex science and engineering systems. However, in addition to these kind of whole machine “hero” simulations, exascale systems could also enable new paradigms in digital design by making petascale hero runs routine. Currently, untenable problems in complex system design, optimization, model exploration, and scientific discovery could all become possible. Motivated by the challenge of uncovering the next generation of robust high-yield inertial confinement fusion (ICF) designs, project ICECap (Inertial Confinement on El Capitan) attempts to integrate multiple advances in machine learning (ML), scientific workflows, high performance computing, GPU-acceleration, and numerical optimization to prototype such a future. Built on a general framework, ICECap is exploring how these technologies could broadly accelerate scientific discovery on El Capitan. In addition to our requirements, system-level design, and challenges, we describe some of the key technologies in ICECap, including ML replacements for multiphysics packages, tools for human-machine teaming, and algorithms for multifidelity design optimization under uncertainty. As a test of our prototype pre-El Capitan system, we advance the state-of-the art for ICF hohlraum design by demonstrating the optimization of a 17-parameter National Ignition Facility experiment and show that our ML-assisted workflow makes design choices that are consistent with physics intuition, but in an automated, efficient, and mathematically rigorous fashion.
I. INTRODUCTION
It is a common feature of modern science and engineering to study complex systems with many parameters, many variables, many knowns, and many unknowns. It is also increasingly common to use computer models to study such systems. These models can vary in complexity from simple analytic equations, explored via digital notebooks, to integrated multiphysics models, requiring the use of high performance computing (HPC).
As compute power has grown, accelerated by advances in both hardware (e.g., graphical processing units, GPUs) and software (e.g., high-order methods for solving partial differential equations), so too has the capacity to simulate ever increasingly complex systems. Leadership HPC systems are entering the Exascale Age, with the current top computer on the Top500 list (Frontier) able to perform floating point operations per second, or at a rate of 1.68 ExaFLOPs.1 Additional machines are following suit: for instance, El Capitan, the National Nuclear Security Administration's first exascale system, is projected to top 2 ExaFLOPs2 when fully deployed.
A fundamental question is, however, how best to use such vast compute resources for practical science and engineering?
On one side, researchers could continue to model increasingly complex systems, with higher resolutions and additional physics. However, often as the complexity (and fidelity) of these models increases, so does the number of parameters in those models: to model more of the world requires modeling more processes, each of which comes with its own settings, which must be determined, scanned over, or optimized. A problem is that even as the number of parameters needed to study increases, the computational cost of running that model also increases, which makes it more difficult to perform the exact studies, scans, and optimizations for which the complex model was built.
Conversely, one could accept simplified versions of these models and instead of running only a handful of heroic full-system simulations, perform large studies with models that are “good enough” to infer the behavior of reality, for instance, to identify model trends for comparison to experiment, or to digitally engineer an optimized system. In this world view, one can leverage not the depth afforded by exascale systems, but the breadth. The “heroic” calculations of the petascale are the “common” calculations of the exascale. If the petascale heroic models were “necessary” for detailed study of complex systems, like fusion experiments, it stands to reason that they would be good enough to perform high-fidelity design studies at the exascale.
However, what if the petascale models are not good enough to capture everything? What if we still need to use the entire exascale system to run one simulation? Are we stuck between having to choose between running a few high-fidelity models or several low-fidelity models?
There may be a way forward, thanks to advances in machine learning (ML) and artificial intelligence (AI).3
First, it could be possible to replace embedded subroutines within a multiphysics code with learned models that sacrifice some model fidelity for speed. These “in-the-loop” speedups might mean that one could execute high-fidelity-like models at reduced cost. (Of course, such models bring with them challenges like trustworthiness and physicality that must be overcome for practical applications.4)
Additionally, machine learning models could help with the “out-of-the-loop” problem of deciding which simulations (parameters and fidelities) to perform to achieve an objective. In other words, they could help answer the question of “Do I need a hero run here, or would a simpler model show me the same result?” Contemporary machine learning models can efficiently learn the behavior of complex systems from data of different fidelities. For instance, one could perform a “warm start” by training an algorithm first on lots of synthetic data generated in a low-fidelity “gym” and then retraining it with fewer observations of high-fidelity real-world data.5 Or one could treat the model fidelity as a model parameter itself and produce a machine learning model that predicts the response of a system at arbitrary fidelity (that is, it has fidelity as a knob that can be adjusted).
Modern black-box optimization algorithms can use combinations of observations at different fidelities to efficiently optimize a high-fidelity response surface. Since many problems in science and engineering can be recast as optimization problems (e.g., calibrating to experimental observations, searching and scanning for a particular response), multifidelity design optimization is an exciting application of exascale computing. Instead of having to choose between running lots of simple models or a few complex models, one could instead do both and take a multifidelity approach to digital design.6,7 This middle ground allows for the combination of low and high fidelity models to efficiently march toward an objective.
Additionally, it could be possible to perform optimization under uncertainty since exascale might allow the breadth of compute needed to quantify the uncertainty in model performance (e.g., via Monte Carlo sampling about a design point).
One complex system that could benefit from a multifidelity exascale design under uncertainty is that of nuclear fusion. As an example, an inertial confinement fusion (ICF) experiment,8 such as those performed at the National Ignition Facility (NIF),9 requires the definition of at least a few dozen parameters (e.g., laser powers and target geometries). The system is highly non-linear, since the fusion ignition process is a runaway reaction, and inherently multiphysical, involving laser–matter interactions, radiation hydrodynamics, atomic physics, thermodynamics, and nuclear physics, to name a few.10,11 Low-fidelity models capture some of the trends seen in experiments, but not all of them. The highest-fidelity simulations of ICF experiments can take petascale resources. Furthermore, with the recent achievement of ignition,12 the community is pushing toward the next goal of robust high-yield. For both energy and stockpile stewardship applications, ICF platforms need to be able to reliably deliver high yield, even when there exists uncertainty in system reproducibility, target manufacturability and model settings.
As such, there has been significant work in ICF on a number of technologies relevant for exascale design. These include surrogate-based design optimization,13 single-fidelity iterative optimization,14 two-step multifidelity optimization,15 iterative multifidelity optimization,16 iterative large-scale HPC workflows,17 and “in-the-loop” replacement of multiphysics packages with learned AI models.18,19
In this manuscript and in anticipation of El Capitan's arrival, we describe an effort to combine several of these independent technologies into a single unified framework for exascale design in general, and robust ICF design, in particular. In relation to prior art, the novelty herein lies in the integration of an entire system that could function robustly at exascale. We take a systems-level approach to exploit and build off these works, in the process anticipating, uncovering, and addressing many of the challenges in reducing these technologies to practice at scale. In other words, technologies developed in isolation and tested on hundreds to thousands of datapoints might not perform well when integrated in a system designed for a million simulations. As one example, Ref. 16 developed a Gaussian Process multifidelity algorithm and tested its performance on hundreds of simulations on a ICF test problem. While it showed improvement over a single fidelity approach, it is unlikely to work with the millions of real simulations possible on exascale systems. This is because the algorithm is based on a standard Gaussian Process model, whose matrix solve would become untenable when the number of simulations, N, becomes greater than a few thousand. To actually use an iterative multifidelity algorithm, like that proposed in Ref. 16, for exascale design necessitates at least some modification.
In this work, we take a systems-level approach to the problem of exascale design optimization, examining and integrating many of the key technologies into a unified framework, capable of performing at exascale. We also test and demonstrate the feasibility of our system on a 17-dimensional multifidelity hohlraum design problem, a record-sized parameter space for a hohlraum optimization study.
The rest of the paper is outlined as follows. Section II describes both system-level requirements for exascale design and specific requirements for ICF. Section III dives into key technologies under development that build off state-of-the-art ICF design methods. Section IV looks toward El Capitan and discusses some anticipated challenges and risk mitigation strategies, and we conclude in Sec. V.
II. REQUIREMENTS FOR EXASCALE DESIGN
In this section, we overview some high level requirements for exascale design workflows, in general, and for ICF in particular,.
In its simplest form, an automated design workflow is an iterative workflow that first runs a batch of simulations on an HPC, processes the raw data from those simulations into a relevant quantity of interest (QOI), and uses the aggregation of that QOI to select new simulations to run. In many cases this can use a centralized database to collect the QOI, and the workflow might leverage a ML model to suggest the next batch. Peterson et al.17 elaborates on the specific requirements to execute this at exascale, but in summary, these design workflows need to operate in an HPC environment, handle and process data from hundreds of thousands to millions of simulations, support both on-node and concurrent analysis, and work across both multi-machine and multi-batch slot allocations with multiple executable types. They need to also effectively and efficiently manage heterogeneous systems (e.g., CPUs and GPUs).
In addition to these infrastructure-level requirements, exascale design workflows need to be able to coordinate and execute simulations of different fidelities (and match the available resources accordingly) and efficiently search high parameter spaces for both constrained and unconstrained optimization. To promote trustworthiness, the final design should be based on actual simulation results (at the highest fidelity possible), instead of an approximation (e.g., the result of a surrogate model). Finally, the large data volumes and degree of automation necessitate a “human-near-the-loop” monitoring and control system that efficiently pushes relevant data for consumption by the design subject matter expert, e.g., regarding the best design found, the spaces searched, and resources run.
The fundamental challenge is that we need our digital designs to represent the state-of-the-art simulations, but we cannot afford to optimize directly with a state-of-the-art model. The design spaces are moderately large (perhaps dozens of parameters) and the models are expensive, effectively black box simulations that can require several hours to several days to complete. Leveraging lower-fidelity models to bootstrap the design search might take hundreds of thousands to millions of simulations.
As a concrete example, we consider the optimization of NIF ICF experiments. With the recent achievement of ignition,12 the physics community is turning toward the design challenge of robust high-yield or a search for designs that can, with high confidence and repeatability, ignite and produce tens of MJ of energy. In order to design one such experiment requires the engineering definition of at least dozens of parameters. Table I lists a simplified set of parameters needed to define a variation about NIF shot N210808 (the first experiment to produce over 1 MJ of energy).
The minimum 27 parameters (15 laser, 8 capsule, and 4 hohlraum) needed to define an indirect drive NIF ICF experiment (more complicated design studies could have more). Also shown are the permissible (physically legal) values for those parameters, along with the nominal (non-optimized) settings for a design similar to NIF shot N210808. Permissible ranges are larger than those typically chosen for design studies, but an interesting test of the ICECap system will be how well it can narrow that search without pre-conditioned bias.
Parameter name . | Description . | Permissible values . | Nominal (non-optimized) value . |
---|---|---|---|
picket-power | Laser picket: power (TW) | 0–100 | 50.4 |
picket-length | Laser picket: length (ns) | 0–5 | 1.4 |
picket-cone-fraction | Laser picket: inner cone fraction | 0–1 | 0.1 |
trough-rise | Laser trough: rise time to power (ns) | 0.3–3 | 0.4 |
trough-power | Laser trough: power (TW) | 0–100 | 30 |
trough-length | Laser trough: length (ns) | 0–5 | 0.648 |
trough-cone-fraction | Laser trough: inner cone fraction | 0–1 | 0.155 |
second-rise | Laser second pulse: rise time to power (ns) | 0.3–3 | 0.835 |
second-power | Laser second pulse: power (TW) | 0–100 | 80 |
second-length | Laser second pulse: length (ns) | 0–5 | 0.462 |
second-cone-fraction | Laser second pulse: inner cone fraction | 0–1 | 0.14 |
third-rise | Laser third pulse: rise time to power (ns) | 0.3–3 | 1.60 |
third-power | Laser third pulse: power (TW) | 100–500 | 430 |
third-length | Laser third pulse: length (ns) | 0–5 | 2.29 |
third-cone-fraction | Laser third pulse: inner cone fraction | 0–1 | 0.33 |
layer1-thickness | Capsule ablator layer 1: thickness (μm) | 5–100 | 6 |
layer1-dopant | Capsule ablator layer 1: W dopant (percentage) | 0–0.5 | 0 |
layer2-thickness | Capsule ablator layer 2: thickness (μm) | 5–100 | 19 |
layer2-dopant | Capsule ablator layer 2: W dopant (percentage) | 0–0.5 | 0.42 |
layer3-thickness | Capsule ablator layer 3: thickness (μm) | 5–100 | 55 |
layer3-dopant | Capsule ablator layer 3: W dopant (percentage) | 0–0.5 | 0 |
ice-thickness | Capsule ice layer: thickness (μm) | 10–80 | 66 |
gas-thickness | Capsule gas layer: thickness (μm) | 500–1500 | 983 |
hohlraum-leh-diameter | Hohlraum LEH diameter (cm) | 0.01–1 | 0.03 |
hohlraum-diameter | Hohlraum diameter (cm) | 0–1 | 0.64 |
hohlraum-length | Hohlraum length (cm) | 0–1.5 | 1.13 |
hohlraum-gas-density | Hohlraum He gas fill density (g/cc ) | 1–5 | 3 |
Parameter name . | Description . | Permissible values . | Nominal (non-optimized) value . |
---|---|---|---|
picket-power | Laser picket: power (TW) | 0–100 | 50.4 |
picket-length | Laser picket: length (ns) | 0–5 | 1.4 |
picket-cone-fraction | Laser picket: inner cone fraction | 0–1 | 0.1 |
trough-rise | Laser trough: rise time to power (ns) | 0.3–3 | 0.4 |
trough-power | Laser trough: power (TW) | 0–100 | 30 |
trough-length | Laser trough: length (ns) | 0–5 | 0.648 |
trough-cone-fraction | Laser trough: inner cone fraction | 0–1 | 0.155 |
second-rise | Laser second pulse: rise time to power (ns) | 0.3–3 | 0.835 |
second-power | Laser second pulse: power (TW) | 0–100 | 80 |
second-length | Laser second pulse: length (ns) | 0–5 | 0.462 |
second-cone-fraction | Laser second pulse: inner cone fraction | 0–1 | 0.14 |
third-rise | Laser third pulse: rise time to power (ns) | 0.3–3 | 1.60 |
third-power | Laser third pulse: power (TW) | 100–500 | 430 |
third-length | Laser third pulse: length (ns) | 0–5 | 2.29 |
third-cone-fraction | Laser third pulse: inner cone fraction | 0–1 | 0.33 |
layer1-thickness | Capsule ablator layer 1: thickness (μm) | 5–100 | 6 |
layer1-dopant | Capsule ablator layer 1: W dopant (percentage) | 0–0.5 | 0 |
layer2-thickness | Capsule ablator layer 2: thickness (μm) | 5–100 | 19 |
layer2-dopant | Capsule ablator layer 2: W dopant (percentage) | 0–0.5 | 0.42 |
layer3-thickness | Capsule ablator layer 3: thickness (μm) | 5–100 | 55 |
layer3-dopant | Capsule ablator layer 3: W dopant (percentage) | 0–0.5 | 0 |
ice-thickness | Capsule ice layer: thickness (μm) | 10–80 | 66 |
gas-thickness | Capsule gas layer: thickness (μm) | 500–1500 | 983 |
hohlraum-leh-diameter | Hohlraum LEH diameter (cm) | 0.01–1 | 0.03 |
hohlraum-diameter | Hohlraum diameter (cm) | 0–1 | 0.64 |
hohlraum-length | Hohlraum length (cm) | 0–1.5 | 1.13 |
hohlraum-gas-density | Hohlraum He gas fill density (g/cc ) | 1–5 | 3 |
The 3-shock laser pulse can be parameterized by fifteen parameters (each pulse requires at minimum a turn-on time, a new power level, a duration at the power level, and a ratio of power striking the equator of the hohlraum relative to the total.) Additionally, there is a time-zero “picket” that is similar but turns on right away (there is no transition time). Figure 1 shows both the original simplified N210808-like and a variation that alters the third pulse rise time, power, and duration. Each of the laser pulses can be modified in a similar fashion.
An original NIF shot N210808-like three-shock laser pulse and a modified version that alters the third pulse's rise rate, power level, and duration.
An original NIF shot N210808-like three-shock laser pulse and a modified version that alters the third pulse's rise rate, power level, and duration.
In addition to the laser, the capsule requires at least eight parameters (two for each of the three dopant layers and one for each of the inner gas and ice fuel layers) and the hohlraum another four (three for geometry and a forth for the fill density of the gas inside). Note that, however, these 27 parameters represent a bare minimum with a simplified piecewise-constant laser pulse. Additional complexity is certainly possible (for instance, by adding intra-pulse cone fraction variations to the laser pulse, which the actual N210808 shot did). In any case, any ICF design will need to specify at least dozens of parameters.
In addition to the design space, a complexity arises from the physics itself. Unlike the analytic multifidelity benchmark functions in Ref. 20, the optima from a multifidelity multiphysics problem need not exist near each other in parameter space. That is, the high fidelity function is not necessarily the low fidelity solution with a simple transformation or extra terms. The functions themselves are black boxes whose structure can differ. For instance, consider Fig. 2, which shows yield contours for two design parameters for capsule simulations of N210808, where the “fidelity” is altered by turning on/off a physics submodule (in this case, thermonuclear burn: see Ref. 16, for details and related discussion). In this case, the optimal (high yield, red) location actually changes; the low-fidelity optimum exists on the design boundary, and the high-fidelity optimum in the middle. As another concrete example, a low-fidelity 1D design cannot be degraded from 2D or 3D hydrodynamic mix. As such, the “best” 1D design might perform poorly in 2D and 3D.
Contours of yield from a two-parameter search of N210808 capsule simulations, turning off (left) and turning on (right) the physics of thermonuclear burn. The design variables ×1 and ×2 are a delay of the second shock (ns) and a power multiplier on the peak of the radiation drive, respectively. The location of the optimal design for a low-fidelity model [lo-fi target, left] is different than for the higher fidelity model [hi-fi target, right]. Since the physics changes with fidelity, so does the optimal design location. This behavior implies that optimizing low-fidelity models to define search regions for high-fidelity models might not succeed, especially in higher dimensions.
Contours of yield from a two-parameter search of N210808 capsule simulations, turning off (left) and turning on (right) the physics of thermonuclear burn. The design variables ×1 and ×2 are a delay of the second shock (ns) and a power multiplier on the peak of the radiation drive, respectively. The location of the optimal design for a low-fidelity model [lo-fi target, left] is different than for the higher fidelity model [hi-fi target, right]. Since the physics changes with fidelity, so does the optimal design location. This behavior implies that optimizing low-fidelity models to define search regions for high-fidelity models might not succeed, especially in higher dimensions.
The multiphysics nature of the problem can also bring an interesting computer science requirement: much of the computational run time can be dominated by a few key (and possibly expensive) physics packages (unless the problem is perfectly balanced at all scales, which is highly unlikely in practice).
Also, in the case of ICF, a full-system simulation is not necessary to design an experiment, because it is possible to have a design be “close enough” at a lower fidelity to achieve a desired experimentally viable design (the ignition design was created through a series of moderately expensive integrated hohlraum-capsule simulations, each taking days to weeks on only tens of nodes, not thousands). However, as noted, these workhorse models do not exactly match experimental results, but including additional physics and resolution brings the models closer to the experiment (e.g., by resolving enough fine-scale capsule features to effectively model hydrodynamic instability growth and mix). In all, this says that there is some information sharing between model fidelity levels, but they are not perfectly correlated, which further motivates a multifidelity modeling approach.
Finally, the problem of “robust ignition” requires a methodology to consider design under uncertainty. The laser, hohlraum, and capsule are physical objects manufactured to specific tolerances. Even with a perfect simulation model of the highest fidelity, performance variations within these tolerances contribute to uncertainty in actual design performance. Furthermore, imperfect models have at least some uncertainty in their performance (e.g., from a numerical parameter that is used to adjust the model). For instance, hohlraum simulations can match experimental results by applying multipliers to the laser power.21 However, a Bayesian approach to matching the data22 leads to a family of possible multipliers, with an associated distribution function of likely values. Each simulation, therefore, should be treated as a draw from a distribution of possible “valid” simulations. This implies that a prediction of a new experiment should come with uncertainty in it, to account for the likely variations due to model form error and the uncertainty therein.
In other words, a properly “robust ignition” design should achieve a desired level of reproducibility, given both physical (e.g., tolerances) and model (e.g., parameter) uncertainties. Therefore, each design candidate may require hundreds of Monte Carlo variations to assess the system's reproducibility: it is not enough to make a point prediction of a new design.
In all, these requirements lead to a large number of simulations. For instance, it is easy to conceive of needing one million simulations, if optimizing a 30-dimensional space requires ten thousand function evaluations, and each design requires one hundred variations for uncertainty estimation. Since multiphysics codes can produce gigabytes of simulation data, the data rate requirements on exascale design workflows would be immense. Even if these raw data are processed into a simplified QOI and deleted on the fly, the data rates could still be large. In addition to the strain on the computer hardware, this becomes a challenge for a human to monitor system performance and behavior.
In summary, exascale design optimization workflows need to be able to work with large volumes of data and data rates, while operating on complex multiphysics problems of varying fidelity and complexity. The underlying workflow technology needs to have low enough overhead not to dominate system performance when coordinating the execution of millions of simulations in a heterogeneous HPC environment. Finally, humans need to be able to monitor and interact with the system in real time.
III. DEVELOPING TECHNOLOGY
The goal of project ICECap (or Inertial Confinement on El Capitan) is to advance and integrate a number of the technologies useful for exascale design optimization workflows. Motivated by the requirements set out in Sec. II, we focus on the problem of designing for robust high yield ICF experiments at NIF, but note that many of these technologies (and indeed the problem itself) are more general. In this section, we explain the system level design of the ICECap workflow and summarize the key technologies under development for eventual integration when El Capitan arrives.
Our main motivating physics problem is one of robust high yield for ICF, which requires design under uncertainty. In particular, uncertainty exists in the reproducibility of the system (e.g., the laser power delivered, the surface finish of the capsule, and the precision of the engineering), as well as the physics. For instance, even high-fidelity simulations use “multipliers” to reduce the simulated radiation drive to match key observations.21 Furthermore, these multipliers are not universal: they change with every design campaign and even between shots within the same campaign. This lack of generalization means that the performance of a new design is highly uncertain, especially one that is a drastic departure from existing experiments.
To design a new experiment under uncertainty, one should perform a series of random trials about each proposed design, and then optimize a function of the resultant population of simulations. In other words, one needs to estimate the probability distribution of achieving a certain result and then optimize a function of that probability distribution (for instance, the mean). For ICECap, we choose to optimize the conditional value at risk,23 which comes from financial portfolio optimization and is popular for risk-aware optimization in general. The conditional value at risk, or CVaR, is a tunable risk-based metric of a population that corresponds to the average of the population below a certain threshold. This metric makes it both robust to noise and sensitive to extreme events in the tail of the population. For instance, CVaR at 5% is defined as the average of the lowest 5% values. If we randomly sample 100 variations about a design, CVar (5%) is the average of the five lowest performing simulations. By optimizing this portion of the population, we raise the entire population, producing a distribution of expected yields whose low end tail performs well.
For ICF, both the physics and reproducibility variations can be modeled by capsule-only simulations (that use the radiation field from a companion hohlraum simulation and as a boundary condition), with variations and physics uncertainties manifest as perturbations on the capsule drive, roughness or geometry. Another advantage of using capsule-only simulations to study reproducibility is that capsule simulations, with their increased resolution, can better resolve small-scale hydrodynamic perturbations. However, the design of an experiment that can be fielded at NIF requires the use of hohlraum simulations to specify the laser and hohlraum parameters. Hence, every proposed hohlraum design candidate needs to spawn a number of capsule-only variations, the results of which are then fed into a quantity of interest (QOI) to optimize.
For a QOI, we opt for the barrier method,24 which combines the main outcome (e.g., yield) with any constraints (e.g., total laser power or velocity) into a single scalar function that penalizes regions that violate constraints. The barrier QOI method is an easy framework to combine “hard” constraints (that cannot be violated) and “soft” constraints (that can but should not be violated). It also integrates with a variety of optimization algorithms instead of those that are strictly for constrained optimization.
The optimization framework itself presents a few options. However, given the high-dimensional design space, the most practical are either genetic algorithms or Bayesian optimization. We choose Bayesian optimization25–27 both for its robustness, its success in diverse scientific applications (like additive manufacturing28 and laser wakefield design29) and for the recent extensions to a multifidelity framework.6,30 In this system, at each iteration we simulate a batch of concurrent designs, sampling each one to account for uncertainty, and then build a multifidelity surrogate model to optimize, selecting a new batch of simulation designs and fidelities for the next iteration.
The multifidelity approach, which has been recently applied to ICF design problems,15,16 allows for an acceleration of the design process. For ICECap, we choose our fidelities to correspond to physical dimensions (1D/2D/3D), and optimize the 3D response. In other words, we have 1D hohlraums drive 1D capsules for the lowest fidelity, 2D hohlraums drive 2D capsules for the middle tier, and 3D hohlraums drive 3D capsules for the highest fidelity. Since hydrodynamic mix and shape perturbations are considered major sources of ICF degradation, including the 2D and 3D levels in the optimization process means that we can search for designs that perform well, even when allowed to mix or deform. Since many of the design parameters are well modeled by 1D physics (e.g., gross shock timing), including the much-cheaper 1D simulations as the lowest tier allows the algorithm to explore many of the design variables with cheap simulations and automatically spot-check and exploit with 2D and 3D simulations. Multifidelity Bayesian optimization automatically selects batches of designs and fidelities to efficiently march toward a goal. It has been shown16 to optimize ICF design problems faster than a single fidelity approach. In practice, this means that we can optimize for 3D performance without having to exclusively run 3D problems.
Figure 3 shows the system-level design of the ICECap workflow. At each iteration, we run in parallel a batch of potential NIF design baselines. Each baseline, which consists of an integrated hohlraum and capsule simulation then spawns a number of capsule-only variations to account for uncertainty. The results of these simulations become codified as a CVaR metric of the population, combined with any constraints into a QOI to optimize with the help of a multifidelity surrogate model. Each iteration launches a new batch of potential designs (and fidelities).
System-level representation of the ICECap workflow for uncertainty-aware ICF design. In each iteration, a batch of designs (both physical parameters and fidelities) are simulated concurrently, beginning with a baseline integrated hohlraum and capsule simulation. From each integrated calculation, the radiation drive is extracted as a boundary condition for a series of capsule-only simulations, which vary to account for design uncertainty in both the physics of the problem (e.g., the spectral content of the radiation drive) and the reproducibility of the experiment (e.g., the laser delivery specification and capsule manufacturability). The surrogate model and optimizer then select the next iteration in order to optimize a risk-based metric of the variations around each baseline (e.g., the conditional value at risk, CVaR, of the yield of each capsule simulation).
System-level representation of the ICECap workflow for uncertainty-aware ICF design. In each iteration, a batch of designs (both physical parameters and fidelities) are simulated concurrently, beginning with a baseline integrated hohlraum and capsule simulation. From each integrated calculation, the radiation drive is extracted as a boundary condition for a series of capsule-only simulations, which vary to account for design uncertainty in both the physics of the problem (e.g., the spectral content of the radiation drive) and the reproducibility of the experiment (e.g., the laser delivery specification and capsule manufacturability). The surrogate model and optimizer then select the next iteration in order to optimize a risk-based metric of the variations around each baseline (e.g., the conditional value at risk, CVaR, of the yield of each capsule simulation).
We iterate for NI iterations, launching NH hohlraums per iteration and NC capsules per hohlraum, for a total of hohlraum simulations and capsule simulations. If we estimate , the total process would execute 104 hohlraum and 106 capsule simulations.
At such scales, any slight acceleration can have a meaningful impact on reducing the total time-to-science. In addition to multifidelity optimization, ICECap is exploring a number of potential accelerators. At the workflow level, we choose as our driving engine merlin,17 a cloud+HPC workflow technology that scales efficiently with low overhead to millions of samples. At exascale, we need to consider the (likely) possibility that some aspect of the system will break and that humans need to be able to monitor and intervene in the workflow. To that end, we have developed a lightweight real-time visualization framework (Fig. 4) that can connect with a persistent database and provide up-to-date information as to the state of the workflow, the best point found, the design search history, the accuracy of the surrogate models, etc.
The “splashboard” real-time visualization engine is a light-weight interface with few dependencies to the simulation database, presenting the user with high-level real-time updates on the performance of the workflow, the computer system, the optimization process, simulation health, and machine learning model accuracy. In addition to interactive capability, it operates in a headless mode that automatically produces key scientific plots at each iteration.
The “splashboard” real-time visualization engine is a light-weight interface with few dependencies to the simulation database, presenting the user with high-level real-time updates on the performance of the workflow, the computer system, the optimization process, simulation health, and machine learning model accuracy. In addition to interactive capability, it operates in a headless mode that automatically produces key scientific plots at each iteration.
Moving inward from the workflow, speed-ups can be made at the hardware level, for instance via GPU acceleration. Significant effort has been made to port the key hohlraum physics in HYDRA31 to the El Capitan GPUs. We also chose marbl32 to run our capsule simulations, which has been built from the ground up to be performant on GPUs. Moreover, El Capitan's APUs, with combined GPU/CPU memory, is likely to speed up even CPU-only calculations for both codes. El Capitan also has “Rabbit” nodes,33 which have fast read/write to/from the compute nodes, but not great amounts of storage. For ICECap, we will explore using these to stage our raw simulation results from which to calculate our relatively lightweight QOIs.
The most expensive physics package within an ICF hohlraum simulation is the atomic physics calculation, which can comprise over half of the overall runtime of certain problems. Recent work has investigated embedding neural networks inside of HYDRA to replace this package.18,34,35 This approach involves pre-training a neural network on the data from real atomic physics calculations and then calling the neural network inline instead of the relatively expensive physics package. Such an embedding is attractive from a multiphysics standpoint, and ICECap is developing this into an operational capability. However, several challenges arise from this approach. First, the neural network could be highly inaccurate, either due to poor training or the optimization process having led it into an unforeseen region of physics space. In these cases, we need to fall back to the actual atomic physics package, as any pathologies from the neural network could have unforeseen consequences on the rest of the physics. To do so, we need a measure of confidence in the neural network that, if violated, triggers the fall back to the real physics. We would also like to flag this as an invalid simulation (“out of distribution”) and feed it back into the neural network for later retraining, such that the model improves as the optimization process progresses. As a measure of confidence, we use Δ-UQ,36 which employs data anchoring for scalable estimates of neural network uncertainty and can account for both epistemic and aleatoric uncertainty. In addition to the extra workflow complexity, a practical consideration is that one needs to replace not one but two physics packages: the atomic physics and the equation of state. This dual replacement is necessary because both are calculated by the same sub-package, and merely replacing one without the other would not actually accelerate the overall problem, since the sub-package would have to be called in any case. The unified model replacement for atomic physics is known as HERMIT, and ICECap will test its full implementation in HYDRA.
Of course, the fastest simulations are the ones that are not run, so in addition to preprocessing steps that throw out infeasible simulations ahead of time (e.g., that violate a laser energy constraint), an efficient optimization algorithm is key. For this, we have chosen a Bayesian optimization that can be run with neural network surrogates that can handle the large number of samples that we expect. It has been shown36 that Δ-UQ can outperform other methods, such as Gaussian Process regression or ensembling techniques on design optimization problems, particularly in higher dimensions. Figure 5 shows a proof of concept extension of Δ-UQ to multifidelity optimization, on an 8-parameter surrogate model trained on a 1D HYDRA capsule simulation database.16 Over several random trials, the Δ-UQ neural network performs similarly to Gaussian Process regression. Of note is also the random variability from both approaches, which suggests that the performance could be highly dependent on the starting location of the optimization process. For ICECap, we can launch several parallel chains simultaneously to guard against any single chain getting stuck in local optima. Since each chain can push to a central database, this allows for additional asynchronous learning, optimization, and exploration.
Proof of concept for multifidelity Bayesian optimization with neural networks. Plotted is a comparison of the running maximum yield found after each iteration from optimizing an 8-parameter HYDRA surrogate with Gaussian processes (left) and neural networks (right). Neural network uncertainty comes from the Δ-UQ method. The acquisition function in both cases is knowledge gradient. Several independent random seeds are shown, illustrating inherent variability in the optimization process.
Proof of concept for multifidelity Bayesian optimization with neural networks. Plotted is a comparison of the running maximum yield found after each iteration from optimizing an 8-parameter HYDRA surrogate with Gaussian processes (left) and neural networks (right). Neural network uncertainty comes from the Δ-UQ method. The acquisition function in both cases is knowledge gradient. Several independent random seeds are shown, illustrating inherent variability in the optimization process.
Figure 6 shows a logical flow diagram of the entire workflow, including the complications from HERMIT, the orchestration to central databases, and visualization. Each iteration pulls the current HERMIT model to use within its HYDRA simulations. If a simulation fails a neural network distribution test, it falls back to the full physics and flags the result for retraining. After each simulation is complete and post-processed, the raw data are deleted to save disk space and the relevant QOI are pushed to a central database, which can be visualized in real time. At the end of each iteration, the multifidelity surrogate model and optimization routine suggest a new batch. Figure 6 represents one iteration “chain.” As mentioned for both scalability and robustness, ICECap will have several of these executing in parallel.
Flow diagram of the workflow. The iterative workflow begins with an initial set of simulations randomly spanning the design space. HYDRA simulations that violate the HERMIT in-distribution acceptance criteria are flagged for ML re-training, which happens continuously in a separate workflow. As simulations finish, the data are processed and the salient features pushed to a database, while the raw simulation data are erased. After the round of simulations has finished, all of the results, thus far, get collected from the database and used to train a multifidelity deep learning model, which is then optimized to pick the next batch of simulation parameters and fidelities to run. This new batch gets pushed to the central merlin queue server to run as soon as resources become available. In parallel, the heads-up visualization display splashboard pulls data from the database and allows humans to monitor the system's health in real-time.
Flow diagram of the workflow. The iterative workflow begins with an initial set of simulations randomly spanning the design space. HYDRA simulations that violate the HERMIT in-distribution acceptance criteria are flagged for ML re-training, which happens continuously in a separate workflow. As simulations finish, the data are processed and the salient features pushed to a database, while the raw simulation data are erased. After the round of simulations has finished, all of the results, thus far, get collected from the database and used to train a multifidelity deep learning model, which is then optimized to pick the next batch of simulation parameters and fidelities to run. This new batch gets pushed to the central merlin queue server to run as soon as resources become available. In parallel, the heads-up visualization display splashboard pulls data from the database and allows humans to monitor the system's health in real-time.
As a test of the ICECap system, we have run a smaller version of the workflow (only 17 of the parameters) using only 1D hohlraum simulations, but with toggling thermonuclear burn as a fidelity parameter and without the capsule simulations for uncertainty optimization. We also set the cost-ratio of the two fidelities to be 2:1, which is artificial but serves as a test. Figure 7 shows the results. The top right panel shows that the algorithm is quickly able to find a high-yield (10 MJ) design after only a few hundred effective total simulation cost (defined as the effective number of low-fidelity simulations or ). The bottom right shows the running total of both low and high fidelity simulations per iteration, showing a roughly 2:1 slope ratio, as would be expected. The left panel shows the final best design, as measured by the 17 design parameters and ordered by percent variation from the baseline N210808 case. After about ten iterations, the workflow converges.
Test of the multifidelity (without uncertainty) workflow for one-dimensional hohlraum simulations, with the fidelity setting turning on/off thermonuclear burn (at a 2:1 cost ratio) and varying 17 parameters simultaneously. The left figure shows the optimal design (as measured by change from the baseline). The right top shows the total yield of the best design as a function of total simulation cost (measured by the effective number of low-fidelity simulations), and the right bottom shows the total number of each simulation chosen per iteration. The workflow converges after about ten iterations, at around a total cost of 400 low-fidelity simulations, settling on a design that lengthens the second shock, shrinks the hohlraum size, thickens the ablator, and thins the DT ice, all while maintaining shock timing for high yield.
Test of the multifidelity (without uncertainty) workflow for one-dimensional hohlraum simulations, with the fidelity setting turning on/off thermonuclear burn (at a 2:1 cost ratio) and varying 17 parameters simultaneously. The left figure shows the optimal design (as measured by change from the baseline). The right top shows the total yield of the best design as a function of total simulation cost (measured by the effective number of low-fidelity simulations), and the right bottom shows the total number of each simulation chosen per iteration. The workflow converges after about ten iterations, at around a total cost of 400 low-fidelity simulations, settling on a design that lengthens the second shock, shrinks the hohlraum size, thickens the ablator, and thins the DT ice, all while maintaining shock timing for high yield.
Of note, the algorithm can find design changes that align with physics understanding of the 1D performance of the N210808 design. Since N210808, design changes have led to the achievement of ignition. The first shot to do so, N221204,12 was a modification of N210808 with a thicker ablator ( m). A parallel campaign also achieved target gain on shot N230605, but at a lower laser energy than N221204, by lengthening the time prior to the second shock.37,38 Both key changes explored in these campaigns align with what our model suggests. Our optimization algorithm suggests that the most significant adjustments are a lengthening of the second shock, a thickening of the ablator, a shrinking of the hohlraum size, and a thinning of the DT ice. ICF physics suggests that all of these design choices would maximize velocity and lower adiabat (by fixing the shock timing of the second shock). Furthermore, the other variables are also adjusted simultaneously, likely to help maintain a well-tuned implosion. Figure 7 demonstrates that the workflow can behave as expected on a realistic ICF design problem. This test is also the largest 1D hohlraum design study ever conducted, making it interesting in and of itself. It also supports the design choices made by humans in the campaigns since N210808, although in a fully automated and mathematically rigorous optimization framework.
IV. CHALLENGES GOING FORWARD
The ICECap workflow, as demonstrated by the system test in Fig. 7, is able to automate a tedious high-dimensional multifidelity design problem. Even absent El Capitan, the ability to optimize simultaneously 17 ICF design parameters in only a few hundred simulations is a significant technological advance. Nonetheless, challenges remain for realizing the full ICECap vision. In this section, we discuss some of those challenges and our mitigation strategies.
The first major challenge involves the scale of El Capitan. Since the system is a significant leap in compute power, the number of concurrent simulations, the data volumes, and network traffic will all operate at a much higher level than current workflow systems, even those that have been tested and perform well at petascale, like merlin. The constraint of a new leadership system means that few opportunities exists to test at this scale before the machine arrives. However, our mitigation strategy is to perform smoke tests with fake data and map out performance scalability. By replacing HYDRA and marbl with a fake data generator, we can test the network and system infrastructure at our expected high data rates. Another major challenge is that the hardware on El Capitan will be new and only arrive at first in small quantities. However, we are performing component-level tests on early access hardware, which enable us to develop our workflow pieces (e.g., simulations, surrogate models) on isolated but realistic hardware and fine-tune their performance ahead of time. Early access to the hardware (and operating system) of El Capitan allows us to build the software stack and perform small few-node integration tests of the workflow and algorithms, much as with the 17-parameter test in Fig. 7. In essence, we plan to test the infrastructure via smoke tests on simulated data and components on early access hardware.
There also exists risk in the simulations themselves, from a physics standpoint. The simulations could crash should they wander into a strange part of parameter space. While this cannot be universally guarded against, we can perform a series of “corner-case” studies and design our algorithms to work without human intervention, which is a common technique that has been used previously to design ICF-ensemble studies.13 Nonetheless, the simulation might still fail, in which case we can leverage merlin to mark failed simulations as bad points in design space (e.g., produce zero yield) and let the optimization algorithm automatically steer the system away from these regions. In essence, we can work under the ansatz that a simulation that crashes (e.g., from mesh tangling, which is the most common cause for ICF simulations) is around a sensitive design point that we would want to avoid anyway. If we need to untangle the mesh by hand, then two different designers could untangle the mesh in different ways and produce different simulation results. Incidentally, since mesh tangling comes from vortical hydrodynamic motion, which is thought to harm ICF capsules anyway, such an approach should automatically push us toward designs that are unlikely to produce hydrodynamic turbulence and mix.
Finally, there are certain to be unforeseen consequences. The production environment will be different because we cannot test the full system until El Capitan arrives. When running at exascale, we would need to be able to diagnose and respond to problems quickly. In this case, the splashboard heads up real-time display needs to provide users the proper insight into the system's performance, to help determine if intervention is necessary. Intervention itself (e.g., stopping jobs, starting new jobs, launching new or stopping existing sample optimization chains) is straightforward with merlin, since the work to be done is decoupled from the workers doing it. In other words, canceling all the jobs and turning the machine off does not disrupt the workflow since the task and sampling instructions live on the external queue server. ICECap is effectively continuously and automatically checkpointing with every simulation.
With unforeseen consequences almost certain, our best bet is to prepare the scientist to react by creating the infrastructure to provide timely data streams for quick diagnosis and tools for intervention. In this sense, ICECap is also prototyping how humans can effectively interface with exascale simulation workflows.
V. CONCLUSIONS
Exascale computing presents a unique opportunity for digital design. Due to its shear computational power, exascale has the power to enable large scale studies of sufficient breadth and depth to optimize complex systems. Beyond pure computational power, digital design at the exascale could benefit from advances in machine learning and artificial intelligence, GPU-accelerated multiphysics codes, converged HPC+cloud computing workflows, and multifidelity optimization algorithms.
Motivated by the problem of moving beyond ICF ignition to a new era of robust high yield, project ICECap is an ongoing project to integrate some of these technologies into a unified design optimization ecosystem on El Capitan. In this paper, we outlined the design of the ICECap system, our underlying technologies, our progress in developing those technologies, and the requirements that drive our decisions. We demonstrated a working prototype workflow by optimizing a 17-parameter 1D ICF hohlraum. We showed that the physics choices made by the algorithm in only a few hundred simulations (a significant technological advance for automated ICF design) align well with physical expectations. Finally, we discuss some of the challenges (and mitigation strategies) we anticipate encountering as El Capitan comes online and we begin to test at scale.
At the heart of ICECap is the growing interest in the digital design of complex systems under model and performance uncertainty. While we are developing around the specific problem of ICF experimental design, there exist an abundance of complex technologies that have yet to be digitally optimized, such as the nanometer thick ICF capsule fill tube (which itself is difficult to model and build with several free parameters, such as material composition, geometry, and insertion angle). The near term confluence of advanced manufacturing and robotic laboratories will only increase the demand signal for tools that allow for model based design for manufacturability. We are building and testing our tools in a modular framework to enable flexibility and portability to other challenging science and engineering problems. For instance, to apply ICECap to the fill tube problem would require swapping out the simulation model and instrumenting it to expose the design parameters and QOI to optimize, while leaving the rest of the infrastructure largely intact.
By attempting to integrate several technologies at the intersection of machine learning and scientific high performance computing, ICECap is how we might be able to harness both the capacity and capability of exascale machines to revolutionize the digital design, engineering, and exploration of complex systems.
ACKNOWLEDGMENTS
The authors would like to gratefully acknowledge the advice and support of the LLNL Weapons Simulation and Computing program; Livermore Computing; the El Capitan Center of Excellence; the marbl, HYDRA, and cretin developers; the LLNL ICF program; and the Autonomous Multiscale project.
This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract No. DE-AC52-07NA27344 and the LLNL-LDRD Program under Project Tracking No. 21-ERD-028 (Release No. LLNL-JRNL-860716).
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
J. Luc Peterson: Conceptualization (lead); Data curation (lead); Investigation (equal); Methodology (lead); Project administration (lead); Software (equal); Supervision (lead); Visualization (equal); Writing – original draft (lead); Writing – review & editing (lead). Tim Bender: Investigation (supporting); Software (supporting). Robert Blake: Investigation (supporting); Software (supporting). Nai Yuan Chiang: Investigation (supporting); Software (supporting); Visualization (supporting). M. Giselle Fernández-Godino: Investigation (supporting); Software (supporting); Visualization (supporting); Writing – review & editing (equal). Brian Garcia: Investigation (supporting); Methodology (supporting); Software (supporting). Andrew Gillette: Investigation (supporting); Methodology (supporting); Software (supporting); Supervision (supporting). Brian Gunnarson: Investigation (supporting); Software (supporting); Writing – review & editing (supporting). Cooper Hansen: Investigation (supporting); Software (supporting). Judy Hill: Funding acquisition (equal); Project administration (equal); Resources (equal); Supervision (equal). Kelli Humbird: Conceptualization (supporting); Methodology (supporting); Supervision (supporting); Writing – review & editing (equal). Bogdan Kustowski: Investigation (supporting); Software (supporting); Validation (supporting). Irene Kim: Methodology (supporting); Software (supporting). Joseph Koning: Conceptualization (supporting); Investigation (equal); Software (equal); Validation (equal). Eugene Kur: Investigation (supporting); Methodology (supporting); Software (supporting); Visualization (supporting). Steven Langer: Investigation (supporting); Supervision (supporting); Validation (supporting); Writing – review & editing (supporting). Ryan Lee: Methodology (equal); Software (equal); Visualization (equal). Katie Lewis: Funding acquisition (equal); Methodology (equal); Project administration (equal); Resources (equal); Software (equal); Supervision (equal). Alister Maguire: Investigation (supporting); Methodology (supporting); Software (supporting). Jose Milovich: Investigation (supporting); Methodology (supporting); Software (supporting); Validation (supporting). Yamen Mubarka: Methodology (supporting); Software (supporting). Renee Olson: Methodology (supporting); Software (supporting). Jay Salmonson: Software (supporting); Validation (supporting). Chris Schroeder: Investigation (supporting); Methodology (supporting); Software (supporting). Brian Spears: Conceptualization (equal); Funding acquisition (lead); Project administration (equal); Resources (equal); Supervision (equal). Jayaraman Thiagarajan: Methodology (supporting); Software (supporting). Ryan Tran: Investigation (equal); Methodology (equal); Software (lead); Validation (equal); Visualization (equal). Jingyi Wang: Methodology (supporting); Software (supporting). Christopher Weber: Investigation (supporting); Methodology (supporting); Software (supporting); Validation (supporting).
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.