The high-energy-density (HED) physics community is moving toward a new paradigm of high-repetition-rate (HRR) operation. To fully leverage the scientific power of HRR HED facilities, all of the components of each subsystem (laser, targetry, and performance diagnostics) must be connected and synchronized in a reliable and robust manner while the data acquired are tagged and archived in real time. To this end, GA has begun developing a generalized NoSQL-database framework, the MongoDB repository for information and archiving. An organizational strategy has been developed that shifts HED data organization from a shot-based to a diagnostic-based approach in order to increase archival and retrieval efficiency that lends itself to optimization applications. This work is a first step in pushing HRR HED science toward data management solutions that emphasize machine actionability and aim to stimulate community engagement to define data standards in HED science.
I. INTRODUCTION
Over the next decade, expectations are that many new capabilities will be unveiled at existing and new high-energy-density (HED) physics facilities to leverage high repetition-rate (HRR) operation (∼0.01 to 10 Hz).1,2 These facilities will utilize high-power lasers to perform statistical studies of basic plasma science in extreme environments, develop new radiation sources, and advance missions relevant to national security. In addition, any future application of HRR and high-power laser systems, including inertial fusion energy (IFE)3,4 and laser-based secondary radiation sources (SRSs),5–7 will require the implementation of rapid data acquisition, analysis, and control-feedback systems.8 Along the path to develop these societal applications, present and future HRR laser facilities researching relevant technologies must evolve from a single-shot mindset to a large-scale, big-data paradigm to accelerate HED science by fully realizing the potential of HRR laser systems.
An HRR laser experiment or application will require the integration of multiple subsystems and control-feedback loops. A generic HRR laser-plasma experiment starts with the laser system and diagnostics (1), wherein the control system and feedback loops determine if the applied settings are safe for operation and then send the safe laser pulse to the chamber, where HRR targetry hardware (2) deploys a target at the desired location for the laser to produce an HED plasma (3). Target and performance diagnostics (4) archive their raw data and metadata from each shot to a storage location (5) where their fidelity and robustness can be verified. Necessary changes to settings (e.g., the gain on a camera) can be fed back to the diagnostics for the next shot, and if the data are verified, integrated analysis algorithms (6) process performance diagnostic data, laser diagnostic data, and target specifications to determine the performance of the shot. This information can then be sent to algorithms that model the HED plasma (7) that will work with an optimization algorithm (8) to determine the laser and target settings for the next shot. The exact details of each of these subsystems will depend on the experiment, and the control system architectures implemented will vary from facility to facility.8
To truly leverage the scientific power of HRR HED facilities, all components of the system will need to be synchronized across the different subsystems in a reliable and robust manner. Rep-rated laser technology has been available for many decades, but in the last 5–10 years, advancements in thermal management have allowed lasers with high peak-power to move toward higher average power by increasing the repetition rate, making them more attractive for future commercial applications. Along with these technological developments, feedback loops have been integrated into the laser systems that stabilize long timescale energy content and the pointing of the beam. While more development is needed to continue increasing the stability and average power of these systems, the laser is the most developed subsystem in terms of HRR operation and feedback control, and current laser technologies are already sufficient for a commercial laser-based SRS. However, other associated subsystems are not as advanced in HRR operation and require more development.
Three main subsystems accompanying the laser need advancement to achieve HRR operation are as follows: diagnostics, targetry, and data handling. The specifics of these subsystems will depend on the application, and there have been some developments in recent years. On the targetry side, continuous-flow systems such as gas-jets/gas-cells9–10 or liquid jets/droplets11 naturally lend themselves to HRR operation, and many of the technical challenges of fielding these types of targets in a vacuum have been solved. On the diagnostics side, the most recent efforts within the community have focused on adapting single-shot particle diagnostics for HRR operation. This has resulted in HRR versions of electron spectrometers,10 proton beam diagnostics,12–14 and x-ray spectrometers.15–17 Many other typical HED diagnostics, such as probe beam diagnostics such as interferometry, shadowgraphy, schlieren imaging, etc., and those utilizing streak cameras or other immediately digitizing systems, can often be operated at higher repetition rates, depending on hardware limitations. While some diagnostics can be operated at HRR, automated analysis routines for performance diagnostics are still under development and will require leveraging machine learning (ML) and artificial intelligence (AI) algorithms to process data between shots18 in order to provide feedback for maintaining laser-target performance over long timescales.
To fully realize the potential of HRR HED experiments and/or applications, all measurements from each subsystem (laser, target, and diagnostics) need to be appropriately labeled and archived in real time (∼10 to 100 MB/s, ∼1 to 10 PB/year). As advancements are made in ML/AI analysis routines, metrics derived from analyzed data will be used to provide feedback control to the laser and targetry subsystems and aid in developing/improving models of the system’s performance. To achieve these ends in the long term, the HED community will need to address the challenges of data storage and access across a wide range of facilities and users. To this end, a diverse set of stakeholders aimed to address the broad usage of scientific data in general by developing the FAIR guiding principles19 in 2016. These principles provide a good trajectory for the HED community to address scientific data management. A brief, summary description of each principle is provided below for reference.
Findable: Data and associated metadata are assigned a globally unique and persistent identifier. Metadata are comprehensive and include the identifier of the data they describe with multiple searchable indices.
Accessible: Data and metadata are retrievable by their identifier using open, free, and universally implementable communication protocols that allow for authentication when needed.
Interoperable: Data and metadata use a formal, accessible language using vocabularies that follow FAIR principles and include references to other data or metadata.
Reusable: Data and metadata are comprehensively described with accurate and relevant attributes that are released with a clear and accessible data usage license and meet community standards.
These principles provide guidance on best practices for data management with an emphasis on machine-actionability, which is required as the HED community moves toward HRR operation and automation of experimental scans and future commercial applications.
II. DATA MANAGEMENT WITH MORIA
Given the overarching goals of moving toward a FAIR-guided data organization solution for HRR HED experiments and applications, a NoSQL approach was chosen to provide flexibility and scalability. The transition from relational SQL databases to so-called NoSQL databases for large scientific experiments is apparent from new NASA missions, such as the James Webb Space Telescope.20 NoSQL databases are geared toward processing large amounts of varied and unstructured data. Data generated in HED experiments are multimodal, taking the form of scalars, vectors, and images; such an application will function more effectively under a generalizable NoSQL format. However, NoSQL databases are not fully ACID-compliant, meaning that they do not guarantee data atomicity, consistency, isolation, and durability. The advantage of flexibility comes at the cost of more complex data structures and queries. Defining a sufficiently general data structure with associated query functions that will meet current and future needs is difficult, as the HED community has not yet defined database standards. To meet the needs of the HRR HED community at this stage, flexibility is key, so the MongoDB NoSQL-database management system has been selected for the proposed work primarily for the following two key attributes:
Horizontal scaling: As data volumes increase (with the addition of diagnostics and increasingly long rep-rate experiments), the capacity of the database system can be increased by simply adding more computational nodes.
Integration with GridFS: The GridFS file system allows the storage of arbitrary, non-scalar data, such as images, along with associated metadata. MongoDB exposes an Application Programming Interface (API) for reading such data directly from the database, eliminating the need to store files separately, thereby preventing issues with data integrity and consistency.
In addition, MongoDB is open-source, and there are many officially supported libraries for accessing MongoDB databases in Python, Matlab, and C++. While no supported integration libraries currently exist for other control systems, such as LabVIEW, Experimental Physics and Industrial Control System (EPICS), or Tango Controls, MongoDB databases may be accessed and controlled through those control systems using appropriate Python or C++ wrappers for supported libraries. The versatility of MongoDB provides users with the tools to adapt the framework for their specific lab and application. Since the ultimate goal is for many facilities to use a similar data organizational structure to increase accessibility and reusability of HRR HED experimental data, the flexibility and breadth of support that MongoDB has allow this database framework to be implemented at any facility, regardless of control system architecture.
Access to GridFS allows file storage to be directly integrated into the database, allowing for seamless management of files and metadata. This can be particularly advantageous when simplicity and close integration between data and files are required. The primary trade-off with GridFS involves the potential for increased resource consumption, particularly in terms of storage and memory, which may lead to performance bottlenecks. For HED HRR datasets, which may include large file payloads, this could result in slower access times and higher operational costs. Therefore, if the current or future datasets are expected to involve significant volumes of large files, considering external storage options optimized for such use cases could enhance performance and scalability. Proto-type testing and development of the database framework and API, the MOngodb Repository for Information and Archiving (MORIA), is under way at the GA Laboratory for Developing Rep-rated Instrumentation and Experiments with Lasers (GALADRIEL)21 in an effort to advance data management strategies for HRR HED experiments guided by the FAIR principles.
The general organizational strategy for MORIA is shown in Fig. 1. MORIA will be organized and generalized at the diagnostic level to lead the paradigm shift from shot-based to diagnostic-based data archiving. Multiple diagnostic collections can be queried at the same time, but only the diagnostic information that is of interest to the specific application needs to be opened at any one time, allowing for more efficient use of computational resources. The main disadvantage of the diagnostic-first approach is the additional complexity of the data structures and the requirement that the metadata, including associated shot numbers, are stored and tagged properly for each instrument so that data across different diagnostics within a single shot are accurately correlated. To store data and metadata, MongoDB uses a generic collection structure that contains a list of attributes with associated values of an arbitrary type. This type of structure is required because data acquired in HRR HED experiments can take multiple forms: singular numbers (e.g., trigger timing and shot number), 1D arrays (e.g., diode voltage traces), and images (e.g., CMOS or CCD images). In this framework, every data-generating piece of hardware is classified as an instrument within MORIA, for which data and metadata are archived on every shot, including the definition of the diagnostic attribute, so that as specific pieces of hardware change, continuity is maintained at the diagnostic level.
The diagnostic abstraction level shown in Fig. 1 provides the means to retrieve data and metadata associated with a measurement that is archived by the instrument. Data and metadata are stored in a collection within acquisitions (for non-image data) or within fs.files/fs.chunks (for image file data). Image files can be large (∼5 to 20 MB) and are broken down into smaller pieces using the GridFS functionality of MongoDB to increase archival and retrieval speeds. In the example shown in Fig. 1, instruments C and D must archive image files, so their data go to fs.chunks and their metadata to fs.files, whereas instruments A and B are scalars or arrays of data (and associated metadata) that are archived as acquisitions. As instruments are replaced or changed over time, as long as they are associated with the same diagnostic attribute, all data and associated metadata from the diagnostic can be queryable without having to specify the hardware, thus addressing many of the FAIR principles. The database also tracks a growing list of all the diagnostics and instruments included, and a global shot counter coupled with time stamps addresses findability across the database.
To access and retrieve data from MORIA, an API has been developed in parallel with database implementation and testing. Figure 2 illustrates how different components of the MORIA API, built on top of the MongoDB API, interact with the control system, the database, and user requests. The connection API manages the core database connection and is invoked by each of the other APIs. The admin API provides an interface for storing data in the database, allowing users to modify the collections in MORIA by adding instruments, defining experimental settings, controlling the shot counter, etc. The storage API provides an interface for storing data in the database, allowing users to insert data into the acquisitions collection and GridFS. The query API provides an interface for querying the database using filters defined by a user to specify the datasets of interest. The separation of duties is designed primarily to improve the maintainability of the API. The MORIA API serves as a framework that facilitates interaction with the database for both internal and external users. It effectively abstracts much of the complex MongoDB syntax, supporting the new diagnostic-first approach.
To date, most developmental efforts have focused on using simulated data streams for multiple image- and non-image-based instruments to check theoretical performance. Archiving of these simulated data streams has shown that MongoDB could easily run at 10 Hz with many images coming in at once, indicating that write-speed would be limited by communication bandwidth through the network of instruments connected to MORIA and ultimately that of the hard drive write speed. Recent testing on GALADRIEL demonstrated the ability to archive one image-based instrument and more than ten non-image instruments (stages, pressure controllers, vacuum gauges, etc.) at a 1 Hz repetition rate. The total number of shots currently stored in MORIA is shown in Fig. 3, since the current framework was finalized in late 2023. These data originated from gas-jet experiments, tape-drive experiments, and laser-only experiments wherein the diagnostics used were different for each campaign, although all data are queryable from the same centralized location. More testing is planned to address limitations and trade-offs of data handling and archival repetition rate for the GALADRIEL system, which will guide changes and advancements in the framework to maximize total data output.
III. SUMMARY
The HED community is in a transition phase of moving from single-shot (∼1/h) to HRR (0.01 Hz) operation. Work has been done in the community to address technology development in lasers, targetry, and performance diagnostics, but a consensus on data management has not been pushed forward. It is unrealistic to expect that control systems can be normalized across all facilities within our community, but a generalized database framework that can be implemented regardless of control-system architecture provides the basis for the community to begin addressing these issues with the FAIR principles in-mind.
General Atomics has begun developing a generalized database framework with a NoSQL approach using MongoDB. From a big data perspective, a NoSQL database is preferred over a traditional SQL approach because the NoSQL framework is designed for data workloads that are geared toward the rapid processing and analysis of vast amounts of varied and unstructured data. An organizational strategy has been developed that shifts HED data organization from a shot-based to a diagnostic-based approach in order to make archival and retrieval times more efficient, which lends itself to optimization applications. Proto-type testing and development of the MORIA framework and associated API for query building are under way at GALADRIEL. While being developed using on-site storage resources, the flexibility of the MORIA framework also allows for deployment on cloud platforms such as AWS or Azure, ensuring scalability and reliability as database needs increase in size and scope. MORIA sets the stage and provides a starting point for community-wide discussions through its accessibility and utility to HED facilities of any size.
ACKNOWLEDGMENTS
This work was supported by internal research and development funds at General Atomics. Continued development to provide a publicly released version of the MORIA framework is supported by the Department of Energy Office of Science, Fusion Energy Sciences, under Contract No. DE-SC0025083.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
M. J.-E. Manuel: Conceptualization (equal); Funding acquisition (lead); Project administration (lead); Writing – original draft (lead); Writing – review & editing (lead). A. Keller: Conceptualization (equal); Data curation (equal); Methodology (equal); Software (equal); Visualization (equal). E. Linsenmayer: Conceptualization (equal); Methodology (equal); Software (lead). G. W. Collins IV: Conceptualization (equal); Data curation (equal); Investigation (equal); Methodology (equal); Software (supporting); Validation (equal); Visualization (equal); Writing – review & editing (equal). B. Sammuli: Conceptualization (equal); Funding acquisition (equal); Methodology (equal); Software (equal); Validation (equal); Writing – review & editing (equal). M. Margo: Conceptualization (equal); Methodology (equal); Software (equal).
DATA AVAILABILITY
The goal of this project is for the MORIA framework to be posted on an open-source, open-license code repository, such as GitHub. Until it is made available online, interested users may contact the author to request the current working version of the code.