In August the European Synchrotron Research Facility (ESRF) in Grenoble, France, opened the Extremely Brilliant Source (EBS), a newly upgraded light source that is 100 times as brilliant as its predecessor and some 100 billion times as powerful as the average hospital x-ray source.
There’s something else EBS can do better than its predecessor: produce data.
The light source has a theoretical capacity to produce 1 petabyte of data per day, says Harald Reichert, ESRF’s director of research in physical sciences. “We don’t have the capacity to analyze this data, or even store it, if that machine fires continuously, day after day.”
Since the 1980s, both beamline photon fluxes and detector data rates have far outpaced the rate of increase in Moore’s law. Even as scientists turn to automation, the unique nature of synchrotrons, with their myriad applications, makes automating their outputs uniquely complicated.
In the early 2000s, three months’ worth of data from a detector could fit into a 100-megabyte archive, says Stefan Vogt, an x-ray scientist at Argonne National Laboratory’s Advanced Photon Source (APS). “These days, for these specific types of experiments, it’s several terabytes.” APS will soon be capable of producing around 200 petabytes per year. A single beamline can produce as much as 12 gigabytes per second.
The data transfer and storage infrastructure can and does fail to keep up, resulting in lags that can stretch up to days, according to James Holton, a biophysicist at Lawrence Berkeley National Laboratory’s Advanced Light Source.
The rapid inflation of data also makes it difficult to future-proof new beamlines. “You have to be making choices now against what the computing infrastructure is going to look like in about five years’ time,” says Graeme Winter, an x-ray crystallographer at the Diamond Light Source in the UK.
Upgrading the storage infrastructure only shifts the bottleneck further downstream. There, automation can pick up the reins. Not only can AI, machine learning, and neural networks help in analysis, but they can make data much more manageable by throwing away poor-quality data. They can also reduce excess by stopping data collection in mid-experiment when certain conditions have been reached.
Indeed, the Large Hadron Collider (LHC), which CERN claims can produce around 25 gigabytes per second recording the aftermath of particle collisions, relies on a worldwide computing grid to handle its data deluge. But Reichert says that the LHC’s detectors capture each collision in essentially the same way, producing predictable data sets that lend themselves well to automation. In contrast, a synchrotron facility can host multiple beamlines with a far more diverse array of applications, such as determining protein crystal structures, charting the brain’s neural connections, and watching chemical catalysis and additive manufacturing in real time. “They have nothing in common,” says Reichert. “They use different detectors. The data is completely different.”
Consequently, it’s often left to the users of each beamline and application to develop their own specialized firmware and algorithms. When large synchrotrons like APS host dozens of beamlines, some of which are deeply customizable, the volume of specialized use cases renders a streamlined system like CERN’s impractical.
When users do take up the challenge of building tools, the programmers who make them often approach the problem differently from the scientists who need them. Earlier this year, crystallographers used Diamond to aid drug discovery efforts by scanning a protease in the virus responsible for COVID-19. Collecting the data took mere days, but it took longer to convert the observational data into structures detailed enough for chemists to use. The researchers’ tools also provided output that was difficult for chemists to interpret. Without an effective means of communicating and checking their results with chemists, the project was delayed by weeks.
Frank von Delft, a macromolecular crystallographer at Diamond, says that programmers should focus on making their tools easier to use. “When that’s achieved,” he says, “your whole platform suddenly becomes powerful.” In particular, he cites Phenix, a crystallography tool that can help determine molecular structures. Phenix is one of the most popular tools of its kind, in large part thanks to a graphical user interface designed by a biologist rather than a career programmer.
Fortunately, the future seems to be pointing toward greater streamlining, including at the synchrotron end. Traditionally, facilities left the data-handling part of their science to the users, but the enormous data volumes, as well as other factors such as more computation shifting to the cloud, are changing that.
Reichert believes each synchrotron facility should help provide scientists the tools they need and assist with the computation. “When we give [a scientist] beam time,” he says, “we’d better ask the question: What do you do with the data, and what kind of help do you need to actually get an answer to your scientific problem and put the answer out in the open?”