Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. We present here the case for automating and outsourcing light source science using cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. We discuss three specific services that accomplish these goals for data distribution, automation, and transformation. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. We draw conclusions about best practices for building next-generation data automation systems for future light sources.

1.
B. H.
Toby
,
D.
Gürsoy
,
F.
De Carlo
,
N.
Schwarz
,
H.
Sharma
, and
C. J.
Jacobsen
, “
Practices and standards for data and processing at the APS
,”
Synchrotron Radiation News
28
,
15
21
(
2015
).
2.
S.
Streiffer
,
S.
Vogt
,
P.
Evans
, et al, “Early science at the upgraded Advanced Photon Source,”
Tech. Rep.
(
Argonne National Laboratory
,
2015
) https://bit.ly/2x4Vb2i.
3.
C.
Pralavorio
,
LHC Season 2: CERN computing ready for data torrent
, May
2015
, https://home.cern/about/updates/2015/06/lhc-season-2-cern-computing-ready-data-torrent.
4.
M.
Hiraki
,
S.
Watanabe
,
N.
Phonda
,
Y.
Yamada
,
N.
Matsugaki
,
N.
Igarashi
,
Y.
Gaponov
, and
S.
Wakatsuki
, “
High-throughput operation of sample-exchange robots with double tongs at the Photon Factory beamlines
,”
Journal of Synchrotron Radiation
15
,
300
303
(
2008
).
5.
B. H.
Toby
,
Y.
Huang
,
D.
Dohan
,
D.
Carroll
,
X.
Jiao
,
L.
Ribaud
,
J. A.
Doebbler
,
M. R.
Suchomel
,
J.
Wang
,
C.
Preissner
, et al, “
Management of metadata and automation for mail-in measurements with the APS 11-BM high-throughput, high-resolution synchrotron powder diffractometer
,”
Journal of Applied Crystallography
42
,
990
993
(
2009
).
6.
J. M.
Wozniak
,
K.
Chard
,
B.
Blaiszik
,
R.
Osborn
,
M.
Wilde
, and
I.
Foster
, “Big data remote access interfaces for light source science,” in
2nd International Symposium on Big Data Computing
(
IEEE
,
2015
), pp.
51
60
.
7.
B.
Blaiszik
,
K.
Chard
,
J.
Pruyne
,
R.
Ananthakrishnan
,
S.
Tuecke
, and
I.
Foster
, “
The Materials Data Facility: Data services to advance materials science research
,”
Journal of Materials
68
,
2045
2052
(
2016
).
8.
Y.
Wang
,
F.
De Carlo
,
I.
Foster
,
J.
Insley
,
C.
Kesselman
,
P.
Lane
,
G.
von Laszewski
,
D. C.
Mancini
,
I.
Mc-Nulty
,
M.-H.
Su
, et al, “Quasi-real-time x-ray microtomography system at the Advanced Photon Source,” in
Developments in X-Ray Tomography II
, Vol.
3772
(
International Society for Optics and Photonics
,
1999
), pp.
318
328
.
9.
Y.
Wang
,
F.
De Carlo
,
D. C.
Mancini
,
I.
McNulty
,
B.
Tieman
,
J.
Bresnahan
,
I.
Foster
,
J.
Insley
,
P.
Lane
,
G.
von Laszewski
, et al, “
A high-throughput x-ray microtomography system at the Advanced Photon Source
,”
Re-view of Scientific Instruments
72
,
2062
2068
(
2001
).
10.
G.
Von Laszewski
,
M. L.
Westbrook
,
C.
Barnes
,
I.
Foster
, and
E. M.
Westbrook
, “
Using computational grid capabilities to enhance the capability of an X-ray source for structural biology
,”
Cluster Computing
3
,
187
199
(
2000
).
11.
G.
Von Laszewski
,
J. A.
Insley
,
I.
Foster
,
J.
Bresnahan
,
C.
Kesselman
,
M.
Su
,
M.
Thiebaux
,
M. L.
Rivers
,
S.
Wang
,
B.
Tieman
, and
I.
McNulty
, “
Real-time analysis, visualization, and steering of microtomography experiments at photon sources
,” in
9th SIAM Conference on Parallel Processing
(
1999
).
12.
T.
Bicer
,
D.
Gürsoy
,
R.
Kettimuthu
,
F. D.
Carlo
,
G.
Agrawal
, and
I. T.
Foster
, “
Rapid tomographic image reconstruction via large-scale parallelization
,” in
Euro-Par 2015: 21st International Conference on Parallel and Distributed Computing
(
2015
), pp.
289
302
.
13.
T.
Bicer
,
D.
Gursoy
,
R.
Kettimuthu
,
I. T.
Foster
,
B.
Ren
,
V.
De Andrede
, and
F.
De Carlo
, “
Real-time data analysis and autonomous steering of synchrotron light source experiments
,” in
13th International Conference on e-Science
(
IEEE
,
2017
), pp.
59
68
.
14.
M.
Thomas
,
K.
Kleese-van Dam
,
M. J.
Marshall
,
A.
Kuprat
,
J.
Carson
,
C.
Lansing
,
Z.
Guillen
,
E.
Miller
,
I.
Lanekoff
, and
J.
Laskin
, “
Towards adaptive, streaming analysis of x-ray tomography data
,”
Synchrotron Radiation News
28
,
10
14
(
2015
).
15.
K.
Chard
,
S.
Tuecke
, and
I.
Foster
, “
Efficient and secure transfer, synchronization, and sharing of big data
,”
IEEE Cloud Computing
1
,
46
55
Sept (
2014
).
16.
S.
Delagenière
,
P.
Brenchereau
,
L.
Launer
,
A. W.
Ashton
,
R.
Leal
,
S.
Veyrier
,
J.
Gabadinho
,
E. J.
Gordon
,
S. D.
Jones
,
K. E.
Levik
, et al, “
ISPyB: An information management system for synchrotron macromolecular crystallography
,”
Bioinformatics
27
,
3186
3192
(
2011
).
17.
R.
Ananthakrishnan
,
B.
Blaiszik
,
K.
Chard
,
R.
Chard
,
B.
McCollam
,
J.
Pruyne
,
S.
Rosen
,
S.
Tuecke
, and
I.
Foster
, “Globus platform services for data publication,” in
Practice and Experience in Advanced Research Computing (PEARC)
(
IEEE
,
2018
).
18.
K.
Chard
,
J.
Pruyne
,
B.
Blaiszik
,
R.
Ananthakrishnan
,
S.
Tuecke
, and
I.
Foster
, “
Globus data publication as a service: Lowering barriers to reproducible science
,” in
11th International Conference on e-Science
(
IEEE
,
2015
), pp.
401
410
.
19.
J.
Gaff
,
B.
Blaiszik
, and
L.
Ward
, MDF Forge Python package, https://github.com/materials-data-facility/forge. Accessed June 15, 2018.
20.
R.
Chard
,
K.
Chard
,
J.
Alt
,
D. Y.
Parkinson
,
S.
Tuecke
, and
I.
Foster
, “
Ripple: Home automation for research data management
,” in
37th IEEE International Conference on Distributed Computing Systems Workshops
(
2017
), pp.
389
394
.
21.
Amazon States Language
, https://states-language.net/spec.html. Accessed April 1, 2018.
22.
Linux programmers manual: inotify API
, http://man7.org/linux/man-pages/man7/inotify.7.html. Accessed August 1, 2016 (August 2016).
23.
A. K.
Paul
,
S.
Tuecke
,
R.
Chard
,
A. R.
Butt
,
K.
Chard
, and
I.
Foster
, “
Toward scalable monitoring on large-scale storage for software defined cyberinfrastructure
,” in
2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems
(
ACM
,
2017
), pp.
49
54
.
24.
Amazon Simple Workflow Service
, https://aws.amazon.com/swf/. Accessed April 1, 2018.
25.
Conductor, https://netflix.github.io/conductor/. Accessed April 1, 2018.
26.
Airflow, https://airflow.apache.org/. Accessed April 1, 2018.
27.
Amazon Step Functions
, https://aws.amazon.com/step-functions. Accessed April 1, 2018.
28.
Amazon Lambda
, https://aws.amazon.com/lambda. Accessed April 1, 2018.
29.
N.
Kasthuri
,
K. J.
Hayworth
,
D. R.
Berger
,
R. L.
Schalek
,
J. A.
Conchello
,
S.
Knowles-Barley
,
D.
Lee
,
A.
Vázquez-Reina
,
V.
Kaynig
,
T. R.
Jones
, et al, “
Saturated reconstruction of a volume of neocortex
,”
Cell
162
,
648
661
(
2015
).
30.
F.
De Carlo
, Automo, https://automo.readthedocs.io. Accessed June 1, 2018.
31.
D.
Gürsoy
,
F.
De Carlo
,
X.
Xiao
, and
C.
Jacobsen
, “
TomoPy: A framework for the analysis of synchrotron tomographic data
,”
Journal of Synchrotron Radiation
21
,
1188
1193
(
2014
).
32.
Neuroglancer, https://github.com/google/neuroglancer. Accessed June 1, 2018.
33.
S.
Tuecke
,
R.
Ananthakrishnan
,
K.
Chard
,
M.
Lidman
,
B.
McCollam
,
S.
Rosen
, and
I.
Foster
, “
Globus Auth: A research identity and access management platform
,” in
12th International Conference on e-Science
(
IEEE
,
2016
), pp.
203
212
.
34.
Y.
Babuji
,
A.
Brizius
,
K.
Chard
,
I.
Foster
,
D. S.
Katz
,
M.
Wilde
, and
J.
Wozniak
,
Introducing Parsl: A Python Parallel Scripting Library
, August
2017
, .
35.
B.
Wang
,
K.
Yager
,
D.
Yu
, and
M.
Hoai
, “
X-ray scattering image classification using deep learning
,” in
Winter Conference on Applications of Computer Vision
(
IEEE
,
2017
), pp.
697
704
.
36.
M.
Abadi
,
P.
Barham
,
J.
Chen
,
Z.
Chen
,
A.
Davis
,
J.
Dean
,
M.
Devin
,
S.
Ghemawat
,
G.
Irving
,
M.
Isard
, et al, “
TensorFlow: A system for large-scale machine learning
.” in
OSDI
, Vol.
16
(
2016
), pp.
265
283
.
37.
B.
Wang
,
X-ray scattering classifier Resnet implementation
, https://github.com/Boyu-Wang/Xray_Scattering_Resnet. Accessed June 15, 2018.
38.
R. D.
King
,
J.
Rowland
,
S. G.
Oliver
,
M.
Young
,
W.
Aubrey
,
E.
Byrne
,
M.
Liakata
,
M.
Markham
,
P.
Pir
,
L. N.
Soldatova
, et al, “
The automation of science
,”
Science
324
,
85
89
(
2009
).
39.
J.
Deslippe
,
A.
Essiari
,
S. J.
Patton
,
T.
Samak
,
C. E.
Tull
,
A.
Hexemer
,
D.
Kumar
,
D.
Parkinson
, and
P.
Stewart
, “Workflow management for real-time analysis of lightsource experiments,” in
9th Workshop on Work-flows in Support of Large-Scale Science
(
IEEE
,
2014
), pp.
31
40
.
40.
S. J.
Coles
,
J. G.
Frey
,
M. B.
Hursthouse
,
M. E.
Light
,
K. E.
Meacham
,
D. J.
Marvin
, and
M.
Surridge
, “
ECSES–Examining crystal structures using e-science: A demonstrator employing web and grid services to enhance user participation in crystallographic experiments
,”
Journal of Applied Crystallography
38
,
819
826
(
2005
).
41.
S. J.
Coles
,
J. G.
Frey
,
M. B.
Hursthouse
,
M. E.
Light
,
A. J.
Milsted
,
L. A.
Carr
,
D.
DeRoure
,
C. J.
Gutteridge
,
H. R.
Mills
,
K. E.
Meacham
, et al, “
An e-science environment for service crystallography from submission to dissemination
,”
Journal of Chemical Information and Modeling
46
,
1006
1016
(
2006
).
42.
J.-S.
Park
,
X.
Zhang
,
H.
Sharma
,
P.
Kenesei
,
D.
Hoelzer
,
M.
Li
, and
J.
Almer
, “
High-energy synchrotron x-ray techniques for studying irradiated materials
,”
Journal of Materials Research
30
,
1380
1391
(
2015
).
43.
R.
Gehrke
,
A.
Kopmann
,
E.
Wintersberger
, and
F.
Beckmann
, “
The high data rate processing and analysis initiative of the Helmholtz Association in Germany
,”
Synchrotron Radiation News
28
,
36
42
(
2015
).
44.
J.
Goecks
,
A.
Nekrutenko
, and
J.
Taylor
, “
Galaxy: a comprehensive approach for supporting accessible, re-producible, and transparent computational research in the life sciences
,”
Genome biology
11
, p.
R86
(
2010
).
45.
E.
Deelman
,
K.
Vahi
,
G.
Juve
,
M.
Rynge
,
S.
Callaghan
,
P.
Maechling
,
R.
Mayani
,
W.
Chen
,
R.
da Silva
,
M.
Livny
, et al, “
Pegasus, a workflow management system for science automation
,”
Future Generation Computer Systems
46
,
17
35
(
2015
).
46.
M.
Wilde
,
M.
Hategan
,
J.
Wozniak
,
B.
Clifford
,
D.
Katz
, and
I.
Foster
, “
Swift: A language for distributed parallel scripting
,”
Parallel Computing
37
,
633
652
(
2011
).
47.
T.
Oinn
,
M.
Addis
,
J.
Ferris
,
D.
Marvin
,
M.
Senger
,
M.
Greenwood
,
T.
Carver
,
K.
Glover
,
M. R.
Pocock
,
A.
Wipat
, et al, “
Taverna: a tool for the composition and enactment of bioinformatics workflows
,”
Bioinformatics
20
,
3045
3054
(
2004
).
48.
M.
Wilkinson
,
M.
Dumontier
,
I.
Aalbersberg
,
G.
Appleton
,
M.
Axton
,
A.
Baak
, et al, “
The FAIR guiding principles for scientific data management and stewardship
,”
Scientific Data
3
, p.
160018
(
2016
).
49.
M.
Costello
, “
Motivating online publication of data
,”
BioScience
59
,
418
427
(
2009
).
50.
M.
Crosas
, “
The Dataverse Network: An open-source application for sharing, discovering and preserving data
,”
D-Lib Magazine
17
(
2011
).
51.
DuraCloud, http://duracloud.org/. Accessed April 1, 2018.
52.
figshare, https://figshare.com/. Accessed March 22, 2018.
53.
A.
Kumar
,
M.
Boehm
, and
J.
Yang
, “
Data management in machine learning: Challenges, techniques, and systems
,” in
International Conference on Management of Data
(
ACM
,
2017
), pp.
1717
1722
.
54.
D.
Crankshaw
,
P.
Bailis
,
J. E.
Gonzalez
,
H.
Li
,
Z.
Zhang
,
M. J.
Franklin
,
A.
Ghodsi
, and
M. I.
Jordan
, “
The missing piece in complex analytics: Low latency, scalable model management and serving with Velox
,” arXiv preprint arXiv:1409.3809 (
2014
).
55.
D.
Crankshaw
,
X.
Wang
,
G.
Zhou
,
M. J.
Franklin
,
J. E.
Gonzalez
, and
I.
Stoica
, “
Clipper: A low-latency online prediction serving system
.” in
NSDI
(
2017
), pp.
613
627
.
56.
Tensorflow serving
, https://github.com/tensorflow/serving. Accessed June 1, 2018.
57.
Amazon sagemaker: Developer guide
,
2017
, http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf. Accessed June 13, 2018.
58.
Kipoi: Model zoo for genomics
,.
59.
H.
Miao
,
A.
Li
,
L. S.
Davis
, and
A.
Deshpande
, “
Towards unified data and lifecycle management for deep learning
,” in
33rd International Conference on Data Engineering
(
IEEE
,
2017
), pp.
571
582
.
60.
J.
Brase
, “
Datacite – a global registration agency for research data
,” in
4th International Conference on Cooperation and Promotion of Information Resources in Science and Technology
(
2009
), pp.
257
261
.
61.
I.
Foster
,
R.
Ananthakrishnan
,
B.
Blaiszik
,
K.
Chard
,
R.
Osborn
,
S.
Tuecke
,
M.
Wilde
, and
J.
Wozniak
, “
Networking materials data: Accelerating discovery at an experimental facility
,” in
Big Data and High Performance Computing
(
2015
).
62.
I.
Foster
,
B.
Blaiszik
,
K.
Chard
, and
R.
Chard
, “
Software defined cyberinfrastructure
,” in
37th International Conference on Distributed Computing Systems
(
IEEE
,
2017
), pp.
1808
1814
.
63.
R.
Chard
,
K.
Chard
,
S.
Tuecke
, and
I.
Foster
, “
Software defined cyberinfrastructure for data management
,” in
13th International Conference on e-Science
(
IEEE
,
2017
), pp.
456
457
.
64.
F.
Ren
,
Discovering metallic glasses with HiTp experiment and machine learning
, https://github.com/fang-ren/Discover_MG_CoVZr. Accessed June 1, 2018.
65.
L.
Ward
,
A.
Agrawal
,
A.
Choudhary
, and
C.
Wolverton
, “
A general-purpose machine learning framework for predicting properties of inorganic materials
,”
npj Computational Materials
2
, p.
16028
(
2016
).
66.
F.
Ren
,
L.
Ward
,
T.
Williams
,
K. J.
Laws
,
C.
Wolverton
,
J.
Hattrick-Simpers
, and
A.
Mehta
, “
Accelerated discovery of metallic glasses through iteration of machine learning and high-throughput experiments
,”
Science advances
4
, p.
eaaq1566
(
2018
).