Non-negative matrix factorization (NMF) is an appealing class of methods for performing unsupervised learning on streaming spectral data, particularly in time-sensitive applications such as in situ characterization of materials. These methods seek to decompose a dataset into a small number of components and weights that can compactly represent the underlying signal while effectively reconstructing the observations with minimal error. However, canonical NMF methods have no underlying requirement that the reconstruction uses components or weights that are representative of the true physical processes. In this work, we demonstrate how constraining a subset of the NMF weights or components as rigid priors, provided as known or assumed values, can provide significant improvement in revealing true underlying phenomena. We present a PyTorch-based method for efficiently applying constrained NMF and demonstrate its application to several synthetic examples. Our implementation allows an expert researcher-in-the-loop to provide and dynamically adjust the constraints during a live experiment involving streaming spectral data. Such interactive priors allow researchers to specify known or identified independent components, as well as functional expectations about the mixing or transitions between the components. We further demonstrate the application of this method to measured synchrotron x-ray total scattering data from in situ beamline experiments. In such a context, constrained NMF can result in a more interpretive and scientifically relevant decomposition than canonical NMF or other decomposition techniques. The details of the method are provided, along with general guidance for employing constrained NMF in the extraction of critical information and insights during time-sensitive experimental applications.

High-throughput experiments, which involve measurements exceeding the rate of conventional analysis,1–6 and in situ experiments, which involve measurements made during a dynamic process or sample change,7,8 have both become ubiquitous in the discovery of modern materials. Automation and engineering have increased the capabilities for data streaming and in situ experimentation in many settings, including university laboratories,9 industrial research,10 and central facilities.11 These technological advancements have afforded new opportunities for interrogating the dynamic processes of systems, including phase transitions12 and reactive processes.7 However, such opportunities bear the burden of an increased rate of data production, as well as a commensurate increase in the volume of data to analyze.

This is abundantly clear at central user facilities such as synchrotron light sources, where available photon flux for beamline measurements has been increasing at a rate even faster than Moore's law.13 Largely driven by advances in accelerator and detector technologies, such facilities are generating more data now than ever before. For example, the National Synchrotron Light Source II (NSLS-II) produced over 1 PB of raw data in the last year, and that rate is expected to increase as the facility matures.11 This measured raw data often cannot be immediately converted into material knowledge, but must first be processed and analyzed to extract key information. Unfortunately, the speed of conventional data analysis methods has not kept pace with the increased measurement speeds. This mismatch has led to a scenario where scientists performing in situ studies are effectively “flying-blind” while at the beamline, unable to fully understand their data until well after their experiment has completed.

This information bottleneck is particularly pronounced during total scattering experiments, where both the local and average atomic structure of a sample can be measured in fractions of a second.14 These experiments—combining both x-ray powder diffraction (XRD) and pair distribution function (PDF) methods—are often critical to understanding the origin of structure/property relationships in energy materials such as batteries,15,16 catalysis,17,18 photovoltaics,13 and multi-ferroics.19 Conventional analysis in total scattering studies typically involves iterative fitting of models to reciprocal20 and real space21 data in order to extract relevant physical properties (e.g., crystallographic phase, lattice constants, atomic positions, site occupancies, and atomic displacement parameters). These refinements may take minutes to hours for a skilled expert to perform, but can be difficult for novices and do not scale well to large datastreams.22 For amorphous, disordered, or nanoscale hierarchical materials, it is common to employ even more complex “large-box” refinements on total scattering data,23,24 but these methods of analysis take even longer (often days to weeks of compute) and are thus completely unsuitable for rapid analysis.

Machine learning (ML) tools offer great promise for monitoring and analyzing large volumes of streaming data.11 These tools have been demonstrated for classifying diffraction patterns,25–32 or decomposing large diffraction datasets over state variables into fundamental components.12,33–37 Particularly, unsupervised learning is useful for offering rapid diagnostics and analysis without prior training.15,34,38 Because unsupervised decomposition or data reduction algorithms do not require any prior data labeling, they are exceptionally useful for interpreting datasets in which there are potentially new or unexpected features or phases emerging. Two such algorithms, principal component analysis (PCA)39,40 and non-negative matrix factorization (NMF),38,41 have been employed to segregate distinct components from diffraction datasets. The algorithms aim to describe a dataset of m experimental patterns (XRD or PDF) as a weighted linear composition of k fundamental components. Of these, NMF is particularly well suited to the analysis of x-ray scattering or diffraction data because its decomposition aligns with the underlying physics: the component signals are enforced to remain greater than or equal to zero, and are strictly additive to one another in the dataset reconstruction.

Despite the attractive properties of non-negativity and additive superposition, NMF as canonically implemented suffers from several significant drawbacks in its application to scattering or diffraction experiments. First, it is incapable of interpreting translational variance: as the canonical NMF model uses only a linear combination of components, a given component cannot effectively reproduce data that is identical except for translation. Unfortunately, this is a common occurrence for XRD and PDF experiments due to material thermal expansion, changes in nanostructure, and lattice strain due to Vergad's law.34 Efforts to overcome this translation have included discarding translationally redundant components, although this requires additional hyperparameters for detecting the redundancy that are not effective across different experimental challenges.25,34 A second challenge of canonical NMF decomposition is that the optimization is Non-deterministic Polynomial-time hard (NP-hard), and thus most non-trivial applications will lack a unique solution. This means that the results of the model fit may vary depending on initialization, and are not guaranteed to converge to a global minimum for reconstruction error. Finally, though non-negativity is a sensible constraint for x-ray diffraction data, it is not sufficient to ensure that the output components correspond to physically interpretable entities.

It can be argued that there exist a set of components and weights that are most physically meaningful in the analysis of diffraction datasets.38 From the standpoint of the researcher seeking physical or chemical information, ideally the derived NMF components would correspond to the fundamental physical components in the material and respective derived weights would correspond to the volume fractions of those physical components. It is common in such studies to have some prior knowledge about these physical components. For example, in situ experiments commonly follow a trajectory across a state variable where at the start of the experiment, the sample begins as a known phase that will later undergo a transition into other known or unknown phases.42 In this instance, a physically meaningful analysis would include components that correspond to the initial, known phase. In other instances, both the end-member states of an in situ study may be well characterized before the measurement, with unknown intermediate phases or mixtures that need identification.12,38 Alternatively, there can be no knowledge of the component phase identity, but an expectation for the underlying composition. Metropolis matrix factorization attempts to address these problems by constraining select components during optimization, but remains too computationally costly to run in real time during an experiment involving streaming data.15 

Therefore, an ideal decomposition method for total scattering and diffraction datasets must resolve components and weights that are physically meaningful, and operate efficiently on the timescale of the measurement. The resulting model should not necessarily be the one that best fits the data but one that is most likely within the physical constraints of the system. By increasing the likelihood that the learned components and weights represent physically relevant features and process occurring within the sample, these results are more likely to be relevant and interpretable to the researcher performing the measurement. Moreover, an effective technique should operate quickly enough to provide real-time insights to the researcher, thus enabling a feedback loop between knowledge extraction and human action. Therefore, in order to be productive and insightful, decomposition methods for such data should be able to incorporate known weights or components as informative constraints on the problem, at the time of the measurement.

In this contribution, we present our advances in developing constrained NMF methods for use on spectral data, specifically in situ studies of total scattering and powder diffraction beamline data. The algorithms were developed using the PyTorch43 framework for speed and scalability, allowing constrained NMF to behave as a companion agent in concert with the researcher, leveraging the expertise of the scientist while providing real-time insights via physically interpretable decompositions of streaming datasets. We first demonstrate the capabilities of constrained NMF on several synthetic examples, in which the limitations of canonical NMF are explored. We then use constrained NMF to decompose temperature-dependent total scattering datasets of a ferroelectric perovskite and a molten salt, wherein there is no a priori knowledge of components in the latter. Finally, we provide best practices on how to apply these methods during in situ experiments and interpret the results as they relate to commonly encountered scenarios in the analysis of energy materials. These developments are extensible to applications in spectroscopy and microscopy, and can also be deployed at university and industrial laboratories.

Given a non-negative data matrix X of dimensions m × n—where m is the number of independent measurements and n is the length of the feature dimension—the goal of NMF is to find a factorization such that

XWH,
(1)

where W and H are non-negative matrices of dimension m × k and k × n, respectively, and k is the user-specified number of components. As such, the problem reduces to solving

minWm×c,Hc×nD(X,WH),s.t.W0,H0,
(2)

where D is a separable, positive measure of fit and the notation M0 indicates non-negativity of the entries of matrix M.

We employ the β-divergence Dβ as a loss function during minimization. The β-divergence is a popular cost function for NMF minimization parameterized by a single shape parameter β.44 For the purposes of this study, we consider two special cases of the β-divergence—squared Euclidean distance, also known as the squared error (SE) (β = 2) given by

D2(X,X)=12i=1mj=1n(Xi,jXi,j)2,
(3)

and Kullback–Leibler divergence (β = 1) given by

D1(X,X)=i=1mj=1n(Xi,jlogXi,jXi,jXi,j+Xi,j),
(4)

though the methods presented here may generalize to any choice of β.

We employ the alternating non-negative least squares (ANLS) algorithm in order to minimize the β-divergence subject to NMF constraints. Briefly, this algorithm operates by alternately minimizing Dβ(X,WH) with respect to W and H under the constraints that all matrices remain non-negative. This can be solved by gradient descent with non-negativity constraints on parameter updates, as described in Algorithm 1 as follows:

Algorithm 1.

Alternating non-negative least squares.

1: procedure ANLS(X, W0,H0, β, λ1, λ2, iter, tol) 
2:   i0,  Lprev,  Δ 
3:   WW0,  HH0 
4:   whileiiter and Δ>toldo 
5:     X THRESHOLDWH, 0, 108 
6:     LDβ(X,X) 
7:      μW{j=1nH.,jifβ=1X(β1)Wotherwise 
8:      WW·max(μWLW,0)/(μW+λ1+λ2W)   ▷ Non-negative W update 
9:      X THRESHOLDWH, 0, 108 
10:   LDβ(X,X) 
11:   μH{i=1mWi,.ifβ=1WX(β1)otherwise 
12:   HH·max(μHLH,0)/(μH+λ1+λ2H)   ▷ Non-negative H update 
13:   ii+1,  ΔLLprev,  LprevL 
14:   end while 
15:   return W, H 
16: end procedure 
17: function Threshold(M, t, ε)   ▷ Threshold each element in matrix M 
18:   for all Mi,j in Mdo 
19:      Mi,j{Mi,jifMi,j>tεotherwise 
20:   end for 
21: end function 
1: procedure ANLS(X, W0,H0, β, λ1, λ2, iter, tol) 
2:   i0,  Lprev,  Δ 
3:   WW0,  HH0 
4:   whileiiter and Δ>toldo 
5:     X THRESHOLDWH, 0, 108 
6:     LDβ(X,X) 
7:      μW{j=1nH.,jifβ=1X(β1)Wotherwise 
8:      WW·max(μWLW,0)/(μW+λ1+λ2W)   ▷ Non-negative W update 
9:      X THRESHOLDWH, 0, 108 
10:   LDβ(X,X) 
11:   μH{i=1mWi,.ifβ=1WX(β1)otherwise 
12:   HH·max(μHLH,0)/(μH+λ1+λ2H)   ▷ Non-negative H update 
13:   ii+1,  ΔLLprev,  LprevL 
14:   end while 
15:   return W, H 
16: end procedure 
17: function Threshold(M, t, ε)   ▷ Threshold each element in matrix M 
18:   for all Mi,j in Mdo 
19:      Mi,j{Mi,jifMi,j>tεotherwise 
20:   end for 
21: end function 

To allow for import of prior knowledge, Algorithm 1 allows for the user to supply initial values for W and H in the form of inputs W0 and H0. To fix certain columns of W or rows of H throughout optimization in order to reflect ground-truth knowledge, one may fix the gradients for known components to zero in the following manner:

DβMi,j=0Mi,jSM,
(5)

where SM is the set of all known components of matrix M.

During the analysis of experimental results, we presented the MSE alongside the mean R-factor given by

R=i=1n|XWH|i=1n|X|,
(6)

as a commonly used metric in crystallographic and scattering methods.

Algorithm 1 has been implemented in PyTorch,43 and is publicly available at https://github.com/bnl/pub-Maffettone_2021_04.

For the measured BaTiO3 dataset, the as fit canonical and constrained NMF components and weights are shown in Figs. 7 and 8, respectively. For the measured molten salt dataset, the as fit canonical and constrained NMF components and weights are shown in Figs. 9 and 11, respectively, including how the model varies as a function of the number of components.

Total scattering measurements were performed at the PDF beamline at the National Synchrotron Light Source II at Brookhaven National Laboratory. The wavelength of incident x-rays for both measurements was found to be 0.1671 Å (74.21 keV). Data were collected in transmission mode with a flat-panel Perkin–Elmer detector. Sample-to-detector distance and detector orientation were calibrated using Ni powder standard. All raw 2D images were background subtracted using a dark image collected immediately prior to each scan. Beamstop and detector aberrations were removed with masking prior to radial integration that was performed using pyFAI,45 with PDFs then being generated using PDFgetX3.46 

For the molten salt measurements, the sample-to-detector distance was found to be 282.74 mm. The sample was held in 1 mm quartz capillaries resistively heated using Nichrome wire, with temperatures measured via a K-type thermocouple. Measurements were collected with 10 s data collection time per temperature. For the BaTiO3 measurements, samples were held in 1 mm quartz capillaries, with temperature controlled through a liquid nitrogen cryostream directed at the sample. Diffraction (XRD) measurements were collected at 1 m from the sample with 60 s data collection time per temperature.

The goal of NMF is to find an m × k non-negative matrix W of weights and a k × n non-negative matrix H of components, to approximate a dataset XWH, where m is the number of patterns in a dataset, n is the number of points in a pattern, and k is the chosen number of components. The details of this optimization are presented in Sec. II. We augmented this approach to allow the user—or an algorithm—to initialize any column of W or row of H, and to subsequently constrain those vectors from being updated during the optimization. It is worth noting that the algorithm is often presented with the contents of W and H reversed; however, we find it more intuitive to have W refer to weights and have done so herein.

To compare the capabilities of canonical and constrained NMF, we assembled two synthetic datasets, which are shown in Fig. 1. The first highlights a common case where two unique, overlapping, components mix linearly with coefficients that sum to one [Fig. 1(a)]. This occurs in binary mixtures when the measured response function is a weighted sum of the pure response functions, as is the case for diffraction.

FIG. 1.

Waterfall plots showing the full synthetic datasets used to demonstrate the capabilities of canonical and constrained NMF. (a) Two noisy Gaussian curves centered at 2.5 and −2.5 are mixed in linear combinations with coefficients that vary approximately, but not exactly, linearly across the dataset. (b) A Lorentzian centered at 3.0, a box function of width two centered at 5.0, and a Gaussian centered at 6.0 are mixed as linear combination with functionally varying weights across the dataset.

FIG. 1.

Waterfall plots showing the full synthetic datasets used to demonstrate the capabilities of canonical and constrained NMF. (a) Two noisy Gaussian curves centered at 2.5 and −2.5 are mixed in linear combinations with coefficients that vary approximately, but not exactly, linearly across the dataset. (b) A Lorentzian centered at 3.0, a box function of width two centered at 5.0, and a Gaussian centered at 6.0 are mixed as linear combination with functionally varying weights across the dataset.

Close modal

The second synthetic dataset contains three unique functions that are mixed non-monotonically as the dataset index progresses [Fig. 1(b)]. The three functions, chosen to be generic and visually clear, are a Lorentzian, a Gaussian, and a box function. The Gaussian and box function overlap to create a non-differentiable result. This dataset explores a limiting and challenging use case where underlying physical components originate from mixed contributions of both smoothly varying and discontinuous physical features in measured data, such as aberrations introduced by data processing procedures. The weights represent some commonly encountered ways that physical processes—and therefore components—can evolve naturally in chemical and physical processes (gradual decay, transitory existence, sudden onset). This family of challenges highlights the scenario wherein a priori knowledge of one or all components exists, but not necessarily knowledge of the weights.

We next examined two distinct experimental systems in order to validate our approach to constrained NMF, the full datasets of which are shown in Fig. 2. First, we analyzed the subtle temperature-driven phase transitions of barium titanate, BaTiO3 [Fig. 2(a)]. BaTiO3 is a well-studied material that has been used to probe the mechanism of ferroelectricity because its phase transitions exist at relatively low temperature.47,48 At high temperature, BaTiO3 exists as a paraelectric cubic phase, where on average the titanium (Ti) atoms are octahedrally coordinated by six oxygen (O) atoms. At room temperature, the average displacement of the Ti atom elongates along [001] crystal axis, resulting in a ferroelectric tetragonal phase. Further cooling results in an orthorhombic phase with polarization along the [011] axis. At low temperature, the material reaches a rhombohedral ground state, with all Ti atoms polarized along the [111] axis. This symmetry breaking transformation of the crystal lattice results in subtle peak splitting of the XRD pattern. It is further obfuscated by thermal expansion, which drives peaks in the XRD pattern to shift. The differences between patterns is so subtle that classification across these temperature-induced phase transitions is not possible without expert intervention during analysis.47 

FIG. 2.

Waterfall plots showing the full variable temperature diffraction datasets used herein to demonstrate constrained NMF. The plots of the normalized intensity vs the magnitude of the reciprocal lattice vector, Q, is shown for the datasets. (a) BaTiO3 undergoes three phase transitions across the temperature range 150–450 K, which are imperceptible even to conventional refinement techniques. (b) NaCl:CrCl3 diffraction measurements were carried out in a temperature range of 300–963 K that included a melting transition and an anomalous high temperature solid phase (shown by the remaining sharp peaks above 650 K).

FIG. 2.

Waterfall plots showing the full variable temperature diffraction datasets used herein to demonstrate constrained NMF. The plots of the normalized intensity vs the magnitude of the reciprocal lattice vector, Q, is shown for the datasets. (a) BaTiO3 undergoes three phase transitions across the temperature range 150–450 K, which are imperceptible even to conventional refinement techniques. (b) NaCl:CrCl3 diffraction measurements were carried out in a temperature range of 300–963 K that included a melting transition and an anomalous high temperature solid phase (shown by the remaining sharp peaks above 650 K).

Close modal

In our final experimental system, we examined the structural evolution of a near-eutectic mixture of sodium chloride and chromium chloride, NaCl:CrCl3, when it is heated beyond its melting temperature [Fig. 2(b)]. Such molten salts are of interest in the development of accident-tolerant nuclear fuels.42 Unlike the changes observed in BaTiO3, the crystalline–amorphous transitions in this salt are abrupt and easily detectable. We used constrained NMF to understand the behavior of NaCl:CrCl3 across a temperature range of 300–963 K, which included its melting transition. In this experiment, there was a residual solid phase after the melting transition. Without any additional analysis, it remained unclear to the research team whether that solid was a distinct high-temperature phase or an experimental artifact. Taken together with the barium titanate dataset and the two previously outlined synthetic datasets, these four datasets present the clear utility of constrained NMF for on-the-fly and in situ analysis of diffraction data.

Through examination of the first synthetic dataset [Fig. 1(a)], we demonstrated the challenge of retaining physically relevant components from NMF even when the weights are known a priori. This is a common situation in diffraction studies of variable composition or phase mixing, where complementary experimental information (preparation conditions or spectroscopy) may indicate the concentration of individual phases without specifying the diffraction patterns of the phases. We considered a continuous mixture of two Gaussian functions, where noise is added both to the composition and to the final function.

The results of using canonical NMF to model this dataset is summarized in Fig. 3, which shows the modeled components, weights, and three representative reconstructions from the full dataset—a presentation format consistent for all synthetic examples discussed herein. In decomposing the Gaussian dataset using canonical NMF, the reconstruction loss as defined by SE [Eq. (3)] achieved a mean value (MSE) of 5.788 over the dataset. The common crystallographic R-factor metric [Eq. (6)] of the reconstruction achieved a mean value of 0.066. The decomposition into two components created a model of the mixtures that fits the data well. Even though the learned components are similar to the true functions, their respective magnitudes are not obviously linked to the magnitudes of those functions [Fig. 3(d)]. Furthermore, the weights must compensate for this increased predicted magnitude, and inevitably over-fit the noise in the dataset [Fig. 3(e)]. This is the expected behavior of canonical NMF, since it is naive to any prior knowledge or physical constraints we can impose on the decomposition. From this, we can see the baseline utility of canonical NMF for dimensionality reduction; however, even in this simple dataset we see how it is possible to obtain physically implausible solutions.

FIG. 3.

The reconstruction of the (a) first, (b) median, and (c) last pattern of a dataset of mixed Gaussian functions using canonical NMF. The full dataset is shown in Fig. 1(a). (d) Both the ordering and amplitude of the learned components fails to match the ground truth functions. (e) By plotting the learned and true weights against the dataset index, it is shown that the learned weights compensate for this over-estimated amplitude of the learned components with a magnitude less than that of the ground truth. The learned weights also over-fit the non-linear noise in the true weights.

FIG. 3.

The reconstruction of the (a) first, (b) median, and (c) last pattern of a dataset of mixed Gaussian functions using canonical NMF. The full dataset is shown in Fig. 1(a). (d) Both the ordering and amplitude of the learned components fails to match the ground truth functions. (e) By plotting the learned and true weights against the dataset index, it is shown that the learned weights compensate for this over-estimated amplitude of the learned components with a magnitude less than that of the ground truth. The learned weights also over-fit the non-linear noise in the true weights.

Close modal

We next sought to inform the NMF model with prior knowledge of weights in the form of optimization constraints. In this use case, the weights were limited to the coordinates of a simplex (that is, they sum to unity) and expected to vary linearly. The results of this constrained NMF model are shown in Fig. 4. By initializing the NMF with the “true” weights, and constraining their update in the optimization algorithm, we find a comparably performing decomposition (MSE of 6.903 and mean R-factor 0.0722) that was much more likely given the physical constraints of the system [Fig. 4(d)]. By design, the performance metrics (MSE and R-factor) are slightly worse than those for the canonical NMF because we do not allow the decomposition to over-fit the noise in the weights (which is not exactly linear). This forces the weights to follow the prior and keeps the reconstruction robust to noise or outliers. Importantly, the resulting components from the constrained NMF have the same amplitude as the “true” function. In a crystallographic or spectroscopic context, this is important because peak intensity carries physical significance.

FIG. 4.

The reconstruction of the (a) first, (b) median, and (c) last pattern of a dataset of mixed Gaussian functions using constrained NMF. The weights of the decomposition were constrained to be linear with respect to the dataset index. The full dataset is shown in Fig. 1(a). (d) Both the ordering and amplitude of the learned components successfully match the ground truth functions. (e) The learned and true weights are plotted against the dataset index. By fixing the weights, they cannot fit the noise in the ground truth but serve as a physically meaningful approximation.

FIG. 4.

The reconstruction of the (a) first, (b) median, and (c) last pattern of a dataset of mixed Gaussian functions using constrained NMF. The weights of the decomposition were constrained to be linear with respect to the dataset index. The full dataset is shown in Fig. 1(a). (d) Both the ordering and amplitude of the learned components successfully match the ground truth functions. (e) The learned and true weights are plotted against the dataset index. By fixing the weights, they cannot fit the noise in the ground truth but serve as a physically meaningful approximation.

Close modal

The results from this constrained NMF are obtained rapidly, in a matter of seconds on a personal laptop, and can be produced on-the-fly during measurement. The output components offer immediate insight into the underlying dataset, while depending on a predicted approximate prior. These components can be used as starting conditions for exact refinement,20 of each pattern or spectrum in a dataset. When analysis is less time constrained, other methods for further tuning the optimization of constrained NMF can be explored with continual human input. Some such methods include using non-rigid constraints and Markov chain Monte Carlo optimization,38 which would enable a flexible, non-exact, linear prior.

As a second controlled example, we created a synthetic dataset from three diverse functions with smooth, non-monotonically varying weights in a linear combination [Fig. 1(b)]. Those functions included an overlapping box and Gaussian function, and a Lorentz function that does not overlap either. In this scenario, prior knowledge of any of these components significantly constrains the NP-hard optimization problem of NMF.

As can be seen in Fig. 5, this dataset was not effectively decomposed by canonical NMF. The resulting reconstruction MSE was 25.441 with a mean R-factor of 0.034. One reason for this failure was the non-differentiable box function. This property makes decomposition particularly challenging for the gradient descent methods used in the optimization. As shown in Fig. 5(d), the model found a local minimum in the optimization where the learned components were partial mixtures of the true components. This is both physically inaccurate and results in poor reconstruction. In a case such as this, the decomposition from canonical NMF results in a weak, non-physical, dimensionality reduction, which offers no new knowledge to the scientist.

FIG. 5.

The reconstruction of the (a) first, (b) median, and (c) last pattern of a dataset of mixed Gaussian, Lorentzian, and box functions using canonical NMF. The full dataset is shown in Fig. 1(b). (d) Canonical NMF fails to successfully decompose this dataset and produce an accurate reconstruction. The learned components are mixtures of the underlying functions. (e) The learned and true weights are plotted against the dataset index. The learned weights produce the best possible reconstruction given the inaccurate components, albeit the reconstruction in this case is worse than that produced by constrained NMF (Fig. 6).

FIG. 5.

The reconstruction of the (a) first, (b) median, and (c) last pattern of a dataset of mixed Gaussian, Lorentzian, and box functions using canonical NMF. The full dataset is shown in Fig. 1(b). (d) Canonical NMF fails to successfully decompose this dataset and produce an accurate reconstruction. The learned components are mixtures of the underlying functions. (e) The learned and true weights are plotted against the dataset index. The learned weights produce the best possible reconstruction given the inaccurate components, albeit the reconstruction in this case is worse than that produced by constrained NMF (Fig. 6).

Close modal

Corollary to the previous example (where prior knowledge of the weights enabled a physically meaningful extraction of components), we sought to inform NMF of the components as a prior to decompose these composite functions. The results of modeling this dataset with constrained NMF are shown in Fig. 6. Despite the non-trivial nature of the weights, prior knowledge of the component functions enabled the model to learn those weights to a high degree of accuracy [Fig. 6(e)]. Again, we see the addition of priors on the components guiding the model away from overfitting. In this instance, the constrained NMF outperformed canonical NMF even by the standard metrics (MSE = 1.336, Rfactor= 0.013)—improving in the MSE by nearly a factor of 20. This approach is also applicable if only some of the components are known, albeit less robust for this use case (Figs. S1–S6). This suggests that a procedure that allows for iterative inclusion of more priors through human or automatic intervention would allow for even the most complex datasets to be successfully resolved using constrained NMF.

FIG. 6.

The reconstruction of the (a) first, (b) median, and (c) last pattern of a dataset of mixed Gaussian, Lorentzian, and box functions using constrained NMF. The full dataset is shown in Fig. 1(b). (d) Prior knowledge of the functions fixes the underlying components. (e) The learned and true weights are plotted against the dataset index. The learned weights match the underlying mixing function and create an accurate reconstruction of the dataset.

FIG. 6.

The reconstruction of the (a) first, (b) median, and (c) last pattern of a dataset of mixed Gaussian, Lorentzian, and box functions using constrained NMF. The full dataset is shown in Fig. 1(b). (d) Prior knowledge of the functions fixes the underlying components. (e) The learned and true weights are plotted against the dataset index. The learned weights match the underlying mixing function and create an accurate reconstruction of the dataset.

Close modal

Following this validation of constrained NMF in controlled systems, we sought to test its ability to construct a meaningful decomposition in a subtle and challenging crystallographic problem with experimentally measured data. The BaTiO3 dataset presented in Fig. 2 contains four distinct crystallographic phases (Pm3¯m,P4mm, Amm2, and R3m) across the temperature range measured. These transitions are subtle and not easily discernible using conventional analysis methods without expert intervention. Between these order–disorder phase transitions, there is also continuous peak shifting due to thermal expansion.

As seen in Fig. 7, canonical NMF is poorly suited to address translational variance (peak shifting) in decomposing a dataset into composite patterns.34,36,37 Additionally, it was also incapable of identifying the clear second order phase transitions within the data. Instead, it learned mixtures of the “true” components as its fundamental components in the output decomposition. From Fig. 7(b), one might falsely conclude that the high temperature phase (blue) increases approximately linearly in composition with temperature, or that there are three components that contribute significantly to the low temperature phase. A fastidious materials expert inspecting the NMF fit may disregard the analysis outright as it violates the Gibbs phase rule by suggesting the coexistence of four phases. Despite this, the reconstruction loss from NMF was quite low with an MSE of 0.150 and a mean R-factor of 0.039. This demonstrates that canonical NMF is not well suited for crystallographic datasets with subtle transitions, a common case for diffraction studies.

FIG. 7.

Canonical NMF fails to produce a physically meaningful decomposition of the variable temperature BaTiO3 dataset [Fig. 2(a)]. (a) The learned components are randomized mixtures of real patterns. (b) The resulting weights carry no significance regarding the composition or any other property of the physical system. They do not sum to 1, and suggest the presence of all four components throughout—a clear violation of the Gibbs phase rule.

FIG. 7.

Canonical NMF fails to produce a physically meaningful decomposition of the variable temperature BaTiO3 dataset [Fig. 2(a)]. (a) The learned components are randomized mixtures of real patterns. (b) The resulting weights carry no significance regarding the composition or any other property of the physical system. They do not sum to 1, and suggest the presence of all four components throughout—a clear violation of the Gibbs phase rule.

Close modal

We next considered constraining the components of NMF to extract a better, more physically realistic, decomposition of the BaTiO3 dataset. We sought to approach this in a way that made no presumption of a specific phase, and only operated according to the knowledge that the dataset was measured along a single state variable (temperature). While our synthetic experiment mixing diverse functions demonstrated that constraining all of the components led to the optimal solution, such complete system knowledge may not be expected in all experimental applications. Previous studies on diffraction data have demonstrated that partial constraints may enable the extraction of physically sensible intermediates.38,49 To construct a physically meaningful solution without explicit knowledge of the underlying phases, we designed an iterative algorithm that wraps our constrained NMF to handle datasets over state variables such as temperature or composition.

The iterative application of constrained NMF automatically selects components to constrain, thus depending only on the user-selected number of components. It begins by assuming the end members of a dataset (in this case lowest and highest temperature patterns) are unique, fixing those members as components, and optimizing a partially constrained NMF. In the limiting case of a rank-two NMF, this amounts to combinatorial decomposition, which has been successful in interpreting the total scattering data.12 When more components are used in the decomposition, the algorithm then selects the variable component corresponding to the highest weights, then finds the location where this component is most prevalent in the dataset. The selected component is then fixed to the dataset member at this location, and the NMF optimization proceeds with one additional constraint. This process continues until all components of the NMF are constrained. Critically for total scattering datasets, this results in weights that are bounded between 0 and 1, thus possessing a meaningful mapping to fractional composition.

The results of applying this iterative application of constrained NMF to the barium titanate dataset is shown in Fig. 8. While this approach yielded results with similar reconstruction performance to those of canonical NMF (MSE = 0.349, Rfactor=0.051), the resulting components and weights can be unambiguously interpreted. As the fixed components were chosen directly from the measured dataset, they can be used directly as reference diffraction patterns of pure phases for further refinement [Fig. 8(a)]. Furthermore, the weights can be interpreted as molar fractions, showing phase mixtures existing in the sample during transitions [Fig. 8(b)]. The applications of constraints—even without prior knowledge of what those constraints should be—enabled a substantial improvement over canonical NMF, providing both physically meaningful components and weights.

FIG. 8.

Automated constrained NMF produces a physically meaningful decomposition of the variable temperature BaTiO3 dataset [Fig. 2(a)]. (a) The learned components correspond to the pure phases at the median temperatures where they are observed. (b) The resulting weights approximate mixing between phases and thermal expansion causing translational variance in the XRD patterns.

FIG. 8.

Automated constrained NMF produces a physically meaningful decomposition of the variable temperature BaTiO3 dataset [Fig. 2(a)]. (a) The learned components correspond to the pure phases at the median temperatures where they are observed. (b) The resulting weights approximate mixing between phases and thermal expansion causing translational variance in the XRD patterns.

Close modal

There is still some room for improvement of this method. The transitions provided by the constrained NMF are still too gradual for second order phase transitions. While they occur in approximately the correct location, they indicate more mixing than is actually observed. This is a direct result of translation dependence of NMF: peak shifts in pure phases are likely cast as mixtures between minimally and maximally shifted components that are otherwise identical. Supervised or semi-supervised convolutional deep learning methods have exceeded this performance.28,29,50 Nonetheless, to the authors' knowledge, this is the best performance of any fully unsupervised ML approach at segregating the phases in such a dataset.

We then used NMF to monitor an experimental transition of solid crystalline NaCl:CrCl3 into a molten amorphous state wherein the number of component phases is unknown. Using the elbow method51 to examine the MSE and Kullback–Leibler (KL)-divergence loss of canonical NMF, we noted diminishing returns in the reconstruction at around four components (Fig. 9). As shown in Fig. 10, canonical NMF using four components produces a decomposition with low reconstruction error (MSE = 3.355, R-factor = 0.020) that captures the solid to amorphous phase boundary with a clear second order phase transition shown at 676 K. However, as in our study of BaTiO3, the results are physically unrealistic, because both the high temperature components are not consistent with real diffraction patterns, and the weights suggest a 3–4 phase mixture throughout the temperature series.

FIG. 9.

The loss curve of NMF with respect to the number of components shows diminishing returns around four components. This heuristic identification is subjective in nature, but multiple instances of NMF can easily be run within a range of components to test applicability to a dataset.

FIG. 9.

The loss curve of NMF with respect to the number of components shows diminishing returns around four components. This heuristic identification is subjective in nature, but multiple instances of NMF can easily be run within a range of components to test applicability to a dataset.

Close modal
FIG. 10.

Canonical NMF produces a confusing decomposition, offering limited clarity to the scientist. (a) The learned components are non-physical and cannot be interpreted as XRD patterns. (b) The learned weights provide little information beyond an abrupt change around 400 K.

FIG. 10.

Canonical NMF produces a confusing decomposition, offering limited clarity to the scientist. (a) The learned components are non-physical and cannot be interpreted as XRD patterns. (b) The learned weights provide little information beyond an abrupt change around 400 K.

Close modal

During the time-sensitive conditions experienced when performing such an experiment at a central facility, it is important to extract meaningful scientific insight in real time. Integrating the NMF into data acquisition pipeline52 enables the researcher to note the transition temperature on-the-fly; however, the canonical approach is unable to extract meaningful components. Without additional processing, the solid material that existed past the transition temperature was unable to be characterized. It is crucial to know if this phase is unique and mandates further use of the limited measurement time available. In the case of any experiment that is irreversible, destructive, or expensive to reproduce, immediate access to this type of information is paramount.

The constrained NMF approach offered this immediate analysis capability. The elbow method produced similar learning curves, with a four-component MSE of 9.193 and a mean R-factor of 0.0334. Due to the uncertain effects of peak shifting and thermal expansion, we chose to examine a range of values around this inflection point, and performed constrained NMF using 3–5 components in parallel as shown in Fig. 11.

FIG. 11.

Automated constrained NMF was run using 3–5 components asynchronously on the molten salt dataset. The learned components for the (a) three-, (c) four-, and (e) five-component decompositions provide insight into thermal expansion, and the phases present during the temperature ramp. The learned weights for the (b) three-, (d) four-, and (f) five-component decompositions demonstrate the second order phase transition, the characteristic response to thermal expansion, and persistence of a solid phase after 400 K.

FIG. 11.

Automated constrained NMF was run using 3–5 components asynchronously on the molten salt dataset. The learned components for the (a) three-, (c) four-, and (e) five-component decompositions provide insight into thermal expansion, and the phases present during the temperature ramp. The learned weights for the (b) three-, (d) four-, and (f) five-component decompositions demonstrate the second order phase transition, the characteristic response to thermal expansion, and persistence of a solid phase after 400 K.

Close modal

Using the same automatic iterative constraining procedure from our study of BaTiO3, the constrained NMF elucidated a significantly more physically meaningful decomposition than canonical NMF, readily discovering the difference between a component unique to a phase and shifting features due to thermal expansion. In the rank-three constrained NMF, two solid components and one amorphous component arose in the decomposition [Figs. 11(a) and 11(b)]. The solid components were nearly identical and showed a linear variation in weight (composition), which was 0.5/0.5 at the midpoint in the temperature range prior to the first order phase transition. A small and decaying amount of the high temperature solid phase was present early after the transition. This linear exchange between the two similar components below 400 K arises from the translational dependence of NMF: the algorithm compensates by considering two identical [besides shift] patterns at both ends of the relevant temperature range. From the rank-three constrained NMF alone, we can see the gradual impact of thermal expansion followed by a second order transition where some of the material remained solid.

Considering higher orders of constrained NMF showed the uniqueness of the high temperature solid material, and further validated the assertion that smooth, linear variations of similar patterns are an indication of peak shifting. The rank-four NMF extracts similar components as the rank-three NMF, albeit with a new high temperature component to describe the fine structure observed after the transition temperature [Figs. 11(c) and 11(d)]. This pattern [orange in Fig. 11(c)] was a result of an amorphous background (later fully pronounced at high temperature in dark gray) unexpected presence of persistent, strained, transitory crystallites of the original solid that had a preferred orientation with respect to the x-ray beam. Finally, the rank-five NMF [Figs. 11(e) and 11(f)] decomposed the dataset into the same components with an additional subdivision of the low temperature solid phase. This subdivision occurred at the midpoint of the previous two end members, demonstrating the behavior of the constrained NMF when compensating for pattern shift. This response to the imperceptible shift in the data is a useful and predictable benefit to the iterative and constrained approach: without translational invariance, this behavior enables the user to confidently note when components belong to the same—but shifting—phase.

The results from the decomposition using constrained NMF on a variable temperature diffraction dataset are immediately more interpretable than a similar decomposition using canonical NMF. We quantify this assertion by performing a Le Bail refinement in TOPAS53 of the true crystal structures against each of the components produced by NMF. The R-weighted pattern (Rwp) from these refinements is summarized in Fig. 12. The resulting error is consistently smaller (by a factor of four) for the constrained components, confirming that these components can be better interpreted as diffraction patterns from the physical system.

FIG. 12.

A comparison of violin plots showing distributions of error from La Bail refinements performed on all output NMF components fit with different numbers of components. The plots show the distribution of the canonical (orange) and constrained (blue) results, with the mean shown by a white dot and quartiles shown in the black box. Note the significant improvement in fit quality of the constrained components in all scenarios.

FIG. 12.

A comparison of violin plots showing distributions of error from La Bail refinements performed on all output NMF components fit with different numbers of components. The plots show the distribution of the canonical (orange) and constrained (blue) results, with the mean shown by a white dot and quartiles shown in the black box. Note the significant improvement in fit quality of the constrained components in all scenarios.

Close modal

As with the BaTiO3 example, the weights correspond to interpretable compositions. Without solid–solid phase transitions as seen with BaTiO3, the linear response to pattern shifting is apparent. While this ameliorates the lack of translational invariance of NMF, a more ideal solution would be one that maps only one component to a single phase regardless of the translational variations occurring during an experiment. Moreover, the recognition of persistent solids by the NMF is useful for adjusting an experimental protocol on-the-fly. Through user or algorithm intervention, the identification of a second order transition could be used to slow the rate of temperature change, allowing the system time to equilibrate. In such experiments, there is often a lag between the measured temperature and the material temperature, especially around a transition that involves latent heat. Indicating this transition on-the-fly enables a more robust and informative experiment.

Here, we developed a constrained NMF approach that interprets streaming spectral data on-the-fly, promptly providing critical insight and analysis from time-sensitive experiments such as XRD and PDF. This tool is now regularly deployed at the PDF beamline at National Synchrotron Light Source II. We have continually refined how we utilize and interpret constrained NMF with our user community and outline some best practices here.

We have considered several approaches for constraining components and weights. Constraints can be specified a priori based on system knowledge, such as known material states, or in a data-driven manner during the experiment.

For diffraction, one a priori approach is to use simulated patterns of known or anticipated phases as constraints, forcing the unconstrained components to describe new or unknown phases. Alternatively, experimental patterns drawn directly from the dataset can be used as fundamental components. This is a viable approach for a variable temperature series where the end members are expected to be distinct (Fig. 11). For such cases, we have implemented a completely data-driven approach to derive the constrained components. This approach operates iteratively, seeking out points in the dataset that are maximally described by a single component (high weight) and constraining the value of the said component at that that point, thus reducing the degrees of freedom for NMF to explore in subsequent iterations. This method was illustrated in our study of BaTiO3, where it recovered the known pure phases and approximate transition temperatures (Fig. 8). We have also considered the case where the mixture weights (i.e., volume fraction) are known, but not the components themselves, and demonstrated this capability on synthetic data (Fig. 4). Future work in this area should allow for soft constraints, such as enforcing monotonicity of weights or allowing the initial constraints to adjust with a penalty.

Because the decomposition is hyperparameter dependent, multiple instances of NMF can be run in parallel to efficiently explore the parameter space, enabling the scientist to investigate different constraints or numbers of components while the measurement is ongoing. This is particularly advisable when the user is uncertain about the number of fundamental components and allows the NMF agent to generate elbow plots as seen in Fig. 9. This also allows for real-time comparisons between constrained NMF and canonical NMF as shown in Figs. 7 and 8. Conditioning and re-plotting the NMF outputs occurs on the order of seconds to minutes depending on the size of the dataset. This allows for a feedback loop between the scientist, the analysis, and the experiment that is beneficial for a variety of studies. We have found the speed of this implementation sufficient for all encountered experiments by performing the recalculation and plotting using only the CPU; however, the decision to implement in PyTorch enables facile Graphical Processing Unit (GPU) acceleration using Compute Unified Device Architecture (CUDA) libraries if required by future applications.

While there are certain properties of NMF that encourage physicality in the output for x-ray scattering data (e.g., non-negativity and additive components), these characteristics alone do not assure physical interpretability. Although the features that make a decomposition immediately interpretable will depend on the nature of the experiment and discretion of the user employing the method, we have described some emergent features that make XRD decomposition readily interpretable: satisfying Gibbs phase rule, weights which sum to unity, and components that seemingly represent unique phases and not mixtures. From our synthetic examples (Figs. 3–6) and the elbow method (Fig. 9), we can see that simply reducing the reconstruction error is not sufficient in creating a more interpretable decomposition. It is also not informative to compare the reconstruction error between different datasets. In practice, we compare the order of magnitude of the reconstruction error between different instances of NMF (canonical, k-components, etc.) on the same dataset. However, this metric only informs the user if the method is capturing the information of the full dataset, and not if the output itself is physically interpretable.

The constrained NMF approach processes large datasets in an unsupervised fashion with an interactive prior: the intuition and decision-making of the human-in-the-loop. These constraints can be chosen explicitly by the researcher, or implicitly from the dataset by an algorithm. This gives the streaming analysis added value that cannot be expected from canonical NMF. We demonstrated the advantages of using constrained NMF to aid in revealing subtle phase transitions of a ferroelectric perovskite and the identification of experimental complexities in molten salts. This approach efficiently handles the drastic increase in data generation rates—critical at modern beamlines—where conventional analysis approaches cannot keep up with the experimental pace required for optimized data collection. Some overall limitations of NMF that still apply to constrained approaches are the inability to retain the partial translational invariance of XRD and PDF patterns,34,41 as well as other common aberrations that impact a pattern without impacting the underlying phase.28,50 Constrained NMF can be extended to a broad range of spectral data applications such as x-ray photoelectron spectra, x-ray absorption near edge spectra, photoluminescence spectra, nuclear magnetic resonance spectra, or mass spectra.49 These agents can be deployed as part of a federation that includes multiple NMF instances asynchronous with supervised ML, and be potentially informative of adaptive experiments.52 This maintains the agency of the researcher through leveraging of expertise, while producing real-time analysis and actionable results during an experiment.

See the supplementary material for examples of partially constrained NMF components in synthetic datasets and demonstrations of applying constrained NMF to pair-distribution function (PDF) data from in situ melting salt experiments.

We acknowledge financial support from the BNL Laboratory Directed Research and Development (LDRD) projects 20–032 “Accelerating materials discovery with total scattering via machine learning” (P.M.M. and D.O.), and the Simons Foundation Center for Computational Biology (A.C.D). This research utilized the PDF (28-ID-1) Beamline and resources of the National Synchrotron Light Source II, a U.S. Department of Energy (DOE) Office of Science User Facility operated for the DOE Office of Science by Brookhaven National Laboratory under Contract No. DE-SC0012704.

The authors have no conflicts to disclose.

The data that support the findings of this study are available from the corresponding author upon reasonable request.

1.
A. G.
Kusne
,
T.
Gao
,
A.
Mehta
,
L.
Ke
,
M. C.
Nguyen
,
K.-M.
Ho
,
V.
Antropov
,
C.-Z.
Wang
,
M. J.
Kramer
,
C.
Long
, and
I.
Takeuchi
, “
On-the-fly machine-learning for high-throughput experiments: Search for rare-earth-free permanent magnets
,”
Sci. Rep.
4
,
6367
(
2014
).
2.
R.
Gómez-Bombarelli
,
J.
Aguilera-Iparraguirre
,
T. D.
Hirzel
,
D.
Duvenaud
,
D.
Maclaurin
,
M. A.
Blood-Forsythe
,
H. S.
Chae
,
M.
Einzinger
,
D.-G.
Ha
,
T.
Wu
,
G.
Markopoulos
,
S.
Jeon
,
H.
Kang
,
H.
Miyazaki
,
M.
Numata
,
S.
Kim
,
W.
Huang
,
S. I.
Hong
,
M.
Baldo
,
R. P.
Adams
, and
A.
Aspuru-Guzik
, “
Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach
,”
Nat. Mater.
15
,
1120
1127
(
2016
).
3.
R. L.
Greenaway
,
V.
Santolini
,
M. J.
Bennison
,
B. M.
Alston
,
C. J.
Pugh
,
M. A.
Little
,
M.
Miklitz
,
E. G. B.
Eden-Rump
,
R.
Clowes
,
A.
Shakil
,
H. J.
Cuthbertson
,
H.
Armstrong
,
M. E.
Briggs
,
K. E.
Jelfs
, and
A. I.
Cooper
, “
High-throughput discovery of organic cages and catenanes using computational screening fused with robotic synthesis
,”
Nat. Commun.
9
,
2849
(
2018
).
4.
S.
Langner
,
F.
Häse
,
J. D.
Perea
,
T.
Stubhan
,
J.
Hauch
,
L. M.
Roch
,
T.
Heumueller
,
A.
Aspuru-Guzik
, and
C. J.
Brabec
, “
Beyond ternary OPV: High-throughput experimentation and self-driving laboratories optimize multicomponent systems
,”
Adv. Mater.
32
,
1907801
(
2020
).
5.
A.
Ludwig
, “
Discovery of new materials using combinatorial synthesis and high-throughput characterization of thin-film materials libraries combined with computational methods
,”
NPJ Comput. Mater.
5
,
70
(
2019
).
6.
P.
Decker
,
D.
Naujoks
,
D.
Langenkämper
,
C.
Somsen
, and
A.
Ludwig
, “
High-throughput structural and functional characterization of the thin film materials system Ni-Co-Al
,”
ACS Comb. Sci.
19
,
618
624
(
2017
).
7.
H. H.-M.
Yeung
,
Y.
Wu
,
S.
Henke
,
A. K.
Cheetham
,
D.
O'Hare
, and
R. I.
Walton
, “
In-situ observation of successive crystallizations and metastable intermediates in the formation of metal–organic frameworks
,”
Angew. Chem. Int. Ed.
55
,
2012
2016
(
2016
).
8.
B. P.
MacLeod
,
F. G. L.
Parlane
,
T. D.
Morrissey
,
F.
Häse
,
L. M.
Roch
,
K. E.
Dettelbach
,
R.
Moreira
,
L. P. E.
Yunker
,
M. B.
Rooney
,
J. R.
Deeth
,
V.
Lai
,
G. J.
Ng
,
H.
Situ
,
R. H.
Zhang
,
M. S.
Elliott
,
T. H.
Haley
,
D. J.
Dvorak
,
A.
Aspuru-Guzik
,
J. E.
Hein
, and
C. P.
Berlinguette
, “
Self-driving laboratory for accelerated discovery of thin-film materials
,”
Sci. Adv.
6
,
eaaz8867
(
2020
).
9.
Z.
Li
,
M. A.
Najeeb
,
L.
Alves
,
A. Z.
Sherman
,
V.
Shekar
,
P.
Cruz Parrilla
,
I. M.
Pendleton
,
W.
Wang
,
P. W.
Nega
,
M.
Zeller
,
J.
Schrier
,
A. J.
Norquist
, and
E. M.
Chan
, “
Robot-accelerated perovskite investigation and discovery
,”
Chem. Mater.
32
,
5650
5663
(
2020
).
10.
H.
Fleischer
,
D.
Baumann
,
S.
Joshi
,
X.
Chu
,
T.
Roddelkopf
,
M.
Klos
, and
K.
Thurow
, “
Analytical measurements and efficient process generation using a dual–arm robot equipped with electronic pipettes
,”
Energies
11
,
2567
(
2018
).
11.
S. J.
Campbell
,
D. B.
Allan
,
A. M.
Barbour
,
D.
Olds
,
M. S.
Rakitin
,
R.
Smith
, and
S. B.
Wilkins
, “
Outlook for artificial intelligence and machine learning at the NSLS-II
,”
Mach. Learn.: Sci. Technol.
2
,
013001
(
2020
).
12.
D.
Olds
,
P. F.
Peterson
,
M. K.
Crawford
,
J. R.
Neilson
,
H.-W.
Wang
,
P. S.
Whitfield
, and
K.
Page
, “
Combinatorial appraisal of transition states for in situ pair distribution function analysis
,”
J. Appl. Crystallogr.
50
,
1744
1753
(
2017
).
13.
D.
Olds
, “
Synchrotron x-ray diffraction for energy and environmental materials: The current role and future directions of total scattering beamlines in the functional material scientific ecosystem
,”
Synchrotron Radiat. News
33
,
4
10
(
2020
).
14.
T.
Egami
and
S. J.
Billinge
,
Underneath the Bragg Peaks: Structural Analysis of Complex Materials
(
Elsevier
,
2003
).
15.
X.
Hua
,
A. S.
Eggeman
,
E.
Castillo-Martínez
,
R.
Robert
,
H. S.
Geddes
,
Z.
Lu
,
C. J.
Pickard
,
W.
Meng
,
K. M.
Wiaderek
,
N.
Pereira
,
G. G.
Amatucci
,
P. A.
Midgley
,
K. W.
Chapman
,
U.
Steiner
,
A. L.
Goodwin
, and
C. P.
Grey
, “
Revisiting metal fluorides as lithium-ion battery cathodes
,”
Nat. Mater.
20
,
841
850
(
2021
).
16.
Z.
Shadike
,
H.
Lee
,
O.
Borodin
,
X.
Cao
,
X.
Fan
,
X.
Wang
,
R.
Lin
,
S.-M.
Bak
,
S.
Ghose
,
K.
Xu
,
C.
Wang
,
J.
Liu
,
J.
Xiao
,
X.-Q.
Yang
, and
E.
Hu
, “
Identification of LiH and nanocrystalline LiF in the solid–electrolyte interphase of lithium metal anodes
,”
Nat. Nanotechnol.
16
,
549
554
(
2021
).
17.
Z.
Wang
,
C.
Li
, and
K.
Domen
, “
Recent developments in heterogeneous photocatalysts for solar-driven overall water splitting
,”
Chem. Soc. Rev.
48
,
2109
2125
(
2019
).
18.
T.
Löffler
,
A.
Savan
,
A.
Garzón-Manjón
,
M.
Meischein
,
C.
Scheu
,
A.
Ludwig
, and
W.
Schuhmann
, “
Toward a paradigm shift in electrocatalysis using complex solid solution nanoparticles
,”
ACS Energy Lett.
4
,
1206
1214
(
2019
).
19.
G.
Clarke
,
C.
Ablitt
,
J.
Daniels
,
S.
Checchia
, and
M. S.
Senn
, “
In situ X-ray diffraction investigation of electric-field-induced switching in a hybrid improper ferroelectric
,”
J. Appl. Crystallogr.
54
,
533
540
(
2021
).
20.
H. M.
Rietveld
, “
Line profiles of neutron powder-diffraction peaks for structure refinement
,”
Acta Crystallogr.
22
,
151
152
(
1967
).
21.
C.
Farrow
,
P.
Juhas
,
J.
Liu
,
D.
Bryndin
,
E.
Božin
,
J.
Bloch
,
T.
Proffen
, and
S.
Billinge
, “
PDFfit2 and PDFgui: Computer programs for studying nanostructure in crystals
,”
J. Phys.: Condens. Matter
19
,
335219
(
2007
).
22.
C.
Giacovazzo
, ed.,
Fundamentals of Crystallography
, 3rd ed. (
Oxford University Press
,
2011
).
23.
T.
Proffen
and
R.
Neder
, “
Discus: A program for diffuse scattering and defect-structure simulation
,”
J. Appl. Crystallogr.
30
,
171
175
(
1997
).
24.
M. G.
Tucker
,
D. A.
Keen
,
M. T.
Dove
,
A. L.
Goodwin
, and
Q.
Hui
, “
Rmcprofile: Reverse Monte Carlo for polycrystalline materials
,”
J. Phys.: Condens. Matter
19
,
335218
(
2007
).
25.
Y.
Iwasaki
,
A. G.
Kusne
, and
I.
Takeuchi
, “
Comparison of dissimilarity measures for cluster analysis of x-ray diffraction data from combinatorial libraries
,”
NPJ Comput. Mater.
3
,
4
(
2017
).
26.
Z.
Xiong
,
Y.
He
,
J. R.
Hattrick-Simpers
, and
J.
Hu
, “
Automated phase segmentation for large-scale x-ray diffraction data using a graph-based phase segmentation (GPhase) algorithm
,”
ACS Comb. Sci.
19
,
137
144
(
2017
).
27.
C. J.
Long
,
D.
Bunker
,
X.
Li
,
V. L.
Karen
, and
I.
Takeuchi
, “
Rapid identification of structural phases in combinatorial thin-film libraries using x-ray diffraction and non-negative matrix factorization
,”
Rev. Sci. Instrum.
80
,
103902
(
2009
).
28.
F.
Oviedo
,
Z.
Ren
,
S.
Sun
,
C.
Settens
,
Z.
Liu
,
N. T. P.
Hartono
,
S.
Ramasamy
,
B. L.
DeCost
,
S. I. P.
Tian
, and
G.
Romano
, “
Fast and interpretable classification of small x-ray diffraction datasets using data augmentation and deep neural networks
,”
NPJ Comput. Mater.
5
,
1
9
(
2019
).
29.
J.-W.
Lee
,
W. B.
Park
,
J. H.
Lee
,
S. P.
Singh
, and
K.-S.
Sohn
, “
A deep-learning technique for phase identification in multiphase inorganic compounds using synthetic XRD powder patterns
,”
Nat. Commun.
11
,
86
(
2020
).
30.
A.
Ziletti
,
D.
Kumar
,
M.
Scheffler
, and
L. M.
Ghiringhelli
, “
Insightful classification of crystal structures using deep learning
,”
Nat. Commun.
9
,
2775
(
2018
).
31.
J. A.
Aguiar
,
M. L.
Gong
,
R. R.
Unocic
,
T.
Tasdizen
, and
B. D.
Miller
, “
Decoding crystallography from high-resolution electron imaging and diffraction datasets with deep learning
,”
Sci. Adv.
5
,
eaaw1949
(
2019
).
32.
D.
Chen
,
Y.
Bai
,
W.
Zhao
,
S.
Ament
,
J. M.
Gregoire
, and
C. P.
Gomes
, “
Deep reasoning networks for unsupervised pattern de-mixing with constraint reasoning
,” in
Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research
, edited by
F.
Bach
and
D.
Blei
(
PMLR
,
2020
), Vol.
119
.
33.
I.
Takeuchi
,
C. J.
Long
,
O. O.
Famodu
,
M.
Murakami
,
J.
Hattrick-Simpers
,
G. W.
Rubloff
,
M.
Stukowski
, and
K.
Rajan
, “
Data management and visualization of x-ray diffraction spectra from thin film ternary composition spreads
,”
Rev. Sci. Instrum.
76
,
062223
(
2005
).
34.
V.
Stanev
,
V. V.
Vesselinov
,
A. G.
Kusne
,
G.
Antoszewski
,
I.
Takeuchi
, and
B. S.
Alexandrov
, “
Unsupervised phase mapping of x-ray diffraction data by nonnegative matrix factorization integrated with custom clustering
,”
NPJ Comput. Mater.
4
,
43
(
2018
).
35.
J.
Bai
,
J.
Bjorck
,
Y.
Xue
,
S. K.
Suram
,
J.
Gregoire
, and
C.
Gomes
, “
Relaxation methods for constrained matrix factorization problems: Solving the phase mapping problem in materials discovery
,” in
Integration of AI and or Techniques in Constraint Programming
, edited by
D.
Salvagnin
and
M.
Lombardi
(
Springer International Publishing
,
Cham
,
2017
), pp.
104
112
.
36.
J.
Bai
,
Y.
Xue
,
J.
Bjorck
,
R. L.
Bras
,
B.
Rappazzo
,
R.
Bernstein
,
S. K.
Suram
,
R. B.
van Dover
,
J. M.
Gregoire
, and
C. P.
Gomes
, “
Phase mapper: Accelerating materials discovery with AI
,”
AI Mag.
39
,
15
26
(
2018
).
37.
S. K.
Suram
,
Y.
Xue
,
J.
Bai
,
R. L.
Bras
,
B.
Rappazzo
,
R.
Bernstein
,
J.
Bjorck
,
L.
Zhou
,
R. B.
van Dover
,
C. P.
Gomes
, and
J. M.
Gregoire
, “
Automated phase mapping with AgileFD and its application to light absorber discovery in the V–Mn–Nb oxide system
,”
ACS Comb. Sci.
19
,
37
46
(
2017
).
38.
H. S.
Geddes
,
H.
Blade
,
J. F.
McCabe
,
L. P.
Hughes
, and
A. L.
Goodwin
, “
Structural characterisation of amorphous solid dispersions via metropolis matrix factorisation of pair distribution function data
,”
Chem. Commun.
55
,
13346
13349
(
2019
).
39.
D.
Chernyshov
,
I.
Dovgaliuk
,
V.
Dyadkin
, and
W.
van Beek
, “
Principal component analysis (PCA) for powder diffraction data: Towards unblinded applications
,”
Crystals
10
,
581
(
2020
).
40.
E.
Conterosito
,
M.
Lopresti
, and
L.
Palin
, “
In situ x-ray diffraction study of Xe and CO2 adsorption in y zeolite: Comparison between Rietveld and PCA-based analysis
,”
Crystals
10
,
483
(
2020
).
41.
C.-H.
Liu
,
C. J.
Wright
,
R.
Gu
,
S.
Bandi
,
A.
Wustrow
,
P. K.
Todd
,
D.
O'Nolan
,
M. L.
Beauvais
,
J. R.
Neilson
,
P. J.
Chupas
,
K. W.
Chapman
, and
S. J. L.
Billinge
, “
Validation of non-negative matrix factorization for assessment of atomic pair-distribution function (pdf) data in a real-time streaming context
,” arXiv preprint arXiv:2010.11807 (
2020
).
42.
Q.-J.
Li
,
D.
Sprouster
,
G.
Zheng
,
J. C.
Neuefeind
,
A. D.
Braatz
,
J.
Mcfarlane
,
D.
Olds
,
S.
Lam
,
J.
Li
, and
B.
Khaykovich
, “
Complex structure of molten NaCl–CrCl3 salt: Cr–Cl octahedral network and intermediate-range order
,”
ACS Appl. Energy Mater.
4
,
3044
3056
(
2021
).
43.
A.
Paszke
,
S.
Gross
,
F.
Massa
,
A.
Lerer
,
J.
Bradbury
,
G.
Chanan
,
T.
Killeen
,
Z.
Lin
,
N.
Gimelshein
,
L.
Antiga
,
A.
Desmaison
,
A.
Kopf
,
E.
Yang
,
Z.
DeVito
,
M.
Raison
,
A.
Tejani
,
S.
Chilamkurthy
,
B.
Steiner
,
L.
Fang
,
J.
Bai
, and
S.
Chintala
, “
Pytorch: An imperative style, high-performance deep learning library
,” in
Advances in Neural Information Processing Systems 32
, edited by
H.
Wallach
,
H.
Larochelle
,
A.
Beygelzimer
,
F.
d'Alché Buc
,
E.
Fox
, and
R.
Garnett
(
Curran Associates, Inc
.,
2019
), pp.
8024
8035
.
44.
C.
Févotte
and
J.
Idier
, “
Algorithms for nonnegative matrix factorization with the beta-divergence
,”
Neural Comput.
23
,
2421
2456
(
2011
).
45.
G.
Ashiotis
,
A.
Deschildre
,
Z.
Nawaz
,
J. P.
Wright
,
D.
Karkoulis
,
F. E.
Picca
, and
J.
Kieffer
, “
The fast azimuthal integration python library: Pyfai
,”
J. Appl. Crystallogr.
48
,
510
519
(
2015
).
46.
P.
Juhás
,
T.
Davis
,
C. L.
Farrow
, and
S. J.
Billinge
, “
Pdfgetx3: A rapid and highly automatable program for processing powder diffraction data into total scattering pair distribution functions
,”
J. Appl. Crystallogr.
46
,
560
566
(
2013
).
47.
K.
Page
,
T.
Proffen
,
M.
Niederberger
, and
R.
Seshadri
, “
Probing local dipoles and ligand structure in batio3 nanoparticles
,”
Chem. Mater.
22
,
4386
4391
(
2010
).
48.
M.
Wegner
,
H.
Gu
,
R. D.
James
, and
E.
Quandt
, “
Correlation between phase compatibility and efficient energy conversion in zr-doped barium titanate
,”
Sci. Rep.
10
,
3496
(
2020
).
49.
R.
Gu
,
Q.
Du
, and
S. J. L.
Billinge
, “
A fast two-stage algorithm for non-negative matrix factorization in streaming data
,” arXiv preprint arXiv:2101.08431 (
2021
).
50.
P. M.
Maffettone
,
L.
Banko
,
P.
Cui
,
Y.
Lysogorskiy
,
M. A.
Little
,
D.
Olds
,
A.
Ludwig
, and
A. I.
Cooper
, “
Crystallography companion agent for high-throughput materials discovery
,”
Nat. Comput. Sci.
1
,
290
297
(
2021
).
51.
R. L.
Thorndike
, “
Who belongs in the family?
,”
Psychometrika
18
,
267
276
(
1953
).
52.
D.
Allan
,
T.
Caswell
,
S.
Campbell
, and
M.
Rakitin
, “
Bluesky's ahead: A multi-facility collaboration for an a la carte software project for data acquisition and management
,”
Synchrotron Radiat. News
32
,
19
22
(
2019
).
53.
A. A.
Coelho
, “
TOPAS and TOPAS-academic: An optimization program integrating computer algebra and crystallographic objects written in C++
,”
J. Appl. Crystallogr.
51
,
210
218
(
2018
).

Supplementary Material