This Tutorial provides a step-by-step guide on how to apply supervised machine-learning techniques to analyze diffraction and spectroscopy data. This Tutorial details four models—a reconstruction-focused model, a regression-focused model, a hybrid reconstruction/regression model, and a multimodal model—that use x-ray diffraction profiles and vibrational density of states spectra to predict various microstructural descriptors. In this Tutorial, we cover data pre-processing steps, constructions of the models via dimensionality reduction and regression, training, and analysis of these models. Comparisons of the model’s performance are provided, highlighting the strength and weakness of the various approaches utilized.
I. INTRODUCTION
Spectroscopy and diffraction techniques have a long history of use for materials characterization in a wide range of scientific fields, including their use for the identification of the composition1,2 or phase3–5 of materials or the measurement of the stress state of a material,6–9 among many other applications. Advances in experimental tools and automation have made the collection of large amounts of diffraction10–13 or spectroscopy14–16 data simpler than ever before, enabling high-throughput materials screening and characterization.
However, traditional analysis methods for these types of data, notably peak analysis, can become quite complicated when dealing with noise due to instrumental factors or material- and sample-related overlapping features. These complications can lead to biased data interpretation17,18 or limit the analysis to only simple problems where consistent outcomes are possible.19,20 When these methods need human intervention to correct for any of these biases and simplifications or to fine-tune the analysis, they become too slow to use for in-line analysis (which may be desired for accelerating time-sensitive beam-line experiments21) or for handling large volumes of data from high-throughput experimental techniques. Machine-learning methods provide a systematic approach for analyzing and interpreting spectroscopy and diffraction data that avoids the limitations of traditional methods driven by human-identifiable features. These techniques offer significant advantages in automation and efficiency and are capable of handling complex, noisy, or overlapping data. Numerous studies have demonstrated that machine-learning models can not only match the accuracy of traditional analysis methods,21–28 but also reveal additional, hidden materials information that is not easily identifiable by conventional human analysis.21,29,30
In this Tutorial, we provide a step-by-step guide on how to apply supervised machine-learning techniques to analyze diffraction and/or spectroscopy data for the purpose of extracting microstructural state descriptors from an observed x-ray diffraction (XRD) profile. In Sec. II, we introduce the dataset used for this Tutorial. This publicly available dataset consists of simulated XRD profiles and vibrational density of states (VDoS) spectra. These profiles and spectra were generated from atomistic structures that had been subjected to different amounts of disorder insertion and mechanical loading.31,32 For this Tutorial, we chose simulation data because they provide a controlled environment where the underlying patterns are well-understood, allowing us to focus on demonstrating the core concepts and methodologies of the machine-learning methods illustrated hereafter. In Sec. II B, we discuss how to perform data preparation and pre-processing for spectroscopy and diffraction data to improve model performance while preserving the integrity of the data. Section III provides a summary of the essential machine-learning building blocks to be used for diffraction and spectroscopy data analysis. In Sec. IV, we discuss how to design and select a model architecture by defining and comparing four distinct machine-learning models. The first three models operate exclusively on a single input modality (XRD profiles). Each of these models emphasizes different aspects of the machine-learning model’s building blocks, namely, dimensionality reduction, regression, or a hybrid approach that balances both tasks. The fourth model consists of a multimodal approach, which incorporates both XRD profiles and VDoS spectra as inputs. This last model is meant to demonstrate a straightforward method for integrating multiple data types to improve the performance of the analysis. These four models are presented to give examples of prioritizing specific aspects of the analysis based on data characteristics and the objective of the analysis. Finally, in Sec. V, we compare the performance of these different models and provide a discussion around model selection. Commented code snippets are included to serve as a bridge between theoretical concepts and practical implementation, enabling the reader to directly apply the discussed techniques to their own research. Overall, this Tutorial aims to equip researchers and practitioners with the knowledge and tools necessary to leverage supervised machine learning for advanced materials characterization, enabling more efficient and insightful analysis of complex microstructural data.
II. DATASET DESCRIPTION
This Tutorial utilizes a subset of a publicly available dataset, which includes simulated spectroscopy and diffraction profiles derived from molecular-dynamics simulations of mechanically deformed and disordered atomic structures.31,32 In this Tutorial, we primarily utilize XRD profiles and the associated microstructural descriptors of bulk single crystal gold (Au). One machine-learning model in this Tutorial also incorporates VDoS spectra available in the dataset.
A. Atomic structures and associated diffraction and spectroscopy data
We performed molecular-dynamics simulations of Au structures using the Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS).33 We used an embedded atom method (EAM) interatomic potential34 to describe interactions between Au atoms and created various atomic structures with different strain (either uniaxial or hydrostatic compression or tension) and defect states. Details on how these structures were created are provided elsewhere.31 From these structures, we extracted seven microstructural descriptors, including stress tensor components ( ), face-centered cubic (fcc), hexagonal close-packed (hcp), and disordered phase fractions, and the total dislocation density. These microstructural descriptors serve as regression targets for the machine-learning models designed to predict microstructural information from observed XRD profiles.
XRD spectra were simulated using the LAMMPS diffraction package on all Au atomic structures, with specific parameters for wavelength (1.54 Å), range ( – ), and reciprocal lattice spacing (0.005 Å ). VDoS spectra were obtained by measuring the velocity autocorrelation function (VACF) in the molecular-dynamics simulations and then computing their Fourier transforms. The resulting VDoS spectra span a frequency range from 0 to . This dataset exemplifies how machine-learning can link microstructural states to spectroscopy or diffraction data. The methods in this Tutorial are broadly applicable to other materials from the complete dataset31,32 or various systems with diverse characterization techniques, microstructural properties, and modifications. Importantly, the chosen property and microstructural descriptors must be encoded in the selected characterization technique.
B. Data pre-processing
Data pre-processing is often necessary to enable or improve model training. Key considerations include data consistency, the range and distribution of the data values, normalization of the data, and noise filtering.
Regarding data consistency, input data size generally must be uniform across all samples (some machine-learning architectures can handle disparate input data sizes, but most cannot). In the case of the XRD 1D line profiles, each profile must be adjusted such that it spans the same range of 2 values. Downsampling is a convenient method for achieving consistency and matching the range and point density of the least dense sample(s) in the dataset. This can be done by using a subset of the points in the high-density sample(s) (i.e., drop every other point or use every fourth point, etc.) or by averaging sets of points in bins. Both of these approaches can maintain the original integrity of the data, although high frequency features that exist below the downsampled resolution will be lost. Using spline or interpolation techniques to increase the density of sparse spectra can seem like an attractive method. However, this approach can also introduce artificial distortions to the data in regions with sharp features, as seen in Fig. 1(a), where interpolating between the sparsely distributed points creates a spectrum that deviates from the original near sharper features and results in shifts in peak positions and changes in relative feature intensities. For the purposes of this Tutorial, the raw XRD profiles collected from the original dataset were downsampled from an initial vector size of 6000 points down to 150 points by summing 40 points per bin across the range. VDoS spectra were instead truncated to contain a total of 5997 points covering a frequency range from 0 to approximately .
(a) Comparison between a smoothed VDoS spectrum, a sparse sampling of this spectrum, and an interpolation of the sparse sampling. (b) Comparison between two XRD profiles (solid and dotted lines) that are normalized based on the maximum value of XRD1 only (top panel) or based on the maximum values of each profile (bottom panel). (c) Comparison between a noisy VDoS spectrum and several spectra that have been filtered using different filtering parameters.
(a) Comparison between a smoothed VDoS spectrum, a sparse sampling of this spectrum, and an interpolation of the sparse sampling. (b) Comparison between two XRD profiles (solid and dotted lines) that are normalized based on the maximum value of XRD1 only (top panel) or based on the maximum values of each profile (bottom panel). (c) Comparison between a noisy VDoS spectrum and several spectra that have been filtered using different filtering parameters.
Normalization is another crucial step in preparing data as it can significantly enhance training speed and increase model sensitivity to relevant data ranges. However, it must be done carefully in order to prevent information loss or spurious information addition. When dealing with arbitrary or uncontrolled magnitudes—such as those found in experimental XRD profiles, which can vary with measurement time and can be difficult to control—one can normalize each profile by dividing the XRD profile by its maximum value. This approach can prevent the model from learning spurious correlations arising from inconsistent magnitudes across the dataset. Conversely, if the magnitudes are controlled, normalization can be performed identically across all samples (i.e., division by the same value for all samples). This ensures that the relative differences in magnitude between samples are preserved. Figure 1(b) illustrates both normalization approaches: the two spectra in the top panel are normalized by the maximum value of the XRD 1 spectrum, while the two spectra in the bottom panel are normalized individually by their respective maxima. For this Tutorial, both the VDoS spectra and the XRD profiles were normalized by dividing each spectrum or profile by the maximum intensity observed across all of the spectra or profiles in the dataset. Since these spectra and profiles are the result of well-controlled simulations, normalization by the maximum intensity preserves differences in relative intensities that may contain information relevant to the regression task.
Noise filtering may also be necessary depending on the quality of the input data and the chosen machine-learning model. For the purposes of discussion, we consider two types of noise: zero-mean and non-zero mean noises. Zero-mean noise refers to noises like Gaussian or Poisson noise, where each point in the measurement is displaced by a positive or negative value off of its true point by a random amount that can be represented by a known distribution with a mean value of zero. Non-zero mean noise has a consistent positive or negative bias and is commonly found in experimental data.
When addressing zero-mean noise, if the amplitude of the noise is consistent across the dataset and significantly lower than the feature magnitudes, it can typically be ignored without negatively affecting model performance. However, if the amplitude of the noise varies significantly between samples or if it is comparable in magnitude to the features present in the data, removing it may be beneficial or even essential for achieving good model performance. When applying filtering or smoothing algorithms for zero-mean noise removal, it is crucial to ensure that these processes do not distort or eliminate important features in the data. For example, Fig. 1(c) demonstrates how increasingly strong filtering of a noisy VDoS spectrum can initially remove local noise but eventually distorts the spectrum near inflection points. In this Tutorial, no noise removal or filtering was applied to the VDoS or XRD data. The decision to use noise filtering should be based on a careful evaluation of the data characteristics and the specific requirements of the machine-learning task.
Experimental measurements often exhibit non-zero mean noise due to background signals inherent to sensors and instrumentation, which contribute additional noise to the collected data. If the magnitude of non-zero mean noise remains consistent across all samples relative to the features of interest, its removal may be unnecessary, as machine-learning models can learn to disregard such background signals. However, when the noise magnitude varies between samples, removal becomes critical—especially if the noise intensity is comparable to key features or if it correlates with the target descriptors (e.g., a higher noise floor in samples with specific features). Non-zero mean noise removal typically involves subtracting or adjusting the background signal, which may vary across the data range (e.g., higher background at lower values in XRD measurements) and between samples, necessitating careful manual or automated preprocessing.
Finally, we note that, for regression tasks as the one illustrated in this Tutorial, particularly in multi-output scenarios, normalizing the target outputs (here the microstructural descriptors) is often advantageous. This is especially crucial when the different outputs have significantly different value ranges. In our dataset, which includes seven microstructural descriptors as target outputs, each descriptor has its own distinct range of values and associated order of magnitude, as listed in Table I. To address this disparity in range, we normalized each descriptor range by mapping its minimum value to zero and scaling all values to fall between 0 and 1. This normalization was part of the data pre-processing conducted before model training. By standardizing the ranges of microstructural descriptors, we ensure that the machine-learning model treats each output variable with equal importance during the training process, regardless of their original scales.
Range of microstructural descriptor values from dataset.
Property . | Min and Max value . | Unit . |
---|---|---|
fcc phase fraction | 0.354 to 1.0 | Unitless |
hcp phase fraction | 0.0 to 0.353 | Unitless |
Disordered phase fraction | 0.0 to 0.302 | Unitless |
σXX | −204.22 to 14.48 | GPa |
σYY | −205.43 to 14.48 | GPa |
σZZ | −207.32 to 14.48 | GPa |
Dislocation density | 0.0 to 4.14 × 10−3 | Å−2 |
Property . | Min and Max value . | Unit . |
---|---|---|
fcc phase fraction | 0.354 to 1.0 | Unitless |
hcp phase fraction | 0.0 to 0.353 | Unitless |
Disordered phase fraction | 0.0 to 0.302 | Unitless |
σXX | −204.22 to 14.48 | GPa |
σYY | −205.43 to 14.48 | GPa |
σZZ | −207.32 to 14.48 | GPa |
Dislocation density | 0.0 to 4.14 × 10−3 | Å−2 |
C. Considerations for experimental data
The dataset described in Sec. II A is derived from computational simulations, ensuring that the data are free from spurious artifacts and contains noise that is well-characterized and accurately modeled. Consequently, minimal pre-processing was required to prepare the data for training the machine-learning models. In contrast, experimental data often involve complex noise and artifacts that are harder to characterize, necessitating additional care during preparation for model training or inference.
Let us consider two situations where experimental data are being used with a machine-learning model. In the first case, when training a machine-learning model directly with experimental data for inference, as in Desai et al. 35 for instance with experimental XRD measurements, the consistency of noise and artifacts across training and inference datasets is important. Methods for noise removal, normalization, and standardization, as detailed in Sec. II B, are generally applicable and appropriate, with careful attention paid to potential variations in noise floors and signal amplitudes. In the second case, training a machine-learning model with simulated data (or a combination of simulated and experimental data) for inference on experimental data introduces challenges due to potential inconsistencies in noise and artifacts between the two data types. A model trained solely on simulated data is unlikely to perform accurately on experimental data without addressing these differences. For instance, the XRD profiles in the dataset described in Sec. II A are generated through simulations and lack experimentally observed noise sources, such as instrument broadening or sensor background noise. Using such a model for inference on experimental data would likely result in inaccurate predictions.
To address discrepancies between simulated training data and experimental inference data, two approaches can be employed. The first involves pre-processing experimental data to remove noise and artifacts absent in simulated data. This method is feasible when the experimental data volume is low or pre-processing is straightforward but caution must be exercised to avoid altering relevant features. The second approach entails modifying simulated data to emulate experimental noise and artifacts, requiring a thorough understanding of experimental noise characteristics. For instance, Natinsky et al.36 added artificial noise and corruptions to experimental atomic force microscopy (AFM) images for data augmentation. However, inadequate characterization of experimental noise and artifacts may lead to poor model performance on real-world datasets with unrepresented characteristics in the training set.
III. REGRESSION WITH DIFFRACTION OR SPECTROSCOPY DATA
A. Dimensionality reduction
Dimensionality reduction transforms high-dimensional XRD (or VDoS) data into a more manageable, lower-dimensional representation. This process, mathematically expressed in Eq. (2), converts an XRD profile I, consisting of M data points across a specific range, into a vector Z with significantly fewer dimensions N, where N is much smaller than M. Traditionally, a subject matter expert performing a diffraction analysis intuitively performs a form of dimensionality reduction by focusing on specific human-identified features like peak positions, widths, and intensities. In the context of machine learning, these features are now captured via the vector Z which contains the full information of the original XRD profile and therefore effectively represent the microstructural states encoded in the diffraction data.
In this Tutorial, we employed convolutional autoencoders for dimensionality reduction of diffraction data. An autoencoder consists of two primary components: an encoder and a decoder. The encoder compresses the input data, reducing its dimension from M to a smaller latent representation of dimension N. The decoder reconstructs the original input from this latent encoding. A generalized diagram showing the structure of the autoencoder is provided in Fig. 2. The autoencoder is trained by minimizing a loss function that compares the original input to its reconstruction. While traditionally optimized for accurate reconstruction, autoencoders can be modified to encode specific information into the latent representation through additional loss terms.39 Both the encoder and decoder are independent neural networks, which may have symmetric or asymmetric structures and can incorporate various neural network components. In the convolutional autoencoder defined in Fig. 2, convolutional blocks raster a kernel across the input data (XRD profile) capturing stationary features (peak shapes and characteristics) in the input data. These convolutional blocks are followed by a linear layer that is capable of learning the spatial relationship of the stationary features within the input. This approach allows for efficient dimensionality reduction while preserving the essential features of the diffraction data, making it particularly useful for analyzing complex materials systems and their microstructural states.
Top row: dimensionality reduction via a convolutional autoencoder. The autoencoder is composed of an encoder that compresses an input spectrum of dimension M into a latent space of dimension N , and a decoder that reconstructs the input from its encoding. Middle row: regression using a multi-layer perceptron. The MLP is composed of several linear layers, which can be followed by various activation functions or other layer types such as Dropout or normalization layers. Bottom row: legend for the symbols used in neural networks’ schematics.
Top row: dimensionality reduction via a convolutional autoencoder. The autoencoder is composed of an encoder that compresses an input spectrum of dimension M into a latent space of dimension N , and a decoder that reconstructs the input from its encoding. Middle row: regression using a multi-layer perceptron. The MLP is composed of several linear layers, which can be followed by various activation functions or other layer types such as Dropout or normalization layers. Bottom row: legend for the symbols used in neural networks’ schematics.
When performing dimensionality reduction, selecting the appropriate size of the latent space N is crucial. Some dimensionality reduction methods, such as Principal Component Analysis (PCA),40–43 provide some guidance by quantifying the variance captured by each latent variable. Autoencoders, on the other hand, provide reconstruction error metrics, allowing users to adjust N until satisfactory reconstruction accuracy is achieved. It is generally advisable to conduct a sensitivity analysis to evaluate how varying N affects both the reconstruction error and performance in subsequent tasks utilizing the latent encoding. This approach ensures an optimal balance between data compression and information retention.
B. Regression
The regression model aims to map the latent vector Z to a set of material state descriptors s of dimension J, extracted from the XRD profiles. In the same manner that there are many different dimensionality reduction techniques, there exists a broad choice of regression techniques. For a comprehensive survey of regression methods that could be selected, the reader is encouraged to seek out books or review articles that provide thorough descriptions of the various classes and specific examples of regression methods that could be utilized for a wide variety of tasks, some of which are cited here.44–48 The choice of method depends on the specific requirements of the materials characterization task at hand.
C. Model architecture selection
The selection of a specific model architecture involves many decisions, including general model architecture, specific model hyperparameters defining network depth and layer composition, as well as training parameters. These choices significantly impact model performance, affecting prediction accuracy, training difficulty, and data requirements. Frequently, these choices come with potential trade-offs, such as deeper networks potentially performing better at the desired task while also potentially being more prone to overfitting and requiring more data to train the network. Often, the rationale behind these impactful decisions remains unclear. When selecting model architectures for a specific dataset and machine-learning task, it is best practice to try many different combinations of model architectures and hyperparameters.
When optimizing model architectures and hyperparameters, it is crucial to assess how much optimization is necessary for the task at hand. While sequential fine tuning and optimization can yield incremental performance gains, they come at the expense of significant time, computational resources, and potentially reduced model generalizability. Establishing a reasonable accuracy threshold for the task at hand and stopping optimization once reasonable accuracy is satisfied is a practical approach to limit the otherwise boundless scope of model selection and tuning.
IV. MODELS
This Tutorial explores four machine-learning models for predicting material properties from XRD data, and in one case, using a combination of XRD and VDoS data. The first three models differ in how the process of dimensionality reduction is handled and how the dimensionality reduction and regression portions of the model are trained. These models are (i) a reconstruction-focused model using a traditional autoencoder/MLP model; (ii) a regression-focused model using an encoder/MLP model; and (iii) a hybrid reconstruction/regression model based on concurrent autoencoder-MLP training. The reconstruction-focused model uses an autoencoder for dimensionality reduction, optimized for high-quality reconstruction of the XRD profile. The resulting latent representation is then used to train an MLP to regress the microstructural descriptors. In the regression-focused model, the decoder is removed, allowing simultaneous training of the encoder and the MLP. This produces a latent representation optimized specifically for the regression task. The third machine-learning model, the hybrid reconstruction/regression model, trains the autoencoder and MLP concurrently using a combined loss function, creating a latent representation that balances reconstruction and regression tasks. Separately, the last and fourth model combines information from two latent spaces, derived from XRD profiles and VDoS spectra respectively, demonstrating a simple method for fusing multiple data modalities. Code examples are provided as part of the Tutorial, which demonstrate the construction and use of these models in Python using the PyTorch library.49 The included code snippets and model training results shown in Sec. V were created using PyTorch v. 2.0.1.
A. Reconstruction-focused model
This model implements a sequential approach to analyze XRD data. Initially, we train a convolutional autoencoder to perform dimensionality reduction [see Eq. (2)], targeting a latent dimension of . This specific dimension was chosen based on PCA results, which indicated that 30 components capture over 98% of the total variance in the XRD dataset. This approach ensures that we retain the most significant information while substantially reducing data complexity. The architecture of the convolutional autoencoder used for this dimensionality reduction task is detailed in Code Listing 1. The code shown in Code Listing 1 assumes that the XRD profiles that are input in to the autoencoder have a length of . This input dimension results in a 2208-point tensor after three convolutional layers, as shown in lines 10 and 17 of Code Listing 1. This value is derived from the input dimension , channel expansion ( ), and kernel size. If the autoencoder is to be used with XRD profiles of a different length or if changes are made to the hyperparameters of the convolutional layers, these lines will need to be modified with the correct dimension. Additionally, the value 138 on line 19 requires updating as it depends on the convolutional layers’ output tensor size.
This autoencoder mimics the structure shown in Fig. 2. The encoding portion of the model is composed of three 1D convolutional layers each with a kernel size of 5 and the number of channels increasing through each consecutive layer. Then, these layers are flattened and a single dense linear layer is used to reduce the output of the convolutional layers down to the desired number of latent variables. Schematically, the encoder architecture can be described as follows: where the nomenclature “Conv” describes a convolutional layer, “Linear” describes a linear layer, and “Flat” describes a flattening operation. The superscript of the “Conv” operations indicates the convolution kernel size, while the subscripts of all operations represent the dimensionality of the data after passing through each operation, with the first number indicating the number of channels and the second number indicating the size of each channel. The decoding portion of the model mirrors the encoding portion, using the same number of layers, the same channel counts, and the same kernel sizes, just in the reverse order. When the forward operation of the model is called, it will return both the latent projection of the data that has been passed to the model as well as the reconstruction of the input data. For this architecture, the number of convolutional layers, the kernel size in each of these convolutions, the number of channels, the number and size of linear layers after the convolutions, and the symmetry or asymmetry of the encoder and decoder are all adjustable parameters that can be modified to tune the models performance. Again, the code shown in Code Listing 1 assumes that the XRD profiles that are input in to the autoencoder have a length of ; modifications to the dimension of the linear layers and the unflatten operation will be necessary if the input changes in dimension.
For the purposes of this Tutorial, the data that are used for training and testing the models are stored in a PyTorch Dataset object, defined in Code Listing 2. In this object, the XRD profiles and the regression targets (i.e., the vector of the microstructural descriptor, ) are stored together. This is done to make it easier to create shuffled testing and training datasets that are shared between the autoencoder and the MLP. The code shown in Code Listing 2 creates a 70%–30% training and test split and stores these datasets in PyTorch DataLoader objects which handle shuffling and batching the data. Different test–train splits can be created by modifying the random seed number in the definition of the gen1 random number generator object. Using the code snippet will require storing your own dataset in the PyTorch tensors xrd_tensor and target_tensor. This and subsequent code snippets do not store or print the training and test losses during the training process; modifications will be required if that is desired.
The process of training the autoencoder involves several key steps provided in Code Listing 3. First, the desired loss function, which in this case is the Mean Squared Error (MSE) function, is initialized. Next, the autoencoder is called and its learnable parameters are passed to the optimizer, specifically AdamW50 for the purposes of this Tutorial. AdamW is a modified version of the classical Adam optimizer51 that incorporates a built-in method for handling weight decay during training. The effectiveness and speed of the model’s training are significantly influenced by the choice of optimizer and its hyperparameters. As indicated in Code Listing 3, the training of the autoencoder is performed with a learning rate (lr) of 0.001 and a weight decay of 0.001 for a total of 500 training epochs. These hyperparameter values were selected after performing a manual hyperparameter search with the objective of minimizing the autoencoder reconstruction error.
Sequential training of the autoencoder and MLP optimizes the latent representation for input reconstruction. This approach can enhance performance and generalizability of the learned representation under specific conditions, such as when multiple models utilize the same latent space (for example, training a regression model and a classification model with the same latent representation). It is particularly advantageous when the intended machine-learning task is challenging or only possible with a low-accuracy, as concurrent training would be dominated by the more difficult task’s loss.
As discussed in Sec. III, the performance of the autoencoder and MLP described in this section is closely tied with the selection of the model architectures as well as the training hyperparameters. The specific architectures and hyperparameters described above were selected after performing manual architecture and hyperparameter optimization tests for the specific dataset described in Sec. II A. It is true that different model architectures and different training hyperparameter sets may perform better with this dataset, and it may also be true that different architectures and hyperparameters may be required to observe good performance with different datasets. When developing machine-learning workflows, it is always recommended that architecture and hyperparameter optimization is performed, either with manual tuning or using optimization tools such as Optuna,52 Neptune,53 and Weights and Biases.54
B. Regression-focused model
The previous architecture focused on creating a latent representation that best reconstructed the original data. While this approach should theoretically provide a good basis for property regression, it is not guaranteed. The autoencoder can prioritize features that minimize reconstruction loss while overlooking subtle, regression-relevant information. In this section, we introduce a machine-learning model specifically designed for the regression task by removing the decoder and training the encoder and MLP concurrently, without considering reconstruction of the XRD profile.
The process of training the regression-focused model is similar to the process of training the reconstruction-focused model, following the same stages of initializing the loss function, the model, and the optimizer, then entering a two-stage loop that iterates through training and testing epochs each broken into sets of batches. Where the procedure defined in Code Listing 7 differs from the previous approaches defined in Code Listings 3 and 5 is in how the model is handled and how the loss is computed. Since the present model does not have a decoder, no reconstructions of the input XRD profiles are produced, and no optimization occurs based on the reconstruction error. Instead, the model directly predicts microstructural descriptors s from an input XRD profile I, which are compared against the known microstructural descriptors using Eq. (6) to compute the loss for updating the model parameters. As indicated in Code Listing 7, training of the encoder and MLP were performed with a learning rate of 0.0001 and a weight decay of 0.0001 for a total of 10 000 epochs.
Concurrent training of dimensionality reduction and regression models can offer advantages over sequential training in certain scenarios. For instance, autoencoders often require high-dimensional latent representations for accurate reconstruction. By removing the decoder and training the encoder and regression models simultaneously, lower-dimensional latent representations can be used, as input reconstruction is no longer a factor. This reduction can eliminate extraneous information, potentially improving regression accuracy. Additionally, it is advantageous when using an MLP for regression under memory constraints, as the MLP’s parameter count and memory requirements scale with the input size.
C. Hybrid reconstruction/regression model
Sections IV A and IV B described models with different priorities: one focused on reconstructing the input XRD data, while the other emphasized creating a latent representation optimized for the desired regression task. In this section, we introduce a hybrid approach that considers both objectives. This hybrid model consists of an autoencoder and an MLP that are trained concurrently, ensuring that the latent representation of the input XRD data is optimized for both the task of reconstructing the input data as well as providing useful features for the regression task.
Hybrid Reconstruction/Regression Model Architecture.
Code Listing 8 shows the model architecture of the hybrid model. This model combines the autoencoder defined in Code Listing 1 with the MLP defined in Code Listing 4 in a single PyTorch model class. When this model is called, it outputs predictions of the regression targets provided by the MLP as well as the learned latent representation and the reconstruction of the input data that is created by the autoencoder. The architecture defined in Code Listing 8 expects the input XRD profiles to have a length of . If a different input length is utilized or if changes are made to the convolutional layers in the autoencoder, the numbers 2208 on lines 9 and 16 as well as the number 138 on line 18 will need to be modified.
This hybrid model is intended to combine the benefits of the two previous models in the creation of the latent representation. Optimization of the reconstruction loss drives the learned latent representation to encode as much of the input data as possible, which should in theory maximize the generalizability of the latent representation for different machine-learning tasks. Optimization of the regression loss drives the learned latent representation to specifically consider the information necessary to maximize the regression accuracy. The aim of these two driving forces is to create a latent representation that is tuned specifically for the regression task while still being generalizable for other tasks.
D. Multimodal model
This multimodal workflow takes the general architecture defined in Sec. IV C and modifies it to consider two input modalities, those being XRD profiles and VDoS spectra. Both of these modalities share the same general form of 1D line profiles, but they have distinct characteristics both in terms of the number of features that exist within the 1D line profiles as well as the form of those features. Due to these differences, different autoencoder architectures and latent dimension sizes are needed to produce good reconstructions of the two modalities. For the purposes of this Tutorial, manual architecture and hyperparameter optimization was performed to determine how the two autoencoders should be defined. The specific differences in the architectures of the two autoencoders used to encode these two modalities are defined in supplementary material I.
Schematic of the structure for the multimodal model. Two autoencoders create separate latent representations (XRD: Z , VDoS: Z ) of the two input modalities (I , I ) and an MLP operates on the concatenated vector of these latent representations (Z ) to create predictions (s).
Schematic of the structure for the multimodal model. Two autoencoders create separate latent representations (XRD: Z , VDoS: Z ) of the two input modalities (I , I ) and an MLP operates on the concatenated vector of these latent representations (Z ) to create predictions (s).
V. DISCUSSION
As shown in Table I, each microstructural descriptor that is included as a regression target varies across ranges of varying magnitudes, and, therefore, the range for each descriptor was normalized from 0 to 1 for model training. In Secs. V A–V C, when comparing the performance of the previously defined machine-learning models, model errors will be presented using un-normalized values. To examine the impact of the test–train split on model performance, each of the four workflows presented in this Tutorial were trained with ten different test–train splits. Model accuracies reported in Secs. V A–V C will be the averages and optimal performances observed across these 10 different trained models. Model architectures and all hyperparameters besides the seed used to create the test–train split were kept constant during these runs.
A. Comparison of single modality workflow performance
Tables II, III, and IV present performance metrics for the reconstruction-focused, regression-focused, and hybrid single-modality models in predicting microstructural descriptors from XRD profiles. For each descriptor, these tables provide three key values: the average and standard deviation of the root mean squared error (RMSE) calculated from 10 models trained on different test–train splits, the RMSE for each descriptor from the individual model that had the lowest average test loss (MSE) during training, and the lowest test error achieved for that specific descriptor across all 10 of the trained models. All values are reported in the units of their respective descriptors, as specified in Table I. Supplementary material II includes additional tables with the MSE for normalized regression targets and R scores.
Reconstruction-focused workflow RMSE.
Target descriptor . | Average RMSE . | Best model . | Best RMSE . |
---|---|---|---|
fcc phase fraction | 0.014 90 ± 0.005 74 | 0.009 50 | 0.009 43 |
hcp phase fraction | 0.007 14 ± 0.000 527 | 0.006 26 | 0.006 26 |
Disordered phase fraction | 0.006 39 ± 0.000 994 | 0.005 86 | 0.005 24 |
σXX | 2.272 ± 0.315 | 1.645 | 1.645 |
σYY | 2.326 ± 0.293 | 1.673 | 1.673 |
σZZ | 2.376 ± 0.334 | 1.733 | 1.733 |
Total dislocation density | 1.557 × 10−4 ± 1.917 × 10−5 | 1.382 × 10−4 | 1.365 × 10−4 |
Target descriptor . | Average RMSE . | Best model . | Best RMSE . |
---|---|---|---|
fcc phase fraction | 0.014 90 ± 0.005 74 | 0.009 50 | 0.009 43 |
hcp phase fraction | 0.007 14 ± 0.000 527 | 0.006 26 | 0.006 26 |
Disordered phase fraction | 0.006 39 ± 0.000 994 | 0.005 86 | 0.005 24 |
σXX | 2.272 ± 0.315 | 1.645 | 1.645 |
σYY | 2.326 ± 0.293 | 1.673 | 1.673 |
σZZ | 2.376 ± 0.334 | 1.733 | 1.733 |
Total dislocation density | 1.557 × 10−4 ± 1.917 × 10−5 | 1.382 × 10−4 | 1.365 × 10−4 |
Regression-focused workflow RMSE.
Target descriptor . | Average RMSE . | Best model . | Best RMSE . |
---|---|---|---|
fcc phase fraction | 0.013 39 ± 0.003 24 | 0.008 80 | 0.008 80 |
hcp phase fraction | 0.006 69 ± 0.001 27 | 0.005 90 | 0.005 74 |
Disordered phase fraction | 0.007 58 ± 0.001 01 | 0.005 96 | 0.005 96 |
σXX | 2.798 ± 0.735 | 2.583 | 1.805 |
σYY | 2.750 ± 0.727 | 2.470 | 1.846 |
σZZ | 2.792 ± 0.708 | 2.658 | 1.902 |
Total dislocation density | 1.605 × 10−4 ± 2.154 × 10−5 | 1.427 × 10−4 | 1.415 × 10−4 |
Target descriptor . | Average RMSE . | Best model . | Best RMSE . |
---|---|---|---|
fcc phase fraction | 0.013 39 ± 0.003 24 | 0.008 80 | 0.008 80 |
hcp phase fraction | 0.006 69 ± 0.001 27 | 0.005 90 | 0.005 74 |
Disordered phase fraction | 0.007 58 ± 0.001 01 | 0.005 96 | 0.005 96 |
σXX | 2.798 ± 0.735 | 2.583 | 1.805 |
σYY | 2.750 ± 0.727 | 2.470 | 1.846 |
σZZ | 2.792 ± 0.708 | 2.658 | 1.902 |
Total dislocation density | 1.605 × 10−4 ± 2.154 × 10−5 | 1.427 × 10−4 | 1.415 × 10−4 |
Hybrid workflow RMSE.
Target descriptor . | Average RMSE . | Best model . | Best RMSE . |
---|---|---|---|
fcc phase fraction | 0.014 04 ± 0.005 56 | 0.008 45 | 0.007 79 |
hcp phase fraction | 0.005 89 ± 0.000 30 | 0.006 19 | 0.005 43 |
Disordered phase fraction | 0.006 59 ± 0.001 02 | 0.005 78 | 0.005 32 |
σXX | 2.393 ± 0.471 | 2.036 | 1.839 |
σYY | 2.344 ± 0.471 | 1.950 | 1.810 |
σZZ | 2.401 ± 0.443 | 1.968 | 1.874 |
Total dislocation density | 1.476 × 10−4 ± 1.593 × 10−5 | 1.394 × 10−4 | 1.314 × 10−4 |
Target descriptor . | Average RMSE . | Best model . | Best RMSE . |
---|---|---|---|
fcc phase fraction | 0.014 04 ± 0.005 56 | 0.008 45 | 0.007 79 |
hcp phase fraction | 0.005 89 ± 0.000 30 | 0.006 19 | 0.005 43 |
Disordered phase fraction | 0.006 59 ± 0.001 02 | 0.005 78 | 0.005 32 |
σXX | 2.393 ± 0.471 | 2.036 | 1.839 |
σYY | 2.344 ± 0.471 | 1.950 | 1.810 |
σZZ | 2.401 ± 0.443 | 1.968 | 1.874 |
Total dislocation density | 1.476 × 10−4 ± 1.593 × 10−5 | 1.394 × 10−4 | 1.314 × 10−4 |
A comparison of the performance metrics in Tables II, III, and IV reveals that no single model consistently outperforms the others in predicting all microstructural descriptors. Each model shows strengths in different areas. The reconstruction-focused model performs particularly well in predicting the three stress components. It achieves the lowest average error across 10 training iterations and produces the most accurate results for stress predictions. The regression-focused model improves predictions for the fcc and hcp phase fractions compared to the reconstruction-focused model but performs similarly or slightly worse for the other descriptors. The hybrid model performs best in predicting various phase fractions and total dislocation density but shows lower accuracy than the reconstruction-focused model for stress values. Ultimately, the selection of the most suitable model should be based on the particular needs of the application and the relative importance of accurately predicting each microstructural descriptor.
Figure 4 shows parity plots to compare the regression accuracy of the best-performing models from each single-modality model (reconstruction-focused, regression-focused, and hybrid). These parity plots reveal the similarity across the three models, not only in overall model accuracy, but also in specific areas where model predictions suffer. For example, the disordered phase fraction plots for all models show a consistent pattern: the data appears clustered into three distinct groups. The majority of prediction errors stems from misclassifying values into the wrong clusters. Also, while the cluster with the highest value is correctly centered, the cluster’s shape indicates that the models struggle to accurately distribute values within this parameter space. Similar patterns of errors are evident in other descriptors. The total dislocation density plots show errors concentrated around the zero-value point, while the fcc phase fraction plots display inaccuracies in the region close to a value of one. These consistent error patterns across different models suggest that the challenges in accurate prediction may be inherent to the data structure or the specific relationships between XRD profiles and certain microstructural descriptors, rather than limitations of any particular modeling approach.
One important thing to note is that models presented here are not fully optimized. The model architectures and training methods used in this Tutorial were standardized for educational purposes, but they have potential for improvement. Modifications of the models’ architecture or to the way in which they are trained could significantly improve their performance and potentially alter which model yields the highest accuracy for various regression targets. For example, the regression-focused model exhibited some training instability, occasionally experiencing sudden increases in test loss. These fluctuations negatively impacted the model’s final accuracy, and could have been mitigated by stopping the training process when the test loss was lowest. This highlights the importance of fine-tuning in real-world applications. In practical scenarios, it would be essential to customize the model architecture and training procedure to suit the specific regression task at hand.
Parity plots from the test portion of the dataset for the seven microstructural descriptors predicted by the three single-modality models. Values shown have been normalized to be between 0 and 1 based on the total range of descriptor values from the dataset. R and MSE for each parity plot are provided.
Parity plots from the test portion of the dataset for the seven microstructural descriptors predicted by the three single-modality models. Values shown have been normalized to be between 0 and 1 based on the total range of descriptor values from the dataset. R and MSE for each parity plot are provided.
B. Comparison between single mode and multimodal models
Table V presents performance metrics for the multimodal model, which uses both XRD and VDoS profiles as inputs. A comparison of these metrics with those shown in Tables II, III, and IV reveals that the multimodal model offers some improvements in regression accuracy over the single-modality methods. However, these improvements are generally relatively marginal compared to those of single-modality approaches, with a small reduction in RMSEs for most descriptors. This trend suggests that the multimodal workflow exhibits greater stability across different training runs, demonstrating reduced variance in model performance. This increased consistency indicates that the multimodal approach may be more robust and reliable, even if the absolute improvements in accuracy are not dramatic.
Multimodal hybrid workflow RMSE.
Target descriptor . | Average RMSE . | Best model . | Best RMSE . |
---|---|---|---|
fcc phase fraction | 0.012 30 ± 0.005 93 | 0.006 97 | 0.006 97 |
hcp phase fraction | 0.006 27 ± 0.000 29 | 0.006 05 | 0.005 91 |
Disordered phase fraction | 0.005 65 ± 0.001 09 | 0.004 57 | 0.004 48 |
σXX | 1.982 ± 0.436 | 1.906 | 1.454 |
σYY | 1.956 ± 0.441 | 1.882 | 1.406 |
σZZ | 2.064 ± 0.454 | 2.017 | 1.420 |
Total dislocation density | 1.344 × 10−4 ± 2.198 × 10−5 | 1.092 × 10−4 | 1.092 × 10−4 |
Target descriptor . | Average RMSE . | Best model . | Best RMSE . |
---|---|---|---|
fcc phase fraction | 0.012 30 ± 0.005 93 | 0.006 97 | 0.006 97 |
hcp phase fraction | 0.006 27 ± 0.000 29 | 0.006 05 | 0.005 91 |
Disordered phase fraction | 0.005 65 ± 0.001 09 | 0.004 57 | 0.004 48 |
σXX | 1.982 ± 0.436 | 1.906 | 1.454 |
σYY | 1.956 ± 0.441 | 1.882 | 1.406 |
σZZ | 2.064 ± 0.454 | 2.017 | 1.420 |
Total dislocation density | 1.344 × 10−4 ± 2.198 × 10−5 | 1.092 × 10−4 | 1.092 × 10−4 |
C. Selecting an appropriate model
The analysis of the proposed models reveals that they achieve comparable levels of accuracy overall but with distinct strengths and weaknesses. Some models perform well at predicting certain descriptors while underperforming on others. Other models demonstrate more consistent accuracy across all descriptors. It is important to note that these models have potential for further optimization through adjustments to model architectures and training hyperparameters. The choice of model for a specific application should be guided by the particular requirements of that application. For instance, if sample generation is needed, variational autoencoders could replace the deterministic convolutional autoencoders used in this Tutorial. For applications requiring multiple output types (e.g., both regression and classification), these models can be adapted to allow multiple models to operate on the learned latent representation of the input data. Ultimately, the selection of the most appropriate model depends on a careful consideration of the specific needs and constraints of the task at hand.
VI. CONCLUSIONS
This Tutorial is primarily intended for researchers and practitioners interested in leveraging machine-learning techniques for advanced materials characterization. This Tutorial provides a comprehensive and practical guide for applying supervised machine-learning techniques to analyze diffraction and spectroscopy data for extracting microstructure information. The key elements for such approaches include data preparation composed of consistent data sampling, careful normalization technique, and noise filtering; and a machine-learning model composed of a dimensionality reduction elements to capture the salient features encoded in the XRD data and a regression model to efficiently map this low-dimensional representation of the diffraction data to an extensive set of microstructural descriptors. In this Tutorial, we covered four distinct machine-learning workflows:
A model focusing on reconstruction, which relies on sequentially training a dimensionality reduction model of the XRD data, followed by training of the regression model. The learned reduced representation is optimized for the reconstruction of the input XRD.
A model focusing on regression for which the decoder is removed, creating a latent representation optimized specifically for regression of the microstructural descriptors.
A hybrid model that focused on joint optimization of both the reconstruction and regression, allowing for tunable training toward either task.
A multimodal approach includes multiple input data modalities, enabling the regression model to use latent representations from various sources.
All four workflows successfully extract microstructural state descriptors from the given data. Comparisons show comparable regression accuracy across workflows, with some performing better for specific descriptors or showing more consistency across different test–train splits. The choice of workflow depends on the user’s specific needs and the nature of the data. Testing multiple workflows for a given task is recommended to identify the best-performing approach. These workflows are flexible and can be modified with different machine-learning architectures for dimensionality reduction or regression tasks.
While the machine-learning models illustrated in this Tutorial demonstrated their ability to reveal hidden materials information not easily identifiable by conventional human analysis, several avenues for future improvements remain. For instance, the development of more interpretable models55–57 would help practitioners and researchers with the ability to better understand the relationship between materials characterization data (whether it is diffraction, spectroscopy, or other) and the underlying microstructure and associated physical and chemical processes at play. Another area for improvement is related to transfer learning58,59 and how models trained on simulated data can be adapted to actual experimental data, potentially reducing the need for large experimental datasets.60,61 Finally as a last example, the methods illustrated in this Tutorial could be improved to better estimate the uncertainty associated with the model predictions for more reliable predictions and materials characterization.62,63
SUPPLEMENTARY MATERIAL
The supplementary material provides details on network architectures used in this Tutorial and convergence studies.
ACKNOWLEDGMENTS
The authors would like to thank Andreas Robertson and Lane Schultz from Sandia National Laboratories for insightful comments and suggestions during the preparation of this manuscript. The authors would also like to thank the anonymous reviewers for their constructive feedback resulting in the final version of this Tutorial. Machine-learning capabilities and computational resources used to create this Tutorial are supported by the Center for Integrated Nanotechnologies, an Office of Science user facility operated for the U.S. Department of Energy. This article has been authored by an employee of National Technology & Engineering Solutions of Sandia, LLC under Contract No. DE-NA0003525 with the U.S. Department of Energy (DOE). The employee owns all right, title, and interest in and to the article and is solely responsible for its contents. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this article or allow others to do so, for United States Government purposes. The DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan https://www.energy.gov/downloads/doe-public-access-plan.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
D. Vizoso: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). R. Dingreville: Conceptualization (equal); Funding acquisition (lead); Investigation (equal); Project administration (lead); Resources (lead); Supervision (lead); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings of this study are openly available in the Materials Data Facility website, Ref. 32.