This tutorial introduces systematically the foundational concepts undergirding the recently formulated AI (artificial intelligence)-based materials knowledge system (AI-MKS) framework. More specifically, these concepts deal with features engineering the heterogeneous material internal structure to obtain low-dimensional representations that can then be combined with machine learning models to establish low-computational cost surrogate models for capturing the process–structure–property linkages over a hierarchy of material structure/lengths scales. Generally referred to as materials knowledge systems (MKS), this framework synergistically leverages the emergent AI/ML (machine learning) toolsets in conjunction with the modern experimental and physics-based simulation toolsets employed currently by the domain experts in the materials field. The primary goal of this tutorial is to present to the domain expert the foundations needed to understand and take advantage of the impending opportunities arising from a synergistic integration of AI/ML tools into the current materials innovation efforts while identifying a specific path forward for accomplishing this goal.

## INTRODUCTION

Data analytics and machine learning tools have been shown to be invaluable in establishing low-computational cost surrogate models for a broad variety of applications ranging from recommendation systems (e.g., Ref. 1) to cancer detection (e.g., Ref. 2) to self-driving cars (e.g., Ref. 3). They are being increasingly explored for addressing challenges related to accelerated materials discovery and development—the central focus of the Materials Genome Initiative (MGI).^{4–13} Some of the notable successes in this direction have included the automated extraction of materials data from published reports (e.g., Refs. 14 and 15), automated identification of salient features in images (e.g., Refs. 16 and 17), and the identification/prediction of material chemistries and internal structures with unusual/superior combinations of properties (e.g., Refs. 18–29). Although these successes have identified many exciting new research avenues, they have also identified a central gap in the field today. It is generally observed that a simple brute-force application of the established machine learning tools to problems and challenges encountered in advanced materials development and deployment often fails to deliver the expected benefits. This is mainly because such explorations fail to leverage the vast amount of the previously accumulated domain knowledge in the materials field. Therefore, in the materials innovation efforts, there is a critical need and a tremendous opportunity for the development and deployment of novel frameworks^{12,30} that facilitate synergistic use of the emergent data analytic tools in conjunction with the established toolsets used in the materials science and engineering domain. The latter includes a broad suite of sophisticated physics-based multiscale materials modeling tools^{31} and multiresolution materials structure and response characterization protocols (e.g., Refs. 32–36).

It is highly desirable to develop integrated mathematical frameworks that take advantage of the relative strengths of the different approaches mentioned above (i.e., data analytics, physics-based computations, multiresolution experiments). For example, most machine learning tools are inherently aimed at the *interpolation* of the available (usually small number of data points) high-dimensional data, with some of them paying special attention to uncertainty quantification (e.g., Bayesian approaches^{37–41}). On the other hand, physics-based models and simulations provide the only avenues for building models with the potential for high fidelity *extrapolation*. It is also important to recognize that experiments provide the only avenue for collecting *ground truth* data. Therefore, it is clear that the best strategy for accelerated materials innovation lies in our ability to develop and deploy novel frameworks capable of exploiting the relative strengths of all the different classes of toolsets mentioned above. It is also important to recognize that vast differences exist in the cost and fidelity of the different classes of tools mentioned above. For example, multiresolution experiments spanning multiple material length scales often require significant investments of time and money, while physics-based simulation tools are generally less expensive (on a relative basis). However, there is currently no framework for providing objective guidance to the researcher on where one should invest their time and effort (e.g., Should one do more experiments or physics-based simulations? Which ones?) in order to optimally reach their targets in materials innovation (e.g., attain a specified combination of material properties or performance metrics). The above discussion points to the critical need for a foundational framework that can systematically and comprehensively extract the embedded knowledge in the physics-based simulations and multimodal multiresolution experimental datasets, and express this core knowledge in forms that can objectively support and guide the accelerated materials innovation envisioned by MGI.

A central tenet in the field of materials science and engineering is that the *processing* history controls the material's internal *structure* over a hierarchy of length scales, which, in turn, controls the effective (macroscale) *properties* or performance characteristics exhibited by the material. The core *materials knowledge* needed to drive materials innovation is therefore most conveniently captured in the form of process–structure–property (PSP) linkages.^{13,42–47} Such PSP linkages can be formulated at salient material length/structure scales^{13,48} to facilitate computationally efficient scale-bridging in both directions (i.e., homogenization and localization^{13,49}). Briefly, homogenization focuses on aggregating information from the lower material structure/length scale to the next higher scale, while localization addresses the spatial distribution (i.e., partitioning) of imposed quantities at the higher material structure/length scale to the next lower scale. The envisioned network of linkages is presented schematically in Fig. 1, where it is implied that a large library of low-computational cost, reduced-order (surrogate), PSP linkages could drive optimally the materials innovation effort. One of the salient aspects of the aggregated and curated PSP linkages shown in this figure is that their uncertainty will be rigorously modeled in a suitable Bayesian framework. Consequently, at any given time, one will be able to answer instantly any materials-related queries arising from design/manufacturing experts, while also quantifying the confidence levels in the provided answers. Simultaneously, the queries themselves could be used to prioritize and streamline future efforts aimed at refinement and/or expansion of the PSP linkages in order to provide a better answer at a future time.

The vision and paradigm presented in Fig. 1 is distinctly different from the current practices in multiscale materials innovation efforts, which largely design and launch experiments/simulations in response to the specific needs articulated by the design/manufacturing end-user. Since most multiscale materials experiments and simulations demand significant time and effort, the data collection efforts are invariably slow (often requiring several months or even years). Most importantly, the decisions made by the domain specialists regarding which specific experiments or simulations are to be performed to obtain the required insights are often made in an ad-hoc manner by relying largely on their own individual analysis of the data/information accessible to them. Consequently, the decisions made in the current workflows do not usually lead to optimal learning of the critical knowledge needed to drive the targeted materials innovation.

The novel concept presented in Fig. 1 fundamentally argues that it would actually be much more beneficial to pursue the aggregation and curation of the PSP linkages in a variety of materials classes and structure/length scales in a fully de-coupled manner. The overall scheme presented in this figure allows the different experts engaged in the many different aspects of materials science and engineering to generate and contribute their datasets for community-level curation of the underlying materials knowledge (e.g., see the data repositories aggregated by Materials Project,^{10} AFLOW,^{50} PRISMS,^{51} MDF,^{52,53} and MDMC,^{54} MEAD^{55}). In this context, it is important to recognize that any single dataset (from either experiments or simulations) is unlikely to produce the PSP linkages depicted in Fig. 1. This is because any dataset produced from a single source (i.e., either an experimental setup or a software framework) is likely to provide only a partial clue into the overall puzzle. It should be further recognized that even this partial clue comes with inherent uncertainty that can be attributed to many factors, including (i) insufficient knowledge of the governing physics and (ii) the limitations of the tools and machines used to generate the data.

The discussion above emphasizes the critical need for a rigorous framework for the objective (i.e., data-driven) extraction of knowledge from disparate, incomplete, and uncertain datasets produced by the different experts in the materials science and engineering field. The strategy outlined in Fig. 1 effectively de-couples the data generation and aggregation tasks (these can be broadly referred as *materials data management* tasks) from the knowledge extraction tasks (these can be broadly referred as *materials data analytics* tasks).^{12,13,56,57} This fundamental separation of the tasks should allow the materials community to pursue the envisioned materials knowledge in a highly systematic and organized manner. Potentially, a community-level organization of the overall effort involved can lead to highly optimized exploration of the unimaginably large materials space spanning a hierarchy of material length/structure scales. Indeed, the proposed transformation of the current practices in the materials innovation efforts could streamline the efforts of the broader materials community into building systematically and optimally the core materials knowledge of high value sought by the design/manufacturing stakeholders. The novel mathematical framework and the associated toolsets needed to address the grand challenge depicted in Fig. 1 are referred as the *AI-based Materials Knowledge Systems* (AI-MKS) in this paper. This tutorial expounds the foundational concepts and frameworks needed to pursue these knowledge systems.

## MATERIALS KNOWLEDGE SYSTEMS

As already expounded above, the core materials knowledge needed to drive objectively (and optimally) materials innovation efforts is best expressed as a very large and highly organized collection (i.e., a library) of PSP linkages whose uncertainties have been quantified. Figure 2 depicts examples of PSP linkages that could be targeted for the envisioned materials knowledge systems. It is emphasized that neither the lists of variables nor the set of PSP linkages depicted in this figure are intended to be comprehensive. These are intended to serve merely as examples. It is further noted that comprehensive and standardized lists of variables and the corresponding PSP linkages that hold high value for materials innovation efforts have not yet been identified or compiled systematically by the materials experts. Indeed, the lack of a standardized and broadly adopted taxonomy for the PSP linkages constitutes an important current gap and a significant hurdle for the realization of the vision depicted in Fig. 1. Addressing this critical need will impact positively the labeling or structuring of all materials data through the development and adoption of standardized schemas [e.g., the Materials Data Curation System (MDCS)^{58} developed at the National Institute of Standards and Technology (NIST)].

Before undertaking the immense task of creating a suitable taxonomy (or equivalently an ontology) for the aggregation and curation of the desired library of PSP linkages, it is important to establish suitable criteria/guidelines for the same. The following offers an initial list (to be further tweaked and refined by the materials research community): (i) *Comprehensive*—the list of variables identified for PSP linkages needs to cover all materials classes and the entire hierarchy of material structure/length scales. (ii) *Versatile*—the selected variables should be able to represent the broad diversity of the features involved in the PSP linkages with the most economical representations (i.e., allow the use of the smallest number of dominant features). This is particularly important for the variables selected to represent the details of the materials internal structure (see the middle column of Fig. 2). (iii) *Interoperability*—the selected variables should maximize the interoperability of the formulated PSP linkages (in both homogenization and localization directions; see Fig. 1). This criterion is the central key to ensuring that all of the relevant PSP linkages can be utilized in providing the most valuable responses to the queries from the design/manufacturing end-user (see Fig. 1). Note that several prior reports in the literature^{31,59,60} have emphasized the need for the interoperability of the different software codes used by materials experts in simulating or predicting the macroscale response of a material or the performance of a device used in advanced technology. The concept articulated here is distinctly different in that it recognizes that we only need interoperability of the knowledge expressed at different material length/structure scales. In other words, even if the software codes used to generate the data are not interoperable, the concept presented in Fig. 1 suggests that we can still extract high value knowledge in the form of interoperable PSP linkages. This task is indeed substantially easier than requiring the interoperability of the diverse software components employed by the materials experts. In fact, interoperable PSP linkages are much more practical and can lead to rigorous assessment of the uncertainty associated with the overall predictions made using the aggregated knowledge systems.

Next, it is important to recognize and understand clearly the expected functionality of the AI-MKS depicted schematically in Fig. 1. Broadly, most of the functionality of the AI-MKS can be envisioned to be delivered through three main components: (i) *Diagnostics Engine*—this set of computations are aimed at identifying similarity or differences between different datasets. They can be used to perform tasks such as outlier analysis (i.e., how similar or dissimilar are a given set of data points to another set), machine health prognosis (e.g., is a given machine producing consistent material), and quality control tests (e.g., is a specified protocol being performed consistently). (ii) *Prediction Engine*—this set of computations are aimed at answering the “what if …” questions posed by the design/manufacturing end-user. As examples, one might be interested in assessing the impact of specific changes in material chemistry and/or process history on a desired set of macroscale material properties. (iii) *Recommendation Engine*—this set of computations are aimed at providing objective decision support in materials innovation. More specifically, they are aimed at identifying the specific next steps (from a list of potential options) that exhibit the highest potential for information gain toward a specified target. In the context of materials innovation efforts, the potential next steps typically constitute multiple options in either multiresolution experiments or physics-based simulations, while the target is usually specified in the form of a desired combination of macroscale materials properties and performance criteria.

It should be noted that the desired functionality outlined above is highly ambitious, especially given the rich diversity of multiscale materials phenomena occurring in the broad variety of material systems of interest in advanced technologies. Given the challenges of disparate, incomplete, and uncertain materials data discussed earlier, the only practical path forward is to design an agile software platform that allows continuous computationally efficient updates of the aggregated knowledge as new data becomes available. This requires the formation of synergistic partnerships between the materials science and computer science communities. This, above all, holds the central key to the successful realization of the envisioned AI-MKS.

## AI-MKS

The overall effort involved in realizing the vision depicted in Fig. 1 faces two main technical hurdles: (i) a mathematical framework for the rigorous quantification of the material structure over a hierarchy of structure/length scales spanning from the subatomic to the macroscale (hereafter referred to as the *feature engineering* of the hierarchical material structure) and (ii) a learning framework that can objectively drive the curation of the desired PSP linkages from available disparate, uncertain, and incomplete materials data (hereafter referred to as the *machine learning* framework for PSP linkages). The other challenges anticipated in the implementation of the envisioned AI-MKS are largely non-technical (e.g., establishing productive collaborations between computer scientists, data scientists, and material scientists; community-level sharing of data and codes; training of materials specialists in the correct use of the emerging machine learning tools), and have been addressed elsewhere.^{5,57} In the rest of this tutorial, we will establish the necessary mathematical frameworks for addressing the two main technical hurdles identified above. As such, these two components constitute the foundational technical elements of the envisioned AI-MKS.

### Feature engineering of material structure

In order to appreciate the fundamental challenge of this task, it is instructive to imagine what details of the material structure need to be captured to fully describe the material state of a given sample. It should be noted that the current widely used material naming conventions are based largely on the overall chemical composition and some details of the final processing steps employed in the manufacture of the material. As an example, 7075-T6 Al^{61} generally implies that this Al metal alloy has 5.6%–6.1% Zn, 2.1%–2.5% Mg, and 1.2%–1.6% Cu, while the label “T6” implies the use of an aging treatment that results in an ultimate tensile strength of 510–572 MPa, a yield strength of 434–503 MPa, and a failure elongation of 5%–11%. Although the T6 temper is usually achieved by homogenizing the cast 7075 Al alloy at 450 °C for several hours, and then aging at 120 °C for 24 h, it does not automatically imply that this was the exact temper treatment employed. This is because it is entirely possible to obtain the set of properties specified above using other thermo-mechanical processing histories, even within the specified composition window. The main limitation of the current approach in naming the materials is that it does not capture or reflect the salient features of the hierarchical material structure that would uniquely identify the material and control its many properties or performance characteristics.

As already noted, the description of the material structure is heavily complicated by the fact that the material internal structure spans several length scales (from the subatomic to the macroscale). Over this large span of length scales covering about seven orders of magnitude, the material internal structure exhibits a very large set of salient features that can potentially influence the macroscale properties of interest. For example, if one looks into the mesoscale structures of typical structural alloys (e.g., Al alloys, Ti alloys, steels) using optical and/or scanning electron microscopes, one finds diverse multiphase polycrystalline microstructures exhibiting rich variations in grain orientation, grain/phase size, and morphology distributions. At the lower length scales, the crystalline arrangements exhibit an equally rich variety of defects (e.g., dislocation structures, solute segregation). Given this basic introduction of the material structure, it is clear that the number of features needed to define the hierarchical material structure in any given physical sample of a material is going to be extremely large (i.e., almost an infinite number of features would be required to uniquely identify each material). Since each distinct feature of the material structure can be treated as a *dimension* in its description, one can conclude that the rigorous quantification of the hierarchical material structure inherently demands an extremely high dimensional representation.

Most commonly employed approaches for the quantification of the material structure are based on highly simplified statistical measures. For example, there have been several attempts to correlate the yield strength and failure strength of metal alloys to the overall alloy composition, phase volume fractions and the average grain sizes in the sample.^{62–67} Although these highly simplified statistical measures of the microstructure show tremendous potential, it should also be clear that these are woefully inadequate for establishing high fidelity PSP linkages of the kind depicted in Fig. 1. In fact, established composite theories actually point out that one can only obtain elementary bounds^{49} on properties based on the simplified measures. Often, these elementary bounds are widely separated to be of practical value in rational design of materials meeting designer-specified property combinations. Indeed, more sophisticated (i.e., higher-order) statistical measures of the material structure are essential for establishing the desired higher fidelity models.^{47,68–71}

Before we delve into the details of an advanced framework for efficient feature engineering of the material structure, we should clearly lay down our expectations from such a framework. In the context of the AI-MKS described in this tutorial (see Fig. 1), the necessary attributes of the framework include versatility (allow generalization to all relevant structure/length scales in all material systems) and extensibility (allow systematic incorporation of more detailed information as needed), while facilitating low-dimensional representations (i.e., needing only a small number of parameters) with a high ability to recover most of the original information. Needless to say, any framework adopted for material structure quantification needs to serve the primary mission of formulating low computational cost, high fidelity, PSP linkages needed for AI-MKS.

At the outset, it should be recognized that most data about the material internal structure is derived from a variety of images. For example, one obtains data about the material internal structure through images (i.e., maps) obtained using a variety of microscopy techniques (e.g., optical, scanning electron, transmission electron), sometimes aided by tomography techniques to obtain three-dimensional information.^{72–75} Any single two-dimensional or three-dimensional map obtained in this process simply represents a single instantiation of the material internal structure (at a spatial resolution and an accuracy dictated by the characterization machine and protocols employed). In other words, any such map should be treated as *a* material structure and not *the* material structure.^{76–78} This is because it is possible to obtain a large number of material structure instantiations from a single physical sample, where the different instantiations exhibit unavoidable variance in the values of the extracted statistics. Indeed, it should be recognized that *the* material structure refers to a fictitious realization that is representative of a large ensemble of material structures obtained from a single sample. Such a representative material structure is generally referred to as the representative volume element (RVE), and can be established at the different salient material structure/length scales using suitably established criterion.^{79}

The discussion above makes clear that the central notion behind the description of the material internal structure is essentially a spatial mapping of the local state of the material within the internal structure. Let *h* denote the local state of the material found at the spatial location $ x$ in a suitably defined RVE of the material structure. Both of these variables need additional clarification and discussion. The spatial location $ x$ has to have a certain material volume associated with it. This is because if one thinks of the spatial location as a point in space without any associated volume, then it is impossible to associate any local state descriptors with that spatial point. Note that all materials characterization techniques (e.g., diffraction, spectroscopy) need to probe a finite material volume in order to quantify reliably the local material state. Moreover, these measurements are often conducted on a fixed spatial grid to produce the desired material structure maps. Consequently, all experimentally measured material structure maps are implicitly associated with a spatial resolution dictated by the limitations of the characterization equipment and/or the protocols employed. In other words, the information captured in the structure maps reflects an averaged measure of the material structure over a very small volume that is implicitly associated with each spatial point in the structure map. Given these considerations, a practical path forward for capturing the details of the material structure at any selected length scale is to define the material local state on a uniformly discretized (2D or 3D) spatial grid, just the same way digital images are stored on a computer. Each grid point is then associated with a material volume or a voxel (equivalently for a 2D description each grid point would be associated with a pixel). It should be noted that the discretized description of the material structure described here is equivalent to uniform sampling, and lends itself to computationally efficient transformations based on discrete Fourier transforms (DFTs).^{80,81}

Next, let us formalize what material local state information should be associated with each voxel. Formally, the material local state *h* can include any and all attributes needed to describe fully the local material response at the scale of the voxel. Implicitly, the variables included in *h* would depend on the specifics of the material system and the material's structure length scale under consideration. For example, at the mesoscale (voxel sizes in the range of ∼ 1 *μ*m to ∼100 *μ*m) most metals exhibit multiphase polycrystalline microstructures. The appropriate material structure scale for the description of these material structures is the grain scale, i.e., one would employ a multitude of voxels in each grain. For these material structures, one could include the thermodynamic phase identifier, $\alpha $, the chemical composition of the phase, *c*, the crystal lattice orientation, *g*, and suitably defined defect densities such as the dislocation density, $\rho $. As already discussed, these would have to be defined as averaged quantities over the voxel volume. Based on this example, it can be seen that the complexity of the local material state in most advanced materials demands a multivariate description. For the example described above, $h=( \alpha , c , g , \rho )$. Note also that some of the variables used in the description of *h* themselves demand multivariate descriptions. In the example above, the phase composition may need to be represented by a set of chemical compositions of individual chemical elements, the lattice orientation is usually defined by a ordered set of three rotation angles called Euler angles,^{82} and one might be interested in including additional defect and/or damage measures (e.g., microporosity, density of small cracks). At a lower material structure length scale, say the atomic structure, one needs to adopt a different definition of the local material state. If one were to voxelize the atomic structure of a material where the atoms are represented as hard spheres,^{48,83} the local material state could be described using an identifier for the chemical species.

It should be clear that the definition of the material local state *h* is quite flexible and can be used to incorporate any attributes of interest. In the examples described above, we mainly restricted our attention to attributes that describe the material structure at the lower length scale. For example, the definition $h=( \alpha , c , g , \rho )$ is exclusively based on averaged statistical measures of the material structure at the lower length scale. As another alternative, one can also define the local material state directly by the local properties exhibited at the voxel scale. As an example, one can choose to represent the local material state at the mesoscale as $h=( C , Y)$, where $ C$ denotes the fourth-rank elastic stiffness tensor and $ Y$ denotes the anisotropic plastic yield surface for the material in the voxel. Similarly, at the atomic scale, one can use a combination of physical parameters instead of the identifier for the chemical species. For example, one can employ a combination of parameters such as atomic mass, atomic radius, and electronegativity to describe the material local state in describing the material structure at the atomic scale. This alternate expression of the material local state directly in terms of physical properties of interest allows for a better interpolation of the material structure to facilitate the formulation of high fidelity PSP linkages across different chemical compositions and/or thermodynamic phases.

The voxelized representation of the material volume at a hierarchy of material structure/length scales and the assignment of suitably defined material local states to each voxel can provide a sufficiently accurate and versatile description of the highly complex material structure. However, this would result in an extremely high-dimensional representation that will be unwieldy for formulating the PSP linkages of kind depicted in Fig. 1. Most importantly, as already described earlier, we need a stochastic formulation that allows a quantification of the uncertainty in the formulated PSP linkages. In the context of material structure quantification, uncertainty arises from the limitations in the capabilities of the materials characterization equipment, unintended changes introduced into the material structure in the preparation of samples for the characterization protocols, and the availability of limited or incomplete data (e.g., availability of only two-dimensional maps of material structure, inadequate number and size of microstructure maps or scans).

A stochastic material structure can be defined by invoking a material structure function,^{84}^{,} $m( h , x)$, which reflects the probability density for finding local state *h* at the spatial location $ x$. For reasons already discussed above, it is much more practical to employ a discretized version of the microstructure function defined as the array $m[ h , s]$, where $ s\u2208 S$ indexes the voxels in the discretized material volume. In this notation, $ s$ denotes a multi-dimensional integer index. For example, in describing 3D microstructures, it is convenient to represent the index as $ s={ s 1 , s 2 , s 3}$. Furthermore, we will also restrict our attention in this tutorial to situations where the local state *h* is limited to only a finite number of discrete choices. For this special case, $m[ h , s]$ simply reflects the volume fraction of local state *h* in the voxel indexed by $ s$. Extensions of the concepts described in this tutorial to the more complex choices of the local state descriptors can be found in recent publications.^{13,27,85,86}

*h*in the entire structure and $| S|$ denotes the total number of voxels in the microstructure. Note that $m[ h , s]$ admits fractional values (between zero and one). The deficiency in utilizing $m[ h , s]$ directly for the quantification of material structure lies in the lack of natural origin for indexing the spatial cell variable, $ s$. In other words, $m[ h , s]$ lacks translational invariance because material structure images taken from different locations (using different origins for indexing $ s$) on the same sample can exhibit very different values of $m[ h , s]$, but their underlying material structure statistics are expected to be quite similar to each other. A rigorous statistical treatment of the material structure as a random process suggests the use of the framework of

*n*-point spatial correlations (also called

*n*-point statistics).

^{47,68,69,76,77,87}This formalism provides a natural approach for quantifying the material structure in ways that allow systematic addition of higher-order information (as the value of

*n*is increased). For example, 1-point statistics is the simplest form of

*n*-point statistics and captures information on the volume fractions of the different local states present in the material structure.

The next level of *n*-point statistics is 2-point statistics and denotes the probability of finding local states *h* and $ h \u2032$ separated by a vector indexed by $ r$. Since the vectors that can be placed in a voxelized volume are also naturally discretized (see Fig. 3), ** r** can also be treated as a vector integer index (allowed to take only integer values for its vectorial components). Indeed, indices

**and**

*s***share many common features. The main difference is that while the index**

*r***enumerates each of the voxels, the index**

*s***enumerates the vectors that can be thrown into the discretized material structure (essentially as a difference between any two values of**

*r***).**

*s*^{69,77}

^{76,77,88}The formalism presented in Eqs. (1) and (2) represents the most comprehensive and systematic digital representation of the material structure available today (see Ref. 47 for a discussion of how it relates to other traditionally used measures of the microstructure and Ref. 89 for reconstructions of the original microstructure from the 2-point statistics). It has also been pointed out that this formalism (i) provides a comprehensive treatment of the neighborhood of a selected voxel as a stochastic variable,

^{76,77}and (ii) connects directly with the most sophisticated physics-derived composite theories available in the published literature.

^{13,47,68,90}

In order to illustrate how the 2-point statistics capture the important attributes of microstructure morphology, it is instructive to look at a few idealized microstructures and their corresponding 2-point statistics computed using Eq. (2). Figure 4 shows a few digitally created (i.e., synthetic) microstructures from Ref. 88 in the top row and their corresponding 2-point statistics in the bottom row. The first four digital microstructures are produced by randomly placing ellipses of a selected size, while the fifth microstructure was produced by randomly placing spheres of a selected size. Furthermore, in the first two micrographs, the ellipses were oriented in a single direction (horizontal in the first one and vertical in the second one), while the third microstructure was generated by placing ellipses in two selected orientations. The fourth microstructure was generated by placing ellipses in arbitrary orientations. One of the most important statistic in an autocorrelation map is the one corresponding to the zero vector, which is seen at the center of the autocorrelation map. The value at the center of the autocorrelation maps in Fig. 4 reflects the volume fraction of the corresponding local states. The central pattern in the middle of the autocorrelation captures information on the phase morphology. Note that the central pattern in the autocorrelation clearly captures the elliptical shape and its orientation quite well in the first two microstructures. For the third microstructure, the pattern in the middle of the autocorrelation reflects a combined morphology resulting from both orientations of the ellipses in the microstructure. Since the fourth microstructure has ellipses oriented in all directions, the average morphology is reflected in the roughly equiaxed pattern in the middle of the autocorrelation (the slight elliptical shape in this plot is a consequence of the fact that it is nearly impossible to obtain a perfectly random distribution of oriented ellipses in any finite domain; larger microstructural domains will make the central pattern in this plot approach a circle). In addition to the central feature, there are many local peaks and patterns visible from the autocorrelation maps in the bottom row of Fig. 4. These additional local peaks and patterns carry information about the spacing of the ellipses (or circles) in the corresponding microstructure. For example, the horizontal bands just outside the central ellipse in the autocorrelation of the first microstructure indicate that the ellipses are more aligned with each other in the horizontal direction compared to their alignment in the vertical direction. This indeed can be confirmed in the corresponding microstructure. It is also noted that the autocorrelation for the fourth microstructure depicts fewer discernible additional patterns or peaks outside the central pattern. This indicates that this particular microstructure exhibits a higher level of randomness (i.e., disorder) compared to the other microstructures in this example.

*n*-point statistics defined above are indeed very high-dimensional, where the statistic for each spatial configuration (i.e., each spatial arrangement of specified local states) constitutes one dimension. The term “high-value” in the context of this discussion refers to the efficacy of the low-dimensional microstructure representation in arriving at reliable and robust PSP linkages. This is exactly where some of the data science toolsets become very valuable. Prior work

^{13}has demonstrated the remarkable efficacy of principal component analysis (PCA) in obtaining low-dimensional representations of 2-point statistics and establishing high-fidelity PSP linkages. PCA essentially transforms the 2-point statistics into a new reference frame that allows the ensemble of microstructure statistics be represented most economically in terms of the capture of the variance between the elements of the ensemble. In other words, at any selected truncation level, this representation guarantees the capture of the highest amount of the variance between the data points, within the constraints of a linear, distance-preserving, transformation. For performing PCA, it is convenient to aggregate the

*n*-point statistics deemed important for a selected application into an array denoted by $f[ k , r]$, where $k=1,2,\u2026,K$ indexes all microstructures in the ensemble. In principal component space, the desired set of spatial statistics of the

*k*th microstructure can be expressed as

*k*th microstructure in a PC space. In other words, each row of this array provides a low-dimensional representation of each microstructure in the ensemble, in the form of PC

*i*with $i=1,2,\u2026, R \u2217$. One of the main benefits of Eq. (3) is that it provides the highest capability for the reconstruction of the original spatial statistics through the use of the stored values of the basis vectors, $\psi [ i , r]$, and the ensemble average, $ f \xaf[ r]$.

As a simple illustration, Fig. 5 shows the PC representation in the first two dimensions for spatial statistics aggregated from an ensemble of 287 experimentally obtained segmented micrographs in superalloy samples exhibiting two-phase microstructures.^{91} In this plot, each data point represents 36 360 spatial statistics extracted from each micrograph (i.e., microstructure). It was also reported that the first two PC scores captured 93.3% of the variance among the different microstructures in this ensemble. This example shows the power of PCA in attaining a dimensionality reduction (i.e., from 36 360 to 2) with minimal loss of information. Specific microstructures corresponding to eight selected points (shown in different colors) in this plot are identified at the top and bottom of the PC plot. In this study, the microstructures were obtained by aging the superalloy samples at different temperatures for different time periods. The information on the aging temperature, the aging time, and the area fraction of $ \gamma \u2032$ precipitates for each microstructure are given at the top of each micrograph. These eight points were specifically selected to illustrate what features were captured in PC1 and PC2. The points selected in the top row have very different PC2 values compared to the points selected in the bottom row. The value of PC1 in each row is increasing systematically from left to right. Each set of points with the same color have similar PC1 scores but significantly different PC2 score. A careful study of the colored points and their microstructures in Fig. 5 indicates that PC1 strongly correlates with precipitate area fraction, while PC2 appears to capture the coarsening and coalescing of precipitates. A full interpretation of the PC scores is complicated by the large dimensionality of the PC basis, which is equal to the number of collected spatial statistics (36 360 for this example). Tools aimed at the improved interpretation of the PC basis are a topic of active research in many fields, even outside of materials research where similar tools have been deployed extensively.

Before closing this section, a few additional remarks are warranted. In the formalism presented here, the information on grain (or phase) boundary character is not explicitly incorporated. It is only incorporated indirectly. Changes in the phase or grain orientation from one voxel to the next indirectly imply the presence of a phase or grain boundary. Indeed, while the lattice misorientation across the boundary is captured to a higher fidelity in this formalism, the boundary plane is itself captured only to the level of accuracy allowed by the discretized representation of the material volume. If such material local states are deemed important for the problem, they need to be explicitly included in the descriptors of the material local state. Another salient aspect of the presented framework is that it produces an objective (unsupervised) low dimensional representation of the material structure statistics that automatically maximizes our ability to reconstruct the statistics of the original material structure. This, of course, is a property of PCA. The reconstruction of the original material structure from the statistics, however, requires other sophisticated computational strategies.^{89,92–94} Although there exist a number of other dimensionality reduction strategies (e.g., kernel PCA,^{95} local linear embedding,^{96} local tangent space alignment^{97}), they were not found thus far to produce an improved unsupervised classification of the material structure statistics for the many case studies explored by our research group. Further investigations are clearly needed to further explore critically the utility and efficacy of these methods for AI-MKS applications.

### Machine learning for PSP linkages

Once the features have been identified for the inputs and outputs of interest, the next task in creating the foundational elements of the AI-MKS is the formulation of reduced-order low-computational cost PSP linkages. In this task, one usually starts with a suitable dataset (could come from physics-based simulation tools or from experimental protocols). As an example, one might have generated digitally a large ensemble of 3D RVEs and evaluated their effective mechanical properties using micro-mechanical finite element (FE) models that incorporate sophisticated physics about the constitutive responses of microscale constituents and their interactions. In this case, the PC scores of the microstructure (treated as inputs) and its FE predicted effective mechanical property (treated as output) would constitute a single data point, and a collection of such data points would constitute a dataset. As another example, in studies of microstructure evolution using phase-field models, the averaged chemical compositions and the process parameters driving the microstructure evolution would be treated as inputs and the time-evolving PC scores of the microstructure statistics would be treated as outputs. Note that one can establish suitably modified definitions for experimentally acquired datasets.

An important component of all reduced-order model building efforts is the validation of the model. Since modern machine learning tools employ highly sophisticated algorithms for training the desired models, they often employ a very large number of implicit model-fit parameters. Since a larger number of learnt model-fit parameters is very likely to improve the model predictions, there is always an incentive to keep increasing the complexity of the model. However, it is important to understand that the most likely outcome of unnecessarily increasing the model complexity is an over-fit characterized by dramatic loss in the predictive accuracy of the model for new inputs. Therefore, it is very important to design and adopt a rigorous validation strategy. In our work,^{98–100} we have generally favored a hybrid approach that utilized (i) a hard train-test split of the dataset where a certain fraction of the available data points are randomly selected and set aside for the critical validation of the trained model (these are referred as test data points and are not exposed to the model training effort in any manner) and (ii) a leave-one-out cross-validation (LOOCV) during the training phase of the model. In LOOCV, one data point is excluded from the training set while building the model, and utilized to quantify the prediction error. The process is repeated by excluding each of the training data points, one at a time. Consequently, one collects as many evaluations of the model prediction errors as the number of the training data points. This set of errors are referred as LOOCV errors. When the LOOCV errors are higher than the errors obtained using all of the training points, it indicates an over-fit of the model. Therefore, LOOCV is valuable in guiding the model training process, while avoiding an over-fit. Separately, the prediction errors from the test data points that have not been exposed to any aspect of the model training are used to assess critically the accuracy of the model. It is noted that a number of different variants of the protocols described above are indeed possible. In fact, one of the most commonly employed approach is k-fold validation, where the data is partitioned randomly into k equal folds, and cross-validation is performed by excluding one fold at a time. In our prior work, we found the hybrid approach described above to be much more systematic as it separates formally the model training phase from the model testing phase. The use of LOOCV in the model training phase allows the optimal development of the surrogate model with the small training datasets that are typically available in the materials innovation problems discussed in this tutorial.

Reduced-order models of interest (i.e., PSP linkages) can be established using a variety of strategies. Broadly, these fall into two main categories: (i) regression approaches and (ii) Bayesian-inference based approaches. In the regression approaches, one defines an error measure and identifies the values of the adjustable model-fit parameters that minimize the average error over the training data points. The simplest of the regression approaches is the least-squares regression,^{81} and can be augmented using a variety of regularization techniques (e.g., Ridge Regression,^{101} least absolute shrinkage and selection operator (LASSO),^{102} elastic net^{103}) to mitigate the propensity for over-fit. Modern sophisticated implementations of the regression approaches can be found in Neural Networks (NNs),^{104–106} and Convolutional Neural Networks (CNNs).^{107–109} The approaches based on Bayesian inference offer a powerful alternative to the regression techniques, especially for problems with relatively small datasets (i.e., small number of training data points available for model building). In these approaches, regularization is accomplished by prescribing a suitable prior distribution on the unknowns (could be parameters of an assumed model form or directly the unknown function itself). In general, one might argue that the efficacy of Bayesian inference is likely to be controlled significantly by the specific details of the assumed prior. Fortunately, in the field of materials science and engineering, there exists significant prior knowledge established by the domain experts in the form of known materials physics. Priors informed by established but uncertain physics in the materials field offer a powerful approach to building the desired surrogate PSP linkages. Specific examples of such model building approaches include Bayesian Linear Regression (BLR)^{26,110,111} and Gaussian Process regression (GPR).^{112–115} One of the main benefits of these statistical approaches is that they provide a natural framework for the rigorous treatment of the uncertainty in the model predictions for new inputs. This, in turn, can provide objective guidance on where new training points should be generated in order to optimize the potential gain in the fidelity of the model being built. Bayesian inference approaches are therefore ideally suited for most multiscale materials design problems, where the cost of generating data points (either experimental or simulations) is very high. There exist a number of adaptive sampling strategies for addressing the optimal selection of inputs for data generation; these are generally referred to as sequential design strategies and utilize criterion such as maximum surrogate uncertainty^{116} or maximum difference from current estimates made by the surrogate model^{117,118} or maximum expected improvement in the fit of the Gaussian process model to the noisy training data.^{119}

Among the different model building approaches described above, GPR offers a powerful toolset for the envisioned AI-MKS systems. In addition to allowing a formal treatment of the uncertainty in the model predictions (this feature is central to build the diagnostics, prediction, and recommendation engines described earlier), it employs a non-parametric approach (i.e., this approach does not invoke a specific model form). This is of tremendous value for establishing reduced-order PSP linkages of the kind depicted in Figs. 1 and 2, for which generalized model forms exhibiting high fidelity have not yet been established in the prior literature. Indeed, GPR is being increasingly utilized in the current literature for addressing a broad range of materials problems.^{91,110,113,116,117,120–126}

^{127,128}starts with a prior distribution over the desired function defined as a joint Gaussian distribution of a set of input (i.e., features) values denoted by the vector $ x$ (assumed to be of

*D*dimensions) and their corresponding output (i.e., target) values denoted by

*y*. Mathematically, the function of interest is assumed to be represented by a Gaussian process (GP) as

**,**whose selection plays a critical role in the accuracy of the model predictions for test inputs. An automatic relevance determination squared exponential (ARDSE) kernel is a good choice for our needs in formulating PSP linkages, since it allows the use of different interpolation hyperparameters for the different input variables. The ARDSE kernel is mathematically expressed as

^{129}using gradient-ascent optimization algorithms.

^{130}While GPR is accessible through many software packages (e.g., MATLAB,

^{131}R

^{132}), it is instructive to understand the roles of the different hyper-parameters introduced in Eq. (5). Smaller values of the interpolation parameters $ l d$ indicate higher sensitivity of the specific input variable $ x d$ to the output

*y*. In other words, as the value of $ l d$ increases, the specific input variable $ x d$ exhibits less influence on the predicted value of the ouput variable. However, very small values of $ l d$ would result in noisier predictions. The output noise $ \sigma n$ in Eq. (5) is assumed to be independent of input (referred as homoscedasticity). For high fidelity data (such as those obtained from established simulations or highly validated experimental protocols), the value of $ \sigma n$ should be very low (can even be taken as zero if justified).

*N*and $ N \u2217$ denote the numbers of train and test points, respectively. Let $ y$ and $ y \u2217$ denote the output vectors for the training and test points, respectively. Let $ K( X , X)$, $ k \u2217( X , X \u2217)$, and $ K \u2217( X \u2217 , X \u2217)$ denote the covariance matrices computed using the kernel function [see Eq. (5)] on the respective combinations of training and test inputs. In GPR, the predictive distribution for the test points is expressed as

^{113,133}that utilize only the training points in the neighborhood of the test point in making the predictions. An additional complication can arise if the $ K$ matrix exhibits a large condition number. It should be noted that the $ \sigma n$ term in Eq. (5) essentially regularizes the $ K$ matrix. Once $ K \u2212 1$ is obtained, predictions for test points can be realized through much cheaper matrix operations.

^{127,134}

One of the clearest demonstration of the power of GPR in formulating PSP linkages can be found in the recent work of Yabansu *et al.,*^{135} where it was used to formulate reduced-order models relating the macroporous structure of a membrane to its effective permeability. The training data for this model were generated using physics-based simulations that explicitly solved the governing transport equations on digitally created 3D RVEs of porous microstructures. PC representations of suitably defined spatial correlations were used as low-dimensional features (i.e., inputs), while the effective permeability of the 3D RVE predicted by the simulation tool was used as the output. Figure 6 depicts the predictive capability of the GPR model developed in this study, while using only two PC scores. This example demonstrates the remarkable efficiency of the low-dimensional representations obtained from the feature engineering approaches described in this tutorial and the high fidelity of the GPR-based structure-property models produced using these features.

### Bayesian framework for fusion of experimental and simulation data

As stipulated at the start of this tutorial, one of the major impediments to accelerating the current pace of materials innovation comes from the lack of a rigorous mathematical framework for the objective fusion of incomplete and uncertain data from disparate sources (e.g., physical experiments, physics-based simulations). In recent work,^{110} I have proposed a new Bayesian framework for addressing this critical gap. This framework builds on the foundational elements presented above and is briefly summarized next.

Physical experiments and physics-based simulations conducted by materials experts provide distinctly different insights into the hierarchical (i.e., multiscale) PSP linkages needed to drive materials innovation (see Figs. 1 and 2). Most importantly, they offer completely different avenues for studying the governing physics mediating the PSP linkages of interest. In the simulations, one usually prescribes the governing physics (expressed as a suitable combination of thermodynamic and/or mechanical laws). Simulations generally allow one to explore the overall material response to any prescribed physics, even when it might be inconsistent with the governing physics realized in the actual material samples of interest. In fact, herein lies the main challenge. The governing physics in a given material sample of interest is often only known to a limited extent. Although experiments are aimed mainly at uncovering the governing physics, they cannot directly reveal the desired insights. A physics-based model is always needed to map the quantities measured in the experiment to the physics governing the material response. Sometimes, this mapping is accomplished with very simple models. However, the mapping of experimental results to the physics governing the material response gets extremely challenging as one extends the considerations to lower material length/structure scales. At the lower material length scales, experiments rarely provide direct evidence about the physics governing the material response. Additionally, they exhibit high levels of uncertainty in the acquired data and incur high cost.

^{13,47,90,136}These kernels can also be expressed in digital forms (either through sampling or through the use of a suitable Fourier basis).

^{137}Therefore, it should be recognized that the specification of $ \phi $ in high value low-dimensional forms in an important open area of active research.

The formulation proposed in Eq. (7) offers many advantages. Most importantly, it offers the best opportunity for utilizing physics-based simulations in accelerating materials innovation. This is because the likelihood function $p( E | \phi )$ evaluates probabilities that are conditional on specified governing physics; such an evaluation can only be conducted with physics-based simulation tools (it would be impossible to evaluate the likelihood function through experiments). Equation (7) also offers new avenues for the sequential design of experiments^{39,41,138–140} where one might maximize the potential information gain with each new experiment conducted. Therefore, a rigorous framework centered around Eq. (7) offers a systematic approach for fusing optimally the knowledge extracted from data acquired in physical experiments and physics-based simulations.

The practical implementation of the fusion framework outlined above is largely hindered by the high computational cost of the physics-based simulations. A statistically meaningful evaluation of the likelihood function $p( E | \phi )$ for most multiscale materials problems requires the execution of an extremely large number of physics-based simulations spanning a sufficiently large space in the domain of $ \phi $; the corresponding computational cost would be prohibitive for most materials innovation efforts. The foundational elements described earlier in this tutorial offer the only practical way for addressing this immense challenge. It is suggested that we first establish highly reliable and robust surrogate (i.e., reduced-order) models for the physics-based multiscale simulation tools (such as those demonstrated in Fig. 6), and subsequently employ the surrogate models for evaluating the likelihood function $p( E | \phi )$. The approach outlined above has been successfully demonstrated recently in the estimation of single crystal elastic stiffness parameters from spherical indentations performed on individual grains in a polycrystalline sample^{116} and the estimation of the single ply elastic stiffness parameters from indentation measurements on a multilaminate polymer matrix composite sample.^{141}

## SUMMARY

A mathematically rigorous framework has been presented for pursuing AI-based materials knowledge systems that addresses the objective fusion of disparate data gathered from multiscale physics-based simulations and multi-resolution experiments conducted by materials specialists. The presented framework is applicable to virtually all materials classes and systems. It is also applicable to virtually all steps in the materials innovation workflow that require objective decision support. Finally, it offers new avenues for optimizing the cost and effort incurred in materials innovation, by directing the effort through the objective selection of new experiments (physical or numerical) that exhibit the highest potential for information gain based on rigorous statistical analyses. Because of the features described above, the proposed AI-MKS framework offers tremendous potential for accelerating the pace of materials innovation.

## ACKNOWLEDGMENTS

The author acknowledges support from ONR Award No. N00014-18-1-2879. The author is grateful to Dr. Yuksel Yabansu for providing some of the figures used in this paper.

## DATA AVAILABILITY

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

## REFERENCES

*t*he President of United States,

*Dynamic Behavior of Materials*

*Information Science for Materials Discovery and Design*

*Chapman & {Hall/CRC} Texts in Statistical Science*

*Markov Chain Monte Carlo Stochastic Simulation for Bayesian Inference*

*Bayesian Inference in Statistical Analysis*

*Integrated Design of Multiscale, Multifunctional Materials and Products*

*Microstructure Sensitive Design for Performance Optimization*

*The Theory of Composites. Cambridge Monographs on Applied and Computational Mathematics*

_{2}compounds

*, PyMKS, and MATIN*

*in situ*TiBw/Ti6Al4V composites with novel network microstructure

*Random Heterogeneous Materials*

*The DFT: Aan Owner's Manual for the Discrete Fourier Transform*

*Numerical Recipes in C++: The Art of Scientific Computing*

*Texture Analysis in Materials Science. Mathematical Methods*

*Statistical Modelling, in Modelling Small Deformations of Polycrystals*

*Advances in Neural Information Processing Systems*

*Neural Networks and Learning Machines*

*Advances in Neural Information Processing Systems*

*Advances in Neural Information Processing Systems*

*Summer School on Machine Learning*

*Gaussian Processes for Machine Learning*

*NATO ASI Series F Computer and Systems Sciences*

*Evaluation of Gaussian Processes and Other Methods for Non-Linear Regression*