Materials cartography: A forward-looking perspective on materials representation and devising better maps

Machine learning (ML) is gaining popularity as a tool for materials scientists to accelerate computation, automate data analysis


I. INTRODUCTION
Machine learning (ML) as a tool is here to stay in materials science.Early gains have come from innovative applications of methods from the computer science literature, such as graphbased neural networks or computer vision, to inorganic materials contexts, such as accelerating molecular dynamics, predicting properties, or automating data analysis. 1As ML and informatics expertise become increasingly mainstream in materials science and engineering, future progress in the field may depend on the integration of scientific domain knowledge into the fundamental building blocks of ML tools, including representations and model architecture. 2ithin materials science, practitioners are concerned with modeling time and length scales that span many orders of magnitude, presenting differing challenges across the atomistic, mesoscale, and device levels.Moreover, problems where data are scarce 3 pose challenges for applying machine learning in many scientific fields.When designing, training, and applying a model, the representation of input features can be just as important as the target and architecture of the model itself. 3,4Finding the appropriate way to represent a material of interest is not always straightforward and remains an active area of research.In this Perspective, informed by a recent workshop held within a consortium of industry and academic researchers, we set out to articulate some of the goals and challenges faced by ML practitioners in materials science and propose paths forward focused specifically on materials representation.
We identify at least two broad kinds of supervised machine learning problems in materials science: the forward problem and the inverse problem.Both critically depend on the choice of materials representation, as the representation can be both a means and an end, and a well-chosen representation can simplify demands on the model architecture.
The forward problem is to efficiently and approximately reproduce the results of an experiment (empirical measurements, for example, optical property measurements in Fig. 1) or simulation (idealized abstraction, for example, band structure calculations in Fig. 1) from some knowledge of the material (e.g., structure or composition).The material representation used as input, depending on the available data and the task at hand, can take on any form: ranging from the precise description of the local atomic environment for an interatomic potential to high-level knowledge, such as only the composition itself.This act of mapping from knowledge of the material to a resultant property encompasses the whole of composition-and structure-based property prediction, 5 scaling relations in heterogeneous catalysis, 6 interatomic potentials, 7,8 device lifetime prediction, 9 and the part of an "inverse design" loop, which predicts if a candidate material will be desirable.The inverse problem is to predict the underlying physical attributes of materials that are correlated with material characteristics, such as spectroscopic features. 10,11Here, a computational representation of a material is simultaneously a means and an end, and the inversion process can map into more-or-less-physically motivated categories, though the problem can be made more challenging when the relationship between the underlying structure and measured output is not 1:1.
Better representations may help to bridge the gap between benchmarks and routine applications of ML in experimental contexts. 4,12Furthermore, in improving materials representations, the end goals are not just more accurate representation but also (1)  transferable ML models and (2) generalizable theories for enhanced understanding of scientific principles (i.e., knowledge extraction).For (1), presumably all benchmarked models demonstrate acceptable performance on some initial dataset or task, which means that the challenge to demonstrate utility comes from applying them to contexts beyond their initial development.For (2), we note that an example of a useful abstraction that originates from a materials representation is the very notion of periodic atom-containing unit cells that compose crystals: this idealization of the crystal structure is useful for both conceptualizing individual materials and commencing analysis [for instance, matching x-ray diffraction (XRD) patterns with space groups] and therefore enables both understanding and useful predictions.
We structure this perspective around proposing solutions to three identified challenges that researchers may encounter while developing representations of material systems: (A) developing tools to handle the complexity of real-world materials to enable increasing data harvesting and greater interpretability; (B) developing unified representations to combine theory and experimental data sources; and (C) developing representations that can span timescales and length scales.We differentiate between representations suitable for a machine, which we call embeddings (typically vectors of real numbers), and those for a human, which we call idealizations (which follow mathematical and logical structures, such as obeying internally consistent scientific theories).We note that ML can serve as both a guide and a tool to enable the creation of embeddings and idealizations alike.

II. CHALLENGE A: RICHER DESCRIPTION OF MATERIALS' COMPLEXITY
Understanding the interaction between four key traits of a material: structure, process, property, and performance, is a central focus of modern materials science research.Models that allow us to navigate this complexity effectively might make more experimental systems easily accessible to machine learning methodologies PERSPECTIVE scitation.org/journal/amlFIG. 2. We vertically order different input data sources according to how well established certain methods of featurization are.Top: Simple idealizations of a materials system, such as the chemical formula or the crystal structure, serve as inputs to established featurization tools or frameworks, such as matminer; 18 featurizing structures is practically an entire subfield. 15Middle: Complex idealizations of a materials system, such as physical properties of the material that are common across a wider range of systems (such as simple observables about the electronic structure or some knowledge of the defect distribution), are sometimes incorporated into featurization of a material, but standard featurization tools are not in widespread practice yet.Bottom: For systems that have not been well-studied using machine learning, practitioners must make case-by-case decisions on how best to represent their data.This includes synthesis protocol, the set of laboratory conditions that accompanied synthesis, storage, or other miscellaneous measurements.The arrow on the left-hand side represents the development of new tools that can help make featurization of common experimental measurements a standard practice that can be re-used across different projects.For example, growing interest in featurizing spectroscopic data 11,19 may make it more commonplace to featurize spectra for input into machine learning data.For systems such as batteries, where there is a wide variability in the number and kind of measurements that could be made, the emphasis on the community for developing ontologies that can be shared across different systems will help make standards for the field.
and therefore unlock new scientific capabilities (see Fig. 2).In this section, we summarize popular approaches to featurizing data gathered within specific lengths and timescales, such as materials composition and crystal structure, as well as tools to handle convoluted experimental observations that contain information about more than one key traits of materials, such as optical properties or device performance tests.One way to divide the body of recent work on featurizing materials is between those that focus exclusively on the chemical composition, 13,14 those that include some description of the atomistic structure, 15 and those that focus instead on micro-or macro-scale observable properties of the system, including images, 16 spectra, 11 or electrochemical measurements. 17hemical compositions are a simple starting point to represent a material as they are easy to featurize 14,20 and often known in experiments.For input into machine learning models, common approaches include using an element's fractional prevalence within a given composition 21 or as inputs to featurization 14 either internal to a model or via an associated toolkit, such as matminer. 18We note that when mapping from a chemical composition to an observable property, the composition implicitly encodes structure (more or less depending on the property).This constraint is because of the fact that all measured properties, of course, rely on some underlying atomic arrangement and that composition-property mappings cannot, in general, be 1:1 without selecting a single structure for each composition (consider the diverse properties presented by pure carbon alone in forms 22 such as graphite, graphene, or diamond).
For the structural representation, we highlight several examples.For input into machine-learning-based models, ample work has been performed on the computational representation of local atomic structures; 23 notably, for use as features in interatomic potential models, we recommend a thorough review from Musil et al. 15 In these contexts, the completeness of the descriptor and the computational expense are considerations, which have subsequently given rise to many innovative ideas, such as moment tensor potentials, 24 the atomic cluster expansion, 25 or equivariant descriptors. 26For structural descriptions, there has also been work centered on graph representations of crystalline materials [e.g., crystal graph convolutional neural networks (CGCNNs) [27][28][29] ] and their applications in predicting site-specific properties. 30or atomistic simulations, one typically begins with some prior knowledge of the atomic structure.While macroscopic observables, such as bandgap or surface reactivity, can be very sensitive to individual phases, 31 gaining a detailed mechanistic understanding of the structure-property relationship is challenging because it is experimentally expensive to fully characterize the local atomic structure.This means that representations that correlate with materialproperty relationships that can sidestep the requirement of full knowledge of the atomic structure are highly desirable.For instance, observables that can give clues to materials structure (e.g., the coordination number of a species in a measured phase) 32,33 can help yield conditions that narrow down the space of possible structures.
A well-chosen representation is itself a tool, as it enables creativity, structured thinking, and useful predictions.An example of this in string serialization of molecules is SMILES 34,35 vs SELFIES, 36 where the latter is purpose-built for traversal of molecule representations in a latent space.Another example is periodic densityfunctional theory (DFT), 37 where the very idealization of a periodic material as an infinite crystal makes many problems tractable.We note that the computational formulations of representations, such as pymatgen's structure object 38 or ASE's Atoms object, 39 are 21stcentury practical advances on an established crystallographic idea in their own right.Making it efficient for researchers to rapidly generate, instantiate, and manipulate these structures on a computer saves thousands of hours of valuable researcher time and enables new feats of cheminformatic and materials informatic work.This capability highlights the serious practical benefits that come from making "human interpretable" idealizations "machine useable."

A. Moving forward: Representing disordered systems
A common adage holds that "crystals are like people: it is the defects in them that tend to make them interesting." 40There is much interesting work to explore in ML-ready crystal representation beyond representations of average crystal structures.These structures are amenable to methods such as DFT but rely on the idealization of perfect order.The space of defective structures requires serious effort to be able to tractably explore.An idealized singlephase bulk material cannot necessarily contain information that would be germane to experiments, such as the processing history,

PERSPECTIVE
scitation.org/journal/amlif it cannot be captured by defects and disorder in a relatively small unit cell.This detail proves important for observables such as electronic conductivity or catalytic activity, 41 for which very small dopant fractions can play a decisive role in altering the function of a material. 42Zooming into any real-world material on the atomic scale, it is very likely we would find imperfections in the atomistic ordering.The long-range order of inorganic materials contains countless defects, some by design (e.g., doping in semiconductors), 43 some as a key feature of the material (such as when defects play an entropically stabilizing role in the state), 44 and some by accident, such as thermal strains 45 from unexpected temperature changes.A recent report by Chen et al. demonstrated the machinelearning learned elemental embeddings in materials graph networks to model disorder in materials and the use of multi-fidelity graph neural networks to predict bandgaps. 46While our focus has been primarily on solid-state systems, we also present a brief case study on how non-solid state disordered solutions such as polymer electrolytes are amenable to novel descriptors, where the dynamics of a polymer system are the object of study.Recent work at Toyota Research Institute 47 has found that representations of trajectories in terms of combined ion clustering and time evolution ion transport properties as a behavior-based descriptor can accelerate molecular dynamics (MD) simulations compared to full MD runs, which also improve the accuracy of predictions compared to other commonly used descriptors, such as SMILES and molecular graphs. 48A noteworthy feature of this work is that ML efforts that map polymer composition to the result of an MD simulation implicitly capture the full effect of all MD simulation parameters on the outcome.Prediction beyond the set of parameters used to generate the initial dataset-which had common electrolyte composition, temperature, and salt concentration-is simply not possible when the representation is "flattened" into only the identity of the polymer alone.This limitation calls for the development of representations that describe the behavior of the material under study (the polymer matrix) rather than simply the identity of the polymer used in simulation.

B. Moving forward: Representing processes
The sensitivity of experimental outcomes on processing parameters combined with the expense of data acquisition further challenges the task of evaluating individual materials.Any measurement of a material represents a "snapshot" of its state at a point in time.A holistic record of the time-evolution of a given sample measurement requires knowledge of the full chain of events imposed upon the sample until then: these events could range from individual steps in the synthesis of the sample, a destructive measurement, or even simple storage on the shelf, with each event decorated by descriptive parameters (temperature of sintering and time on shelf).One way to conceptualize this history is via a graph representation, in which a sample is entirely represented by a variable-length series of events, which serves to alter its state in some way.This concept is analogous to the data structure design pattern of event-sourcing, in which a system's state is described exclusively as an ordered acyclic sequence of changes (git being a common example).Computational tools, such as Aiida 49,50 or StructureNL in pymatgen, 51 are designed to make it easy to navigate the full provenance of parameters, which gave rise to a calculation.For experimental workflows, tools such as DBGen 52 and ESAMP 53 are intended to facilitate data assembly with this level of exhaustive detail. 54This form of detailed bookkeeping becoming standard practice could represent an advance in and of itself; more detailed information about the full history of a sample could make it easier to identify causal factors in, e.g., processing that could be decisive during later application.However, the routine use of complete graph-based provenance for experiments and calculations is not yet mainstream. 55The layer of required overhead when designing a workflow to systematically record every possible state change of the sample may be an inhibiting factor, as well as a lack of common expectation that workflows be documented in exhaustive detail.

III. CHALLENGE B: UNIFYING REPRESENTATION FOR THEORY AND DIVERSE EXPERIMENTAL DATA SOURCES
Describing the complexity of a scientific challenge or the physical details of a material is complicated by the fact that we are best equipped to think in terms of idealized and abstract representations.We arrive at a central challenge: finding ways to unify representation across theory and diverse experimental data sources (see Fig. 3).Tools that allow researchers to naturally integrate information about experimental data into a representation might allow for more economical use of experimental data and acquisition of more advanced embeddings.Possible directions include combining different modes of input data sources at varying fidelities, 56,57 integration with theoretical representations or theory-generating data, 58 converting theory-generating data into that matches an experiment or vice versa, 59 and principled uncertainty quantification.
Despite differences in the kinds of data from computational and experimental sources, theorists and experimentalists have come up with effective schemas to better communicate with each other.1][62] Typical CIFs (crystallographic information files), which have standardized file formats, are universally recognized FIG. 3. Pathways toward closer integration of representations for experimental and computational materials.

PERSPECTIVE
scitation.org/journal/amlacross the scientific community. 63CIF files can represent the output of first-principles structure optimization and refinement against single-crystal x-ray diffraction.
Obtaining similar agreements in output formats is much more challenging when it comes to the representation of a material property or device performance.For example, Pourbaix diagrams are widely used as a theoretical guide to deduce thermodynamically stable phases of an aqueous electrochemical system. 64,65To experimentally detect the degradation of a fuel-cell catalyst, 66 for example, one can measure the concentration of chemical species dissolved in the solvent via techniques such as inductively coupled plasma mass spectrometry, 67 track the deterioration of activity via electrochemical cycling tests, 68 or monitor the microscopic changes via x-ray scattering tomography in situ. 69These three examples of experiments would track events of corrosion and serve as an experimentally measured ground truth to validate a simulated Pourbaix diagram.At the same time, we expect these three measurements to match the theory qualitatively rather than quantitatively: the experimental observations are often a convoluted sum of the property of interests plus the imperfections across scales such as defects, contamination, and the environments in which the tests were conducted, and theory itself has many limitations.In many materials science fields, it remains a challenge to develop "universal languages": schemas that effectively compare or combine information across multiple sources.

A. Moving forward: Combining heterogeneous data streams
Moving from an observation to a human-comprehensible representation requires a congruent idealization-a mental picture that agrees with the data.Tools that can automatically combine multiple data streams into a consistent microscopic/microstructural picture, possibly informed by physics, would massively accelerate the process of finding both machine-readable and human-readable representations of data. 70ome models are so closely connected to the underlying physics that the idealization comes "for free," while others require more sophisticated analysis.Combining heterogeneous data sources will require ways to flexibly and automatically combine data from each into a representation.For example, a phenomenon such as EXAFS 71 is well-understood and can be approximated by an equation where individual terms in the equation represent physical quantities in the system [see Eq. (2) of Ref. 72].This is an example where the model and material representation are implicitly linked, and fitting a good model itself provides insights.X-ray diffraction patterns can be used to establish structural phase conditions that a candidate idealized structure must satisfy.Some forms of characterization can be well-approximated by a closed form expression, such as the EXAFS equation.In these contexts, the act of fitting a model to the data provides readily interpretable features of the material under observation.However, when working with data sources that have nonlinear functional forms such as XANES, the interpretation and mapping causality back to the underlying source are not straightforward, and efforts have been made to craft latent spaces that provide a physical picture for the sake of intermediate representation. 73,74n experimentalist's physical or chemical intuition can be used to bridge the gaps among multiple, complementary forms of imaging.This process itself is complicated by epistemic (lack of knowledge) and aleatoric (random nature of events) uncertainty, as well as the fact that the sample itself can change between measurements or as a result of making a particular measurement.
More flexible, possibly data-driven representations could enable the combination of multi-modal data sources to inform the solution of an ideal material. 75Toyota Research Institute's consortium has furthered efforts to make it easier to record state changes within materials, which when multiple data sources are available may make it easier to identify correlations between data sources. 53,76

B. Moving forward: Constrained algorithms and flexible theoretical representations
It is challenging to have one form of representation that can mediate among different kinds of measurement, especially when the relationships between the measurements and underlying structures are not easily determined.Representations that could accommodate imprecise knowledge of the underlying structure (that are "fuzzy") would could make it easier to bridge the gap between experiment and theory, such as a physically informed latent space.Data-driven representations of materials that can be rapidly extracted from experimental observables make this possible.Having a concrete idealization that a particular observation will map to (e.g., an XRD pattern revealing a crystal's space group) necessarily constrains the solution space.This task's difficulty also depends on the solution space-such as if it has a discrete or a continuous representation or if the fitting process is ill-conditioned.Even when looking within similar systems-molecules-a well-chosen representation of the structure space can enable flexible design, for example, SMILES 34,35 vs SELFIES, 36 where the latter by construction always yields a valid molecule and therefore is a more easily traversable latent space.As more data become available in the materials science community, data-driven spaces could become a viable intermediate space for materials design.For example, Mat2Vec is an example of a space that was derived from literature-based sources, 77 which now sees common use in models such as CrabNet. 78Furthermore, improved algorithms might flexibly incorporate constraints from experimental observables. 79olution spaces that are designed to be traversable (such as SELFIES) and that also can admit some uncertainty in the underlying structure could have benefits; guesses could be more easily refined in response to new information such as different modes of characterization.In addition, "fuzzy" representations might help to address the issue of noise within experiment and theory.Already, first-principles calculations, such as DFT, owing to their quantummechanical and atomistic precision, require idealized unit cells.Forms of representation that describe non-idealized unit cells could aid the interaction between experimentalists and theorists.For instance, compositions are in common use for embedding due to the fact they can represent a material without precise knowledge of the structure.

C. Moving forward: Improved experimental data generation, collection, and reporting
There are fundamental tensions with the way that ML methods are practiced for training models on large datasets-some work focuses on using available materials databases 80 to train on hundreds or thousands of compounds, but it is expensive to do even one trial to study one material in great depth experimentally.Crucially, this is contrasted with the high-profile achievements of ML in the commercial software space, where individual trials (e.g., for selling ads) are cost-effective and can be performed at scale.Within fundamental research, one way to rationalize the explosive success of AI in fields such as image recognition/generation 81 and protein folding 82 is the abundance of data available, where the latter is particularly inspiring due to the Protein Data Bank's centrality and importance since 1971.Improving the availability and centrality of data reporting within the materials science community could enable the development of data-driven representations and make it easier to characterize novel materials in light of what has been previously observed by other groups and, thus, to more easily move between modalities of characterization (for example, cross-referencing an XAS measurement made on a particular sample with a database of experimental measurements made in similar systems to gain more structural insight).Over the past few years, an increasing number of open databases of simulated materials structures and properties have been created within the materials community. 83More recently, several experimental databases 84 and platforms have also become accessible, ranging from functional materials 85 to energy devices. 86

IV. CHALLENGE C: REPRESENTATIONS ACROSS SCALES, FROM MATERIAL TO DEVICE
Scientists are well trained to explain physical phenomena observed using their own eyes by using simpler abstractions as building blocks.For example, when we imagine zooming into a working battery, at a centimeter scale, engineers talk about device architecture and cell design; 87 at a nano-to micrometer scale, materials scientists study degradations using high-resolution microscopy to identify Li dendrite growth, 88 and at an atomic scale, chemists might investigate new crystal structures for a potential cathode material and strive to explain its disordered lattices from electronic structure. 89Over the hundreds of years of development in modern science, specific languages and models have emerged at each length scale to describe the structures and mechanisms of materials.Challenges arise since models that well represent a material's structure, chemistry, and function at a specific length scale, when zooming out, may only describe what happened locally.Most models, whether a solid sphere model to represent an atom, a SMILES string to represent a molecule, 90 or a crystal graph neural network (e.g., CGCNN) to represent an inorganic material, 28 would have a length scale or timescale limit within which the model would reasonably represent the continuous dimensions in reality.
Materials scientists may be posed to address long-standing difficulties with trying to stitch together representations at different length scales and/or timescales using data-driven methods.Theoretical idealizations on the atomistic level tend to rely on perfect knowledge of the structure and cannot easily integrate real-world timescales and length scales.Device-level models and associated representations come with their own problems depending on the particulars of a given experiment.We may be able to draw inspiration from the multi-scale modeling community, where common representations are used to link individual length scales to the nextlarger one, such as coarse-graining atoms or parameterizing individual domains of space; as in Fig. 4, machine learning may make it easier to identify and combine descriptors across length scales.In recent years, the utilization of ML techniques to extract information from large and diverse datasets, followed by generating abstract representations in the latent space, has garnered substantial attention in the field of materials science.The generated representations can be situated in a shared latent space through the integration of data from various sources, where correlations between individual measurements or calculations are captured by their relative proximity within the space.We shall note that this approach bears resemblance to image-to-text algorithms used in large language models, which likewise rely on a shared latent space to align textual descriptions and visual representations 91 -in one recent case, as many as six modalities. 92hallenges in representations that can span length scales may originate in the expense of generating datasets, which can be used to capture long timescale and length scale variations in the behavior of a material.Traditional representations are physics-based and thus reflect things we have more easily accessible idealizations of.There are emerging data-driven embeddings 93 for materials and ontologies for devices, 94 which attempt to bridge the gap.Most large datasets correlate theoretical atomic structure and/or compositions with a particular property; 95 the creation of datasets that describe, e.g., battery cycling have helped to enable new fields of informatics work. 96loser experiment-theory connection, and unified representations, may help to expand the space of available data and targets.

A. Moving forward: Economical models and benchmarking
Most materials design relies on structure-property relationships that live in the same length scale. 97For example, it is easier to connect electronic structure in a unit cell to properties such as the bandgap than to the compound's ductility.It would be desirable to find ways to decorate idealized bulk structures such that there could be some way to connect defects to the bulk, 98 possibly drawing inspiration from the field of multi-scale modeling.Since defects and short-range phenomena govern so many important performance criteria, benchmarking and accounting for these accurately are key to improving our discovery process. 99,100here is still a need to be able to create accurate and especially transferable models from small amounts of data, as accurate materials data (especially experimental data) can be expensive to generate. 101As the field matures, we expect to see increased use of constraints in features and models to reduce data hunger and therefore increase the scope of applicable problems ML can be applied to, as well as an increasing awareness of the diversity of dataeconomical models beyond artificial neural networks.Additionally, statisticians have known that simpler models tend to extrapolate better, and incorporating physical and chemical knowledge into model structure may help to simplify the form of models, improving generalizability and efficiency.
One challenge is that it remains unclear how multi-or crossscale models should be benchmarked against each other and against the prior art.By contrast, there is established work that benchmarks the effectiveness of different models in active discovery against the goal of acceleration.For example, Rohr et al. 102 introduced three different metrics: active learning metrics that quantify the discovery of any "good" material, enhancement factors that quantify the improvement of the method introduced (compared to the benchmark) at a given budget, and acceleration factors that quantify the savings in the budget of the method introduced to achieve the same results as the benchmark.However, when it comes to combining representations obtained at multiple timescales or length scales into a single machine learning model to predict materials' behavior, we have a set of questions to answer prior to conducting a new experiment/study.They are as follows: (1) What is the benchmark that we are comparing against?(2) What value have we added using our method compared to the benchmark, specifically, how do we decide which metrics define "success"?(3) How are the materials properties in the lab or simulation connected to the actual device performance in our ML model?New benchmarks enabled by datasets-experimental datasets in particular-would be worthy targets for the community going forward.

B. Moving forward: Embeddings, proxies, and mesoscale descriptors
Multiple reports recently have discussed the lack of mesoscale models bridging the gap between our understanding at an atomic level to a device level. 103Indeed, many common systems in materials science lack good physical models to fully explain the complex phenomena, such as interfacial dynamics and microstructural heterogeneity in batteries. 104Data-driven methods have recently emerged as a way to overcome this challenge of learning in a domain without decent physical models. 105,106One example to achieve propagation of scientific laws across length scales is through embeddings.Learning the embedding of smaller constituents of a large structure followed by a combination of the embeddings provides a viable way to represent complex materials.For example, combining the learned embeddings of organic linkers 107 and inorganic nodes 23 allows us to describe hybrid organic-inorganic framework materials, such as metal-organic frameworks. 108nside laboratories, low-fidelity proxies are often used when high-fidelity measurements are expensive. 109One example is the use of color change as a means of representation instead of precise bandgap measurements to track perovskite degradation under elevated temperatures and humidity. 110Another example where tailored representations can help bridge the scale gaps is through mesoscale descriptors.Yang and Buehler have recently reported methods correlating the atomic structure with mesoscale crystal structures using large graph neural networks. 98Through features extracted from microscopic imaging, 16 such as the shape, size, and orientation of grains in a polycrystalline alloy, one can build data-driven models to correlate compositions with the microstructural features under uniform processing conditions and to correlate microstructural features with bulk materials' properties being measured.Here, descriptors that encode microscopic information serve as an intermediate step in assisting the understanding of composition (atomic scale) and property (macroscale) relationships.The field of descriptor engineering is rapidly evolving, benefiting from advancements in high-performance computing. 111One area of enormous opportunities lies in combining physics and data-driven representations for explainable property predictions, which may have the effect of allowing researchers to discover new empirical laws. 112

V. CONCLUSION
In conclusion, toward the goal of improved inverse and forward models, we articulate three central challenges for representation development for ML in materials science: representations that support a richer description of materials' complexity, unifying representations for theory and diverse experimental data sources, and representations that can span multiple timescales and length scales.We emphasize that a significant benefit would be easier integration of ML into regular experimental practice, as we should not lose sight of the fact that machine learning, while an interesting object of study in its own right, still has tremendous untapped potential in enabling better materials science and engineering.In this Perspective, we identify promising directions that have emerged for each of these challenges and hope that this can serve as an inspiration for future researchers engaging with these topics.

FIG. 1 .
FIG. 1. Examples of common forward and inverse problems in materials science, with a focus on structure-property relationship modeling.The forward model maps from knowledge of the material to predictions of consequent properties; the inverse model maps from observations to conceptions of the material congruent with what was measured.

FIG. 4 .
FIG. 4.The goal of combining features from multiple length scales into an endto-end machine learning framework.