Computational capability has enabled materials design to evolve from trial-and-error towards more informed methodologies that require large amounts of data. Expert-designed tools and their underlying databases facilitate modern-day high throughput computational methods. Standard data formats and communication standards increase the impact of traditional data, and applying these technologies to a high throughput experimental design provides dense, targeted materials data that are valuable for material discovery. Integrated computational materials engineering requires both experimentally and computationally derived data. Harvesting these comprehensively requires different methods of varying degrees of automation to accommodate variety and volume. Issues of data quality persist independent of type.

The Materials Genome Initiative (MGI), established in 2011, has the goal of doubling the pace of advanced materials discovery, innovation, manufacturing, and commercialization in the United States. Six federal agencies participate with over 400 million dollars committed to-date for research across industry and academia. MGI aims to:

  • Integrate research efforts across the US.

  • Integrate experimental, computational, and theoretical research tool development.

  • Make digital data available.

  • Create a world class materials science and engineering workforce.

Advances in computing and informatics have aided materials discovery through systematic computational experiments powered by databases of first principle calculations and experimental data. Published data are estimated to be increasing exponentially at 4.7% per year in science in general.1 It is critical for the users of reported data to have accurate measurements; however, if we assume a constant rate of errors (such as typographical errors) in what is published annually, the total amount of erroneous data would also be increasing at the same rate as that of publications. High quality databases can provide measurements with uncertainties, provenance of the data, and control data persistence—propagating data corrections and warnings about data sets with known problems. Human errors, inevitable during the document formatting and data transfer for publication, can typically be caught with basic visualization. Unfortunately, these checks are typically not performed by reviewers prior to publication.2 The use of data capture tools in conjunction with comprehensive databases can be used prior to final publication to catch typographic errors and for cross referencing prior measurements with the soon-to-be-published data to provide some guarantee of data integrity. A shift in the review process of journals is necessary in a critical mass of high impact journals to reduce erroneous data publication and propagation, placing more responsibility on the journals to promote high quality data practices and requirements for authors to abide by and be responsible for meeting those requirements.

Comprehensive properties databases also facilitate discovery of coverage gaps within the literature. Such coverage gaps can be quickly and efficiently populated via High Throughput Experimental Methods (HTEMs). HTEM experiments are well designed, automated measurements conducted in specific compositional, temperature, and pressure ranges resulting in densely populated areas of experimental measurements. These combined results can then be fed forward into High Throughput Computational Methods (HTCMs). Such methods model large amounts of data to investigate materials space for new potential materials as determined by optimized parameters or “descriptors”3 that are developed for specific applications. Synergy between HTCM, HTEM, and data collection, communication, and curation is fundamental to achieving many of the goals of MGI.

This viewpoint paper focuses on the experimental data, in particular, and the associated communication, curation, and collection issues and potential improvements.

Awareness of importance of processing, structure, properties, and performance has existed for millennia. In the Chinese Bronze Age, composition effects were known to play a role in metals,4,5 as was the effect of carbon on hardness of steel and the effect of mechanical deformation on strength as illustrated by Damascus steel.6 The prediction between structure and properties was predicted to play a role by Rene de Reaumur7 in the 18th century. Attribution of credit for the discussion of the relationships between properties and performance is difficult as its presence throughout history is ubiquitous.

The trial-and-error method for alloy design was relied upon until the 18th or 19th century. Gibbs provided the bases for modern methods for alloy design through applying thermodynamics to explain phase equilibrium.8–10 These established equilibrium conditions remain important for the development of adequate processing conditions to produce a desired materials structure and resultant property set. Understanding of crystal structure grew over time and became one of four factors, along with electronegativity, valence, and atomic radii, used to predict solid solubility in mixtures of two metallic elements in the Hume-Rothery rules (1926).11 These rules predict whether intermetallic precipitates form in a material, which are now known to be important for materials strength and ductility in alloys. The use of flow block-diagrams leads to understanding of how processing alters structure to produce the desired properties of a material. These systems-level approaches to materials design integration illustrate the complex relationships essential for alloy design for performance.12 

While phase diagrams still act as roadmaps for alloy development using temperature, pressure, and composition for 2-3-4-5 component systems, computational tools have been developed to model higher dimensional data that are not easily represented for complex system alloy phase predictions. These tools rely on information generated from atomistic and molecular dynamic models developed from the 1930s onward including Monte Carlo (MC), cluster variation method, Density Functional Theory (DFT), and phase field modeling (PFM).

Earlier efforts to calculate binary phase diagrams were pioneered by Van Laar,13,14 while most modern efforts for binary, ternary, and multicomponent phase boundaries use the CALPHAD (CALculated PHAse Diagram) method developed by Kaufman and Bernstein in 1970.15 They provided descriptions of the general features of the method, including detailing computer programs for calculating binary and ternary phase diagrams. CALPHAD databases cover a broad range of fundamental thermomechanical and thermophysical properties spanning solution thermodynamics, molar volumes, diffusivities, and elastic constants.16 In recent years, “first principle” calculations from DFT have increasingly gained importance to supplement CALPHAD databases. Today’s thermodynamic descriptions can reliably produce phase diagrams and thermophysical and thermomechanical properties within the range of experimental errors.3 

The Integrated Computational Materials Engineering (ICME) method is a powerful tool. However, just as users of a calorimeter must understand how the temperature, pressure, and heat evolution are monitored, the users of ICME must understand the algorithms and the data from which the models are developed. ICME consists of two main components: the databases and computational materials design algorithms/software that, when combined, are used for concurrent engineered systems.17 Unfortunately, data provenance and uncertainties are lost during the development of the correlations that underpin these methods, and users often fail to consider the impact this can have upon a final design. With free flow of information and dynamic data evaluation, new experimental data will be fed into an ecosystem that is capable of updating these recommendations transparently and without human intervention.

The diversity of software used in ICME is increasing, with various implementations continuing to pursue the similar goals. Software like FactSage, Pandat, or Thermo-calc66 is already used to predict phase diagrams for multicomponent systems. Furthermore, these systems can be integrated with diffusion data to predict phase transformations or precipitation kinetics using software such as DICTRA, Pandat, or TC-PRISMA.66 The combined kinetic and thermodynamic data can then be used to predict phase stability important for finite element analysis simulations or phase field simulations for engineering purposes. ICME approaches are used in many diverse fields from developing welding processes18–21—utilizing relationships between solidification structure and composition and/or mechanical strength—to multiscale materials property predictions for composite materials22 or surface activity for promotion of chemical reactions.23 

The ICME process was used in the development of Ferrium S53,16 a corrosion-resistant replacement for cadmium-coated landing gear steels. The process initiated with the development of a flow block diagram to illustrate important processing-structure-property-performance relationships that would need to be modeled. Next, tradeoffs were determined using computational methods to optimize toward the desired properties (strength, corrosion resistance, core toughness, and fatigue resistance) and a new model was developed24 to address the passive film developed on the steel. Finally, an iterative design and prototyping process were conducted (including full-scale microstructural characterization that provided experimental data for the models), where local correction factors were fed back into the design models for increasing accuracy in the next composition region of interest. The total time for alloy development was three years, which consisted of five iterations.

HTEMs are defined as “methodologies which allow accelerated synthesis and testing of materials for optimized performance, are uniquely suited to rapidly generate high quality experimental data, and hence represent the key enabling technologies to bring the computational materials design efforts of MGI to fruition.”23 These studies allow for surveying data regions in ways that historically are not measured quickly and for the express purpose of feeding databases critical for materials discovery. An additional benefit of HTEM studies is the fundamental research opportunities—fundamental questions and modifications to testing procedures—developed during the data collection.

HTEM is essential to the future development of the CALPHAD method, though the methods are limited by experimental equipment–the development of experimental equipment for HTEM is imperative to develop validation for CALPHAD databases. Controlled compositional gradients can be generated via physical vapor deposition, but the structural properties being measured are impacted.25,26 In an assessment of high entropy alloys Miracle12 states, “a successful structural material must simultaneously satisfy a demanding balance of over a dozen different properties, and high throughput tools are not fully developed for all of these properties.”27 Furthermore, Miracle highlights that for CALPHAD predictions, melting temperature data are essential, yet there are no HTEM methods to measure this property. Methods also do not exist for essential engineering properties like yield and ultimate tensile strength, ductility, damage tolerance, creep strength, and fatigue, all of which require extensive microstructural characterization.

New methods capable of collecting data continue being developed, e.g., the fairly new local electrode atom probe (LEAP) and focused ion beam (FIB) serial sectioning techniques that let experiments span from the nanometer to the millimeter in microstructural characterization, allowing for microstructural reconstruction and new insight and heuristics for future materials design. Presently, we are limited experimentally by the ability to extract data from imaging as a quantitative tool;28 our understanding and ability to harness experimental information have been limited, particularly in active domains of 3d and 4d imaging. In a recent study by De Geuser et al.,29 the precipitation kinetics in Co-Cu alloys were monitored using small angle x-ray scattering at a synchrotron beamline using simultaneous monitoring of a functionally graded specimen ranging in composition from 0 to 2 wt. % cobalt. From one experiment, they were able to cover a wide range of supersaturations and aging times for a range of compositions, repeating at two additional temperatures and comparing to standard precipitation models. The utility of data collected from these new techniques and from HTEM only can increase with improved data infrastructure for ICME.

HTCM is a combination of advanced thermodynamic and electronic structure methods with intelligent data mining and data construction with supercomputer architectures.3 The diversity of HTCM techniques matches the complex nature of materials problems. For future materials discovery, it is important to ensure that the algorithms and processes behind the computational methods are documented and that the future workforce is educated in these algorithms and processes. In HTCM, researchers are able to condense compositions of interest from thousands or millions of compositions or components down to a manageable number of experiments via optimization algorithms following complex searching algorithms.

For example, McLinden et al.30 used evolutionary algorithms as a screening method to determine potential future organic refrigerants with low global warming potential (GWP) as well as a number of other criteria to be met. These criteria included a short atmospheric lifetime, ideal thermodynamic properties for cooling, low flammability and toxicity, and cost among other metrics. The initial dataset was populated from a public-domain database with 100 × 106 potential compounds, of which 56 000 candidate molecules composing of a limited set of elements having 15 or fewer atoms in the molecule were selected. For these compounds, the GWP, flammability, critical temperature, and other thermodynamic factors were estimated and filtered to remove any elements with toxic or unstable functional groups31 resulting in 62 potential liquid refrigerants. Most of the resulting candidate compounds are used today, with only a few recommended potential refrigerants suggested for future research. A similar screening methodology was used for alloy design for wind turbines reducing the number of single crystal alloy compositions from the 100 000 to 6 potential alloys to investigate, with descriptors including estimates of creep resistance, microstructural stability, density, cost, and castability.32 

Other researchers use thermodynamic genetic algorithms33 to design or multiphysics modeling combined with neural networks for materials optimization.34 Discussion of neural networks in alloy design exists,34–36 where the need for undergraduate education in neural networks is proposed, given their extensive use in process control, process design, and alloy design. Bayesian methods taking advantage of ab initio and machine learning techniques have been used to predict properties; however, these methods require accurate predictions of properties or experimental data,37 and the quality of output depends strongly on input data quality.34,37–41

Infrastructure to collect, curate, and communicate heterogeneous data effectively is essential for enabling researchers to generate reliable predictions of material behaviors.17 Developing comprehensive databases with quantified reliabilities will be the keystone for automated, next-generation material development. In order to maximize impact for a given data resource, it must be considered in the context of the data ecosystem in which it operates. Considering elements such as accessibility, volume, reliability, and depth of metadata characterization is essential to guarantee value in a broader global information system. Drawing quantitative and salient connections between apparently disparate datasets fundamentally relies on communication of the full suite of parameters, including chemical compositions, processing history, and the final observed properties of the material.42 

To address matters of data accessibility for the field of molecular liquids, the National Institute of Standards and Technology (NIST) has developed a number of informatics tools to help with literature identification,43 manuscript digitization,44 data communication,45,46 and data evaluation/comparisons.47–50 These technologies are expected to be expanded to support metal-based systems in the near future. Of particular note is how this suite of technologies serves to underpin a collaborative effort between the NIST Thermodynamics Research Center (TRC) and the five major journals that publish the majority of organic liquids thermophysical property data to facilitate data quality and reliable dissemination.

In the general case, knowledge acquisition is a bottleneck for the success of expert systems and machine learning algorithms. Accumulation of data is expensive, time consuming, and error prone, especially if content quality is held to a high standard. Automation substantively alleviates many of these challenges but is subject to the same limitations in the development process, and present technologies still require non-negligible human oversight. Moreover, as data volume growth rates continue to accelerate, the critical evaluation process becomes a bottleneck unless it as well is dynamic, accepting and propagating new information as it becomes available.

The complex and multidimensional nature of materials data requires either advanced algorithms for data collection or expert analysis and often requires both. The publication rate of thermophysical property data doubles about every twelve years2 while “data” in general double on a 16-year horizon.1 Consequently, errors in data, inherent with human-computer interaction and (depending on definitions) present in 1/3 of all manuscripts historically,2 are also growing exponentially. Preventing erroneous information from corrupting data resources and overwhelming the capability to discern the good from the bad requires automated and enforced data integrity checks. Data can be verified for consistency with physical laws, checked for self-consistency and smoothness relative to claimed uncertainty, and verified against existing literature. It is essential to emphasize that failure of any consistency check does not necessarily indicate problematic data; rather, it provides the capability to invoke human review to determine if the set is problematic, if the uncertainties are overly optimistic, or indeed if new physics are appropriate to describe the system. Application of HTCM in cross validation of these datasets fundamentally combines physical law and existing literature checks, and therefore care must be taken that contradictory data are not necessarily dismissed out of hand.

The data collection methodology adopted by TRC at NIST, used in practical work flows for the past 15 years, relies heavily on human interpretation. Initial data capture is performed using form-driven software to guarantee well-formedness and completeness of metadata, and rudimentary visualizations performed at the earliest stages of capture serve to prevent the majority of errors introduced in the capture process and perform initial sanity checks on reported uncertainties. Following independent review for consistency of entered data with reported data, captured information is processed by ThermoData Engine (TDE),47–50 an expert system implementing data integrity checks of the type previously discussed. Frequent errors found in reported tabulated data include: numerical typos, juxtaposed characters, column switches, fill down errors, and copy and paste errors. These can be often repaired to a substantial degree at capture time and in a worst case can be selectively marked as irreparable. In contrast, if data are reported in the form of a regression only, any error in manuscript preparation will render the entire set unusable. If data are irreparable, the metadata surrounding the reported data are still captured so as to avoid duplication of effort in case of rediscovery of the problematic document. By performing these checks prior to publication, it becomes possible to engage authors when the study is still fresh in their minds and thus often makes data repair possible.

Efforts have been made to automate the data collection process to varying degrees. Ultimately, interpretation of reported information requires a human decision, though this is abstracted away from the time of capture more heavily as the degree of automation is increased. These decision trees become quite complex if automated systems are expected to consume both new and historic data, as reporting standards and methods have changed substantially and repeatedly in the centuries since thermophysical properties were first reported. Frameworks completely lacking in human intervention following the programming stage3 will likely be essential in addressing issues arising from data publication growth; however, human interpretation will continue to be essential in collection of historical archives.

Multiple intermediate steps exist between collection of information from an experimental apparatus and inclusion of that data into design processes. Each step involves storing data in some intermediate format, and transformation between those states requires some mutually agreed upon communication structure. Within the context of communication of thermophysical property data, NIST and collaborators have developed standards under the auspices of the International Union of Pure and Applied Chemistry (IUPAC) for both data reporting in publication51 and machine-machine transmission of thermophysical property data.45,46

Any scientific report, such as a scientific article, should be a clear and complete communication of an experimental result and the methods utilized to obtain it. This informs the research community how to replicate the result and how to judge the quantitative reliability of the result, particularly in light of potential new information about methodological limits. In a broad sense, this means reporting experimental results must balance between conforming to fixed patterns of data representation and providing the flexibility to provide critique and express novelty. Defining which information should be provided and defining its method of presentation not only facilitate clear communication of quantitatively important information but also allow for efficient verification of completeness and consistency of that information.

For quantitative information that is presented in a manuscript format, the clearest disposition is a “standalone table” where all experimental data are in a table that alone contains all the necessary experimental details to fully interpret the result.51 Examples of such tables can be found in Tables I–V. Table I illustrates characterization of material samples to be studied individually or as part of a multicomponent system. Examples are provided for an organic fluid, an organic crystalline material, and a pure metal while Table II illustrates an example of reporting composition of initial materials and a synthesized alloy. While details of formats may vary between journals, minor variations are irrelevant so long as the critical data are presented: material supplier, methods of purity determination, determined impurity contents, fully defined units, molar/mass/volume basis for composition provided (mass/molar/volume percent/fraction/parts per million), and standard or expanded uncertainty. Any processing and the initial and purity post processing should additionally be reported. For properties, any applicable statistical information (k-values, etc.) can also prove useful.

TABLE I.

Sample table of sample descriptions.

Chemical/material name Source Initial purity (mol. %/wt. %)a Purification method/processing history Final purity (mol. %/wt. %) Analysis method
Heptane  Aldrich  98 mol. %  Distillation  99.7 mol. %  GCb 
THAc  Synthesis  …  Recrystallization  99.98 mol. %  FMd 
Aluminum  Alcoa  99.99 wt. %  Vacuum annealed 450 °C, 0.1 MPa, 1 h  99.9 wt. %  AESe 
Chemical/material name Source Initial purity (mol. %/wt. %)a Purification method/processing history Final purity (mol. %/wt. %) Analysis method
Heptane  Aldrich  98 mol. %  Distillation  99.7 mol. %  GCb 
THAc  Synthesis  …  Recrystallization  99.98 mol. %  FMd 
Aluminum  Alcoa  99.99 wt. %  Vacuum annealed 450 °C, 0.1 MPa, 1 h  99.9 wt. %  AESe 
a

Initial purity stated by manufacturer.

b

Gas-liquid chromatography.

c

THA is the abbreviation for 1,2,3,4-tetrahydroanthracene.

d

FM is fractional melting.

e

Auger electron spectroscopy. Impurity contents: (in wppm) 60 ± 5. Cr, Ni, Mn; 10 ± 5 S, P, C, O. Standard uncertainties (k = 1).

TABLE II.

Chemical composition of arc-melted and homogenized Fe-15 wt. % Cr alloy determined by inductively coupled plasma–optical emission spectroscopy at p = 0.1 MPa. Pure elements with (99.99 wt. %) purity obtained from Sigma-Aldrich are used for making alloys. Table from Ref. 52.

Element Composition (wt. %) Standard uncertainty
Cr  14.86  0.001 
30 ppm by mass  2 ppm by mass 
Fe  Balance  0.001 
Element Composition (wt. %) Standard uncertainty
Cr  14.86  0.001 
30 ppm by mass  2 ppm by mass 
Fe  Balance  0.001 
TABLE III.

Experimental data for chemical analysis of specimens by phase. Experimentally determined phase compositions (in at. %) of the ternary Bi-In-Ni alloys at T = 100 °C and pressure p = 0.1 MPa. Compositions determined by scanning electron microscopy–electron dispersive spectroscopy and X-ray diffraction.a,b

Sample Bulk composition (at. %) Experimentally determined phases Bi content (at. %) In content (at. %) Ni content (at. %)
60.5 Bi  Ni2In3  1.2 ± 0.8  59.46 ± 1.1  39.34 ± 1.2 
29.8 In  Bi  98.56 ± 0.5  1.06 ± 0.7  0.38 ± 0.4 
9.7 Ni  BiIn  49.22 ± 1.2  49.89 ± 0.8  0.89 ± 0.6 
40.2 Bi  B3Ni  73.45 ± 1.9  0.71 ± 0.5  25.84 ± 0.9 
19.8 In  ζ(Ni13In9 0.32 ± 0.5  41.04 ± 0.8  58.64 ± 0.8 
40.4 Ni         
Sample Bulk composition (at. %) Experimentally determined phases Bi content (at. %) In content (at. %) Ni content (at. %)
60.5 Bi  Ni2In3  1.2 ± 0.8  59.46 ± 1.1  39.34 ± 1.2 
29.8 In  Bi  98.56 ± 0.5  1.06 ± 0.7  0.38 ± 0.4 
9.7 Ni  BiIn  49.22 ± 1.2  49.89 ± 0.8  0.89 ± 0.6 
40.2 Bi  B3Ni  73.45 ± 1.9  0.71 ± 0.5  25.84 ± 0.9 
19.8 In  ζ(Ni13In9 0.32 ± 0.5  41.04 ± 0.8  58.64 ± 0.8 
40.4 Ni         
a

Standard uncertainties (k = 1) are u(T) = 1 K and u(p) = 5 kPa. The values of the expanded uncertainty (k = 2), u(xi), are given in the table.

b

The experimental data in this table were abstracted from Ref. 53.

TABLE IV.

Smoothed values of the volumetric properties of nickel in the solid and liquid states at p = 0.1 MPa.a,b

Phase Temperature (K) α (10−6 K−1) β (10−5 K−1) u (α,β) (%) Density (ρ) (kg m−3)c u(ρ) (%)
FCC  273.15  12.8  3.83  1.2  8899  0.05 
  400  14.4  3.95  1.1  8890  0.05 
TCd  628  18.2  5.43  1.6  8757  0.05 
  900  17.5  5.19  0.9  8638  0.09 
Tf−e  1728  25.1  7.35  0.8  8210  0.08 
Tf+e  1728  …  8.81  2.5  7824  0.18 
  1800  …  8.87  2.5  7774  0.02 
Melt  1900  …  8.95  2.5  7705  0.22 
Phase Temperature (K) α (10−6 K−1) β (10−5 K−1) u (α,β) (%) Density (ρ) (kg m−3)c u(ρ) (%)
FCC  273.15  12.8  3.83  1.2  8899  0.05 
  400  14.4  3.95  1.1  8890  0.05 
TCd  628  18.2  5.43  1.6  8757  0.05 
  900  17.5  5.19  0.9  8638  0.09 
Tf−e  1728  25.1  7.35  0.8  8210  0.08 
Tf+e  1728  …  8.81  2.5  7824  0.18 
  1800  …  8.87  2.5  7774  0.02 
Melt  1900  …  8.95  2.5  7705  0.22 
a

Expanded uncertainties (k = 2) are u(T) = 0.01 K and u(p) = 1 kPa; u(ρ), u(α), and u(β) are given in the table.

b

The experimental data in this table were abstracted from Ref. 54.

c

Measured with dilatometric technique and γ–ray attenuation.

d

TC is Curie point; TC = 629.6 K and u(TC) = 5 K.

e

Tf− and Tf+ are below freezing point and above freezing point; Tf = 1728.1 K and u(Tf) = 3 K.

TABLE V.

Example table for phase transition temperatures of binary Al100−xInx alloys measured at p = 0.1 MPa by Differential Scanning Calorimetry.a,b

In content x/wt. % Liquidus temperature TL−(Al−In)/K Liquid mixing temperature TL−mix(Al−In)/K Monotectic temperature TM−(Al−In)/K Eutectic temperature TE−(Al−In)/K
932 ± 1  …  910 ± 1  429 ± 11 
10  926 ± 11  …  910 ± 11  429 ± 11 
20  …  946 ± 5  910 ± 11  429 ± 11 
40  …  1053 ± 5  910 ± 1  429 ± 1 
In content x/wt. % Liquidus temperature TL−(Al−In)/K Liquid mixing temperature TL−mix(Al−In)/K Monotectic temperature TM−(Al−In)/K Eutectic temperature TE−(Al−In)/K
932 ± 1  …  910 ± 1  429 ± 11 
10  926 ± 11  …  910 ± 11  429 ± 11 
20  …  946 ± 5  910 ± 11  429 ± 11 
40  …  1053 ± 5  910 ± 1  429 ± 1 
a

Standard uncertainties (k = 1) are u(T) = 0.01 K and u(x) = 0.3 wt. % and u(p) = 1 kPa are given in the table.

b

The experimental data in this table were abstracted from Ref. 55.

In the case of metal-based systems, many of the details important for thermophysical measurements are often omitted in the literature. For composition, “percent” or “ppm” is stated without consideration of its molar or mass basis. Grain size is not reported when resistivity measurements of metals are made, nor are dislocation densities reported in most cases, leaving results unclear. Thus the importance of metadata like processing history that provides insight to data users; knowledge of the annealing process or heat treatment temperatures, times, pressures, and atmospheres a material experienced allows for data users to assess the state of the measured material (dislocation density, additional impurities, etc.) when the authors cannot perform the actual measurements. Another commonly unreported value is phase fraction. These measurements are in multiphase materials; however, the measurements are time intensive and require stereological assumptions regarding grain boundary character, etc., unless some of the newer expensive technologies such as atom probe tomography, electron backscatter diffraction, or metallographic serial sectioning grain reconstructions are conducted. Historically, the identification of phases is not well represented, discussed, or reported in data tables or plots leaving ambiguity in archival data. Furthermore, in measurements of multiphase systems, it is important to quantify or qualify compositions of phases in materials when possible. Table III provides an example of well represented composition measurements of equilibrium phases’ conditions in multicomponent system.

Communication of phase transitions can also have ambiguity. Table IV illustrates well documented measurements where it was noted that data were smoothed and transitions are clearly described in the table. The clarification of phases present is valuable information that often goes unreported. Table V provides an example of data recorded associated with a phase transition found to occur in the Al-In system. The text in the document defines the reactions stated in the table, giving the authors’ interpretation of which measured values were associated with each transformation reaction. The accompaniment of Figure 1 and Table V ensures that the reactions can be accurately identified together. There is no ambiguity on the determination of what points on the figure are associated with which reaction. Again, all variables and their uncertainties are completely defined within the table.

FIG. 1.

Example of complimentary phase diagram for Table V. Binary phase Al-In phase diagram. The symbols represent the experimental data. •: liquidus and liquid phase separation; □: monotectic reaction; and Δ: Al-In eutectic transformation. Standard uncertainties (k = 1) are u(p) = 1 kPa and u(TL−mix) = 5 K for the liquid mixing temperature and that of other measured characteristic temperatures is u(T) = 1 K.

FIG. 1.

Example of complimentary phase diagram for Table V. Binary phase Al-In phase diagram. The symbols represent the experimental data. •: liquidus and liquid phase separation; □: monotectic reaction; and Δ: Al-In eutectic transformation. Standard uncertainties (k = 1) are u(p) = 1 kPa and u(TL−mix) = 5 K for the liquid mixing temperature and that of other measured characteristic temperatures is u(T) = 1 K.

Close modal

Furthermore, pertinent data can be displayed in the table heading or footnotes as illustrated in the tables to communicate statements such as the data are smoothed or to define the system state. All variables are clearly and completely defined within the table or its footnotes and reported uncertainties are reported immediately adjacent to experimental observations if they vary between data points or within the footnotes if they do not. By presenting the uncertainties in the vicinity of the values of the observables rather than embedding them in the prose of an experimental methods discussion, readers consider uncertainties coincidentally with the values.

The above discussion should not be considered comprehensive but rather focuses strongly on a small number of shortfalls within the thermophysical scope. If the application of the data discussed above was taken to be an accurate prediction of microstructure, understanding of unaddressed elements such as diffusion kinetics and shear transformation behavior is essential. Development of reporting standards requires broad discussion within the community and an understanding of applications. By communicating which data are essential to accurate modeling to the experimental community and understanding how real experimental conditions should be translated into modeling assumptions, the impact and efficiency of each experimental effort can be maximized.

In contrast to the human-to-human dynamic of manuscripts, machine-to-machine communication cannot leave room for interpretation. Any value contained in a data communication standard that is not explicitly numerical, a member of an enumerated set, or a stable, external identifier is only useful for display to a human operator. As previously discussed, the report of novel experimental data or methods contains information of a nature that will not have been encountered previously. This means that there will be likely information loss in mapping data into a rigid format; whether that information loss is significant depends heavily on intended use cases.

In designing a communication standard between two computers, two layers must be considered: the semantic and the schematic. The semantic layer, or data serialization format, dictates the protocols by which a communicated document should be decoded into an in-memory representation. The schematic layer provides the necessary information for understanding the meaning of fields and how those fields are related. For a document to communicate bibliographic information, the schema would indicate that the document should contain exactly one year, a list of authors’ names, and the name of exactly one journal. The serialization format would indicate formalisms of text encoding (e.g., escape characters and text delimiters) within data fields and how to associate sets of characters with the schema’s fields.

Examples of semantic definitions include XML (Extensible Markup Language), JSON (JavaScript Object Notation), and CSV (comma-separated values) as well as a host of proprietary formats. The most common formats at present on the World Wide Web are XML and JSON. Both of these were constructed with the goal of providing a data format that is human-legible, well defined, and capable of representing complex data relationships. XML tends to be more rigid and more verbose and has more fully developed schema communication technologies. While JSON Schema has been proposed, its standard is presently still a draft56 whereas the XML Schema Definition (XSD) has been a standard since 2001.57 

For distribution of thermophysical property data, TRC led the development of XML-based IUPAC-standard ThermoML. Since 2003 and in the context of the NIST-journal cooperation, data published in collaborating journals have been captured and serialized by NIST staff and shared in a free and open archive.58 Within the chemical process simulation community, ThermoML is a common data import format. There also exist a number of tools within the community for direct consumption and interpretation of ThermoML, such as the ThermoML Opener to convert ThermoML files easily into spreadsheet formats58 and Python-based ThermoPyL parser.59,60

ThermoML was developed with thermophysical properties of molecular organics as a focus. While multiple extensions to the standard have since been published, the standard is limited in its representation of phases and material processing as these are not significant in the use cases or initial development. Substantial revisions are presently underway to address these limitations.

As the body of knowledge continues to grow exponentially, the importance of minimizing barriers to information propagation increases as well. In traditional critical evaluation processes, the evaluator must spend valuable time reviewing the literature to collect all relevant data, including errata and corrigenda, use their best judgement as to the relative reliability of each value, and format and disseminate final results. The import and export stages can be substantially expedited by proper indexing and development of communication standards, but so long as the middle stage requires human inspection, the volume of information transfer will be throttled by the availability of competent human review.

This challenge can be obviated, at least in part, through automation of the analysis process and effective development of data review tools. Trivial examples of this automation such as data normalization (e.g., converting all compositions to a molar basis) and plotting data from multiple literature sources against each other are simple tasks for a human to perform in concept but are time-consuming in practice. Higher level activities such as outlier detection, regression development, and consistency testing require more sophisticated implementations, but once the initial, substantial investment in development is made, the human cost for review of outputs and maintenance of software per unit of processed information can be reduced substantially. The importance of regular human review and robust and well defined quality tests cannot be over emphasized, as machine learning algorithms will provide outputs constrained only by the instructions that have been provided and thus can and will fail in many varied ways. As discussed previously, TRC developed TDE47–50,61–65 as an evolving expert system and as the first implementation of the dynamic data evaluation concept in the domain of thermophysical properties.

The call for experimental data for validation of theoretical calculations persists even with the “first principle”-based simulations that provide the underpinning for high throughput calculations. Future development of tools and software to assist with high throughput experimental methods, specifically those that deal with imaging, should be invested in and used in combination with high throughput computational methods, specifically high throughput screening processes, to decrease time for materials development processes. As data production rates grow, screening technologies for the peer review process become essential to ensure that published data are of high quality, and complete metadata representations are essential to their communication to ensure proper interpretation. The use of archives of historical data for validation prior to publication has been shown to be an effective tool to increase data reliability in properties of molecular organics and could be extended to other fields. However, this method requires substantial curation efforts and can only take advantage of data available to the general community. Without these protections, erroneous data will continue to permeate data resources, requiring consumers of information to be skeptical of their inputs and thus significantly impede progress.

This paper is a contribution of the U.S. government and is not subject to copyright in the United States.

1.
P.
Larsen
and
M.
Von Ins
,
Scientometrics
84
,
575
(
2010
).
2.
R. D.
Chirico
,
M.
Frenkel
,
J. W.
Magee
,
V. V.
Diky
,
C. D.
Muzny
,
A. F.
Kazakov
,
K.
Kroenlein
,
I.
Abdulagatov
,
G. R.
Hardin
,
W. E.
Acree
,
J. F.
Brenneke
,
P. L.
Brown
,
P. T.
Cummings
,
T. W.
de Loos
,
D. G.
Friend
,
A. R. H.
Goodwin
,
L. D.
Hansen
,
W. M.
Haynes
,
N.
Koga
,
A.
Mandelis
,
K. N.
Marsh
,
P. M.
Mathias
,
C.
McCabe
,
J. P.
O’Connell
,
A.
Pádua
,
V.
Rives
,
C.
Schick
,
J. P. M.
Trusler
,
S.
Vyazovkin
,
R. D.
Weir
, and
J.
Wu
,
J. Chem. Eng. Data
58
,
2699
(
2013
).
3.
S.
Curtarolo
,
G. L. W.
Hart
,
M. B.
Nardelli
,
N.
Mingo
,
S.
Sanvito
, and
O.
Levy
,
Nat. Mater.
12
,
191
(
2013
).
4.
L. Q.
Xing
and
W. G. J.
Bunk
,
Giesserei
84
,
20
(
1997
).
5.
D.
Mu
,
P.
Nan
,
J.
Wang
,
G.
Song
, and
W.
Luo
,
JOM
67
,
1659
(
2015
).
6.
O. D.
Sherby
and
J.
Wadsworth
,
J. Mater. Process. Technol.
117
,
347
(
2001
).
7.
G. B.
Olson
,
Science
288
,
993
(
2000
).
8.
J. W.
Gibbs
,
Trans. Conn. Acad. Arts Sci.
2
,
382
(
1873
).
9.
J. W.
Gibbs
,
Trans. Conn. Acad. Arts Sci.
3
,
108
(
1876
).
10.
J. W.
Gibbs
,
Trans. Conn. Acad. Arts Sci.
3
,
343
(
1879
).
11.
W.
Hume-Rothery
,
J. Inst. Met.
35
,
319
(
1926
).
12.
G. B.
Olson
,
Science
277
,
1237
(
1997
).
13.
J.
Van Laar
,
Z. Phys. Chem.
63
,
216
(
1908
).
14.
J.
Van Laar
,
Z. Phys. Chem.
64
,
257
(
1908
).
15.
L.
Kaufman
and
H.
Bernstein
,
Computer Calculation of Phase Diagrams with Special Reference to Refractory Metals
(
Academic Press, Inc
,
1970
).
16.
G. B.
Olson
and
C. J.
Kuehmann
,
Scr. Mater.
70
,
25
(
2014
).
17.
W.
Xiong
and
G. B.
Olson
,
MRS Bull.
40
,
1035
(
2015
).
18.
Y.-P.
Yang
,
J. Mater. Eng. Perform.
24
,
202
(
2014
).
19.
G.
Gou
,
Y.
Yang
, and
H.
Chen
,
Engineering
06
,
936
(
2014
).
20.
G.
Casalino
,
M.
Mortello
,
N.
Contuzzi
, and
F. M. C.
Minutolo
,
Procedia CIRP
33
,
434
(
2015
).
21.
J.
Orsborn
,
R. E. A.
Williams
, and
H.
Fraser
,
Microsc. Microanal.
21
,
1089
(
2015
).
22.
R.
Rettig
,
N. C.
Ritter
,
H. E.
Helmer
,
S.
Neumeier
, and
R. F.
Singer
,
Modell. Simul. Mater. Sci. Eng.
23
,
035004
(
2015
).
23.
J.
Hattrick-Simpers
,
C.
Wen
, and
J.
Lauterbach
,
Catal. Lett.
145
,
290
(
2014
).
24.
C. E.
Campbell
and
G. B.
Olson
,
J. Comput.-Aided Mater. Des.
7
,
145
(
2000
).
25.
S.
Ding
,
Y.
Liu
,
Y.
Li
,
Z.
Liu
,
S.
Sohn
,
F. J.
Walker
, and
J.
Schroers
,
Nat. Mater.
13
,
494
(
2014
).
26.
A.
Ludwig
,
R.
Zarnetta
,
S.
Hamann
,
A.
Savan
, and
S.
Thienhaus
,
Int. J. Mater. Res.
99
,
1144
(
2008
).
27.
D. B.
Miracle
,
Mater. Sci. Technol.
31
,
1142
(
2015
).
28.
S. V.
Kalinin
,
B. G.
Sumpter
, and
R. K.
Archibald
,
Nat. Mater.
14
,
973
(
2015
).
29.
F.
De Geuser
,
M. J.
Styles
,
C. R.
Hutchinson
, and
A.
Deschamps
,
Acta Mater.
101
,
1
(
2015
).
30.
M. O.
McLinden
,
A. F.
Kazakov
,
J. S.
Brown
, and
P. A.
Domanski
,
Int. J. Refrig.
38
,
80
(
2014
).
31.
A. F.
Kazakov
,
M. O.
McLinden
, and
M.
Frenkel
,
Ind. Eng. Chem. Res.
51
,
12537
(
2012
).
32.
R.
Reed
,
T.
Tao
, and
N.
Warnken
,
Acta Mater.
57
,
5898
(
2009
).
33.
F.
Tancret
,
Modell. Simul. Mater. Sci. Eng.
20
,
1
(
2012
).
34.
H. K. D. H.
Bhadeshia
and
R.
Honeycombe
,
Steels: Microstructure and Properties
(
Butterworth-Heinemann
,
2011
).
35.
H. K. D. H.
Bhadeshia
,
Stat. Anal. Data Min.
1
,
296
(
2009
).
36.
H. K. D. H.
Bhadeshia
,
ISIJ Int.
39
,
966
(
1999
).
37.
P.
Frazier
and
J.
Wang
, in
Information Science for Materials Discovery and Design
, edited by
T.
Lookman
,
F. J.
Alexander
, and
K.
Rajan
(
Springer International Publishing
,
Cham
,
2016
), pp.
45
-
75
.
38.
W.
Carande
,
A.
Kazakov
,
C. D.
Muzny
, and
M.
Frenkel
,
J. Chem. Eng. Data
60
,
1377
(
2015
).
39.
E.
Bélisle
,
Z.
Huang
, and
A.
Gheribi
,
Databases Theory and Applications
(
Springer International Publishing
,
2014
), pp.
38
-
49
.
40.
D. L.
McDowell
and
G. B.
Olson
, in
Scientific Modeling and Simulations
, edited by
S.
Yip
and
T.
Diaz de la Rubia
(
Springer
,
Netherlands
,
2009
), pp.
207
-
240
.
41.
Z.-K.
Liu
and
D.
McDowell
,
Integr. Mater. Manuf. Innovations
3
,
1
(
2014
).
42.
A.
Agrawal
,
P. D.
Deshpande
,
A.
Cecen
,
G. P.
Basavarsu
,
A. N.
Choudhary
, and
S. R.
Kalidindi
,
Integr. Mater. Manuf. Innovations
3
,
8
(
2014
).
43.
K.
Kroenlein
,
V. V.
Diky
,
C. D.
Muzny
,
J. W.
Magee
, and
M.
Frenkel
,
NIST Standard Reference Database 171
, NIST, 2015.
44.
V. V.
Diky
,
R. D.
Chirico
,
R.
Wilhoit
,
Q.
Dong
, and
M.
Frenkel
,
J. Chem. Inf. Model.
43
,
15
(
2003
).
45.
M.
Frenkel
,
V. V.
Diky
,
R. D.
Chirico
,
R. N.
Goldberg
,
H.
Heerklotz
,
J. E.
Ladbury
,
D. P.
Remeta
,
J. H.
Dymond
,
A. R. H.
Goodwin
,
K. N.
Marsh
,
W. A.
Wakeham
,
S. E.
Stein
,
P. L.
Brown
,
E.
Königsberger
,
P. A.
Williams
,
D. P.
Remata
,
J. H.
Dymond
,
A. R. H.
Goodwin
,
K. N.
Marsh
,
W. A.
Wakeham
,
S. E.
Stein
,
P. L.
Brown
,
E.
Konigsberger
, and
P. A.
Williams
,
J. Chem. Eng. Data
56
,
307
(
2011
).
46.
M.
Frenkel
,
R. D.
Chirico
,
V. V.
Diky
,
P. L.
Brown
,
J. H.
Dymond
,
R. N.
Goldberg
,
A. R. H.
Goodwin
,
H.
Heerklotz
,
E.
Konigsberger
,
J. E.
Ladbury
,
K. N.
Marsh
,
D. P.
Remata
,
S. E.
Stein
,
W. A.
Wakeham
, and
P. A.
Williams
,
Pure Appl. Chem.
83
,
1937
(
2011
).
47.
V. V.
Diky
,
R. D.
Chirico
,
C. D.
Muzny
,
A. F.
Kazakov
,
K.
Kroenlein
,
J. W.
Magee
,
I.
Abdulagatov
, and
M.
Frenkel
,
J. Chem. Inf. Model.
53
,
3418
(
2013
).
48.
K.
Kroenlein
,
C. D.
Muzny
,
V. V.
Diky
,
A. F.
Kazakov
,
R. D.
Chirico
,
J. W.
Magee
,
I.
Abdulagatov
, and
M.
Frenkel
,
J. Chem. Inf. Model.
51
,
1506
(
2011
).
49.
M.
Frenkel
,
R. D.
Chirico
,
V. V.
Diky
,
X.
Yan
,
Q.
Dong
, and
C. D.
Muzny
,
J. Chem. Inf. Model.
45
,
816
(
2005
).
50.
V. V.
Diky
,
C. D.
Muzny
,
E. W.
Lemmon
,
R. D.
Chirico
, and
M.
Frenkel
,
J. Chem. Inf. Model.
47
,
1713
(
2007
).
51.
R. D.
Chirico
,
T. W.
de Loos
,
J.
Gmehling
,
A. R. H.
Goodwin
,
S.
Gupta
,
W. M.
Haynes
,
K. N.
Marsh
,
V.
Rives
,
J. D.
Olson
,
C.
Spencer
,
J. F.
Brennecke
, and
J. P. M.
Trusler
,
Pure Appl. Chem.
84
,
1785
(
2012
).
52.
R. N.
Hajra
,
R.
Subramanian
,
H.
Tripathy
,
A. K.
Rai
, and
S.
Saibaba
,
Thermochim. Acta
620
,
40
(
2015
).
53.
M.
Premović
,
D.
Minić
,
D.
Manasijević
,
V.
Ćosović
,
D.
Živković
, and
I.
Dervišević
,
Thermochim. Acta
609
,
61
(
2015
).
54.
R. N.
Abdullaev
and
Y. M.
Kozlovskii
,
Int. J. Thermophys.
36
,
603
(
2015
).
55.
W.
Zhai
and
B.
Wei
,
J. Chem. Thermodyn.
86
,
57
(
2015
).
56.
F.
Galiegue
,
K.
Zyp
, and
G.
Court
, JSON Schema Core Definition Terminology Draft, 2013, http://tools.ietf.org/html/draft.
57.
D. C.
Fallside
and
P.
Walmsley
, XML Schema Part 0 Primer Second Edition, W3C Recommendation, 2004, http://www.w3.org/TR/xmlschema.
58.
ThermoML Opener A Tool Direct Viewing ThermoML Files, 2016, http://trc.nist.gov/ThermoML_ Opener.html.
59.
K. A.
Beauchamp
,
J. M.
Behr
,
A. S.
Rustenburg
,
C. I.
Bayly
,
K.
Kroenlein
, and
J. D.
Chodera
,
J. Phys. Chem. B
119
,
12912
(
2015
).
60.
K. A.
Beauchamp
,
J. M.
Behr
,
A. S.
Rustenburg
,
C. I.
Bayly
,
K.
Kroenlein
, and
J. D.
Chodera
, ThermoPyL, 2016, https://github.com/choderalab/ThermoPyL.
61.
V. V.
Diky
,
R. D.
Chirico
,
A. F.
Kazakov
,
C. D.
Muzny
, and
M.
Frenkel
,
J. Chem. Inf. Model.
49
,
503
(
2009
).
62.
V. V.
Diky
,
R. D.
Chirico
,
A. F.
Kazakov
,
C. D.
Muzny
, and
M.
Frenkel
,
J. Chem. Inf. Model.
49
,
2883
(
2009
).
63.
V. V.
Diky
,
R. D.
Chirico
,
A. F.
Kazakov
,
C. D.
Muzny
,
J. W.
Magee
,
I.
Abdulagatov
,
J. W.
Kang
,
K.
Kroenlein
, and
M.
Frenkel
,
J. Chem. Inf. Model.
51
,
181
(
2011
).
64.
V.
Diky
,
R. D.
Chirico
,
C. D.
Muzny
,
A. F.
Kazakov
,
K.
Kroenlein
,
J. W.
Magee
,
I.
Abdulagatov
,
J. W.
Kang
, and
M.
Frenkel
,
J. Chem. Inf. Model.
52
,
260
(
2012
).
65.
V. V.
Diky
,
R. D.
Chirico
,
C. D.
Muzny
,
A. F.
Kazakov
,
K.
Kroenlein
,
J. W.
Magee
,
I.
Abdulagatov
,
J. W.
Kang
,
R.
Gani
, and
M.
Frenkel
,
J. Chem. Inf. Model.
53
,
249
(
2013
).
66.

Commercial products are identified only for purposes of technical description, and this implies no endorsement by NIST. Other products might be found that work equally well or better for the applications described.