Recent advancements in computing technologies coupled with the need to make sense of large amounts of raw data have renewed much interest in data-driven materials design and discovery. Traditional materials science research relies heavily on experimental data to gauge the properties of materials. However, this paradigm is purely based on trial and error and ongoing research can take decades to discover new materials. Data-driven modeling tools such as machine learning and its proven libraries can help speed up the materials’ discovery process through the implementation of powerful algorithms on readily available material datasets mined from the ever-increasing private- and government-funded material databases. In this Perspective, we applied various machine learning models on tens of hundreds of thermoelectric compounds obtained from density functional theory calculation results. In our preliminary analysis, we made use of pymatgen and the powerful materials science library matminer to add and explore key material features that have the propensity to accurately predict our achievable target output. We evaluated the accuracy and performance of our models with the coefficient of determination (R2), the root mean square error, and K-fold cross-validation metrics and identified the most important descriptors for our materials. Finally, we reviewed the current state-of-the-art in data-driven thermoelectric materials’ design and discovery, its current challenges, and prospects.
I. INTRODUCTION
The technological reality of self-learning machines' abilities to produce self-driving cars, home-based voice-controlled virtual assistants (such as Apple Siri, Amazon Alexa, Google Assistant, and Microsoft Cortana), or powerful recommendation engines has revolutionized the way we live and interact with our environment. At the core of these innovations lies complex computer models that can learn through experience without being programmed to do so. This technology can be applied almost everywhere automation or predictive analytics are needed. The mitigation of high-cost materials’ research, the necessity to reduce the lengthy research-to-discovery lifecycle, and the low-risk advantage data-intensive modeling have propelled growing numbers of theorists and experimentalists to include data-driven modeling into their research toolbox. Referred to by many in the scientific community as the fourth paradigm of science,1,2 this burgeoning data-driven paradigm is a logical progression following the first three highly adaptive scientific paradigms of experimentally, theoretically, and computationally oriented methods.3 Out of these three scientific methods, a substantial amount of raw, complex, and unprocessed data are being generated on the daily basis. This had led to terabytes of curated and uncurated material datasets stored in readily accessible online databases. Among the ever-increasing high-throughput online material databases are the Materials Project (MP),4 the Materials Genome Initiative (MGI),5 the Joint Automated Repository for Various Integrated Simulations (JARVIS),6 the Novel Materials Discovery (NOMAD),7 Automatic-FLOW for Materials Discovery (AFLOW),8 the Inorganic Crystal Structure Database (ICSD),9 and the Open Quantum Materials Database (OQMD),10 just to name a few. The establishment of data science and machine learning (ML) in materials science was born out of the need to treat, analyze, understand, and gain insights from the complexities of existing and newly generated material data. To this extent, new and innovative approaches are continuously being devised across interdisciplinary fields to find hidden patterns, understand, and predict the properties and/or the structure–property relationships of a wide range of known materials (i.e., forward modeling) and tens of hundreds of presumed materials yet to be discovered (i.e., inverse modeling). Whether by accident, through experience, trial-and-error, computational simulations, or mere intuition, from the stone age, bronze age, to the modern silicon era, material inquiries have greatly served humanity in their quest to tame their environments. In a world where our dependence on fossil fuels and their toxic remnants is ever increasing with rising energy demands, the quest to find novel and clean energy-related materials is ever-pressing. Concerned with uncovering the hidden structure–property relationships, the exploration of inverse material design processes, and the statistical accuracy in model prediction, data-driven materials’ studies have garnered a slew of innovative research aimed toward a wide range of applications. In sustainable energy research alone, extensive data-driven studies have been conducted on photovoltaic systems,11–14 thermoelectricity,15–18 and Li-ion batteries,19,20 just to name a few.
Thermoelectric (TE) materials, which are lauded to play a significant role in energy sustainability, utilize the thermoelectric effect to convert waste heat into electricity. The performance of thermoelectric materials is measured by the value of the dimensionless figure of merit,
where S is the Seebeck coefficient or thermopower, σ is the electrical conductivity, T is the absolute temperature, and and are the electronic and lattice thermal conductivities, respectively. A high requires large thermopower, high electrical conductivity, and low thermal conductivity. Due to the highly conflicting nature of the properties of thermoelectric materials, where optimizing one parameter can be detrimental to the properties of other closely interrelated parameters, data-driven processes offer a unique blend of discovering hidden physical relations between properties, material property predictions, and material optimizing schemes for the identification, selection, and discovery of optimal performing materials. Research in thermoelectricity can be divided into three main groups: power factor (PF, S2σ) maximization, thermal conductivity reduction, or both. Power factor maximization includes but is not limited to finding heavily doped semiconductors with carrier concentrations between 1019 and 1021 carriers per cm3, altering the electronic band structure of the materials through various band engineering techniques to enhance carrier mobility and electrical conductivity. Thermal conductivity reduction revolves around finding ways to inhibit the lattice thermal conductivity, which results from heat transporting phonons traveling through the crystal lattice. Common thermal conductivity reduction techniques, which fall under the umbrella of micro/nano-structuring, include high energy ball milling,21 nano-inclusion within the host materials,22 grain boundary engineering,23 and complex powder processing mechanisms24,25 to obtain micro- or nanoscale grain sizes, which serve as barriers for the transmittance of heat-carrying particles. High zT has also been reported on low dimensional superlattice structures26 where thin periodic layers of different materials of the same types are superimposed one on top of the other to introduce a large density of interfaces in which phonons over a large mean free path range can be scattered more effectively and preferentially more than electrons. These techniques are performed on a myriad of material systems such as Bi2Te3 and Bi2Te3–Sb2Te3–Bi2Se3-based nanocomposite alloys,27 skutterudites,28–30 Zintl phase compounds (clathrates,31 calcium silicide,32,33 sodium silicide34), perovskites,35 half-Heusler semiconductors,36 oxide materials,37 conducting polymers,38 and hybrid organic–inorganic materials.39,40 The list is by no means exhaustive. Owing to the large search space for potential high-performing thermoelectric materials, research in data-driven thermoelectric material systems is diverse and vast. As such, we will limit our work on machine learning guided TE materials’ design and discovery, which also encompasses material property predictions as the understanding of the behavior and performance of materials’ properties are essential to any data-driven thermoelectric materials’ design and discovery.
Notwithstanding the difficulties in decoupling the fundamental thermoelectric properties, data-driven guided material discovery has yielded many successes. For example, Iwasaki et al.41 used machine learning modeling to investigate key physical parameters controlling the spin-driven thermoelectric effect (STE) and successfully used their findings to synthesize novel TE materials with thermopower higher than previously recorded. Choudhary et al.42 used the JARVIS-DFT library to identify promising 3D and 2D thermoelectric materials. Thermoelectric oxides have been known to possess high power factors but low overall zT due to their high thermal conductivities. A comprehensive data-driven study by Tewari et al.43 shed light on the properties of the different materials contributing to the categorization (low, medium, and high) of the thermal conductivities of these materials. This approach could be a good starting point in selecting low thermal conductivity oxides to aid in the design and discovery of efficient TE materials. Decoupling electronic and thermal transport is the holy grail of TE materials’ design and discovery. Yamawak et al.44 investigated a multifunctional structural optimization approach on graphene nanoribbons (GNRs) by alternating transport calculations and Bayesian optimization. They reported a structural optimization five times greater than that achieved using random search. In addition, the optimized GNRs were observed to enhance the thermoelectric figure of merit 11 times more than unprocessed GNRs.
In this study, we applied different machine learning algorithms on a set of 8863 by 167 multi-dimensional input–output matrices in predicting the thermoelectric power factor. We evaluated the performance of our models with validation metrics such as coefficient of determination (R2), root mean square error (RMSE), and fivefold cross-validation estimators.
II. MACHINE LEARNING IN MATERIALS SCIENCE—A BRIEF OVERVIEW
The development of machine learning in materials science has seen exponential growth in the last decade. This is partly fueled by the advent of the Open Science movement (open-source access, open-source software, open-source code, etc.) where wide-scale data acquisition, sharing, and exploration have brought about new and innovative approaches to conduct data-driven material studies. The main objectives in data-driven materials science research are the prediction of materials’ properties (forward modeling) and the design of new materials (inverse design modeling).45 Common to these two approaches is the selection and integration of reliable ML tools for effective data wrangling, featurization, hyper parameterization, model fitting, and results validation. This workflow can be daunting, but giant steps are being made to find the right tool during each step in the modeling pipeline. As mentioned briefly in Sec. III B, feature selection is the backbone of any successful ML model. Ghiringhelli et al.46 demonstrated the challenge in finding dominant features in predicting the crystal structures of known materials. Although there are many obstacles in selecting the right descriptors, major inroads in featurization techniques are noted. For example, Ghiringhelli et al.47 utilized a compression-based methodology in subsequent works to analyze the dominant features for materials’ property predictions. However, in the study of feature engineering, one must distinguish between modeling objectives when selecting relevant descriptors. As Ramprasad et al.48 noted, if the predictive power of a model is not the main objective of the study, then the application of gross-level features makes more sense. That is, the features are selected based on the attributes of the atoms (bandgap, grain size, etc.) of the materials under study. However, if the degree of accuracy of a model is determinant in the prediction of certain materials’ properties (total energies, space group, etc.), then molecular fragment-level features may be more suited. That is, a material can be viewed as the sum of its building blocks.49 Progressively, the latter technique has morphed into what has come to be known as the quantitative structure–property relationship (QSPR) and the quantitative structure–activity relationship (QSAR) models. The QSPR and QSAR featurization models, which seek to establish the relationship between the microscopic properties of materials to their macroscopic behavior,50 are becoming widely accepted as reliable methods to find meaningful material features for a wide range of material property prediction studies.
The non-linearity of material data called for a rigorous search for alternatives to simplistic regression and classification models, which lack the statistical weight to describe any hidden patterns in the dataset. Deep learning, a subfield of machine learning that mimics the functions of the human brain, has recently gained momentum in data-driven material studies, especially on the prediction, classification, and inverse design of materials’ crystal structures.51,52 As such, crystal graph convolution neural network (CGCNN) algorithms occupied the forefront of many data-driven crystal structure queries.53,54 In addition, gradient boost algorithms and least absolute shrinkage, and selection operator (LASSO) regression are known to not only provide robust model fitting but utilize built-in hyper-parameters to optimize the model under study.
III. METHODS
Traditionally, data-intensive studies required knowledge and expertise in data-centric programming languages such as python, R, Julia, and related frameworks (Anaconda, Jupyter Notebook, RStudio, etc.). However, as data science becomes widespread, so does the universe of predictive analytic platforms geared toward facilitating data-driven studies. More and more such platforms are tailored to novices who do not have to have any expertise in computer programming to conduct comprehensive, end-to-end data analytics. As an example, Automated Machine Learning (AutoML) frameworks such as Auto-SKlearn,55,56 DataRobot,57 Google Cloud AutoML,58 H2O AutoML,59 MlBox,60 Auto-Keras,61 etc., offer non-expert users step-by-step processes going from raw data to data preprocessing, to model selection and optimization, to model prediction and validation with minimal code writing. The choice of any particular tool depends mainly on one's level of expertise in the aforementioned programming languages and knowledge or lack thereof in data analytic processes and modeling. Although AutoML frameworks are fast and reliable, fundamental and rudimentary data science programming tools such as python and R are always useful to master.
The Anaconda virtual environment manager is a “one-stop-shop,” free and open-source package management distribution that offers seamless options to install tens of hundreds of powerful open-source Python-based data science libraries. In this work, all modeling and simulations were conducted using the web-based Jupyter Notebook platform, which was installed through the Anaconda package management distribution. After installation, we made use of appropriate python libraries for materials science (pymatgen, matminer, SKlearn, etc.) to collect, clean, and analyze our data.
A. Data collection: The dataset
The dataset was obtained from the open-source data mining python library matminer.62 It contains more than 9000 chemical compounds calculated using the BoltzTraP software package run on the GGA-PBE or GGA + U density functional theory calculation results.63,64 The original dataset, tabulated from the Materials Project database, contains fundamental thermoelectric material properties/features such as the power factor (Seebeck coefficient squared times the electrical conductivity), p-type and n-type Seebeck coefficient, p-type and n-type effective mass, chemical formula, and materials crystal structure. The properties were calculated at a constant temperature of 300 K while the carrier concentration was maintained at 1.00 × 1018 1/cm3.19 Figure 1 shows the power factor (PF) as a function of the Seebeck coefficient (S) for all n-type materials used in this study. The inset on the top left shows a word cloud representation of just a few n-type and p-type materials contained in our dataset.
PF_n (μW/cm2 K) vs S_n (μV/K) of n-type materials (few shown under formula). The inset on the top left corner represents a word cloud of n-type and p-type materials in the dataset.
PF_n (μW/cm2 K) vs S_n (μV/K) of n-type materials (few shown under formula). The inset on the top left corner represents a word cloud of n-type and p-type materials in the dataset.
The underlying physics and chemistries of thermoelectric materials properties are dictated not only by electron and phonon transport dynamics but also by empirical investigations such as materials’ dopability65 and micro/nano-structuring modifications.66,67 However, most thermoelectric datasets are either obtained primarily through high-throughput computational formalisms or through experimental investigations to a lesser extent. The favorability of high-throughput computation stems from the fact that meaningful data analytics necessitate what is termed as the various Vs of big data analytics: volume, velocity, variety, variability, veracity, validity, vulnerability, volatility, visualization, and value,68,69 the first four Vs being the dominant factors contributing to the popularity of high-throughput materials screening. ML models are more inclined to perform better when the volume of the data is substantial enough to recognize hidden patterns and gain insights into the training data. The velocity, variety, and variability at which thermoelectric materials data are generated or created make high-throughput computation more appealing even though many of its formalisms are based on approximations that might not describe in full the intricate nature and dynamics of thermoelectric phenomena. In contrast, thermoelectric data coming from experimental settings although substantial and invaluable in nature are disparate and buried in tens of thousands of publications and lab notebooks making their mining and digitization slow, laborious, and ineffective for ML modeling in some cases. Recently, animated by the pressing need to include valuable structure–property relationships and material synthesis parameters contained in experimental thermoelectric data, many efforts have been made to survey various thermoelectric studies to extract meaningful materials data. For instance, Gaultois et al.70 collected selectively 18 000 thermoelectric data points from over 100 scientific publications and performed various visualization schemes to understand the interplay between the different properties of thermoelectric materials. Tshitoyan et al.71 used supervised natural language processing with “thermoelectric” as a keyword to text mine materials data from several published abstracts and predict potential new thermoelectric materials with promising performance.
B. Data preparation
Data wrangling is the first and one of the most important steps in machine learning modeling. It is estimated that practitioners spend a large portion of their time during the data modeling lifecycle on data mining, data cleaning, data labeling, and data management.72 In some cases, the arduous task of surveying different online database repertoires is necessary to extract needed thermoelectric materials’ properties for specific studies. Luckily, information-heavy and noiseless thermoelectric datasets from single-source online repositories are becoming more prominent, thus relegating the need for extensive, time-consuming, and costly data wrangling. An illustration is the Citrination platform73 where thousands of promising thermoelectric compounds can be screened.
C. Exploratory data analysis
The very first step after obtaining the dataset is to make sure the dataset does not contain any abnormalities. Exploratory data analysis (EDA) is precisely how we achieve that. It is where we apply statistical analysis and visualization to make sense of our data and to filter out any outliers still prevailing in the dataset. The dataset contained more than 9000 data entries. To make sure that outliers are not present in the dataset, we made use of the fact that the values of the Seebeck coefficient are always positive for p-type materials (holes being the major carriers) and always negative for n-type materials (electrons being the major carriers). We use simple python operators to sort and include only positive and negative values of all p-type and n-type materials, respectively. In doing so, we decrease the dataset entries from 9036 to 8863 data points. In addition, we used different built-in python methods (describe and info) to make sure there were no null entries in the dataset. This preliminary data analysis based on python codes alone reveals the importance of data curation and data cleaning in the ML workflow. An important EDA process is to apply data visualization techniques to uncover relationships between features in the dataset. The figure of merit, [Eq. (1)] is directly proportional to the power factor, which is the Seebeck coefficient squared times the electrical conductivity . Similarly, the Seebeck coefficient S is also directly proportional to the effective mass [Eq. (2)],
As regression analysis has demonstrated a strong correlation between the power factor and carrier effective mass,74 we wish to determine whether there is a linear relationship between the two by plotting the heat map and pair plot correlation matrix between the properties of the given materials. The results are shown in Figs. 2 and 3. The pairwise correlation function shows a close to zero correlation between the power factor and the effective mass indicating no linear interdependence between the two features. The relationship between the two descriptors is more complex than a simple linear relationship would describe and will necessitate appropriate ML tools for accurate model prediction. We will see, in Sec. V, how our machine learning model highlights this important relationship with a simple “most important feature” histogram plot.
Materials properties correlation matrix. PF is the power factor, S is the Seebeck coefficient, m is the effective mass, subscript n indicates n-type material, and subscript p indicates p-type material.
Materials properties correlation matrix. PF is the power factor, S is the Seebeck coefficient, m is the effective mass, subscript n indicates n-type material, and subscript p indicates p-type material.
Scatterplot matrix showing the pairwise correlation between features.
D. Feature engineering
Fundamental to the successful application of machine learning algorithms in TE materials’ design and discovery is the exploration, identification, and selection of meaningful, dominant materials descriptors or features that have enough numerical weight to allow accurate model predictions. The set of independent input variables to be fed in any particular model is commonly referred to as features, fingerprints, or descriptors. Feature selection generally requires domain knowledge and a heightened sense of appreciation of hidden correlations between the input and the target output. Difficulties arise when key material descriptors are not available in the original dataset or feature engineering of specific inputs to numerically represent said descriptors is not attainable. The choice of features should be targeted in such a manner in which individual input variables' contribution to the determination of the dependent target output is substantial but not necessarily well-defined. In other words, success in identifying and selecting relevant features lies mainly in the realm of one's intuition, expertise, or domain knowledge. Intuition can prove to be beneficial or meaningless in feature engineering. In the former, it can lead to new insights or at best to discoveries of physical laws. In the latter, it can lead to the inclusion of irrelevant features, which is highly discouraged as they do not add any meaningful weight to the overall model prediction. For this reason, feature selection should be dictated first by proven physical laws inherent to the structure–property relationships of the materials under study. Although major inroads have been made in the development of thermoelectric materials databases, the lack of diverse datasets containing fundamental materials descriptors, materials’ synthesis parameters, and the scarcity of large enough experimental data volume make data-driven TE materials’ design and discovery a persistent challenge. For example, the citrine thermal conductivity dataset75 only contains the thermal conductivities of the tabulated materials. Moreover, the dataset used in this study does not include the electrical and thermal conductivities of the materials, both of which are key features in TE phenomena.
Once we clean and analyze our data, the next step in the modeling pipeline is to add as many descriptive features as possible into our DataFrame. Besides the already present descriptors in the dataset, we included descriptive mechanical, electronic, and thermal features such as melting temperature, atomic weight, and thermal conductivity. We added a total of 207 features into our DataFrame using mostly MagpieData76 and PymatgenData.77 To make sure the dataset is void of any unidentifiable or any unrepresentable values (i.e., NaN values), we used python operators to filter out all features with NaN data type. This resulted in the decrease of the number of features from 207 to 167.
IV. MACHINE LEARNING MODELS: SUPERVISED MACHINE LEARNING
With a plethora of machine learning models to choose from, finding the right algorithm that better describes the structure and morphology of the dataset (i.e., the ability for a model to detect patterns from training data) is paramount to a successful model prediction. This exercise is not always straightforward as model effectiveness depends on the uniqueness, quality, and quantity of the dataset. There is no one-size-fits-all in ML modeling. Machine learning can be divided into four categories: supervised machine learning, semi-supervised machine learning, unsupervised machine learning, and reinforcement learning. We will limit our data-driven modeling to supervised learning as most studies conducted in materials modeling employ supervised machine learning.78 Supervised machine learning is basically an input–output mapping mechanism where a model learns from labeled training data and project predictions on unseen or future data.79 Supervised learning can be further divided into two main subcategories: classification and regression problems. The classification task is used when the labels or output data are discrete (e.g., e-mail spam filtering,80 chemical compound orbital radii predictions,81 etc.). Regression-based modeling is used on continuous labels (e.g., thermoelectric power factor, melting point, materials bandgap predictions,82 etc.). The simplest and most used ML regression is the linear regression (LR) model.83 Previous data-driven studies in materials science have found successes in crystal graph convolution neural network (CGCNN), support vector machine (SVM), Gaussian process regression, and neural network-based modeling. Xie and Grossman84 used crystal graph convolutional neural networks to successfully predict the properties of materials. In this work, we explored four different types of regression-based machine learning models: linear regression (LR), kernel ridge regression (KRR) (linear, polynomial, Gaussian, and Laplacian), support vector machine (SVM), and random forest (RF).
A. Model evaluation
The performance calculation and the assessment of the predictive capabilities of ML models serve as a metric of appreciation in evaluating the goodness-of-fit between target output and predicted values. Goodness-of-fit can be defined as the degree of accuracy independent input variables are mapped unto target outputs. The first step in evaluating a model is to partition the dataset into a training set and a testing set. As the names suggest, the training set is used to train the model while the testing set is used to evaluate the model performance. Most of the time, the ratio between training and testing data obeys the 90/10, 80/20, or 70/30 train/test split rule. Train/test ratio tending toward unity can also be applied, but for a model to be high performing, the chemical compounds in the training set must be much larger than the ones in the testing data to avoid the nuisance of the model misfit. We apply the generalized hold-out method by partitioning the dataset into 7090 training data and 1773 testing data in an 80/20 train/test ratio split. We evaluated the ML models with the coefficient of determination R2 [Eq. (3)] and RMSE [Eq. (4)] on the training and testing data. We further used leave-one-out fivefold cross-validation and applied the above metrics to evaluate the models,
V. RESULTS AND DISCUSSIONS (N-TYPE MATERIALS MODELING)
Table I shows the tabulated results of the R2 and RMSE validation metrics for the models used in this study (linear regression, linear kernel ridge regression, support vector machine with liblinear SVR kernel, and random forest) for the training, testing, and cross-validation datasets.
Validation metrics for n-type materials' power factor.
Machine learning models . | Training set . | Testing set . | Fivefold cross-validation . | |||
---|---|---|---|---|---|---|
R2 . | RMSE . | R2 . | RMSE . | R2 . | RMSE . | |
Linear regression | 71.5% | 0.237 | 66.7% | 0.272 | 68.8% | 0.251 |
Kernel Ridge | 70.9% | 0.240 | 66.7% | 0.272 | 59.5% | 0.280 |
SVM | 71.4% | 0.238 | 66.6% | 0.272 | 68.8% | 0.251 |
RF | 97.6% | 0.068 | 81.3% | 0.204 | 83.9% | 0.180 |
Machine learning models . | Training set . | Testing set . | Fivefold cross-validation . | |||
---|---|---|---|---|---|---|
R2 . | RMSE . | R2 . | RMSE . | R2 . | RMSE . | |
Linear regression | 71.5% | 0.237 | 66.7% | 0.272 | 68.8% | 0.251 |
Kernel Ridge | 70.9% | 0.240 | 66.7% | 0.272 | 59.5% | 0.280 |
SVM | 71.4% | 0.238 | 66.6% | 0.272 | 68.8% | 0.251 |
RF | 97.6% | 0.068 | 81.3% | 0.204 | 83.9% | 0.180 |
The linear kernel ridge regression (KRR) model is the least impressive among the four models studied with only 70.4%, 66.7%, and 59.5% accuracy on the training, testing, and fivefold cross-validation data, respectively. Even with this minimal predictive power, our KRR model is far better than recent studies on data-driven thermoelectric power factor prediction by Laugier et al.85 with recorded mean absolute error (MAE) values of 55.50%, 44.90%, and 20.70% on their CGCNN, XGBoost, and FCNN models, respectively. We also employed Laplacian, polynomial, and Gaussian KRR on our dataset. Apart from the linear KRR, the other models (Laplacian, polynomial, and Gaussian) performed poorly and were not deemed satisfactory enough to be included in this paper. As linear regression is the simplest regression model in many data-driven studies, it was a pleasant surprise to observe that it has higher predictive power than the SVM and linear KRR models, showing proof that simplicity does not necessarily mean poor predictive power in ML modeling.
By far, the random forest model gives us a better predictive power than the other models studied. With 97.6% accuracy recorded on the training data, thoughts of overfitting may arise. Overfitting usually occurs when the model performs well on the training data but fails to catch any correlation on the testing data. However, with 81.3% and 83.9% accuracy on our testing and cross-validation data, respectively, thoughts of overfitting are not warranted in this case. Furthermore, the numerically computed values of the predicted vs the target outputs of the power factor are very close to each other. To the best of our knowledge, the coefficient of determination of our random forest model in predicting the thermoelectric power factor is the highest result recorded in the literature to date. For all the models described above, the RMSE values of each model tend to decrease with increasing predictive capabilities of the corresponding model. This is expected as the more powerful a model is in describing the data, the less the error will be present. The improved performance of our models from that of Laugier et al. can be attributed to the volume of available data for training and testing, the quantity and quality of the descriptors extracted, and/or to the models' ability to better map input features to target outputs. Our models used more materials’ data than Laugier et al. (8863 compared to 7230) and more high-quality descriptors (167 compared to 16 for their CGCNN model and 28 for their FCNN model). The p-type power factor predictions for the RF model (97.1% for the training set, 80.4% for the testing set, and 79.9% for cross-validation) yielded similar but less dramatic results when compared to their n-type counterparts. However, we noted a significant dramatic decrease in the prediction accuracy for the LR, KRR, and SVM models when compared to their n-type counterparts. Data related to the p-type power factor predictions can be found in the supplementary material.
Figure 4 shows the results of the predicted vs the target values of the testing data for our four algorithms, namely, the random forest, the linear regression, the linear KRR, and the SVM models. With almost asymmetrical data distribution and a similar coefficient of determination and cross-validation results, the linear regression and the linear KRR models are graphically and statistically identical. This is besides the fact that the linear KRR model performed poorly on the training data when compared to the linear regression model. Again, as evidenced by these figures, it can be clearly seen that the random forest model gives better accuracy in predicting the values of the thermoelectric power factor with R2 score of 81.3% on the testing data and a low corresponding RMSE value.
An important model performance evaluation is the determination of the residuals. Residuals can be defined as the percentage error between the target and the predicted values. The closer the residuals are to zero, the better the model will perform. In Fig. 5, we show the regression residuals of the training and testing data for our RF model. We observed that the bulk of the training data residuals is concentrated between 0% and 0.05% while the testing data residuals are more spread and extend to more than 0.15% error. The reverse would have been an indication of poor model performance. This exercise can serve as a preliminary analysis for selecting promising models.
Figure 6 shows the features that contributed more to our RF model prediction. Our machine learning model shows us clearly that the most important feature for predicting the thermoelectric power factor is the effective mass, as it should be. This is the ultimate proof that machines do learn and gain insights when fed a substantial amount of data. This knowledge can be utilized in experimental settings to tune the effective mass to achieve a higher thermoelectric power factor.
Histogram plot of the features by importance (m_n = n-type effective mass, S_n = Seebeck coefficient of n-type materials, PF_p = Power Factor of p-type materials, m_p = p-type effective mass, S_p = Seebeck coefficient of p-type materials).
Histogram plot of the features by importance (m_n = n-type effective mass, S_n = Seebeck coefficient of n-type materials, PF_p = Power Factor of p-type materials, m_p = p-type effective mass, S_p = Seebeck coefficient of p-type materials).
VI. CURRENT CHALLENGES
As Zunger stated in one of his highly cited perspectives,86 “The properties required to realize a particular device are often known, but the specific materials that harbor such properties are generally unknown and are difficult to identify.” In these few words lies the quintessential motivating factor of material inquiries that has spanned decades of innovative research in materials design and discovery. The challenges seem insurmountable as the number of materials yet to be discovered is estimated to be as large as a googol (10100).87 This is equally true in data-driven TE materials’ design and discovery despite the financial, technological, and intellectual investments in the field. The vastness of the hypothetically viable TE materials' chemical and structural search space is a challenge in its own right. The design and discovery of TE materials through ML modeling depend on the currently available databases, which represent a tiny fraction of the hypothetical search space. Many of such databases obtained from high-throughput (HT) computational methods or experimentally may not contain key physical descriptors needed for accurate materials predictions, design, and discovery. Although high-throughput datasets can accommodate a diverse and large volume of data, they suffer from data veracity and data accuracy. To understand the limitations of HT datasets, one needs to revisit the interplay between the different TE properties. Among other things, TE materials properties are governed by electron and phonon dynamics: electron–electron interactions, electron–phonon scatterings, and Umklapp scatterings. Current HT methods may be efficient to describe such interactions on simple TE compounds but are inadequate to fully describe transport mechanisms for complex materials systems. Moreover, one particularity for thermoelectric materials is that they operate at different temperature ranges (low, medium, and high) with many applications toward the high-temperature range. However, HT formalisms fail to describe the complexity of electron and phonon transports at high temperatures leaving many potential materials unaccounted for in currently available datasets. While experimental datasets are on the rise, their extraction and curation are a continued challenge. These limitations usually result in datasets that are either large and diverse but do not include key TE materials’ features or so small and less diverse that the data do not allow practical machine learning modeling.
Beyond the difficulties of finding the appropriate feature vectors for model representation lies the monumental task of selecting the right model that maps inputs to outputs with lesser statistical errors. This input–output mapping mechanism that seeks to determine the best target function is generally called a hypothesis (h) in the ML jargon. The set of all possible hypotheses is referred to as the hypothesis space (H). Thus, finding the right model is analogous to determining the best hypothesis within the hypothesis space. This process is akin to finding a needle in a haystack when faced with a large dataset as the hypothesis space H depends mainly on the number of input features and the various ways they can be statistically represented. With thousands of possible candidate models that can potentially be used, we are left to wonder whether there exist other unknown statistical learning models that can produce the best target function than the ones we are familiar with. As machine learning in materials science is progressing at a fast pace, we hope to witness in the very near future the derivation of powerful models for better materials data representation.
VII. PROSPECTS OF MACHINE LEARNING IN TE MATERIALS’ DESIGN AND DISCOVERY
ML-guided design and discovery of TE materials involve many challenging processes, the most important one being insufficiency of highly descriptive data volume that can capture the diversity and complexity of temperature-dependent TE materials and the underlying properties we seek to understand in our journey to new compounds discovery. Until we see major breakthroughs in HT computational simulations and experimental data mining, this challenge will continue to be a roadblock in our pursuit for better TE materials. Reckoned with this fact, a recent shift has seen researchers focus more on integrated ML techniques for data acquisition, material selection modeling, and optimization. One such technique is transfer learning (TL). It is based on the notion that the properties inherent to materials’ attributes are closely interdependent. The central idea of this method is to pre-train ML models on materials data large enough that property features can be captured.88 These newly machine-acquired features are then assigned or transferred as input features to a model with a limited amount of materials data with properties similar to the newly acquired features. An analogy would be human beings' ability to quickly learn new tasks based on past experiences on closely related tasks. Liu et al.89 transfer knowledge learned from the electronic bandgaps on more than a thousand semiconductor compounds to models trained using only 124 materials data and observed that the mean absolute errors of the TL models are reduced by 65%, 14%, and 54% for the predictions of the phonon bandgap, group velocity, and heat capacity, respectively, when compared with directly trained models. Yamada et al.90 conceived a pre-trained data library called XenonPy.MDL with thousands of inorganic materials to facilitate TL modeling. TL has not seen any large-scale adoption in TE materials design and discovery. However, with the advent of pre-trained libraries, this will soon change.
In addition, the lack of materials’ synthesis data (deposition pressure, temperature, thickness, grain size, etc.) limits materials’ synthesizability and performance evaluation. The National Renewable Energy Lab’s (NREL) revolutionary idea to build data mining infrastructures based on high-throughput experimental (HTE) methods stems from the need to include materials’ synthesis data into ML workflow. The NREL HTE approach utilizes combinatorial thin-film synthesis (sputtering, pulsed laser deposition), spatially revolved characterization (composition, structure, transport properties), and automated data processing, analysis, and visualization. Currently, their database contains about 140 000 data and is accessible to the public.91 Such highly integrated materials apparatus will serve greatly in the discovery of new materials. For example, You et al.92 used HTE screening to optimize the Cu concentration and Se/S ratio on a Cu-doped PbSe–PbS material system. This resulted in the observance of a high thermoelectric figure of merit with zT around 1.6 at 873 K. Sasaki et al.93 used combinatorial gradient thermal annealing and ML to identify optimal strain on bismuth telluride thin films. To usher in the era of great advancements in thermoelectric materials’ design and discovery, the continued combination of high-throughput computational data as well as experimental data (the successful ones as well as the failed attempts) will be necessary.
VIII. CONCLUSION
The rise of the machines is no longer a fictitious, imaginary set of entertaining undertakings that envision machines to take over humanity, nor it is something we should fear. It is a new scientific paradigm based on machines' self-learning ability to supplement human endeavors in all aspects of life by turning data into knowledge. In materials science, the implication of such a realization has revolutionized the way we see and interact with materials’ data. In this work, we have surveyed the fundamental processes for successful end-to-end machine learning modeling in materials science. In the search for a more robust model in predicting the power factor of more than 9000 thermoelectric compounds, we applied four regression-based machine learning algorithms in our study. Our work showed that the random forest model with accuracies of 97.6%, 80.9%, and 83.8% on the training, testing, and fivefold cross-validation, respectively, is better suited in predicting the thermoelectric power factor. As quantum computing continues to gain breakthroughs, enabling the expansion of data storage capabilities and faster processing time of modern computers, prospects in data-driven thermoelectric studies can only get more excitement.
SUPPLEMENTARY MATERIAL
See the supplementary material for discussion about p-type materials' power factor predictions.
ACKNOWLEDGMENTS
This work was supported by NSF-CREST (CREAM) (Grant No. HRD 1547771) and NSF-CREST (CNBMD) (Grant No. HRD 1036494). We would like to thank Dr. Ali Abdinur from the Departments of Mathematics and Computer Science at Norfolk State University for his valuable insights.
DATA AVAILABILITY
The data that support the findings of this study are openly available in the Dryad Digital Repository at https://doi.org/10.5061/dryad.gn001, Ref. 64. Notebooks and python source codes are also available from the corresponding author upon reasonable request.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
M.T.M., S.K.P., and M.B. conceived the idea, contributed to the theoretical discussion, and reviewed the manuscript. M.T.M. collected the data, designed the machine learning models, and wrote the manuscript. M.B. supervised this study. All authors discussed the results and commented on the manuscript.