Machine learning (ML) has emerged as an indispensable methodology to describe, discover, and predict complex physical phenomena that efficiently help us learn underlying functional rules, especially in cases when conventional modeling approaches cannot be applied. While conventional feedforward neural networks are typically limited to performing tasks related to static patterns in data, recursive models can both work iteratively based on a changing input and discover complex dynamical relationships in the data. Deep language models can model flexible modalities of data and are capable of learning rich dynamical behaviors as they operate on discrete or continuous symbols that define the states of a physical system, yielding great potential toward end-to-end predictions. Similar to how words form a sentence, materials can be considered as a self-assembly of physically interacted building blocks, where the emerging functions of materials are analogous to the meaning of sentences. While discovering the fundamental relationships between building blocks and function emergence can be challenging, language models, such as recurrent neural networks and long-short term memory networks, and, in particular, attention models, such as the transformer architecture, can solve many such complex problems. Application areas of such models include protein folding, molecular property prediction, prediction of material failure of complex nonlinear architected materials, and also generative strategies for materials discovery. We outline challenges and opportunities, especially focusing on extending the deep-rooted kinship of humans with symbolism toward generalizable artificial intelligence (AI) systems using neuro-symbolic AI, and outline how tools such as ChatGPT and DALL·E can drive materials discovery.
I. INTRODUCTION
The emergence of language in physics could be traced back to several centuries BCE, where early-stage numeral notations, such as Greek numerals and counting rod numerals, were used,1 building off of earlier developments in the intellectual pursuits of explaining how the world worked through integrated theories of philosophy.2 With the development in mathematics, mathematical language, including algebra, topology, and calculus, has become the fundamental language in physics. Since the mid-20th century, computational tools such as “numerics” have gained broad importance to solve problems (e.g., density functional theory),3,4 benefiting from the exquisite calculation capability of computers. More recently, as machine learning (ML) and artificial intelligence (AI) find numerous applications across many scientific fields of study, AI has the potential to become an additional indispensable type of language in physics that efficiently helps us learn complex underlying rules to build on and complement our existing mathematical frameworks, hence enriching the conventional scientific exploration in the theory, experiment, and computation5,6 (Fig. 1). In fact, promising AI models have emerged that can deal with complex symbolic representations of relationships, including human language,7–10 proteins,11–15 DNA,12,14,16 graphs,17,18 and ontologies, and many other forms of descriptive associations.19–22
In the field of materials science, AI has gained importance attributed to its power to predict material properties, design de novo materials, and discover new mechanisms beyond human intuitions.5 Integrating AI, especially language models that have great potential in making end-to-end predictions, has become an important frontier in materials research. This is because conventional feedforward neural networks (FFNNs) are typically limited to performing tasks related to static patterns in data. In contrast, recursive models can both work iteratively based on changing input and capture complex dynamical relationships across a range of scales within flexible data formats (e.g., graphs, sequences, pixel or field data, boundary conditions, or processing parameters). Based on these concepts, a class of language models has emerged as powerful ML solutions to model flexible modalities of data and capable of learning rich dynamical behaviors. These developments can be considered outgrowths of early forms of neural network architectures, such as Boltzmann machines,23 that describe the interrelated aspects of states of a physical system (including, possibly, metaphysical systems, in general).
More broadly, language models are powerful ML strategies to model long-range behaviors in terms of state variables and dynamical aspects, including flexible modalities of data. This is because they can easily integrate sequential, field, and other numerical input and learn extremely complicated dynamical behaviors across variegated forms of state variables. Modern language models that derive ultimately from original concepts, such as Recurrent Neural Networks (RNNs), Long-Short Term Memory (LSTM), and attention-based transformer networks, have shown a striking capability of solving complex problems in protein folding,24–26 molecular property prediction,14,20,27 and failure of complex nonlinear architected materials,28 to name a few (Table I). Looking from a more general perspective, these problems can be commonly described as predicting the results of large numbers of interactions of elementary building blocks (atoms; molecules; amino acids; peptides; even words, musical notes, or vibrational patterns; etc.) to form more integrated, complex relationships with functions that ultimately far exceed those of individual building blocks,6,21,29 defined by their web of interrelated functors.29 The focus on formulating such problems as an attention-based problem helps with a critical task in the physical science, where small details can have an outsized impact on a solution (e.g., near a crack singularity, or to capture the impacts of a single mutation on protein misfolding).
ML related terms . | Features/definition . |
---|---|
Feedforward neural network (FFNN) | Nodes connected with information flowing in one direction |
Encoder–decoder framework | Two neural networks with the same or similar structure: one maps |
the input to representations and the other maps representations to the output | |
Convolutional Neural Network (CNN) | Capture features at different levels by calculating convolutions; |
take pixel-based or voxel-based data as input | |
Recurrent Neural Network (RNN), | Nodes connected with feedback loops; |
Long-Short Term Memory (LSTM), | history information stored in hidden states; |
and Gated Recurrent Unit (GRU) | sequential data as input |
Transformer | Encoder–decoder structure with multi-head |
attention mechanism; no requirements on sequence order | |
Natural language processing | Aimed at enabling computers to understand human language |
Pre-training | The first training of a model on a generic task (often on an |
unlabeled dataset), the parameters of which will then be used to | |
adapt the model or train it toward a different task | |
Fine-tuning | Adjusting parameters of a model to solve a certain task |
pertaining to a particular dataset, often based on a pre-trained model |
ML related terms . | Features/definition . |
---|---|
Feedforward neural network (FFNN) | Nodes connected with information flowing in one direction |
Encoder–decoder framework | Two neural networks with the same or similar structure: one maps |
the input to representations and the other maps representations to the output | |
Convolutional Neural Network (CNN) | Capture features at different levels by calculating convolutions; |
take pixel-based or voxel-based data as input | |
Recurrent Neural Network (RNN), | Nodes connected with feedback loops; |
Long-Short Term Memory (LSTM), | history information stored in hidden states; |
and Gated Recurrent Unit (GRU) | sequential data as input |
Transformer | Encoder–decoder structure with multi-head |
attention mechanism; no requirements on sequence order | |
Natural language processing | Aimed at enabling computers to understand human language |
Pre-training | The first training of a model on a generic task (often on an |
unlabeled dataset), the parameters of which will then be used to | |
adapt the model or train it toward a different task | |
Fine-tuning | Adjusting parameters of a model to solve a certain task |
pertaining to a particular dataset, often based on a pre-trained model |
This is broadly important since hierarchical structuring exists in materials science across vast scales in time, length, and modalities. For instance, complex biomaterials with different functions, ranging from hair, spider silk, to viruses, are all created through the assembly of universal amino acids building blocks, the process of which is governed by molecular, quantum chemical principles but conditioned based on environmental cues or gradients that are sensed by the entire system. While the protein folding rules are extremely complicated, end-to-end models such as trRosetta,30 AlphaFold2,24 OpenFold,31 and OmegaFold32 can now provide protein structure predictions with an atomic accuracy, similar to the folded geometries observed in nature (Fig. 2). In a broader sense, the scientific paradigm devised by human civilizations over tens of thousands of years seeks to discover the fundamental relationships between building blocks and the emergence of function by using human intelligence.6 In this vein, end-to-end models, in which biological, chemical, physical, and other information, such as assembly conditions, are directly associated with the final material function, can be formulated using AI methods such that these relationships can be discovered and visualized autonomously from data. Important areas of application in physics also include the development of new neural network potentials to accelerate molecular dynamics modeling and even quantum-level processing.33–36
A prominent language model is the transformer network,7 based on attention mechanisms that originated in multiplicative modules introduced as sigma–pi units37,38 and higher order networks39,40 around 1990. Being one of the most widely used language models, the class of transformer models benefits from its attention architecture and is able to both learn universal truths of physical systems and solve specific tasks, forming the basis to explore a diverse set of downstream problem space as it learns and develops features of a universal mathematical operator. Unlike models that are trained to solve very specific problems and, hence, may not generalize to other application areas, transformer models possess strong generalization capabilities.41 Originally based on Natural Language Processing (NLP) concepts, transformer models have been shown to be capable of learning how words relate to one another, and innately include a higher level of interpretability.42–45 Applying this concept to universal building blocks of the physical world, e.g., molecules; material microstructures; and distinct hierarchical representations of materials, including biological systems, molecular building blocks, mesoscale structural components, and many others, transformer models can capture the functional relationships between constituting building blocks without a priori knowledge of what to pay attention to, or even what the functional group building blocks are.6 With attention learning, some parts of the input data get enhanced, while the importance of others is diminished, which reflects a key aspect of many self-assembly and property prediction tasks that some, potentially small, features in the data require us to dedicate more focus to reflect their importance6 (Fig. 3). Since such architectures can deal with high-dimensional data, they can also be effectively combined with convolutional theories to provide a direct approach to capture hierarchical phenomena in materials science.
In this perspective, we review recent progress in this emerging area of research and showcase how it can be beneficial when combined with related, earlier attempts to model physical systems using approaches such as mathematics or early-stage models in the neural network era, e.g., a traditional encoder–decoder framework that seeks to discover elementary physics via latent bottlenecks in a reduced parameter space.46–50 In terms of an outline of the work discussed in this paper, various language models are introduced in Sec. II, with a special focus on transformer architecture. Section III discusses the recent applications of deep language models in materials science, including dataset construction, structure/property prediction, and inverse material design. Finally, in Sec. IV, we summarize the key insights put forth in this perspective and provide forward-looking discussions in this area of research, including an exploration of the role that large language models, such as GPT-351 or Galactica,52 can play especially when combined with generative models, such as DALL·E 253 or Stable Diffusion.54
II. LANGUAGE MODELS
A. Basic language models
The use of language models is inspired both from the general ontology by which we describe material processes, structures, and functions and also from the quest to learn dynamical behaviors of physical systems. These aspects are critical in a variety of applications ranging from text, music, video, robotics, to predicting the behavior of soft matter systems, hierarchical material architectures, and many others. With the motivation specially to model sequential data, language models have emerged as a powerful machine learning solution to this class of problems, starting with Recurrent Neural Network (RNN),55 Long-Short Term Memory (LSTM),56 Gated Recurrent Unit (GRU),57 and, more recently, moving toward the era of attention models10,58 featuring transformer networks7 [Fig. 4(b)] that can capture a rich set of data modalities, including but not limited to sequential data.
As illustrated in Fig. 4(a), compared with a simple FFNN, a RNN55 has feedback loops43 to process the new information with the outputs from prior steps at each time step and, thus, has become widely used neural networks for various tasks involving the notion of sequential data, such as speech recognition,59,60 language modeling,61 and image captioning.62 While a RNN suffers from limitations to short-term memory, LSTM56 greatly improves by learning long-term dependencies. The key differences between RNN and LSTM are the operations performed within LSTM cells. The memory of LSTM, shown in Fig. 4(b), which runs as a horizontal line at the top, has the ability to forget, update, and add context aided by different operations within a LSTM cell, namely, forget gate, update gate, and output gate,63 and can thereby capture longer-range relationships. The GRU57 architecture can be considered as an alternative to LSTM since it has a similar structure with minor modifications, where the forget and input gates are merged and an extra reset gate is used to update memory with an old state at time step t − 1 and a new input at time step t. One potential issue with all three neural networks described so far is the requirement of compressing all the necessary information of a source sequence into a fixed-length vector, making it difficult for the neural network to cope with long sequences.58 This is especially true if there exist very long-range relationships (e.g., a small nuance in the input at the beginning of the sequence and how it interacts with a small nuance much later in the sequence is hard to capture); however, such problems are common in many physics problems, such as protein folding or self-assembly. All these architectures are also limited to sequential data, whereas relevant problems in physics may have a combination of field and temporal data, for instance, requiring the need for a more flexible description.
Unlike previous approaches that start making predictions using only the final hidden state output with rather condensed information, attention models pay attention to each hidden state at every step and make predictions after learning how informative each state is with a neural network.58 Importantly, in attention learning, the key is that it successively enhances certain parts of the input data while diminishing the importance of others. This realizes a key aspect of many self-assembly and feature prediction tasks in materials science: some, potentially small, features in the data require us to dedicate more focus to reflect their importance (e.g., a point mutation in a biopolymer or DNA, or a singularity in a fracture study).6 Attention learning is often computed with deep layers, where the attention operation is carried out repeatedly to capture highly complex relationships. Usually, increasing the number of layers leads to more trainable parameters in the model and enables the model to learn more complicated relationships. For example, the well-known language model GPT-351 features 96 layers and ∼175 × 109 parameters. Compared to smaller models with 12–40 layers, such larger language models have been shown to make an increasingly efficient use of in-context information and achieve a better performance across different benchmarks.51 The detailed architecture of the fundamental attention model at the heart of it will be discussed in Sec. II B, with the transformer architecture as an example. Besides transformer networks, some other salient neural architectures used together with attention learning includes the encoder–decoder framework,58 memory networks,64–66 and graph attention networks.67
Attention models have emerged as the state-of-the-art for multiple tasks in NLP,9 computer vision,68 cross-modal tasks,69 and recommender systems,70 offering several other advantages beyond performance improvements such as enhanced interpretability.71 The research interest in the perspective of interpretability results from the fact that unlike other “black-box” models, attention models allow a direct inspection of the internal deep neural network architectures.71 Visualizing attention weights shows how relevant a specific region of input is for the prediction of output at each position in a sequence. For instance, Bahdanau et al.58 visualized attention weights and showed automatic alignment of sentences in French and English despite the different locations of subject–verb–noun in different languages. In an image captioning task, Xu et al.72 showed that image regions with high attention weights have a significant impact on the generated text. Similar studies have been done also for protein sequences, where protein language models, such as ProtBERT,12 could be explained, in regard to how it learns the form, shape, and function of proteins during training process, by direct visualization of the output embeddings.
B. Transformer architecture
While recurrent architectures rely on sequential processing of the input at the encoding step that results in computational inefficiency, transformer architectures completely eliminate sequential processing and recurrent connections, relying only on attention mechanisms to capture global dependencies between input and output,7,71 in some ways resembling early architectures, such as the Boltzmann machine. The detailed transformer architecture is shown in Fig. 5(b) in contrast to a conventional encoder–decoder structure [Fig. 5(a)], where layers of the encoder and decoder contains extra attention sublayers rather than simple FFNNs.
The transformer network employs an encoder–decoder structure, where the encoder maps an input sequence of symbol representations x to a sequence of continuous representations z and the decoder generates an output sequence y of symbols one element at a time.7 At the embedding stage, positional encoding is used to provide information about the relative or absolute position of the tokens in the sequence. The main architecture is composed of a stack of N identical layers of encoders and decoders with two sublayers: a position-wise fully connected FFNN layer and a multi-head attention layer. As a fully connected FFNN applies a linear transformation to input data, the “position-wise” emphasizes that the same transformation is applied to each position in the sequence independently, straightforwardly enabling parallel processing. (That is, in the same layer, parameters of a FFNN are the same across different positions, facilitating such parallel processing.) The decoder is similar to the encoder, except that the decoder contains a third sublayer inserted, which performs multi-head cross-attention over the output of an encoder stack. Unlike self-attention based on inputs generated from the same embedding, the cross-attention calculation is performed on inputs from different embeddings, or a different signal altogether. As shown in Fig. 5(b), the cross-attention in a transformer model is fed with data K, V from the encoder stack and data Q from the output embedding. In its first multi-head self-attention sub-module, masking is applied to ensure that the predictions for position i can depend only on the known outputs at positions less than i. Finally, normalization and residual connections are performed.
Attention mechanisms. The core part of a transformer architecture is the multi-head self- and cross-attention mechanism.7 While conventional neural networks, such as Convolutional Neural Network (CNN), fall short in capturing complex relationships as they may neglect small details when coarse-graining through data (e.g., at every convolutional layer in a conventional deep CNN, the signal is further coarse-grained), the multi-head attention mechanism allows the transformer model to carry out a discovering strategy with all details across scales taken into consideration. As illustrated in Fig. 6, for physical problems, such as understanding the hierarchical structure of spider webs,73–75 conceptually, multiple heads in the transformer architecture could help capture features from the finest scale, e.g., amino acids, to the macroscale, where topology is characterized (these features are learned and so may not exactly reflect this simplistic summary, but we may generally think of the underlying mechanisms of how relationships are understood in that way). Hence, the importance of even small details could be reflected and intricate relationships that govern the physics and materials of interest can, indeed, be discovered. The mathematical details of multi-head attention mechanism are described as follows.
As one of the state-of-the-art attention models, transformer networks can thereby capture long range dependencies between input and output, support parallel processing, require minimal prior knowledge, demonstrate scalability to large sequences and datasets, and allow domain-agnostic processing of multiple modalities (text, images, and speech) using similar processing blocks.71 Since the inception of the transformer, there has been an increasing interest in applying attention models in a wide range of research areas, which also leads to a growing number of variants of transformer, such as Vision Transformer,76 Trajectory Transformer,77 and physics-focused transformer models.78 Important developments also include the development of linear-scaling approaches, in order to reduce computation time and memory consumption, such as the Reformer,79 Perceiver,80 TurboTransformer,81 and Transformer-XL.82
C. Pre-trained large language models
Despite the success of neural network models for NLP, the performance improvement is limited by small curated datasets that are available for most supervised tasks (e.g., relationships between biological sequences and physical properties). To address this, the advent of pre-trained language models has revolutionized many applications in NLP, thanks to their training on large corpus data with no or little labeling. Moreover, it has been shown that such strategies facilitate a fine-tune capability for downstream tasks and convenient usage. Among them, transformer based pre-trained language models have been especially popular. There are two main types of such models finding applications in materials science: (1) models using architectures similar to Bidirectional Encoder Representations from Transformers (BERT)83 and pre-trained by corrupting and reconstructing the original sequence, such as ESM,84 MatSciBERT,85 ProtTrans,12 and ProteinBERT,86 the most direct application of which is sequence embedding, (2) models using autoregressive training, the most famous of which being the Generative Pre-trained Transformer (GPT) series,51,87 such as ProGen,88,89 DARK,90 and ProtGPT2,91 which show great potential for protein design.
For a specific downstream task, these pre-trained language models can then be easily fine-tuned with significantly less labeled data, thus avoiding training a brand-new model from scratch, simply by adding a new decoder and training part or all of the model. This can be done by replacing the final few neural network layers so that the parameters of the large model will be preserved and further adapted.6 Then, fine-tuning requires us to train either the final head layers or the entire model, both being much easier than training from scratch. With such adaptations, pre-trained large language models that do not specifically focus on physics or materials science related knowledge could also be exploited for materials research. For instance, the recently developed large language model trained on scientific knowledge corpus, Galactica,52 has shown its potential to act as a bridge between scientific modalities and natural language toward biological and chemical understandings of materials, with downstream tasks such as MoleculeNet classification and protein function prediction. Another AI model, ChatGPT,92 built on top of GPT-3 and adapted for dialogue, has been released recently and could also be an interesting approach for research in the physical sciences, especially in conjunction with generative text-to-image methods. For example, if we ask ChatGPT to describe the microstructure of a very compliant material and then use its answers to generate images with DALL·E 253 (DALL·E 2 is an AI system that can create realistic images and art from text), we would achieve images of structures of various compliant material designs. While this is an early stage for this type of research, it could already be promising for automated dataset generation (see Fig. 7 for an example).
III. APPLICATIONS IN MATERIALS SCIENCE
A. Dataset construction
Data are a key element for machine learning models. Sufficient and high-quality data are key for models to work efficiently. Usually, researchers could either collect data from the existing literature or databases or could generate them on demand with high-throughput experiments or simulations.
Although there exist many materials databases, such as MATDAT,94 MatWeb,95 MatMatch,96 and MatNavi,97 researchers may need to mine appropriate data from numerous studies and across different databases, where text processing techniques can be utilized to replace manual labor. It has been demonstrated that NLP could not only efficiently encode materials science knowledge present in the published literature, but also map the unstructured raw text onto structured database entries that allow for programmatic querying, with Matscholar as an example.98,99 Other examples of datasets gathered and curated by NLP models can be found across materials science, although progress is still in early stage.100 Aided by NLP techniques, Kim et al. were able to develop an automated workflow containing article retrieval, text extraction, and database construction to build a dataset of aggregated synthesis parameters computed using the text contained within over 640 000 journal articles.101 To describe the temperatures for ferromagnetic and antiferromagnetic phase transitions, Court and Cole102 have assembled close to 40 000 chemical compounds and associated Curie and Néel magnetic phase-transition temperatures across almost 70 000 chemistry and physics articles using ChemDataExtractor,103 an NLP toolkit for the automated extraction of chemical information.
Recent developments of pre-trained large language models, such as MatSciBERT,85 could also greatly benefit the dataset construction process, accelerating materials discovery and information extraction from materials science texts. Besides, if an appropriate pre-trained language model would be employed for the main research task, the size of self-generated experimental or simulation data could be reduced to some extent, since the transferred model still preserves some general predictive capability of the pre-trained model.
In addition to text extraction and serving as pre-trained models, language models can help learn valuable data from images, facilitating data labeling especially for experiments, which otherwise takes much human effort. Microscopy images, which characterize the microscopic-to atomic-scale structure of materials and are predominantly sourced from scanning and transmission electron microscopies (SEM and TEM, respectively), contain a wide range of quantitative data that would be useful in the design and understanding of functional materials.100 Image segmentation and classification, for example, is an essential step for constructing labeled image datasets. The Vision Transformer,68 pre-trained on a large propriety dataset, can be fine-tuned to perform such downstream recognition tasks, which has been benchmarked with ImageNet classification. It has also been shown that TransUNet,104 which merits both transformers and U-Net, is a strong alternative for medical image segmentation, achieving superior performances to various competing methods on different medical applications, including multi-organ segmentation and cardiac segmentation.
B. Structure and property predictions
Structure and property predictions have always been the frontier of materials science, and language models are now adding a new page in this area. Models such as RNN, LSTM, and transformers have shown the capability of solving complex problems in protein folding,16,24–26 material property prediction,14,20,27,105 and failure of complex nonlinear architected materials.28
Protein folding, that is, protein structure prediction that yields full-atomistic geometries of this class of biomolecules, has been an important research problem for large-scale structural bioinformatics and biomaterial investigation. Although progress remained stagnant over the last two decades, recent applications of deep language models, which enable end-to-end predictions, have largely solved the folding problem for single-domain proteins,106 with methods such as trRosetta,30 AlphaFold2,24 and newer variations, such as OpenFold31 or OmegaFold,32 now becoming the state-of-the-art tools. Taking protein amino acid sequence solely as an input and leveraging multi-sequence alignments (MSAs), AlphaFold2 is able to regularly predict protein structures with an atomic accuracy, where transformer architecture is actively incorporated in the model24 [Fig. 8(a)]. Kandathil et al.16 developed an ultrafast deep learning-based predictor of protein tertiary structure that uses only an MSA as input, where a system of GRU layers is utilized to process and embed the input MSA and output a feature map. Predictions of protein structure are also possible without using MSAs. Chowdhury et al.107 reported the development of an end-to-end differentiable recurrent geometric network (RGN) that uses a protein language model to learn latent structural information from unaligned proteins. On average, RGN2 outperforms AlphaFold224 and RoseTTAFold108,109 on orphan proteins and classes of designed proteins while achieving up to a 106-fold reduction in computation time.107
In addition, deep language models have significant potential in predicting features and properties of various materials. Regarding biological materials that have amino acids as building blocks, end-to-end models have been developed to learn about molecular properties, including secondary structures,14 thermal stability,111 mechanical strength,13 and normal mode frequencies,112 providing novel avenues for protein engineering, analysis, and design. For instance, to predict normal mode frequencies of proteins, the end-to-end transformer, taking solely an amino acid sequence as input, has been trained to achieve a performance (R2) as high as 0.85 on a test set112 [Fig. 9 (left)]. As shown in Fig. 9, research investigating the thermal stability of collagen triple helices suggests that the end-to-end small transformer model trained from scratch and ProtBERT-based large model resulted in a similar performance on test data (R2 = 0.84 vs 0.79, respectively).111 In terms of properties of the general material domain, the FieldPerceiver,78 a physics-based building block based transformer network, is able to learn by categorizing interactions of elementary material units that define the microstructure of the material and then predict the resulting material behavior, such as stress and displacements field (Fig. 10). The relatively unlimited range of transformer models in associating input to output data enables the prediction of both local and long-range organization of the target field. The prediction results of stress and displacement fields shown in Fig. 10 suggest that the model can use a pre-trained model and easily transfer physical insights to cases that have a distinct solution.78
Mechanical problems involving nonlinearities, such as plasticity, fracture, and dynamic impact, are known to be challenging and computationally expensive for conventional numerical simulation schemes.5 Machine learning techniques, especially deep language models, have provided efficient approaches to such problems. Hsu et al.28 presented an AI-based multiscale model with convolutional LSTM for predicting fracture patterns in crystalline solids based on molecular simulations. As shown in Fig. 11, the proposed approach not only shows excellent agreement regarding the computed fracture patterns but also predicts fracture toughness values under mode I and mode II loading conditions. Lew et al.113 used similar machine-learning approaches to predict nanoscopic fracture mechanisms, including crack instabilities and branching as a function of crystal orientation, focusing on a particular technologically relevant material system, graphene. Another machine-learning model has been proposed to predict the brittle fracture of polycrystalline graphene under tensile loading, integrating convolutional neural network (CNN), bidirectional RNN, and fully connected layer to process the spatial and sequential features.114 Furthermore, it was demonstrated that a progressive transformer diffusion model can effectively describe the dynamics of fracture, achieving great generalization with limited training data and capturing important aspects, such as crack dynamics, instabilities, and initiation mechanisms.115 The incorporation of attention approaches into progressive diffusion methods, combined with sophisticated convolutional architectures that include ResNet blocks and skip connections, now emerges as powerful generalizable architectures that can capture, predict, and generalize behaviors across different physical systems and can also solve degenerate inverse problems that have multiple solutions (e.g., finding a set of material microstructure candidates that meet a certain design demand, e.g., a stress–strain relationship, as shown in the recent work116,117). A particularly noteworthy feature of diffusing models is the training process, where the stochastic nature of training the denoising neural network provides avenues to avoid overfitting while achieving an excellent coverage during inference to include conditions that were not included in the training (this is partly due to minimizing Kullback–Leibler divergence for the Gaussian noise terms). This, combined with the ability to condition the input to these models to provide solutions for a variety of boundary value problems, holds great promise for a variety of applications in physics.
C. Material design
As the language models have been demonstrated with a satisfying performance in predicting structure and property of materials, integrating such models with generative neural networks, such as Variational Auto-Encoders (VAEs) and Generative Adversarial Networks (GANs), would enable researchers to efficiently explore the tremendous material design space, which is intractable with conventional methods.
Hsu et al.21 have developed a machine learning based approach to generate 3D architected materials based on human-readable word input, enabling a text-to-material translation. The model combines a Vector Quantized Generative Adversarial Network (VQGAN) and Contrastive Language–Image Pre-training (CLIP) neural networks to generate images, which are then translated into 3D architectures that feature fully periodic, tileable unit cells.21 Such language-based design approaches can have a profound impact on end-to-end design environments and drive a new understanding of physical phenomena that intersect directly with human language and creativity.21 Researchers have also shown that an approach combining an RNN-based model and an evolutionary algorithm (EA) could realize inverse design for 4D-printed active composite beams. Moreover, it has been illustrated that hierarchical assemblies of building blocks, with elementary flame particles as an example, could be created using a combination of GAN and NLP models.118 In terms of protein engineering, the pre-trained ProtGPT291 was shown to generate de novo protein sequences following the principles of natural one, paving the way for efficient high-throughput protein engineering and design.
More recently, Wu et al.110 have developed a diffusion-based generative model with a transformer architecture, to mimic the native folding process and design protein backbone structures, which are described as a series of consecutive angles capturing the relative orientation of the constituent amino acid residues [Fig. 8(b)]. The generated backbones from this model better respect protein chirality and exhibit greater designability compared to prior works that use equivariance assumptions.110 In these examples, the integration of various architectures to model and synthesize solutions, including, in particular, the use of language-based descriptors for field synthesis as done in VQGAN, is an exciting opportunity in the physical sciences that can integrate both the knowledge derived from existing theories [e.g., synthetic datasets generated from solutions with density functional theory (DFT), MD, and coarse-graining] and experimental, empirical data for which we may not yet have a theory (e.g., protein dynamics, structures, functions, and biological data). This can be particularly useful for developing predictive mesoscale theories for which we do not have closed-form theoretical frameworks but for which we can generate large datasets that capture key relationships. The examples discussed in this paper, e.g., modeling physical phenomena using the FieldPerceiver model or solving inverse problems116,117 using attention–diffusion models, can be viewed as early adaptations of this general perspective.
IV. SUMMARY, PERSPECTIVE, AND OUTLOOK
The emergence of attention models, featuring, for instance, transformer architectures and similar strategies, with advantages in predictive capability, generalization, and interpretability, has revolutionized the application areas of language paradigms beyond conventional NLP. For materials research, exciting opportunities have been created with recent advances in natural language modeling from all aspects of investigation, including dataset construction, structure and property prediction, and inverse material design. As reviewed in Sec. III, these models have shown significant potential in a wide variety of complex tasks, such as automated material information extraction,85 protein folding,16,24–26 molecular property prediction,14,27 fracture pattern investigation,20,28,113,115 and inverse design for architected materials.21
When formulating materials science problems with machine learning strategies, one challenge is to identify inputs/outputs and build appropriate datasets, through autonomous experimentation, multiscale simulations, or the use of NLP methods mentioned before. While the data modalities are not limited if researchers consider using language models, the size, quality, and diversity of dataset would significantly influence the model performance. Although the current language models have been shown capable of solving complex tasks, the challenge toward machine learning models with a better accuracy, scalability, and generalization exists and should be addressed in future research.
As the transformer model and its variations have proven to be successful and powerful, an interesting direction for future work could be mining the transformer architecture, especially its attention mechanism, for interpretation to yield even broader and deeper physical insights. The availability of graph-like attention maps (akin to the schematic shown in Fig. 3) provides a variety of interpretability strategies, from mining the attention maps to reverse engineering localization where specific types of physical principles are captured. Analyzing the multi-head attention maps could not only benefit interpretation of the model but also establish a strong foundation for model architecture improvements (by automatically optimizing the multi-head division, for instance). In addition, since the computational cost of transformer increases quadratically with sequence length, reducing the computation time and memory consumption to achieve better scalability is also an important research theme. To address this problem, researchers are actively developing linear-scaling approaches, including Reformer,79 Perceiver,80 TurboTransformer,81 Transformer-XL,82 and many other emerging developments.
Another long-standing challenge in this area is the generalization capability of models, which characterizes the model’s ability to respond to unseen data. With the increasing applications of deep language models in materials science, a systematic exploration of the generalization of such models with tasks in materials context, such as predicting physical and chemical phenomena of materials, would be an essential future step. The degree of usefulness will expand drastically for deep learning approaches that capture generalizable truths in a way that they can generalize these functorial relationships toward new solutions that are distinct from the training data provided, to not only unleash predictive power but also explain relationships in a way that the human mind can understand. Recent developments of Galactica52 and GPT-351 and the emergence of generative models that can synthesize human knowledge toward providing novel solutions (e.g., DALL·E-253,93 or Stable Diffusion54 in image generation) are first steps toward that goal. Furthermore, the self-assembling and self-organizing AI, made out of highly interconnected simple building blocks such as RNN, might be a candidate toward better robustness and generalization since the absence of any centralized control could allow them to quickly adjust to changing conditions.119–121 These and related questions are critical for the community to explore and address, in order to develop a foundational understanding of the limitations and opportunities of such approaches.
Going beyond learning phenomena in materials science applications, learning generic symbols representing fundamental physics rules could be another interesting research direction. Several attempts have been made, developing models such as transformer-based models122,123 and DeepONet.124 Some of these models are capable of learning various explicit and implicit operators and achieve highly accurate solutions to differential systems. What is more, the developing neuro-symbolic AI system,125–130 which combines neural networks with symbolic AI and, hence, benefits from a three-way interaction between neural, symbolic, and probabilistic modeling and inference, has the potential to achieve human-style comprehension [Fig. 12(a)]. The key to such emerging neural-symbolic methodology is how to learn representations through neural nets and make them available for downstream use, symbolically.130 For instance, as shown in Fig. 12(b), using a neuro-symbolic reasoning module to bridge the learning of visual concepts, words, and semantic parsing of sentences without any explicit annotations, the neuro-symbolic concept learner is trained to reason about what they “see,” which is analogical to human concept learning.127 Several future research directions include the exploration of symbolic knowledge extraction from large networks and efficient reasoning within neural networks about what has been learned.129,130
Looking ahead, the potential of deep language models for various materials research across scales has yet to be fully exploited and there are plenty of new opportunities as well as challenges in research. The future of this area is exciting and promising, with continuously improving machine learning models and maybe even generalizable AI systems41 one day that can successfully capture and amalgamate a variety of information modalities and can perhaps be combined with human readable mathematical notations. This, combined with autonomous experimentation and data collection as well as synthesis routes, offers many important challenges for researchers in the physical sciences. A particular opportunity is to devise architectures that extract mechanistic, useful information from very deep models and make them understandable to human minds when possible, to build on our deep-rooted kinship with symbolism that has been an integral part of the human experience for millennia. Once we cross the bridge between models such as GPT-3 or ChatGPT that can understand and produce human language and our well-developed tools for mathematics, numerical methods, and large-scale scientific computing, exciting things will be possible that blur the boundary between language, physics, observation, and engineering design. This may facilitate a phase of accelerated discovery that will hopefully be put to good use for benefits to civilization.
ACKNOWLEDGMENTS
This work was funded by the National Institutes of Health (Grant Nos. U01ED014976 and 1R01AR07779), the U.S. Department of Agriculture (USDA) (Grant No. 2021-69012-35978), the U.S. Army Research Office (Grant No. W911NF2220213), the Department of Energy Strategic Environmental Research and Development Program (DOE-SERDP) (Grant No. WP22-3475), and the Office of Naval Research (ONR) (Grant Nos. N00014-19-1-2375 and N00014-20-1-2189.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Yiwen Hu: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Investigation (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Markus J. Buehler: Conceptualization (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Project administration (equal); Resources (equal); Software (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal).
DATA AVAILABILITY
Data sharing is not applicable to this article as no new data were created or analyzed in this study.