Machine learning (ML) has emerged as an indispensable methodology to describe, discover, and predict complex physical phenomena that efficiently help us learn underlying functional rules, especially in cases when conventional modeling approaches cannot be applied. While conventional feedforward neural networks are typically limited to performing tasks related to static patterns in data, recursive models can both work iteratively based on a changing input and discover complex dynamical relationships in the data. Deep language models can model flexible modalities of data and are capable of learning rich dynamical behaviors as they operate on discrete or continuous symbols that define the states of a physical system, yielding great potential toward end-to-end predictions. Similar to how words form a sentence, materials can be considered as a self-assembly of physically interacted building blocks, where the emerging functions of materials are analogous to the meaning of sentences. While discovering the fundamental relationships between building blocks and function emergence can be challenging, language models, such as recurrent neural networks and long-short term memory networks, and, in particular, attention models, such as the transformer architecture, can solve many such complex problems. Application areas of such models include protein folding, molecular property prediction, prediction of material failure of complex nonlinear architected materials, and also generative strategies for materials discovery. We outline challenges and opportunities, especially focusing on extending the deep-rooted kinship of humans with symbolism toward generalizable artificial intelligence (AI) systems using neuro-symbolic AI, and outline how tools such as ChatGPT and DALL·E can drive materials discovery.

The emergence of language in physics could be traced back to several centuries BCE, where early-stage numeral notations, such as Greek numerals and counting rod numerals, were used,1 building off of earlier developments in the intellectual pursuits of explaining how the world worked through integrated theories of philosophy.2 With the development in mathematics, mathematical language, including algebra, topology, and calculus, has become the fundamental language in physics. Since the mid-20th century, computational tools such as “numerics” have gained broad importance to solve problems (e.g., density functional theory),3,4 benefiting from the exquisite calculation capability of computers. More recently, as machine learning (ML) and artificial intelligence (AI) find numerous applications across many scientific fields of study, AI has the potential to become an additional indispensable type of language in physics that efficiently helps us learn complex underlying rules to build on and complement our existing mathematical frameworks, hence enriching the conventional scientific exploration in the theory, experiment, and computation5,6 (Fig. 1). In fact, promising AI models have emerged that can deal with complex symbolic representations of relationships, including human language,7–10 proteins,11–15 DNA,12,14,16 graphs,17,18 and ontologies, and many other forms of descriptive associations.19–22 

FIG. 1.

Evolution of the use of language in physics as a descriptive tool for discovery, modeling, and prediction. The language in physics has evolved from early-stage numeral notations1 to traditional and modern computational mathematics,3,4 which includes algebra, topology, and calculus. Nowadays, with the fast development in machine learning and artificial intelligence (AI), we have entered the AI era,5,6 where AI becomes an indispensable type of language in physics that efficiently helps us learn complex underlying rules that complement mathematical symbolism and operations.

FIG. 1.

Evolution of the use of language in physics as a descriptive tool for discovery, modeling, and prediction. The language in physics has evolved from early-stage numeral notations1 to traditional and modern computational mathematics,3,4 which includes algebra, topology, and calculus. Nowadays, with the fast development in machine learning and artificial intelligence (AI), we have entered the AI era,5,6 where AI becomes an indispensable type of language in physics that efficiently helps us learn complex underlying rules that complement mathematical symbolism and operations.

Close modal

In the field of materials science, AI has gained importance attributed to its power to predict material properties, design de novo materials, and discover new mechanisms beyond human intuitions.5 Integrating AI, especially language models that have great potential in making end-to-end predictions, has become an important frontier in materials research. This is because conventional feedforward neural networks (FFNNs) are typically limited to performing tasks related to static patterns in data. In contrast, recursive models can both work iteratively based on changing input and capture complex dynamical relationships across a range of scales within flexible data formats (e.g., graphs, sequences, pixel or field data, boundary conditions, or processing parameters). Based on these concepts, a class of language models has emerged as powerful ML solutions to model flexible modalities of data and capable of learning rich dynamical behaviors. These developments can be considered outgrowths of early forms of neural network architectures, such as Boltzmann machines,23 that describe the interrelated aspects of states of a physical system (including, possibly, metaphysical systems, in general).

More broadly, language models are powerful ML strategies to model long-range behaviors in terms of state variables and dynamical aspects, including flexible modalities of data. This is because they can easily integrate sequential, field, and other numerical input and learn extremely complicated dynamical behaviors across variegated forms of state variables. Modern language models that derive ultimately from original concepts, such as Recurrent Neural Networks (RNNs), Long-Short Term Memory (LSTM), and attention-based transformer networks, have shown a striking capability of solving complex problems in protein folding,24–26 molecular property prediction,14,20,27 and failure of complex nonlinear architected materials,28 to name a few (Table I). Looking from a more general perspective, these problems can be commonly described as predicting the results of large numbers of interactions of elementary building blocks (atoms; molecules; amino acids; peptides; even words, musical notes, or vibrational patterns; etc.) to form more integrated, complex relationships with functions that ultimately far exceed those of individual building blocks,6,21,29 defined by their web of interrelated functors.29 The focus on formulating such problems as an attention-based problem helps with a critical task in the physical science, where small details can have an outsized impact on a solution (e.g., near a crack singularity, or to capture the impacts of a single mutation on protein misfolding).

TABLE I.

Important machine learning related definitions included in this paper.

ML related termsFeatures/definition
Feedforward neural network (FFNN) Nodes connected with information flowing in one direction 
Encoder–decoder framework Two neural networks with the same or similar structure: one maps 
 the input to representations and the other maps representations to the output 
Convolutional Neural Network (CNN) Capture features at different levels by calculating convolutions; 
 take pixel-based or voxel-based data as input 
Recurrent Neural Network (RNN), Nodes connected with feedback loops; 
Long-Short Term Memory (LSTM), history information stored in hidden states;  
and Gated Recurrent Unit (GRU) sequential data as input 
Transformer Encoder–decoder structure with multi-head 
 attention mechanism; no requirements on sequence order 
Natural language processing Aimed at enabling computers to understand human language 
Pre-training The first training of a model on a generic task (often on an 
 unlabeled dataset), the parameters of which will then be used to 
 adapt the model or train it toward a different task 
Fine-tuning Adjusting parameters of a model to solve a certain task 
 pertaining to a particular dataset, often based on a pre-trained model 
ML related termsFeatures/definition
Feedforward neural network (FFNN) Nodes connected with information flowing in one direction 
Encoder–decoder framework Two neural networks with the same or similar structure: one maps 
 the input to representations and the other maps representations to the output 
Convolutional Neural Network (CNN) Capture features at different levels by calculating convolutions; 
 take pixel-based or voxel-based data as input 
Recurrent Neural Network (RNN), Nodes connected with feedback loops; 
Long-Short Term Memory (LSTM), history information stored in hidden states;  
and Gated Recurrent Unit (GRU) sequential data as input 
Transformer Encoder–decoder structure with multi-head 
 attention mechanism; no requirements on sequence order 
Natural language processing Aimed at enabling computers to understand human language 
Pre-training The first training of a model on a generic task (often on an 
 unlabeled dataset), the parameters of which will then be used to 
 adapt the model or train it toward a different task 
Fine-tuning Adjusting parameters of a model to solve a certain task 
 pertaining to a particular dataset, often based on a pre-trained model 

This is broadly important since hierarchical structuring exists in materials science across vast scales in time, length, and modalities. For instance, complex biomaterials with different functions, ranging from hair, spider silk, to viruses, are all created through the assembly of universal amino acids building blocks, the process of which is governed by molecular, quantum chemical principles but conditioned based on environmental cues or gradients that are sensed by the entire system. While the protein folding rules are extremely complicated, end-to-end models such as trRosetta,30 AlphaFold2,24 OpenFold,31 and OmegaFold32 can now provide protein structure predictions with an atomic accuracy, similar to the folded geometries observed in nature (Fig. 2). In a broader sense, the scientific paradigm devised by human civilizations over tens of thousands of years seeks to discover the fundamental relationships between building blocks and the emergence of function by using human intelligence.6 In this vein, end-to-end models, in which biological, chemical, physical, and other information, such as assembly conditions, are directly associated with the final material function, can be formulated using AI methods such that these relationships can be discovered and visualized autonomously from data. Important areas of application in physics also include the development of new neural network potentials to accelerate molecular dynamics modeling and even quantum-level processing.33–36 

FIG. 2.

Visualization of universal building block diagram. (a) Visualization of building blocks and example assembly rules (here showing the results of a molecular dynamics simulation of the emergence of a beta-sheet). (b) Building block diagram in protein folding. End-to-end models such as AlphaFold224 and similar methods can now provide a highly accurate folded protein structure as seen in nature.

FIG. 2.

Visualization of universal building block diagram. (a) Visualization of building blocks and example assembly rules (here showing the results of a molecular dynamics simulation of the emergence of a beta-sheet). (b) Building block diagram in protein folding. End-to-end models such as AlphaFold224 and similar methods can now provide a highly accurate folded protein structure as seen in nature.

Close modal

A prominent language model is the transformer network,7 based on attention mechanisms that originated in multiplicative modules introduced as sigma–pi units37,38 and higher order networks39,40 around 1990. Being one of the most widely used language models, the class of transformer models benefits from its attention architecture and is able to both learn universal truths of physical systems and solve specific tasks, forming the basis to explore a diverse set of downstream problem space as it learns and develops features of a universal mathematical operator. Unlike models that are trained to solve very specific problems and, hence, may not generalize to other application areas, transformer models possess strong generalization capabilities.41 Originally based on Natural Language Processing (NLP) concepts, transformer models have been shown to be capable of learning how words relate to one another, and innately include a higher level of interpretability.42–45 Applying this concept to universal building blocks of the physical world, e.g., molecules; material microstructures; and distinct hierarchical representations of materials, including biological systems, molecular building blocks, mesoscale structural components, and many others, transformer models can capture the functional relationships between constituting building blocks without a priori knowledge of what to pay attention to, or even what the functional group building blocks are.6 With attention learning, some parts of the input data get enhanced, while the importance of others is diminished, which reflects a key aspect of many self-assembly and property prediction tasks that some, potentially small, features in the data require us to dedicate more focus to reflect their importance6 (Fig. 3). Since such architectures can deal with high-dimensional data, they can also be effectively combined with convolutional theories to provide a direct approach to capture hierarchical phenomena in materials science.

FIG. 3.

A schematic illustrating the benefits of attention mechanisms, visualized with a fully connected graph whose significant edges are discovered in the process of training with data. Each input data element can be viewed as a graph node, and the attention mechanism takes all interactions between data into consideration, which resembles the “fully connected” feature of the graph. In each attention layer, the weight of interaction between two nodes is a function of their input context. After training with several attention layers, the model could learn weights of each edge, hence remove insignificant edges, and reduce the original fully connected graph to a sparsified graph featuring only important interactions. In the figure, the dashed lines denote the removed edges from the graph.

FIG. 3.

A schematic illustrating the benefits of attention mechanisms, visualized with a fully connected graph whose significant edges are discovered in the process of training with data. Each input data element can be viewed as a graph node, and the attention mechanism takes all interactions between data into consideration, which resembles the “fully connected” feature of the graph. In each attention layer, the weight of interaction between two nodes is a function of their input context. After training with several attention layers, the model could learn weights of each edge, hence remove insignificant edges, and reduce the original fully connected graph to a sparsified graph featuring only important interactions. In the figure, the dashed lines denote the removed edges from the graph.

Close modal

In this perspective, we review recent progress in this emerging area of research and showcase how it can be beneficial when combined with related, earlier attempts to model physical systems using approaches such as mathematics or early-stage models in the neural network era, e.g., a traditional encoder–decoder framework that seeks to discover elementary physics via latent bottlenecks in a reduced parameter space.46–50 In terms of an outline of the work discussed in this paper, various language models are introduced in Sec. II, with a special focus on transformer architecture. Section III discusses the recent applications of deep language models in materials science, including dataset construction, structure/property prediction, and inverse material design. Finally, in Sec. IV, we summarize the key insights put forth in this perspective and provide forward-looking discussions in this area of research, including an exploration of the role that large language models, such as GPT-351 or Galactica,52 can play especially when combined with generative models, such as DALL·E 253 or Stable Diffusion.54 

The use of language models is inspired both from the general ontology by which we describe material processes, structures, and functions and also from the quest to learn dynamical behaviors of physical systems. These aspects are critical in a variety of applications ranging from text, music, video, robotics, to predicting the behavior of soft matter systems, hierarchical material architectures, and many others. With the motivation specially to model sequential data, language models have emerged as a powerful machine learning solution to this class of problems, starting with Recurrent Neural Network (RNN),55 Long-Short Term Memory (LSTM),56 Gated Recurrent Unit (GRU),57 and, more recently, moving toward the era of attention models10,58 featuring transformer networks7 [Fig. 4(b)] that can capture a rich set of data modalities, including but not limited to sequential data.

FIG. 4.

Development of state-of-the-art neural network models for dynamical behavior learning such as NLP. (a) The advantage of language models, RNN, for example, over a FFNN to model sequential data. In an RNN architecture, “A” represents an operation cell with layers of FFNNs. (b) The emergence and architectures of Recurrent Neural Network (RNN),55 Long-Short Term Memory (LSTM),56 Gated Recurrent Unit (GRU),57 general attention model,58 and transformer network7 are shown.

FIG. 4.

Development of state-of-the-art neural network models for dynamical behavior learning such as NLP. (a) The advantage of language models, RNN, for example, over a FFNN to model sequential data. In an RNN architecture, “A” represents an operation cell with layers of FFNNs. (b) The emergence and architectures of Recurrent Neural Network (RNN),55 Long-Short Term Memory (LSTM),56 Gated Recurrent Unit (GRU),57 general attention model,58 and transformer network7 are shown.

Close modal

As illustrated in Fig. 4(a), compared with a simple FFNN, a RNN55 has feedback loops43 to process the new information with the outputs from prior steps at each time step and, thus, has become widely used neural networks for various tasks involving the notion of sequential data, such as speech recognition,59,60 language modeling,61 and image captioning.62 While a RNN suffers from limitations to short-term memory, LSTM56 greatly improves by learning long-term dependencies. The key differences between RNN and LSTM are the operations performed within LSTM cells. The memory of LSTM, shown in Fig. 4(b), which runs as a horizontal line at the top, has the ability to forget, update, and add context aided by different operations within a LSTM cell, namely, forget gate, update gate, and output gate,63 and can thereby capture longer-range relationships. The GRU57 architecture can be considered as an alternative to LSTM since it has a similar structure with minor modifications, where the forget and input gates are merged and an extra reset gate is used to update memory with an old state at time step t − 1 and a new input at time step t. One potential issue with all three neural networks described so far is the requirement of compressing all the necessary information of a source sequence into a fixed-length vector, making it difficult for the neural network to cope with long sequences.58 This is especially true if there exist very long-range relationships (e.g., a small nuance in the input at the beginning of the sequence and how it interacts with a small nuance much later in the sequence is hard to capture); however, such problems are common in many physics problems, such as protein folding or self-assembly. All these architectures are also limited to sequential data, whereas relevant problems in physics may have a combination of field and temporal data, for instance, requiring the need for a more flexible description.

Unlike previous approaches that start making predictions using only the final hidden state output with rather condensed information, attention models pay attention to each hidden state at every step and make predictions after learning how informative each state is with a neural network.58 Importantly, in attention learning, the key is that it successively enhances certain parts of the input data while diminishing the importance of others. This realizes a key aspect of many self-assembly and feature prediction tasks in materials science: some, potentially small, features in the data require us to dedicate more focus to reflect their importance (e.g., a point mutation in a biopolymer or DNA, or a singularity in a fracture study).6 Attention learning is often computed with deep layers, where the attention operation is carried out repeatedly to capture highly complex relationships. Usually, increasing the number of layers leads to more trainable parameters in the model and enables the model to learn more complicated relationships. For example, the well-known language model GPT-351 features 96 layers and ∼175 × 109 parameters. Compared to smaller models with 12–40 layers, such larger language models have been shown to make an increasingly efficient use of in-context information and achieve a better performance across different benchmarks.51 The detailed architecture of the fundamental attention model at the heart of it will be discussed in Sec. II B, with the transformer architecture as an example. Besides transformer networks, some other salient neural architectures used together with attention learning includes the encoder–decoder framework,58 memory networks,64–66 and graph attention networks.67 

Attention models have emerged as the state-of-the-art for multiple tasks in NLP,9 computer vision,68 cross-modal tasks,69 and recommender systems,70 offering several other advantages beyond performance improvements such as enhanced interpretability.71 The research interest in the perspective of interpretability results from the fact that unlike other “black-box” models, attention models allow a direct inspection of the internal deep neural network architectures.71 Visualizing attention weights shows how relevant a specific region of input is for the prediction of output at each position in a sequence. For instance, Bahdanau et al.58 visualized attention weights and showed automatic alignment of sentences in French and English despite the different locations of subject–verb–noun in different languages. In an image captioning task, Xu et al.72 showed that image regions with high attention weights have a significant impact on the generated text. Similar studies have been done also for protein sequences, where protein language models, such as ProtBERT,12 could be explained, in regard to how it learns the form, shape, and function of proteins during training process, by direct visualization of the output embeddings.

While recurrent architectures rely on sequential processing of the input at the encoding step that results in computational inefficiency, transformer architectures completely eliminate sequential processing and recurrent connections, relying only on attention mechanisms to capture global dependencies between input and output,7,71 in some ways resembling early architectures, such as the Boltzmann machine. The detailed transformer architecture is shown in Fig. 5(b) in contrast to a conventional encoder–decoder structure [Fig. 5(a)], where layers of the encoder and decoder contains extra attention sublayers rather than simple FFNNs.

FIG. 5.

Architecture of the transformer model proposed in 2017,7 visualized in contrast to a conventional encoder–decoder framework.58 (a) Architecture of a conventional encoder–decoder, with FFNN layers. (b) Detailed architecture of the transformer model, with the multi-head attention mechanism emphasized. Each layer in the encoder contains multi-head self-attention sublayer followed by a FFNN, whereas the decoder layer includes three sublayers—masked multi-head self-attention, multi-head cross-attention, and a FFNN. While in self-attention, all Q, K, and V come from either input or output embeddings (or other sources) only, cross-attention calculations here are performed with Q from the output embedding and K and V from the encoder stack, which is essentially generated from the input embedding. The interplay of self- and cross-attention enables a deep learning capacity in sequence-to-sequence translations, for instance.

FIG. 5.

Architecture of the transformer model proposed in 2017,7 visualized in contrast to a conventional encoder–decoder framework.58 (a) Architecture of a conventional encoder–decoder, with FFNN layers. (b) Detailed architecture of the transformer model, with the multi-head attention mechanism emphasized. Each layer in the encoder contains multi-head self-attention sublayer followed by a FFNN, whereas the decoder layer includes three sublayers—masked multi-head self-attention, multi-head cross-attention, and a FFNN. While in self-attention, all Q, K, and V come from either input or output embeddings (or other sources) only, cross-attention calculations here are performed with Q from the output embedding and K and V from the encoder stack, which is essentially generated from the input embedding. The interplay of self- and cross-attention enables a deep learning capacity in sequence-to-sequence translations, for instance.

Close modal

The transformer network employs an encoder–decoder structure, where the encoder maps an input sequence of symbol representations x to a sequence of continuous representations z and the decoder generates an output sequence y of symbols one element at a time.7 At the embedding stage, positional encoding is used to provide information about the relative or absolute position of the tokens in the sequence. The main architecture is composed of a stack of N identical layers of encoders and decoders with two sublayers: a position-wise fully connected FFNN layer and a multi-head attention layer. As a fully connected FFNN applies a linear transformation to input data, the “position-wise” emphasizes that the same transformation is applied to each position in the sequence independently, straightforwardly enabling parallel processing. (That is, in the same layer, parameters of a FFNN are the same across different positions, facilitating such parallel processing.) The decoder is similar to the encoder, except that the decoder contains a third sublayer inserted, which performs multi-head cross-attention over the output of an encoder stack. Unlike self-attention based on inputs generated from the same embedding, the cross-attention calculation is performed on inputs from different embeddings, or a different signal altogether. As shown in Fig. 5(b), the cross-attention in a transformer model is fed with data K, V from the encoder stack and data Q from the output embedding. In its first multi-head self-attention sub-module, masking is applied to ensure that the predictions for position i can depend only on the known outputs at positions less than i. Finally, normalization and residual connections are performed.

Attention mechanisms. The core part of a transformer architecture is the multi-head self- and cross-attention mechanism.7 While conventional neural networks, such as Convolutional Neural Network (CNN), fall short in capturing complex relationships as they may neglect small details when coarse-graining through data (e.g., at every convolutional layer in a conventional deep CNN, the signal is further coarse-grained), the multi-head attention mechanism allows the transformer model to carry out a discovering strategy with all details across scales taken into consideration. As illustrated in Fig. 6, for physical problems, such as understanding the hierarchical structure of spider webs,73–75 conceptually, multiple heads in the transformer architecture could help capture features from the finest scale, e.g., amino acids, to the macroscale, where topology is characterized (these features are learned and so may not exactly reflect this simplistic summary, but we may generally think of the underlying mechanisms of how relationships are understood in that way). Hence, the importance of even small details could be reflected and intricate relationships that govern the physics and materials of interest can, indeed, be discovered. The mathematical details of multi-head attention mechanism are described as follows.

FIG. 6.

Schematic illustrations of multi-head attention mechanism in the transformer architecture, with a focus on its application in physical problems. (a) The hierarchical structure of spider webs73–75 can be captured with features across scales using multiple heads in the transformer model. (b) Intrinsic differences between conventional neural networks (CNNs) and transformer neural networks due to the usage of the attention mechanism. Unlike CNNs that utilize convolutional and pooling operations in each layer to coarse grain through data and reduce dimensions, attention mechanism fully connects all necessary data and allows the transformer model to discover very long-range relationships at the highest level of detail.

FIG. 6.

Schematic illustrations of multi-head attention mechanism in the transformer architecture, with a focus on its application in physical problems. (a) The hierarchical structure of spider webs73–75 can be captured with features across scales using multiple heads in the transformer model. (b) Intrinsic differences between conventional neural networks (CNNs) and transformer neural networks due to the usage of the attention mechanism. Unlike CNNs that utilize convolutional and pooling operations in each layer to coarse grain through data and reduce dimensions, attention mechanism fully connects all necessary data and allows the transformer model to discover very long-range relationships at the highest level of detail.

Close modal
An attention function can be considered as mapping a query and a set of key–value pairs to an output, which is computed as weighted sum of values. The weight assigned to each value is calculated by a function of the query with the corresponding key. Hence, the general attention function can be described as
(1)
where f is a function determining how keys and queries are combined to generate attention weights. With various types of function f, the operations on keys and queries could be set up linearly or nonlinearly, allowing researchers to establish attention models with different complexities. In the 2017 transformer architecture,7 for instance, in order to obtain the weights of values, the authors proposed scaled dot-product attention, which normalizes the dot product of a query with key by the representation vector length and applies a softmax function.6 Aggregating a set of queries into the matrix Q, the matrix of outputs is
(2)
where dk is the dimension of keys. In self-attention, all Q, K, and V come from either input or output embedding only. In contrast, cross-attention calculations, in the transformer model, for instance, are performed with Q from the output embedding and K and V from the encoder stack, which is essentially generated from the input embedding.
Further, the attention calculation is typically implemented in a “multi-headed” form by using parallelly stacked attention layers. Instead of only computing the attention once, the multi-head mechanism splits the input into fixed-size segments—in the dimension of the embedding—and then computes the scaled dot-product attention over each segment in parallel, allowing the model to jointly attend to information from different representation subspaces at different positions,
(3)
(4)
where the projections are parameter matrices WiQRdmodel×dk, WiKRdmodel×dk, WiVRdmodel×dv, and WORhdv×dmodel. One challenge with the approach is the scaling, which typically requires the computational cost to scale quadratically with respect to the input size, or N2, as all input elements are processed with respect to each other, resembling a fully connected graph [Fig. 3 (left)].

As one of the state-of-the-art attention models, transformer networks can thereby capture long range dependencies between input and output, support parallel processing, require minimal prior knowledge, demonstrate scalability to large sequences and datasets, and allow domain-agnostic processing of multiple modalities (text, images, and speech) using similar processing blocks.71 Since the inception of the transformer, there has been an increasing interest in applying attention models in a wide range of research areas, which also leads to a growing number of variants of transformer, such as Vision Transformer,76 Trajectory Transformer,77 and physics-focused transformer models.78 Important developments also include the development of linear-scaling approaches, in order to reduce computation time and memory consumption, such as the Reformer,79 Perceiver,80 TurboTransformer,81 and Transformer-XL.82 

Despite the success of neural network models for NLP, the performance improvement is limited by small curated datasets that are available for most supervised tasks (e.g., relationships between biological sequences and physical properties). To address this, the advent of pre-trained language models has revolutionized many applications in NLP, thanks to their training on large corpus data with no or little labeling. Moreover, it has been shown that such strategies facilitate a fine-tune capability for downstream tasks and convenient usage. Among them, transformer based pre-trained language models have been especially popular. There are two main types of such models finding applications in materials science: (1) models using architectures similar to Bidirectional Encoder Representations from Transformers (BERT)83 and pre-trained by corrupting and reconstructing the original sequence, such as ESM,84 MatSciBERT,85 ProtTrans,12 and ProteinBERT,86 the most direct application of which is sequence embedding, (2) models using autoregressive training, the most famous of which being the Generative Pre-trained Transformer (GPT) series,51,87 such as ProGen,88,89 DARK,90 and ProtGPT2,91 which show great potential for protein design.

For a specific downstream task, these pre-trained language models can then be easily fine-tuned with significantly less labeled data, thus avoiding training a brand-new model from scratch, simply by adding a new decoder and training part or all of the model. This can be done by replacing the final few neural network layers so that the parameters of the large model will be preserved and further adapted.6 Then, fine-tuning requires us to train either the final head layers or the entire model, both being much easier than training from scratch. With such adaptations, pre-trained large language models that do not specifically focus on physics or materials science related knowledge could also be exploited for materials research. For instance, the recently developed large language model trained on scientific knowledge corpus, Galactica,52 has shown its potential to act as a bridge between scientific modalities and natural language toward biological and chemical understandings of materials, with downstream tasks such as MoleculeNet classification and protein function prediction. Another AI model, ChatGPT,92 built on top of GPT-3 and adapted for dialogue, has been released recently and could also be an interesting approach for research in the physical sciences, especially in conjunction with generative text-to-image methods. For example, if we ask ChatGPT to describe the microstructure of a very compliant material and then use its answers to generate images with DALL·E 253 (DALL·E 2 is an AI system that can create realistic images and art from text), we would achieve images of structures of various compliant material designs. While this is an early stage for this type of research, it could already be promising for automated dataset generation (see Fig. 7 for an example).

FIG. 7.

Example of generating microstructure images of compliant materials using ChatGPT92 and DALL·E 2.53,93 (a) Dialogue with ChatGPT when asking about the microstructure of a very compliant material. (b) Images generated by DALL·E 2 with prompt “materials with high porosity, or a large amount of voids or pores.”

FIG. 7.

Example of generating microstructure images of compliant materials using ChatGPT92 and DALL·E 2.53,93 (a) Dialogue with ChatGPT when asking about the microstructure of a very compliant material. (b) Images generated by DALL·E 2 with prompt “materials with high porosity, or a large amount of voids or pores.”

Close modal

Data are a key element for machine learning models. Sufficient and high-quality data are key for models to work efficiently. Usually, researchers could either collect data from the existing literature or databases or could generate them on demand with high-throughput experiments or simulations.

Although there exist many materials databases, such as MATDAT,94 MatWeb,95 MatMatch,96 and MatNavi,97 researchers may need to mine appropriate data from numerous studies and across different databases, where text processing techniques can be utilized to replace manual labor. It has been demonstrated that NLP could not only efficiently encode materials science knowledge present in the published literature, but also map the unstructured raw text onto structured database entries that allow for programmatic querying, with Matscholar as an example.98,99 Other examples of datasets gathered and curated by NLP models can be found across materials science, although progress is still in early stage.100 Aided by NLP techniques, Kim et al. were able to develop an automated workflow containing article retrieval, text extraction, and database construction to build a dataset of aggregated synthesis parameters computed using the text contained within over 640 000 journal articles.101 To describe the temperatures for ferromagnetic and antiferromagnetic phase transitions, Court and Cole102 have assembled close to 40 000 chemical compounds and associated Curie and Néel magnetic phase-transition temperatures across almost 70 000 chemistry and physics articles using ChemDataExtractor,103 an NLP toolkit for the automated extraction of chemical information.

Recent developments of pre-trained large language models, such as MatSciBERT,85 could also greatly benefit the dataset construction process, accelerating materials discovery and information extraction from materials science texts. Besides, if an appropriate pre-trained language model would be employed for the main research task, the size of self-generated experimental or simulation data could be reduced to some extent, since the transferred model still preserves some general predictive capability of the pre-trained model.

In addition to text extraction and serving as pre-trained models, language models can help learn valuable data from images, facilitating data labeling especially for experiments, which otherwise takes much human effort. Microscopy images, which characterize the microscopic-to atomic-scale structure of materials and are predominantly sourced from scanning and transmission electron microscopies (SEM and TEM, respectively), contain a wide range of quantitative data that would be useful in the design and understanding of functional materials.100 Image segmentation and classification, for example, is an essential step for constructing labeled image datasets. The Vision Transformer,68 pre-trained on a large propriety dataset, can be fine-tuned to perform such downstream recognition tasks, which has been benchmarked with ImageNet classification. It has also been shown that TransUNet,104 which merits both transformers and U-Net, is a strong alternative for medical image segmentation, achieving superior performances to various competing methods on different medical applications, including multi-organ segmentation and cardiac segmentation.

Structure and property predictions have always been the frontier of materials science, and language models are now adding a new page in this area. Models such as RNN, LSTM, and transformers have shown the capability of solving complex problems in protein folding,16,24–26 material property prediction,14,20,27,105 and failure of complex nonlinear architected materials.28 

Protein folding, that is, protein structure prediction that yields full-atomistic geometries of this class of biomolecules, has been an important research problem for large-scale structural bioinformatics and biomaterial investigation. Although progress remained stagnant over the last two decades, recent applications of deep language models, which enable end-to-end predictions, have largely solved the folding problem for single-domain proteins,106 with methods such as trRosetta,30 AlphaFold2,24 and newer variations, such as OpenFold31 or OmegaFold,32 now becoming the state-of-the-art tools. Taking protein amino acid sequence solely as an input and leveraging multi-sequence alignments (MSAs), AlphaFold2 is able to regularly predict protein structures with an atomic accuracy, where transformer architecture is actively incorporated in the model24 [Fig. 8(a)]. Kandathil et al.16 developed an ultrafast deep learning-based predictor of protein tertiary structure that uses only an MSA as input, where a system of GRU layers is utilized to process and embed the input MSA and output a feature map. Predictions of protein structure are also possible without using MSAs. Chowdhury et al.107 reported the development of an end-to-end differentiable recurrent geometric network (RGN) that uses a protein language model to learn latent structural information from unaligned proteins. On average, RGN2 outperforms AlphaFold224 and RoseTTAFold108,109 on orphan proteins and classes of designed proteins while achieving up to a 106-fold reduction in computation time.107 

FIG. 8.

Protein structure prediction and generation. (a) Model architecture of AlphaFold2,24 which predicts protein 3D structure from the amino acid sequence only and leverages MSA. Reproduced with permission from Jumper et al., “Highly accurate protein structure prediction with AlphaFold,” Nature 596, 583–589 (2021). Copyright 2021 Springer Nature. (b) Protein structure generation via a denoising diffusion model with a simple transformer backbone,110 where proteins are characterized by several consecutive angles. Reproduced with permission from Wu et al., “Protein structure generation via folding diffusion,” arXiv:2209.15611 (2022). Copyright 2022 Author(s), licensed under a Creative Commons Attribution-ShareAlike (CC BY-SA 4.0) license.

FIG. 8.

Protein structure prediction and generation. (a) Model architecture of AlphaFold2,24 which predicts protein 3D structure from the amino acid sequence only and leverages MSA. Reproduced with permission from Jumper et al., “Highly accurate protein structure prediction with AlphaFold,” Nature 596, 583–589 (2021). Copyright 2021 Springer Nature. (b) Protein structure generation via a denoising diffusion model with a simple transformer backbone,110 where proteins are characterized by several consecutive angles. Reproduced with permission from Wu et al., “Protein structure generation via folding diffusion,” arXiv:2209.15611 (2022). Copyright 2022 Author(s), licensed under a Creative Commons Attribution-ShareAlike (CC BY-SA 4.0) license.

Close modal

In addition, deep language models have significant potential in predicting features and properties of various materials. Regarding biological materials that have amino acids as building blocks, end-to-end models have been developed to learn about molecular properties, including secondary structures,14 thermal stability,111 mechanical strength,13 and normal mode frequencies,112 providing novel avenues for protein engineering, analysis, and design. For instance, to predict normal mode frequencies of proteins, the end-to-end transformer, taking solely an amino acid sequence as input, has been trained to achieve a performance (R2) as high as 0.85 on a test set112 [Fig. 9 (left)]. As shown in Fig. 9, research investigating the thermal stability of collagen triple helices suggests that the end-to-end small transformer model trained from scratch and ProtBERT-based large model resulted in a similar performance on test data (R2 = 0.84 vs 0.79, respectively).111 In terms of properties of the general material domain, the FieldPerceiver,78 a physics-based building block based transformer network, is able to learn by categorizing interactions of elementary material units that define the microstructure of the material and then predict the resulting material behavior, such as stress and displacements field (Fig. 10). The relatively unlimited range of transformer models in associating input to output data enables the prediction of both local and long-range organization of the target field. The prediction results of stress and displacement fields shown in Fig. 10 suggest that the model can use a pre-trained model and easily transfer physical insights to cases that have a distinct solution.78 

FIG. 9.

End-to-end predictions using a transformer model for downstream tasks such as predicting normal model frequencies of proteins and melting temperature of collagens, providing novel avenues for protein engineering, analysis, and design. For normal mode frequency predictions, the end-to-end transformer, taking solely the amino acid sequence as input, has been trained to achieve a performance (R2) as high as 0.85 on a test set.112 Research investigating the thermal stability of collagen triple helices suggests that the end-to-end small transformer model trained from scratch and using a pre-trained ProtBERT-based large model could have a similar satisfying performance on test data (R2 = 0.84 vs 0.79, respectively).111 Reproduced with permission from Khare et al, “Collagen transformer: End-to-End transformer model to predict thermal stability of collagen triple helices using an NLP approach,” ACS Biomater. Sci. Eng. 8, 4301–4310 (2022). Copyright 2022 American Chemical Society. Reproduced with permission from Hu and Buehler, “End-to-end protein normal mode frequency predictions using language and graph models and application to sonification,” ACS Nano 16, 20656–20670 (2022). Copyright 2022 American Chemical Society.

FIG. 9.

End-to-end predictions using a transformer model for downstream tasks such as predicting normal model frequencies of proteins and melting temperature of collagens, providing novel avenues for protein engineering, analysis, and design. For normal mode frequency predictions, the end-to-end transformer, taking solely the amino acid sequence as input, has been trained to achieve a performance (R2) as high as 0.85 on a test set.112 Research investigating the thermal stability of collagen triple helices suggests that the end-to-end small transformer model trained from scratch and using a pre-trained ProtBERT-based large model could have a similar satisfying performance on test data (R2 = 0.84 vs 0.79, respectively).111 Reproduced with permission from Khare et al, “Collagen transformer: End-to-End transformer model to predict thermal stability of collagen triple helices using an NLP approach,” ACS Biomater. Sci. Eng. 8, 4301–4310 (2022). Copyright 2022 American Chemical Society. Reproduced with permission from Hu and Buehler, “End-to-end protein normal mode frequency predictions using language and graph models and application to sonification,” ACS Nano 16, 20656–20670 (2022). Copyright 2022 American Chemical Society.

Close modal
FIG. 10.

Workflow and sample results for field predictions with FieldPerceiver.78 (a) Illustration of how FieldPerceiver solves self-assembly problems with elementary building blocks that define the microstructure of the material. (b) Sample results of von Mises stress and displacement fields predicted by FieldPerceiver, showing an excellent accuracy compared with the ground truth. Reproduced with permission from Buehler, “FieldPerceiver: Domain agnostic transformer model to predict multiscale physical fields and nonlinear material properties through neural ologs,” Mater. Today 57, 9–25 (2022). Copyright 2022 Elsevier.

FIG. 10.

Workflow and sample results for field predictions with FieldPerceiver.78 (a) Illustration of how FieldPerceiver solves self-assembly problems with elementary building blocks that define the microstructure of the material. (b) Sample results of von Mises stress and displacement fields predicted by FieldPerceiver, showing an excellent accuracy compared with the ground truth. Reproduced with permission from Buehler, “FieldPerceiver: Domain agnostic transformer model to predict multiscale physical fields and nonlinear material properties through neural ologs,” Mater. Today 57, 9–25 (2022). Copyright 2022 Elsevier.

Close modal

Mechanical problems involving nonlinearities, such as plasticity, fracture, and dynamic impact, are known to be challenging and computationally expensive for conventional numerical simulation schemes.5 Machine learning techniques, especially deep language models, have provided efficient approaches to such problems. Hsu et al.28 presented an AI-based multiscale model with convolutional LSTM for predicting fracture patterns in crystalline solids based on molecular simulations. As shown in Fig. 11, the proposed approach not only shows excellent agreement regarding the computed fracture patterns but also predicts fracture toughness values under mode I and mode II loading conditions. Lew et al.113 used similar machine-learning approaches to predict nanoscopic fracture mechanisms, including crack instabilities and branching as a function of crystal orientation, focusing on a particular technologically relevant material system, graphene. Another machine-learning model has been proposed to predict the brittle fracture of polycrystalline graphene under tensile loading, integrating convolutional neural network (CNN), bidirectional RNN, and fully connected layer to process the spatial and sequential features.114 Furthermore, it was demonstrated that a progressive transformer diffusion model can effectively describe the dynamics of fracture, achieving great generalization with limited training data and capturing important aspects, such as crack dynamics, instabilities, and initiation mechanisms.115 The incorporation of attention approaches into progressive diffusion methods, combined with sophisticated convolutional architectures that include ResNet blocks and skip connections, now emerges as powerful generalizable architectures that can capture, predict, and generalize behaviors across different physical systems and can also solve degenerate inverse problems that have multiple solutions (e.g., finding a set of material microstructure candidates that meet a certain design demand, e.g., a stress–strain relationship, as shown in the recent work116,117). A particularly noteworthy feature of diffusing models is the training process, where the stochastic nature of training the denoising neural network provides avenues to avoid overfitting while achieving an excellent coverage during inference to include conditions that were not included in the training (this is partly due to minimizing Kullback–Leibler divergence for the Gaussian noise terms). This, combined with the ability to condition the input to these models to provide solutions for a variety of boundary value problems, holds great promise for a variety of applications in physics.

FIG. 11.

Workflow and prediction results of an AI-based multiscale model for predicting fracture patterns in crystalline solids.28 (a) Workflow of the multiscale fracture pattern prediction model, where the dataset is compiled from atomistic modeling and convolutional LSTM model is employed for prediction. (b) Results of physics-based simulations of tensile-loaded mode I fracture learned by the AI-based model, with crack path, length, and energy release all showing good agreement. Reproduced with permission from Hsu et al., “Using deep learning to predict fracture patterns in crystalline solids,” Matter 3, 197–211 (2020). Copyright 2020 Elsevier.

FIG. 11.

Workflow and prediction results of an AI-based multiscale model for predicting fracture patterns in crystalline solids.28 (a) Workflow of the multiscale fracture pattern prediction model, where the dataset is compiled from atomistic modeling and convolutional LSTM model is employed for prediction. (b) Results of physics-based simulations of tensile-loaded mode I fracture learned by the AI-based model, with crack path, length, and energy release all showing good agreement. Reproduced with permission from Hsu et al., “Using deep learning to predict fracture patterns in crystalline solids,” Matter 3, 197–211 (2020). Copyright 2020 Elsevier.

Close modal

As the language models have been demonstrated with a satisfying performance in predicting structure and property of materials, integrating such models with generative neural networks, such as Variational Auto-Encoders (VAEs) and Generative Adversarial Networks (GANs), would enable researchers to efficiently explore the tremendous material design space, which is intractable with conventional methods.

Hsu et al.21 have developed a machine learning based approach to generate 3D architected materials based on human-readable word input, enabling a text-to-material translation. The model combines a Vector Quantized Generative Adversarial Network (VQGAN) and Contrastive Language–Image Pre-training (CLIP) neural networks to generate images, which are then translated into 3D architectures that feature fully periodic, tileable unit cells.21 Such language-based design approaches can have a profound impact on end-to-end design environments and drive a new understanding of physical phenomena that intersect directly with human language and creativity.21 Researchers have also shown that an approach combining an RNN-based model and an evolutionary algorithm (EA) could realize inverse design for 4D-printed active composite beams. Moreover, it has been illustrated that hierarchical assemblies of building blocks, with elementary flame particles as an example, could be created using a combination of GAN and NLP models.118 In terms of protein engineering, the pre-trained ProtGPT291 was shown to generate de novo protein sequences following the principles of natural one, paving the way for efficient high-throughput protein engineering and design.

More recently, Wu et al.110 have developed a diffusion-based generative model with a transformer architecture, to mimic the native folding process and design protein backbone structures, which are described as a series of consecutive angles capturing the relative orientation of the constituent amino acid residues [Fig. 8(b)]. The generated backbones from this model better respect protein chirality and exhibit greater designability compared to prior works that use equivariance assumptions.110 In these examples, the integration of various architectures to model and synthesize solutions, including, in particular, the use of language-based descriptors for field synthesis as done in VQGAN, is an exciting opportunity in the physical sciences that can integrate both the knowledge derived from existing theories [e.g., synthetic datasets generated from solutions with density functional theory (DFT), MD, and coarse-graining] and experimental, empirical data for which we may not yet have a theory (e.g., protein dynamics, structures, functions, and biological data). This can be particularly useful for developing predictive mesoscale theories for which we do not have closed-form theoretical frameworks but for which we can generate large datasets that capture key relationships. The examples discussed in this paper, e.g., modeling physical phenomena using the FieldPerceiver model or solving inverse problems116,117 using attention–diffusion models, can be viewed as early adaptations of this general perspective.

The emergence of attention models, featuring, for instance, transformer architectures and similar strategies, with advantages in predictive capability, generalization, and interpretability, has revolutionized the application areas of language paradigms beyond conventional NLP. For materials research, exciting opportunities have been created with recent advances in natural language modeling from all aspects of investigation, including dataset construction, structure and property prediction, and inverse material design. As reviewed in Sec. III, these models have shown significant potential in a wide variety of complex tasks, such as automated material information extraction,85 protein folding,16,24–26 molecular property prediction,14,27 fracture pattern investigation,20,28,113,115 and inverse design for architected materials.21 

When formulating materials science problems with machine learning strategies, one challenge is to identify inputs/outputs and build appropriate datasets, through autonomous experimentation, multiscale simulations, or the use of NLP methods mentioned before. While the data modalities are not limited if researchers consider using language models, the size, quality, and diversity of dataset would significantly influence the model performance. Although the current language models have been shown capable of solving complex tasks, the challenge toward machine learning models with a better accuracy, scalability, and generalization exists and should be addressed in future research.

As the transformer model and its variations have proven to be successful and powerful, an interesting direction for future work could be mining the transformer architecture, especially its attention mechanism, for interpretation to yield even broader and deeper physical insights. The availability of graph-like attention maps (akin to the schematic shown in Fig. 3) provides a variety of interpretability strategies, from mining the attention maps to reverse engineering localization where specific types of physical principles are captured. Analyzing the multi-head attention maps could not only benefit interpretation of the model but also establish a strong foundation for model architecture improvements (by automatically optimizing the multi-head division, for instance). In addition, since the computational cost of transformer increases quadratically with sequence length, reducing the computation time and memory consumption to achieve better scalability is also an important research theme. To address this problem, researchers are actively developing linear-scaling approaches, including Reformer,79 Perceiver,80 TurboTransformer,81 Transformer-XL,82 and many other emerging developments.

Another long-standing challenge in this area is the generalization capability of models, which characterizes the model’s ability to respond to unseen data. With the increasing applications of deep language models in materials science, a systematic exploration of the generalization of such models with tasks in materials context, such as predicting physical and chemical phenomena of materials, would be an essential future step. The degree of usefulness will expand drastically for deep learning approaches that capture generalizable truths in a way that they can generalize these functorial relationships toward new solutions that are distinct from the training data provided, to not only unleash predictive power but also explain relationships in a way that the human mind can understand. Recent developments of Galactica52 and GPT-351 and the emergence of generative models that can synthesize human knowledge toward providing novel solutions (e.g., DALL·E-253,93 or Stable Diffusion54 in image generation) are first steps toward that goal. Furthermore, the self-assembling and self-organizing AI, made out of highly interconnected simple building blocks such as RNN, might be a candidate toward better robustness and generalization since the absence of any centralized control could allow them to quickly adjust to changing conditions.119–121 These and related questions are critical for the community to explore and address, in order to develop a foundational understanding of the limitations and opportunities of such approaches.

Going beyond learning phenomena in materials science applications, learning generic symbols representing fundamental physics rules could be another interesting research direction. Several attempts have been made, developing models such as transformer-based models122,123 and DeepONet.124 Some of these models are capable of learning various explicit and implicit operators and achieve highly accurate solutions to differential systems. What is more, the developing neuro-symbolic AI system,125–130 which combines neural networks with symbolic AI and, hence, benefits from a three-way interaction between neural, symbolic, and probabilistic modeling and inference, has the potential to achieve human-style comprehension [Fig. 12(a)]. The key to such emerging neural-symbolic methodology is how to learn representations through neural nets and make them available for downstream use, symbolically.130 For instance, as shown in Fig. 12(b), using a neuro-symbolic reasoning module to bridge the learning of visual concepts, words, and semantic parsing of sentences without any explicit annotations, the neuro-symbolic concept learner is trained to reason about what they “see,” which is analogical to human concept learning.127 Several future research directions include the exploration of symbolic knowledge extraction from large networks and efficient reasoning within neural networks about what has been learned.129,130

FIG. 12.

General model system and example of neural-symbolic AI. (a) Neural-symbolic AI system has a three-way interaction between neural, symbolic, and probabilistic modeling and inference. (b) An example image–question pair and the corresponding execution trace of the neuro-symbolic concept learner.127 Reproduced with permission from Mao et al., “The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision,” International Conference on Learning Representations, 2019. Copyright 2019 Authors.

FIG. 12.

General model system and example of neural-symbolic AI. (a) Neural-symbolic AI system has a three-way interaction between neural, symbolic, and probabilistic modeling and inference. (b) An example image–question pair and the corresponding execution trace of the neuro-symbolic concept learner.127 Reproduced with permission from Mao et al., “The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision,” International Conference on Learning Representations, 2019. Copyright 2019 Authors.

Close modal

Looking ahead, the potential of deep language models for various materials research across scales has yet to be fully exploited and there are plenty of new opportunities as well as challenges in research. The future of this area is exciting and promising, with continuously improving machine learning models and maybe even generalizable AI systems41 one day that can successfully capture and amalgamate a variety of information modalities and can perhaps be combined with human readable mathematical notations. This, combined with autonomous experimentation and data collection as well as synthesis routes, offers many important challenges for researchers in the physical sciences. A particular opportunity is to devise architectures that extract mechanistic, useful information from very deep models and make them understandable to human minds when possible, to build on our deep-rooted kinship with symbolism that has been an integral part of the human experience for millennia. Once we cross the bridge between models such as GPT-3 or ChatGPT that can understand and produce human language and our well-developed tools for mathematics, numerical methods, and large-scale scientific computing, exciting things will be possible that blur the boundary between language, physics, observation, and engineering design. This may facilitate a phase of accelerated discovery that will hopefully be put to good use for benefits to civilization.

This work was funded by the National Institutes of Health (Grant Nos. U01ED014976 and 1R01AR07779), the U.S. Department of Agriculture (USDA) (Grant No. 2021-69012-35978), the U.S. Army Research Office (Grant No. W911NF2220213), the Department of Energy Strategic Environmental Research and Development Program (DOE-SERDP) (Grant No. WP22-3475), and the Office of Naval Research (ONR) (Grant Nos. N00014-19-1-2375 and N00014-20-1-2189.

The authors have no conflicts to disclose.

Yiwen Hu: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Investigation (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Markus J. Buehler: Conceptualization (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Project administration (equal); Resources (equal); Software (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal).

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

1.
T. L.
Heath
,
A Manual of Greek Mathematics
(
Clarendon Press
,
1931
).
2.
C. B.
Boyer
and
U. C.
Merzbach
,
A History of Mathematics
(
Wiley
,
2011
).
3.
J.
von Neumann
and
H. H.
Goldstine
, “
Numerical inverting of matrices of high order
,”
Bull. Am. Math. Soc.
53
,
1021
1099
(
1947
).
4.
A.
Bultheel
and
R.
Cools
,
The Birth of Numerical Analysis
(
World Scientific
,
2009
).
5.
K.
Guo
,
Z.
Yang
,
C.-H.
Yu
, and
M. J.
Buehler
, “
Artificial intelligence and machine learning in design of mechanical materials
,”
Mater. Horiz.
8
,
1153
1172
(
2021
).
6.
M. J.
Buehler
, “
Multiscale modeling at the interface of molecular mechanics and natural language through attention neural networks
,”
Acc. Chem. Res.
55
,
3387
3403
(
2022
).
7.
A.
Vaswani
et al, “
Attention is all you need
,” in
Advances in Neural Information Processing Systems
(
Curran Associates, Inc
,
2017
), Vol. 30, pp.
5998
6008
.
8.
Z.
Yang
et al, “
Hierarchical attention networks for document classification
,” in
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
(
Association for Computational Linguistics
,
2016
), pp.
1480
1489
.
9.
A.
Galassi
,
M.
Lippi
, and
P.
Torroni
, “
Attention in natural language processing
,”
IEEE Trans. Neural Netw. Learn. Syst.
32
,
4291
4308
(
2021
).
10.
A.
Parikh
,
O.
Täckström
,
D.
Das
, and
J.
Uszkoreit
, “
A decomposable attention model for natural language inference
,” in
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
(
Association for Computational Linguistics
,
2016
), pp.
2249
2255
.
11.
M. S.
Klausen
et al, “
NetSurfP‐2.0: Improved prediction of protein structural features by integrated deep learning
,”
Proteins: Struct., Funct., Bioinf.
87
,
520
527
(
2019
).
12.
A.
Elnaggar
et al, “
ProtTrans: Toward understanding the language of life through self-supervised learning
,”
IEEE Trans. Pattern Anal. Mach. Intell.
44
,
7112
7127
(
2022
).
13.
F. Y. C.
Liu
,
B.
Ni
, and
M. J.
Buehler
, “
PRESTO: Rapid protein mechanical strength prediction with an end-to-end deep learning model
,”
Extreme Mech. Lett.
55
,
101803
(
2022
).
14.
C.-H.
Yu
et al, “
End-to-end deep learning model to predict and design secondary structure content of structural proteins
,”
ACS Biomater. Sci. Eng.
8
,
1156
1165
(
2022
).
15.
K.
Guo
and
M. J.
Buehler
, “
Rapid prediction of protein natural frequencies using graph neural networks
,”
Digital Discovery
1
,
277
285
(
2022
).
16.
S. M.
Kandathil
,
J. G.
Greener
,
A. M.
Lau
, and
D. T.
Jones
, “
Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins
,”
Proc. Natl. Acad. Sci. U. S. A.
119
,
e2113348119
(
2022
).
17.
T. N.
Kipf
and
M.
Welling
, “
Semi-supervised classification with graph convolutional networks
,” in
5th International Conference on Learning Representations (ICLR),
Toulon, France
,
April 24–26, 2017 (OpenReview.net, 2017
).
18.
W.
Lu
,
Z.
Yang
, and
M. J.
Buehler
, “
Rapid mechanical property prediction and de novo design of three-dimensional spider webs through graph and GraphPerceiver neural networks
,”
J. Appl. Phys.
132
,
074703
(
2022
).
19.
Z.
Yang
,
C.-H.
Yu
,
K.
Guo
, and
M. J.
Buehler
, “
End-to-end deep learning method to predict complete strain and stress tensors for complex hierarchical composite microstructures
,”
J. Mech. Phys. Solids
154
,
104506
(
2021
).
20.
E. L.
Buehler
and
M. J.
Buehler
, “
End-to-end prediction of multimaterial stress fields and fracture patterns using cycle-consistent adversarial and transformer neural networks
,”
Biomed. Eng. Adv.
4
,
100038
(
2022
).
21.
Y. C.
Hsu
,
Z.
Yang
, and
M. J.
Buehler
, “
Generative design, manufacturing, and molecular modeling of 3D architected materials based on natural language input
,”
APL Mater.
10
,
041107
(
2022
).
22.
Z.
Yang
,
C.-H.
Yu
, and
M. J.
Buehler
, “
Deep learning model to predict complex stress and strain fields in hierarchical composites
,”
Sci. Adv.
7
,
1
17
(
2021
).
23.
D. H.
Ackley
,
G. E.
Hinton
, and
T. J.
Sejnowski
, “
A learning algorithm for Boltzmann machines
,”
Cognit. Sci.
9
,
147
169
(
1985
).
24.
J.
Jumper
et al, “
Highly accurate protein structure prediction with AlphaFold
,”
Nature
596
,
583
589
(
2021
).
25.
E.
Callaway
, “
What’s next for AlphaFold and the AI protein-folding revolution
,”
Nature
604
,
234
238
(
2022
).
26.
M.
AlQuraishi
, “
End-to-End differentiable learning of protein structure
,”
Cell Syst.
8
,
292
301.e3
(
2019
).
27.
C.-H.
Yu
,
Z.
Qin
,
F. J.
Martin-Martinez
, and
M. J.
Buehler
, “
A self-consistent sonification method to translate amino acid sequences into musical compositions and application in protein design using artificial intelligence
,”
ACS Nano
13
,
7471
7482
(
2019
).
28.
Y.-C.
Hsu
,
C.-H.
Yu
, and
M. J.
Buehler
, “
Using deep learning to predict fracture patterns in crystalline solids
,”
Matter
3
,
197
211
(
2020
).
29.
T.
Giesa
,
D. I.
Spivak
, and
M. J.
Buehler
, “
Category theory based solution for the building block replacement problem in materials design
,”
Adv. Eng. Mater.
14
,
810
817
(
2012
).
30.
Z.
Du
et al, “
The trRosetta server for fast and accurate protein structure prediction
,”
Nat. Protoc.
16
,
5634
5651
(
2021
).
31.
G.
Ahdritz
et al, “
OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization
,” bioRxiv:2022.11.20.517210 (
2022
).
32.
R.
Wu
et al, “
High-resolution de novo structure prediction from primary sequence
,” bioRxiv:2022.07.21.500999 (
2022
).
33.
G. C.
Sosso
and
M.
Bernasconi
, “
Harnessing machine learning potentials to understand the functional properties of phase-change materials
,”
MRS Bull.
44
,
705
709
(
2019
).
34.
O. T.
Unke
et al, “
Machine learning force fields
,”
Chem. Rev.
121
,
10142
10186
(
2021
).
35.
R.
Pederson
,
B.
Kalita
, and
K.
Burke
, “
Machine learning and density functional theory
,”
Nat. Rev. Phys.
4
,
357
358
(
2022
).
36.
S.
Batzner
et al, “
E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials
,”
Nat. Commun.
13
,
2453
(
2022
).
37.
D. E.
Rumelhart
,
G. E.
Hinton
, and
R. J.
Williams
, “
Learning internal representations by error propagation
,” in
Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations
, edited by
D. E.
Rumelhart
and
J. L.
McLelland
(
MIT Press
,
1986
), Vol. 1, pp.
318
362
.
38.
B.
Mel
and
C.
Koch
, “
Sigma-Pi learning: On radial basis functions and cortical associative learning
,” in
Advances in Neural Information Processing Systems (NIPS)
(
Morgan-Kaufmann
,
1989
), Vol. 2.
39.
B.
Lenze
, “
How to make sigma-pi neural networks perform perfectly on regular training sets
,”
Neural Networks
7
,
1285
1293
(
1994
).
40.
C.
Giles
,
R.
Griffin
, and
T.
Maxwell
, “
Encoding geometric invariances in higher-order neural networks
,” in
Proceedings of the 1987 International Conference on Neural Information Processing Systems
(
American Institute of Physics
,
1987
), pp.
301
309
.
41.
S.
Reed
et al, “
A generalist agent
,” arXiv:2205.06175 (
2022
).
42.
J.
Pennington
,
R.
Socher
, and
C.
Manning
, “
Global vectors for word representation
,” in
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
(
Association for Computational Linguistics
,
2014
), pp.
1532
1543
.
43.
T.
Kudo
and
J.
Richardson
, “
SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing
,” in
Proceedings of the 2018 Conference on Empirical Methods
in
Natural Language Processing (System Demonstrations) (Association for Computational Linguistics
,
2018
), pp.
66
71
.
44.
A.
Radford
et al, “
Learning transferable visual models from natural language supervision
,” in
Proceedings of the 38th International Conference on Machine Learning
(PMLR,
2021
), Vol. 139, pp.
8748
8763
.
45.
M.
Bates
, “
Models of natural language understanding
,”
Proc. Natl. Acad. Sci. U. S. A.
92
,
9977
9982
(
1995
).
46.
L.
Agostini
, “
Exploration and prediction of fluid dynamical systems using auto-encoder technology
,”
Phys. Fluids
32
,
067103
(
2020
).
47.
J.
Carrasquilla
,
G.
Torlai
,
R. G.
Melko
, and
L.
Aolita
, “
Reconstructing quantum states with generative models
,”
Nat. Mach. Intell.
1
,
155
161
(
2019
).
48.
A.
Rocchetto
,
E.
Grant
,
S.
Strelchuk
,
G.
Carleo
, and
S.
Severini
, “
Learning hard quantum distributions with variational autoencoders
,”
npj Quantum Inf.
4
,
28
(
2018
).
49.
I. A.
Luchnikov
,
A.
Ryzhov
,
P.-J.
Stas
,
S. N.
Filippov
, and
H.
Ouerdane
, “
Variational autoencoder reconstruction of complex many-body physics
,”
Entropy
21
,
1091
(
2019
).
50.
D.
di Sante
et al, “
Deep learning the functional renormalization group
,”
Phys. Rev. Lett.
129
,
136402
(
2022
).
51.
T. B.
Brown
et al, “
Language models are few-shot learners
,” in
Advances in Neural Information Processing Systems
(
Curran Associates, Inc
., 2020), Vol. 33, pp.
1877
1901
.
52.
R.
Taylor
et al, “
Galactica: A large language model for science
,” arXiv:2211.09085 (
2022
).
53.
A.
Ramesh
,
P.
Dhariwal
,
A.
Nichol
,
C.
Chu
, and
M.
Chen
, “
Hierarchical text-conditional image generation with CLIP latents
,” arXiv:2204.06125 (
2022
).
54.
R.
Rombach
,
A.
Blattmann
,
D.
Lorenz
,
P.
Esser
, and
B.
Ommer
, “
High-resolution image synthesis with latent diffusion models
,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
(
IEEE
,
2022
), pp.
10684
10695
.
55.
R. J.
Williams
and
D.
Zipser
, “
A learning algorithm for continually running fully recurrent neural networks
,”
Neural Comput.
1
,
270
280
(
1989
).
56.
S.
Hochreiter
and
J.
Schmidhuber
, “
Long short-term memory
,”
Neural Comput.
9
,
1735
1780
(
1997
).
57.
K.
Cho
et al, “
Learning phrase representations using RNN encoder–decoder for statistical machine translation
,” in
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
(
Association for Computational Linguistics
,
2014
), pp.
1724
1734
.
58.
D.
Bahdanau
,
K.
Cho
, and
Y.
Bengio
, “
Neural machine translation by jointly learning to align and translate
,” in
3rd International Conference on Learning Representations (ICLR),
San Diego, CA, USA
,
7–9 May
2015
.
59.
A.
Graves
,
A.-r.
Mohamed
, and
G.
Hinton
, “
Speech recognition with deep recurrent neural networks
,” in
2013 IEEE International Conference on Acoustics, Speech and Signal Processing
(
IEEE
,
2013
), pp.
6645
6649
.
60.
Y.
Miao
,
M.
Gowayyed
, and
F.
Metze
, “
EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding
,” in
2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)
(
IEEE
,
2015
), pp.
167
174
.
61.
T.
Mikolov
,
S.
Kombrink
,
L.
Burget
,
J.
Cernocky
, and
S.
Khudanpur
, “
Extensions of recurrent neural network language model
,” in
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(
IEEE
,
2011
), pp.
5528
5531
.
62.
X.
Chen
and
C. L.
Zitnick
, “
Mind’s eye: A recurrent visual representation for image caption generation
,” in
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(
IEEE
,
2015
), pp.
2422
2431
.
63.
Y.
Yu
,
X.
Si
,
C.
Hu
, and
J.
Zhang
, “
A review of recurrent neural networks: LSTM cells and network architectures
,”
Neural Comput.
31
,
1235
1270
(
2019
).
64.
J.
Weston
,
S.
Chopra
, and
A.
Bordes
, “
Memory networks
,” in
3rd International Conference on Learning Representations (ICLR),
San Diego, CA, USA
,
7–9 May
2015
.
65.
S.
Sukhbaatar
,
A.
Szlam
,
J.
Weston
, and
R.
Fergus
, “
End-to-end memory networks
,” in
Advances in Neural Information Processing Systems
(
Curran Associates, Inc
,
2015
), Vol. 28, pp.
2440
2448
.
66.
A.
Kumar
et al, “
Ask me anything: Dynamic memory networks for natural language processing
,” in
Proceedings of the 33rd International Conference on Machine Learning
(PMLR,
2016
), Vol. 48, pp.
1378
1387
.
67.
P.
Veličković
et al, “
Graph attention networks
,” in
6th International Conference on Learning Representations (ICLR),
Vancouver, BC, Canada
,
30 April–3 May 2018 (OpenReview.net, 2018)
.
68.
S.
Khan
et al, “
Transformers in vision: A survey
,”
ACM Comput. Surv.
54
,
1
41
(
2022
).
69.
T.
Young
,
D.
Hazarika
,
S.
Poria
, and
E.
Cambria
, “
Recent trends in deep learning based natural language processing [review article]
,”
IEEE Comput. Intell. Mag.
13
,
55
75
(
2018
).
70.
S.
Zhang
,
L.
Yao
,
A.
Sun
, and
Y.
Tay
, “
Deep learning based recommender system
,”
ACM Comput. Surv.
52
,
1
38
(
2020
).
71.
S.
Chaudhari
,
V.
Mithal
,
G.
Polatkan
, and
R.
Ramanath
, “
An attentive survey of attention models
,”
ACM Trans. Intell. Syst. Technol.
12
,
53
(
2021
).
72.
K.
Xu
et al, “
Show, attend and tell: Neural image caption generation with visual attention
,” in
Proceedings of the 32nd International Conference on Machine Learning (JMLR.org,
2015
), Vol. 37, pp.
2048
2057
.
73.
I.
Su
and
M. J.
Buehler
, “
Spider silk: Dynamic mechanics
,”
Nat. Mater.
15
,
1054
1055
(
2016
).
74.
I.
Su
and
M. J.
Buehler
, “
Nanomechanics of silk: The fundamentals of a strong, tough and versatile material
,”
Nanotechnology
27
,
302001
(
2016
).
75.
I.
Su
,
G. S.
Jung
,
N.
Narayanan
, and
M. J.
Buehler
, “
Perspectives on three-dimensional printing of self-assembling materials and structures
,”
Curr. Opin. Biomed. Eng.
15
,
59
67
(
2020
).
76.
A.
Dosovitskiy
et al, “
An image is worth 16 × 16 words: Transformers for image recognition at scale
,” in
9th International Conference on Learning Representations (ICLR), Virtual Event,
Austria
,
3–7 May 2021 (OpenReview.net, 2021
).
77.
M.
Janner
,
Q.
Li
, and
S.
Levine
, “
Offline reinforcement learning as one big sequence modeling problem
,” in
Advances in Neural Information Processing Systems
(
Curran Associates, Inc.
,
2021
), Vol. 34, pp.
1273
1286
.
78.
M. J.
Buehler
, “
FieldPerceiver: Domain agnostic transformer model to predict multiscale physical fields and nonlinear material properties through neural ologs
,”
Mater. Today
57
,
9
25
(
2022
).
79.
N.
Kitaev
,
L.
Kaiser
, and
A.
Levskaya
, “
Reformer: The efficient transformer
,” in
8th International Conference on Learning Representations (ICLR), Addis Ababa,
Ethiopia
,
26–30 April 2020 (OpenReview.net, 2020
).
80.
A.
Jaegle
et al, “
Perceiver: General perception with iterative attention
,” in
Proceedings of the 38th International Conference on Machine Learning
(PMLR,
2021
), Vol. 139, pp.
4651
4664
.
81.
J.
Fang
,
Y.
Yu
,
C.
Zhao
, and
J.
Zhou
, “
TurboTransformers: An efficient GPU serving system for transformer models
,” in
Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Association for Computing Machinery,
2021), pp.
389
402
.
82.
Z.
Dai
,
Z.
Yang
,
Y.
Yang
,
J.
Carbonell
,
Q.
Le
, and
R.
Salakhutdinov
, “
Transformer-XL: Attentive language models beyond a fixed-length context
,” in
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
(
Association for Computational Linguistics
,
2019
), pp.
2978
2988
.
83.
J.
Devlin
,
M.-W.
Chang
,
K.
Lee
, and
K.
Toutanova
, “
BERT: Pre-Training of deep bidirectional transformers for language understanding
,” in
Proceedings of the 2019 Conference of the North
(
Association for Computational Linguistics
,
2019
), pp.
4171
4186
84.
A.
Rives
et al, “
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
,”
Proc. Natl. Acad. Sci. U. S.A.
118
,
e2016239118
(
2021
).
85.
T.
Gupta
,
M.
Zaki
,
N. M. A.
Krishnan
, and
Mausam
, “
MatSciBERT: A materials domain language model for text mining and information extraction
,”
npj Comput. Mater.
8
,
102
(
2022
).
86.
N.
Brandes
,
D.
Ofer
,
Y.
Peleg
,
N.
Rappoport
, and
M.
Linial
, “
ProteinBERT: A universal deep-learning model of protein sequence and function
,”
Bioinformatics
38
,
2102
2110
(
2022
).
87.
A.
Radford
et al, “
Language models are unsupervised multitask learners
,”
OpenAI Blog
1
,
9
(
2019
).
88.
A.
Madani
et al, “
ProGen: Language modeling for protein generation
,” arXiv:2004.03497 (
2020
).
89.
E.
Nijkamp
,
J.
Ruffolo
,
E. N.
Weinstein
,
N.
Naik
, and
A.
Madani
, “
ProGen2: Exploring the boundaries of protein language models
,” arXiv:2206.13517 (
2022
).
90.
L.
Moffat
,
S. M.
Kandathil
, and
D. T.
Jones
, “
Design in the DARK: Learning deep generative models for de novo protein design
,” bioRxiv:2022.01.27.478087 (
2022
).
91.
N.
Ferruz
,
S.
Schmidt
, and
B.
Höcker
, “
ProtGPT2 is a deep unsupervised language model for protein design
,”
Nat. Commun.
13
,
4348
(
2022
).
92.
See https://chat.openai.com for Open AI ChatGPT,
2022
.
93.
See https://openai.com/dall-e-2/ for Open AI, DALL·E 2.
94.
See https://www.matdat.com for MATDAT.
95.
See http://www.matweb.com for MatWeb.
96.
See https://matmatch.com for MatMatch.
97.
98.
V.
Tshitoyan
et al, “
Unsupervised word embeddings capture latent knowledge from materials science literature
,”
Nature
571
,
95
98
(
2019
).
99.
L.
Weston
et al, “
Named entity recognition and normalization applied to large-scale information extraction from the materials science literature
,”
J. Chem. Inf. Model.
59
,
3692
3702
(
2019
).
100.
E. A.
Olivetti
et al, “
Data-driven materials research enabled by natural language processing and information extraction
,”
Appl. Phys. Rev.
7
,
041317
(
2020
).
101.
E.
Kim
et al, “
Machine-learned and codified synthesis parameters of oxide materials
,”
Sci. Data
4
,
170127
(
2017
).
102.
C. J.
Court
and
J. M.
Cole
, “
Auto-generated materials database of Curie and Néel temperatures via semi-supervised relationship extraction
,”
Sci. Data
5
,
180111
(
2018
).
103.
M. C.
Swain
and
J. M.
Cole
, “
ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature
,”
J. Chem. Inf. Model.
56
,
1894
1904
(
2016
).
104.
J.
Chen
et al, “
TransUNet: Transformers make strong encoders for medical image segmentation
,” arXiv:2102.04306 (
2021
).
105.
H. J.
Logarzo
,
G.
Capuano
, and
J. J.
Rimoli
, “
Smart constitutive laws: Inelastic homogenization through machine learning
,”
Comput. Methods Appl. Mech. Eng.
373
,
113482
(
2021
).
106.
R.
Pearce
and
Y.
Zhang
, “
Deep learning techniques have significantly impacted protein structure prediction and protein design
,”
Curr. Opin. Struct. Biol.
68
,
194
207
(
2021
).
107.
R.
Chowdhury
et al, “
Single-sequence protein structure prediction using a language model and deep learning
,”
Nat. Biotechnol.
40
,
1617
1623
(
2022
).
108.
J.
Yang
et al, “
Improved protein structure prediction using predicted interresidue orientations
,”
Proc. Natl. Acad. Sci. U. S. A
117
,
1496
1503
(
2020
).
109.
M.
Baek
et al, “
Accurate prediction of protein structures and interactions using a three-track neural network
,”
Science
373
,
871
876
(
2021
).
110.
K. E.
Wu
et al, “
Protein structure generation via folding diffusion
,” arXiv:2209.15611 (
2022
).
111.
E.
Khare
,
C.
Gonzalez-Obeso
,
D. L.
Kaplan
, and
M. J.
Buehler
, “
Collagen transformer: End-to-End transformer model to predict thermal stability of collagen triple helices using an NLP approach
,”
ACS Biomater. Sci. Eng.
8
,
4301
4310
(
2022
).
112.
Y.
Hu
and
M. J.
Buehler
, “
End-to-end protein normal mode frequency predictions using language and graph models and application to sonification
,”
ACS Nano
16
,
20656
20670
(
2022
).
113.
A. J.
Lew
,
C.-H.
Yu
,
Y.-C.
Hsu
, and
M. J.
Buehler
, “
Deep learning model to predict fracture mechanisms of graphene
,”
npj 2D Mater. Appl.
5
,
48
(
2021
).
114.
M. S. R.
Elapolu
,
M. I. R.
Shishir
, and
A.
Tabarraei
, “
A novel approach for studying crack propagation in polycrystalline graphene using machine learning algorithms
,”
Comput. Mater. Sci.
201
,
110878
(
2022
).
115.
M. J.
Buehler
, “
Modeling atomistic dynamic fracture mechanisms using a progressive transformer diffusion model
,”
J. Appl. Mech.
89
,
121009
(
2022
).
116.
M. J.
Buehler
, “
A computational building block approach towards multiscale architected materials analysis and design with application to hierarchical metal metamaterials
,” in
Modelling and Simulation in Materials Science and Engineering
(IOP Publishing, submitted).
117.
A. J.
Lew
and
M. J.
Buehler
, “
Single-shot forward and inverse hierarchical architected materials design for nonlinear mechanical properties using an attention-diffusion model
,” unpublished (
2022
).
118.
M. J.
Buehler
, “
DeepFlames: Neural network-driven self-assembly of flame particles into hierarchical structures
,”
MRS Commun.
12
,
257
265
(
2022
).
119.
S.
Risi
, “
The future of artificial intelligence is self-organizing and self-assembling
,” https://sebastianrisi.com/self_assembling_ai/ (
2021
).
120.
L.
Kirsch
and
J.
Schmidhuber
, “
Meta learning backpropagation and improving it
,” in
Advances in Neural Information Processing Systems
(
Curran Associates, Inc.
,
2021
), Vol. 34, pp.
14122
14134
.
121.
Y.
Tang
and
D.
Ha
, “
The sensory neuron as a transformer: Permutation-invariant neural networks for reinforcement learning
,” in
Advances in Neural Information Processing Systems
(
Curran Associates, Inc.
,
2021
), Vol. 34, pp.
22574
22587
.
122.
G.
Lample
and
F.
Charton
, “
Deep learning for symbolic mathematics
,” in
8th International Conference on Learning Representations (ICLR), Addis Ababa,
Ethiopia
,
26–30 April 2020 (OpenReview.net, 2020
).
123.
F.
Charton
,
A.
Hayat
, and
G.
Lample
, “
Learning advanced mathematical computations from examples
,” in
9th International Conference on Learning Representations
(ICLR), Virtual Event, Austria, 3–7 May 2021 (OpenReview.net, 2021)
.
124.
L.
Lu
,
P.
Jin
,
G.
Pang
,
Z.
Zhang
, and
G. E.
Karniadakis
, “
Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators
,”
Nat. Mach. Intell.
3
,
218
229
(
2021
).
125.
J.
Wu
,
J. B.
Tenenbaum
, and
P.
Kohli
, “
Neural scene de-rendering
,” in
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(
IEEE
,
2017
), pp.
7035
7043
.
126.
K.
Yi
et al, “
Neural-symbolic VQA: Disentangling reasoning from vision and language understanding
,” in
Advances in Neural Information Processing Systems
(
Curran Associates, Inc.
,
2018
), Vol. 31, pp.
1031
1042
.
127.
J.
Mao
,
C.
Gan
,
P.
Kohli
,
J. B.
Tenenbaum
, and
J.
Wu
, “
The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision
,” in
7th International Conference on Learning Representations (ICLR),
New Orleans, LA, USA
,
6–9 May 2019 (OpenReview.net, 2019
).
128.
C.
Han
,
J.
Mao
,
C.
Gan
,
J. B.
Tenenbaum
, and
J.
Wu
, “
Visual concept metaconcept learning
,” in
Advances in Neural Information Processing Systems
(
Curran Associates, Inc.
,
2019
), Vol. 32, pp.
5001
5012
.
129.
S.
Odense
and
A. d’A.
Garcez
, “
A semantic framework for neural-symbolic computing
,” (
2022
).
130.
A. d’A.
Garcez
and
L. C.
Lamb
,
“Neurosymbolic AI: The 3rd wave,” arXiv:2012.05876
(
2020
).