In-memory computing (IMC) using emerging nonvolatile devices has received considerable attention due to its great potential for accelerating artificial neural networks and machine learning tasks. As the basic concept and operation modes of IMC are now well established, there is growing interest in employing its wide and general application. In this perspective, the path that leads memristive IMC to general-purpose machine learning is discussed in detail. First, we reviewed the development timeline of machine learning algorithms that employ memristive devices, such as resistive random-access memory and phase-change memory. Then we summarized two typical aspects of realizing IMC-based general-purpose machine learning. One involves a heterogeneous computing system for algorithmic completeness. The other is to obtain the configurable precision techniques for the compromise of the precision-efficiency dilemma. Finally, the major directions and challenges of memristive IMC-based general-purpose machine learning are proposed from a cross-level design perspective.

Large language models (LLM), such as ChatGPT and ERNIE Bot, have recently attracted widespread attention because of their superior ability in artificial intelligence-generated content (AIGC). Machine learning (ML) advancements over the past few decades have greatly benefited the development of LLM and significantly contributed to applications, such as image processing, autopilot, and recommendation systems [Fig. 1(a)]. Initially, ML relied on traditional mathematical and probabilistic models such as regression, clustering, and Bayesian theory to identify patterns in data and apply them to real-world applications. With the explosive growth of data in the Internet era, deep learning algorithms have become increasingly popular in recent years. These algorithms allow for the creation of more complex models that can process vast amounts of data and make accurate predictions. Various ML algorithms complement each other and have demonstrated excellent performance in practice. Despite their success, ML algorithms face a significant obstacle in current computer architecture known as the von Neumann bottleneck,1,2 which hampers the utility of ML algorithms due to computational power, resource consumption, and delay constraints.3 In-memory computing (IMC) offers a revolutionary technology means of improving computing performance for ML tasks.

FIG. 1.

The concept of memristive general-purpose machine learning. (a) Diverse applications and algorithms of machine learning. (b) The concept of vector–matrix multiplication with nonvolatile memories in crossbar structure. (c) Memristive IMC-based general-purpose machine learning, which covers diverse applications, algorithms, operations, and hardware basis.

FIG. 1.

The concept of memristive general-purpose machine learning. (a) Diverse applications and algorithms of machine learning. (b) The concept of vector–matrix multiplication with nonvolatile memories in crossbar structure. (c) Memristive IMC-based general-purpose machine learning, which covers diverse applications, algorithms, operations, and hardware basis.

Close modal

By integrating computing functions into memory, IMC architecture minimizes the need for time-consuming massive data movement in and out of memory. A range of memory technologies, including commercial memories such as static random-access memory (SRAM), dynamic random-access memory (DRAM), and flash memory,4–7 as well as emerging nonvolatile memory (NVM), such as resistive random-access memory (RRAM),8 phase change memory (PCM),9 ferroelectric RAM (FeRAM),11 and magnetic RAM (MRAM),12 have been applied as the basic component of IMC hardware systems. Emerging memories are particularly promising for IMC implementation compared to current commercial ones, as they offer a compromise between speed, power consumption, and storage density based on their native device properties. These resistive switching devices, also called memristors, represent a new way of using the resistance states to store the real values rather than the charge in conventional CMOS-based memory. This is the fundamental principle of the implementation of memristive IMC. Moreover, memristive IMCs benefit greatly from the crossbar architecture [Fig. 1(b)]. A crossbar structure has the innate property of storing a weight matrix directly utilizing the physical resistance states at every cross node. The crossbar-based IMC realizes one-step vector–matrix multiplication (VMM) with approximately constant time complexity,13,14 making memristive IMC hardware-friendly as a VMM-intensive accelerator. It is a three-step process. First, a digital–analog converter (DAC) encodes the input vectors in the row direction. Then, a crossbar performs the VMM based on Ohm’s and Kirchhoff’s laws. Third, an analog-to-digital converter (ADC) converts accumulated analog results into digital signals by producing the corresponding VMM results.

Nowadays, memristive IMC-based general VMM accelerators are increasingly used in various ML algorithms, especially deep learning,14–19 where the NVM arrays are typically regarded as a dot product engine. Generally, memristive IMC circuits are customized for specific applications and limited to fixed scenarios. For example, a crossbar array may accelerate the fully connected layer in feed-forward networks20 or the convolution operations in convolutional neural networks (CNNs).18 Special functions such as the diverse similarity calculation can also be implemented on the arrays with their parallel VMM capability.16,20–23 Based on those scenarios, VMM operations using NVM crossbars could be regarded as a general and extensive process in ML models that can construct a generic acceleration operator regardless of the device type. Hence, in this perspective, memristive IMC-based general-purpose ML (GPML) is proposed and discussed in how it can be applied to a general computing platform with broader application fields.24 Unlike highly customized memristive IMC circuits, memristive IMC-based GPML is described as the cross-level implementation from application to hardware basis here [Fig. 1(c)]. A framework that copes with diverse ML algorithms in a VMM-based manner is desired to bring memristive IMC circuits to a high quality-price ratio, considering the design cost, system performance, and application compatibility trade-offs. The memristive IMC platform is expected to be the designated coprocessor for ML acceleration, contributing to solving von Neumann bottleneck in large models for ML. However, two inherent issues must be addressed when extending memristive IMC to GPML. The first is algorithm completeness, which arises from the dilemma between the functionally constrained crossbar topology and multiple computing operators used in ML algorithms. The crossbar topology enables only simple and efficient VMM with less flexibility. The second involves limits of precision reconfigurability, given the precision-efficiency dilemma. Precision reconfigurability refers to the ability of an IMC system to execute various ML algorithms with the accuracy and precision required to achieve high efficiency. The computing precision directly influences the performance of system.24 

In this perspective, we focus recent development of two typical memristors, RRAM and PCM, for ML applications and summarized the solutions of the two conflicts when building the memristive IMC-based GPML. The remainder of our paper is organized as follows: First, we reviewed the timeline of IMC-based ML applications and explained IMC-accelerated models including artificial neural networks (ANNs) and similarity searches used in typical ML algorithms. Then, we identified the paths that may solve the two inherent conflicts of IMC. One involves the development of an algorithmically complete heterogeneous computing system. The other deals with precision reconfigurability for the precision-efficiency dilemma. Finally, key challenges and opportunities are identified for memristive GPML. We hope this manuscript will provide considerable sparks to bring IMC into wider application fields.

In the early years, the memristive device, e.g., the RRAM, was widely used in the field of synaptic plasticity.25,26 Many studies utilize the RRAM to realize the typical associative learning applications, such as Pavlov’s dog experiment.26–29 Using memristive devices to physically model ANN synapses enabled their application in broader fields.20 Since then, the use of analog properties of memristive arrays to build efficient VMM processes has been more common in the industry for IMC-based acceleration. Applications have gradually extended from basic ANN to the entire ML domain. Among the applications, memristive neural networks and similarity calculation methods occupy two mainstream directions. Besides, algorithms, including the logical regression,15 and principal component analysis (PCA),30 also take full advantage of VMM acceleration in the NVM array. Figure 2 depicts the timeline of typical ML applications including the ANNs and typical ML methods. An important trend is that the models realized in the IMC system are gradually becoming more complex and systematic, covering most of the research areas in ML. Having developed for more than a decade, the memristive IMC shows the potential to partly replace general computing systems, such as central processing units (CPUs), graphics processing units (GPUs), field programmable gate arrays (FPGA), and so on, in the field of ML. This brings the memristive IMC into GPML.

FIG. 2.

A timeline of IMC-based machine learning algorithms. Over the past years, a large number of algorithms have been supported in the memristive IMC-based systems. This provides the potential for bringing memristive IMC into general-purpose ML.

FIG. 2.

A timeline of IMC-based machine learning algorithms. Over the past years, a large number of algorithms have been supported in the memristive IMC-based systems. This provides the potential for bringing memristive IMC into general-purpose ML.

Close modal

In 2013, the single-layer perception was experimentally demonstrated for a pattern classification task on the memristor crossbar.20 Subsequently, there has been a surge of interest in IMC-based ANN accelerators, which are gradually narrowing the gap with contemporary neural networks.9,19,30–33 In the next following years, in situ training of the fully connected layers drew much attention to utilizing the synaptic-like properties of memristive devices.9,17,19,31 Memristor-based systems have since facilitated the implementation of sophisticated applications, such as facial recognition33 and deeper ANNs like multilayer perceptron (MLP).17,19,34 A key point that must to be mentioned is the huge cost of the array programming, which forces many accelerators to focus only on the inference process. Exploring efficient online training methods on crossbar arrays becomes much more important. In 2018, a chip-level online training MLP scheme was experimentally demonstrated on PCMs where the PCM is employed for the long-term storage of synaptic weights and capacitors integrated in the array are used for online training.9 In 2019, complex deep-Q learning35 and long short-term memory (LSTM) neural network36 were achieved using analog–digital computing systems. During that timeframe, many simulation verifications that implement the convolutional neural networks (CNNs) on memristors to process the images have been proposed.37 In 2020, a fully hardware-implemented CNN18 was demonstrated for the first time using memristors [Fig. 3(a)]. It achieved software-comparable classification accuracy and nearly a 100-fold improvement in energy efficiency compared with the Nvidia V100 platform. Furthermore, combined with the randomness of the resistive array, stochastic neural networks, including the Boltzmann machine, and Hopfield neural network have also made many achievements in optimization problems.37–42 The potential application of IMC-based applications lies in the currently popular models in graph neural networks and the transformer that have tentative works.38,42–45

FIG. 3.

Schematic of IMC-based applications. (a) Data deployment of mapping the weights of CNN to the hardware IMC system. Reproduced with permission from Yao et al., Nature 577, 641–646 (2020). Copyright 2020 Springer Nature Limited. (b) The concept of similarity match using Hamming distance and the 2R CAM scheme. Reproduced with permission from Yang et al., InfoMat. 5(5), e12416 (2023). Copyright 2023 Author(s), licensed under a Creative Commons Attribution 4.0 License. (c) The NeuRRAM chip and its reconfigurability for diverse models and different bit-precision. Reproduced with permission from Wan et al., Nature 608, 504–512 (2022). Copyright 2022 Author(s), licensed under a Creative Commons Attribution 4.0 License.

FIG. 3.

Schematic of IMC-based applications. (a) Data deployment of mapping the weights of CNN to the hardware IMC system. Reproduced with permission from Yao et al., Nature 577, 641–646 (2020). Copyright 2020 Springer Nature Limited. (b) The concept of similarity match using Hamming distance and the 2R CAM scheme. Reproduced with permission from Yang et al., InfoMat. 5(5), e12416 (2023). Copyright 2023 Author(s), licensed under a Creative Commons Attribution 4.0 License. (c) The NeuRRAM chip and its reconfigurability for diverse models and different bit-precision. Reproduced with permission from Wan et al., Nature 608, 504–512 (2022). Copyright 2022 Author(s), licensed under a Creative Commons Attribution 4.0 License.

Close modal

In addition to synapse-like weighted connections, similarity search is another crucial operation in ML. Data features of samples are stored in the crossbar to accelerate the similarity calculation. In 2016, Hamming distance calculating layers were built using a 3D memristor array, which adopted the acceleration of VMM when realizing the language recognition tasks.46 Similarly, in 2018, the Euclidean distance calculating involving nonlinear quadratic operations was performed in crossbar architectures by mapping the specifically designed bias to store squared terms.16 A similar approach has been adopted in later research as well.47 Additionally, cosine similarities for both binarized vectors and analog vectors are explored in data memory-augmented neural networks and data clustering.47–50 The above schemes utilize one resistive device to store a value of the vector as the ANN-based accelerator does. Still, schemes that use multiple nonvolatile devices to build the specified content addressable memory (CAM) for efficient data searching are also attractive.

Early CAM schemes designed specified cells (e.g., 4-transistor 2-memristor, 4T2R, 6T2R) for both binarized and real-valued data matching in IMC systems.23,51,52 These schemes are used for tree-based ML tasks and pattern matching. Later researchers21,52–55 proposed more compact CAM designs using a two-device structure. In these structures, one CAM cell consists of two devices (2T2R53,54 or 2R21,55) and stores one state of data. Figure 3(b) shows the 2R CAM scheme.21,55 By applying designed opposite voltage signals to the two devices, the CAM cell shows the state match or mismatch. The compact CAM cells provided more consistency with the same crossbar cells used in ANN accelerators in terms of input and storage configurations and there was no need for the specified fabrication of arrays. IMC-based similarities have been applied to various applications including self-organized maps,47 competitive learning,22 image retrieval,56 and they show bright prospects on the data center side. The diversity of ML-based applications expands the scenarios of IMC and ultimately led to innovative changes in computing systems.

Cross-domain research, represented by few-shot learning, is also a focus on combining typical ML methods and deep learning models.54,56–59 Recently, a few-shot learning task was experimentally validated in a memristive array that realized image embedding in a CNN, data hashing, and similarity matching, and simultaneously in a memristor array.54 However, the key issue is that a highly customized system runs different functions on different arrays rather than providing the reconfigurability automatically. Specifying the functional applications of the IMC system makes it difficult to automatically realize the different applications in the large array. In this regard, pursuing GPML on IMC accelerators has become a meaningful goal, providing the possibility of a general platform for different ML methods. Many attempts have been made toward this goal. In 2019, a programmable CMOS-memristor computing system was proposed in which algorithms including PCA, sparse coding, and single-layer perceptron could be implemented.30,60,61 In 2022, a compute-in-memory chip was realized, which was capable of diverse deep-learning architecture and applications [Fig. 3(c)].62 Meanwhile, memristor-based field-programmable analog arrays were demonstrated that provided reconfigurable VMM and other analog units.63 In 2023, analog IMC chips based on the PCMs were demonstrated.64,65 The digital processing units were embedded in these chips using on-chip links and applications including the ResNet, and long short-term memory networks were realized experimentally. These demonstrations implement general matrix multiplications using crossbar arrays and support a variety of applications, showing the potential of memristive IMC for GPML.

Significant issues for memristive IMC-based GPML involve the solutions of two inherent conflicts: algorithmic completeness and precision reconfigurability. Although memristive memories have proven their potential for logical operations, an all-in-one logical machine for storage and computing is still in the proof-of-concept stage.66 The main shortcomings are processing speed,67,68 limited endurance of the device,67 and logic cascading.69 Currently, the collaboration between digital computers and IMC systems to construct a heterogeneous computing system is a good choice to realize algorithmic completeness. Although a heterogeneous system provides the basis for different models, it still cannot run the models with adaptive efficiency under the precision-efficiency dilemma. Precision reconfigurability can ensure that the system works with appropriate efficiency over a variety of ML scenarios.

Taking advantage of analog computing in NVM arrays, VMM-based accelerating in ML applications has unleashed its potential greatly. Encoding the data in the frequency domain even realized the matrix–matrix–multiplication.70,71 In this trend, custom-designed analog circuits, such as the closed-loop architecture of the array,15,42 ReLU function,72 and analog signal data comparing,55 are developed to make more data processed in the analog domain to pursue higher efficiency. But high efficiency in custom circuits largely sacrifices functional flexibility. Table I summarizes the proportions of typical operators in popular ML algorithms. Although VMM-involved operations (convolution, linear) are the largest category, the diversity of other operations, such as pooling, activation, and batch normalization, accounts for a large part of these algorithms and greatly impact their effectiveness. It is impossible to make all the functions in the analog domain. Not to mention the key bottleneck in these methods are not the amount of computation but rather the complexity in which digital system specializes. Thus, toward the way to GPML, the IMC system obtains a balance between the analog and the digital computing parts, which adopt the mind of the heterogeneous system.73 

TABLE I.

The number of calculations of typical operators and total params in popular deep-learninga–c models.73–79 Boldface denotes the most computationally expensive operations in the models.

Image model Model Convolution Linear Batch norm ReLU Max pool Average pool ⋯ Total params 
 AlexNet 0.66G (91.6%) 58.62M ⋯ 0.49M 0.38M 9.22K ⋯ 61.10M 
 VGG16 15.36G (99.1%) 123.63M ⋯ 13.56M 6.12M 25.09K ⋯ 138.36M 
 ResNet50 4.09G (99.2%) 2.05M 21.83M 6.32M 0.80M 0.10M ⋯ 25.56M 
 denseNet121 2.86G (98.2%) 1.02M 31.3M 15.62M 0.80M 0.70M ⋯ 7.98M 
Image model Model Convolution Linear Batch norm ReLU Max pool Average pool ⋯ Total params 
 AlexNet 0.66G (91.6%) 58.62M ⋯ 0.49M 0.38M 9.22K ⋯ 61.10M 
 VGG16 15.36G (99.1%) 123.63M ⋯ 13.56M 6.12M 25.09K ⋯ 138.36M 
 ResNet50 4.09G (99.2%) 2.05M 21.83M 6.32M 0.80M 0.10M ⋯ 25.56M 
 denseNet121 2.86G (98.2%) 1.02M 31.3M 15.62M 0.80M 0.70M ⋯ 7.98M 
Sequence model Model Convolution Linear Layer norm Softmax Mat-mul GELU Tanh Total params 
 Bert-base ⋯ 21.74G (97.23%) 12.29M 2.36M 603.98 M 0.59M 6.14K 109.48 M 
 GPT2 77.31G (83.91%) ⋯ 32.11M 6.29M 1.61 G ⋯ ⋯ 354.82 M 
Sequence model Model Convolution Linear Layer norm Softmax Mat-mul GELU Tanh Total params 
 Bert-base ⋯ 21.74G (97.23%) 12.29M 2.36M 603.98 M 0.59M 6.14K 109.48 M 
 GPT2 77.31G (83.91%) ⋯ 32.11M 6.29M 1.61 G ⋯ ⋯ 354.82 M 
a

The image models are estimated with an input image size of (3,224,224).

b

Sequence models are estimated with an input sequence length of 128.

c

mat-mul is specified for the matrix multiplication operation in the attention mechanism.

Figure 4 shows a heterogeneous system used to build a general computing platform. In typical ML applications, such as image encoding and the similarity calculation for recommendation80,81 in Fig. 4(a), data streams will be processed in two directions according to the processing complexity. One stream is the non-VMM involved operations, including the scalar- and vector-based operations and the logic operations, all of which are performed by conventional digital systems35 [Fig. 4(b)]. For example, CPU and GPU are responsible for complex data control and parallel vector operations. Additionally, vector-based platforms may accelerate most parallel computing processes, leaving the traditional CPU to perform all other complex tasks. The second stream is the data-intensive and computing-hungry VMM-related operations, including convolution, fully connected layers, and similarity calculation [Fig. 4(c)]. The VMM-involved procedures in the heterogeneous systems are specified to run in memristive arrays because there is no need for frequent data communication between memory and IMC systems, so all weights or samples can be stored in resistive arrays. However, not all VMM-related operations use memristive arrays to accelerate computing speed. For example, matrices used to calculate scores in the attention model73 are intermediate results in the neural network. Implementing those matrices on the memristive array is costly in terms of programming latency and power consumption. Therefore, the SRAM-based IMC may be a good solution for accelerating the attention model, owing to its advantages of fast reading and writing.82 The heterogeneous systems constitute a data-hierarchical processor from a scalar, and vector to the matrix. Thus, the diversity of ML-based applications and the acceleration of IMC can be simultaneously accommodated, making algorithmic completeness achievable.

FIG. 4.

A heterogeneous computing system. (a) Typical ML applications. All the operations are divided into (b) scalar and vector-based operations and the logic operations (e.g., activation, vector concatenate) and (c) VMM-based analog computing operations (including the convolution, fully connected layers, and the similarity calculation). (d) Workflow of the heterogeneous computing system.

FIG. 4.

A heterogeneous computing system. (a) Typical ML applications. All the operations are divided into (b) scalar and vector-based operations and the logic operations (e.g., activation, vector concatenate) and (c) VMM-based analog computing operations (including the convolution, fully connected layers, and the similarity calculation). (d) Workflow of the heterogeneous computing system.

Close modal

Figure 4(d) shows one possible process for executing an IMC-based ML algorithm in a heterogeneous system. Digital systems, such as CPU-based systems, are used to develop the user-oriented top-level controlling codes for the IMC-based application. During development, the IMC-support ML library is used as the basis for developing ML applications that include IMC-based operations (conv2D, for example) and digital platform-specified operations (Batchnorm2d, ReLU, etc.). Developed code is compiled into executable machine code by a compiler that bridges the two sides of the digital and analog IMC systems. The IMC interface acts as the I/O port that receives the control from the digital system and returns the processed results in the IMC system. The IMC system contains essential working parts of VMM operations and functional controls. When the application runs, the digital system gives control of IMC computing and receives the processed data. Processed data are then operated on in a digital system producing results for applications like classification and data retrieval. This scheme may lower the computing speed and reduce the efficiency but is still acceptable to provide the potential for GPML compared with the customized designs.15,41,42

As summarized in Fig. 5(a), the higher precision of a memristive IMC system often lowers computing efficiency, which is contrary to the original purpose of IMC for computing acceleration. This is mainly due to the high resolution of DACs and ADCs,85,86 which have an exponential relationship with energy consumption. The processing precision of ML models is typically balanced against hardware overhead and influenced by peripheral circuits and the storage precision of the NVM arrays. To address these issues, many quantization methods are adopted in both weights and activations among the ML tasks.85–89 However, it is difficult to identify a general pattern ML task pattern that fits all computing scenarios. For example, edge-ML can operate in a low-precision domain with quantized models under constrained resources. In contrast, cloud AI typically uses floating-point (FP) calculations with complex datasets90 [Fig. 5(b)]. Moreover, weight precision needs differ for different algorithms and applications. Classification tasks often have greater flexibility to run with low precision, whereas AI-oriented scientific computing requires FP numbers.91 Even within deep-learning methods, various IMC-based layers may need different precision levels to balance efficiency and accuracy.87,92,93 For GPML, reconfigurable precision is needed to satisfy requirements for precision-efficiency dilemma.24 

FIG. 5.

A description of precision completeness for various algorithms. (a) Summarized relationship between the computing precision and energy efficiency.61,92–104 (b) Diverse applications for different precision requirements. (c) The mixed-precision computing paradigm, which uses the digital processor to enhance the low-precision results of IMC. (d) Bit-slicing methods to expand hardware storage precision. (e) A format-aligned bit-slicing method.

FIG. 5.

A description of precision completeness for various algorithms. (a) Summarized relationship between the computing precision and energy efficiency.61,92–104 (b) Diverse applications for different precision requirements. (c) The mixed-precision computing paradigm, which uses the digital processor to enhance the low-precision results of IMC. (d) Bit-slicing methods to expand hardware storage precision. (e) A format-aligned bit-slicing method.

Close modal

A software-enhanced precision promotion method was investigated whereby the analog arrays ran within low precision and a digital system corrected the final precision of results [Fig. 5(c)].105 However, this method still depends heavily on the precision of the hardware, highlighting the need to develop a reconfigurable precision IMC system. With regard to the peripheral circuits, configurable DACs and ADCs have been proposed to achieve a better compromise between precision and efficiency.106,107 For example, DACs can be replaced by the binarized signals encoded in the time domain where the outputs are shifted and added to obtain the results.17,106 Although this method requires more time and peripheral circuits, it allows for the reconfigurability of input precision. Meanwhile, more attention has been focused on improving the precision of storage in arrays. One important solution is to increase the intermediate conductance levels of the memristive device. In fact, one such device has been reported to achieve 11-bit middle states.108 Downward compatibility can enable high conductance level devices to switch between different precisions via programming. Another approach to improve storage range at the array level is to use multiple low-precision devices to store high-precision data, known as bit-slicing.109 For instance, if an array has a 3-bit storage ability, 9-bit data can be divided into three parts and assigned to three arrays individually [Fig. 5(d)]. Results from the different arrays will be transferred to a digital shift-and-add network that sums the partial products and produces the VMM results with desired precision. The bit-slicing method can control storage precision by configuring the shift-and-add network.

Although bit-slicing provides a flexible solution for the finite state problem, this method is restricted by linear mapping from physical conductance to real values according to the max–min conductance value.110 This makes models in digital computing transfer to IMC systems challenging. On the one hand, pre-trained backbone models like ResNet, and VGGNet, use standard FP or integer (INT) arithmetic, both of which are incompatible with the existing nonlinear mapping method of the memristor circuit, resulting in an unexpected accuracy loss.88 On the other hand, it is difficult to directly realize the data communication between the analog computing in IMC and the digital components like DRAM,111 increasing the barrier for data communication. The solution for these problems lies in mapping the standard format data in digital space to their counterparts in pseudo-continuous storage space, which is called format-aligned mapping.72,112 In this approach, the length of the data is pre-aligned and fixed to match the standard data format, as shown in Fig. 5(e). The format-aligned mapping cannot always make full use of each stored bit in its long tail resulting in the waste of storage space. An important solution is to develop new data formats to adapt to IMC rather than use FP or INT directly. Combining the precision reconfigurability with the heterogeneous system, the memristive IMC can make significant strides in advancing toward higher computing precision and energy efficiency, which is crucial for enabling large-scale high-performance GPML processing.

Heterogeneous computing systems that support GPML, as one of the main future directions for IMC development, are still in their initial stages. Many present IMC systems realized a part of GPML, such as the precision or the algorithm configurations. However, several key issues must be addressed at the hardware and software levels. Most hardware concerns involve system integration, which combines requirements for heterogeneous systems and precision reconfigurability. At the software level, the goal is to introduce IMC to a broader application-oriented field with a complete IMC ecosystem.

Although many experimental achievements are obtained in memristive IMC, it still faces a lot of challenges in hardware implementation toward memristive IMC-based GPML. The main problems involve three aspects from the basic device and array level, the functional IMC core design to the heterogeneous macrointegration, and should be further explored.

1. Device optimization and NVM array fabrication

The first problem on the hardware is the device optimization and array fabrication, which forms the basis of memristive IMC. At the device level, critical decisions are to optimize the materials and device structures. The goal of device optimization, on the one hand, is to provide more stable middle levels of data representation that meet the requirement of precision reconfigurability. On the other hand, it is developing memristive devices compatible with existing CMOS processes, which enables the integration of NVM devices with mature CMOS technology.113,114 At the array level, considerable attention has been focused on promoting the storage capacity of single arrays. As shown earlier in Table I, reported models contain tens to hundreds of megabits of VMM-involved params, making them significantly larger than current array sizes.61,97,104 Size limitations increase the cost of production.115 In the crossbar topology, the size of the array is largely influenced by the parasitic effects including the line resistance and the parasitic capacitance.116,117 In large arrays, the IR-drop of line resistance will greatly interfere with the accuracy of calculation and lower the performance of the algorithms. Reasonable solutions for large array fabrication are to improve the on-state resistance of the device118 or use the 3D integration.119,120 But challenges still exist, for example, high resistance of the device causes high delay and highly efficient device stacking methods in 3D arrays are still to be explored.

2. Configurable circuit in IMC cores

The second problem is the need for a configurable IMC core based on the memristive arrays. At the circuit level, the basic problem is to define the physical connection when building computing blocks. The IMC core usually contains more than one array to have sufficient storage ability.61,103 It is hard to equip with the inputs and outputs circuits for each individual array, which results in large hardware overheads. One common scheme involves sharing inputs and outputs among multiple arrays using multiplexers to switch physical connections between functional circuits and arrays. In addition, for data input, configurable DACs or binarized input signals in the time domain are used to enable adjustable input precision. However, reconfigurable inputs require adaptability in output processing, such as programmable output registers with different lengths. At the crossbar level, the precision of data storage can be adapted for diverse applications. Configuring array precision relies on programming circuits and bit-slicing methods. For example, shifting and add circuits can concatenate different partial results from different arrays using the bit-slicing method. In IMC-based hardware design, a configurable design is necessary to make the optimization of specific application designs possible. But it brings more circuit redundancy, which further results in more area, power, and more complex control. The most representative operation is the shift and add which may bring equivalent costs as the analog hardware.

3. Heterogeneous macrointegration

Thirdly, more attention is paid to the heterogeneous system integration of computing platforms based on the IMC cores. The heterogeneous system to cover the algorithmic completeness requires different systems, including the IMC, CPU, GPU, and FPGA, to cooperate for higher performance and energy efficiency. The conventional idea is to integrate the digital systems and the IMC ones in one macro to avoid the long communication path between different platforms. But this is often challenging because the fabrication process of memory usually lags behind the processor. The memristor-COMS system provides a solution that stacks the memristors on the finished functional circuits using PAD connection.61 A more feasible way is to consider the heterogeneous system-on-chip integration schemes, such as chiplet and EMIB.121,122 Three-dimensional stacking hetero-integrated chips using optoelectronic device arrays and memristor crossbar arrays have been demonstrated and show exciting multimodal sensor computing and in-memory computing capabilities.123 These schemes allow for more flexibility in design and easier manufacturing. However, the pressure of these schemes turns to the high-bandwidth in-chip interconnects between disparate computing systems.

As the hardware design of the IMC system was accomplished, our next goal was to develop a top-level software ecosystem. Providing high-level programming methods for users, especially developers, is essential to encourage the use of memristive IMC-based GPML. Building a user-friendly computing environment will significantly advance the use of promotion of ML algorithms in IMC systems.

1. IMC-supported ML developing library

The first major issue is to develop an IMC-enabled ML acceleration library for code-level realization. Among general users, the underlying hardware is of less concern because an IMC system is regarded as a black box for application purposes. The interface of the system separates the developer from the underlying hardware, so they can focus on functional realization. A high-level library wraps the functional operations, such as convolution, fully connected layers, and distance calculation schemes, with VMM-accelerated IMC macros. Many CPU- and GPU-based ML applications have been created using mature ML frameworks, such as PyTorch and TensorFlow, making application development easier. However, there are still no public or commercial libraries that support the development of top-level IMC applications, for example, IMC-supported PyTorch.

2. System technology co-optimization (STCO) tools

The second issue is the need to build the system technology co-optimization (STCO) tools that support the software–hardware co-optimization of the high-level code to reach the best performance. Different from digital computing platforms, IMC is an approximate computing system that contains nonidealities in the analog domain. Inevitable nonideal factors in the array, such as the device disturbance of reading and programming, finite conductance levels, and conductance drift, may largely influence the performance of the ML applications. Software promotion methods, such as noise injection when training,61,124,125 are taken into consideration to provide much margin for various applications. These methods rely on the robustness of the application against hardware deficiencies based on the behavior prediction of the hardware. Moreover, when considering the finite conductance quantization, the device disturbance, the optimization in the IMC design space is an NP-hard problem and remains to be handled. For example, use the neural network architecture search on RRAM to obtain the best performance of CNN and lower the requirement for ADC/DAC and the precision of the device.126 Many studies provide simulation frameworks that realize design space exploration and the performance estimation of the ML tasks in the approximate IMC system.90,127–129 Toward the way to GPML, designing tools for STCO is largely required to get the best compromise between the hardware resource consumption and the software performance.

3. Instruction set architecture

Third, it is the need to design the instruction set architecture (ISA), which abstracts the hardware, to construct the basis for running the high-level codes on the IMC systems. The ISA design should consider two aspects. The first one is the compatibility with existing digital systems. In the heterogeneous system design, mature instructions in the mainstream digital architectures are adopted to reduce the complexity. Thus, key design rules, such as length, function division, addressing space, etc., should be consistent with existing instructions. Another one is the functional design of instruction in the analog domain. Although ISA design relies on the architecture of the IMC cores, the central concept still focuses on the VMM operation. Three types of instructions should be further considered in the IMC system, including data loading, calculation, and communication. These instructions make the configurable circuits run in desired modes. At the same time, the computing framework of IMC supporting the instruction execution needs to be declared but there still lacks consensus in this field.

4. General compiler

Finally, there is an urgent need to build a general compiler that converts the top-level codes to operational instructions on the IMC system. The general compiler automatically adopts the results of STCO tools, and ISA is a core control component. In a formal library module, the array size is considered infinite and unrestricted, but it is not instantiated until the compiler is operated. When storing the matrix in the memristive array, the compiler first automatically determines the matrix division under the optimized precision in STCO tools. Then, the processed data matrix will be mapped and stored in the NVM groups according to the available space table130 and produce the storage maps by the instruction execution. The storage maps clarify the information about the storage matrix. But constrained by the topology of the matrix and the arrays, the matrix deployment problem is a 2D bin package problem, which is an NP-hard problem.131 Therefore, the compiler should find the best optimization for data deployment according to the user's preference, such as efficiency priority (faster VMM operation) or resource priority (smaller occupied areas or lower power consumption in IMC). When realizing the wrapped functions using VMM, the input data will be converted to read instructions automatically according to the storage maps. The processed results will be collected by the compiler and returned to the application. Thus, the compiler bridges the ML applications and the IMC systems and frees IMC-based applications from their system structures.

In conclusion, the use of the in-memory computing paradigm to accelerate machine learning algorithms has moved into a new phase characterized by a shift from single algorithmic implementation to diverse and systematic applications. In this work, we reviewed diverse applications applied to memristive arrays. Then summarized two typical aspects that lead the way to IMC-based general-purpose machine learning. One is to develop a heterogeneous computing system for covering algorithmic completeness. The other is developing the precision reconfigurable IMC system to pursue higher efficiency. At the end of the perspective, the challenges and opportunities are proposed to clarify the possible directions of memristive IMC-based GPML, which will finally boost the progress of the full process of in-memory computing. We believe our study offers new insight for extending IMC into new application areas.

We acknowledge financial support from the STI 2030—Major Projects (Grant No. 2021ZD0201201), the National Key Research and Development Plan of MOST of China (Grant Nos. 2019YFB2205100 and 2022YFB4500101), and the National Natural Science Foundation of China (Grant No. 92064012). This work was also supported by the Hubei Engineering Research Center on Microelectronics and Chua Memristor Institute.

The authors have no conflicts to disclose.

Houji Zhou: Conceptualization (lead); Methodology (lead); Visualization (lead); Writing – original draft (lead); Writing – review & editing (equal). Jia Chen: Investigation (equal); Methodology (equal); Writing – original draft (equal); Writing – review & editing (equal). Jiancong Li: Methodology (equal); Writing – review & editing (supporting). Ling Yang: Visualization (supporting). Yi Li: Funding acquisition (equal); Project administration (equal); Resources (equal); Writing – review & editing (equal). Xiangshui Miao: Project administration (equal); Resources (equal).

The data that support the findings of this study are available within the article.

1.
M. A.
Zidan
et al, “
The future of electronics based on memristive systems
,”
Nat. Electron.
1
(
1
),
22
29
(
2018
).
2.
T.
Zhang
et al, “
Memristive devices and networks for brain-inspired computing
,”
Phys. Status Solidi RRL
13
(
8
),
1970031
(
2019
).
3.
M.
Horowitz
, “
1.1 computing’s energy problem (and what we can do about it)
,” in
2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC)
(
IEEE
,
2014
), pp.
10
14
.
4.
S.
Yin
et al, “
XNOR-SRAM: In-memory computing SRAM macro for binary/ternary deep neural networks
,”
IEEE J. Solid-State Circuits
55
,
1733
(
2020
).
5.
N.
Verma
et al, “
In-memory computing: Advances and prospects
,”
IEEE Solid-State Circuits Mag.
11
(
3
),
43
55
(
2019
).
6.
F.
Gao
et al, “
ComputeDRAM: In-memory compute using off-the-shelf DRAMs
,” in
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture
(
ACM
,
2019
), pp.
100
113
.
7.
A.
Sebastian
et al, “
Memory devices and applications for in-memory computing
,”
Nat. Nanotechnol.
15
(
7
),
529
544
(
2020
).
8.
Q.
Huo
et al, “
A computing-in-memory macro based on three-dimensional resistive random-access memory
,”
Nat. Electron.
5
(
7
),
469
477
(
2022
).
9.
S.
Ambrogio
et al, “
Equivalent-accuracy accelerated neural-network training using analogue memory
,”
Nature
558
(
7708
),
60
67
(
2018
).
10.
J. Y.
Park
et al, “
Revival of ferroelectric memories based on emerging fluorite-structured ferroelectrics
,”
Adv. Mater.
e2204904
(published online
2022
).
11.
S.
Jung
et al, “
A crossbar array of magnetoresistive memory devices for in-memory computing
,”
Nature
601
(
7892
),
211
216
(
2022
).
12.
Z.
Sun
and
R.
Huang
, “
Time complexity of in-memory matrix-vector multiplication
,”
IEEE Trans. Circuits Syst. II
68
(
8
),
2785
2789
(
2021
).
13.
M.
Hu
et al, “
Memristor-based analog computation and neural network classification with a dot product engine
,”
Adv. Mater.
30
(
9
),
1705914
(
2018
).
14.
Z.
Sun
et al, “
One-step regression and classification with cross-point resistive memory arrays
,”
Sci. Adv.
6
(
5
),
eaay2378
(
2020
).
15.
Y.
Jeong
et al, “
K-means data clustering with memristor networks
,”
Nano Lett.
18
(
7
),
4447
4453
(
2018
).
16.
C.
Li
et al, “
Efficient and self-adaptive in-situ learning in multilayer memristor neural networks
,”
Nat. Commun.
9
(
1
),
2385
(
2018
).
17.
P.
Yao
et al, “
Fully hardware-implemented memristor convolutional neural network
,”
Nature
577
(
7792
),
641
646
(
2020
).
18.
F. M.
Bayat
et al, “
Implementation of multilayer perceptron network with highly uniform passive memristive crossbar circuits
,”
Nat. Commun.
9
(
1
),
2331
(
2018
).
19.
F.
Alibart
et al, “
Pattern classification by memristive crossbar circuits using ex situ and in situ training
,”
Nat. Commun.
4
,
2072
(
2013
).
20.
L.
Yang
et al, “
Self-selective memristor-enabled in-memory search for highly efficient data mining
,”
InfoMat
5
(
5
),
e12416
(
2023
).
21.
H.
Zhou
et al, “
Energy-efficient memristive Euclidean distance engine for brain-inspired competitive learning
,”
Adv. Intell. Syst.
3
(
11
),
2100114
(
2021
).
22.
C. E.
Graves
et al, “
In-memory computing with memristor content addressable memories for pattern matching
,”
Adv. Mater.
32
(
37
),
e2003437
(
2020
).
23.
J.-H.
Kim
et al, “
A 409.6 GOPS and 204.8 GFLOPS mixed-precision vector processor system for general-purpose machine learning acceleration
,” in
2022 IEEE Asian Solid-State Circuits Conference (A-SSCC)
(
IEEE
,
2022
), pp.
1
3
.
24.
S. H.
Jo
et al, “
Nanoscale memristor device as synapse in neuromorphic systems
,”
Nano Lett.
10
(
4
),
1297
1301
(
2010
).
25.
Y.
Li
et al, “
Ultrafast synaptic events in a chalcogenide memristor
,”
Sci. Rep.
3
,
1619
(
2013
).
26.
S.
Wen
et al, “
Associative learning of integrate-and-fire neurons with memristor-based synapses
,”
Neural Process. Lett.
38
(
1
),
69
80
(
2012
).
27.
M.
Ziegler
et al, “
An electronic version of Pavlov’s dog
,”
Adv. Funct. Mater.
22
(
13
),
2744
2749
(
2012
).
28.
S. G.
Hu
et al, “
Synaptic long-term potentiation realized in Pavlov’s dog model based on a NiOx-based memristor
,”
J. Appl. Phys.
116
(
21
),
214502
(
2014
).
29.
S.
Choi
et al, “
Experimental demonstration of feature extraction and dimensionality reduction using memristor networks
,”
Nano Lett.
17
(
5
),
3113
3118
(
2017
).
30.
M.
Prezioso
et al, “
Training and operation of an integrated neuromorphic network based on metal-oxide memristors
,”
Nature
521
(
7550
),
61
64
(
2015
).
31.
C.
Yakopcic
et al, “
Memristor based neuromorphic circuit for ex-situ training of multi-layer neural network algorithms
,” in
2015 International Joint Conference on Neural Networks (IJCNN)
(
IEEE
,
2015
).
32.
P.
Yao
et al, “
Face classification using electronic synapses
,”
Nat. Commun.
8
,
15199
(
2017
).
33.
F. M.
Bayat
et al, “
Memristor-based perceptron classifier: Increasing complexity and coping with imperfect hardware
,” in
2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)
(
IEEE
,
2017
).
34.
Z.
Wang
et al, “
Reinforcement learning with analogue memristor arrays
,”
Nat. Electron.
2
(
3
),
115
124
(
2019
).
35.
C.
Li
et al, “
Long short-term memory networks in memristor crossbar arrays
,”
Nat. Mach. Intell.
1
(
1
),
49
57
(
2019
).
36.
C.
Yakopcic
et al, “
Memristor crossbar deep network implementation based on a convolutional neural network
,” in
International Joint Conference on Neural Networks (IJCNN)
(
IEEE
,
2016
).
37.
S.
Wang
et al, “
Echo state graph neural networks with analogue random resistive memory arrays
,”
Nat. Mach. Intell.
5
(
2
),
104
113
(
2023
).
38.
M. R.
Mahmoodi
et al, “
Versatile stochastic dot product circuits based on nonvolatile memories for high performance neurocomputing and neurooptimization
,”
Nat. Commun.
10
(
1
),
5113
(
2019
).
39.
M. R.
Mahmoodi
et al, “
An analog neuro-optimizer with adaptable annealing based on 64×64 0T1R crossbar circuit
,” in
2019 IEEE International Electron Devices Meeting (IEDM)
(
IEEE
,
2019
), pp.
14.17.11
14.17.14
.
40.
F.
Cai
et al, “
Power-efficient combinatorial optimization using intrinsic noise in memristor Hopfield neural networks
,”
Nat. Electron.
3
(
7
),
409
418
(
2020
).
41.
K.
Yang
et al, “
Transiently chaotic simulated annealing based on intrinsic nonlinearity of memristors for efficient solution of optimization problems
,”
Sci. Adv.
6
(
33
),
eaba9901
(
2020
).
42.
L.
Yan
et al, “
Graph neural network based on RRAM array
,” in
2022 6th IEEE Electron Devices Technology & Manufacturing Conference (EDTM)
(
IEEE
,
2022
), pp.
403
405
.
43.
C.
Yang
et al, “
Full-circuit implementation of transformer network based on memristor
,”
IEEE Trans. Circuits Syst. I
69
(
4
),
1395
1407
(
2022
).
44.
X.
Yang
et al, “
ReTransformer: ReRAM-based processing-in-memory architecture for transformer acceleration
,” in
Proceedings of the 39th International Conference on Computer-Aided Design
(
IEEE
,
2020
), pp.
1
9
.
45.
H.
Li
et al, “
Hyperdimensional computing with 3D VRRAM in-memory kernels: Device-architecture co-design for energy-efficient, error-resilient language recognition
,” in
2016 IEEE International Electron Devices Meeting (IEDM)
(
IEEE
,
2016
), pp.
16.11.11
16.11.14
.
46.
R.
Wang
et al, “
Implementing in-situ self-organizing maps with memristor crossbar arrays for data mining and optimization
,”
Nat. Commun.
13
(
1
),
2289
(
2022
).
47.
G.
Karunaratne
et al, “
Robust high-dimensional memory-augmented neural networks
,”
Nat. Commun.
12
(
1
),
2468
(
2021
).
48.
H.
Zhou
et al, “
Memristive cosine-similarity-based few-shot learning with lifelong memory adaptation
,”
Adv. Intell. Syst.
5
(
2
),
2200173
(
2023
).
49.
H.
Zhou
et al, “
Low-time-complexity document clustering using memristive dot product engine
,”
Sci. China Inf. Sci.
65
(
2
),
122410
(
2022
).
50.
C.
Li
et al, “
Analog content-addressable memories with memristors
,”
Nat. Commun.
11
(
1
),
1638
(
2020
).
51.
G.
Pedretti
et al, “
Tree-based machine learning performed in-memory with memristive analog CAM
,”
Nat. Commun.
12
(
1
),
5806
(
2021
).
52.
Y.
Li
et al, “
Monolithic 3D integration of logic, memory and computing-in-memory for one-shot learning
,” in
2021 IEEE International Electron Devices Meeting (IEDM)
(
IEEE
,
2021
), pp.
21.25.21
21.25.24
.
53.
R.
Mao
et al, “
Experimentally validated memristive memory augmented neural network with efficient hashing and similarity search
,”
Nat. Commun.
13
(
1
),
6284
(
2022
).
54.
L.
Yang
et al, “
In-memory search with phase change device-based ternary content addressable memory
,”
IEEE Electron Device Lett.
43
(
7
),
1053
1056
(
2022
).
55.
Y.
Yu
et al, “
In-memory search for highly efficient image retrieval
,”
Adv. Intell. Syst.
5
(
3
),
2200268
(
2023
).
56.
A.
Graves
et al, “
Hybrid computing using a neural network with dynamic external memory
,”
Nature
538
(
7626
),
471
476
(
2016
).
57.
K.
Ni
et al, “
Ferroelectric ternary content-addressable memory for one-shot learning
,”
Nat. Electron.
2
(
11
),
521
529
(
2019
).
58.
M.-L.
Wei
et al, “
Analog computing in memory (CIM) technique for general matrix multiplication (GEMM) to support deep neural network (DNN) and cosine similarity search computing using 3D AND-type NOR flash devices
,” in
2022 International Electron Devices Meeting (IEDM)
(
IEEE
,
2022
), pp.
33.33.31
33.33.34
.
59.
P. M.
Sheridan
et al, “
Sparse coding with memristor networks
,”
Nat. Nanotechnol.
12
(
8
),
784
789
(
2017
).
60.
F.
Cai
et al, “
A fully integrated reprogrammable memristor–CMOS system for efficient multiply–accumulate operations
,”
Nat. Electron.
2
(
7
),
290
299
(
2019
).
61.
W.
Wan
et al, “
A compute-in-memory chip based on resistive random-access memory
,”
Nature
608
(
7923
),
504
512
(
2022
).
62.
Y.
Li
et al, “
Memristive field-programmable analog arrays for analog computing
,”
Adv. Mater.
35
,
e2206648
(
2022
).
63.
S.
Ambrogio
et al, “
An analog-AI chip for energy-efficient speech recognition and transcription
,”
Nature
620
(
7975
),
768
775
(
2023
).
64.
M.
Le Gallo
et al, “
A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference
,”
Nat. Electron.
6
,
680
693
(
2023
).
65.
L.
Cheng
et al, “
Functional demonstration of a memristive arithmetic logic unit (MemALU) for in-memory computing
,”
Adv. Funct. Mater.
29
(
49
),
1905660
(
2019
).
66.
P.
Mannocci
et al, “
In-memory computing with emerging memory devices: Status and outlook
,”
APL Mach. Learn.
1
(
1
),
010902
(
2023
).
67.
X.-D.
Huang
et al, “
Forming-free, fast, uniform, and high endurance resistive switching from cryogenic to high temperatures in W/AlOx/Al2O3/Pt bilayer memristor
,”
IEEE Electron Device Lett.
41
(
4
),
549
552
(
2020
).
68.
D.
Ielmini
and
H. S. P.
Wong
, “
In-memory computing with resistive switching devices
,”
Nat. Electron.
1
(
6
),
333
343
(
2018
).
69.
C.
Wang
et al, “
Scalable massively parallel computing using continuous-time data representation in nanoscale crossbar array
,”
Nat. Nanotechnol.
16
(
10
),
1079
1085
(
2021
).
70.
C.
Wang
et al, “
Parallel in-memory wireless computing
,”
Nat. Electron.
6
(
5
),
381
389
(
2023
).
71.
S.
Oh
et al, “
Energy-efficient Mott activation neuron for full-hardware implementation of neural networks
,”
Nat. Nanotechnol.
16
(
6
),
680
687
(
2021
).
72.
J.
Lee
, et al, “
A 13.7 TFLOPS/W floating-point DNN processor using heterogeneous computing architecture with exponent-computing-in-memory
,” in
2021 Symposium on VLSI Circuits
(
IEEE
,
2021
), pp.
1
2
.
73.
A.
Vaswani
et al, paper presented at the
Advances in Neural Information Processing Systems
,
2017
.
74.
S.
Liu
and
W.
Deng
, paper presented at the
2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR)
,
Kuala Lumpur, Malaysia
,
2015
.
75.
K.
He
et al, paper presented at the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
,
Las Vegas, NV
,
2016
.
76.
G.
Huang
et al, “
Densely connected convolutional networks
,” in
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(
IEEE Computer Society
,
2017
), pp.
2261
2269
.
77.
A.
Krizhevsky
et al,
Commun. ACM
60
,
84
90
(
2017
).
78.
J.
Devlin
et al, “
BERT: Pre-training of deep bidirectional transformers
,” in
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)
(
Association forComputational Linguistics
,
2019
), Vol. 1, pp.
4171
4186
.
79.
A.
Radford
et al, “
Language models are unsupervised multitask learners
,”
OpenAI Blog
1
(
8
),
9
(
2019
).
80.
P.
Covington
et al, “
Deep neural networks for YouTube recommendations
,” in
Proceedings of the 10th ACM Conference on Recommender Systems (Association for Computing Machinery,
2016
), pp.
191
198
.
81.
J.
Wang
et al, “
Billion-scale commodity embedding for E-commerce recommendation in Alibaba
,” in
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(
Association for Computing Machinery
,
2018
), pp.
839
848
.
82.
F.
Tu
et al, “
TranCIM: Full-digital bitline-transpose CIM-based sparse transformer accelerator with pipeline/parallel reconfigurable modes
,”
IEEE J. Solid-State Circuits
58
(
6
),
1798
1809
(
2023
).
83.
Q.
Zheng
et al, “
Lattice: An ADC/DAC-less ReRAM-based processing-in-memory architecture for accelerating deep convolution neural networks
,” in
2020 57th ACM/IEEE Design Automation Conference (DAC)
(
IEEE
,
2020
), pp.
1
6
.
84.
U.
Saxena
et al, “
Towards ADC-less compute-in-memory accelerators for energy efficient deep learning
,” in
2022 Design, Automation and Test in Europe Conference and Exhibition (DATE)
(
IEEE
,
2022
), pp.
624
627
.
85.
H.
Bao
et al, “
Quantization and sparsity-aware processing for energy-efficient NVM-based convolutional neural networks
,”
Front. Electron.
3
,
954661
(
2022
).
86.
G.
Yuan
et al, “
An ultra-efficient memristor-based DNN framework with structured weight pruning and quantization using ADMM
,” in
2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)
(
IEEE
,
2019
), pp.
1
6
.
87.
X.
Ma
et al, “
Tiny but accurate: A pruned, quantized and optimized memristor crossbar framework for ultra efficient DNN implementation
,” in
2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)
(
IEEE
,
2020
), pp.
301
306
.
88.
F.
Tu
et al, “
ReDCIM: Reconfigurable digital computing-in-memory processor with unified FP/INT pipeline for cloud AI acceleration
,”
IEEE J. Solid-State Circuits
58
(
1
),
243
255
(
2023
).
89.
L.
Lu
et al, “
Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators
,”
Nat. Mach. Intell.
3
(
3
),
218
229
(
2021
).
90.
Z.
Zhu
et al, “
A configurable multi-precision CNN computing framework based on single bit RRAM
,” in
Proceedings of the 56th Annual Design Automation Conference 2019
(
IEEE
,
2019
), pp.
1
6
.
91.
Q.
Yang
and
H.
Li
, “
BitSystolic: A 26.7 TOPS/W 2b∼8b NPU with configurable data flows for edge devices
,”
IEEE Trans. Circuits Syst. I
68
(
3
),
1134
1145
(
2021
).
92.
W.-H.
Chen
et al, “
A 65nm 1Mb nonvolatile computing-in-memory ReRAM Macro with Sub-16ns multiply-and-accumulate for binary DNN AI edge processors
,” in
2018 IEEE International Solid-State Circuits Conference-(ISSCC)
(
IEEE
,
2018
), pp.
494
496
.
93.
Q.
Liu
et al, “
33.2 a fully integrated analog ReRAM based 78.4TOPS/W compute-in-memory chip with fully parallel MAC computing
,” in
2020 IEEE International Solid-State Circuits Conference-(ISSCC)
(
IEEE
,
2020
), pp.
500
502
.
94.
C.-X.
Xue
et al, “
24.1 A 1Mb multibit ReRAM computing-in-memory macro with 14.6ns parallel MAC computing time for CNN based AI edge processors
,” in
2019 IEEE International Solid-State Circuits Conference-(ISSCC)
(
IEEE
,
2019
), pp.
388
390
.
95.
C.-X.
Xue
et al, “
15.4 A 22nm 2Mb ReRAM compute-in-memory macro with 121-28TOPS/W for multibit MAC computing for tiny AI edge devices
,” in
2020 IEEE International Solid- State Circuits Conference-(ISSCC)
(
IEEE
,
2020
), pp.
244
246
.
96.
W.-H.
Chen
et al, “
CMOS-integrated memristive non-volatile computing-in-memory for AI edge processors
,”
Nat. Electron.
2
(
9
),
420
428
(
2019
).
97.
C.-X.
Xue
et al, “
16.1 A 22nm 4Mb 8b-precision ReRAM computing-in-memory macro with 11.91 to 195.7TOPS/W for tiny AI edge devices
,” in
2021 IEEE International Solid-State Circuits Conference (ISSCC)
(
IEEE
,
2021
), pp.
245
247
.
98.
M.
Chang
et al, “
A 40nm 60.64TOPS/W ECC-capable compute-in-memory/digital 2.25MB/768KB RRAM/SRAM system with embedded cortex M3 microprocessor for edge recommendation systems
,” in
2022 IEEE International Solid-State Circuits Conference (ISSCC)
(
IEEE
,
2022
), pp.
1
3
.
99.
W.-S.
Khwa
et al, “
A 40-nm, 2M-cell, 8b-precision, hybrid SLC-MLC PCM computing-in-memory macro with 20.5-65.0TOPS/W for tiny-Al edge devices
,” in
2022 IEEE International Solid-State Circuits Conference (ISSCC)
(
IEEE
,
2022
), pp.
1
3
.
100.
J.-H.
Yoon
et al, “
A 40-nm, 64-kb, 56.67 TOPS/W voltage-sensing computing-in-memory/digital RRAM macro supporting iterative write with verification and online read-disturb detection
,”
IEEE J. Solid-State Circuits
57
(
1
),
68
79
(
2022
).
101.
J.
Yue
et al, “
STICKER-IM: A 65 nm computing-in-memory NN processor using block-wise sparsity optimization and inter/intra-macro data reuse
,”
IEEE J. Solid-State Circuits
57
(
8
),
2560
2573
(
2022
).
102.
K.
Zhou
et al, “
A 28 nm 81 Kb 59–95.3 TOPS/W 4T2R ReRAM computing-in-memory accelerator with voltage-to-time-to-digital based output
,”
IEEE J. Emerging Sel. Top. Circuits Syst.
12
(
4
),
846
857
(
2022
).
103.
W.-H.
Huang
et al, “
A nonvolatile Al-edge processor with 4MB SLC-MLC hybrid-mode ReRAM compute-in-memory macro and 51.4-251TOPS/W
,” in
2023 IEEE International Solid-State Circuits Conference (ISSCC)
(
IEEE
,
2023
), pp.
15
17
.
104.
J.-M.
Hung
et al, “
8-b precision 8-Mb ReRAM compute-in-memory macro using direct-current-free time-domain readout scheme for AI edge devices
,”
IEEE J. Solid-State Circuits
58
(
1
),
303
315
(
2023
).
105.
M.
Le Gallo
et al, “
Mixed-precision in-memory computing
,”
Nat. Electron.
1
(
4
),
246
253
(
2018
).
106.
F.
Cai
et al, “
A fully integrated system-on-chip design with scalable resistive random-access memory tile design for analog in-memory computing
,”
Adv. Intell. Syst.
4
(
8
),
2200014
(
2022
).
107.
Z.
Guo
et al, “
Algorithm/hardware co-design configurable SAR ADC with low power for computing-in-memory in 28nm CMOS
,” in
2021 IEEE 14th International Conference on ASIC (ASICON)
(
IEEE
,
2021
), pp.
1
4
.
108.
M.
Rao
et al, “
Thousands of conductance levels in memristors integrated on CMOS
,”
Nature
615
(
7954
),
823
829
(
2023
).
109.
M. A.
Zidan
et al, “
A general memristor-based partial differential equation solver
,”
Nat. Electron.
1
(
7
),
411
420
(
2018
).
110.
C.
Li
et al, “
Analogue signal and image processing with large memristor crossbars
,”
Nat. Electron.
1
(
1
),
52
59
(
2017
).
111.
M.
Imani
et al, “
FloatPIM: In-memory acceleration of deep neural network training with high precision
,” in
Proceedings of the 46th International Symposium on Computer Architecture
(
IEEE
,
2019
), pp.
802
815
.
112.
S. S.
Ensan
and
S.
Ghosh
, “
FPCAS: In-memory floating point computations for autonomous systems
,” in
2019 International Joint Conference on Neural Networks (IJCNN)
(
IEEE
,
2019
), pp.
1
8
.
113.
J.-H.
Ryu
et al, “
Filamentary and interface switching of CMOS-compatible Ta2O5 memristor for non-volatile memory and synaptic devices
,”
Appl. Surf. Sci.
529
,
147167
(
2020
).
114.
Y.
Li
et al, “
Review of memristor devices in neuromorphic computing: Materials sciences and device challenges
,”
J. Phys. D: Appl. Phys.
51
(
50
),
503002
(
2018
).
115.
M.
Lanza
et al, “
The gap between academia and industry in resistive switching research
,”
Nat. Electron.
6
(
4
),
260
263
(
2023
).
116.
C.-W. S.
Yeh
and
S. S.
Wong
, “
Compact one-transistor-N-RRAM array architecture for advanced CMOS technology
,”
IEEE J. Solid-State Circuits
50
(
5
),
1299
1309
(
2015
).
117.
Y.
Luo
et al, “
Modeling and mitigating the interconnect resistance issue in analog RRAM matrix computing circuits
,”
IEEE Trans. Circuits Syst. I
69
(
11
),
4367
4380
(
2022
).
118.
S.-G.
Ren
et al, “
Pt/Al2O3/TaOX/Ta self-rectifying memristor with record-low operation current (<2 pA), low power (fJ), and high scalability
,”
IEEE Trans. Electron Devices
69
(
2
),
838
842
(
2022
).
119.
Q.
Luo
et al, “
8-layers 3D vertical RRAM with excellent scalability towards storage class memory applications
,” in
2017 IEEE International Electron Devices Meeting (IEDM)
(
IEEE
,
2017
), pp.
2.7.1
2.7.4
.
120.
S.
Qin
et al, “
8-layer 3D vertical Ru/AlOxNy/TiN RRAM with mega-Ω level LRS for low power and ultrahigh-density memory
,” in
2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits)
(
IEEE
,
2022
), pp.
314
315
.
121.
R.
Mahajan
et al, “
Embedded multi-die interconnect bridge (EMIB)—A high density, high bandwidth packaging interconnect
,” in
2016 IEEE 66th Electronic Components and Technology Conference (ECTC)
(
IEEE
,
2016
), pp.
557
565
.
122.
S.
Naffziger
et al, “
2.2 AMD chiplet architecture for high-performance server and desktop products
,” in
2020 IEEE International Solid-State Circuits Conference-(ISSCC)
(
IEEE
,
2020
), pp.
44
45
.
123.
C.
Choi
et al, “
Reconfigurable heterogeneous integration using stackable chips with embedded artificial intelligence
,”
Nat. Electron.
5
(
6
),
386
393
(
2022
).
124.
Z.
He
et al, “
Noise injection adaption: End-to-end ReRAM crossbar non-ideal effect adaption for neural network mapping
,” in
Proceedings of the 56th Annual Design Automation Conference 2019
(
IEEE
,
2019
), pp.
1
6
.
125.
Y.
Geng
et al, “
An on-chip layer-wise training method for RRAM based computing-in-memory chips
,” in
2021 Design, Automation and Test in Europe Conference and Exhibition (DATE)
(
IEEE
,
2021
), pp.
248
251
.
126.
Z.
Yuan
et al, “
NAS4RRAM: Neural network architecture search for inference on RRAM-based accelerators
,”
Sci. China Inf. Sci.
64
(
6
),
160407
(
2021
).
127.
X.
Peng
et al, “
Optimizing weight mapping and data flow for convolutional neural networks on RRAM based processing-in-memory architecture
,” in
2019 IEEE International Symposium on Circuits and Systems (ISCAS)
(
IEEE
,
2019
), pp.
1
5
.
128.
P.
Chi
et al, “
PRIMe: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory
,” in
2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
(
IEEE
,
2016
), pp.
27
39
.
129.
X.
Peng
et al, “
DNN+NeuroSim V2.0: An end-to-end benchmarking framework for compute-in-memory accelerators for on-chip training
,”
IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
40
(
11
),
2306
2319
(
2021
).
130.
A.
Siemieniuk
et al, “
OCC: An automated end-to-end machine learning optimizing compiler for computing-in-memory
,”
IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
41
(
6
),
1674
1686
(
2022
).
131.
H.
Liu
et al, “
A simulation framework for memristor-based heterogeneous computing architectures
,”
IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
41
(
12
),
5476
5488
(
2022
).