In-memory computing (IMC) using emerging nonvolatile devices has received considerable attention due to its great potential for accelerating artificial neural networks and machine learning tasks. As the basic concept and operation modes of IMC are now well established, there is growing interest in employing its wide and general application. In this perspective, the path that leads memristive IMC to general-purpose machine learning is discussed in detail. First, we reviewed the development timeline of machine learning algorithms that employ memristive devices, such as resistive random-access memory and phase-change memory. Then we summarized two typical aspects of realizing IMC-based general-purpose machine learning. One involves a heterogeneous computing system for algorithmic completeness. The other is to obtain the configurable precision techniques for the compromise of the precision-efficiency dilemma. Finally, the major directions and challenges of memristive IMC-based general-purpose machine learning are proposed from a cross-level design perspective.
I. BACKGROUND
Large language models (LLM), such as ChatGPT and ERNIE Bot, have recently attracted widespread attention because of their superior ability in artificial intelligence-generated content (AIGC). Machine learning (ML) advancements over the past few decades have greatly benefited the development of LLM and significantly contributed to applications, such as image processing, autopilot, and recommendation systems [Fig. 1(a)]. Initially, ML relied on traditional mathematical and probabilistic models such as regression, clustering, and Bayesian theory to identify patterns in data and apply them to real-world applications. With the explosive growth of data in the Internet era, deep learning algorithms have become increasingly popular in recent years. These algorithms allow for the creation of more complex models that can process vast amounts of data and make accurate predictions. Various ML algorithms complement each other and have demonstrated excellent performance in practice. Despite their success, ML algorithms face a significant obstacle in current computer architecture known as the von Neumann bottleneck,1,2 which hampers the utility of ML algorithms due to computational power, resource consumption, and delay constraints.3 In-memory computing (IMC) offers a revolutionary technology means of improving computing performance for ML tasks.
By integrating computing functions into memory, IMC architecture minimizes the need for time-consuming massive data movement in and out of memory. A range of memory technologies, including commercial memories such as static random-access memory (SRAM), dynamic random-access memory (DRAM), and flash memory,4–7 as well as emerging nonvolatile memory (NVM), such as resistive random-access memory (RRAM),8 phase change memory (PCM),9 ferroelectric RAM (FeRAM),11 and magnetic RAM (MRAM),12 have been applied as the basic component of IMC hardware systems. Emerging memories are particularly promising for IMC implementation compared to current commercial ones, as they offer a compromise between speed, power consumption, and storage density based on their native device properties. These resistive switching devices, also called memristors, represent a new way of using the resistance states to store the real values rather than the charge in conventional CMOS-based memory. This is the fundamental principle of the implementation of memristive IMC. Moreover, memristive IMCs benefit greatly from the crossbar architecture [Fig. 1(b)]. A crossbar structure has the innate property of storing a weight matrix directly utilizing the physical resistance states at every cross node. The crossbar-based IMC realizes one-step vector–matrix multiplication (VMM) with approximately constant time complexity,13,14 making memristive IMC hardware-friendly as a VMM-intensive accelerator. It is a three-step process. First, a digital–analog converter (DAC) encodes the input vectors in the row direction. Then, a crossbar performs the VMM based on Ohm’s and Kirchhoff’s laws. Third, an analog-to-digital converter (ADC) converts accumulated analog results into digital signals by producing the corresponding VMM results.
Nowadays, memristive IMC-based general VMM accelerators are increasingly used in various ML algorithms, especially deep learning,14–19 where the NVM arrays are typically regarded as a dot product engine. Generally, memristive IMC circuits are customized for specific applications and limited to fixed scenarios. For example, a crossbar array may accelerate the fully connected layer in feed-forward networks20 or the convolution operations in convolutional neural networks (CNNs).18 Special functions such as the diverse similarity calculation can also be implemented on the arrays with their parallel VMM capability.16,20–23 Based on those scenarios, VMM operations using NVM crossbars could be regarded as a general and extensive process in ML models that can construct a generic acceleration operator regardless of the device type. Hence, in this perspective, memristive IMC-based general-purpose ML (GPML) is proposed and discussed in how it can be applied to a general computing platform with broader application fields.24 Unlike highly customized memristive IMC circuits, memristive IMC-based GPML is described as the cross-level implementation from application to hardware basis here [Fig. 1(c)]. A framework that copes with diverse ML algorithms in a VMM-based manner is desired to bring memristive IMC circuits to a high quality-price ratio, considering the design cost, system performance, and application compatibility trade-offs. The memristive IMC platform is expected to be the designated coprocessor for ML acceleration, contributing to solving von Neumann bottleneck in large models for ML. However, two inherent issues must be addressed when extending memristive IMC to GPML. The first is algorithm completeness, which arises from the dilemma between the functionally constrained crossbar topology and multiple computing operators used in ML algorithms. The crossbar topology enables only simple and efficient VMM with less flexibility. The second involves limits of precision reconfigurability, given the precision-efficiency dilemma. Precision reconfigurability refers to the ability of an IMC system to execute various ML algorithms with the accuracy and precision required to achieve high efficiency. The computing precision directly influences the performance of system.24
In this perspective, we focus recent development of two typical memristors, RRAM and PCM, for ML applications and summarized the solutions of the two conflicts when building the memristive IMC-based GPML. The remainder of our paper is organized as follows: First, we reviewed the timeline of IMC-based ML applications and explained IMC-accelerated models including artificial neural networks (ANNs) and similarity searches used in typical ML algorithms. Then, we identified the paths that may solve the two inherent conflicts of IMC. One involves the development of an algorithmically complete heterogeneous computing system. The other deals with precision reconfigurability for the precision-efficiency dilemma. Finally, key challenges and opportunities are identified for memristive GPML. We hope this manuscript will provide considerable sparks to bring IMC into wider application fields.
II. HISTORY OF THE IMC-BASED ML IMPLEMENTATIONS
In the early years, the memristive device, e.g., the RRAM, was widely used in the field of synaptic plasticity.25,26 Many studies utilize the RRAM to realize the typical associative learning applications, such as Pavlov’s dog experiment.26–29 Using memristive devices to physically model ANN synapses enabled their application in broader fields.20 Since then, the use of analog properties of memristive arrays to build efficient VMM processes has been more common in the industry for IMC-based acceleration. Applications have gradually extended from basic ANN to the entire ML domain. Among the applications, memristive neural networks and similarity calculation methods occupy two mainstream directions. Besides, algorithms, including the logical regression,15 and principal component analysis (PCA),30 also take full advantage of VMM acceleration in the NVM array. Figure 2 depicts the timeline of typical ML applications including the ANNs and typical ML methods. An important trend is that the models realized in the IMC system are gradually becoming more complex and systematic, covering most of the research areas in ML. Having developed for more than a decade, the memristive IMC shows the potential to partly replace general computing systems, such as central processing units (CPUs), graphics processing units (GPUs), field programmable gate arrays (FPGA), and so on, in the field of ML. This brings the memristive IMC into GPML.
A. Memristive neural networks
In 2013, the single-layer perception was experimentally demonstrated for a pattern classification task on the memristor crossbar.20 Subsequently, there has been a surge of interest in IMC-based ANN accelerators, which are gradually narrowing the gap with contemporary neural networks.9,19,30–33 In the next following years, in situ training of the fully connected layers drew much attention to utilizing the synaptic-like properties of memristive devices.9,17,19,31 Memristor-based systems have since facilitated the implementation of sophisticated applications, such as facial recognition33 and deeper ANNs like multilayer perceptron (MLP).17,19,34 A key point that must to be mentioned is the huge cost of the array programming, which forces many accelerators to focus only on the inference process. Exploring efficient online training methods on crossbar arrays becomes much more important. In 2018, a chip-level online training MLP scheme was experimentally demonstrated on PCMs where the PCM is employed for the long-term storage of synaptic weights and capacitors integrated in the array are used for online training.9 In 2019, complex deep-Q learning35 and long short-term memory (LSTM) neural network36 were achieved using analog–digital computing systems. During that timeframe, many simulation verifications that implement the convolutional neural networks (CNNs) on memristors to process the images have been proposed.37 In 2020, a fully hardware-implemented CNN18 was demonstrated for the first time using memristors [Fig. 3(a)]. It achieved software-comparable classification accuracy and nearly a 100-fold improvement in energy efficiency compared with the Nvidia V100 platform. Furthermore, combined with the randomness of the resistive array, stochastic neural networks, including the Boltzmann machine, and Hopfield neural network have also made many achievements in optimization problems.37–42 The potential application of IMC-based applications lies in the currently popular models in graph neural networks and the transformer that have tentative works.38,42–45
B. Memristive similarity calculation
In addition to synapse-like weighted connections, similarity search is another crucial operation in ML. Data features of samples are stored in the crossbar to accelerate the similarity calculation. In 2016, Hamming distance calculating layers were built using a 3D memristor array, which adopted the acceleration of VMM when realizing the language recognition tasks.46 Similarly, in 2018, the Euclidean distance calculating involving nonlinear quadratic operations was performed in crossbar architectures by mapping the specifically designed bias to store squared terms.16 A similar approach has been adopted in later research as well.47 Additionally, cosine similarities for both binarized vectors and analog vectors are explored in data memory-augmented neural networks and data clustering.47–50 The above schemes utilize one resistive device to store a value of the vector as the ANN-based accelerator does. Still, schemes that use multiple nonvolatile devices to build the specified content addressable memory (CAM) for efficient data searching are also attractive.
Early CAM schemes designed specified cells (e.g., 4-transistor 2-memristor, 4T2R, 6T2R) for both binarized and real-valued data matching in IMC systems.23,51,52 These schemes are used for tree-based ML tasks and pattern matching. Later researchers21,52–55 proposed more compact CAM designs using a two-device structure. In these structures, one CAM cell consists of two devices (2T2R53,54 or 2R21,55) and stores one state of data. Figure 3(b) shows the 2R CAM scheme.21,55 By applying designed opposite voltage signals to the two devices, the CAM cell shows the state match or mismatch. The compact CAM cells provided more consistency with the same crossbar cells used in ANN accelerators in terms of input and storage configurations and there was no need for the specified fabrication of arrays. IMC-based similarities have been applied to various applications including self-organized maps,47 competitive learning,22 image retrieval,56 and they show bright prospects on the data center side. The diversity of ML-based applications expands the scenarios of IMC and ultimately led to innovative changes in computing systems.
C. Toward the general IMC system
Cross-domain research, represented by few-shot learning, is also a focus on combining typical ML methods and deep learning models.54,56–59 Recently, a few-shot learning task was experimentally validated in a memristive array that realized image embedding in a CNN, data hashing, and similarity matching, and simultaneously in a memristor array.54 However, the key issue is that a highly customized system runs different functions on different arrays rather than providing the reconfigurability automatically. Specifying the functional applications of the IMC system makes it difficult to automatically realize the different applications in the large array. In this regard, pursuing GPML on IMC accelerators has become a meaningful goal, providing the possibility of a general platform for different ML methods. Many attempts have been made toward this goal. In 2019, a programmable CMOS-memristor computing system was proposed in which algorithms including PCA, sparse coding, and single-layer perceptron could be implemented.30,60,61 In 2022, a compute-in-memory chip was realized, which was capable of diverse deep-learning architecture and applications [Fig. 3(c)].62 Meanwhile, memristor-based field-programmable analog arrays were demonstrated that provided reconfigurable VMM and other analog units.63 In 2023, analog IMC chips based on the PCMs were demonstrated.64,65 The digital processing units were embedded in these chips using on-chip links and applications including the ResNet, and long short-term memory networks were realized experimentally. These demonstrations implement general matrix multiplications using crossbar arrays and support a variety of applications, showing the potential of memristive IMC for GPML.
III. IMC FOR GENERAL-PURPOSE MACHINE LEARNING
Significant issues for memristive IMC-based GPML involve the solutions of two inherent conflicts: algorithmic completeness and precision reconfigurability. Although memristive memories have proven their potential for logical operations, an all-in-one logical machine for storage and computing is still in the proof-of-concept stage.66 The main shortcomings are processing speed,67,68 limited endurance of the device,67 and logic cascading.69 Currently, the collaboration between digital computers and IMC systems to construct a heterogeneous computing system is a good choice to realize algorithmic completeness. Although a heterogeneous system provides the basis for different models, it still cannot run the models with adaptive efficiency under the precision-efficiency dilemma. Precision reconfigurability can ensure that the system works with appropriate efficiency over a variety of ML scenarios.
A. Heterogeneous computing system for algorithmic completeness
Taking advantage of analog computing in NVM arrays, VMM-based accelerating in ML applications has unleashed its potential greatly. Encoding the data in the frequency domain even realized the matrix–matrix–multiplication.70,71 In this trend, custom-designed analog circuits, such as the closed-loop architecture of the array,15,42 ReLU function,72 and analog signal data comparing,55 are developed to make more data processed in the analog domain to pursue higher efficiency. But high efficiency in custom circuits largely sacrifices functional flexibility. Table I summarizes the proportions of typical operators in popular ML algorithms. Although VMM-involved operations (convolution, linear) are the largest category, the diversity of other operations, such as pooling, activation, and batch normalization, accounts for a large part of these algorithms and greatly impact their effectiveness. It is impossible to make all the functions in the analog domain. Not to mention the key bottleneck in these methods are not the amount of computation but rather the complexity in which digital system specializes. Thus, toward the way to GPML, the IMC system obtains a balance between the analog and the digital computing parts, which adopt the mind of the heterogeneous system.73
Image model | Model | Convolution | Linear | Batch norm | ReLU | Max pool | Average pool | ⋯ | Total params |
AlexNet | 0.66G (91.6%) | 58.62M | ⋯ | 0.49M | 0.38M | 9.22K | ⋯ | 61.10M | |
VGG16 | 15.36G (99.1%) | 123.63M | ⋯ | 13.56M | 6.12M | 25.09K | ⋯ | 138.36M | |
ResNet50 | 4.09G (99.2%) | 2.05M | 21.83M | 6.32M | 0.80M | 0.10M | ⋯ | 25.56M | |
denseNet121 | 2.86G (98.2%) | 1.02M | 31.3M | 15.62M | 0.80M | 0.70M | ⋯ | 7.98M |
Image model | Model | Convolution | Linear | Batch norm | ReLU | Max pool | Average pool | ⋯ | Total params |
AlexNet | 0.66G (91.6%) | 58.62M | ⋯ | 0.49M | 0.38M | 9.22K | ⋯ | 61.10M | |
VGG16 | 15.36G (99.1%) | 123.63M | ⋯ | 13.56M | 6.12M | 25.09K | ⋯ | 138.36M | |
ResNet50 | 4.09G (99.2%) | 2.05M | 21.83M | 6.32M | 0.80M | 0.10M | ⋯ | 25.56M | |
denseNet121 | 2.86G (98.2%) | 1.02M | 31.3M | 15.62M | 0.80M | 0.70M | ⋯ | 7.98M |
Sequence model | Model | Convolution | Linear | Layer norm | Softmax | Mat-mul | GELU | Tanh | Total params |
Bert-base | ⋯ | 21.74G (97.23%) | 12.29M | 2.36M | 603.98 M | 0.59M | 6.14K | 109.48 M | |
GPT2 | 77.31G (83.91%) | ⋯ | 32.11M | 6.29M | 1.61 G | ⋯ | ⋯ | 354.82 M |
Sequence model | Model | Convolution | Linear | Layer norm | Softmax | Mat-mul | GELU | Tanh | Total params |
Bert-base | ⋯ | 21.74G (97.23%) | 12.29M | 2.36M | 603.98 M | 0.59M | 6.14K | 109.48 M | |
GPT2 | 77.31G (83.91%) | ⋯ | 32.11M | 6.29M | 1.61 G | ⋯ | ⋯ | 354.82 M |
The image models are estimated with an input image size of (3,224,224).
Sequence models are estimated with an input sequence length of 128.
mat-mul is specified for the matrix multiplication operation in the attention mechanism.
Figure 4 shows a heterogeneous system used to build a general computing platform. In typical ML applications, such as image encoding and the similarity calculation for recommendation80,81 in Fig. 4(a), data streams will be processed in two directions according to the processing complexity. One stream is the non-VMM involved operations, including the scalar- and vector-based operations and the logic operations, all of which are performed by conventional digital systems35 [Fig. 4(b)]. For example, CPU and GPU are responsible for complex data control and parallel vector operations. Additionally, vector-based platforms may accelerate most parallel computing processes, leaving the traditional CPU to perform all other complex tasks. The second stream is the data-intensive and computing-hungry VMM-related operations, including convolution, fully connected layers, and similarity calculation [Fig. 4(c)]. The VMM-involved procedures in the heterogeneous systems are specified to run in memristive arrays because there is no need for frequent data communication between memory and IMC systems, so all weights or samples can be stored in resistive arrays. However, not all VMM-related operations use memristive arrays to accelerate computing speed. For example, matrices used to calculate scores in the attention model73 are intermediate results in the neural network. Implementing those matrices on the memristive array is costly in terms of programming latency and power consumption. Therefore, the SRAM-based IMC may be a good solution for accelerating the attention model, owing to its advantages of fast reading and writing.82 The heterogeneous systems constitute a data-hierarchical processor from a scalar, and vector to the matrix. Thus, the diversity of ML-based applications and the acceleration of IMC can be simultaneously accommodated, making algorithmic completeness achievable.
Figure 4(d) shows one possible process for executing an IMC-based ML algorithm in a heterogeneous system. Digital systems, such as CPU-based systems, are used to develop the user-oriented top-level controlling codes for the IMC-based application. During development, the IMC-support ML library is used as the basis for developing ML applications that include IMC-based operations (conv2D, for example) and digital platform-specified operations (Batchnorm2d, ReLU, etc.). Developed code is compiled into executable machine code by a compiler that bridges the two sides of the digital and analog IMC systems. The IMC interface acts as the I/O port that receives the control from the digital system and returns the processed results in the IMC system. The IMC system contains essential working parts of VMM operations and functional controls. When the application runs, the digital system gives control of IMC computing and receives the processed data. Processed data are then operated on in a digital system producing results for applications like classification and data retrieval. This scheme may lower the computing speed and reduce the efficiency but is still acceptable to provide the potential for GPML compared with the customized designs.15,41,42
B. Precision reconfigurability to solve the precision-efficiency dilemma
As summarized in Fig. 5(a), the higher precision of a memristive IMC system often lowers computing efficiency, which is contrary to the original purpose of IMC for computing acceleration. This is mainly due to the high resolution of DACs and ADCs,85,86 which have an exponential relationship with energy consumption. The processing precision of ML models is typically balanced against hardware overhead and influenced by peripheral circuits and the storage precision of the NVM arrays. To address these issues, many quantization methods are adopted in both weights and activations among the ML tasks.85–89 However, it is difficult to identify a general pattern ML task pattern that fits all computing scenarios. For example, edge-ML can operate in a low-precision domain with quantized models under constrained resources. In contrast, cloud AI typically uses floating-point (FP) calculations with complex datasets90 [Fig. 5(b)]. Moreover, weight precision needs differ for different algorithms and applications. Classification tasks often have greater flexibility to run with low precision, whereas AI-oriented scientific computing requires FP numbers.91 Even within deep-learning methods, various IMC-based layers may need different precision levels to balance efficiency and accuracy.87,92,93 For GPML, reconfigurable precision is needed to satisfy requirements for precision-efficiency dilemma.24
A software-enhanced precision promotion method was investigated whereby the analog arrays ran within low precision and a digital system corrected the final precision of results [Fig. 5(c)].105 However, this method still depends heavily on the precision of the hardware, highlighting the need to develop a reconfigurable precision IMC system. With regard to the peripheral circuits, configurable DACs and ADCs have been proposed to achieve a better compromise between precision and efficiency.106,107 For example, DACs can be replaced by the binarized signals encoded in the time domain where the outputs are shifted and added to obtain the results.17,106 Although this method requires more time and peripheral circuits, it allows for the reconfigurability of input precision. Meanwhile, more attention has been focused on improving the precision of storage in arrays. One important solution is to increase the intermediate conductance levels of the memristive device. In fact, one such device has been reported to achieve 11-bit middle states.108 Downward compatibility can enable high conductance level devices to switch between different precisions via programming. Another approach to improve storage range at the array level is to use multiple low-precision devices to store high-precision data, known as bit-slicing.109 For instance, if an array has a 3-bit storage ability, 9-bit data can be divided into three parts and assigned to three arrays individually [Fig. 5(d)]. Results from the different arrays will be transferred to a digital shift-and-add network that sums the partial products and produces the VMM results with desired precision. The bit-slicing method can control storage precision by configuring the shift-and-add network.
Although bit-slicing provides a flexible solution for the finite state problem, this method is restricted by linear mapping from physical conductance to real values according to the max–min conductance value.110 This makes models in digital computing transfer to IMC systems challenging. On the one hand, pre-trained backbone models like ResNet, and VGGNet, use standard FP or integer (INT) arithmetic, both of which are incompatible with the existing nonlinear mapping method of the memristor circuit, resulting in an unexpected accuracy loss.88 On the other hand, it is difficult to directly realize the data communication between the analog computing in IMC and the digital components like DRAM,111 increasing the barrier for data communication. The solution for these problems lies in mapping the standard format data in digital space to their counterparts in pseudo-continuous storage space, which is called format-aligned mapping.72,112 In this approach, the length of the data is pre-aligned and fixed to match the standard data format, as shown in Fig. 5(e). The format-aligned mapping cannot always make full use of each stored bit in its long tail resulting in the waste of storage space. An important solution is to develop new data formats to adapt to IMC rather than use FP or INT directly. Combining the precision reconfigurability with the heterogeneous system, the memristive IMC can make significant strides in advancing toward higher computing precision and energy efficiency, which is crucial for enabling large-scale high-performance GPML processing.
IV. CHALLENGES AND OPPORTUNITIES
Heterogeneous computing systems that support GPML, as one of the main future directions for IMC development, are still in their initial stages. Many present IMC systems realized a part of GPML, such as the precision or the algorithm configurations. However, several key issues must be addressed at the hardware and software levels. Most hardware concerns involve system integration, which combines requirements for heterogeneous systems and precision reconfigurability. At the software level, the goal is to introduce IMC to a broader application-oriented field with a complete IMC ecosystem.
A. Hardware implementation
Although many experimental achievements are obtained in memristive IMC, it still faces a lot of challenges in hardware implementation toward memristive IMC-based GPML. The main problems involve three aspects from the basic device and array level, the functional IMC core design to the heterogeneous macrointegration, and should be further explored.
1. Device optimization and NVM array fabrication
The first problem on the hardware is the device optimization and array fabrication, which forms the basis of memristive IMC. At the device level, critical decisions are to optimize the materials and device structures. The goal of device optimization, on the one hand, is to provide more stable middle levels of data representation that meet the requirement of precision reconfigurability. On the other hand, it is developing memristive devices compatible with existing CMOS processes, which enables the integration of NVM devices with mature CMOS technology.113,114 At the array level, considerable attention has been focused on promoting the storage capacity of single arrays. As shown earlier in Table I, reported models contain tens to hundreds of megabits of VMM-involved params, making them significantly larger than current array sizes.61,97,104 Size limitations increase the cost of production.115 In the crossbar topology, the size of the array is largely influenced by the parasitic effects including the line resistance and the parasitic capacitance.116,117 In large arrays, the IR-drop of line resistance will greatly interfere with the accuracy of calculation and lower the performance of the algorithms. Reasonable solutions for large array fabrication are to improve the on-state resistance of the device118 or use the 3D integration.119,120 But challenges still exist, for example, high resistance of the device causes high delay and highly efficient device stacking methods in 3D arrays are still to be explored.
2. Configurable circuit in IMC cores
The second problem is the need for a configurable IMC core based on the memristive arrays. At the circuit level, the basic problem is to define the physical connection when building computing blocks. The IMC core usually contains more than one array to have sufficient storage ability.61,103 It is hard to equip with the inputs and outputs circuits for each individual array, which results in large hardware overheads. One common scheme involves sharing inputs and outputs among multiple arrays using multiplexers to switch physical connections between functional circuits and arrays. In addition, for data input, configurable DACs or binarized input signals in the time domain are used to enable adjustable input precision. However, reconfigurable inputs require adaptability in output processing, such as programmable output registers with different lengths. At the crossbar level, the precision of data storage can be adapted for diverse applications. Configuring array precision relies on programming circuits and bit-slicing methods. For example, shifting and add circuits can concatenate different partial results from different arrays using the bit-slicing method. In IMC-based hardware design, a configurable design is necessary to make the optimization of specific application designs possible. But it brings more circuit redundancy, which further results in more area, power, and more complex control. The most representative operation is the shift and add which may bring equivalent costs as the analog hardware.
3. Heterogeneous macrointegration
Thirdly, more attention is paid to the heterogeneous system integration of computing platforms based on the IMC cores. The heterogeneous system to cover the algorithmic completeness requires different systems, including the IMC, CPU, GPU, and FPGA, to cooperate for higher performance and energy efficiency. The conventional idea is to integrate the digital systems and the IMC ones in one macro to avoid the long communication path between different platforms. But this is often challenging because the fabrication process of memory usually lags behind the processor. The memristor-COMS system provides a solution that stacks the memristors on the finished functional circuits using PAD connection.61 A more feasible way is to consider the heterogeneous system-on-chip integration schemes, such as chiplet and EMIB.121,122 Three-dimensional stacking hetero-integrated chips using optoelectronic device arrays and memristor crossbar arrays have been demonstrated and show exciting multimodal sensor computing and in-memory computing capabilities.123 These schemes allow for more flexibility in design and easier manufacturing. However, the pressure of these schemes turns to the high-bandwidth in-chip interconnects between disparate computing systems.
B. Software ecosystem
As the hardware design of the IMC system was accomplished, our next goal was to develop a top-level software ecosystem. Providing high-level programming methods for users, especially developers, is essential to encourage the use of memristive IMC-based GPML. Building a user-friendly computing environment will significantly advance the use of promotion of ML algorithms in IMC systems.
1. IMC-supported ML developing library
The first major issue is to develop an IMC-enabled ML acceleration library for code-level realization. Among general users, the underlying hardware is of less concern because an IMC system is regarded as a black box for application purposes. The interface of the system separates the developer from the underlying hardware, so they can focus on functional realization. A high-level library wraps the functional operations, such as convolution, fully connected layers, and distance calculation schemes, with VMM-accelerated IMC macros. Many CPU- and GPU-based ML applications have been created using mature ML frameworks, such as PyTorch and TensorFlow, making application development easier. However, there are still no public or commercial libraries that support the development of top-level IMC applications, for example, IMC-supported PyTorch.
2. System technology co-optimization (STCO) tools
The second issue is the need to build the system technology co-optimization (STCO) tools that support the software–hardware co-optimization of the high-level code to reach the best performance. Different from digital computing platforms, IMC is an approximate computing system that contains nonidealities in the analog domain. Inevitable nonideal factors in the array, such as the device disturbance of reading and programming, finite conductance levels, and conductance drift, may largely influence the performance of the ML applications. Software promotion methods, such as noise injection when training,61,124,125 are taken into consideration to provide much margin for various applications. These methods rely on the robustness of the application against hardware deficiencies based on the behavior prediction of the hardware. Moreover, when considering the finite conductance quantization, the device disturbance, the optimization in the IMC design space is an NP-hard problem and remains to be handled. For example, use the neural network architecture search on RRAM to obtain the best performance of CNN and lower the requirement for ADC/DAC and the precision of the device.126 Many studies provide simulation frameworks that realize design space exploration and the performance estimation of the ML tasks in the approximate IMC system.90,127–129 Toward the way to GPML, designing tools for STCO is largely required to get the best compromise between the hardware resource consumption and the software performance.
3. Instruction set architecture
Third, it is the need to design the instruction set architecture (ISA), which abstracts the hardware, to construct the basis for running the high-level codes on the IMC systems. The ISA design should consider two aspects. The first one is the compatibility with existing digital systems. In the heterogeneous system design, mature instructions in the mainstream digital architectures are adopted to reduce the complexity. Thus, key design rules, such as length, function division, addressing space, etc., should be consistent with existing instructions. Another one is the functional design of instruction in the analog domain. Although ISA design relies on the architecture of the IMC cores, the central concept still focuses on the VMM operation. Three types of instructions should be further considered in the IMC system, including data loading, calculation, and communication. These instructions make the configurable circuits run in desired modes. At the same time, the computing framework of IMC supporting the instruction execution needs to be declared but there still lacks consensus in this field.
4. General compiler
Finally, there is an urgent need to build a general compiler that converts the top-level codes to operational instructions on the IMC system. The general compiler automatically adopts the results of STCO tools, and ISA is a core control component. In a formal library module, the array size is considered infinite and unrestricted, but it is not instantiated until the compiler is operated. When storing the matrix in the memristive array, the compiler first automatically determines the matrix division under the optimized precision in STCO tools. Then, the processed data matrix will be mapped and stored in the NVM groups according to the available space table130 and produce the storage maps by the instruction execution. The storage maps clarify the information about the storage matrix. But constrained by the topology of the matrix and the arrays, the matrix deployment problem is a 2D bin package problem, which is an NP-hard problem.131 Therefore, the compiler should find the best optimization for data deployment according to the user's preference, such as efficiency priority (faster VMM operation) or resource priority (smaller occupied areas or lower power consumption in IMC). When realizing the wrapped functions using VMM, the input data will be converted to read instructions automatically according to the storage maps. The processed results will be collected by the compiler and returned to the application. Thus, the compiler bridges the ML applications and the IMC systems and frees IMC-based applications from their system structures.
V. CONCLUSION
In conclusion, the use of the in-memory computing paradigm to accelerate machine learning algorithms has moved into a new phase characterized by a shift from single algorithmic implementation to diverse and systematic applications. In this work, we reviewed diverse applications applied to memristive arrays. Then summarized two typical aspects that lead the way to IMC-based general-purpose machine learning. One is to develop a heterogeneous computing system for covering algorithmic completeness. The other is developing the precision reconfigurable IMC system to pursue higher efficiency. At the end of the perspective, the challenges and opportunities are proposed to clarify the possible directions of memristive IMC-based GPML, which will finally boost the progress of the full process of in-memory computing. We believe our study offers new insight for extending IMC into new application areas.
ACKNOWLEDGMENTS
We acknowledge financial support from the STI 2030—Major Projects (Grant No. 2021ZD0201201), the National Key Research and Development Plan of MOST of China (Grant Nos. 2019YFB2205100 and 2022YFB4500101), and the National Natural Science Foundation of China (Grant No. 92064012). This work was also supported by the Hubei Engineering Research Center on Microelectronics and Chua Memristor Institute.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Houji Zhou: Conceptualization (lead); Methodology (lead); Visualization (lead); Writing – original draft (lead); Writing – review & editing (equal). Jia Chen: Investigation (equal); Methodology (equal); Writing – original draft (equal); Writing – review & editing (equal). Jiancong Li: Methodology (equal); Writing – review & editing (supporting). Ling Yang: Visualization (supporting). Yi Li: Funding acquisition (equal); Project administration (equal); Resources (equal); Writing – review & editing (equal). Xiangshui Miao: Project administration (equal); Resources (equal).
DATA AVAILABILITY
The data that support the findings of this study are available within the article.