In recent years, as Deep Neural Network (DNN) has been widely used in various artificial intelligence (AI) applications, the demands for energy efficiency and computational speed have continuously increased. Computing-in-Memory (CIM), as a potential solution, can significantly reduce the energy consumption and the delay caused by data transmission. In the paper, the CIM application based on spintronic device in DNN training is explored. Architecture for CIM using spintronic devices can efficiently perform the computational tasks of neural networks at the memory level. Comparison is conducted with the computation training based on SRAM, RRAM, and FeFET for a standard DNN training task with the same inference accuracy.
I. INTRODUCTION
In recent years, the rapid development of artificial intelligence (AI) and machine learning (ML) is arousing the widespread application of ML algorithm based on deep neural networks (DNNs) across a range of fields, including image processing, speech recognition, natural language processing, and automated driving, etc. With a large number of cascading neurons and complex parameter structures, DNNs can be used for automatically extracting features, recognizing patterns and making highly accurate predictions based on vast quantities of data.1 However, DNNs necessitate substantial computing resource and memory bandwidth to facilitate high speed data access and parameter refreshment. In the context of large datasets and complex models, data is frequently transmitted among neurons in each layer, necessitating the performance of a considerable number of matrix calculations and non-linear transformations. This process entails a multitude of forward and backward propagations, as well as numerous iterations of training. It necessitates substantial parallel computing power and rapid memory access.
However, in the traditional von Neumann architecture, the computational unit (CPU or GPU) is separated from the storage unit (memory). The data is frequently transferred between the two units and consumes a large amount of energy, which gives rise to the problems of the “memory wall” and the “power wall.” With the amount of the data increasing, the discrepancy between the processing speed and the transmission speed is increased, resulting in degradation in system performance and large power consumption. This ultimately constrains the efficiency of traditional architecture in DNN training and inference tasks.
To address the bottleneck, computing-in-memory (CIM) concept is proposed, in which the computational and the storage units are integrated together. Partial computation can be conducted directly in memory, which can significantly reduce the data transmission between the memory and the computational unit. This could result in a notable reduction in power consumption and a marked improvement in computational efficiency.
Magnetic Random Access Memory (MRAM) is a novel type of non-volatile memory (NVM) that is particularly well-suited to CIM architecture.2 It employs spintronic principles to provide rapid read/write speeds, high durability, and low static power consumption, making it an optimal candidate for developing CIM systems. The core element in MRAM is the Magnetic Tunnel Junction (MTJ) device, as illustrated in Fig. 1.
The MTJ device comprises two ferromagnetic layers and a tunneling insulating layer, wherein one ferromagnetic layer exhibits a fixed direction of magnetization (referred to as ‘reference layer’), while the other layer’s magnetization orientation can be reversed (referred to as ‘free layer’).3,4 When the magnetization directions of the two layers are same, the device shows low resistance (denote as “0”). On the other hand, when the directions of the two layers are opposite, the device exhibits high resistance (denote as “1”). The states of the MTJ device can be switched by the spin transfer torque (STT),5 which facilitates rapid, low-energy data access. Currently, STT-MTJ is the most common spintronic device, with the merits of high memory density and mature fabrication technology.
In CIM architectures, the utilization of MTJ device shows the potential to reduce the necessity for data transmission from memory to processor, as it is capable of performing computational operations directly in memory.6 The integration of computational operations into the storage unit results in a notable reduction in power consumption during the training process of deep neural networks (DNNs). Moreover, the rapid write and read capabilities of MRAM facilitate efficient data access and processing in DNN training and inference, rendering it well-suited for edge computing and high-performance computing applications that necessitate high power consumption and speed. This architectural solution could address the limitations of the “memory wall” and “power consumption wall,” thereby facilitating the efficient operation of DNN models.
This paper is organized as follows: The architecture of CIM based on various types of memories, including the CIM array module and its peripheral circuit module, is introduced in Sec. II. Comparison among the CIMs with different memories is shown in Sec. III. Finally, the conclusions of the paper is summarized in Sec. IV.
II. DESIGN OF CIM ARCHITECTURE
The objective of this proposal is to present CIM architecture based on spintronic devices in conjunction with Neurosim platform,7 as illustrated in Fig. 2. Multiple modules are included in the architecture, including global buffer, accumulation unit, activation unit (e.g., sigmoid), pooling unit, and data processing module, etc.
CIM chip architecture implemented with MTJ devices in conjunction with Neurosim platform.
CIM chip architecture implemented with MTJ devices in conjunction with Neurosim platform.
The global buffer is employed for the storage and the management of data transmission between modules, thereby ensuring efficient data transfer and processing. The data processing module represents the core of the proposed architecture. It comprises several data processing units (PEs), PE buffers, and input/output buffers. Each PE comprises a set of in-store computational arrays that employ the spintronic devices to facilitate rapid parallel computation. Furthermore, the PE buffers are also included for the provisional storage of data, thereby enhancing the efficiency of the processing.
The control unit is in charge of the management of data and computational operations, ensuring the coordination of activities among the different components. The accumulation module is employed to aggregate the partial sum operations derived from disparate computational units, ultimately yielding the desired output results. The output buffer is tasked with conveying the outcomes to subsequent processing modules or external interfaces. The objective of this architectural design is to achieve efficient in-store computation, reduce energy consumption, and enhance processing speed through the utilization of spintronic devices, thereby providing robust support for complex AI and machine learning operations.8,9
The CIM array consists of STT-MRAM sub-arrays and the peripheral circuits based on a 1T-1MTJ structure, as shown in Fig. 3. In the array, write and read operations are performed by several key components in the platform. The line selector line (WL) is used to control the gate of the transistor and acts as a cell-like switch that can precisely control the data transmission. The source line (SL) is connected to the source of the transistor and provides the necessary current path. The bit line (BL) is connected to the bottom electrode of the STT-MTJ, and is responsible for transferring the data stored in the STT-MRAM. This design allows the storage in-computing arrays to perform complex computational tasks with high parallelism and excellent energy efficiency, making them particularly suitable for application scenarios such as deep learning and big data processing.
The peripheral circuitry of the in-store computational array is primarily composed of WL/BL switching matrix, comprising transmission gates interconnected with rows and columns. The control signals for the matrix are stored in registers. The function of this switching matrix is to implement fully parallel voltage inputs to the rows or columns of the array. In weighted sum operations, the input vector signals are loaded onto the bit lines, which determine whether the bit lines are connected to the read voltage or to ground. Subsequently, the decoder is employed to select the requisite cells for programming and to provide the requisite voltage bias scheme for the write operation. Given that the dimensions of the array cells are considerably smaller than those of the read peripheral circuits, it is not feasible to situate all the read peripheral circuits beneath the array. Consequently, a multiplexer (Mux) is employed to distribute the read peripheral circuits among the columns of the synaptic array.
In order to facilitate the efficient reading of portions and subsequent processing in subsequent logic modules (e.g., activation and pooling), a set of flash analog-to-digital converters (ADCs) is introduced at the end of the memory layer. The ADCs facilitate multi-stage signal sampling by adjusting the reference voltage, thereby providing accurate digital outputs. Furthermore, they are capable of handling more complex data tasks, which ultimately enhances the overall system performance.
III. CIM PERFORMANCE COMPARISON
In the scenarios involving the training of DNNs and the deployment of CIM applications, the performance differences among different memory technologies have a marked impact on the overall system energy efficiency and computational efficiency. A comprehensive comparative analysis of the performance of four distinct memory types, including SRAM,10 STT-MRAM,11 RRAM,12 and FRAM,13 are listed in Table I. The memory technologies exhibit notable disparities in pivotal metrics such as storage architecture, read/write speeds, energy consumption, and leakage characteristics. These discrepancies directly influence their suitability and energy efficiency in CIM systems.14
Comparison of the key properties among different memories.
Technology . | Storage . | Read/Write (ns) . | Energy Read/Write . | Leakage (mw) . | Advantages/Drawbacks . |
---|---|---|---|---|---|
SRAM | CMOS Latch | <1 | High | 5.65 | (-) Non-volatile |
(++) Endurance | |||||
STT-MRAM | Magnetization | 2-30 | Medium | 1.35 | (+) Non-volatile |
(+) Endurance | |||||
RRAM | Resistance | 1-100 | Medium | 1.23 | (+) Non-volatile |
(-) Endurance | |||||
FRAM | Polarization | 30 | Low | 1.328 | (+) Non-volatile |
(+) Endurance |
Technology . | Storage . | Read/Write (ns) . | Energy Read/Write . | Leakage (mw) . | Advantages/Drawbacks . |
---|---|---|---|---|---|
SRAM | CMOS Latch | <1 | High | 5.65 | (-) Non-volatile |
(++) Endurance | |||||
STT-MRAM | Magnetization | 2-30 | Medium | 1.35 | (+) Non-volatile |
(+) Endurance | |||||
RRAM | Resistance | 1-100 | Medium | 1.23 | (+) Non-volatile |
(-) Endurance | |||||
FRAM | Polarization | 30 | Low | 1.328 | (+) Non-volatile |
(+) Endurance |
Based on the CIM chip architecture shown in Fig. 3, the memory computing units in the CIM array can be replaced with the four types of memories. CIFAR-10 dataset using the VGG8 network are trained with the same inference accuracy condition. The performances of SRAM, STT-MRAM, RRAM, and FRAM-based CIMs in DNN training are compared in terms of five key metrics, including arithmetic performance, power consumption, area efficiency, accuracy, and array energy decomposition.
In terms of the arithmetic power, the comparison results are shown in Fig. 4. The CIM based on SRAM performs most prominently with 8.1689 TOPS, which is mainly due to its extremely fast read and write speeds (<1 ns), enabling it to excel in compute-intensive tasks. However, SRAM’s high computational power is accompanied by high energy consumption and severe leakage (5.65 mW), making it unsuitable for energy-efficiency-sensitive applications.
STT-MRAM, on the other hand, offers 2.5057 TOPS, which, combined with moderate read/write speeds (2-30 ns) and low leakage (1.35 mW), strikes a balance between computational performance and energy-efficiency, making it ideal for scenarios with high-frequency data accesses and energy-efficiency requirements.
RRAM, with 1.3182 TOPS, is limited by its inconsistent read and write speeds (1-100 ns), but the lowest leakage power consumption (1.23 mW) makes it suitable for low-power applications.
FRAM, with 1.9253 TOPS, has slower read and write speeds (30 ns), but its low power consumption (1.328 mW) and good durability make it suitable for low power and long-life applications.
Based on the above comparison, CIM based on SRAM shows the best performance in terms of power, but poor energy efficiency. CIM based on STT-MRAM achieves a balance between power and energy efficiency, and is suitable for high-frequency read and write requirements. CIMs based on RRAM and FRAM have relatively average power, but are more suitable for power-first application.
The energy efficiency comparison results are shown in Fig. 5. Among the four candidates, RRAM and FRAM exhibit excellent energy efficiency in DNN inference tasks, reaching 24.2696 TOPS/W and 24.0147 TOPS/W, respectively, demonstrating a significant energy efficiency advantage, which is particularly suitable for energy efficiency-sensitive tasks such as deep learning training. In contrast, STT-MRAM has a slightly lower energy efficiency of 22.3749 TOPS/W, but offers good durability and balance in non-volatile storage for high frequency access applications. SRAM is more limited in energy efficiency sensitive tasks due to its high read/write energy and significant leakage, resulting in an energy efficiency of only 20.7404 TOPS/W. Specifically, the high energy efficiency of RRAM and FRAM is due to the low read/write energy consumption and low leakage. FRAM shows high endurance, which further enhances its energy-sensitive applications. MRAM strikes a balance between energy efficiency and endurance, and is suitable for CIM where a balance between read/write speed and endurance is required.
Next, the area efficiency of the four memories in the CIM architecture is analyzed, as shown in Fig. 6. In terms of the area efficiency, RRAM and FRAM perform the best with the same inference accuracy, reaching 0.0846 TOPS/mm2 and 0.0806 TOPS/mm2 respectively, indicating that they have the highest arithmetic densities on the effective area. This advantage is mainly due to the outstanding compactness of RRAM and FRAM. Combined with their low power consumption characteristics, RRAM and FRAM can provide high computational throughput in a small chip area, particularly suitable for application scenarios that require an increase in the one-bit density of the storage algorithm.
The area efficiency of STT-MRAM is 0.0684 TOPS/mm2, which is less optimal than those of RRAM15 and FRAM.16 However, it offers an effective compromise between non-volatility and storage density, and holds potential for integration in storage and computing applications. The area efficiency of SRAM is 0.0216 TOPS/mm2, due to its CMOS latch-based architecture, which requires a greater number of transistors, resulting in a reduction in computational density.
In conclusion, RRAM and FRAM are more competitive in high-computational-density DNN training due to their higher area efficiencies, and are especially suited for improving the computational throughput of CIM architectures.17 MRAM’s equilibrium of durability and area efficiency renders it a pivotal selection for high-density computing applications. In contrast, SRAM’s prospective is constrained due to its low area efficiency and elevated power consumption.
The results of the accuracy comparison among the four memories are illustrated in Fig. 7. The results in Fig. 7 are subjected to further analysis with a view to evaluating the training performance of the four memories under CIM architecture. When training the VGG8 model on the CIFAR-10 dataset, there is a notable discrepancy in the performance of the four memories in terms of model accuracy, despite the inference accuracy being consistent.18
Firstly, it is notable that SRAM performs particularly well, achieving the highest accuracy of 89.41%. This can be attributed to the ultra-low latency of SRAM, which is defined as the time taken for a read or write operation to complete. This latency is less than 1 ns, which minimizes the delay during data transmission and computation, thus maintaining high training accuracy.
RRAM achieves an accuracy of 86.32%. The high accuracy of RRAM is primarily attributable to its relatively low read/write latency (1-100 ns), which enables more efficient parameter updates during the training process, thus ensuring model convergence. Furthermore, the non-volatile nature of RRAM mitigates the risk of data loss due to power failure or insufficient energy during the training process, thereby enhancing the overall training stability. However, the lack of durability of RRAM may potentially lead to an accuracy degradation under long-term massive write operations.
The accuracy rate of FRAM is 76.89%, which exhibits slight deficiencies in performance when compared to SRAM and RRAM. The relatively slow write time (30 ns) of FRAM may potentially lead to bottlenecks in DNN training where weights are frequently updated, thus affecting the training accuracy.
The accuracy rate of STT-MRAM is 71.66%. This relatively low accuracy rate may be attributed to its storage mechanism and read/write speed. Although STT-MRAM is more competitive than RRAM in terms of read/write latency (2-30 ns), it has relatively high read/write energy consumption due to its magnetized storage characteristics as well as potential stability issues. This may result in less accurate parameter updates during training, which could negatively affect the convergence of the final model.
Table II lists the power consumptions in CIMs based on the four memories. Figure 8 shows the corresponding power consumptions exhibited in histograms. By combining the energy decomposition histograms in Fig. 8 and the data in Table II, a comprehensive analysis of the energy consumption of CIM arrays based on the four distinct memory types is conducted. The energy consumption is primarily attributable to the accumulator, ADC, interconnect, and other components, with notable disparities in the performance of different memories in these areas.19,20
Energy consumption in CIM based on four memories.
Memory . | Accumulation Energy (µJ) . | Other Energy (µJ) . | ADC Energy (µJ) . | Interconnect Energy (µJ) . |
---|---|---|---|---|
SRAM | 0.501 | 0.501 | 1.851 | 1.159 |
STT-MRAM | 0.264 | 0.346 | 0.412 | 1.152 |
RRAM | 0.268 | 0.349 | 0.389 | 1.155 |
FRAM | 0.259 | 0.351 | 0.316 | 1.152 |
Memory . | Accumulation Energy (µJ) . | Other Energy (µJ) . | ADC Energy (µJ) . | Interconnect Energy (µJ) . |
---|---|---|---|---|
SRAM | 0.501 | 0.501 | 1.851 | 1.159 |
STT-MRAM | 0.264 | 0.346 | 0.412 | 1.152 |
RRAM | 0.268 | 0.349 | 0.389 | 1.155 |
FRAM | 0.259 | 0.351 | 0.316 | 1.152 |
In terms of accumulator energy consumption, FRAM exhibits the least consumption rate (0.259 μJ), followed closely by STT-MRAM (0.264 μJ) and RRAM (0.268 μJ). In contrast, SRAM demonstrates the highest accumulator energy consumption rate (0.501 μJ). This discrepancy is mainly due to the data retention capabilities of non-volatile memories (STT-MRAM, RRAM, and FRAM), which require less refresh operations during accumulation, thus reducing accumulator energy consumption. In contrast, SRAM, being volatile, requires frequent refresh and data hold operations, significantly increasing energy overhead during accumulation.
With regard to the category of “other” energy consumption, the discrepancy between the four memories is relatively minor, yet a comparable pattern persists. FRAM, STT-MRAM, and RRAM are situated within the 0.346-0.351 μJ, whereas SRAM continues to exhibit the highest “other” energy consumption (0.501 μJ). The highest level of “other” energy consumption is still observed (0.501 μJ). These additional energy costs are typically associated with peripheral control circuits, signal drivers, and read/write management. Once more, the elevated energy consumption of SRAM is evidence of the considerable impact of its high-frequency data refresh requirements on energy consumption.
In terms of ADC energy consumption, the energy consumption of SRAM is significantly higher than that of other memories, reaching 1.851 μJ. In comparison, the ADC energy consumption of MRAM, RRAM, and FRAM is 0.412 μJ, 0.389 μJ, and 0.316 μJ, respectively. The energy overhead of the ADC is usually closely related to the signal conversion accuracy and the sampling rate. The energy consumption of the ADC section of SRAM is considerable due to the necessity of high-precision, high-speed data acquisition to achieve its ultra-fast read and write performance, which consequently results in a significant energy consumption in the ADC part. In contrast, FRAM exhibits the least ADC energy consumption, with an energy requirement that is only one-sixth of that of SRAM. This reflects the energy advantage of FRAM in in-store computation, particularly in scenarios that require low ADC energy consumption.
The interconnect energy consumption differences among the four memories are minimal (1.152-1.159 μJ). This indicates that interconnect energy consumption is primarily influenced by the overall connection architecture of the CIM array, rather than the memory type. Therefore, the minimal impact of memory on interconnect energy suggests that optimizing interconnects should focus more on architectural design and data routing efficiency, rather than on the memory type itself.
IV. CONCLUSIONS
In the paper, comprehensive analysis and systematic comparison of the performance of four memories in the CIM architecture for DNN training are conducted. The deployment of distinct memory technologies within CIM architectures reveals the respective advantages and disadvantages of the four memories. In selecting an appropriate memory for in-store computation, it is essential to consider a range of factors, including energy efficiency, area utilization, accuracy and system energy consumption in specific application scenarios. This approach ensures that the overall performance and energy efficiency are optimized.
STT-MRAM, with its well-balanced characteristics of non-volatility, high endurance, and moderate energy consumption, represents an optimal compromise between energy efficiency and computing power. Conversely, RRAM and FRAM exhibit more substantial advantages in terms of energy and area efficiency. From the perspectives of arithmetic power and accuracy, CIM based on SRAM demonstrates the most optimal performance due to its exceptionally high read and write speeds.
ACKNOWLEDGMENTS
This work is supported by NSFC (No. 61774078). Open Project Funding of State Key Laboratory of Processor Chip (CLQ202303).
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Shuai Zhou: Conceptualization (lead); Data curation (lead); Investigation (lead); Methodology (lead); Project administration (lead); Software (lead). Yanfeng Jiang: Conceptualization (supporting); Funding acquisition (lead); Resources (lead); Supervision (lead); Writing – review & editing (lead).
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.