As artificial intelligence calls for novel energy-efficient hardware, neuromorphic computing systems based on analog resistive switching memory (RSM) devices have drawn great attention recently. Different from the well-studied binary RSMs, the analog RSMs are featured by a continuous and controllable conductance-tuning ability and thus are capable of combining analog computing and data storage at the device level. Although significant research achievements on analog RSMs have been accomplished, there have been few works demonstrating large-scale neuromorphic systems. A major bottleneck lies in the reliability issues of the analog RSM, such as endurance and retention degradation and read/write noises and disturbances. Owing to the complexity of resistive switching mechanisms, studies on the origins of reliability degradation and the corresponding optimization methodology face many challenges. In this article, aiming on the high-performance neuromorphic computing applications, we provide a comprehensive review on the status of reliability studies of analog RSMs, the reliability requirements, and evaluation criteria and outlook for future reliability research directions in this field.
I. INTRODUCTION
By mimicking the mechanism of human brains, artificial intelligence (AI) has achieved remarkable success, with applications in image and natural-language processing,1,2 driving automation,3 big data analysis,4 and even vision-based robotic object handling.5 The rapid development of AI applications requires continuous hardware advancement; especially, a high-speed and energy-efficient hardware system is required. However, the traditional computing system with von Neumann architecture leads to the high energy consumption and latency due to a huge amount of data transfer between the separated memory unit and the logic unit.6 The speed gap between the two units further results in the considerable latency, which is called the “memory wall.”7,8 The neuromorphic computing system has been considered as a promising candidate for breaking this predicament.9
The neuromorphic computing systems refer to the systems that mimicked the biological brain from the structure and/or working mechanism.10 The emerging analog-type resistive switching memory (RSM) based in-memory architecture is one of the key technologies to implement the neuromorphic computing system. Analog RSM refers to a kind of two-terminal nonvolatile memory device with multiple conductance levels. The stored information is determined by the conductance value.11 In a neuromorphic computing system, the RSM devices act as synaptic weights to store information and process input signals.12 Based on Ohm's law and Kirchhoff's law, the RSM with a crossbar array could naturally accomplish matrix–vector multiplication (MVM) within one step by collecting the accumulative output current.13 In this case, high parallelism could be realized to accelerate the computations without the latency and energy consumption between memory and computing units. Compared to the traditional memory devices, such as static random access memory (SRAM)14 and Flash,15 analog RSM has significant advantages. Although SRAM technology has a fast speed and matured manufactory technology following the CMOS scaling down, the limited area-inefficient and high standby power is undesired in large-scale SRAM arrays.16 In contrast, Flash is a kind of nonvolatile memory device with analog computing ability. Neuromorphic computing chips based on Flash technology have demonstrated excellent performance when compared to the conventional CMOS technology.17,18 However, compared to the Flash technology, analog RSMs show higher switching speed, lower programming voltage, and higher endurance ability.19,20 With these excellent characteristics and much higher area efficiency, the analog RSM array can stimulate great potential in future neuromorphic computing systems.
Recently, neuromorphic computing based on analog RSM has achieved significant progress from the synaptic devices to the array-level demonstrations. Jo et al.12 first proposed implementing the synaptic functions by using analog RSM devices, which pioneered the development of RSM for neuromorphic computing. Prezioso et al.21 reported an array-level implementation of in situ training in a neuromorphic network based on Al2O3/TiO2−x analog RSM. Furthermore, several tasks, such as handwritten recognition,22 face classification,13 feature extraction,23 and reinforcement learning,24 have been demonstrated based on an analog RSM array. These achievements demonstrated the functional feasibility and performance advantage of analog RSM-based neuromorphic computing systems. However, to date, experimental reports have remained at small-array-level (hundreds to thousands of cells) primitive demonstrations.14,25 It is still very challenging to develop a large array or full chip to execute practical AI tasks. Furthermore, the computing accuracy of analog RSM-based systems is lower than that of a CPU.26 The key challenge lies in the reliability issues of the analog RSM. With large write/read noises and disturbance,27,28 endurance and retention degradation,29 and other reliability issues, it is difficult to achieve high performance with large arrays. This situation significantly blocks further research and applications of neuromorphic computing with analog RSMs. Therefore, it is important to provide a comprehensive analysis and summary of the reliability issues of analog RSM, which is the goal of this review.
Previously, several comprehensive reviews were published and discussed on the research progress of neuromorphic computing with emerging nonvolatile memory devices. They involved recent efforts in materials and mechanisms of synaptic devices30,31 and memory-based experimental demonstrations with novel algorithms and circuit architectures.16,32 It was also accompanied by an analysis of the desired device properties.33,34 However, in RSM-based neuromorphic systems for the functional demonstration and practical application, reliability becomes a key challenge, which limits the performance and accuracy in a large-scale RSM array. Therefore, it is necessary to provide a comprehensive summary and discuss on the state of the art, challenges, and prospects of the reliability problems and their impacts on neuromorphic computing. With these considerations, this review offers the dedicated perspective of device reliability and its impact on neuromorphic system performance. In detail, we will review the reliability issues and possible solutions of analog RSM devices for neuromorphic computing. In Sec. II, we explain why and how to implement the neuromorphic computing with analog RSMs. Then, various reliability concerns of neuromorphic computing applications are discussed in Sec. III, including the basic and functional reliability metrics. In Sec. IV, state-of-the-art and representative works on the reliability of the analog RSM are reviewed. The physical mechanisms of reliability degradation and optimization methods for analog RSMs are also summarized. Finally, we provide the outlook and prospects of the unresolved reliability issues that urgently need to be addressed for analog RSMs for neuromorphic computing.
II. RSM FOR NEUROMORPHIC COMPUTING
A. Analog RSM devices
Compared to traditional memory types, RSMs have shown significant advantages in implementing neuromorphic computing systems. Hardware accelerators based on traditional memories such as SRAM show limitations for computing in terms of cell density (100–200 F2 per bit cell). By contrast, analog RSM, as a synaptic device, demonstrates high storage density (4–16 F2 per bit cell)20 and fast parallel computing ability.35 Recently, some three-terminal transistorlike synaptic devices have been proposed with better conductance tuning ability.36,37 However, in this article, we only focus on two-terminal resistorlike analog RSMs because they show better integration density and have been well studied on the reliability aspects.
The analog RSMs typically include filamentary-type resistive random access memory (RRAM) devices [Fig. 1(a)], non-filamentary-type RRAM devices [Fig. 1(b)], and phase change memory (PCM) devices [Fig. 1(c)]. Filamentary RRAMs can be further classified into cation type, anion type, and dual ionic type. The resistance value of the filamentary RRAM depends on the formation and rupture of conductive filaments (CFs),38 as shown in Fig. 1(a). The CFs are composed of interstitial metal atoms (cation type),39 oxygen vacancies (anion type),40 or both (dual ionic type).41 Resistive switching is dominated by the migration of ions. The resistance value of the nonfilamentary RRAM is determined by the interfacial Schottky/tunneling barrier modulated by the electron trapping/detrapping or ion migration,42 as shown in Fig. 1(b). Different from the morphology or component change in RRAM, the resistance change in the PCM is determined by the thermally induced lattice phase change in a bulk region.43 In PCM, the active layer is a chalcogenide-based material, which can maintain a crystalline or amorphous state for a long time, as shown in Fig. 1(c). The crystalline state shows a lower resistance value, whereas the amorphous state demonstrates semiconductor characteristics corresponding to a higher resistance state. The reversible switching is dependent on the Joule heating causing by the voltage/current pulses in the active region. Furthermore, some charge- or spin-based memory devices also show resistive switching behaviors, such as magnetic random access memory (MRAM) devices, domain wall devices, ferroelectric devices, and charge-trapping devices.44,45 However, these types of devices still need more studies to realize both analog-type resistive switching ability and long-term retention simultaneously.
Computing with the emerging analog-type RSM. (a) The structure and mechanism of filamentary RRAM. The rupture or connection of CFs represents the higher or lower resistance states, and multiple CFs contribute to the analog switching ability. (b) The structure and mechanism of nonfilamentary RRAM. The two insets illustrate the band diagrams of the interface in HRS (left) and LRS (right). (c) The structure and mechanism of PCM. The phase of the programmable region switches between the crystalline and amorphous states corresponding to the resistive switching between LRS and HRS, respectively.
Computing with the emerging analog-type RSM. (a) The structure and mechanism of filamentary RRAM. The rupture or connection of CFs represents the higher or lower resistance states, and multiple CFs contribute to the analog switching ability. (b) The structure and mechanism of nonfilamentary RRAM. The two insets illustrate the band diagrams of the interface in HRS (left) and LRS (right). (c) The structure and mechanism of PCM. The phase of the programmable region switches between the crystalline and amorphous states corresponding to the resistive switching between LRS and HRS, respectively.
To tune the conductance of analog RSM devices, an external voltage pulse is applied. If the device conductance increases with an applied pulse, we call this process “SET,” “weight increase,” or “potentiation.” Meanwhile, if a pulse causes a conductance decrease, we call this process “RESET,” “weight decrease,” or “depression.” Some of the RSMs are bipolar, which means that SET and RESET pulses should have different voltage polarities, and the others are unipolar, which means that SET and RESET are independent with voltage polarity. Most RSMs based on the ion-migration mechanism are bipolar. For analog RSMs, the lowest and highest resistance states are called LRS and HRS, respectively, and the other medium resistance states are all called MRS. Sometimes, when the device is switching between two MRSs, we call the pair a lower medium resistance state (L-MRS) and a higher medium resistance state (H-MRS).
B. RSM-based neuromorphic computing system
There are two approaches for implementing neuromorphic computing based on RSM with different information-encoding schemes. One is the deep neural network (DNN), which pursues high computation efficiency on data-intensive tasks. The other is the spiking neural network (SNN), which pursues excellent power efficiency by mimicking the biological neural network in the human brain with the neuron values encoded by spiking timing.46 Analog RSM has been exploited for DNN chips, such as Umass's 128 64 reconfigurable 1T1R memristor crossbars,47 UCSB's transistor-free 12 12 memristor crossbars,21 and Umich's full integrated hardware system on a 54 108 reprogrammable memristor chip,48 while the experimental demonstration of SNN based on the analog RSM chip remains to be studied. Because the DNNs are the mainstream neural networks and have demonstrated much better performance and accuracy than SNN,49 this review will focus on the DNN with analog RSMs.
The processing of a neural network includes two phases: inference and training. Inference is a feedforward computing process by summing the weighted inputs from the prelayer neurons and generating output signals to the postlayer neurons, as shown in Fig. 2(a). The inference in multiple layers is carried out layer by layer sequentially. In an analog RSM array, the conductance (Gij) of each RSM device acts as an analog weight (Wij). As shown in Fig. 2(b), the current of each RSM (Iij) is the product of Gij and the biased voltage Vj based on Ohm's law, whereas the total current Ii in one column is accumulated as the sum of the device current based on Kirchhoff's current law.50 For the hidden layer, an activation function unit is required to transform the output current Ih(m) of the previous layer into the input voltage Vh(m) of the subsequent layer. Therefore, the complex MVM can be naturally implemented by inputting the voltage from the word lines (WLs) and outputting the current through the bit lines (BLs), as shown in Fig. 2(b). In this case, significant amounts of energy and time are saved with the natural parallel operation without data transfer between memory and computing cells in the von Neumann architecture.12,51
(a) Schematic diagram of a two-layer neural network. Each neuron computes a weighted sum of its inputs and applies a nonlinear activation function. (b) The schematic of an analog RSM crossbar array implementation of the most critical part of the perceptron, the weighted sum, where the conductance of analog RSM acts as the synaptic weight in a neuromorphic network.
(a) Schematic diagram of a two-layer neural network. Each neuron computes a weighted sum of its inputs and applies a nonlinear activation function. (b) The schematic of an analog RSM crossbar array implementation of the most critical part of the perceptron, the weighted sum, where the conductance of analog RSM acts as the synaptic weight in a neuromorphic network.
Training is critical for neuromorphic systems by the feedback algorithm, as shown in Fig. 2(a). This algorithm is used to update the weights in parallel according to the learning rules.52 There are two training methods for a neuromorphic system: ex situ and in situ training. For the ex situ method, training is executed in the software system first, and then the calculated weights are loaded to the analog RSM array.53 The weights stored in the RSMs are not adjusted during the weight-loading process, regardless of the existence of variation or other undesired weight changes. It is important to maintain the conductance of the RSM unchanged after weight loading. Therefore, the requirements of retention, bit yield, and uniformity are very strict for analog RSMs under ex situ training.
By contrast, a neuromorphic computing system with in situ training is capable of updating weights on chip and thus has better immunity to retention degradation, state-stuck issues, and variations than ex situ training.54 Before in situ training, the weights stored in the analog RSMs can be either random or started from the values that are calculated and loaded based on the preliminary ex situ training method.55 The goal of in situ training is to maximize the inference accuracy by tuning the device conductance in the analog RSM arrays. During one training iteration, before conductance tuning, a preinference process is required to obtain the errors between the expected results and the calculated results. Then, the desired weight update value can be calculated. There are several ways to tune the conductance based on the calculated errors. The most general way is to use the conventional backpropagation (BP) learning rule. Using this method, the desired weight update values should be calculated exactly and mapped to analog RSM devices. A verification programming scheme should be introduced to ensure that the conductance of each analog RSM is tuned to the projected value.13 A simplified method is to use the sign-based BP (SBP) learning rule.56 In this case, the exact weight update values are not necessary, and only the signs of weight update values are needed. If the sign is positive, a SET pulse is applied to the corresponding analog RSM device. By contrast, if the sign is negative, a RESET pulse is applied. The SBP method can largely reduce the overhead of periphery circuits and verification time and can be done in a parallel way rather than via one-by-one verification.57 In addition, the training accuracy is strongly dependent on the characteristics of analog resistive switching behaviors, such as nonlinearity and asymmetry, which will be discussed in Sec. III. Recently, some novel learning rules for the trade-off between BP and SBP were also proposed.57,58 Furthermore, it should be noticed that a complete in situ training process contains many iterations, and thus, the conductance of each analog RSM is tuned thousands to millions of times according to the learning rule and learning task.51,59 Therefore, endurance becomes one of the most important concerns for analog RSMs under in situ training.
III. RELIABILITY CONCERNS
Notably, there has been no universal evaluation criterion for the reliability of analog RSM devices. However, computing accuracy could be considered as a critical parameter to quantify the reliability metrics. The accuracy loss of analog RSM-based neuromorphic computing can be attributed to two main reasons: one is the nonideal effects of analog resistive switching, which causes the hardware training accuracy lower than the software training accuracy;51 the other is the conductance drift or fluctuation of the analog RSM devices, which causes the accuracy after training to degrade with time.60 Based on this analysis, the device reliability metrics can be classified into basic reliability metrics and functional reliability metrics, as shown in Fig. 3. Basic reliability metrics are valid for both memory and computing applications, including retention, endurance, write/read disturbance, and random noise, whereas functional reliability metrics have attracted widespread attention mainly for computing applications, including nonlinearity, asymmetry, dynamic range, precision, variation, and bit yield. For analog RSMs, some basic reliability issues can degrade the functional reliability metrics and further decrease the accuracy, for example, endurance degradation usually triggers a decrease in the dynamic range and increase in nonlinearity.61 However, both the basic and functional reliability metrics are related to the accuracy loss.62,63 Therefore, the physical mechanisms and impact of reliability issues for analog RSM devices are very complex and require comprehensive studies. More importantly, the definition of the basic reliability metrics of the analog memory devices for neuromorphic computing application is different from that of the conventional memory application. It is highly desired to clarify their differences and provide a clear criterion for the study of reliability physics in the future. In the following part of this section, we will discuss the basic and functional reliability metrics.
Reliability metrics of the neuromorphic device. Device reliability metrics are classified into basic and functional reliability metrics, which degrade the accuracy during and after the training process.
Reliability metrics of the neuromorphic device. Device reliability metrics are classified into basic and functional reliability metrics, which degrade the accuracy during and after the training process.
A. Basic reliability metrics
Basic reliability metrics refer to the common and essential reliability characteristics for both memory and neuromorphic computing applications. Retention, endurance, write/read disturb, and noise are the critical reliability metrics of RSM devices. Their similarities and differences are illustrated in Fig. 4. The requirement of memory application focuses on the distinguishable resistance states, regardless of the change in the resistance value during the programming and data retention process.64 Therefore, the reliability evaluation of memory applications mainly cares about the resistance windows between different states. For example, the window retention means that the resistance window should maintain for more than 10 years at 85 °C according to the industry standard.34 As long as the resistance value does not cross the reference line, the small range variation of the resistance value is permitted, as shown in Fig. 4(a). Cycling endurance also allows narrowing the switching window with the increased switching cycles unless the window disappears, as shown in Fig. 4(c). The largest endurance cycle is the most important evaluation criterion in the full window switching (between HRS and LRS).19 Write/read disturb refers to the unexpected resistance change during write or read process caused by long time accumulation of small voltage. Read disturb usually occurs on the selected cell in the array, caused by continuous read on the cell.65 Write disturb occurs on the unselected cells in the array, caused by electric and thermal cross talk or sneak path effect. The occurrence of write disturb depends on the pulse voltage, array structure, location in the array,66 and program scheme.27 The criterion of disturbance is similar to retention, as shown in Fig. 4(e). Noise is an intrinsic characteristic of electron devices with a variety of forms, such as 1/f noise67 and random telegraph noise (RTN),68 as shown in Fig. 4(g). The noise limitation also lies in that the resistance with noises should not cross the reference line between HRS and LRS.
Schematic diagram of different basic reliability metrics of memory application for digital data storage and neuromorphic computing application for analog data processing and storage. (a) Window retention of digital memory and (b) conductance retention of analog data in the computing process. (c) Cycling endurance of digital memory and (d) incremental switching endurance of analog data in the computing process. (e) Write/read disturb in memory and (f) computing. (g) Noises in memory and (h) computing.
Schematic diagram of different basic reliability metrics of memory application for digital data storage and neuromorphic computing application for analog data processing and storage. (a) Window retention of digital memory and (b) conductance retention of analog data in the computing process. (c) Cycling endurance of digital memory and (d) incremental switching endurance of analog data in the computing process. (e) Write/read disturb in memory and (f) computing. (g) Noises in memory and (h) computing.
Compared to the reliability concerns of memory application, the concerns of neuromorphic computing application focus on the accurate conductance values, which are stricter from the device-level view. We hope that the conductance retention and conductance tuning process of the analog RSM remains stable, which requires the development of new mechanisms to further optimize the device, while from the system-level view, the requirements can be relaxed according to different applications, e.g., small degree of conductance fluctuation can be tolerated during inference,69 which requires device-system co-optimization in the study. Specifically, the conductance retention of the analog RSMs pays attention to the individual conductance change of each analog level, as shown in Fig. 4(b). For endurance evaluation, neuromorphic computing focuses on the incremental switching, which means that the conductance of the analog RSM only changes a small value to mimic the weight update process.60 The incremental switching endurance stands for the conductance tuning within varying levels and ranges as different numbers of pulses are applied, as shown in Fig. 4(d). The conductance evolution during different training algorithms may be quite different, and so the evaluation and measurement methods should also be adjusted according to different algorithms. The degradation of functional reliability metrics should also be considered during endurance tests.61 Write/read disturb shows similar behavior with the conductance retention by replacing the retention time with the pulse number as shown in Fig. 4(f). Because the operation schemes for inference and training are slightly different from the read and write schemes of memory application,13,65 particularly featured by the different parallelism, write/read disturb of the analog RSM for neuromorphic computing also needs further study and is largely dependent on the network structure and learning rule. Noise effects have more impacts on the neuromorphic computing application. The degree of read current fluctuation should be limited within a safety range to ensure a high degree of confidence on the neuromorphic computing [Fig. 4(h)].70
To investigate the impact of conductance retention degradation on the performance of neuromorphic computing, a neural network simulator was developed to classify patterns from the MNIST database, as shown in Fig. 5(a). A retention model was extracted to capture the conductance distribution evolution with the retention time.71 Apparent degradation (∼4.6%) of the recognition rate was found after 104 s of baking at 175 °C.72 Based on the Arrhenius equation, this retention time was equivalent to 5.45 years at 85 °C [Fig. 5(b)]. When considering the impact of endurance degradation on the performance of neuromorphic computing, the impact of endurance on the functional metrics should be studied first. It was found that the functional reliability metrics, such as the dynamic range, nonlinearity, and asymmetry, degraded gradually with increasing numbers of increment switching cycling [Fig. 5(c)].61 The deformed incremental switching curve of conductance vs pulse number corresponded to the weight mapping model. Therefore, the changing mapping model resulted in the complexity of the update process and was attributed to the decrease in learning accuracy. Figure 5(d) shows that a significant accuracy loss occurred after 107 cycles when considering both nonlinearity and dynamic range degradation.61 Figure 5(e) shows the impact of read noise on accuracy loss with online (in situ) and offline (ex situ) training. It was found that accuracy loss became serious when the read noise was above 15%, and the abrupt loss occurred in online training.70
Impact of basic reliability metrics on accuracy loss during MNIST image learning and recognition. (a) Schematic of the multilayer standard perceptron for recognizing images from the MNIST database, where the weights are implemented with the analog RSM array. (b) Impact of retention degradation on the recognition error rate. Reprinted with permission from Huang et al., in IEEE International Electron Devices Meeting (IEDM) (2018), p. 40.4.1. Copyright 2018 IEEE. (c) On/off ratio decreased with the increasing weight update number due to endurance degradation. (d) The accuracy loss as a function of endurance cycle number. Reprinted with permission from Zhao et al., in IEEE International Electron Devices Meeting (IEDM) (2018), p. 20.2.1. Copyright 2018 IEEE. (e) The impact of noise on accuracy loss in online and offline training. Reprinted with permission from Chen et al., in IEEE International Electron Devices Meeting (IEDM) (2017), p. 6.1.1. Copyright 2017 IEEE.
Impact of basic reliability metrics on accuracy loss during MNIST image learning and recognition. (a) Schematic of the multilayer standard perceptron for recognizing images from the MNIST database, where the weights are implemented with the analog RSM array. (b) Impact of retention degradation on the recognition error rate. Reprinted with permission from Huang et al., in IEEE International Electron Devices Meeting (IEDM) (2018), p. 40.4.1. Copyright 2018 IEEE. (c) On/off ratio decreased with the increasing weight update number due to endurance degradation. (d) The accuracy loss as a function of endurance cycle number. Reprinted with permission from Zhao et al., in IEEE International Electron Devices Meeting (IEDM) (2018), p. 20.2.1. Copyright 2018 IEEE. (e) The impact of noise on accuracy loss in online and offline training. Reprinted with permission from Chen et al., in IEEE International Electron Devices Meeting (IEDM) (2017), p. 6.1.1. Copyright 2017 IEEE.
B. Functional reliability metrics
Functional reliability metrics refer to the functional properties of analog RSMs for neuromorphic computing and have a direct influence on the training accuracy. Nonlinearity,73 asymmetry,74 dynamic range,75 precision,37 variation,76 and yield77 are important functional reliability metrics. The characteristics and impact of these functional metrics on training accuracy are presented in Fig. 6. The degradation of these functional metrics during the reliability test should be taken into careful consideration. Although many simulation works have been done on this topic and already provided many valuable design guidelines, experimental demonstrations on the impacts of these functional reliability metrics are still required to give more conclusive results in the future.
The characteristics and definition of the functional reliability metrics of analog RSMs and the impact of these metrics on neuromorphic computing. (a), (c), (e), (g), (i), and (k) Schematic diagram of the dynamic range, nonlinearity, asymmetry, precision, variation, and yield. (b) Simulated accuracy as a function of dynamic range. The G reduction ratio refers to the proportion of the dynamic range reduction. The error bar represents the impact of dynamic range variation on accuracy. Reprinted with permission from Chen et al., in IEEE International Reliability Physics Symposium (IRPS) (2018), p. 5C.4. Copyright 2018 IEEE. (d) Test accuracy as a function of different nonlinearity magnitudes of three realistic RRAM devices. Reprinted with permission from Chen et al., in IEEE/ACM International Conference on Computer-Aided Design (ICCAD) (2015), p. 3-A.3.194. Copyright 2015 IEEE. (f) MNIST simulation accuracy as a function of switching asymmetry. Reprinted with permission from Tang et al., in IEEE International Electron Devices Meeting (IEDM) (2018), p. 13.1.1. Copyright 2018 IEEE. (h) Simulated accuracy as a function of precision bits. Reprinted with permission from Liu et al., in European Solid-State Device Research Conference (2017), p. A3L-F.18. Copyright 2017 IEEE. (j) Simulated accuracy loss due to pulse-to-pulse variation with several different nonlinearities during in situ training. Reprinted with permission from Chen et al., IEEE International Electron Devices Meeting (IEDM) (2017), p. 6.1.1. Copyright 2017 IEEE. (l) Test accuracy as a function of stuck-on G rate and dead G rate after 20 epochs of training. Reprinted with permission from Romero et al., Faraday Discuss. 213(0), 371 (2019), Copyright 2019 RSC publishing.
The characteristics and definition of the functional reliability metrics of analog RSMs and the impact of these metrics on neuromorphic computing. (a), (c), (e), (g), (i), and (k) Schematic diagram of the dynamic range, nonlinearity, asymmetry, precision, variation, and yield. (b) Simulated accuracy as a function of dynamic range. The G reduction ratio refers to the proportion of the dynamic range reduction. The error bar represents the impact of dynamic range variation on accuracy. Reprinted with permission from Chen et al., in IEEE International Reliability Physics Symposium (IRPS) (2018), p. 5C.4. Copyright 2018 IEEE. (d) Test accuracy as a function of different nonlinearity magnitudes of three realistic RRAM devices. Reprinted with permission from Chen et al., in IEEE/ACM International Conference on Computer-Aided Design (ICCAD) (2015), p. 3-A.3.194. Copyright 2015 IEEE. (f) MNIST simulation accuracy as a function of switching asymmetry. Reprinted with permission from Tang et al., in IEEE International Electron Devices Meeting (IEDM) (2018), p. 13.1.1. Copyright 2018 IEEE. (h) Simulated accuracy as a function of precision bits. Reprinted with permission from Liu et al., in European Solid-State Device Research Conference (2017), p. A3L-F.18. Copyright 2017 IEEE. (j) Simulated accuracy loss due to pulse-to-pulse variation with several different nonlinearities during in situ training. Reprinted with permission from Chen et al., IEEE International Electron Devices Meeting (IEDM) (2017), p. 6.1.1. Copyright 2017 IEEE. (l) Test accuracy as a function of stuck-on G rate and dead G rate after 20 epochs of training. Reprinted with permission from Romero et al., Faraday Discuss. 213(0), 371 (2019), Copyright 2019 RSC publishing.
The dynamic range means the conductance ratio of HRS and LRS, also called the on/off ratio, which is from 2 to 100 with different materials and structures of analog RSM devices [Fig. 6(a)].78 In the case of determining precision, the dynamic range is directly related to the number of conductance or weight levels during the training process. Chen and Yu60 demonstrated the impact of dynamic range degradation on recognition accuracy, as shown in Fig. 6(b). The concepts of nonlinearity and asymmetry are derived from the relationship between the conductance change and the weight-update pulse number or voltage polarity. Nonlinearity corresponds to the degree of curvature of the weight update curve of analog RSMs, performing the incremental conductance change with the increasing weight update pulse number [Fig. 6(c)]. Several papers presented the nonideal linearity in different analog RSM devices.12,79–81 Inconsistent conductance changes make it difficult to tune the conductance to the target with identical pulses, resulting in poor convergence rates during the training process. Therefore, the nonlinearity directly causes more training accuracy loss. Chen et al.82 presented the accuracy degraded in a higher long-term depression (LTD) and long-term potentiation (LTP) nonlinearity region of three realistic synaptic devices in the sparse coding (SC) algorithm on-chip [Fig. 6(d)]. Asymmetry is a metric of the symmetry of the curves of conductance change vs pulse number during weight increase and decrease processes, as shown in Fig. 6(e). Tang et al.37 implied that the accuracy mainly depended on the asymmetry, as shown in Fig. 6(f). Moreover, asymmetry influences the training accuracy, together with nonlinearity.83 Li et al.84 discussed the impact of nonlinearity and asymmetry on the training accuracy. It is found that the bidirectional symmetry incremental conductance change could keep good accuracy even with relatively large nonlinearity.85 Precision indicates how many weight bits are provided by one device in the full dynamic range, determined by the analog switching ability [Fig. 6(g)].76 For ex situ training, it has been proven that low bit precision can also implement reasonable inference accuracy.86 However, it is necessary for a weight with high bit precision to realize an incremental weight update process during in situ training.82 Liu et al.87 demonstrated the impact of weight precision on the accuracy for classifying the MNIST handwritten digits based on a perceptron neural network in the 1 kb 1T1R array. It suggested that the accuracy degraded gradually when the weight precision was lower than 4-bit for ex situ training [Fig. 6(h)]. Variations of analog RSMs for neuromorphic computing not only refer to the parameter (e.g., operation voltage, nonlinearity, and dynamic range) difference from one device to another and from one full switching cycle to another but also refer to the pulse-to-pulse variation during the one weight increase or decrease process [Fig. 6(i)]. For an inference-only system (ex situ training), device to device and cycle to cycle variations have a significant impact on accuracy loss. Even with the verification programming scheme, these variations can lead to the deviation of the programed conductance from the target value on each cell. The accumulated conductance programming deviation may cause computing error during inference.76,88 In situ training systems show better tolerance against the device to device and cycle to cycle variations, thanks to the self-adaptive ability.69 But pulse-to-pulse variation can cause unregulated changes in the conductance change after applying one pulse. Due to the large number of weight update operation during training, pulse to pulse variation definitely results in the increased cost of training iteration and serious accuracy loss.82,89 Chen et al.70 explored the impact of variation on accuracy loss, as shown in Fig. 6(j). With different nonlinearity during in situ training, high accuracy was only realized with small cycle-to-cycle variation and nonlinearity, and approximately 2% variation mitigated the accuracy degradation derived from the high nonlinearity. The bit yield refers to the percentage of RSM devices with analog switching behavior in the network. The low bit yield is caused by state-stuck or abrupt switching in some RSM devices, as shown in Fig. 6(k). Romero et al.77 investigated the impact of the state-stuck effect on training accuracy. It was found that in situ training provided high tolerance to low bit yields to maintain reasonable accuracy, but accuracy loss remained inevitable [Fig. 6(l)]. As a supplement, there is a huge gap of accuracy loss shown in a small stuck ratio between in situ training and ex situ training, and the former performed better stuck at fault tolerance and a less accuracy loss than the latter due to its self-adaptation ability. Li et al.54 reported that the presence of some nonresponsive devices leads to decreased accuracy in neuromorphic computing. It was found that multiple hidden layers in the neural network weakened the impact of stuck devices and further obtained higher accuracy than single-layer networks. This was because the hidden neurons can correspondingly adjust the connections to maintain the accuracy unchanged once a correlative device failed.
IV. REVIEW OF THE RELIABILITY STUDY
In this section, we focus on the state-of-the-art and representative works about the reliability of analog RSMs for neuromorphic computing. The studies of the basic and functional reliability metrics are reviewed, involving the measurement and characterization methods, landmark results, physical mechanism, and optimization methods. In this section, we also review some typical works on the reliability degradation mechanisms and optimization methods of binary RSMs because the mechanisms of binary RSMs have certain correlation with analog RSMs and can inspire in-depth studies of analog switching reliability in the future.
A. Basic reliability metrics study
1. Retention
Retention is a metric to evaluate how long the device can maintain its conductance value. For binary RSM devices, the widely accepted standard for reasonable retention is more than 10 years at 85 °C. Obviously, it is unrealistic to test for such a long time, and so the typical method is to accelerate the resistance drift at high temperature. The Arrhenius equation is used to convert the projected retention time at desired temperature from the experimental results at the measured temperature.90 For analog RSM devices, it pursues a long retention time while keeping the conductance of multiple resistance levels unchanged and aims to predict the conductance distribution at variable temperatures and times. The measurement method is similar to the binary retention test, except it uses fine sampling to explore the evolution of conductance changes of various levels.
In the early exploration of memory application, a lot of works have been devoted to retention behavior of binary RSMs. Wei et al.91 showed the stable window-retention properties of Ta2O5-based anion-type binary RRAM in an 8-kbit array for approximately 3000 h at 150 °C. Rizzi et al.92,93 studied the retention statistics in a 1-Gb binary PCM array with Ge2Sb2Te5 chalcogenide for 105 s at 160 °C. For increasing conductance levels, the mean value decreases and the relative spread increases within a population of 16k PCM cells for calculations repeated for 100 cycles.93 With the development of analog RSMs, the goal of research on retention ability is not only the long retention lifetime but also the tight conductance distribution of each level with time. Zhao et al.71 illustrated a statistical research on analog filamentary RRAM retention. Further work also found that the conductance distribution of each analog levels showed normal distribution at 175 °C.94 After baking for 12 000 s at 3 A, the standard deviation was about 0.50 A. Lin et al.62 also studied a statistical retention test in a 4-level 1 Mb 1T1R tungsten-oxide RRAM array for 2 105 s at 150 °C. The extended conductance distribution of each level was also showed with baking time. Stanisavljevic et al.95 provided statistical experimental characteristics of analog PCM with conductance drift and elevated temperatures. Over a retention time of 105 s, all the 4 levels were presented by using the eM-metric at 80 °C with little overlap between each state.
The physical mechanisms of retention failure are different, owing to different device structures, material stacks, and switching types. The stochastic diffusion of ion or oxygen vacancy (Vo) could result in retention degradation. Wang et al.38 directly observed the rupture and connection of CF through Ag nanoparticle migration using in situ TEM. The paper provided a direct characterization on the microscopic origin of resistive switching and diffusion process in a cation based RSM device. To explain complex scenarios, several simulation models were proposed based the physical principle and probability analysis such as Monte Carlo simulation. Zhao et al.96 developed a physical compact model to explain the retention degradation of cation-type analog RRAM, as shown in Fig. 7(a). The retention degradation relied on the diffusion of metal atoms toward the lower concentration region resulting in the rupture (HRS) and connection (LRS) of the percolation path. Chen et al.97 investigated two possible physical mechanisms for retention failure of anion-type binary RRAM, as shown in Fig. 7(b). It was found that the diffusion and recombination of mobile O2− and Vo determined the change in filament morphology and further caused retention degradation. In PCM, retention degradation usually occurs in the amorphous state, resulting in tail bits and threshold voltage changes.98 Russo et al.99 proposed that the retention failure of binary PCM was derived from the spontaneous and thermal-activated crystallization of the programed amorphous chalcogenide, as shown in Figs. 7(c)–7(e). A physical model is developed to obtain the resistance evolution with time in single device and statistical retention failure in a whole array by extracting the crystallization parameters (both geometry and electrical properties). It provided a valuable basis of statistic prediction for PCM retention performance.
Schematics of retention degradation. (a) Retention degradation process of cation-type RRAM. In LRS, the metal atoms in CF diffuse gradually toward the low atom concentration region with the increasing time, which may result in a broken current path in the expansion region. By contrast, diffusive atoms enter the rupture region (RR) corresponding to the increasing conductance (HRS). Reprinted with permission from Zhao et al., IEEE Electron Device Lett. 40, 647 (2019). Copyright 2019 IEEE. (b) Two possible retention degradation mechanisms in anion-type RRAM. Oxygen scavenged by the Hf cap layer diffuses back into HfO2 and recombines with Vo in the filament; Vo diffusion out and dissolution of the filament. Reprinted with permission from Chen et al., in IEEE International Electron Devices Meeting (IEDM) (2013), p. 10.1.1. Copyright 2013 IEEE. (c) and (d) Retention degradation process with the morphology change with the increasing time. (c) Simulation results of phase and resistive maps with different crystalline fractions. Light gray represents the crystalline elements, whereas the amorphous phase is in black. (d) Corresponding current maps with the morphology above. Lighter gray represents a higher current density. (e) Experimental results of resistance vs retention time at 210 °C. The inset shows the mixed-phase structure of short (left) and long (right) baking times. Reprinted with permission from Russo et al., IEEE Trans. Electron Devices 53, 3032 (2006), Copyright 2006 IEEE.
Schematics of retention degradation. (a) Retention degradation process of cation-type RRAM. In LRS, the metal atoms in CF diffuse gradually toward the low atom concentration region with the increasing time, which may result in a broken current path in the expansion region. By contrast, diffusive atoms enter the rupture region (RR) corresponding to the increasing conductance (HRS). Reprinted with permission from Zhao et al., IEEE Electron Device Lett. 40, 647 (2019). Copyright 2019 IEEE. (b) Two possible retention degradation mechanisms in anion-type RRAM. Oxygen scavenged by the Hf cap layer diffuses back into HfO2 and recombines with Vo in the filament; Vo diffusion out and dissolution of the filament. Reprinted with permission from Chen et al., in IEEE International Electron Devices Meeting (IEDM) (2013), p. 10.1.1. Copyright 2013 IEEE. (c) and (d) Retention degradation process with the morphology change with the increasing time. (c) Simulation results of phase and resistive maps with different crystalline fractions. Light gray represents the crystalline elements, whereas the amorphous phase is in black. (d) Corresponding current maps with the morphology above. Lighter gray represents a higher current density. (e) Experimental results of resistance vs retention time at 210 °C. The inset shows the mixed-phase structure of short (left) and long (right) baking times. Reprinted with permission from Russo et al., IEEE Trans. Electron Devices 53, 3032 (2006), Copyright 2006 IEEE.
To mitigate the retention degradation, some feasible solutions were provided regarding the process technology, innovative materials and structures of RSM devices, and programming schemes. Chen et al.97 demonstrated a fabrication process improvement by adding an additional annealing operation for full stack after the cell patterning in the process flow based on HfO2/Hf binary RRAM. The HfO2 intermixed with Hf under the thermal effect, which caused the mobile oxygen to combine with the Hf. In this case, an HfO2 interface layer was formed to slow the oxygen movement100 and further mitigated the retention degradation. In addition, higher forming energy, provided by sources such as large current, long pulse width, or high temperature, resulted in better retention due to stronger CF formation. Huang et al.101 developed a 1-kbit array based on HfO2 anion-type binary RRAM for retention optimization. It was found that oxygen anneals after HfO2 atomic layer deposition, yielding a significant improvement in retention and uniformity. After applying Al2O3 mixed with the HfO2 layer, the tail bits in retention failure were suppressed, especially for HRS, which provided a method of improving the tail-bit retention.102 Moon et al.103 adopted a Mo electrode to control the redox reaction at the interface to obtain good uniformity and retention characteristics of nonfilamentary analog RRAM based on Mo/Pr0.7Ca0.3MnO3 (PCMO) by increasing the activation energy for oxygen migration. They further reported another material improvement by inserting an MnOx buffer layer to realize bidirectional analog switching in nonfilamentary RRAM based on Al/Mo/PCMO with better retention and dynamic ranges.104 Several previous works were reported on improving the retention lifetime and performance of the multilevel PCM-based system through the trade-off between retention and write latency.105,106 This was because a longer write can achieve a better retention and high precision due to the tolerance of conductance drift, but it also resulted in a longer write latency, while a shorter write scheme caused a reduced retention time and a larger number of refresh operations. Zhang et al.107 proposed region retention monitor (RRM) to balance the write latency and retention by automatically identifying hot access of each device and dynamically assigning the proper write schemes. Their further work demonstrated a lightweight scheme (called quick-and-dirty) to improve performance by 30.9% with a retention lifetime of 7.85 years on the geometric mean on a 2-bit PCM chip.108
2. Endurance
Endurance is expressed as the maximum weight update number during the training process. In the endurance measurement, the conductance switches between several levels with alternating SET and RESET programming pulses. The endurance behaviors of binary and analog RSM devices have been studied extensively. Compared to the endurance requirement of binary RSMs, analog RSMs such as synaptic devices attach importance not only to the sustainable endurance cycles between any two or more conductance levels to mimic the actual weight update process but also to the device performance degradation, such as some functional reliability metrics.
Various demonstrations on elevating the endurance ability in RRAM and PCM are reviewed; Lee et al.19 reported an asymmetric and antiserial RRAM with Pt/Ta2O5−x/TaO2−x/Pt bilayer structures that demonstrated excellent cycling endurance over 1012 switching in binary switching mode. It also showed that the endurance lifetime increased with the resistance of the switching window and the decreasing oxygen partial-pressure conditions. Yeh et al.109 achieved about 109 programming endurance cycles and 1011 read endurance cycles in the binary PCM device. However, incremental switching endurance of analog RSMs is required to satisfy the number of weight update in computing application. To gain an intuitive understanding, the research on conductance evolution of various endurance switching pairs has been performed. Zhao et al.61 investigated the incremental endurance behaviors of analog RRAM for neuromorphic computing. Multiple conductance switching windows at different levels showed a huge gap in the weight update times. With efficient bidirectional verification, over 1011 incremental switching endurance cycles at low resistance levels were performed in a 1k analog RRAM array. In an analog PCM array, Athmanathan et al.110 demonstrated an endurance ability about 106 cycles of 3 bits/cell PCM with a variation of in a 64k array by combining drift-immune cell-state metrics and drift-tolerant coding and detection schemes.
To investigate the physical mechanism of endurance degradation, various approaches are applied including observation using high-resolution microscopy and analysis using compact models. Lee et al.19 observed the metal Ta clusters (white color) in the Ta2O5−x layer after 106 cycles by high-resolution TEM [Fig. 8(a)]. Chen et al.111 reported three types of cycling endurance failure behaviors in anion-type binary RRAM [Figs. 8(b)–8(d)]. It should be noted that the oxygen reservoir is important to maintain good endurance behavior. Failure type I refers to an interfacial electron barrier induced by oxidation of the metal electrode at large power/current and high temperature. As the aforementioned switching mechanism, the interfacial barrier limits the transport of electrons and ions and further results in the endurance degradation. The reason for failure type II is that the electric field and accompanying heat lead to redundant Vos generation, enlarging the radius of filaments. In this case, the device typically fails at LRS. Failure type III refers to gradual changes of HRS. The excessive consumption of O2− after frequent cycling causes the rupture of filaments due to the decreasing recombination rate. The TEM image illustrates that the atom clusters grow after 106 cycles. Switching between arbitrary two conductance levels performed various endurance cycles, resulting from different physical origins. Zhao et al.112 proposed the physical mechanism of endurance degradation of analog RRAM. The impact of switching windows with different resistance levels on the endurance lifetime was explained. The morphology of multiple weak CFs in the smaller switching windows was easily maintained after the endurance cycles. Figure 8(e) shows the endurance degradation of the PCM device. The failure was explained as two modes from SET- or RESET-stuck failure.113 SET-stuck failure is caused by (1) Ge depletion due to element separation [Fig. 8(f)]114 and (2) sustained void formation in the switching region near the bottom electrode. However, it has been reported that doped Ge2Sb2Te5 (GST) could defer the appearance of clusters and improve the cycle endurance lifetime, as shown by the TEM images in Fig. 8(g).115 The reset-stuck failure may originate from the rupture and detachment of the heating electrode.
Physical mechanism of endurance degradation of RSM. (a) TEM image of Ta metal clusters formed after cycles. Reprinted with permission from Lee et al., Nat. Mater. 10, 625 (2011). Copyright 2011 Macmillan Publishers. (b)–(d) Schematic of the endurance failure mechanism of RRAM. Three failure types of endurance degradation are illustrated. Reprinted with permission from Chen et al., in IEEE International Electron Devices Meeting (IEDM) (2011), p. 12.3.1. Copyright 2011 IEEE. (e) The endurance failure mechanism of PCM. Schematic of SET-stuck failure and RESET-stuck failure. Reprinted with permission from Tavana et al., in Proceedings of the International Symposium on Memory Systems (2017), p. 385. Copyright 2017 ACM. SET-stuck failure includes two main failure modes: (f) EDX (energy-dispersive X-ray spectroscopy) images after 1000 cycles, Ge depletion. Reprinted with permission from Raoux et al., Microelectron. Eng. 85, 2330 (2008). Copyright 2008 Elsevier. (g) TEM image of undoped and doped GST. Reprinted with permission from Chen et al., in IEEE International Memory Workshop (2009), p. 1. Copyright 2009 IEEE.
Physical mechanism of endurance degradation of RSM. (a) TEM image of Ta metal clusters formed after cycles. Reprinted with permission from Lee et al., Nat. Mater. 10, 625 (2011). Copyright 2011 Macmillan Publishers. (b)–(d) Schematic of the endurance failure mechanism of RRAM. Three failure types of endurance degradation are illustrated. Reprinted with permission from Chen et al., in IEEE International Electron Devices Meeting (IEDM) (2011), p. 12.3.1. Copyright 2011 IEEE. (e) The endurance failure mechanism of PCM. Schematic of SET-stuck failure and RESET-stuck failure. Reprinted with permission from Tavana et al., in Proceedings of the International Symposium on Memory Systems (2017), p. 385. Copyright 2017 ACM. SET-stuck failure includes two main failure modes: (f) EDX (energy-dispersive X-ray spectroscopy) images after 1000 cycles, Ge depletion. Reprinted with permission from Raoux et al., Microelectron. Eng. 85, 2330 (2008). Copyright 2008 Elsevier. (g) TEM image of undoped and doped GST. Reprinted with permission from Chen et al., in IEEE International Memory Workshop (2009), p. 1. Copyright 2009 IEEE.
Through the analysis and conclusion of the physical mechanism, the endurance optimization methods are summarized from material/structure selection, programming schemes, and circuit design. Chen et al.116 proposed that the doping effect of Ti, Si, and Al in HfO2 binary RRAM influences the cycling endurance lifetime. Single pulse endurance about 109 cycles was obtained because the dopants influenced the formation of the oxygen exchange layer. Grossi et al.117 proposed an end-to-end approach combining the programming scheme and system resilience techniques to overcome endurance and temporary bit error rates for deep learning application. They set the upper write number in each remapping period to reduce the write number of each RRAM device, basically decreasing the possibility of irreversible breakdown/dissolution of CFs. Yamaga et al.118 proposed the highly reliable approximate-RRAM to implement real-time image recognition with pixel-to-pixel data matching (P2P-DM) and interpixel error-correction code (ECC). In this case, compared to the Bose–Chaudhuri–Hocquenghem (BCH) ECC, the acceptable retention time and the endurance of the most significant bits (7th bit) suffered from relatively serious errors have improved by 5 and 3.3, respectively. Recent advances have been made to overcome limited endurance for Multi-level Cell (MLC) PCM. Pan et al.119 developed a write operation selection algorithm and task scheduling to improve endurance and energy efficient of MLC PCM-based systems.
3. Write/read disturb
Write disturb refers to a location-dependent write error of unselected devices after certain programming pulses, especially in large-scale crossbar arrays.27 The Vdd/2 scheme is the typical solution for write disturb, where the unselected WLs and BLs are biased at the half voltage in the programming process [Fig. 9(a)].120 If the half voltage can drive some half-selected devices to the wrong resistance values, a write disturb will occur. Moreover, the write disturb of PCM usually occurs in reset operation when these neighboring PCM cells are in the amorphous state because of the thermal cross talk, and the problem becomes more serious in sub-20-nm technology.121 The classical approach of avoiding the write disturb is to allocate a large intercell space122 and adopt a strong ECC.123 In contrast, read disturb is a time-dependent current error of selected devices, affected by the programming and endurance.124 It has been reported that read disturb issue exists inevitably in any circuit regimes. Similar to different reliability concerns of retention degradation, the largest write or read cycle number of the RSM device is the most important focus in the memory storage scenario, while in neuromorphic computing, the conductance fluctuation with increasing write/read pulses should be paid much attention because the accumulation of the effect of read disturb causes failure acceleration, resulting in the reduced learning accuracy.
Write disturbance of (a)–(c) RRAM and (d) and (e) PCM. (a) Schematic diagram of the RRAM crossbar array with the 1/2 voltage scheme. (b) The physical mechanism of HRS disturbance of RRAM, incorrect resistive switching from HRS to LRS. (c) LRS disturbance, incorrect resistive switching from LRS to HRS. Reprinted with permission from Li et al., in IEEE International Reliability Physics Symposium (IRPS) (2014), p. MY.3.1. Copyright 2014 IEEE. (d) Calculated temperature maps of write disturbance within two adjacent PCM cells at 45-nm technology node. The left cell was being programed, and the right cell was disturbed by the increasing temperature, which was originally in the reset state (0). Reprinted with permission from Russo et al., IEEE Trans. Electron Devices 55, 515 (2008). Copyright 2008 IEEE. (e) Vulnerable cells are colored in red. Reprinted with permission from Jiang et al., in IEEE/IFIP International Conference on Dependable Systems and Networks (2014), p. 216. Copyright 2014 IEEE.
Write disturbance of (a)–(c) RRAM and (d) and (e) PCM. (a) Schematic diagram of the RRAM crossbar array with the 1/2 voltage scheme. (b) The physical mechanism of HRS disturbance of RRAM, incorrect resistive switching from HRS to LRS. (c) LRS disturbance, incorrect resistive switching from LRS to HRS. Reprinted with permission from Li et al., in IEEE International Reliability Physics Symposium (IRPS) (2014), p. MY.3.1. Copyright 2014 IEEE. (d) Calculated temperature maps of write disturbance within two adjacent PCM cells at 45-nm technology node. The left cell was being programed, and the right cell was disturbed by the increasing temperature, which was originally in the reset state (0). Reprinted with permission from Russo et al., IEEE Trans. Electron Devices 55, 515 (2008). Copyright 2008 IEEE. (e) Vulnerable cells are colored in red. Reprinted with permission from Jiang et al., in IEEE/IFIP International Conference on Dependable Systems and Networks (2014), p. 216. Copyright 2014 IEEE.
The physical mechanism of write/read disturb of RSM devices is introduced from the perspective of formula derivation by experimental results and physical models with various materials and structures of devices. The mechanism of disturbance immunity of RRAM comes from the stability of ions in the switching region. Wang et al.125 explained the reason for better disturbance immunity of CuxSiyO-based binary RRAM than CuxO. Higher activation energy suppressed the copper vacancy migration and reduced the probability of write and read disturbance. Using a physical model, Li et al.27 provided the explanation of the physical mechanism of write disturbance in binary RRAM, as shown in Figs. 9(b) and 9(c). With the electric field and thermal effect, the broken CFs grew, accompanying the generated Vo moving along the electric field, shortening the gap, and inducing the disturbance of the resistance state. In LRS disturbances, Vo escaped from the CF region under the force of thermal and electric fields, resulting in the gap formation and resistance shifting. Figure 9(d) demonstrates the scenario of the write disturbance between two adjacent binary PCM cells under the thermal diffusion caused by the RESET programming pulse.126 In particular, the situation only occurred when the neighboring cell was in the RESET state (storing “0”), and there was basically no disturbing influence for the cell in the SET state (storing “1”), as shown in Fig. 9(e). When the left cell was programed to a high temperature, the heat spread horizontally to reach the right cell. The temperature of bit 2 caused crystallization but not melting, and thus, the cell in the RESET state turned into the crystallization state.
Aiming to control precisely the cell-location-dependent selected cells and avoid disturbance with unselected cells, Chen et al.28 proposed an optimization method to control the cell-location-dependent selected cells precisely and avoid disturbance with unselected cells. It was found that inserting a thin AlOx buffer layer under the resistance switching layer (HfO2) can improve the tolerance to read disturbances in the binary RRAM array. Li et al.27 proposed that the Vdd/3 scheme showed much better data preservation ability than the Vdd/2 scheme after a certain number of write pulses. This was because the lower programming voltage was applied to the unselected device in the crossbar RRAM array. Wang et al.127 proposed a fine-grained write method to mitigate the write disturbance by utilizing the imbalance distribution in the binary PCM array. The imbalance referred to only a few cell groups that played a divisive role in the performance degradation based on the programming regime in memory. However, few works have been reported on the physical mechanism and optimization methods of the write/read disturb of analog RSMs, and this topic remains to be explored in depth.
4. Noise
Read noise is classified into three types: thermal noise,128 1/-like noise,67,129 and random telegraph noise (RTN).68 As the name suggests, thermal noise derives from the carrier movement or ion migration due to the voltage induced thermal effect. 1/-like noise ( ∼ 1 for LRS and ∼ 2 for HRS) refers to a kind of low-frequency current fluctuation. RTN is a dominant pattern of low-frequency noise (LFN) with the conductance oscillating between two states, originating from the filling or emptying of one or more traps. RTN determines the read-disturb immunity and bit precision in the analog RSM.130 It has been proven that a certain weight standard deviation due to noise results in the accuracy loss in neuromorphic computing.70
The physical mechanisms of the mentioned various noises were elucidated on RRAM and PCM. Huang et al.131 provided the phenomenon and physical explanation of RTN in binary RRAM and presented a triangular programming pulse scheme to suppress the tail bits. Similar works on the physical mechanism of noise were reported on estimating the diameter of CFs,132 establishing the electron tunneling mechanism for the filamentary conductive process based on LFN behavior,67 detecting the Vo count and its properties (activation and deactivation) in the filament region,131 and so on. Based on the thorough analysis, several optimization methods were developed to mitigate the impact of noise and conversely utilized the noise as the source for special applications. To reduce the 1/f noise, Kim et al.133 utilized a metal nitride liner in a multilevel PCM device to provide another conductive path in the amorphous region, which was proven as the dominant source of large noise.134 It was found that more than 4 times noise reduction can improve the multilevel performance and program-and-verify ability. Giannopoulos et al. demonstrated an 8-bit projected analog PCM. It showed remarkable immunity to 1/f noise with the introduction of the noninsulating projected segment in parallel to the phase-change segment. Noise could be utilized to carry out probabilistic inference through sampling in the neural network135 and has solved several problems. Lin et al.136 proposed a generative adversarial network (GAN) based on an analog RRAM array by utilizing the intrinsic noise as inputs to diversify the generated outputs. Cai et al.137 demonstrated an optimization in speed and energy efficiency with the intrinsic analog noise as the computing resource for an RRAM-based Hopfield network.
B. Functional reliability metrics study
1. Nonlinearity
Nonlinearity, attracting unprecedented attention, can be tracked back to the analog RSM device acting as the electrical synapses in the neural network. It reflects the rate of conductance change with the number of voltage pulses, which degrades the training accuracy, as shown in Fig. 6(d). In addition, large nonlinearity leads to complex weight modulation and high energy and time costs in the training process. Therefore, it is necessary to improve the nonlinearity of analog RSM for higher accuracy.
Based on the device characterization and the understanding of the physical mechanism of resistive switching, several representative works have been dedicated to trying novel material and structures of RSM devices to improve nonlinearity. Wu et al.138 introduced a methodology to improve the linearity of analog filamentary RRAM for both the SET and RESET processes by inserting a electrothermal modulation layer (ETML) over the switching layer (HfOx). ETML was reported to not only control the distribution of the electric field to suppress the change in the electric field in the filament region for RESET linearity but also to control the thermal distribution to make the Vo distribution uniform for SET linearity. Chandrasekaran et al.139 introduced Al dopants to improve the nonlinearity of the HfO2-based analog RRAM. The uneven doping method resulted in oxygen-rich and oxygen-poor regions in the switching layer to confine the filament formation, which decreased the nonlinearity by 14% in potential and 31% in depression, respectively. Moon et al.63 designed a 1T2R structure with the nonfilamentary analog RRAM as a synapse device to achieve linear conductance changes. With an additional serial-connected resistor for voltage division, the identical programming pulses can be converted to incremental pulses to improve the nonlinearity. Besides, some representative works on programming schemes have been reported to improve the nonlinearity.140 Chen et al.82 proposed a smart programming scheme for linear weight update with a pair of positive and negative pulses to mitigate the overshoot effect of the previous pulse. Furthermore, the pulse duration was controlled to vary with the conductance levels to slow the weight update at the beginning of depression and potential, which also inevitably burdened the peripheral circuitry. Cai et al.48 mitigated the device I-V nonlinearity by pulse modulation and custom analogue-to-digital converter (ADC) in a fully functional, hybrid memristor chip to reduce the multiplication error.
2. Asymmetry
Unlike the characterization of nonlinearity, asymmetry is used to indicate the degree of difference in the conductance change of a certain conductance level between the potential and depression stages, as shown in Fig. 6(e). Similar to the optimization method for nonlinearity, Li et al.54 developed a two-pulse conductance programming scheme to achieve the linear and symmetric tunable analog behavior. Based on linear and symmetric weight updates, only 2.4% lower accuracy than the ideal value was achieved in in situ training. Lee et al.141 utilized a fixed resistor connected to the analog RRAM device in series to implement compensational voltage division. In this case, the asymmetric conductance changes can be improved by controlling the induced oxide to form smoothly at the interface under identical pulse bias. Consequently, the optimized asymmetry attributed to the significant promotion of recognition accuracy from 30% to 96%. Asymmetry usually appears with nonlinearity, together causing the accuracy loss. Ambrogio et al.51 demonstrated software-equivalent DNN accuracy using the analog memory unit of 2PCM+3T1C (“3 transistors, 1 capacitor”) devices. Compared to PCM, the lower asymmetry can be obtained to implement high training efficiency. Haensch142 broke the connection between asymmetry and nonlinearity. It was believed that the bidirectional devices as synapses did not rely on the linearity, but the symmetric response, i.e., the mirror images of the incremental conductance changes, is indeed required in the SET and RESET process.
3. Dynamic range
The dynamic range is the on/off ratio between the highest and lowest conductance values. A larger dynamic range can result in high precision and the weight mapping ability,34 further providing higher accuracy. To increase the dynamic range, some innovative device structures are developed. Moon et al.63 demonstrated the excellent dynamic range for more than 100 based on a 1T2R analog RRAM device. To achieve a large dynamic range, a parallel connection of an RRAM and a transistor forced the transistor to operate in the steep subthreshold region of the MOSFET. Then, a small voltage change in the RRAM induced a large shift in the drain current by controlling the gate voltage bias. Choi et al.143 developed a transistor-free SiGe epiRAM to control the formation of metal filaments in a customized channel. They claimed that the confined CFs dramatically enhanced uniformity and reliability with a large dynamic range, resulting in a high online learning accuracy of 95.1%. Ambrogio et al.51 introduced a 2T2R+3T1C unit cell with the increased dynamic range by applying different scale factors on read current for two pairs of conductance. The designed unit cell can also improve the update symmetry, contributing to good training accuracy.
4. Precision
Precision refers to the achievable maximum of weight bits in the full switching window. High precision is required for both ex situ and in situ training in neuromorphic computing. In order to explain the formation of high precision, Gao et al.144 investigated the physical mechanism of abrupt and analog switching using kinetic Monte Carlo simulation. It was suggested that achieving high precision of analog RSMs should avoid the formation of strong CFs. Several optimization methods were explored and developed to obtain high precision in RRAM and PCM with considering their physical principles. Stathopoulos et al.145 demonstrated a bilayer 2-terminal metal-insulator-metal (MIM) structure of the analog nonfilamentary RRAM device with up to 6.5 bits capacity based on AlxOy/TiO2 stack. The key technology of high precision lies in introducing a thin interfacial barrier layer between the active layer and one electrode benefiting to the device stability146 and increasing the number of conductance levels, while the conductance variations and noise of the RSM devices seriously limit the high precision implemented. Giannopoulos et al.147 reported the projected analog PCM with 8-bit precision. A simple temperature compensation method for the PCM device was developed to correct and the temperature variation and noise. In this case, the 100% classification accuracy was observed in a single-layer neural network using a crossbar with 30 projected PCM devices.
5. Variation
Different from noise and fluctuation, variation here emphasizes the spatial variation from device to device and the temporal variation from pulse to pulse.82 Broadly speaking, variation should include the variation in functional reliability metrics, such as linearity, symmetry, dynamic range, and precision. From the perspective of physical mechanism, variation is an intrinsic instability due to stochastic ion migration. Chen et al.82 showed that the tolerable accuracy loss can be performed in in situ learning with ∼30% device variation. The impact of pulse-to-pulse variation on learning accuracy should be taken seriously, and approximately 22% temporal variation corresponds to less than 90% learning accuracy with the best linearity, based on the simulation results (Fig. 10). To reduce the influence of variation, Prezioso et al.21 varied the titanium dioxide compositions and layer thickness to select the optimal parameter range in RRAM stacks to realize low device variability by experimental search. Besides, Montano and Cheng148 utilized the resistance ratio to encode information using two series RRAM cells connected with a transistor. Gao et al.149 proposed a three-dimensional vertical structure with several parallel RRAM devices on the same nanopillar to suppress the intrinsic variation. In this case, the recognition accuracy was improved from 65% to 90%, based on the simulation results. Alibart et al.76 designed a simple feedback algorithm to reduce the variation by adopting the resistance state within 1% relative accuracy of the dynamic range.
Device variation and its impact on accuracy. (a) Illustration of spatial variation and temporal variation in the weight update process. Different devices show slight nonlinear differences due to spatial variation. Temporal variation refers to the fluctuation of conductance with the incremental pulses of one device. (b) and (c) Recognition accuracy as a function of the standard deviation of device variation. The curves of different colors and shapes represent the nonlinearity baseline from (0, 0) to (6, −6) of long-term potentiation (LTP) and long-term depression (LTD). Compared to the spatial effect, the impact of the temporal effect on recognition accuracy is more critical. Reprinted with permission from Chen et al., in IEEE/ACM International Conference on Computer-Aided Design (ICCAD) (2015), p. 3-A.3.194. Copyright 2015 IEEE.
Device variation and its impact on accuracy. (a) Illustration of spatial variation and temporal variation in the weight update process. Different devices show slight nonlinear differences due to spatial variation. Temporal variation refers to the fluctuation of conductance with the incremental pulses of one device. (b) and (c) Recognition accuracy as a function of the standard deviation of device variation. The curves of different colors and shapes represent the nonlinearity baseline from (0, 0) to (6, −6) of long-term potentiation (LTP) and long-term depression (LTD). Compared to the spatial effect, the impact of the temporal effect on recognition accuracy is more critical. Reprinted with permission from Chen et al., in IEEE/ACM International Conference on Computer-Aided Design (ICCAD) (2015), p. 3-A.3.194. Copyright 2015 IEEE.
6. Bit yield
The bit yield represents the proportion of cells with normal analog resistive switching ability to the total number of devices in one array. It results from the fabrication and integration process and undesired device reliability degradation. Tran et al.150 reported an ultrahigh yield (almost 100%) on a 6-in. wafer based on binary HfOx-RRAM devices with Si-diode selectors. However, an acceptable yield of large-scale RSM arrays is not easy to obtain owing to the immature fabrication technology and undesired reliability degradation. Since the optimization from the device level is difficult to substantially increase the bit yield, some works have proposed creative write/read strategies and algorithms. Shih et al.151 improved the yield of 128-kb binary HfO2-RRAM circuits from 38.01% to 93.96% by addressing the overforming problem based on the training sequence. Xia et al.152 presented a fault-tolerance framework to reduce the number of stuck-at-fault RRAM devices. In detail, a mapping algorithm was proposed and attributed to the improved recognition accuracy of MNIST. Furthermore, hardware-level schemes with algorithm-level methods were explored to optimize the fault tolerance.153 Xue et al.154 developed self-adaptive write/read modes, improving the read bit yield of a 0.13-m 8-Mb CuxSiyO binary RRAM macro from 98% to 100% at 125 °C. Although several optimizations of the yield were proposed, improvements in the yield of large-scale analog RSM arrays for neuromorphic computing are still required. In addition, few publications involved the physical mechanism of the yield of an analog RSM array, which should be valued and further explored.
V. SUMMARY AND OUTLOOK
In the past few years, neuromorphic computing based on the emerging analog RSM has made notable progress. However, the research on the reliability of analog RSMs for neuromorphic computing still faces serious challenges in three aspects: (1) the reliability concerns and characterization methods of analog RSM devices are quite different, and well-accepted evaluation criteria are still lacking; (2) because of the complexity of the physical mechanism of analog resistive switching, mechanism studies for device reliability are difficult; (3) cross-layer codesign from the device to the system/algorithm is critical for neuromorphic computing; thus, a single-device-level study is not sufficient. For these reasons, this topic requires much effort for the reliability study of analog RSM-based neuromorphic computing.
In particular, we suggest several research directions that should be given much attention in the future. First, atom-level in situ characterization for the switching mechanisms is important for the reliability study. Direct observations of the dynamics of ion migration in the active region can offer critical evidence to understand the degradation mechanisms. Second, a complete reliability evaluation is required based on statistical measurements under different temperature conditions. The investigation should focus on the tail bits of a crossbar array to capture the stochastic behaviors of reliability degradation. Different reliability metrics should be studied simultaneously to determine how each metric influences the others. Developing new techniques that can quickly finish the statistical measurement is required for this purpose. Third, the reliability evaluation should be performed with close correlation with a specific algorithm and system. At the initial stage, the mainstream AI algorithms could be considered, such as the convolutional neural network and recurrent neural network. Then, other algorithms like SNN and GAN can be considered. The reliability of 3D arrays is also important for the neuromorphic computing study. Finally, physical modeling and compact modeling of reliability degradation is also a key direction in the future study. Physical modeling is powerful for providing guidelines for reliability optimization. The compact model of reliability degradation must be added in future system-level simulators for performance benchmark and circuit design.
In this review, we have summarized the significant research studies on the reliability topic of analog RSM devices for neuromorphic computing. The landmark works involve the cross-layer reliability analysis, physical mechanism of device reliability, and optimization methods from the device characteristics to the algorithm and system. A set of evaluation methods of device reliability has been proposed to provide a guideline for further reliability research. Neuromorphic computing has enabled complex tasks at less cost than that of von Neumann architecture. As the problems of the large-scale integration of emerging analog RSM devices are solved, massive commercialization of neuromorphic computing chips will be realized. With excellent computing ability, there is no doubt that the neuromorphic computing chips will be widely used in various applications of the medical field, aerospace, and some areas related to human life. We expect that significant breakthroughs in reliable and energy-efficient neuromorphic computing chips based on analog RSM will be achieved in the near future.
ACKNOWLEDGMENTS
This work was supported in part by the National Key R&D Program of China (No. 2017YFB0405604), NSFC (Nos. 61851404, 61874169, 61674089, 61674092, and 61674087), National Major Research Program (No. 2017ZX02315001-005), Beijing Municipal Science and Technology Project (Nos. Z181100003218001 and Z191100007519008), and Beijing Innovation Center for Future Chips (ICFC).