Inspired by the parallelism and efficiency of the brain, several candidates for artificial synapse devices have been developed for neuromorphic computing, yet a nonlinear and asymmetric synaptic response curve precludes their use for backpropagation, the foundation of modern supervised learning. Spintronic devices—which benefit from high endurance, low power consumption, low latency, and CMOS compatibility—are a promising technology for memory, and domain-wall magnetic tunnel junction (DW-MTJ) devices have been shown to implement synaptic functions such as long-term potentiation and spike-timing dependent plasticity. In this work, we propose a notched DW-MTJ synapse as a candidate for supervised learning. Using micromagnetic simulations at room temperature, we show that notched synapses ensure the non-volatility of the synaptic weight and allow for highly linear, symmetric, and reproducible weight updates using either spin transfer torque (STT) or spin–orbit torque (SOT) mechanisms of DW propagation. We use lookup tables constructed from micromagnetics simulations to model the training of neural networks built with DW-MTJ synapses on both the MNIST and Fashion-MNIST image classification tasks. Accounting for thermal noise and realistic process variations, the DW-MTJ devices achieve classification accuracy close to ideal floating-point updates using both STT and SOT devices at room temperature and at 400 K. Our work establishes the basis for a magnetic artificial synapse that can eventually lead to hardware neural networks with fully spintronic matrix operations implementing machine learning.

In-memory computing overcomes the memory wall in von Neumann architectures, where the data-intensive computation is frequently bottlenecked by slow and expensive memory accesses.^{1} Foremost among these data-driven applications is the processing of artificial neural networks, modeled after biological neurons interconnected by tunable synapses. Massively parallel analog computation within a resistive memory array, where the memory devices serve as synapses, is a promising approach to lower the energy consumption of training and deploying neural networks.^{2} Nonvolatile memory devices such as resistive random access memory (ReRAM),^{3,4} phase change memory (PCM),^{5,6} conductive bridge RAM (CBRAM),^{7} and electrochemical or polymer-based memory^{8–10} have all been demonstrated to implement multi-level synaptic functionality and, in many cases, adequate cycle-to-cycle variability. However, many of these devices exhibit nonlinear and/or asymmetric responses to programing pulses, making it difficult to accurately implement the ubiquitous backpropagation algorithm for neural network training.^{11,12} These drawbacks, along with high write voltages or currents,^{5} diminish the energy benefits of nonvolatile memory-based training accelerators and limit their generalizability to complex machine learning problems.

Spintronic memory has attracted interest for its high write endurance, low power consumption, and small size. For neuromorphic applications, domain wall-magnetic tunnel junction (DW-MTJ) devices^{13,14} have previously been shown to emulate leaky integrate and fire (LIF) neuron functionality^{15–17} as well as long-term potentiation (LTP) and spike-timing dependent plasticity (STDP) synaptic behaviors.^{18} In contrast to two-terminal MTJs, the three-terminal DW-MTJ enables isolation of the read and write paths, contributing to reduced wear on the MgO tunnel barrier. The DW-MTJ device, shown in Fig. 1, consists of a perpendicularly magnetized ferromagnetic (FM) track containing a DW, separated from a fixed FM layer by a thin MgO barrier. Current applied between the L and R terminals propagates the DW through spin transfer torque (STT). With the inclusion of a heavy metal (HM) layer underneath the FM track, DW motion can also be induced by spin–orbit torque (SOT) at a lower current density. The conductance of the MTJ stack, which represents the synaptic weight, is determined by the DW position and can be read by a small vertical current between OUT and either L or R that does not displace the DW. To date, most work on DW synaptic neuromorphic systems have focused on small-scale implementations of bio-inspired learning rules, e.g., STDP or STDP-like rules,^{19–21} rather than on state-of-the-art deep neural networks that can be applied to complex, high-dimensional machine learning tasks.^{22} Supervised learning systems proposed so far with DW synapses^{23} have not considered the implications of room-temperature drift, stochasticity, and process variations on the feasibility of these systems.

In this Letter, we show using micromagnetic simulations that a DW-MTJ device can function effectively as a synapse with a high intrinsic linearity and symmetry. Notches are added to the track to stabilize the DW position and to improve the linearity and repeatability of synaptic updates. We build lookup-table models of device behavior, based on micromagnetic simulations of both STT- and SOT-driven DW motion, which capture thermally induced cycle-to-cycle variability of DW updates as well as process-induced device-to-device variability in the MTJ conductance. Next, we use these lookup tables to simulate the training of DW-MTJ synapses on the MNIST^{24} and Fashion-MNIST^{25} image classification datasets and evaluate their accuracy.

The magnetization dynamics of DW-MTJ artificial synapses are modeled using MuMax3, a micromagnetics simulator that solves the Landau–Lifshitz–Gilbert (LLG) equation.^{26} The free FM layer is a rectangular wire that is 1050 nm long, 50 nm wide, and 1.5 nm thick for STT (1750 nm long for SOT). The wire is bounded on both ends by 30 nm long regions of fixed magnetization (50 nm for SOT). The free layer is assumed to be CoFeB (exchange stiffness $Aex=1.3\xd710\u221211$ J/m, saturation magnetization $Msat=0.8\xd7106$ A/m, magnetocrystalline anisotropy $z\u0302Ku=5\xd7105$ J/m^{3}, Gilbert damping factor $\alpha =0.05$, non-adiabaticity factor $\xi =0.05$, and spin polarization *P *=* *0.7), while the HM layer for the SOT device is Ta [spin Hall angle $\theta H=0.2$ and interfacial Dzyaloshinskii–Moriya interaction (DMI) $Dind=\u22120.5\xd710\u22123$ J/m^{2}]. The HM layer has the same thickness and resistivity as the FM, so that half of the injected current acts on the DW by STT and the other half by SOT. The DW position is approximated using the average magnetization of the wire along its length.

In a perfectly smooth FM wire and in the absence of a driving current or field, the DW tends to drift toward the center to minimize its interaction energy with the pinned magnetization regions. This is illustrated by the blue curve in Fig. 2(a), where the DW is initialized at the right edge ($x\u2248960$ nm) and drifts by approximately 200 nm over 50 ns. By adding semi-circular notches to the edges of the track that act as pinning sites, shown in Fig. 1(b), the DW position can be made nonvolatile. A further benefit of the notches is shown in Fig. 2(b), where the DW is driven by a sequence of 1 ns long 50 *μ*A pulses separated by 4 ns relaxation periods. Without notches, the DW updates are nonlinear, controlled both by the applied current and its position along the track, which determines the rate of drift. The notches ensure linear updates independent of position and enable lithographic control of synaptic weight values. Other methods for DW pinning include using interlayer exchange coupling of multiple MTJs or by introducing defects in the shape or anisotropy of the free layer.^{23,28} However, notches were chosen as the preferred pinning method because their controllable positions guarantee update linearity, which cannot readily be obtained using randomly distributed defects. In addition, notches are less complex to lithographically define than interlayer exchange and can be more easily scaled.

At non-zero ambient temperatures, thermal fluctuations can cause spontaneous depinning of the DW. As a result, the notch must provide a sufficiently deep energy well to ensure synaptic non-volatility. The notch spacing must also be greater than the DW width to prevent drift. This is particularly important for SOT-driven devices, since the tilting of the DW due to DMI during and after current injection can cause uncontrolled movement between notches. Figure 2(c) illustrates this with snapshots of the $z\u0302$ magnetization of a section of the track. Here, a 0.5 ns long 27 *μ*A pulse is applied to an SOT device with a 30 nm notch spacing. With this spacing, the DW interacts with both adjacent notches and experiences tilting long after the current stimulus has ceased, eventually settling unpredictably to one of the notches. For the SOT device, notches with a 5 nm radius-spaced 50 nm apart are necessary to suppress the effect of thermal fluctuations, while for the STT device, a notch spacing of 30 nm is sufficient. These represent lower bounds on notch spacing that provide non-volatility, predictable updates, and the availability of many states along the track. These notch spacings are attainable using electron beam lithography and ion mill etching; well-controlled nanomagnetic feature sizes down to 10 nm spacing have been demonstrated, and MTJs have been fabricated with diameters below 50 nm.^{29,30}

The synaptic functionality of DW-MTJ devices is demonstrated using both STT- and SOT-driven devices with 32 equally spaced notches (32 weight levels). Synaptic updates are characterized using a sequence of positive current pulses followed by negative pulses, which ramp the DW position along the track. To shift the DW-MTJ by one weight level, the pulse duration is fixed to 1 ns for STT devices (0.5 ns for SOT), and the amplitude is set to 50 *μ*A for STT (27 *μ*A for SOT). The ramp is repeated 30 times to quantify the cycle-to-cycle variability in the update amount, which can be induced by thermal noise.

Figure 2(d) shows the ramp of DW position at 0 K for the STT device. In the inset, the notch positions depicted by the horizontal lines show the lithographically defined levels to which the DW can settle. The linear slope of the ramp in both directions indicates that synaptic weight changes are both highly linear (minimal state-dependence) and highly symmetric (same response in both directions). This suggests that DW-MTJ synapses can implement backpropagation with high fidelity.

The size of a weight update $\Delta x$ can be linearly modulated using the magnitude or duration of the applied current pulse. This follows from the relationship between DW velocity and current density described by Beach *et al.*:^{31}

where *g* is the Landé factor, *μ _{B}* is the Bohr magneton,

*P*is the spin polarization, and

*e*is the electron charge. To validate this property, positive and negative pulses of varying duration were applied to the DW at each of the 32 notches, and the average $\Delta x$ for each duration is computed. The result, shown in Fig. 2(e), confirms that $\Delta x$ varies linearly with pulse duration and that this response is symmetric for the two update polarities.

The DW position *x* is converted to a synapse conductance by treating the MTJ as two resistors in parallel: one over the region where the free and fixed FM layers have parallel magnetizations and one over the region where they are anti-parallel. The DW positions collected from 30 ramps are used to construct a probabilistic lookup table of the change in conductance $\Delta G$ for each initial conductance *G*. Figure 3(a) shows the lookup tables for STT and SOT devices at 0 K and 300 K (at 0 K, SOT behaves similarly to STT). Temperature introduces stochasticity to the updates, and this is significantly more pronounced in the SOT devices where, due to DMI, the DW prefers a Néel geometry. Thermal fluctuations in the DW magnetization, together with its interaction with the notches, can randomly cause the DW to either not be displaced or to propagate multiple levels in order to maintain a Néel configuration. Nonetheless, when averaged over the noise, all of the lookup tables show highly linear updates, indicated by an expected value of $\Delta G$ that is nearly independent of *G*.

Figure 3(b) compares the simulated ramp response of DW devices at 300 K with experimentally measured ramp data from two previously published devices: electrochemical RAM (ECRAM),^{10} which is highly linear and reproducible, and TaO_{x} ReRAM,^{27} which is highly nonlinear and asymmetric. Table I compares the linearity, symmetry, and stochasticity of several published devices, where the parameters are extracted as described in Ref. 12. The DW synapse exhibits greater write noise than ECRAM, but using STT motion, the device has excellent linearity and symmetry in comparison to the best demonstrated synaptic devices.

Synapse device . | Nonlinearity (+/− updates) . | Cycle-to-cycle variation . |
---|---|---|

Domain wall STT | +0.07/−0.15 | 0.77% |

Domain wall SOT | +0.80/−0.81 | 3.23% |

ECRAM^{10} | +0.70/−0.12 | 0.023% |

TaO_{x}/HfO_{x} ReRAM^{32} (analysis by Ref. 12) | +0.04/−0.63 | 3.70% |

TaO_{x} ReRAM^{27} | +668/−51.7 | 11.2% |

We sample the generated lookup tables to simulate the training of DW-MTJ synapse arrays using CrossSim.^{33} To model device-to-device variation within the array induced by process variations, each device is assigned a different perturbed lookup table. The perturbations are added as random variations in the MTJ parallel resistance *R _{p}* (12.8%) and the tunnel magnetoresistance ratio TMR (6.9%). We assume normally distributed variations with magnitudes obtained from Ref. 34 at a 45 nm critical dimension, which is slightly less than the track width. For computational tractability, we generate 20 perturbed lookup tables for each combination of technology (STT/SOT) and temperature and assign them randomly to devices in the array.

^{27}

As shown in Fig. 4(a), the matrix–vector multiplication can be executed on the DW-MTJ array during forward propagation by reading the MTJ resistance (OUT terminal). Weight updates are performed using the L and R terminals of the track, as shown in Fig. 4(b). Using backpropagation with stochastic gradient descent (SGD), each update to the weight matrix is an outer product of two vectors. A parallel outer product update can be efficiently executed in a DW-MTJ array by simultaneously driving the L terminals (rows) and the R terminals (columns) with time-coded and voltage-coded pulses, respectively, to obtain a multiplicative effect.^{35}

The DW-MTJ synapse is evaluated on two image classification tasks—MNIST handwritten digits [see Fig. 5(a)] and the more difficult Fashion-MNIST clothing items [see Fig. 5(b)]—using the same two-layer multilayer perceptron topology with 300 hidden neurons. Each network consists of a 785 × 300 and a 301 × 10 weight matrix (including bias). A sigmoid and a softmax activation, after the first and second layers, respectively, are computed digitally at floating-point precision. Signed weights are implemented using the difference in conductance of two DW-MTJ devices: $Wi,j=Gi,j+\u2212Gi,j\u2212$. The complementary weight components are placed in two separate arrays, and updates are always applied to both halves of a synapse to prevent conductance saturation.^{11} SGD is used with a fixed learning rate schedule for all simulations: the learning rate begins as *α* and is reduced to $\alpha /2,\u2009\alpha /3,\u2009\alpha /4$, and $\alpha /5$ after the third, fifth, eighth, and tenth training epochs, respectively.

Figures 5(a) and 5(b) show the training performance of STT DW-MTJ synapse arrays at 300 K compared to training with ideal numeric updates. We have used the MNIST and Fashion-MNIST test sets for validation. For each series, three networks are trained with random initial seeds; the data points show the average, while the colored areas signify standard deviation. If the DW-MTJ is idealized to have continuous levels without drift, the performance is very close to ideal even with cycle-to-cycle and device-to-device variations introduced by temperature and MTJ variations, respectively. The resilience to MTJ process variation arises from the high linearity of the devices: even if the same conductance maps to different DW positions in different devices, the update strength will be the same since it is largely independent of the starting state.

The geometry with 32 notches (green) has a discretizing effect on DW position: the updated conductance is rounded to the closest discrete level. For this case, the learning rate *α* is increased to prevent a large number of small updates from being reduced to zero; however, this results in inferior convergence relative to the continuous case. In both classification tasks, the notched synapses suffer a significant performance loss even with an optimized learning rate, with a greater loss (>20%) for the Fashion-MNIST task. The accuracy can be partially recovered using periodic carry, which splits the high and low significance bits of each weight into two devices with 32 notches each,^{33} increasing the effective weight resolution of each nanosynapse. With increasing dataset and neural network complexity, more weight levels (notches) are needed to obtain ideal numeric accuracy. Indeed, backpropagation using SGD typically has a clear lower bound on allowable bit resolution.^{36,37}

Figures 5(c) and 5(d) compare the training performance on Fashion-MNIST of STT and SOT devices and different temperatures, assuming ideal continuous and notched synapses with periodic carry, respectively. In the continuous case, the superior accuracy of STT devices in Fig. 5(c) results from their smaller cycle-to-cycle variability in $\Delta G$. For both device types, the accuracy is roughly the same for 300 K and 400 K, which reflects the similarity in their lookup tables. On the other hand, for the notched devices in Fig. 5(d), a higher accuracy is attained with SOT than with STT. This arises from the fact that when the desired synapse update $\Delta G$ is small, the more stochastic SOT device is more likely to yield a non-zero conductance update than the STT device, where many of the updates will be too small to move the DW.

To reduce the accuracy loss caused by discretization, an alternative to periodic carry is to use a longer FM wire with more notches to increase the available number of weight levels. Figure 5(e) shows the effect of the number of notches on Fashion-MNIST accuracy using STT and SOT devices at 300 K. Surprisingly, the SOT device attains a much higher accuracy than the STT device when the number of notches is small; with just 32 notches, an accuracy of 72% is achieved on Fashion-MNIST compared to 10% using STT at the same learning rate. We attribute this to the greater stochasticity of the SOT mechanism, which allows small device updates that would otherwise fail to move the DW to the next notch to occasionally produce a change in conductance. As with the stochastic rounding technique used in software,^{36} when this effect is averaged over thousands of updates, an effectively higher resolution is achieved for the weight updates than the actual number of notches present. Noisy updates have also been shown to reduce overfitting.^{38} This benefit can be approximated using STT with a higher learning rate (dashed curve), but with a lower accuracy at both a small and large number of notches relative to SOT.

Material parameters may also influence the neuromorphic performance and efficiency of the DW-MTJ synapse. Assuming a free layer with perpendicular anisotropy, both the DW velocity and the DW width depend on the track's magnetic properties. Based on Eq. (1), an increase in *P* or a decrease in *M _{sat}* would increase the DW velocity for a given applied current, allowing the same weight update to be performed with a lower energy. For small notch spacings close to half the DW width, an increase in DW width can increase the stochasticity of a weight update. The expression for DW width $\delta =\pi A/K$ indicates that choosing a material with increased exchange stiffness

*A*or reduced perpendicular anisotropy

*K*can lead to more stochastic updates with the same device geometry. Additionally, in choosing the material for the HM layer, the torque contribution from the spin Hall effect described in Ref. 39 can be used to deduce the relevant material parameters,

Choosing a material with larger spin Hall angle *θ _{H}* leads to an increased DW velocity, increasing the energy efficiency of a weight update. The HM layer also mediates the magnitude of DMI, which is a large contributor to the stochasticity found in SOT devices. By choosing a material that induces stronger or weaker DMI, the stochasticity of a weight update can be augmented or reduced. In addition to increasing the effective weight update resolution, tunable stochasticity can also enable efficient implementations of probabilistic learning algorithms.

In summary, our micromagnetics-based modeling of DW-MTJ nanosynapses with a notched geometry demonstrates their suitability for on-chip learning using backpropagation. Well-engineered notches eliminate DW drift in both STT and SOT DW-MTJs, bestowing synaptic non-volatility. Device lookup tables constructed from micromagnetics simulations of 32-level devices display highly linear and symmetric synaptic response, leading to classification accuracies approaching ideal numeric performance on the MNIST task. When taking into account the pinning of DWs to discrete notches, there is an accuracy penalty for more complex tasks such as Fashion-MNIST. This penalty could be alleviated by increasing the weight resolution (adding more notches), using multiple devices to represent the synapse bits (periodic carry), or by exploiting the stochasticity that is inherent to the physics of SOT devices. Since our results imply that discretization due to notches is the major roadblock to software-equivalent neural network performance, the effect of stochastic rounding will be investigated in future work to mitigate this drawback while retaining the increased linearity of a notched geometry. Overall, our physics-rich neural network simulations may be a foundational step in the realization of analog spintronic neuromorphic computation.

The authors acknowledge the support from Sandia's Laboratory-Directed Research and Development program, funding from the National Science Foundation CAREER under the Award No. 1940788, and computing resources from the Texas Advanced Computing Center (TACC) at the University of Texas at Austin (http://www.tacc.utexas.edu). This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in this paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government. Sandia National Laboratories is a multimission laboratory managed and operated by NTESS, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy's National Nuclear Security Administration under Contract No. DE-NA0003525.

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.