We present multiplexed gradient descent (MGD), a gradient descent framework designed to easily train analog or digital neural networks in hardware. MGD utilizes zero-order optimization techniques for online training of hardware neural networks. We demonstrate its ability to train neural networks on modern machine learning datasets, including CIFAR-10 and Fashion-MNIST, and compare its performance to backpropagation. Assuming realistic timescales and hardware parameters, our results indicate that these optimization techniques can train a network on emerging hardware platforms orders of magnitude faster than the wall-clock time of training via backpropagation on a standard GPU, even in the presence of imperfect weight updates or device-to-device variations in the hardware. We additionally describe how it can be applied to existing hardware as part of chip-in-the-loop training or integrated directly at the hardware level. Crucially, because the MGD framework is model-free it can be applied to nearly any hardware platform with tunable parameters, and its gradient descent process can be optimized to compensate for specific hardware limitations, such as slow parameter-update speeds or limited input bandwidth.
I. INTRODUCTION
As the applications of machine learning and artificial intelligence have grown in size and power, so have their implementation costs and energy usage, leading to a significant effort toward building custom hardware that can perform these tasks at high speeds with lower energy costs.1 A number of hardware platforms have emerged using analog,2 digital,3,4 or mixed-signal processing5 that will potentially offer increased operational speeds and/or reduced energy costs.6 However, many of the most promising hardware neural networks only perform inference, as integrating in situ training on these platforms has proved insurmountable. At the same time, the largest portion of energy costs in modern applications is spent on the training process,7 usually via gradient descent, with the gradient calculated via backpropagation. While backpropagation is by far the most computationally efficient method of computing the gradient in software, this does not mean that it is the fastest, most robust, or most energy efficient technique in every hardware platform,8 as evidenced by learning in the brain.9,10
Although often conflated, training via gradient descent does not require backpropagation—backpropagation is only used to calculate the gradient. Other methods for computing the gradient in neural networks exist, but are much less efficient in software than backpropagation and so are rarely used in today’s machine learning applications. This is not generally true in hardware, where backpropagation may not only be challenging to implement but also may not be the most efficient way to compute the gradient.
Of particular interest in hardware are model-free methods, in which we require no knowledge of the internal structure of the network (e.g., topology, activation function, derivatives, etc.), only the ability to perturb the network’s parameters and measure the network’s response. The simplest example of such a method is finite-difference,11 which has been employed for chip-in-the-loop training.12 However, finite-difference has several other disadvantages that prevent its widespread implementation in hardware, including the requirements for extra memory at every synapse and global synchronization. Fortunately, there are a variety of other model-free methods that overcome some of the issues associated with finite-difference.13,14
In this paper, we show that model-free perturbative methods can be used to efficiently train modern neural network architectures in a way that can be implemented natively within emerging hardware. These methods were investigated for training VLSI neural networks beginning in the 1990s,15–24 and more recently on memristive crossbars25 and photonic hardware,26 but all these demonstrations have been very limited in scale, comprising small datasets with only a few neurons. Below we describe a framework for applying these techniques to existing hardware at much larger scales, with an emphasis on creating simple, highly localized circuits that could be implemented on-chip if desired. The framework is also extensible to training existing hardware systems via a chip-in-the-loop technique. We note that these methods have also been adapted in forward gradient approaches using auto-differentiation, which have attracted recent interest in the machine learning literature.27–29
We show that under realistic assumptions of the operating timescales of analog and digital hardware neural networks, one can train hardware to solve modern datasets such as CIFAR-10 orders of magnitude faster than training a software network on a graphics processing unit (GPU), even in the presence of signal noise and device-to-device variations in the hardware. A major advantage of this framework is that it can be used to perform online training of hardware platforms originally designed only for inference while making minimal hardware modifications.
II. MULTIPLEXED GRADIENT DESCENT
A. Computing the gradient with perturbations
We begin with the basic assumption that we have some hardware with programmable parameters (e.g., weights and biases) that can perform inference. Our goal is to augment the hardware minimally such that it can also be trained via gradient descent. We will show how to configure the hardware such that the network as a whole automatically performs gradient descent, without backpropagation. As an example, assume we have a hardware instantiation of a feedforward multi-layer neural network as shown in Fig. 1. The hardware takes time-varying inputs x(t), training target , has variable parameters θ, outputs the inference y(t), and computes a cost . To allow us to compute the gradient of such a system, we first add a small time-varying perturbation to each parameter base value θi [Fig. 1(a), inset]. This perturbation will slightly modulate the cost C, and that modulation will be fed back to the parameters. This process will ultimately allow us to extract the gradient of the system.
If we remove C0, we are left with a time varying signal corresponding only to the effects of our parameter perturbations. The amplitude ΔCi is simply the amplitude of change in the cost due to , the perturbation of parameter i.
The value Gi is the approximation for the partial gradient for parameter i. G approaches the exact gradient when both T → ∞ and the amplitude of the perturbation Δθi approaches zero and is only an approximation otherwise. Fortunately, even at realistic timescales and amplitudes, G contains meaningful information and can be used to perform gradient descent.13
Note that here and in the simulation results, Gi is being accumulated with time and is not normalized by 1/T, unlike Eq. (2). As described later, this allows us to vary the integration time without greatly impacting the rate of training—equivalently, one can integrate for a long time resulting in a large step down the gradient, or one can take a series of shorter steps instead and travel approximately the same distance along the gradient.
We discuss the effects of changing the perturbation type in Sec. III D. We also note that although many of the quantities described here are time-varying, in the following sections, we will drop the explicit time dependence notation for the sake of brevity.
B. Gradient descent in the MGD framework
Looking at this process from the hardware perspective, one must also examine several practical considerations such as when to perform the weight update, how long to integrate the gradient, and so forth. We introduce the following time constants to provide a framework for managing these considerations in the MGD context.
τp is the timescale over which perturbations occur. In a digital system, the perturbations of each parameter would be updated to new values every τp. In an analog (continuous) system, τp corresponds to the characteristic timescale of the perturbations. For instance, if using sinusoidal perturbations, it corresponds approximately to the inverse of the total bandwidth the frequencies occupy.
τθ is the gradient-integration time [i.e., T in Eq. (3)]. It sets how often parameter updates occur and also determines how accurate the gradient approximation will be. For each time period τθ, the gradient approximation G is integrated, and at the end of that period, the parameters are updated according to Eq. (4).
τx controls how often new training samples x, are shown to the hardware. After each τx period, the old sample is discarded and a new one is applied, generating new outputs y and cost C.
The values of the time constants τθ and τp have a large impact on the particulars of the training, and particular choices of τθ and τp allow the implementation of certain conventional algorithms in numerical analysis. For instance, consider the implementation of the forward finite-difference algorithm within this framework. To do this, we start with discrete perturbations, such that every τp only a single parameter is perturbed by Δθ, and the parameters are perturbed sequentially as shown in Fig. 1(c). When parameter i is perturbed by Δθi, the cost changes by ΔC and the resulting partial gradient ΔC/Δθi ≈ ∂C/∂θi is stored in Gi. Now if we set τθ = Pτp, where P is the number of parameters in the network, this is exactly equivalent to computing a finite-difference approximation of the gradient: each τp, one element of the gradient is approximated, and after Pτp, every partial gradient Gi has been measured and stored. Since τθ = Pτp, the weight update only comes after all the partials of G has been collected, just as in finite-difference. This process is shown schematically in Fig. 2(a).
Similarly, other optimization algorithms can be implemented simply by modifying the values of the time constants. For example, using same procedure as above but with the integration time reduced to a single timestep, i.e., τθ = τp, corresponds exactly to coordinate descent. In this case, rather than storing each Gi until all the partials of the gradient are fully assembled, the weight update is applied immediately after each Gi is computed. This may be more appealing from a hardware perspective than a finite-difference approach, as Gi can be used for the weight update and immediately discarded—unlike finite-difference, it does not require a per-parameter memory to store Gi until the weight update is executed. This coordinate-descent process is shown schematically in Fig. 2(b).
As a third example, it is possible to implement simultaneous-perturbation stochastic approximation (SPSA)13 by only changing the values of the time constants and the form of the perturbation. In this case, τθ = τp and a random, discrete {+Δθ, −Δθ} perturbation is applied to every parameter every τp, as shown in Fig. 2(c). Similar to coordinate descent, this configuration avoids the need for additional per-parameter memories, as Gi values do not need to be stored. This method avoids the need for global synchronization of the parameters—the perturbations do not need to be sequential, and instead can be generated locally and randomly at the parameter.
In addition to being able to choose these specific optimization algorithms, by varying the time constants τp and τθ one can also implement entirely new optimization algorithms. For instance, in Fig. 2(d), we show an MGD implementation on an analog system using sinusoidal perturbations that does not correspond to any of the aforementioned methods. In this case, τp corresponds to the timescale 1/Δf, where Δf is the perturbation bandwidth, the difference between the maximum and minimum perturbation frequency. Additionally, in the analog case, there is no discrete update of the parameters, and instead, τθ is an integration time constant. Unlike the discrete case, which accumulates G for τθ time then resets it to zero, θ is continuously updated with the output of an lowpass filter with time constant τθ (see Algorithm 2).
Because modern machine learning datasets are composed of many training examples (often tens of thousands), τx, the time constant that controls how often training samples are changed, is critical. In fact, by setting τx appropriately, mini-batching can even be implemented in hardware that only allows 1 sample input at a time. The batch size is determined by τθ/τx, the ratio of the gradient integration time τθ, and the sample update time τx. When τx is shorter than τθ, multiple samples are shown to the network during a single gradient integration period. As the sample changes, the gradient approximation G will then include gradient information from each of those samples. This integration-in-time process is arithmetically identical to summing multiple examples in parallel (as is often done on a GPU). As a concrete example, for hardware that accepts one input example at a time, if τθ = τx, the batch size is 1, but if τθ = 4τx, the batch size is 4. This is shown in Fig. 3 for the same network as in Fig. 2 and using simultaneous discrete perturbations. In Sec. III, we demonstrate that batch sizes as large as 1000 function correctly in MGD.
In summary, setting just three time constants and the perturbation type allows the MGD framework to implement a wide range of variations of gradient descent and, depending on the desired approach, one can tailor their training to the needs of the problem and the hardware. For example, MGD can match backpropagation by making τθ arbitrarily long, as we show in Sec. III B—the MGD approximation to a partial derivative converges to the true partial derivative as τθ becomes arbitrarily long. Thus, the various parameter update steps presented above can be made arbitrarily close to their analytical definitions as implemented using digital arithmetic and backpropagation. At the same time, we show that short τθ values can work equally well. In practice, hardware considerations, such as write-speed limitations, may require an intermediate τθ value, as we discuss in Sec. IV. These features allow the MGD framework to leverage the same training techniques that have been so successful for training machine learning models, while also being compatible with new and emergent hardware.
III. SIMULATION OF MGD
A. Introduction
To characterize the utility of the MGD framework, we first simulated its performance on modern machine learning datasets. The goal of the simulator was not to perform gradient descent as fast as possible on a central processing unit (CPU) or GPU, but rather to emulate hardware implementing MGD and evaluate its potential performance in a hardware context. In particular, we used the simulator to estimate the speed, accuracy, and resilience to noise and fabrication imperfections. The simulator was written in the Julia language and can be run on a CPU or GPU and is available online.30 The algorithms used in the simulation are shown in the Algorithms 1 and 2 boxes, with the parameters and their types in software shown in Table I.
1: Initialize parameters θ |
2: for n in num˙iterations do |
3: if (n mod τx = 0) then |
4: Input new training sample x, |
5: if (n mod τx = 0) or (n mod τθ = 0) then |
6: Set perturbations to zero |
7: Update baseline cost |
8: if (n mod τp = 0) then |
9: Update perturbations |
10: Compute output |
11: Compute cost |
12: Compute change in cost |
13: Compute instantaneous error signal |
14: Accumulate gradient approximation G ← G + e |
15: if (n mod τθ = 0) then |
16: Update parameters θ ←θ − ηG |
17: Reset gradient approximation G ← 0 |
1: Initialize parameters θ |
2: for n in num˙iterations do |
3: if (n mod τx = 0) then |
4: Input new training sample x, |
5: if (n mod τx = 0) or (n mod τθ = 0) then |
6: Set perturbations to zero |
7: Update baseline cost |
8: if (n mod τp = 0) then |
9: Update perturbations |
10: Compute output |
11: Compute cost |
12: Compute change in cost |
13: Compute instantaneous error signal |
14: Accumulate gradient approximation G ← G + e |
15: if (n mod τθ = 0) then |
16: Update parameters θ ←θ − ηG |
17: Reset gradient approximation G ← 0 |
1: Initialize parameters θ |
2: for t = 0 to T step dt do |
3: if (t mod τx = 0) then |
4: Input new training sample x, |
5: Update perturbations |
6: Compute output |
7: Compute cost |
8: Compute discretized highpass |
9: Compute instantaneous error signal |
10: Update gradient approximation |
11: Update parameters θ ←θ − ηG |
1: Initialize parameters θ |
2: for t = 0 to T step dt do |
3: if (t mod τx = 0) then |
4: Input new training sample x, |
5: Update perturbations |
6: Compute output |
7: Compute cost |
8: Compute discretized highpass |
9: Compute instantaneous error signal |
10: Update gradient approximation |
11: Update parameters θ ←θ − ηG |
Description . | Symbol . | Analog or digital . |
---|---|---|
Change in the cost due to perturbation | Both | |
Perturbation to parameters | Both | |
Parameters | θ | Both |
Input sample | x | Both |
Target output | Both | |
Network output | y | Both |
Cost | C | Both |
Unperturbed baseline cost | C0 | Digital |
Gradient approximation | G | Both |
Instantaneous error signal | e | Both |
Learning rate | η | Both |
Perturbation amplitude | Δθ | Both |
Input-sample change time constant | τx | Both |
Parameter update time constant | τθ | Both |
Perturbation time constant | τp | Digital |
Highpass filter time constant | τhp | Analog |
Description . | Symbol . | Analog or digital . |
---|---|---|
Change in the cost due to perturbation | Both | |
Perturbation to parameters | Both | |
Parameters | θ | Both |
Input sample | x | Both |
Target output | Both | |
Network output | y | Both |
Cost | C | Both |
Unperturbed baseline cost | C0 | Digital |
Gradient approximation | G | Both |
Instantaneous error signal | e | Both |
Learning rate | η | Both |
Perturbation amplitude | Δθ | Both |
Input-sample change time constant | τx | Both |
Parameter update time constant | τθ | Both |
Perturbation time constant | τp | Digital |
Highpass filter time constant | τhp | Analog |
B. Equivalence to backpropagation
We first verified that the simulation was able to minimize the cost for a sample problem, and that it was equivalent to gradient descent via backpropagation with appropriate parameter choices. For this initial comparison, we chose to solve the 2-bit parity problem by training a 2-2-1 feedforward network with nine parameters (six weights, three biases). We simulated the problem with a very large value for τθ and τθ = τx such that we achieved a very good approximation of the gradient in G for each training sample. We then ran the same problem using a τθ value of 1 so that the gradient approximation G for each sample was relatively poor. We measured both the number of epochs and the amount of time (number of iterations of the simulation) for the two experiments, and the results are shown in Fig. 4. Here, an epoch was defined in the typical way—equivalent to the network being shown all training examples of a dataset.
Comparing the two scenarios in terms of epochs in Fig. 4(a), one can observe that a value of τθ = τx = 1000 resulted in the system following a nearly identical training trajectory as backpropagation. This is as expected—for each sample shown to the network, the gradient approximation G has 1000 timesteps to integrate a very accurate estimate that should be very close to the true gradient computed by backpropagation. When τθ = τx = 1, however, each sample only has a single timestep to estimate the gradient before moving on to the next sample. As a result, the samples have to be shown to the network many more times to minimize the cost, resulting in a much larger number of epochs.
However, while the τθ = τx = 1 case uses the sample data less efficiently (requiring more epochs), there is actually a trade-off for data efficiency and run time. If we plot the cost vs iterations rather than vs epoch, we get a more accurate estimate of how long it will take hardware to train in terms of real time. As shown in Fig. 4(b), one can see that the shorter τθ and τx values take approximately half the time to minimize the cost as the longer values. These examples serve to highlight that while longer integration times produce a more accurate gradient approximation, integration times as short as τp may also be used to train a network, consistent with the findings of Ref. 13. In fact, shorter integration times may even be faster to train in terms of operational time.
To quantify the effect of longer integration times on the accuracy of the gradient approximation, we measured how the gradient approximation G converged to the true gradient ∂C/∂θ (as computed by backpropagation) as a function of time. For this experiment, we ran the simulation with τθ = ∞ and τx = τp = 1 so that it continuously integrated G without ever resetting or updating the parameters. As the simulation ran, we repeatedly computed the angle between the true gradient ∂C/∂θ and the approximation G. The results are shown in Fig. 5, showing the solution to the 2-bit parity, 4-bit parity, and NIST7x7 problems. The NIST7x7 dataset is a small image recognition problem based on identifying the letters N, I, S, and T on a 7 × 7 pixel plane. The dataset has the property that it cannot be solved to greater than 93% with a linear solve for a 49-4-4 feedforward network with sigmoidal activation functions. We also chose this network and dataset because it was small enough to perform many different simulations to acquire statistics, and the problem space is large enough that random solutions are unlikely.
As expected, the angle decreased with time as G aligned with the true gradient. The time axis is in units of τp, which is the minimum discrete timestep in this system. For a real hardware platform, this timestep is approximately the inference time of the system. In general, the more parameters the network has, the longer it takes to converge to the true gradient.
C. Details of mini-batching
We next investigated in more detail how τθ and τx affect the training time. Longer τθ values result in a more accurate gradient approximation, but reduce the frequency of parameter updates. Using a fixed, low η value, we trained a 2-2-1 network to solve 2-bit parity (XOR) for 100 different random parameter initializations, varying τθ but keeping the batch size τθ/τx constant at either 4 or 1. Since the 2-bit parity dataset is composed of four (x, ) pairs, τx = 4τθ is analogous to gradient descent—all four samples are integrated into the gradient approximation G before performing a weight update. When τx = τθ, the network performs stochastic gradient descent (SGD) with a batch size of 1. Figure 6(a) shows the training time as a function of τθ for these two cases. Here, training time corresponds to the time at which the total cost C dropped below 0.04, indicating the problem was solved successfully. In the case where the batch size was 1, we observed that increasing τθ increased the training time. However, when the batch size was 4, increasing τθ had little effect on the training time.
As with any training process, the training can become unstable at higher η values and fail to solve the task. Since the results in Fig. 6(a) were only for a fixed learning rate, we also wanted to examine the effect of τθ on the maximum achievable η. Here, we defined the “max η” as the maximum learning rate where the network successfully solved the 2-bit parity problem for at least 50 out of 100 random initializations. As seen in Fig. 6(b), as τθ is increased, the max η decreases, resulting in longer minimum training times. This was true whether the batch size was large or small, although the larger batch sizes had higher achievable η values overall.
From these results, we infer that a poor gradient approximation taken with respect to all training examples is more useful than collecting an accurate gradient with respect to a single example. Stated another way, waiting a long time for an extremely accurate gradient then taking a large step is less productive than taking a series of short (but less accurate) steps. This is an important conclusion for hardware, as it shows that implementing an effective gradient descent process in MGD does not necessarily require additional memory to store accurate, high-bit-depth gradient values. Note that in our implementation G accumulates with time and so the size of the parameter update ηG from Eq. (4) grows proportionally to the integration time. This meant that when τθ was larger the effective step in the direction of the gradient was also larger, and so for fixed η the rate of training therefore remains approximately constant. If this was not the case, whenever τθ was doubled, we would also need to reduce η by half to maintain the same approximate rate of training.
D. Analog and digital perturbations
The parameter perturbations can take many different forms, as long as they are low-amplitude and their time averages are pairwise orthogonal or, in a statistical setting, are uncorrelated.14 We have implemented four types of perturbations: sinusoidal perturbations, sequential discrete perturbations, discrete code perturbations, and random code perturbations. Sinusoidal perturbations are like those shown in Fig. 1(a), where each parameter is assigned a unique frequency. Sequential discrete perturbations refer to the case where parameters are sequentially perturbed, one at a time, by +Δθ, as described in the finite-difference implementation of Sec. II B. “Code” perturbations are simultaneous discrete perturbations of {−Δθ, +Δθ} for every parameter every τp timesteps. We call them code perturbations due to their similarity to code-multiplexing (spread-spectrum) techniques used in wireless communication technologies. There are two flavors of code-perturbations: the first consists of a predefined set of pairwise-orthogonal square wave functions that take the values of {−Δθ, +Δθ}. Each of these perturbation patterns is a deterministic sequence, and no two parameters have the same sequence. One example of these is the Walsh codes used in modern cell-phone communications. The second consists of randomly generated sequences of {−Δθ, +Δθ} that are pairwise uncorrelated. We call these “statistically orthogonal.” The statistically orthogonal case is slightly less efficient than the deterministic orthogonal codes since perturbations from multiple parameters interfere with each other more in —any finite sample of the perturbations will have a non-zero correlation that decreases to zero with sample size. However, the use of the statistically orthogonal version allows the perturbations to be generated locally and randomly. These perturbations may be very useful in hardware implementations, as they are spread-spectrum and single-frequency noise from external sources is unlikely to corrupt the feedback signals.
To compare the training performance between different perturbation types, we applied four different perturbation types to the same 2-2-1 network described in Sec. III C in order to solve the 2-bit parity problem. In particular, we aimed to show that training can happen in both a purely analog or purely digital way. In practice, many systems have hybrid analog–digital features, and a combination approach may be used.
Figure 7 shows the training time distributions for the 2-bit parity problem using the different perturbation types. The bandwidth for sinusoidal perturbations was set to be 1/2τp. As expected, we found that the different perturbation types are approximately equivalent in terms of speed of training. This equivalence makes sense when one considers that the feedback from has a finite bandwidth that must be shared between all the parameters—no matter the encoding (perturbation) scheme, the information carried in that feedback to the parameters will be limited by that finite bandwidth.
E. Operation on noisy or imperfect hardware
The fabrication defects and signal noise present in emerging hardware platforms pose significant challenges for current training techniques in hardware. For example, weight parameters may differ from design or update non-deterministically, causing a discrepancy between the model and actual hardware that can negatively affect training.8,31,32 For example, in Ref. 32, a simulated network achieves 97.6% accuracy for the MNIST dataset, but the hardware diffractive optical network implementation only obtains 63.9% accuracy after direct transfer of the pre-trained model to their hardware without some in situ training, while Ref. 33 shows that a 3% component variation can lead to a 7% drop in classification accuracy without error correction. Reference 8 shows how small discrepancies in modeled parameters can blow-up in general hierarchical problems. In their toy problem, a 0.5% error in a model parameter leads to a 30% error in output after 20 layers. The solutions to these non-idealities can be cumbersome for offline training. For example, the actual value of the weights may have to be regularly measured during training,31 imperfections may be carefully measured and accounted for in the models,33,34 or weights may be quantized with low bit-depth, where the bit depth is chosen such that the system can be modeled and controlled accurately despite device-to-device variations. This is common in hardware platforms, such as memristive crossbar arrays35 or phase change materials,36 where bit depths of 6-8 can be achieved.
Here, we investigate the effects of three different types of imperfections and noise that could affect hardware systems: (1) stochastic noise on the output cost Cnoise, (2) stochastic noise on the parameter update θnoise, and (3) per-neuron defects in the activation function, where each neuron has a randomly scaled and offset sigmoidal activation function that is static in time. These tests were performed on the NIST7x7 dataset using the previously described 49-4-4 network with 220 parameters and with τx = τθ = 1 unless otherwise stated.
In the first test, we added Gaussian noise with mean zero and standard-deviation σC to the cost, applied every timestep such that noise C(t) = Cideal(t) + Cnoise(t; σC). For example, in optical hardware, this could be noise due to laser fluctuations. Noise in the cost is also equivalent to any other Gaussian mean-zero noise affecting the accumulated gradient G. Figure 8(a) shows the effect on the training time for different learning rates as σC was increased. For a given learning rate, there is a threshold amount of noise below which the training time is minimally changed. However, as cost noise increases, the training time eventually increases and ultimately stops converging.
To see how this noise would affect the minimum training time for optimized learning rates, we also measured the maximum achievable η value for a range of σC. Figure 8(b) shows this maximum η value vs cost noise, and corresponding minimum training time. The trend indicates that the lower the cost noise σC, the higher we can make η and the faster we can train. Stated in reverse, one can compensate for large amounts of cost noise in a system simply by reducing the learning rate.
We found that larger values of σθ can prevent convergence entirely [Fig. 9(a)]. Curiously, in the presence of this noise, increasing η can actually improve the convergence of the problem, as highlighted by the σθ = 0.1 and σθ = 0.3 lines in Fig. 9(a). We believe this is likely because at very small η values θnoise will overwhelm the very small ηG in Eq. (5). By making the η term larger, one can prevent ηG from being drowned out by θnoise. Obviously, at very large η values, the usual gradient-descent instability starts to dominate and the convergence approaches zero. For a given η, we found that small values of σθ marginally increase the training time, but the effect is less significant than simply changing the learning rate η [Fig. 9(c)].
Another way to reduce the impact of θnoise is to increase the integration time of the gradient. When τθ is increased, the value of G is accumulated for a longer time and becomes proportionally larger. For instance, when τθ is 100 times larger, the value of ηG becomes 100 times larger, meaning the effect of θnoise in Eq. (5) is relatively 100 times smaller. The result can be seen in Figs. 9(b) and 9(d), where even the largest σθ value has little effect on the result.
In the final test, we analyzed the effect of including “defects” in the neuronal activation functions. Here, neuronal activation functions were no longer identical sigmoid functions, but had fixed random offsets and scaling that were static in time. These variations were meant to emulate device-to-device variations that may be found in hardware, for instance, in analog VLSI neurons.39 The sigmoid activation function for each neuron k was modified to a general logistic function . The variations were all Gaussian, and the scaling factors αk, βk had a standard deviation σa and a mean of 1, while the offset factors ak, bk also had a standard deviation of σa but were mean-zero.
As seen in Fig. 10(a), adding defects to the network’s activation functions had a relatively small effect on the training time. Even with relatively large variations in the activation functions (σa = 0.25), the network only took about twice as much time to fully train the NIST7x7 dataset. However, deviations that were too large (σa > 0.25) caused issues with the training being able to converge. This is likely due to the output neurons being so scaled and offset that it was no longer possible for them to produce the correct output.
Overall, we found that MGD is robust against different types of noise and fabrication imperfections of a reasonable scale as those that would be seen in a real hardware system. In general, these non-idealities affected the training speed of the network, but did not prevent the desired problem from being solved.
F. Modern dataset results
To assess the scalability and relevance of MGD for larger machine learning problems, we compared MGD and backpropagation on a variety of tasks for different network architectures and hyperparameters. Table II compares the accuracies obtained with MGD and backpropagation for different datasets and various hyperparameter choices (τθ, τp, η, batch size), with τx fixed at 1. To make an effective comparison, for the backpropagation results, we used a basic stochastic gradient descent (SGD) optimizer without momentum. Both strategies used a mean squared error (MSE) cost function. These choices allowed us to avoid confounding effects from more complex strategies, although we note that MGD is capable of implementing several of these more complicated features (e.g., momentum, dropout, etc.). In all cases we used the same network architectures for both sets of results. The networks architectures used were not selected to reach state-of-the-art accuracy values, but instead smaller networks were chosen for purposes of rapid testing and statistical analysis. To improve the accuracy further, more advanced network architectures (e.g., more layers) or optimizers (e.g., adding momentum) would be needed. For the purposes of meaningful comparison, we measured the accuracy obtained in the MGD process after fixed numbers of timesteps. This allowed us to estimate actual training time for various hardware configurations in Sec. IV A.
Setup . | Parameters . | Accuracy . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Task . | Network . | |θ| . | τθ . | τp . | η . | Batch size . | 104 steps . | 105 steps . | 106 steps . | 107 steps . | Backprop . |
2-Bit parity | 2-2-1 | 9 | 1 | 1 | 5 | 1 | 100% | 100% | 100% | 100% | 100% |
N-I-S-T | 49-4-4 | 220 | 1 | 1 | 3 | 1 | 38% | 81% | 94% | 97.7% | 99.8% |
N-I-S-T | 49-4-4 | 220 | 1 | 1 | 0.5 | 1 | 22% | 45% | 93% | 98.7% | 99.8% |
Fashion-MNIST | 2-layer CNN | 14 378 | 1 | 1 | 9 | 1000 | 34.2% | 66.3% | 79.3% | 83.5% | 88.6% |
Fashion-MNIST | 2-layer CNN | 14 378 | 10 | 1 | 9 | 1000 | 34.3% | 66.3% | 79.2% | 83.4% | 88.6% |
Fashion-MNIST | 2-layer CNN | 14 378 | 100 | 1 | 9 | 1000 | 35.3% | 66.3% | 77.7% | 84.7% | 88.6% |
Fashion-MNIST | 2-layer CNN | 14 378 | 1000 | 1 | 9 | 1000 | 35.3% | 59.6% | 79.1% | 86.1% | 88.6% |
CIFAR-10 | 3-layer CNN | 26 154 | 1 | 1 | 9 | 1000 | 12% | 23% | 43.8% | 60.7% | 68% |
Setup . | Parameters . | Accuracy . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Task . | Network . | |θ| . | τθ . | τp . | η . | Batch size . | 104 steps . | 105 steps . | 106 steps . | 107 steps . | Backprop . |
2-Bit parity | 2-2-1 | 9 | 1 | 1 | 5 | 1 | 100% | 100% | 100% | 100% | 100% |
N-I-S-T | 49-4-4 | 220 | 1 | 1 | 3 | 1 | 38% | 81% | 94% | 97.7% | 99.8% |
N-I-S-T | 49-4-4 | 220 | 1 | 1 | 0.5 | 1 | 22% | 45% | 93% | 98.7% | 99.8% |
Fashion-MNIST | 2-layer CNN | 14 378 | 1 | 1 | 9 | 1000 | 34.2% | 66.3% | 79.3% | 83.5% | 88.6% |
Fashion-MNIST | 2-layer CNN | 14 378 | 10 | 1 | 9 | 1000 | 34.3% | 66.3% | 79.2% | 83.4% | 88.6% |
Fashion-MNIST | 2-layer CNN | 14 378 | 100 | 1 | 9 | 1000 | 35.3% | 66.3% | 77.7% | 84.7% | 88.6% |
Fashion-MNIST | 2-layer CNN | 14 378 | 1000 | 1 | 9 | 1000 | 35.3% | 59.6% | 79.1% | 86.1% | 88.6% |
CIFAR-10 | 3-layer CNN | 26 154 | 1 | 1 | 9 | 1000 | 12% | 23% | 43.8% | 60.7% | 68% |
The CIFAR-10 network was composed of three convolution and max-pool layers followed by a fully connected layer. The convolutional filters were 3 × 3 with 16, 32, and 64 output channels, respectively, stride 1, and relu activation function outputs. Each convolution was followed by a 2 × 2 max-pool layer. The final fully connected layer converted the 256 outputs of the convolutional layers to the 10 classes, and no softmax was used. The Fashion-MNIST network was similarly composed, comprising two convolution and max-pool layers followed by a (32 × 10) fully connected layer and no softmax. The networks used to solve NIST7x7 and XOR were fully connected feedforward networks with sigmoidal activation functions. To make the comparisons in Table II as fair as possible, the backpropagation accuracies of each row were maximized by training the networks with the listed batch size and a variety of η values. The backpropagation accuracy reported is the highest median accuracy obtained for all the η values tested, with the median taken over five random initializations that each ran for 2500 epochs (long after the accuracy had converged).
By training networks on the 2-bit parity, NIST7x7, Fashion-MNIST, and CIFAR-10 datasets, we found that MGD approached the solution-accuracy of backpropagation. However, even after 107 steps, we observed the MGD-trained networks still had slightly lower accuracy than could be achieved by backpropagation. This is likely related to convergence criteria of the gradient approximation not being met, due to fixed η values—for instance, the theory underlying the SPSA process only guarantees convergence with a learning rate that asymptotically approaches zero.13 In general, we observed that while larger learning rates achieved higher accuracies earlier on, lower learning rates achieved higher accuracies at longer times (see, e.g., NIST7x7 result). Therefore, custom learning rates are likely to achieve more optimal training time and accuracy.
The simulations also indicate that changing τθ had a marginal effect on the maximum accuracy of MGD for a fixed η value for the larger datasets shown here, with increased values of τθ leading to slightly higher final accuracies. This may be because longer τθ values produce more-accurate gradient estimates and allow the network to better optimize the weights as the system approaches a local minima in the cost, although more experiments and statistics are needed to verify if this effect is occurs, in general. Fortunately, our results show this effect is relatively small, allowing the hardware-designer flexibility in how often weight updates need to be performed.
These results show explicitly that MGD is a viable technique for training emerging hardware platforms on real machine learning datasets. Previous theoretical studies have focused on convergence proofs and computation of gradient approximations to varying degrees of accuracy13,14 or on simplified problems, such as linear activation functions.40 While these theoretical results are clearly important and necessary, they are not sufficient to give insight into the applicability of these optimization techniques to real machine learning problems on experimental hardware platforms. Meanwhile, previous experimental and simulation studies have for the most part focused on small scale problems for which 100% accuracy is trivial, such as XOR,15,16 few-bit parity,17,21,22 the iris dataset,25 or other similar toy problems.18,19,26 These are not representative of larger machine learning datasets, which typically have on the order of tens of thousands of training examples and are trained on networks with even more parameters.
These results go further than any of these previous studies, by elucidating the utility of these techniques for larger, modern, machine learning datasets, and implementing a framework that can connect different optimization techniques (SPSA, finite-difference, approximate gradient descent) via specific hardware parameters.
It is also worth noting that the random weight change (RWC) algorithm has been implemented in memristor networks24,41 previously, and while superficially similar to MGD is functionally very different. RWC is not an approximate gradient descent technique, since the weight update is not scaled by the magnitude of the change in the cost, but rather random perturbations are either kept or discarded based on whether or not they improve the cost. Because of this, it scales more poorly with number of parameters than the optimization techniques we have implemented in the MGD framework.
IV. HARDWARE CONSIDERATIONS
A. Time constants
While the MGD framework is general, individual hardware platforms may have different practical considerations that dictate the optimal choices for the different time constants. Here, we discuss potential issues for implementation on hardware and point out trade-offs that may exist.
Perturbation speed (τp): In many hardware platforms, perturbing parameters directly may be difficult or undesired either due to memory-endurance or update-speed concerns. However, it is not necessary to perturb parameters directly—instead, one may perturb a separate element that is in series with the parameter. For instance, in a photonic MZI array, the weights may be implemented with slow (∼ms) thermal or mechanical phase shifters, but the small perturbation could be implemented with a fast (∼ns) electro-optic modulator, such as a depletion46 or opto-mechanical47 modulator. Alternatively, in a memristor system, the perturbation could be implemented with a single sub-threshold transistor in series with the memristor.
For perturbative-style techniques, such as MGD, to function properly, τp should be slower than the system’s inference time,14 which includes the cost calculation and broadcast processes. If τp is too short, it can introduce misalignment between the input perturbation and resulting cost feedback.
Parameter update (τθ): When performing gradient descent on hardware, parameters must be updated multiple times. For a training period of length T, the weights will be updated T/τθ times. For some hardware platforms, it may be advantageous to reduce the number of parameter updates, for instance, if the parameters take a long time to update, the update process requires a large amount of energy, or the underlying parameter values are stored in memory elements with a limited lifetime. Fortunately, Table II shows that for networks with large numbers of parameters, τθ can be increased without greatly impacting the overall training speed or final accuracy.
Changing training examples (τx): The value of τx is limited by the speed with which training examples can be input to the network. For most technologies, τθ or τx will not be limiting factors, as x and samples can be provided by a conventional computer or FPGA at nanosecond timescales. However, it should be noted that MGD will not function correctly if τx is shorter than the inference time of the hardware, and this will likely be the practical limit for τx. τx is also used as a way of controlling the batch size (given by τθ/τx). The particulars of the task and dataset may determine the desired batch size, and this may have some effect on τx, for example, if the minimum allowable value of τθ is very long.
Calculation of cost: The calculation of the cost is performed only at the output of the system, and therefore it is acceptable that the hardware used in its computation be more expensive and complex. In integrated electronic systems, the cost can be straightforward to implement, although it may occupy more relatively more chip real-estate than other elements. However, for very exotic hardware platforms, this cost computation may still be difficult to perform on-chip. This can be addressed by using a non-standard cost function or by using a computer to calculate the cost off-chip. However, the input and output to an off-chip computation may limit the speed of the cost computation and also limit the speed of the global broadcast of and may be another speed bottleneck in some hardware systems.
Based on the literature, some estimates of plausible time constants that could be implemented in hardware are shown in Table III with the corresponding times to solve benchmark tasks, based on the results from Table II. Examples from the literature of hardware platforms that could implement these sets of parameters are shown in the final row. By comparison of these results with the final column in Table III, we see that using realistic estimates for emerging hardware, MGD could be significantly faster than current implementations using backpropagation.
. | HW1 . | HW2 . | HW3 . | Backprop . |
---|---|---|---|---|
τx (input-sample update time) | 100 ns | 1 ns | 10 ps | n/a |
τp (perturbation time) | 1 ms | 10 ns | 200 ps | n/a |
τθ (parameter-update time) | 1 ms | 1 µs | 200 ps | n/a |
2-Bit parity training time (104 steps) | 20 s | 200 µs | 4 µs | 70 ms† |
Fashion-MNIST training time (106 steps) | 33 min | 20 ms | 400 µs | 54 s* |
CIFAR-10 training time (107 steps) | 5.6 h | 200 ms | 4 ms | 480 s* |
Examples of hardware | Chip-in-the-loop, | Mem-compute devices,43 | Superconducting devices,45 | †CPU/*GPU |
integrated photonics with | analog VLSI44 | athermal resonant | ||
thermo-optic tuning12,42 | silicon photonic modulator46 |
. | HW1 . | HW2 . | HW3 . | Backprop . |
---|---|---|---|---|
τx (input-sample update time) | 100 ns | 1 ns | 10 ps | n/a |
τp (perturbation time) | 1 ms | 10 ns | 200 ps | n/a |
τθ (parameter-update time) | 1 ms | 1 µs | 200 ps | n/a |
2-Bit parity training time (104 steps) | 20 s | 200 µs | 4 µs | 70 ms† |
Fashion-MNIST training time (106 steps) | 33 min | 20 ms | 400 µs | 54 s* |
CIFAR-10 training time (107 steps) | 5.6 h | 200 ms | 4 ms | 480 s* |
Examples of hardware | Chip-in-the-loop, | Mem-compute devices,43 | Superconducting devices,45 | †CPU/*GPU |
integrated photonics with | analog VLSI44 | athermal resonant | ||
thermo-optic tuning12,42 | silicon photonic modulator46 |
B. Analog and digital implementations
There are a few notable differences for an analog vs a digital hardware implementation of the MGD framework. The discrete case requires one memory element at the network output to store C0 (sample-and-hold), and a simple subtraction operation to compute . The network also requires some timesteps to be devoted to the measurement of C0. This means that for the simple case of τθ = τp, only a single additional memory element is required for training the entire hardware system, located at the network cost output.
However, in the case where τθ > τp, an additional memory element is required for every parameter (for instance, an analog capacitor or discrete memory) to perform the gradient integration. The analog case requires a lowpass filter at every parameter element, and a single highpass filter on the network output to convert C to . These could be created with simple analog circuits such as RC or LR filters. However, we note the use of a continuous highpass filter can cause implementation issues is some cases. For instance, if η is too large, rapid changes in θ can generate unwanted frequency components that mix with the perturbation input and may negatively affect the gradient approximation. Moreover, if the dataset is discrete, jumps in x can propagate high frequency noise through C and , again negatively affecting the gradient approximation. The simulations demonstrate that, in principle, MGD can be used on different hardware platforms of either an analog or digital nature for network training.
V. DISCUSSION
An interesting analogy can be drawn between the MGD training framework and wireless communication systems. In wireless communications, cell phone users must be able to extract the signal that is relevant to their own communication from a received broadcast that contains multiple users’ signals. MGD operates similarly—each parameter is analogous to an individual cell phone user and the global cost broadcast is analogous to the cell tower transmitter. The broadcasted signal is available to all parameters, and each parameter extracts its relevant (gradient) information—analogous to voice data—from that broadcasted signal.
The encoding and multiplexing techniques used in wireless communications are broadly termed “multiple access” techniques in communications, and there are several varieties of them, including frequency, time, and code multiplexing. Similarly, in MGD, these multiplexing techniques are directly analogous to the different perturbation types discussed in Sec. III D. For instance, sinusoidal perturbations can be considered a type of frequency-multiplexing: each “user” (parameter) has a unique frequency containing information relevant to it, and must perform a time-integration to extract that information from the signal that contains many other, irrelevant signals. Likewise, as mentioned in Sec. III D, the discrete {−Δθ, +Δθ} perturbation type is analogous to the code-multiplexing techniques used in modern cell phones.
It is important to note that for a fixed bandwidth broadcast, all multiplexing techniques have the same nominal spectral efficiency. However, the particulars of real systems can greatly affect which multiplexing techniques are most effective. For example, because code-multiplexing encodes a user’s signal out over a wide time window and a spread spectrum, it has been shown to have an improved robustness to multipath distortion and timing errors compared to time-multiplexing.
Similarly, the various perturbation types described earlier may operate better or worse depending on the particulars of the hardware network and environment. Newly developed communications techniques may also be directly applicable to MGD training in real hardware systems. Key to its usage in hardware, each parameter can operate like a cell phone—as an independent device with a receiver, processor, and memory using only information available locally and from the global broadcast. This contrasts with most other training algorithms, where error signals must propagate through the entire network.10 While trivial in software, this type of upstream information-flow means that hardware neurons must (1) have additional memory to process the error signal, (2) perform different operations for forward-inference and reverse error propagation, and (3) operate in a synchronized fashion to propagate the error correctly.
In systolic arrays, “broadcast” is often discouraged because the time and energy of communication is proportional to the number of inputs by CV2 (with C growing with the number of inputs). Furthermore, as the capacitance is larger, then a larger amplifier is needed for the communication that also requires energy and area. However, in other types of analog or emergent hardware, broadcast may be a more energy and space efficient process. For example, in optical hardware, a global optical signal could be accessible in a simple slab waveguide mode. Alternatively, in analog electrical hardware, a single electrical plane could be used to carry the signal back to the individual parameters.
The MGD framework may also be much more biologically plausible than other algorithms. There are a host of algorithms being developed to address the mystery of how the brain learns and to use biology as inspiration to find more hardware-friendly approaches.10,48,49 One of the major issues with backpropagation from both a practical and bio-plausibility standpoint is the need for a separate type of operation in the forward (inference) and backward (learning) phases. This is not the case for MGD: in MGD, the training is done in the same configuration used for inference. In a biological analogy, a global cost signal could be modeled as some type of quasi-global chemical signal in the brain.
In the future, the MGD algorithm could be extended such that the cost is not global to every synapse, but rather to specific clusters of neurons/synapses. There are chemical signals in the brain with these types of properties. The time scales needed for perturbation may line up with biologically observed effects, such as short-term plasticity (min)50,51 and synaptic noise.52 MGD is also consistent with “three-factor” learning rules for which there is mounting experimental evidence.51 Additionally, although not explored in this work, MGD should work in a spiking model. Similar algorithms that use perturbations in neuron conductance,53 probabilistic synapse transmission,54 or the stochastic nature of spike-timing arrival55 for reward-based learning have been explored in small spiking networks.
VI. CONCLUSIONS
In this paper, we show that with realistic timescales for emerging hardware, training via MGD could be orders of magnitude faster than backpropagation in terms of wall-clock time-to-solution on a standard GPU/CPU. The MGD framework allows the implementation of multiple optimization algorithms using a single, global, cost signal and local parameter updates. The style of algorithm used (e.g., finite-difference, coordinate-descent, and SPSA) can be adjusted via the tuning of the MGD time-constants and can even be adjusted during training if desired. Because it is a model-free perturbative technique (sometimes called zeroth order optimization), it is applicable to a wide range of systems—it can be applied to both analog and digital hardware platforms, and it can be used in the presence of noise and device imperfections. This overcomes a major barrier to using hardware platforms based on emerging technologies, which are often difficult to train.
Going forward, there are many opportunities to explore the application of MGD on both large and small hardware systems. The most direct next step will be to test the training performance in existing hardware, for example, on photonic42 or memristive crossbar hardware using a chip-in-the-loop process. This can occur without even redesigning the hardware, as MGD only requires the ability to input samples, capture inference output, and modify parameters when in a chip-in-the-loop configuration. When used in this fashion, the speed will most likely be limited by system I/O. For example, perturbations can be injected directly to the hardware from an external computer, and that same computer could capture the changes in cost and perform the gradient approximation and calculate the weight updates. This would allow testing of the algorithm without any changes to the hardware. Ultimately, to overcome I/O limitations local circuits can be designed be implemented for autonomous online training. Although not examined in detail here, in this case, many of the hardware platforms examined would likely have several orders of magnitude improvement in terms of energy usage as well.
There is also lots of room for improvement on tuning the technique and its hyper-parameters. In this paper, we performed very little hyperparameter optimization or regularization, and so there are likely opportunities for further optimization by examining techniques, such as dropout, momentum, and regularization. Also of interest is examining the node-perturbation version of the algorithm56 on large datasets, which could significantly reduce the number of perturbative elements and speed up the training process.
While in this paper we only examined feed forward networks, it has already been demonstrated that is possible to use perturbative techniques to train recurrent neural networks,57 spiking networks,54 and other non-standard networks at small scale; future work remains to demonstrate their utility on problems of modern interest. This will have applicability to a wide range of neuromorphic hardware and other physical neural networks.
Ultimately, MGD is well-suited to implementation directly on-chip with local, autonomous circuits. Although significant work remains before autonomous learning can be truly implemented without the participation of an external computer, such a device would be truly enabling for remote and edge-computing applications, allowing training in-the-field on real-time data.
ACKNOWLEDGMENTS
The authors acknowledge helpful discussions with the NIST NCSG community. This research was funded by the NIST (https://ror.org/05xpvk416) and the University of Colorado Boulder (https://ror.org/02ttsq026).
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Adam N. McCaughan: Conceptualization (equal); Formal analysis (equal); Investigation (equal); Software (equal); Writing – original draft (equal). Bakhrom G. Oripov: Formal analysis (equal); Software (equal); Writing – review & editing (equal). Natesh Ganesh: Formal analysis (supporting). Sae Woo Nam: Formal analysis (equal); Funding acquisition (equal); Project administration (equal). Andrew Dienstfrey: Conceptualization (equal); Formal analysis (equal); Project administration (equal). Sonia M. Buckley: Conceptualization (equal); Formal analysis (equal); Methodology (equal); Project administration (equal); Writing – original draft (equal).
DATA AVAILABILITY
The data that support the findings of this study are openly available in multiplexed-gradient-descent-paper at https://github.com/amccaugh/multiplexed-gradient-descent-paper.30