As the size and ubiquity of artificial intelligence and computational machine learning models grow, the energy required to train and use them is rapidly becoming economically and environmentally unsustainable. Recent laboratory prototypes of self-learning electronic circuits, such as “physical learning machines,” open the door to analog hardware that directly employs physics to learn desired functions from examples at a low energy cost. In this work, we show that this hardware platform allows for an even further reduction in energy consumption by using good initial conditions and a new learning algorithm. Using analytical calculations, simulations, and experiments, we show that a trade-off emerges when learning dynamics attempt to minimize both the error and the power consumption of the solution—greater power reductions can be achieved at the cost of decreasing solution accuracy. Finally, we demonstrate a practical procedure to weigh the relative importance of error and power minimization, improving the power efficiency given a specific tolerance to error.
I. INTRODUCTION
There has been a meteoric rise in the adoption and usage of artificial intelligence (AI) and machine learning (ML) tools in just the past 15 years,1,2 accompanied by an equally spectacular rise in the sizes of ML models and the amount of computation required to train and apply them.3,4 In recent years, the energy required to train state-of-the-art ML models, as well as to use the trained models, has been rising exponentially, doubling every 4–6 months.5 This energy cost will eventually severely constrain further increases in model complexity and already constitutes significant economic and carbon costs.6–8
The field of neuromorphic computing9–13 strives to recreate the ability to learn in hardware. A major motivation for the development of neuromorphic systems is the possibility of massive energy savings compared to ML implemented on standard computers.13,14 Many proposals for synthetic “neurons” and “synapses” have been laid out over the past three decades, promising lower power consumption compared to standard computers by 2–5 orders of magnitude.15–18 While much neuromorphic computing research has been focused on the development of power-efficient hardware, usually for performing inference (applying already-trained ML models), some attention has recently been given to the study of power-efficient learning “algorithms.”19–22 However, most neuromorphic hardware implementations considered thus far specifically attempt to mimic standard ML algorithms, such as backpropagation23–25 or phenomenological neural synaptic learning processes such as STDP (spike-timing-dependent plasticity).26–30
Recently, a new avenue was opened toward realizing power-efficient neuromorphic computing, dubbed physical learning machines or self-learning physical networks.31 Rather than mimicking known learning algorithms, such as backpropagation, such systems exploit their inherent physics in order to learn, using local learning rules that modify the learning degrees of freedom based on locally available information, such that the system globally learns to perform desired tasks. A certain class of local learning rules, known as contrastive learning,32–36 describe how learning degrees of freedom should be modified in order for systems to achieve desired outputs in response to inputs supplied by observed examples of use (i.e., supervised learning).
In order to realize any power gains, such learning rules must be implemented in hardware. Coupled learning, a particular contrastive learning rule, has been realized successfully in laboratory hardware for electronic circuits of variable resistors.37–40 Such systems already consume less power than conventional computers doing inference because they are analog rather than digital.41 Here, we use analytical theory, computation, and experiments to show that the propensity of electronic circuits to minimize power dissipation enables even greater reductions in power consumption via appropriate initialization and power-efficient learning rules. We specifically demonstrate these results for regression tasks. However, it should be noted that our analysis and results should apply to other physical learning machines in different physical media (e.g., mechanical networks) if they can be developed in the lab,42 as well as to other types of problems (e.g., classification).
This paper is organized as follows: in Sec. II, we describe the physical learning approach and discuss how the power consumption of the system is modified by learning, in particular, as we change the initial conditions of the learning degrees of freedom. A judicious choice of initial conductances yields learning solutions with low power consumption, while also reducing the energy consumed in training. In Sec. III, we introduce a modification to the local physical learning rule in order to minimize both error and power consumption. We analyze this new local rule theoretically and test it in simulations and lab experiments, concluding that it leads to an error–power trade-off; lower-power solutions may be obtained at the expense of higher errors. The energy required to train the system can be reduced as well. Finally, in Sec. IV, we demonstrate how a power-efficient learning algorithm with dynamical control over the weighting of power and error optimization can lead to an efficient adaptation of low-power solutions beyond simply using good initial conditions and constant weighting.
II. POWER CONSUMPTION IN PHYSICAL LEARNING CIRCUITS
In previous work, we established theoretically and experimentally that self-learning resistor networks can be trained to perform tasks such as allostery, regression, and classification.37,39,40 Training a deep neural network corresponds to minimizing a learning cost function with respect to learning degrees of freedom (edge weights and biases). The learning landscape, described by the learning cost function as one axis in the high-dimensional space where each of the other axes corresponds to a different learning degree of freedom, remains fixed during the minimization. On the other hand, successful training of physical learning machines corresponds to the simultaneous minimization of two cost functions—the learning and physical cost functions—with respect to two different sets of degrees of freedom (DOF), the learning and physical degrees of freedom, respectively. In the case of a self-learning electrical network of variable resistors, the physical cost function is the dissipated power, the physical DOF are the node voltages, and the learning DOF are the conductances.
Notably, the physical cost function, or power, depends implicitly on the learning DOF. As a result, both the learning landscape and the physical landscape evolve during training. For example, training gives rise to soft modes in the physical landscape and to stiff modes in the learning landscape, making the system more conductive and lowering its effective response dimension.43
The height of a minimum in the physical landscape corresponds to the power required to actuate the desired response (to obtain the desired outputs in response to the given inputs from training data). Due to the coupling between the learning and physical landscapes, it is possible to find and push down the minima in the physical landscape corresponding to the global minima in the learning landscape during training, thus decreasing the amount of power required to perform a given task.
Effects of varying the conductance initialization scale. (a) Simulated resistor networks with edges corresponding to variable resistors. We train networks with N = 64 nodes to perform linear regression, i.e., to simulate desired linear equations with two variables (red source edges) and two results (blue target edges), see Appendix B for details. (b) The error as a function of training time for several conductance initialization values κ. The error is successfully reduced by the coupled learning rule by multiple orders of magnitude, regardless of the choice of the initialization scale κ. (c) Training time T (epochs taken for the error to drop to error level ) as a function of initialization κ. The training time remains constant when initialization is far from the bounds but grows linearly for low initialization close to kmin. (d) Free power PF during training for different initialization κ. At the end of the training, the system finds a solution with trained power marked by the colored dots. (e) The trained power as a function of initialization κ. Decreasing the conductance initialization scale has a strong effect, reducing the trained power needed to actuate the learned solution. (f) Training energy as a function of initialization κ. Choosing a proper optimal initialization (here, κ ≈ 2 × 10−2) can optimize the training energy. The results averaged over 50 realizations of networks and regression tasks.
Effects of varying the conductance initialization scale. (a) Simulated resistor networks with edges corresponding to variable resistors. We train networks with N = 64 nodes to perform linear regression, i.e., to simulate desired linear equations with two variables (red source edges) and two results (blue target edges), see Appendix B for details. (b) The error as a function of training time for several conductance initialization values κ. The error is successfully reduced by the coupled learning rule by multiple orders of magnitude, regardless of the choice of the initialization scale κ. (c) Training time T (epochs taken for the error to drop to error level ) as a function of initialization κ. The training time remains constant when initialization is far from the bounds but grows linearly for low initialization close to kmin. (d) Free power PF during training for different initialization κ. At the end of the training, the system finds a solution with trained power marked by the colored dots. (e) The trained power as a function of initialization κ. Decreasing the conductance initialization scale has a strong effect, reducing the trained power needed to actuate the learned solution. (f) Training energy as a function of initialization κ. Choosing a proper optimal initialization (here, κ ≈ 2 × 10−2) can optimize the training energy. The results averaged over 50 realizations of networks and regression tasks.
A. Power consumption in learned solutions
We now turn our attention to the scalar physical cost (i.e., power consumption) of the free state PF. This free power is the relevant measure for the power used by the system to perform inference. As noted, this is the power associated with the application of the input voltages, allowing the output readouts. Our work is primarily concerned with minimizing this free power PF, while also achieving good learning solutions with low error. We will show how the free power can be lowered by non-hardware means, such as choosing better initialization for the learning DOF and using learning rules that minimize the free power at the same time as the error. The free power of a system trained for a specific task will, henceforth be referred to as the trained free power. We will also look at the total energy required to train the system , henceforth termed as the training energy, which can be estimated by integrating the free power over the training time. We show how this training energy can also be optimized by these initialization schemes and learning rules.
In this simple case, the learning modification is determined by the alignment of each component of the free state response with its nudge in the clamped state . In these particular models, we also know that the free power gradient is positive . We conclude that if the clamped state nudge aligns with the free state response, the free power tends to decrease. This is sensible as the system has to decrease its conductances to achieve a stronger response required by the clamping. The opposite effect occurs when the nudged response is misaligned with the free state, resulting in increased conductances.
B. Power dependence on initial conditions
In Sec. II A, we established how physical learning affects the system’s free power PF, i.e., its power consumption in the free state. In the following, we consider how the initial conditions of the learning degrees of freedom determine the trained free power, i.e., free power of the learned solutions. We will show that judicious initialization leads to considerable savings in power consumption.
It is well recognized in the ML literature that the dynamics and obtained solutions of learning algorithms strongly depend on initialization, i.e., the initial values of the learning DOF.44–47 In the context of physical learning, the choice of initialization may not only affect the training time and accuracy of a solution but may also have important effects on the trained free power. Suppose a set of voltage drops is applied over some input edges of a resistor network, and we read out the resulting voltage drops over some other output edges. In addition, suppose that the conductance values of the network have a certain scale κ. It is known that the output voltage drops do not depend on the scale κ but only on the relative ratios of the conductance of different edges. However, reducing the conductance scale does, in fact, linearly decrease the free power [Eq. (5)]. Thus, we can, in principle, improve the trained free power indefinitely by reducing the conductance scale. Realistically, we are bound by experimental considerations: variable conductive elements have minimal conductance values (corresponding to maximal resistance). Furthermore, low conductance necessitates more precise hardware implementations as the network response becomes highly sensitive to small variations in the conductance.
The above-mentioned considerations suggest that initializing the conductance values k (learning DOF) at lower values may yield solutions with lower trained free power. To verify these ideas, we trained N = 64 node networks [Fig. 1(a)] for multiple regression tasks with two inputs and two outputs. The error for these regression tasks, noted as , is given by mean squared differences between the desired and the obtained target voltage drops in the training set ( Appendix B presents details on the simulated resistor networks and regression tasks). We initialized the conductance values uniformly with different conductance scales in the range 10−3 ≤ κ ≤ 101. We note that in these simulations, the minimum conductance for any given edge is kmin = 10−3, and the maximum conductance is kmax = 101. Learning modifications that attempt to push the conductances out of this range are not performed. The learning rate α has been chosen such that κ = 1 and α = 0.1, a value which typically results in a relatively quick and stable learning performance for these networks and tasks. We find that as long as the learning rate is chosen to be slow enough, the learning dynamics are well behaved and our results scale as expected with the learning rate.
We set the units for these simulations such that the conductance scale [k] is defined as k = 1 inside the allowed range of our conductors, and the voltage scale [V] corresponds to the typical highest input values chosen for our regression tasks V = 1 (shown in Appendix B). The networks are trained for 106 learning iterations of Eq. (2). Each such iteration encompasses an epoch, i.e., the time taken for the network to observe and respond to all training examples, similar to full-batch gradient descent.48 Our units of time are scaled by the learning rate [τepoch] = α−1. With these definitions, the units for the free power are given by [PF] = [k][V]2. The training energy is given by the free power, integrated over training time, which has units of .
As expected, we find that coupled learning reduces the error by many orders of magnitude [Fig. 1(b)]. We also find that when the learning rate is scaled appropriately α ∝ κ0.5, the scaled training time T (number of epochs taken for the system to reach a certain error threshold , scaled by the learning rate) does not change much for relatively high initialization κ [Fig. 1(c)]. However, initialization close to the lower boundary kmin induces a linear increase in the scaled training time, scaling as κ−1. This increase in the training time is reasonable as a large part of the training modification Δki is not performed because it would require the conductances to go below the minimum.
We find that the training energy scales linearly with the initialization at high κ [Fig. 1(f)], similar to the trained free power. However, lowering κ close to kmin actually increases the training energy. This is because we can no longer realize gains in the free power [the saturating region shown in Fig. 1(e)], while the training time increases linearly with decreasing κ [Fig. 1(c)]. As a result, the training energy in this regime increases with decreasing κ [Fig. 1(f)]. Thus, there is an optimal value for the initialization κ corresponding to the minimum training energy. We note that in this regime, as the training energy is proportional to the training time, it is inversely proportional to the learning rate so that increasing the learning rate can reduce the energy . However, this improvement only persists for low enough learning rates, when the learning process is consistent and well behaved. We leave the study of power efficiency of fast learning system for future study.
In machine learning, however, the greatest energy cost is incurred during inference. In our case, this cost is quantified by the trained free power PF(k*). We note that training reduces the free power for high κ but increases it for low κ next to the lower conductance limit [Fig. 1(d)]. This is sensible because for low initial conductances at or near the minimum, the network must increase some edge conductances in order to decrease its error. That being said, we conclude that initializing the network with proper low conductance values can save a significant energy during learning and when using the trained network.
III. EXPLICIT POWER MINIMIZATION
These are our key results: (1) the error induced by power minimization scales with λ2, while (2) the free power reduction compared to scales linearly with λ. In other words, the free power difference, i.e., the difference between the free power with and without λ-minimization, , scales linearly with this minimization amplitude.
The learned solution moves away from at a rate linearly proportional to λ. This solution can be used to estimate both the contrast and the free power at time τ. As seen before, we find that the contrast scales as λ2, while the free power is reduced proportional to λ and the elapsed time .
Overall, these considerations suggest that under λ-modified dynamics, a trade-off emerges between the error and trained free power. This intuition is verified in the numerical simulations shown in Fig. 2. We train a 64 node resistor network, initialized with intermediate conductance values , for a regression task as before. Here, the training proceeds with the modified power minimization learning rule [Eq. (9)], varying λ in the range 10−10 ≤ λ ≤ 10−2. We train these networks for τ = 105 steps and then measure the trained error and free power, averaging the results over 50 realizations of the network and regression tasks. We find that for small λ, the error and free power reduction scale as predicted by Eq. (16) [Fig. 2(a)]. As before, we also compute the training energy . We plot the trained free power and the training energy as a function of the minimization amplitude λ in Fig. 2(b). Both of these are markedly decreased when λ is increased, showing the predicted trade-off between power efficiency and error. In these settings, choosing the optimization parameter λ = 10−5 allows us to maintain a reduction of five orders of magnitude in error, while reducing the free state power and the total energy required for training by a factor compared to standard learning (λ = 0).
Physical learning with power minimization. (a) Error (blue) and trained free power difference between learning with and without power minimization, (black), for varying values of the power minimization amplitude λ. As λ is increased, the error of the learned solution increases quadratically but the trained free power of these solutions is linearly decreased. (b) The trained free power (black) as well as the training energy (green) decreases with λ, underscoring a trade-off between power-efficiency and error. The results are averaged over 50 realizations of networks and training tasks.
Physical learning with power minimization. (a) Error (blue) and trained free power difference between learning with and without power minimization, (black), for varying values of the power minimization amplitude λ. As λ is increased, the error of the learned solution increases quadratically but the trained free power of these solutions is linearly decreased. (b) The trained free power (black) as well as the training energy (green) decreases with λ, underscoring a trade-off between power-efficiency and error. The results are averaged over 50 realizations of networks and training tasks.
A. Experimental results
So far, we argued on theoretical grounds that error can be traded-off for power efficiency by employing the learning rule in Eq. (9) and verified these arguments in simulations. Here, we verify the existence of the trade-off in laboratory experiments. We use an experimental network of variable resistors implementing coupled learning, similar to the realizations in previous studies.37–39 However, in this new implementation of the experiment, transistors replace the digital potentiometers in the role of variable resistors.41 Unlike in the previous work,37 this system is also able to learn according to the continuous coupled learning rule [Eq. (2)] as each resistance element is set by a charged capacitor on the gate of the transistor instead of by a discrete counter. Modifications to the learning rule of the form in Eq. (9) are achieved by varying the measurement amplification from the free and clamped networks. In addition, unlike previous implementations, this new network operates continuously in time, with the clamped state value updated automatically via an electronic feedback loop, and so, training duration is measured in real time rather than training steps. Because of unavoidable noise in the experiment, η → 0 is unobtainable. As the clamped state approaches the free state, their difference becomes more and more difficult to measure. Therefore, we use a finite value η = 0.22 for these experiments with an effective learning rate of . The experiments lasted 20 seconds each, and the network’s resistances had completely settled at the end of each run. The network is a 4 × 4 square lattice of edges [inset in Fig. 3(c)] with periodic boundary conditions; the edges are initialized with uniform conductance in the approximate middle of their range at the start of each experiment.
Experimental results for power optimization show a trade-off between error and power. (a) Error as a function of time in laboratory experiments with different optimization amplitude values λ. An adaptive nonlinear resistors network can physically learn to adopt the desired function. This network learns to perform node allostery tasks, gradually minimizing the error down to a finite error floor. (b) Free power in physical learning experiments for different values of λ. As experiments are run with increasing λ, the learning process finds solutions with an increasing error but with improved trained free power. (c) Error vs the trained free power of experimentally learned solutions. Overall, we observe an error–power trade-off in this experimental learning machine. The inset shows a photograph of the experiment. (d) Error vs the trained free power of learned solutions in numerical simulations on the same network geometry and type of tasks. The trade-off between power efficiency and error is recapitulated in simulated learning resistor networks, where the units of time conductance and voltage are matched with the experiments.
Experimental results for power optimization show a trade-off between error and power. (a) Error as a function of time in laboratory experiments with different optimization amplitude values λ. An adaptive nonlinear resistors network can physically learn to adopt the desired function. This network learns to perform node allostery tasks, gradually minimizing the error down to a finite error floor. (b) Free power in physical learning experiments for different values of λ. As experiments are run with increasing λ, the learning process finds solutions with an increasing error but with improved trained free power. (c) Error vs the trained free power of experimentally learned solutions. Overall, we observe an error–power trade-off in this experimental learning machine. The inset shows a photograph of the experiment. (d) Error vs the trained free power of learned solutions in numerical simulations on the same network geometry and type of tasks. The trade-off between power efficiency and error is recapitulated in simulated learning resistor networks, where the units of time conductance and voltage are matched with the experiments.
The network was trained for 150 two-source, two-target node allostery tasks, wherein the sources were held at the low and high ends of the allowable range (0 and ∼0.45 V, respectively), with the two desired target outputs at either 20% and 80% or at 10% and 90% of this range, respectively. Across these experiments, λ was varied to seven values ranging from 0 to 0.055. In all cases, the network was able to lower the error, as shown for typical error vs training time curves in Fig. 3(a). For these tasks, the network also consistently lowered the free power, as shown for the complementary power curves over training time in Fig. 3(b). Consistent with theoretical predictions, error and trained power increased and decreased, respectively, with increasing λ, with their trade-off shown in Fig. 3(c). White diamonds correspond to the mean error and the trained free power of all the experiments performed with the same value of λ.
To study this trade-off seen in the experiment, we simulated N = 16 node resistor networks constructed similarly to the experimental network [the inset in Fig. 3(d)]. These networks are simulated for similar two-source, two-target node allostery tasks (see Appendix B). We added a Gaussian white noise term to Eq. (9) with scale δ = 10−3V2 to approximate the noisy conditions of experimental learning. The white noise term leads to an error floor . A time step in our simulations is equivalent to one experimental learning step , while we can set the simulated conductance and voltage scales to match the experiment as well (conductance scale [k] = 10−3 Ω−1 and voltage scale [V] = Volt). The results for error and free power with λ in the range 10−6–10−3, averaged over 50 realizations of the network and tasks, are shown in Fig. 3(d) and qualitatively show the same error–free power trade-off. However, we note that these simulations are not intended as faithful realistic representations of the experimental learning machine as we simulate a linear flow network and are not attempting to model the specific details regarding the noise and bias profiles of the experiment. The comparison here is only intended to show that realized experimental learning machines display a qualitatively similar performance to power efficiency trade-off as predicted by our theory.
IV. DYNAMICAL CONTROL FOR GREATER POWER MINIMIZATION
In Sec. III, we showed how adding an explicit power minimization term in the contrast function leads to a new local learning rule that attempts to minimize both the error and the free power at the same time, leading to a trade-off between them controlled by the power minimization amplitude λ. We note that noisy inputs make it impossible to reach zero training error, and in any case, there is experimental noise in the self-learning circuits so there is, in practice, a non-zero error floor. Here, we use this insight to design a practical control scheme to dynamically modify λ during learning in order to attain a tolerable error with more power-efficient solutions. We will show how such a control scheme can yield even more power-efficient solutions compared to using a smart initialization (as in Sec. II) and constant λ (as in Sec. III).
Assume that we initialize the conductances of a resistor network at their minimal value (maximum resistance). This initialization leads to a free state VF(kmin) with the lowest possible free power . This state corresponds to the minimum power found by the power minimization dynamics with λ ≫ 1, which selects the learning degrees of freedom resulting in the lowest free power . As seen in Fig. 2, reducing the amplitude λ from infinity toward zero monotonically decreases the error, while increasing the trained free power .
To test this dynamical control scheme for learning with power minimization, we simulate the training of N = 64 nodes for regression tasks as before. We initialize the conductance values at their minimum kmin = 10−3 and set α = 0.03, ρ = 1, p = 0.02. We find that the network trained with the λ dynamical control scheme quickly converges on the desired error tolerance [Fig. 4(a), full line and closed circle. We compare these results with an “early stopping algorithm,” defined as follows: In this algorithm, we consider a learning network without power minimization (λ = 0) [the dashed line shown in Fig. 4(a)]. The network reaches the desired error tolerance after some time [marked by the open circle on the dashed line in Fig. 4(a)], which we call the “early stopping time.” Note that our dynamical control scheme in Eq. (18) reaches the same error at a time given by the solid circle on the solid line. Evidently, the dynamical control scheme achieves a lower trained free power compared with early stopping [Fig. 4(b)]. Once the dynamical control scheme reaches the time indicated by the solid circle in [Fig. 4(b)], λ stays constant but the system now keeps training at this value of λ, finally reaching a steady state at long times. As a result, the power advantage of this scheme [gray arrow in Fig. 4(b)] improves over training time until it converges at some free power value.
Power-efficient solutions with dynamical control. (a) Error trajectories with our λ dynamical control scheme (full line) compared to simple learning without power minimization (broken line). We see that the controlled learning rapidly converges to a desired tolerance error level of . (b) Free power under dynamical control of λ vs free power without power minimization. The dynamically controlled system finds solutions that lower the free power compared to an early stopped training of the uncontrolled system at (open dot). The gray arrow signifies the saved power. (c) The power gain given our control scheme compared to early stopping for different levels of tolerable error . We find that dynamical control can generate significantly more power-efficient solutions. (d) Power gain compared to the ratio of training energies between the dynamically controlled learning and early stopping algorithm. We find that to utilize the full benefit of low free power solutions, one needs to train the system for longer times, increasing the network training energy. All results are averaged over 50 realizations of networks and tasks.
Power-efficient solutions with dynamical control. (a) Error trajectories with our λ dynamical control scheme (full line) compared to simple learning without power minimization (broken line). We see that the controlled learning rapidly converges to a desired tolerance error level of . (b) Free power under dynamical control of λ vs free power without power minimization. The dynamically controlled system finds solutions that lower the free power compared to an early stopped training of the uncontrolled system at (open dot). The gray arrow signifies the saved power. (c) The power gain given our control scheme compared to early stopping for different levels of tolerable error . We find that dynamical control can generate significantly more power-efficient solutions. (d) Power gain compared to the ratio of training energies between the dynamically controlled learning and early stopping algorithm. We find that to utilize the full benefit of low free power solutions, one needs to train the system for longer times, increasing the network training energy. All results are averaged over 50 realizations of networks and tasks.
However, we note that gaining the full benefit of this power reduction requires long training, possibly much longer than the early stopping time, meaning that the training energy is higher compared to the early stopping algorithm. This consideration means that in our dynamical control scheme, there is a trade-off between the training energy and the trained free power of the solution. This is verified in Fig. 4(d), where we measure the power gain in terms of the total training energy ratio between the dynamical control scheme and early stopping algorithm, . Such trade-off also depends on the error tolerance, but we find that if one is willing to spend times the training energy compared to the early stopping algorithm, the network achieves most of the benefit of power reduction due to the dynamical control scheme. If the training energy is a major concern and constitutes a significant fraction of the energy expended by the network during its life, one should consider this trade-off for overall lowest power solutions. Finally, we note that our dynamical control scheme is not optimized. Choosing different parameters or another dynamical control scheme altogether may produce superior power saving at a possibly lower training energy.
V. DISCUSSION
In this work, we studied how electrical circuits can physically learn to adopt desired functions in power-efficient ways. We established that physical learning affects the free power required to actuate the circuit given input signals. This free power can be lowered by choosing better initialization schemes for the learning degrees of freedom, e.g., initializing low conductances in electronic resistor networks.
We have also introduced a modified local learning rule that attempts to minimize both the error and the free power. We showed that this learning rule indeed lowers the trained free power of the obtained learning solutions in both simulations and experiments. This learning rule weights the importance of minimizing error vs power, giving rise to a trade-off between the two. While improving power efficiency at the expense of error (performance) may seem undesirable, a very low error is typically not required and can even be infeasible in real learning situations. Therefore, one can often train learning networks for lower power solutions without much of an adverse effect ( Appendix C). In our experiments, there is a natural noise floor and there is no point in striving for a lower error than the floor. For these systems, power-efficient learning rules can improve the solution power with little to no penalty in error.
Finally, we have introduced a dynamical scheme for controlling the relative importance of error and power minimization to rapidly converge on power-efficient solutions with desired error tolerance. We find that such dynamical control can lead to lower power solutions. It is likely that an optimized version of such a dynamical control scheme could further reduce both the solution power and the overall training energy. This is a subject of future study.
While we presented details of the analytical approach for the case of resistor networks, our theoretical arguments apply to other physical systems trained using coupled learning, such as mechanical spring networks ( Appendix D). Neuromorphic computing often promises to improve power efficiency by embedding learning algorithms in hardware, solving a major problem in modern power-hungry computational learning algorithms. While the hardware platform discussed here, self-learning electronic circuits, does indeed improve power efficiency, our work here focuses on how to achieve power efficiency in the learning process itself. As a result, our power-efficient learning approach may be easily adaptable to other neuromorphic hardware systems that can perform self-learning to offer additional power savings compared to only using efficient hardware.
ACKNOWLEDGMENTS
We thank Purba Chatterjee, Marc Z. Miskin, and Vijay Balasubramanian for insightful discussions and feedback. This research was supported by the U.S. Department of Energy, Office of Basic Energy Sciences, Division of Materials Sciences and Engineering Award No. DE-SC0020963 (M.S.), the National Science Foundation via the UPenn Nos. MRSEC/DMR-1720530 and MRSEC/DMR-DMR-2309043 (S.D. and D.J.D.), and DMR-2005749 (A.J.L.), and the Simons Foundation (No. 327939 to A.J.L.). D.J.D. and A.J.L. thank CCB at the Flatiron Institute as well as the Isaac Newton Institute for Mathematical Sciences under the program “New Statistical Physics in Living Matter” (EPSRC Grant No. EP/R014601/1) for support and hospitality while a portion of this research was carried out.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Menachem Stern: Conceptualization (lead); Data curation (lead); Formal analysis (lead); Methodology (equal); Software (lead); Writing – original draft (lead); Writing – review & editing (lead). Sam Dillavou: Conceptualization (supporting); Data curation (supporting); Formal analysis (supporting); Methodology (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal). Dinesh Jayaraman: Conceptualization (supporting); Data curation (supporting); Formal analysis (supporting); Methodology (supporting); Writing – original draft (supporting); Writing – review & editing (supporting). Douglas J. Durian: Conceptualization (supporting); Funding acquisition (equal); Project administration (supporting); Resources (equal); Supervision (supporting); Writing – original draft (supporting); Writing – review & editing (supporting). Andrea J. Liu: Conceptualization (supporting); Funding acquisition (lead); Project administration (equal); Resources (equal); Supervision (lead); Writing – original draft (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings of this study are openly available in Stern, Menachem (2024). Data and codes for producing results associated with the manuscript “Training self-learning circuits for power-efficient solutions” are available at https://doi.org/10.6084/m9.figshare.24923685.v1.51
APPENDIX A: LEARNING DYNAMICS
We now discuss the modified learning dynamics that minimizes both error and free power [Eq. (9)]. In the main text, we showed that these learning dynamics lead to exponentially decaying weight solutions [Eq. (12)] and associated error and free power dynamics given in Eq. (15). The dynamical trajectories given these error\power dynamics follow two different prototypes, depending on the sign of . For ϕ < 0 (where the contrast gradient is aligned with the free power gradient), the contrast undershoots the infinite time limit, getting arbitrarily close to before rebounding exponentially to [Fig. 5(a)]. This scenario is common when initializing the network with high conductance values. For ϕ > 0 (i.e., anti-alignment of the contrast and free power gradients), the dynamics tend to increase the free power, and we see analytically regular dynamics, where the contrast smoothly decays exponentially to its terminal value [Fig. 5(b)]. This scenario is common in flow\resistor networks initialized at low conductance values. Figure 5(c) shows the verification of the argument laid out in the main text that the solution . We also show as before that the error grows quadratically with λ. Crucially, the arguments for the error are relevant not only for the training set (regression examples used to train the network) but also for the test examples that the network had not seen previously, whose error also scales quadratically in λ. More information about the regression tasks, as well as the training and test sets, is presented in Appendix B. Similarly to the error, the arguments about the trained free power are valid for both the training and test sets so that our modified learning dynamics reduces both of them [Fig. 5(d)].
Learning dynamics with power minimization. (a) Error (blue) and free state power (black) as functions of time for a case where the gradients of the error and the free power align. We see that both are reduced by learning, and the error undershoots its steady state value before relaxing back to it. (b) Error (blue) and free state power (black) as functions of time for a case where the gradients of the error and free power do not align. Here, learning increases the free power while smoothly reducing the error to its final value. (c) Solution shifts (orange) as well as training error (full blue line) and test error (dashed blue line) as a function of the power minimization amplitude λ. As λ is increased, the learned solution linearly displaces from the limiting solution . The error increases quadratically with λ, both for the training set and for the test set. (d) Trained free power for the training and test sets in the learned regression problem, both decrease as a function of λ. The results in panels (c) and (d) averaged over 50 realization of networks and tasks.
Learning dynamics with power minimization. (a) Error (blue) and free state power (black) as functions of time for a case where the gradients of the error and the free power align. We see that both are reduced by learning, and the error undershoots its steady state value before relaxing back to it. (b) Error (blue) and free state power (black) as functions of time for a case where the gradients of the error and free power do not align. Here, learning increases the free power while smoothly reducing the error to its final value. (c) Solution shifts (orange) as well as training error (full blue line) and test error (dashed blue line) as a function of the power minimization amplitude λ. As λ is increased, the learned solution linearly displaces from the limiting solution . The error increases quadratically with λ, both for the training set and for the test set. (d) Trained free power for the training and test sets in the learned regression problem, both decrease as a function of λ. The results in panels (c) and (d) averaged over 50 realization of networks and tasks.
It is also interesting to combine our results for the dependence of the free power on both the power optimization λ and the initialization scale of the learning DOF κ, as discussed in Sec. II B. We simulated the learning of regression tasks on N = 64 networks as earlier and varied λ in the range 10−10–10−2 and the initialization scale κ in the allowed range 10−3–101. We observe the emergence of two regimes of interest, one where the power minimization is weak λ ≪ 1 and the other where λ is large [Fig. 6(a)]. The large λ regime is simpler to understand as the learning solutions there primarily reduce free power at the expense of error. Therefore, the free power reduction is essentially the reduction from the free power at initialization to the minimal free power supported by the system, which scales as κ [at λ = 10−4, Fig. 6(b)]. For weak minimization λ ≪ 1, the effect of the initialization is more subtle. We find that the free power reduction scales approximately as a power law κ0.5 [at λ = 10−8, Fig. 6(b)]. Since here we measure the free power reduction, we find that at lower initialization scales, less power is saved by applying power minimization. However, there is still a substantial benefit in free power reduction even for good initialization. These results also help contextualize our dynamical control scheme in Sec. IV, where we show that the dynamical scheme supports additional free power saving compared to just using good initialization. We reserve the detailed study of the interplay between initialization and free power minimization for future study.
Interplay of initialization and power minimization. (a) Free power reduction depends on both the power minimization amplitude λ and the initialization scale κ. When using better initialization (lower κ), less power is saved by the power-efficient learning rule. (b) The free power as a function of the initialization scale κ. For weak minimization (λ = 10−8, full line), . For strong power minimization (λ = 10−4, dashed line), . The results are averaged over ten realizations of networks and tasks.
Interplay of initialization and power minimization. (a) Free power reduction depends on both the power minimization amplitude λ and the initialization scale κ. When using better initialization (lower κ), less power is saved by the power-efficient learning rule. (b) The free power as a function of the initialization scale κ. For weak minimization (λ = 10−8, full line), . For strong power minimization (λ = 10−4, dashed line), . The results are averaged over ten realizations of networks and tasks.
APPENDIX B: PHYSICAL LEARNING TASKS
We trained these networks in many realizations of geometry, choice of input/output edges and . To train each realization of the problem, we sampled ntr = 20 training examples and corresponding outputs . Note that the scale of the input voltage drops determines the scale of power dissipation in the free state .
In the main text, we looked at noiseless regression problems with ϵ = 0, for which, the network can find exact solutions with zero error. In Appendix C, we study a case with finite label noise ϵ = 10−3. The training examples are sampled randomly during training and used to define the free and clamped states in the iterative learning process. Apart from the ntr = 20 training examples, we also sampled nte = 100 test examples from a wider distribution and their associated desired outputs. The test points are not used during the learning process but help in verifying that the network can generalize. In our work, the test set is also interesting for showing the power-efficient property of the solutions that generalizes beyond the training set [Fig. 5(d)].
Similarly, we can compute the loss value associated with the test set. We simulate physical learning by applying the local learning rules described in the main text, picking a nudging amplitude value η = 10−3.
For better comparison to the physical experiment of Fig. 3, we also simulated a resistor network of the same geometry and a simple allostery task as the experiment [inset of Fig. 3(d)]. In these simulations, we randomly choose two input nodes and assign to them input voltages, Vi1 = 0V and Vi2 = 0.45 V. We further choose two random output nodes and train them such that when the inputs are applied, they would have output voltage values Vo1 = 0.09 V and Vo2 = 0.36 V. In this case, we again use a mean squared error loss function and compare the output voltages at the output nodes to the desired values Vo1,2.
APPENDIX C: POWER MINIMIZATION FOR LIMITED ACCURACY TASKS
The numerical results in the main text were limited to tasks that can, in principle, be learned perfectly by the learning machine. In such cases, there exist solutions with no error , as discussed in Appendix A. There are, however, cases in which it is impossible to obtain solutions with zero error. The typical example is when the training set does not capture all the information contained in the broader data (or the test set). There are also cases where it is impossible to find solutions that nullify the error even on the training set. This can occur due to under-parameterization (too few learning degrees of freedom to learn the task), an insufficiently expressive model (e.g., a linear network cannot represent nonlinear relations) and noise in the learning process.48
Comparing these expressions to Eq. (16), we see that the trained free power behavior stays the same. We also see that the error shift is the same, scaling as λ2, but now there is a finite contrast floor associated with a finite error. The trade-off between error and power is still maintained, although in this case, it may be much more favorable. For a small enough λ, , and so the contrast (and error) is nearly unaffected by the power minimization. As a result, we can apply a finite power minimization parameter λ, reducing the trained free power at nearly no penalty. Thus, power minimization is particularly useful for problems in which zero error solutions are not possible.
To verify these considerations, we simulated physical learning in N = 64 networks on regression and classification tasks (Fig. 7). Excess noise was added to the regression labels (outputs) in the training and test sets, sampled from a distribution , with ϵ = 10−3 (see Appendix B). The simulated networks can successfully learn these tasks, reducing the error to some finite value [Fig. 7(a)]. When adding a small power minimization λ < 10−7, the learning trajectories are almost unchanged and the error is nearly unaffected. When λ increases, the error starts increasing beyond the error floor [Fig. 7(b)]. At the same time, we observe that the trained free power is decreased linearly at finite λ, as seen before. These results show that in noisy cases, such as those seen in physical learning experiments, free power reduction can be achieved at little expense in errors up to a certain point.
Power reduction with little error\accuracy loss in regression and classification problems. (a) Error vs time trajectories for regression task with label noise (with minimum possible error ) for different values of the power minimization amplitude λ. As long as λ < 10−7, the error is largely unaffected. (b) Error (blue) and free power reduction (black) as a function of λ. The trained free power is still reduced by increasing λ, as before, even in the range where the error is unaffected. The regression results averaged over 50 realizations. (c) Classification accuracy of the Iris dataset for the training (full line) and test (dashed line) sets, as a function of λ. Similar results are obtained, as increasing the power minimization amplitude λ decreases accuracy (i.e., increases error), but only beyond a finite value of λ. (d) Power gain due to power minimization for the training (full line) and test (dashed line) sets. In this case, we gain a factor 2 in trained free power with little loss in accuracy (at λ ≈ 10−7). The classification results averaged over 100 realizations.
Power reduction with little error\accuracy loss in regression and classification problems. (a) Error vs time trajectories for regression task with label noise (with minimum possible error ) for different values of the power minimization amplitude λ. As long as λ < 10−7, the error is largely unaffected. (b) Error (blue) and free power reduction (black) as a function of λ. The trained free power is still reduced by increasing λ, as before, even in the range where the error is unaffected. The regression results averaged over 50 realizations. (c) Classification accuracy of the Iris dataset for the training (full line) and test (dashed line) sets, as a function of λ. Similar results are obtained, as increasing the power minimization amplitude λ decreases accuracy (i.e., increases error), but only beyond a finite value of λ. (d) Power gain due to power minimization for the training (full line) and test (dashed line) sets. In this case, we gain a factor 2 in trained free power with little loss in accuracy (at λ ≈ 10−7). The classification results averaged over 100 realizations.
Another case where this result is particularly relevant is in classification problems, where we would like to assign discrete labels to inputs. A standard example for such tasks is the classification of Iris specimens based on the measurements of lengths of their petals and sepals.53 Previously, we have shown that our flow\resistor networks can successfully learn to classify the Iris dataset, as well as could be expected from linear network models, in simulations43 and experiments.37 In discrete classification tasks, we are typically not concerned with the mean squared error but with a measure of accuracy given by a discrete choice of the label based on the network response; excellent classification is possible even at relatively high values of the mean squared error. Therefore, it may be possible to induce power optimization without a penalty in classification accuracy. To test this idea, we simulated the training of our N = 64 node networks to classify the Iris dataset (a detailed description of the training protocol can be found in Ref. 37). Training at different power minimization amplitudes λ in the range 10−10 < λ < 10−2, we find that the classification accuracy (for the training and test sets) is not affected by power minimization until λ ≈ 10−7 [Fig. 7(c)]. At the same time, the solution free power is significantly reduced starting at λ > 10−8, showing that power gain (in this case, by a factor ) is possible at little penalty in accuracy [Fig. 7(d)].
We find that if we know the noise scale σ, measuring the average contrast value allows us to glean information about the effective average curvature of the contrast near the learning solution. Note that the learning DOF diffuse freely in the space of zero contrast solutions, so the effective curvature is associated with the typical slopes of the contrast leaving the zero manifold. Overall, we see that the free power reduction is, on average, the same as in the case with no noise (up to second order terms in λ). However, the contrast now has a finite added term due to the exploration of values of the learning DOF beyond the minimum . This means that additive white noise has a similar effect to the finite contrast floor discussed earlier; finite power minimization λ can reduce the trained free power while having nearly no effect on the contrast (or error) up to a certain scale.
APPENDIX D: POWER MINIMIZATION IN MECHANICAL SPRING NETWORKS
In this work, we presented general arguments on how local learning rules could balance minimizing the error and trained-free power of obtained physical learning solutions, giving rise to a trade-off between the two. However, in the main text, we only tested these ideas numerically and experimentally in resistor networks. Here, we show in simulations that these arguments apply similarly to physical learning systems governed by different physics, e.g., an elastic network of harmonic springs [Fig. 8(a)].
Energy-efficient learning in mechanical spring networks. (a) A mechanical spring network, each edge corresponding to a spring with adaptive stiffness k. Such networks are trained for allostery tasks so that prescribed strains at input edges (red) lead to desired strains at output edges (blue). (b) Error (blue) and free energy reduction (black) as functions of the power minimization amplitude λ. As seen for flow networks, including a power minimization term in the local learning rule leads to a trade-off between error and trained free energy, also having the same scaling behaviors. The results are averaged over five realizations of networks and tasks.
Energy-efficient learning in mechanical spring networks. (a) A mechanical spring network, each edge corresponding to a spring with adaptive stiffness k. Such networks are trained for allostery tasks so that prescribed strains at input edges (red) lead to desired strains at output edges (blue). (b) Error (blue) and free energy reduction (black) as functions of the power minimization amplitude λ. As seen for flow networks, including a power minimization term in the local learning rule leads to a trade-off between error and trained free energy, also having the same scaling behaviors. The results are averaged over five realizations of networks and tasks.
We simulate this modified learning algorithm on an unstrained spring network with N = 27 nodes, as shown in Fig. 8(a). These networks are trained for allostery tasks, in which we apply prescribed relative strains 0.2 (randomly choosing contraction or extension) and desire particular strain values at another two random bonds (0.05 or 0.03, randomly choosing contraction or extension). With no energy minimization applied, λ = 0, and coupled learning generally succeeds in training these networks to a numerical normalized error floor of . As we increase the power minimization amplitude λ, we observe that the error increases as λ2 and the trained free energy EF is reduced as λ [Fig. 8(a)], as predicted by Eq. (16) and observed in simulations of resistor networks. These results show that our approach to physical learning of power efficient solutions can be employed beyond linear resistor networks. Recent experimental progress has been achieved for implementing coupled learning in elastic networks,42 but we leave the experimental validation of energy reduction in such networks for future study.