As the size and ubiquity of artificial intelligence and computational machine learning models grow, the energy required to train and use them is rapidly becoming economically and environmentally unsustainable. Recent laboratory prototypes of self-learning electronic circuits, such as “physical learning machines,” open the door to analog hardware that directly employs physics to learn desired functions from examples at a low energy cost. In this work, we show that this hardware platform allows for an even further reduction in energy consumption by using good initial conditions and a new learning algorithm. Using analytical calculations, simulations, and experiments, we show that a trade-off emerges when learning dynamics attempt to minimize both the error and the power consumption of the solution—greater power reductions can be achieved at the cost of decreasing solution accuracy. Finally, we demonstrate a practical procedure to weigh the relative importance of error and power minimization, improving the power efficiency given a specific tolerance to error.

## I. INTRODUCTION

There has been a meteoric rise in the adoption and usage of artificial intelligence (AI) and machine learning (ML) tools in just the past 15 years,^{1,2} accompanied by an equally spectacular rise in the sizes of ML models and the amount of computation required to train and apply them.^{3,4} In recent years, the energy required to train state-of-the-art ML models, as well as to use the trained models, has been rising exponentially, doubling every 4–6 months.^{5} This energy cost will eventually severely constrain further increases in model complexity and already constitutes significant economic and carbon costs.^{6–8}

The field of neuromorphic computing^{9–13} strives to recreate the ability to learn in hardware. A major motivation for the development of neuromorphic systems is the possibility of massive energy savings compared to ML implemented on standard computers.^{13,14} Many proposals for synthetic “neurons” and “synapses” have been laid out over the past three decades, promising lower power consumption compared to standard computers by 2–5 orders of magnitude.^{15–18} While much neuromorphic computing research has been focused on the development of power-efficient hardware, usually for performing inference (applying already-trained ML models), some attention has recently been given to the study of power-efficient learning “algorithms.”^{19–22} However, most neuromorphic hardware implementations considered thus far specifically attempt to mimic standard ML algorithms, such as backpropagation^{23–25} or phenomenological neural synaptic learning processes such as STDP (spike-timing-dependent plasticity).^{26–30}

Recently, a new avenue was opened toward realizing power-efficient neuromorphic computing, dubbed *physical learning machines* or *self-learning physical networks*.^{31} Rather than mimicking known learning algorithms, such as backpropagation, such systems exploit their inherent physics in order to learn, using *local learning rules* that modify the *learning degrees of freedom* based on locally available information, such that the system globally learns to perform desired tasks. A certain class of local learning rules, known as *contrastive learning*,^{32–36} describe how learning degrees of freedom should be modified in order for systems to achieve desired outputs in response to inputs supplied by observed examples of use (i.e., supervised learning).

In order to realize any power gains, such learning rules must be implemented in hardware. *Coupled learning*, a particular contrastive learning rule, has been realized successfully in laboratory hardware for electronic circuits of variable resistors.^{37–40} Such systems already consume less power than conventional computers doing inference because they are analog rather than digital.^{41} Here, we use analytical theory, computation, and experiments to show that the propensity of electronic circuits to minimize power dissipation enables even greater reductions in power consumption via appropriate initialization and power-efficient learning rules. We specifically demonstrate these results for regression tasks. However, it should be noted that our analysis and results should apply to other physical learning machines in different physical media (e.g., mechanical networks) if they can be developed in the lab,^{42} as well as to other types of problems (e.g., classification).

This paper is organized as follows: in Sec. II, we describe the physical learning approach and discuss how the power consumption of the system is modified by learning, in particular, as we change the initial conditions of the learning degrees of freedom. A judicious choice of initial conductances yields learning solutions with low power consumption, while also reducing the energy consumed in training. In Sec. III, we introduce a modification to the local physical learning rule in order to minimize both error and power consumption. We analyze this new local rule theoretically and test it in simulations and lab experiments, concluding that it leads to an error–power trade-off; lower-power solutions may be obtained at the expense of higher errors. The energy required to train the system can be reduced as well. Finally, in Sec. IV, we demonstrate how a power-efficient learning algorithm with dynamical control over the weighting of power and error optimization can lead to an efficient adaptation of low-power solutions beyond simply using good initial conditions and constant weighting.

## II. POWER CONSUMPTION IN PHYSICAL LEARNING CIRCUITS

In previous work, we established theoretically and experimentally that self-learning resistor networks can be trained to perform tasks such as allostery, regression, and classification.^{37,39,40} Training a deep neural network corresponds to minimizing a learning cost function with respect to learning degrees of freedom (edge weights and biases). The learning landscape, described by the learning cost function as one axis in the high-dimensional space where each of the other axes corresponds to a different learning degree of freedom, remains fixed during the minimization. On the other hand, successful training of physical learning machines corresponds to the simultaneous minimization of *two* cost functions—the learning and physical cost functions—with respect to two different sets of degrees of freedom (DOF), the learning and physical degrees of freedom, respectively. In the case of a self-learning electrical network of variable resistors, the physical cost function is the dissipated power, the physical DOF are the node voltages, and the learning DOF are the conductances.

Notably, the physical cost function, or power, depends implicitly on the learning DOF. As a result, both the learning landscape and the physical landscape *evolve* during training. For example, training gives rise to soft modes in the physical landscape and to stiff modes in the learning landscape, making the system more conductive and lowering its effective response dimension.^{43}

The height of a minimum in the physical landscape corresponds to the power required to actuate the desired response (to obtain the desired outputs in response to the given inputs from training data). Due to the coupling between the learning and physical landscapes, it is possible to find and push down the minima in the physical landscape corresponding to the global minima in the learning landscape during training, thus decreasing the amount of power required to perform a given task.

*P*(

*V*;

*k*) (e.g., the dissipated power), depending on a set of physical DOF

*V*(e.g., the node voltages) and a set of learning DOF

*k*(e.g., the edge conductances). When an input signal (e.g., a set of voltages at input nodes) is applied, the system responds by optimizing the physical DOF to minimize

*P*subject to the input constraints, producing a stable free state

*V*

^{F}with an associated free power

*P*

^{F}(

*V*

^{F};

*k*). Training this system for specific output responses using coupled learning

^{34}involves clamping the targets

*T*by slightly nudging them toward the desired response $VTC=VTF+\eta (V\u0303T\u2212VTF)$, with $V\u0303T$ being the desired response and nudge amplitude

*η*≪ 1. The physical system then minimizes the physical cost function subject to both the inputs and this clamping, yielding a clamped state

*V*

^{C}with a clamped power

*P*

^{C}(

*V*

^{C};

*k*). The contrast (or contrastive function) is defined as the difference between the physical cost (powers) for the clamped and free states,

^{34}

*α*being a scalar

*learning rate*, setting the time scale for the learning dynamics. A system following these dynamics with a sufficiently low learning rate tends to minimize the learning cost function $L$ [as shown in Fig. 1(b)]. See Appendix A for more details on the learning dynamics close to a solution for the learning degrees of freedom

*k**.

### A. Power consumption in learned solutions

We now turn our attention to the scalar physical cost (i.e., power consumption) of the free state *P*^{F}. This *free power* is the relevant measure for the power used by the system to perform inference. As noted, this is the power associated with the application of the input voltages, allowing the output readouts. Our work is primarily concerned with minimizing this free power *P*^{F}, while also achieving good learning solutions with low error. We will show how the free power can be lowered by non-hardware means, such as choosing better initialization for the learning DOF and using learning rules that minimize the free power at the same time as the error. The free power of a system trained for a specific task will, henceforth be referred to as the *trained free power*. We will also look at the total energy required to train the system $E$, henceforth termed as the *training energy*, which can be estimated by integrating the free power over the training time. We show how this training energy can also be optimized by these initialization schemes and learning rules.

*P*

^{F}is affected by the basic coupled learning rule of Eq. (2), and later see how it can be substantially reduced by modifying this rule. Using the chain rule in Eq. (2), we can derive an ODE for the free power during training,

*∂*

_{V}

*P*

^{F}vanishes exactly, and hence, $\u2207kPF=\u2202kPF+dvdk\u2202VPF=\u2202kPF$. We see that the free power tends to decrease if the gradients of the free power and the contrast with respect to

*k*align and increase otherwise. Assuming the free power changes slowly with

*k*, or that the learning DOF

*k*are close to the learning solution

*k**, we can approximate the free power using the following Taylor expansion:

*P*

^{F}(

*k*

^{0}) and ending after training with

*P*

^{F}(

*k**). Next, we discuss the sign of this shift, determined by the alignment between the gradients of the contrast and free power.

*V*

_{a}, while the learning DOF are conductances

*k*

_{i}of edges

*i*connecting pairs of nodes. An adjacency matrix Δ

_{ia}is defined such that each row of the matrix corresponds to an edge, having a value of +1 at the index of the incoming node of that edge, −1 at the index of the outgoing node, and 0 elsewhere. The choice of which node is incoming or outgoing is a matter of convention and sets the direction of currents but has no physical consequence. The vector of voltage drops on the edges is given by Δ

*V*

_{i}=

*∑*

_{a}Δ

_{ia}

*V*

_{a}. Resistor networks minimize the total power dissipation,

*V*

_{G}= 0, the native state of the network (in the absence of any inputs) is where all voltage values are zero, all voltage drops are zero, and the total power dissipation is

*P*= 0. When the free/clamped boundary conditions are applied, for example, by introducing currents in certain input and output edges, the free and clamped power are

*V*

^{C}−

*V*

^{F}∼

*η*≪ 1), we write the contrast function $C$, neglecting the terms of order

*η*

^{2},

*k*(as done in Ref. 35),

In this simple case, the learning modification is determined by the alignment of each component of the free state response $\Delta ViF$ with its nudge in the clamped state $(\Delta VC\u2212\Delta VF)i$. In these particular models, we also know that the free power gradient is positive $\u2202PF\u2202ki=(\Delta VF)i2\u22650$. We conclude that if the clamped state nudge aligns with the free state response, the free power tends to decrease. This is sensible as the system has to decrease its conductances to achieve a stronger response required by the clamping. The opposite effect occurs when the nudged response is misaligned with the free state, resulting in increased conductances.

### B. Power dependence on initial conditions

In Sec. II A, we established how physical learning affects the system’s free power *P*^{F}, i.e., its power consumption in the free state. In the following, we consider how the initial conditions of the learning degrees of freedom determine the trained free power, i.e., free power of the learned solutions. We will show that judicious initialization leads to considerable savings in power consumption.

It is well recognized in the ML literature that the dynamics and obtained solutions of learning algorithms strongly depend on initialization, i.e., the initial values of the learning DOF.^{44–47} In the context of physical learning, the choice of initialization may not only affect the training time and accuracy of a solution but may also have important effects on the trained free power. Suppose a set of voltage drops is applied over some input edges of a resistor network, and we read out the resulting voltage drops over some other output edges. In addition, suppose that the conductance values of the network have a certain scale *κ*. It is known that the output voltage drops do not depend on the scale *κ* but only on the relative ratios of the conductance of different edges. However, reducing the conductance scale does, in fact, linearly decrease the free power [Eq. (5)]. Thus, we can, in principle, improve the trained free power indefinitely by reducing the conductance scale. Realistically, we are bound by experimental considerations: variable conductive elements have minimal conductance values (corresponding to maximal resistance). Furthermore, low conductance necessitates more precise hardware implementations as the network response becomes highly sensitive to small variations in the conductance.

The above-mentioned considerations suggest that initializing the conductance values *k* (learning DOF) at lower values may yield solutions with lower trained free power. To verify these ideas, we trained *N* = 64 node networks [Fig. 1(a)] for multiple regression tasks with two inputs and two outputs. The *error* for these regression tasks, noted as $L$, is given by mean squared differences between the desired and the obtained target voltage drops in the training set ( Appendix B presents details on the simulated resistor networks and regression tasks). We initialized the conductance values uniformly with different conductance scales in the range 10^{−3} ≤ *κ* ≤ 10^{1}. We note that in these simulations, the minimum conductance for any given edge is *k*_{min} = 10^{−3}, and the maximum conductance is *k*_{max} = 10^{1}. Learning modifications that attempt to push the conductances out of this range are not performed. The learning rate *α* has been chosen such that *κ* = 1 and *α* = 0.1, a value which typically results in a relatively quick and stable learning performance for these networks and tasks. We find that as long as the learning rate is chosen to be slow enough, the learning dynamics are well behaved and our results scale as expected with the learning rate.

We set the units for these simulations such that the conductance scale [*k*] is defined as *k* = 1 inside the allowed range of our conductors, and the voltage scale [*V*] corresponds to the typical highest input values chosen for our regression tasks *V* = 1 (shown in Appendix B). The networks are trained for 10^{6} learning iterations of Eq. (2). Each such iteration encompasses an epoch, i.e., the time taken for the network to observe and respond to all training examples, similar to full-batch gradient descent.^{48} Our units of time are scaled by the learning rate [*τ*_{epoch}] = *α*^{−1}. With these definitions, the units for the free power are given by [*P*^{F}] = [*k*][*V*]^{2}. The training energy is given by the free power, integrated over training time, which has units of $[E]=[\tau epoch][k][V]2$.

As expected, we find that coupled learning reduces the error by many orders of magnitude [Fig. 1(b)]. We also find that when the learning rate is scaled appropriately *α* ∝ *κ*^{0.5}, the scaled training time *T* (number of epochs taken for the system to reach a certain error threshold $L\u0303=10\u22124$, scaled by the learning rate) does not change much for relatively high initialization *κ* [Fig. 1(c)]. However, initialization close to the lower boundary *k*_{min} induces a linear increase in the scaled training time, scaling as *κ*^{−1}. This increase in the training time is reasonable as a large part of the training modification Δ*k*_{i} is not performed because it would require the conductances to go below the minimum.

*κ*, physical learning finds lower trained free power solutions

*P*

^{F}(

*k**) [in Figs. 1(d) and 1(e), note that the colored dots in panel d mark the trained free power]. These results clearly show the benefit of initializing the conductances of edges close to their minimal values in terms of learning power efficient solutions. While so far, we referred to the power necessary to actuate the solution (i.e., the free power

*P*

^{F}), one often needs to consider the total energy required to train the network to adopt this solution, the training energy $E$. In some applications, this training energy is small compared to the total energy spent on using the system throughout its life cycle. However, when this is not the case, one should consider learning algorithms that reduce the required training energy and the free power. In our simulations, the training energy $E$ can be measured as the integral over time of the free power during training, until the error reaches a certain tolerable level (e.g., $L\u0303=10\u22124$) at time

*T*,

We find that the training energy $E$ scales linearly with the initialization at high *κ* [Fig. 1(f)], similar to the trained free power. However, lowering *κ* close to *k*_{min} actually *increases* the training energy. This is because we can no longer realize gains in the free power [the saturating region shown in Fig. 1(e)], while the training time increases linearly with decreasing *κ* [Fig. 1(c)]. As a result, the training energy in this regime increases with decreasing *κ* [Fig. 1(f)]. Thus, there is an optimal value for the initialization *κ* corresponding to the minimum training energy. We note that in this regime, as the training energy is proportional to the training time, it is inversely proportional to the learning rate so that increasing the learning rate can reduce the energy $E$. However, this improvement only persists for low enough learning rates, when the learning process is consistent and well behaved. We leave the study of power efficiency of fast learning system for future study.

In machine learning, however, the greatest energy cost is incurred during inference. In our case, this cost is quantified by the trained free power *P*^{F}(*k**). We note that training reduces the free power for high *κ* but increases it for low *κ* next to the lower conductance limit [Fig. 1(d)]. This is sensible because for low initial conductances at or near the minimum, the network must increase some edge conductances in order to decrease its error. That being said, we conclude that initializing the network with proper low conductance values can save a significant energy during learning and when using the trained network.

## III. EXPLICIT POWER MINIMIZATION

*λ*is a tunable parameter, termed as the

*power minimization amplitude*, which dictates the importance of free power minimization. The learning rule, the partial derivative of the contrast, then becomes

*P*

^{F}as the modified learning dynamics lower the free power

*and*the contrast in Eq. (1). If we set

*λ*= 1, the free power cancels out and we recover the directed aging learning rule

^{48,49}that solely tends to reduce the clamped power.

*P*

^{F}tends to be reduced by these dynamics, again up to an effect determined by the alignment. We now discuss the dynamics of the contrast and free power in a simplified linear setting. First, note that in the limit

*λ*→ ∞, the learning rule solely minimizes the free power. We denote this free power minimum as $k\u221e*$. Around this local minimum, the free power can be expanded to quadratic order,

*λ*= 0 learning solution (the unmodified solution discussed earlier),

*λ*= 0 [in over-parameterized networks, the constant term $C(k0*)$ vanishes, see Appendix A]. If the learning solution at finite

*λ*, $k\lambda *$ is close to the limiting solutions $k\u221e*$ and $k0*$, we can express the new contrast approximately as

*k*, whose solution is exponential,

*k*(

*t*= 0) ≡

*k*

^{0}, the learning DOF exponentially decay to $k\lambda *$. Let us discuss the learning DOF solution $k\lambda *$. It is clear that when no power minimization is applied, $k\lambda *=k0*$. If both Hessians $H,H$ are full rank (and invertible), the

*λ*parameter would smoothly interpolate between $k0*$ and $k\u221e*$. However, we know that the Hessian of the contrast in over-parameterized learning machines is low-rank (with the number of non-zero eigenvalues equal to the number of training tasks, see Appendix A for details).

^{43}This means that the contrast Hessian $H$ is not invertible and has vanishing eigenvalues. In the eigen-directions of these vanishing eigenvalues, the power minimization is dominant for any finite value of

*λ*. Thus, the power minimization term introduces a singular perturbation so that for infinitesimal power minimization amplitude

*λ*= 0

^{+}, the learning solution approaches $k0+*=lim\lambda \u21920k\lambda *\u2260k0*$. The solution $k0+*$ tends to minimize the free power, while keeping the contrast low. For over-parameterized learning in the

*λ*→ 0 limit,

*λ*≪ 1),

*λ*≪ 1, note that the vector

*s*is nearly constant as the inverse matrix is dominated by $H$. This means the solutions shift

*λs*is approximately linear in the optimization parameter

*λ*(see Appendix A). Let us further denote $\Delta k0=k0\u2212k\lambda *$ and introduce a time propagator $U\lambda (t)\u2261e\u2212(H+\lambda H)t$. The solution for

*k*can be plugged in the equations above to express the dynamics of the contrast and free power,

These are our key results: (1) the error induced by power minimization scales with *λ*^{2}, while (2) the free power reduction compared to $PF(k0+*)$ scales linearly with *λ*. In other words, the *free power difference*, i.e., the difference between the free power with and without *λ*-minimization, $\Delta P\lambda F\u2261P\lambda F(k\lambda *)\u2212P0+F(k0+*)\u221d\lambda $, scales linearly with this minimization amplitude.

*t*=

*τ*. This training time must be large compared to the natural scale of the contrast Hessian to allow learning to occur. However, in the weak power minimization limit, this time can be much smaller than the power minimization timescale $H\u226b\tau \u22121\u226b\lambda H$. In this case, the dynamics can be approximated by a fast decay toward the unmodified solution $k0*$, followed by a slow decay from $k0*$ to the power minimizing solution $k\lambda *$. For small

*λ*, the learned solution at a finite time

*τ*is

The learned solution moves away from $k0*$ at a rate linearly proportional to *λ*. This solution can be used to estimate both the contrast and the free power at time *τ*. As seen before, we find that the contrast scales as *λ*^{2}, while the free power is reduced proportional to *λ* and the elapsed time $\Delta P\lambda F\u223c\lambda \tau $.

Overall, these considerations suggest that under *λ*-modified dynamics, a trade-off emerges between the error and trained free power. This intuition is verified in the numerical simulations shown in Fig. 2. We train a 64 node resistor network, initialized with intermediate conductance values $ki0=1$, for a regression task as before. Here, the training proceeds with the modified power minimization learning rule [Eq. (9)], varying *λ* in the range 10^{−10} ≤ *λ* ≤ 10^{−2}. We train these networks for *τ* = 10^{5} steps and then measure the trained error and free power, averaging the results over 50 realizations of the network and regression tasks. We find that for small *λ*, the error $L$ and free power reduction $\Delta P\lambda F$ scale as predicted by Eq. (16) [Fig. 2(a)]. As before, we also compute the training energy $E$. We plot the trained free power $PF(k\lambda *)$ and the training energy as a function of the minimization amplitude *λ* in Fig. 2(b). Both of these are markedly decreased when *λ* is increased, showing the predicted trade-off between power efficiency and error. In these settings, choosing the optimization parameter *λ* = 10^{−5} allows us to maintain a reduction of five orders of magnitude in error, while reducing the free state power and the total energy required for training by a factor $\u223c10$ compared to standard learning (*λ* = 0).

### A. Experimental results

So far, we argued on theoretical grounds that error can be traded-off for power efficiency by employing the learning rule in Eq. (9) and verified these arguments in simulations. Here, we verify the existence of the trade-off in laboratory experiments. We use an experimental network of variable resistors implementing coupled learning, similar to the realizations in previous studies.^{37–39} However, in this new implementation of the experiment, transistors replace the digital potentiometers in the role of variable resistors.^{41} Unlike in the previous work,^{37} this system is also able to learn according to the continuous coupled learning rule [Eq. (2)] as each resistance element is set by a charged capacitor on the gate of the transistor instead of by a discrete counter. Modifications to the learning rule of the form in Eq. (9) are achieved by varying the measurement amplification from the free and clamped networks. In addition, unlike previous implementations, this new network operates continuously in time, with the clamped state value updated automatically via an electronic feedback loop, and so, training duration is measured in real time rather than training steps. Because of unavoidable noise in the experiment, *η* → 0 is unobtainable. As the clamped state approaches the free state, their difference becomes more and more difficult to measure. Therefore, we use a finite value *η* = 0.22 for these experiments with an effective learning rate of $\alpha =124ms$. The experiments lasted 20 seconds each, and the network’s resistances had completely settled at the end of each run. The network is a 4 × 4 square lattice of edges [inset in Fig. 3(c)] with periodic boundary conditions; the edges are initialized with uniform conductance in the approximate middle of their range at the start of each experiment.

The network was trained for 150 two-source, two-target node allostery tasks, wherein the sources were held at the low and high ends of the allowable range (0 and ∼0.45 V, respectively), with the two desired target outputs at either 20% and 80% or at 10% and 90% of this range, respectively. Across these experiments, *λ* was varied to seven values ranging from 0 to 0.055. In all cases, the network was able to lower the error, as shown for typical error vs training time curves in Fig. 3(a). For these tasks, the network also consistently lowered the free power, as shown for the complementary power curves over training time in Fig. 3(b). Consistent with theoretical predictions, error and trained power increased and decreased, respectively, with increasing *λ*, with their trade-off shown in Fig. 3(c). White diamonds correspond to the mean error and the trained free power of all the experiments performed with the same value of *λ*.

To study this trade-off seen in the experiment, we simulated *N* = 16 node resistor networks constructed similarly to the experimental network [the inset in Fig. 3(d)]. These networks are simulated for similar two-source, two-target node allostery tasks (see Appendix B). We added a Gaussian white noise term to Eq. (9) with scale *δ* = 10^{−3}*V*^{2} to approximate the noisy conditions of experimental learning. The white noise term leads to an error floor $L\u223c10\u22126V2$. A time step in our simulations is equivalent to one experimental learning step $(\u223c0.1ms)$, while we can set the simulated conductance and voltage scales to match the experiment as well (conductance scale [*k*] = 10^{−3} Ω^{−1} and voltage scale [*V*] = Volt). The results for error and free power with *λ* in the range 10^{−6}–10^{−3}, averaged over 50 realizations of the network and tasks, are shown in Fig. 3(d) and qualitatively show the same error–free power trade-off. However, we note that these simulations are not intended as faithful realistic representations of the experimental learning machine as we simulate a linear flow network and are not attempting to model the specific details regarding the noise and bias profiles of the experiment. The comparison here is only intended to show that realized experimental learning machines display a qualitatively similar performance to power efficiency trade-off as predicted by our theory.

## IV. DYNAMICAL CONTROL FOR GREATER POWER MINIMIZATION

In Sec. III, we showed how adding an explicit power minimization term in the contrast function leads to a new local learning rule that attempts to minimize both the error and the free power at the same time, leading to a trade-off between them controlled by the power minimization amplitude *λ*. We note that noisy inputs make it impossible to reach zero training error, and in any case, there is experimental noise in the self-learning circuits so there is, in practice, a non-zero error floor. Here, we use this insight to design a practical control scheme to dynamically modify *λ* during learning in order to attain a tolerable error with more power-efficient solutions. We will show how such a control scheme can yield even more power-efficient solutions compared to using a smart initialization (as in Sec. II) and constant *λ* (as in Sec. III).

Assume that we initialize the conductances of a resistor network at their minimal value (maximum resistance). This initialization leads to a free state *V*^{F}(*k*_{min}) with the lowest possible free power $PminF$. This state corresponds to the minimum power found by the power minimization dynamics with *λ* ≫ 1, which selects the learning degrees of freedom resulting in the lowest free power $PminF$. As seen in Fig. 2, reducing the amplitude *λ* from infinity toward zero monotonically decreases the error, while increasing the trained free power $PF(k\lambda *)$.

*λ*to promote error minimization, while if the error is smaller than the tolerance, we increase

*λ*to emphasize power minimization. In other words,

*ρ*setting the update timescale of

*λ*and the parameter

*p*controlling the rate of the control scheme (a low

*p*value sets the first term in the parentheses close to 1 so that

*λ*dynamics are slow).

To test this dynamical control scheme for learning with power minimization, we simulate the training of *N* = 64 nodes for regression tasks as before. We initialize the conductance values at their minimum *k*_{min} = 10^{−3} and set *α* = 0.03, *ρ* = 1, *p* = 0.02. We find that the network trained with the *λ* dynamical control scheme quickly converges on the desired error tolerance [Fig. 4(a), full line and closed circle. We compare these results with an “early stopping algorithm,” defined as follows: In this algorithm, we consider a learning network without power minimization (*λ* = 0) [the dashed line shown in Fig. 4(a)]. The network reaches the desired error tolerance $L\u0303=10\u22123$ after some time [marked by the open circle on the dashed line in Fig. 4(a)], which we call the “early stopping time.” Note that our dynamical control scheme in Eq. (18) reaches the same error at a time given by the solid circle on the solid line. Evidently, the dynamical control scheme achieves a lower trained free power compared with early stopping [Fig. 4(b)]. Once the dynamical control scheme reaches the time indicated by the solid circle in [Fig. 4(b)], *λ* stays constant but the system now keeps training at this value of *λ*, finally reaching a steady state at long times. As a result, the power advantage of this scheme [gray arrow in Fig. 4(b)] improves over training time until it converges at some free power value.

*τ*= 10

^{5}in relation to the minimal free power produced by the network given for the lowest possible conductance values

*k*

_{min}. As higher error $L\u0303$ is tolerated, the dynamical control scheme improves in comparison to the simple early stopping algorithm, saving an additional fraction of power that scales as $lnL\u0303$. In this case, tolerating an error level of $L\u0303=10\u22123$ allows us to save $\u223c50%$ of the trained free power

*P*

^{F}required for inference. We emphasize that this improvement in power is on top of using the best conductance initialization.

However, we note that gaining the full benefit of this power reduction requires long training, possibly much longer than the early stopping time, meaning that the training energy $E$ is higher compared to the early stopping algorithm. This consideration means that in our dynamical control scheme, there is a trade-off between the training energy and the trained free power of the solution. This is verified in Fig. 4(d), where we measure the power gain $GPF$ in terms of the total training energy ratio between the dynamical control scheme and early stopping algorithm, $RE=E/EEarlyStop$. Such trade-off also depends on the error tolerance, but we find that if one is willing to spend $\u223c5\u221210$ times the training energy compared to the early stopping algorithm, the network achieves most of the benefit of power reduction due to the dynamical control scheme. If the training energy is a major concern and constitutes a significant fraction of the energy expended by the network during its life, one should consider this trade-off for overall lowest power solutions. Finally, we note that our dynamical control scheme is not optimized. Choosing different parameters or another dynamical control scheme altogether may produce superior power saving at a possibly lower training energy.

## V. DISCUSSION

In this work, we studied how electrical circuits can physically learn to adopt desired functions in power-efficient ways. We established that physical learning affects the free power required to actuate the circuit given input signals. This free power can be lowered by choosing better initialization schemes for the learning degrees of freedom, e.g., initializing low conductances in electronic resistor networks.

We have also introduced a modified local learning rule that attempts to minimize both the error and the free power. We showed that this learning rule indeed lowers the trained free power of the obtained learning solutions in both simulations and experiments. This learning rule weights the importance of minimizing error vs power, giving rise to a trade-off between the two. While improving power efficiency at the expense of error (performance) may seem undesirable, a very low error is typically not required and can even be infeasible in real learning situations. Therefore, one can often train learning networks for lower power solutions without much of an adverse effect ( Appendix C). In our experiments, there is a natural noise floor and there is no point in striving for a lower error than the floor. For these systems, power-efficient learning rules can improve the solution power with little to no penalty in error.

Finally, we have introduced a dynamical scheme for controlling the relative importance of error and power minimization to rapidly converge on power-efficient solutions with desired error tolerance. We find that such dynamical control can lead to lower power solutions. It is likely that an optimized version of such a dynamical control scheme could further reduce both the solution power and the overall training energy. This is a subject of future study.

While we presented details of the analytical approach for the case of resistor networks, our theoretical arguments apply to other physical systems trained using coupled learning, such as mechanical spring networks ( Appendix D). Neuromorphic computing often promises to improve power efficiency by embedding learning algorithms in hardware, solving a major problem in modern power-hungry computational learning algorithms. While the hardware platform discussed here, self-learning electronic circuits, does indeed improve power efficiency, our work here focuses on how to achieve power efficiency in the learning process itself. As a result, our power-efficient learning approach may be easily adaptable to other neuromorphic hardware systems that can perform self-learning to offer *additional* power savings compared to only using efficient hardware.

## ACKNOWLEDGMENTS

We thank Purba Chatterjee, Marc Z. Miskin, and Vijay Balasubramanian for insightful discussions and feedback. This research was supported by the U.S. Department of Energy, Office of Basic Energy Sciences, Division of Materials Sciences and Engineering Award No. DE-SC0020963 (M.S.), the National Science Foundation via the UPenn Nos. MRSEC/DMR-1720530 and MRSEC/DMR-DMR-2309043 (S.D. and D.J.D.), and DMR-2005749 (A.J.L.), and the Simons Foundation (No. 327939 to A.J.L.). D.J.D. and A.J.L. thank CCB at the Flatiron Institute as well as the Isaac Newton Institute for Mathematical Sciences under the program “New Statistical Physics in Living Matter” (EPSRC Grant No. EP/R014601/1) for support and hospitality while a portion of this research was carried out.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Menachem Stern**: Conceptualization (lead); Data curation (lead); Formal analysis (lead); Methodology (equal); Software (lead); Writing – original draft (lead); Writing – review & editing (lead). **Sam Dillavou**: Conceptualization (supporting); Data curation (supporting); Formal analysis (supporting); Methodology (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal). **Dinesh Jayaraman**: Conceptualization (supporting); Data curation (supporting); Formal analysis (supporting); Methodology (supporting); Writing – original draft (supporting); Writing – review & editing (supporting). **Douglas J. Durian**: Conceptualization (supporting); Funding acquisition (equal); Project administration (supporting); Resources (equal); Supervision (supporting); Writing – original draft (supporting); Writing – review & editing (supporting). **Andrea J. Liu**: Conceptualization (supporting); Funding acquisition (lead); Project administration (equal); Resources (equal); Supervision (lead); Writing – original draft (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

The data that support the findings of this study are openly available in Stern, Menachem (2024). Data and codes for producing results associated with the manuscript “Training self-learning circuits for power-efficient solutions” are available at https://doi.org/10.6084/m9.figshare.24923685.v1.^{51}

### APPENDIX A: LEARNING DYNAMICS

*P*

^{F}required to actuate the network response changes during learning. Before tackling the question of the free power of a learning network, let us study the dynamics of the learning DOF,

*k*, and the contrast, $C$, due to the learning rule in Eq. (2). We assume that there exists a solution of the learning degrees of freedom

*k** such that the contrast vanishes $C(k*)=0$ (this is the statement that the learning model is over-parameterized so that the learning degrees of freedom can be trained to nullify the training error). Over-parameterization implies the existence of many connected solutions in

*k*space for which the contrast vanishes, and we denote by

*k** the solution obtained in practice by learning. The contrast $C$ is a complicated non-convex function of the learning DOF, but we can expand it around the solution

*k** to first non-vanishing order (second order),

*k**, we find that despite the explicit partial differentiation in Eq. (2), the learning dynamics are equivalent to the gradient descent on the contrast.

^{34}Therefore, if we absorb the learning rate into the definition of the time unit, the weight dynamics are given by $k\u0307=\u2212\u2207kC=\u2212H(k\u2212k*)$. This leads to simple exponential decaying dynamics. If we set the initial condition at

*k*(

*t*= 0) ≡

*k*

^{0}, then

*k*and contrast $C$, one complication typically arises for over-parameterized learning. We have seen before that the learning Hessian in over-parameterized learning machines tends to be low rank (with the number of non-zero eigenvalues equal to the number of training tasks).

^{43}As the learning Hessian has zero eigenvalues, it is not invertible. There are no dynamics in the eigen-directions of these vanishing eigenvalues, as can be explicitly seen by rotating the frame into the coordinate system that diagonalizes $H$. The learning dynamics are agnostic to the components of

*k*in the large null-space of $H$. We can plug these results in Eq. (3) to obtain the free state power dynamics,

*U*(

*t*) depends on time, this ODE can be integrated to find that the free power exponentially saturates to a value

*A*is a projection matrix, projecting weight vectors into the stiff (i.e., non-null) subspace of $H$. Here, we see again that the power can increase or decrease during learning, depending on the alignment between the gradient of the free state power and the direction of weight dynamics.

We now discuss the modified learning dynamics that minimizes both error and free power [Eq. (9)]. In the main text, we showed that these learning dynamics lead to exponentially decaying weight solutions [Eq. (12)] and associated error and free power dynamics given in Eq. (15). The dynamical trajectories given these error\power dynamics follow two different prototypes, depending on the sign of $\varphi \u2261\u2212\u2202kCT\u2202kPF$. For *ϕ* < 0 (where the contrast gradient is aligned with the free power gradient), the contrast undershoots the infinite time limit, getting arbitrarily close to $C=0$ before rebounding exponentially to $C(t\u2192\u221e)$ [Fig. 5(a)]. This scenario is common when initializing the network with high conductance values. For *ϕ* > 0 (i.e., anti-alignment of the contrast and free power gradients), the dynamics tend to increase the free power, and we see analytically regular dynamics, where the contrast smoothly decays exponentially to its terminal value $C(t\u2192\u221e)$ [Fig. 5(b)]. This scenario is common in flow\resistor networks initialized at low conductance values. Figure 5(c) shows the verification of the argument laid out in the main text that the solution $k\lambda *\u2212k0+*\u223c\lambda $. We also show as before that the error grows quadratically with *λ*. Crucially, the arguments for the error are relevant not only for the training set (regression examples used to train the network) but also for the test examples that the network had not seen previously, whose error also scales quadratically in *λ*. More information about the regression tasks, as well as the training and test sets, is presented in Appendix B. Similarly to the error, the arguments about the trained free power are valid for both the training and test sets so that our modified learning dynamics reduces both of them [Fig. 5(d)].

It is also interesting to combine our results for the dependence of the free power on both the power optimization *λ* and the initialization scale of the learning DOF *κ*, as discussed in Sec. II B. We simulated the learning of regression tasks on *N* = 64 networks as earlier and varied *λ* in the range 10^{−10}–10^{−2} and the initialization scale *κ* in the allowed range 10^{−3}–10^{1}. We observe the emergence of two regimes of interest, one where the power minimization is weak *λ* ≪ 1 and the other where *λ* is large [Fig. 6(a)]. The large *λ* regime is simpler to understand as the learning solutions there primarily reduce free power at the expense of error. Therefore, the free power reduction $\Delta P\lambda F$ is essentially the reduction from the free power at initialization to the minimal free power supported by the system, which scales as *κ* [at *λ* = 10^{−4}, Fig. 6(b)]. For weak minimization *λ* ≪ 1, the effect of the initialization is more subtle. We find that the free power reduction scales approximately as a power law *κ*^{0.5} [at *λ* = 10^{−8}, Fig. 6(b)]. Since here we measure the free power reduction, we find that at lower initialization scales, less power is saved by applying power minimization. However, there is still a substantial benefit in free power reduction even for good initialization. These results also help contextualize our dynamical control scheme in Sec. IV, where we show that the dynamical scheme supports additional free power saving compared to just using good initialization. We reserve the detailed study of the interplay between initialization and free power minimization for future study.

### APPENDIX B: PHYSICAL LEARNING TASKS

*N*= 64 whose structure is derived from jammed two-dimensional packings.

^{52}We randomly choose two edges as input edges and another two as output edges [see Fig. 1(a)]. The input and output voltage drops are noted by the vectors Δ

*V*

_{i}, Δ

*V*

_{o}, respectively. The network is trained to perform regression recovering a linear relation,

*ϵ*, a possible addition of white noise. Since we train a linear resistor network, the functional relation between the input and output voltage drops is always linear Δ

*V*

_{o}=

*∑*

_{i}

*A*

_{oi}Δ

*V*

_{i}, and the correct matrix relation $A\u0303oi$ is supposed to be recovered by learning. The values for the desired matrix were randomly chosen from the distribution

We trained these networks in many realizations of geometry, choice of input/output edges and $A\u0303oi$. To train each realization of the problem, we sampled *n*_{tr} = 20 training examples $\Delta ViTraining\u223cU(0,1)2$ and corresponding outputs $\Delta VoTraining=\u2211iA\u0303oi\Delta ViTraining+\u03f5$. Note that the scale of the input voltage drops determines the scale of power dissipation in the free state $PF\u223c\Delta Vi\u03042$.

In the main text, we looked at noiseless regression problems with *ϵ* = 0, for which, the network can find exact solutions with zero error. In Appendix C, we study a case with finite label noise *ϵ* = 10^{−3}. The training examples are sampled randomly during training and used to define the free and clamped states in the iterative learning process. Apart from the *n*_{tr} = 20 training examples, we also sampled *n*_{te} = 100 test examples from a wider distribution $ViTest\u223cN(0,1)$ and their associated desired outputs. The test points are not used during the learning process but help in verifying that the network can generalize. In our work, the test set is also interesting for showing the power-efficient property of the solutions that generalizes beyond the training set [Fig. 5(d)].

Similarly, we can compute the loss value associated with the test set. We simulate physical learning by applying the local learning rules described in the main text, picking a nudging amplitude value *η* = 10^{−3}.

For better comparison to the physical experiment of Fig. 3, we also simulated a resistor network of the same geometry and a simple allostery task as the experiment [inset of Fig. 3(d)]. In these simulations, we randomly choose two input nodes and assign to them input voltages, *V*_{i1} = 0*V* and *V*_{i2} = 0.45 *V*. We further choose two random output nodes and train them such that when the inputs are applied, they would have output voltage values *V*_{o1} = 0.09 *V* and *V*_{o2} = 0.36 *V*. In this case, we again use a mean squared error loss function and compare the output voltages at the output nodes to the desired values *V*_{o1,2}.

### APPENDIX C: POWER MINIMIZATION FOR LIMITED ACCURACY TASKS

The numerical results in the main text were limited to tasks that can, in principle, be learned perfectly by the learning machine. In such cases, there exist solutions with no error $L(k*)=0$, as discussed in Appendix A. There are, however, cases in which it is impossible to obtain solutions with zero error. The typical example is when the training set does not capture all the information contained in the broader data (or the test set). There are also cases where it is impossible to find solutions that nullify the error even on the training set. This can occur due to under-parameterization (too few learning degrees of freedom to learn the task), an insufficiently expressive model (e.g., a linear network cannot represent nonlinear relations) and noise in the learning process.^{48}

*k*

^{†}would have finite error and contrast values $L(k\u2020),C(k\u2020)$. Nonetheless, we can still perform a quadratic approximation around the contrast minimum

*k*

^{†}similar to Appendix A, where the constant term $C(k\u2020)$ is retained,

*λ*is applied in Eq. (9),

Comparing these expressions to Eq. (16), we see that the trained free power behavior stays the same. We also see that the error shift is the same, scaling as *λ*^{2}, but now there is a finite contrast floor $C(k\u2020)$ associated with a finite error. The trade-off between error and power is still maintained, although in this case, it may be much more favorable. For a small enough *λ*, $12\lambda 2sTHs\u226aC(k\u2020)$, and so the contrast (and error) is nearly unaffected by the power minimization. As a result, we can apply a finite power minimization parameter *λ*, reducing the trained free power at nearly no penalty. Thus, power minimization is particularly useful for problems in which zero error solutions are not possible.

To verify these considerations, we simulated physical learning in *N* = 64 networks on regression and classification tasks (Fig. 7). Excess noise was added to the regression labels (outputs) in the training and test sets, sampled from a distribution $\Delta Vo\u223c\u2211iA\u0303oi\Delta Vi+\u03f5$, with *ϵ* = 10^{−3} (see Appendix B). The simulated networks can successfully learn these tasks, reducing the error to some finite value $L\u224810\u22125$ [Fig. 7(a)]. When adding a small power minimization *λ* < 10^{−7}, the learning trajectories are almost unchanged and the error is nearly unaffected. When *λ* increases, the error starts increasing beyond the error floor [Fig. 7(b)]. At the same time, we observe that the trained free power is decreased linearly at finite *λ*, as seen before. These results show that in noisy cases, such as those seen in physical learning experiments, free power reduction can be achieved at little expense in errors up to a certain point.

Another case where this result is particularly relevant is in classification problems, where we would like to assign discrete labels to inputs. A standard example for such tasks is the classification of Iris specimens based on the measurements of lengths of their petals and sepals.^{53} Previously, we have shown that our flow\resistor networks can successfully learn to classify the Iris dataset, as well as could be expected from linear network models, in simulations^{43} and experiments.^{37} In discrete classification tasks, we are typically not concerned with the mean squared error but with a measure of accuracy given by a discrete choice of the label based on the network response; excellent classification is possible even at relatively high values of the mean squared error. Therefore, it may be possible to induce power optimization without a penalty in classification accuracy. To test this idea, we simulated the training of our *N* = 64 node networks to classify the Iris dataset (a detailed description of the training protocol can be found in Ref. 37). Training at different power minimization amplitudes *λ* in the range 10^{−10} < *λ* < 10^{−2}, we find that the classification accuracy (for the training and test sets) is not affected by power minimization until *λ* ≈ 10^{−7} [Fig. 7(c)]. At the same time, the solution free power is significantly reduced starting at *λ* > 10^{−8}, showing that power gain (in this case, by a factor $\u223c2$) is possible at little penalty in accuracy [Fig. 7(d)].

^{54}in the space of the learning DOF. The instantaneous values of the learning DOF are then sampled from a normal distribution centered around $k\lambda *$ in Eq. (14), with a standard deviation scaling with the white noise amplitude

*σ*,

^{55}

*θ*

_{λ,i}are the eigenvalues of the matrix $H+\lambda H$. In other words, the noise induces the conductances to explore a vicinity of the solution $k\lambda *$, whose size depends on the noise amplitude

*σ*, and the curvature is given by the eigenvalues

*θ*

_{λ,i}. We can take this distribution of values of the learning DOF and plug it in the equation for the contrast [Eq. (11)], finding the distributions of this quantity,

We find that if we know the noise scale *σ*, measuring the average contrast value $\u27e8C0(\sigma )\u27e9$ allows us to glean information about the effective average curvature of the contrast near the learning solution. Note that the learning DOF diffuse freely in the space of zero contrast solutions, so the effective curvature is associated with the typical slopes of the contrast leaving the zero manifold. Overall, we see that the free power reduction is, on average, the same as in the case with no noise (up to second order terms in *λ*). However, the contrast now has a finite added term due to the exploration of values of the learning DOF beyond the minimum $k\lambda *$. This means that additive white noise has a similar effect to the finite contrast floor discussed earlier; finite power minimization *λ* can reduce the trained free power while having nearly no effect on the contrast (or error) up to a certain scale.

### APPENDIX D: POWER MINIMIZATION IN MECHANICAL SPRING NETWORKS

In this work, we presented general arguments on how local learning rules could balance minimizing the error and trained-free power of obtained physical learning solutions, giving rise to a trade-off between the two. However, in the main text, we only tested these ideas numerically and experimentally in resistor networks. Here, we show in simulations that these arguments apply similarly to physical learning systems governed by different physics, e.g., an elastic network of harmonic springs [Fig. 8(a)].

^{48,49,56–60}Specifically, coupled learning can train spring networks to perform the desired tasks by modifying the spring constants or rest lengths.

^{34}The physical cost function naturally minimized by such networks is the elastic energy

*E*,

*k*

_{i}is the spring constant of spring

*i*,

*ℓ*

_{i}is its rest length,

*r*

_{i}the Euclidean distance between the nodes connected by the spring, and the energy is summed over all individual springs. For a spring network with adaptive spring constants, the local learning rule is

*i*in the free and clamped state, respectively. More details on the derivation of this learning rule can be found in Ref. 34. To see if spring networks can be trained to adopt low free energy solutions, i.e., spring configurations for which the desired state is easy (takes little energy) to actuate, we add a local free energy minimization term with amplitude

*λ*, similarly to Eq. (9),

We simulate this modified learning algorithm on an unstrained spring network with *N* = 27 nodes, as shown in Fig. 8(a). These networks are trained for allostery tasks, in which we apply prescribed relative strains 0.2 (randomly choosing contraction or extension) and desire particular strain values at another two random bonds (0.05 or 0.03, randomly choosing contraction or extension). With no energy minimization applied, *λ* = 0, and coupled learning generally succeeds in training these networks to a numerical normalized error floor of $L\u223c10\u22128$. As we increase the power minimization amplitude *λ*, we observe that the error increases as *λ*^{2} and the trained free energy *E*^{F} is reduced as *λ* [Fig. 8(a)], as predicted by Eq. (16) and observed in simulations of resistor networks. These results show that our approach to physical learning of power efficient solutions can be employed beyond linear resistor networks. Recent experimental progress has been achieved for implementing coupled learning in elastic networks,^{42} but we leave the experimental validation of energy reduction in such networks for future study.

## REFERENCES

*Connectionist Models*