Training self-learning circuits for power-efficient solutions

As the size and ubiquity of artificial intelligence and computational machine learning models grow, the energy required to train and use them is rapidly becoming economically and environmentally unsustainable. Recent laboratory prototypes of self-learning electronic circuits, such as “physical learning machines,” open the door to analog hardware that directly employs physics to learn desired functions from examples at a low energy cost. In this work, we show that this hardware platform allows for an even further reduction in energy consumption by using good initial conditions and a new learning algorithm. Using analytical calculations, simulations, and experiments, we show that a trade-off emerges when learning dynamics attempt to minimize both the error and the power consumption of the solution—greater power reductions can be achieved at the cost of decreasing solution accuracy. Finally, we demonstrate a practical procedure to weigh the relative importance of error and power minimization, improving the power efficiency given a specific tolerance to error.


I. INTRODUCTION
There has been a meteoric rise in the adoption and usage of artificial intelligence (AI) and machine learning (ML) tools in just the past 15 years, 1,2 accompanied by an equally spectacular rise in the sizes of ML models and the amount of computation required to train and apply them. 3,4In recent years, the energy required to train state-of-the-art ML models, as well as to use the trained models, has been rising exponentially, doubling every 4-6 months. 5][8] The field of neuromorphic computing [9][10][11][12][13] strives to recreate the ability to learn in hardware.A major motivation for the development of neuromorphic systems is the possibility of massive energy savings compared to ML implemented on standard computers. 13,14][28][29][30] Recently, a new avenue was opened toward realizing powerefficient neuromorphic computing, dubbed physical learning machines or self-learning physical networks. 31Rather than mimicking known learning algorithms, such as backpropagation, such systems exploit their inherent physics in order to learn, using local learning rules that modify the learning degrees of freedom based on locally available information, such that the system globally learns to perform desired tasks.A certain class of local learning rules, known as contrastive learning, [32][33][34][35][36] describe how learning degrees of freedom should be modified in order for systems to achieve desired outputs in response to inputs supplied by observed examples of use (i.e., supervised learning).

ARTICLE pubs.aip.org/aip/aml
In order to realize any power gains, such learning rules must be implemented in hardware.8][39][40] Such systems already consume less power than conventional computers doing inference because they are analog rather than digital. 41Here, we use analytical theory, computation, and experiments to show that the propensity of electronic circuits to minimize power dissipation enables even greater reductions in power consumption via appropriate initialization and power-efficient learning rules.We specifically demonstrate these results for regression tasks.However, it should be noted that our analysis and results should apply to other physical learning machines in different physical media (e.g., mechanical networks) if they can be developed in the lab, 42 as well as to other types of problems (e.g., classification).
This paper is organized as follows: in Sec.II, we describe the physical learning approach and discuss how the power consumption of the system is modified by learning, in particular, as we change the initial conditions of the learning degrees of freedom.A judicious choice of initial conductances yields learning solutions with low power consumption, while also reducing the energy consumed in training.In Sec.III, we introduce a modification to the local physical learning rule in order to minimize both error and power consumption.We analyze this new local rule theoretically and test it in simulations and lab experiments, concluding that it leads to an error-power trade-off; lower-power solutions may be obtained at the expense of higher errors.The energy required to train the system can be reduced as well.Finally, in Sec.IV, we demonstrate how a power-efficient learning algorithm with dynamical control over the weighting of power and error optimization can lead to an efficient adaptation of low-power solutions beyond simply using good initial conditions and constant weighting.

II. POWER CONSUMPTION IN PHYSICAL LEARNING CIRCUITS
In previous work, we established theoretically and experimentally that self-learning resistor networks can be trained to perform tasks such as allostery, regression, and classification. 37,39,40Training a deep neural network corresponds to minimizing a learning cost function with respect to learning degrees of freedom (edge weights and biases).The learning landscape, described by the learning cost function as one axis in the high-dimensional space where each of the other axes corresponds to a different learning degree of freedom, remains fixed during the minimization.On the other hand, successful training of physical learning machines corresponds to the simultaneous minimization of two cost functions-the learning and physical cost functions-with respect to two different sets of degrees of freedom (DOF), the learning and physical degrees of freedom, respectively.In the case of a self-learning electrical network of variable resistors, the physical cost function is the dissipated power, the physical DOF are the node voltages, and the learning DOF are the conductances.
Notably, the physical cost function, or power, depends implicitly on the learning DOF.As a result, both the learning landscape and the physical landscape evolve during training.For example, training gives rise to soft modes in the physical landscape and to stiff modes in the learning landscape, making the system more conductive and lowering its effective response dimension. 43he height of a minimum in the physical landscape corresponds to the power required to actuate the desired response (to obtain the desired outputs in response to the given inputs from training data).Due to the coupling between the learning and physical landscapes, it is possible to find and push down the minima in the physical landscape corresponding to the global minima in the learning landscape during training, thus decreasing the amount of power required to perform a given task.
Consider an electrical circuit that minimizes a scalar physical cost function P(V; k) (e.g., the dissipated power), depending on a set of physical DOF V (e.g., the node voltages) and a set of learning DOF k (e.g., the edge conductances).When an input signal (e.g., a set of voltages at input nodes) is applied, the system responds by optimizing the physical DOF to minimize P subject to the input constraints, producing a stable free state V F with an associated free power P F (V F ; k).Training this system for specific output responses using coupled learning 34 involves clamping the targets T by slightly nudging them toward the desired response , with ṼT being the desired response and nudge amplitude η ≪ 1.The physical system then minimizes the physical cost function subject to both the inputs and this clamping, yielding a clamped state V C with a clamped power P C (V C ; k).The contrast (or contrastive function) is defined as the difference between the physical cost (powers) for the clamped and free states, which is intrinsically non-negative.Minima with vanishing contrast are also minima of the error (loss) function L that is typically used to measure the quality of a learning solution, e.g., the mean squared difference between the desired and obtained behavior, 34 Physical learning is achieved by a learning rule that corresponds to modifying the learning degrees of freedom according to the partial derivative of the contrast.This learning rule is local, with α being a scalar learning rate, setting the time scale for the learning dynamics.A system following these dynamics with a sufficiently low learning rate tends to minimize the learning cost function L [as shown in Fig. 1(b)].See Appendix A for more details on the learning dynamics close to a solution for the learning degrees of freedom k * .

A. Power consumption in learned solutions
We now turn our attention to the scalar physical cost (i.e., power consumption) of the free state P F .This free power is the relevant measure for the power used by the system to perform inference.As noted, this is the power associated with the application of the input voltages, allowing the output readouts.Our work is primarily concerned with minimizing this free power P F , while also achieving good learning solutions with low error.We will show how the free power can be lowered by non-hardware means, such as choosing better initialization for the learning DOF and using learning rules that minimize the free power at the same time as the error.The free power of a system trained for a specific task will, henceforth be referred to as the trained free power.We will also look at the total energy required to train the system E, henceforth termed as the training energy, which can be estimated by integrating the free power over the training time.We show how this training energy can also be optimized by these initialization schemes and learning rules.
We first study how the free power P F is affected by the basic coupled learning rule of Eq. ( 2), and later see how it can be substantially reduced by modifying this rule.Using the chain rule in Eq. (2), we can derive an ODE for the free power during training,

ARTICLE pubs.aip.org/aip/aml
Note that the free state is a minimum of the power subject to the inputs so that the derivative ∂V P F vanishes exactly, and hence, ∇ k P F = ∂ k P F + dv dk ∂V P F = ∂ k P F .We see that the free power tends to decrease if the gradients of the free power and the contrast with respect to k align and increase otherwise.Assuming the free power changes slowly with k, or that the learning DOF k are close to the learning solution k * , we can approximate the free power using the following Taylor expansion: The free power changes due to the learning dynamics, starting at the initial condition P F (k 0 ) and ending after training with P F (k * ).Next, we discuss the sign of this shift, determined by the alignment between the gradients of the contrast and free power.
Here, we specialize to the case of learning electrical circuits, e.g., adaptive resistor networks, where the physical DOF are the voltages at nodes Va, while the learning DOF are conductances ki of edges i connecting pairs of nodes.An adjacency matrix Δia is defined such that each row of the matrix corresponds to an edge, having a value of +1 at the index of the incoming node of that edge, −1 at the index of the outgoing node, and 0 elsewhere.The choice of which node is incoming or outgoing is a matter of convention and sets the direction of currents but has no physical consequence.The vector of voltage drops on the edges is given by ΔVi = ∑ a ΔiaVa.Resistor networks minimize the total power dissipation, In such networks, where one of the nodes is grounded at VG = 0, the native state of the network (in the absence of any inputs) is where all voltage values are zero, all voltage drops are zero, and the total power dissipation is P = 0.When the free/clamped boundary conditions are applied, for example, by introducing currents in certain input and output edges, the free and clamped power are Given weak clamping (V C − V F ∼ η ≪ 1), we write the contrast function C, neglecting the terms of order η 2 , We take the partial derivative of the contrast with respect to k (as done in Ref. 35), In this simple case, the learning modification is determined by the alignment of each component of the free state response ΔV F i with its nudge in the clamped state (ΔV C − ΔV F )i.In these particular models, we also know that the free power gradient is positive ) 2 i ≥ 0. We conclude that if the clamped state nudge aligns with the free state response, the free power tends to decrease.This is sensible as the system has to decrease its conductances to achieve a stronger response required by the clamping.The opposite effect occurs when the nudged response is misaligned with the free state, resulting in increased conductances.

B. Power dependence on initial conditions
In Sec.II A, we established how physical learning affects the system's free power P F , i.e., its power consumption in the free state.In the following, we consider how the initial conditions of the learning degrees of freedom determine the trained free power, i.e., free power of the learned solutions.We will show that judicious initialization leads to considerable savings in power consumption.
5][46][47] In the context of physical learning, the choice of initialization may not only affect the training time and accuracy of a solution but may also have important effects on the trained free power.Suppose a set of voltage drops is applied over some input edges of a resistor network, and we read out the resulting voltage drops over some other output edges.In addition, suppose that the conductance values of the network have a certain scale κ.It is known that the output voltage drops do not depend on the scale κ but only on the relative ratios of the conductance of different edges.However, reducing the conductance scale does, in fact, linearly decrease the free power [Eq.(5)].Thus, we can, in principle, improve the trained free power indefinitely by reducing the conductance scale.Realistically, we are bound by experimental considerations: variable conductive elements have minimal conductance values (corresponding to maximal resistance).Furthermore, low conductance necessitates more precise hardware implementations as the network response becomes highly sensitive to small variations in the conductance.
The above-mentioned considerations suggest that initializing the conductance values k (learning DOF) at lower values may yield solutions with lower trained free power.To verify these ideas, we trained N = 64 node networks [Fig.1(a)] for multiple regression tasks with two inputs and two outputs.The error for these regression tasks, noted as L, is given by mean squared differences between the desired and the obtained target voltage drops in the training set (Appendix B presents details on the simulated resistor networks and regression tasks).We initialized the conductance values uniformly with different conductance scales in the range 10 −3 ≤ κ ≤ 10 1 .We note that in these simulations, the minimum conductance for any given edge is k min = 10 −3 , and the maximum conductance is kmax = 10 1 .Learning modifications that attempt to push the conductances out of this range are not performed.The learning rate α has been chosen such that κ = 1 and α = 0.1, a value which typically results in a relatively quick and stable learning performance for these networks and tasks.We find that as long as the learning rate is chosen to be slow enough, the learning dynamics are well behaved and our results scale as expected with the learning rate.
We set the units for these simulations such that the conductance scale [k] is defined as k = 1 inside the allowed range of our conductors, and the voltage scale [V] corresponds to the typical highest input values chosen for our regression tasks V = 1 (shown in Appendix B).The networks are trained for 10 6 learning iterations of Eq. ( 2).Each such iteration encompasses an epoch, i.e., the ARTICLE pubs.aip.org/aip/amltime taken for the network to observe and respond to all training examples, similar to full-batch gradient descent. 48Our units of time are scaled by the learning rate [τ epoch ] = α −1 .With these definitions, the units for the free power are given by The training energy is given by the free power, integrated over training time, which has units of As expected, we find that coupled learning reduces the error by many orders of magnitude [Fig.1(b)].We also find that when the learning rate is scaled appropriately α ∝ κ 0.5 , the scaled training time T (number of epochs taken for the system to reach a certain error threshold L = 10 −4 , scaled by the learning rate) does not change much for relatively high initialization κ [Fig.1(c)].However, initialization close to the lower boundary k min induces a linear increase in the scaled training time, scaling as κ −1 .This increase in the training time is reasonable as a large part of the training modification Δki is not performed because it would require the conductances to go below the minimum.
More importantly, at lower initialization scales κ, physical learning finds lower trained free power solutions P F (k * ) [in Figs.1(d) and 1(e), note that the colored dots in panel d mark the trained free power].These results clearly show the benefit of initializing the conductances of edges close to their minimal values in terms of learning power efficient solutions.While so far, we referred to the power necessary to actuate the solution (i.e., the free power P F ), one often needs to consider the total energy required to train the network to adopt this solution, the training energy E. In some applications, this training energy is small compared to the total energy spent on using the system throughout its life cycle.However, when this is not the case, one should consider learning algorithms that reduce the required training energy and the free power.In our simulations, the training energy E can be measured as the integral over time of the free power during training, until the error reaches a certain tolerable level (e.g., L = 10 −4 ) at time T, We find that the training energy E scales linearly with the initialization at high κ [Fig.1(f)], similar to the trained free power.However, lowering κ close to k min actually increases the training energy.This is because we can no longer realize gains in the free power [the saturating region shown in Fig. 1(e)], while the training time increases linearly with decreasing κ [Fig.1(c)].As a result, the training energy in this regime increases with decreasing κ [Fig.1(f)].Thus, there is an optimal value for the initialization κ corresponding to the minimum training energy.We note that in this regime, as the training energy is proportional to the training time, it is inversely proportional to the learning rate so that increasing the learning rate can reduce the energy E. However, this improvement only persists for low enough learning rates, when the learning process is consistent and well behaved.We leave the study of power efficiency of fast learning system for future study.
In machine learning, however, the greatest energy cost is incurred during inference.In our case, this cost is quantified by the trained free power P F (k * ).We note that training reduces the free power for high κ but increases it for low κ next to the lower conductance limit [Fig.1(d)].This is sensible because for low initial conductances at or near the minimum, the network must increase some edge conductances in order to decrease its error.That being said, we conclude that initializing the network with proper low conductance values can save a significant energy during learning and when using the trained network.

III. EXPLICIT POWER MINIMIZATION
We have seen that the learning rule in Eq. ( 2) modifies the free power during learning.Our next step is to find a way to explicitly control the learning process to produce solutions with lower trained free power.This is possible because the learning rule in Eq. ( 2) is already written in terms of the free power.It is natural to modify this learning rule to locally minimize the free power and the error.Consider the addition of an explicit free power minimization term to the contrast, where λ is a tunable parameter, termed as the power minimization amplitude, which dictates the importance of free power minimization.The learning rule, the partial derivative of the contrast, then becomes Note that as the free power can be partitioned as a sum over the network edges, the power minimizing rule is still local and physically realizable.This modified learning rule tends to decrease the free power P F as the modified learning dynamics lower the free power and the contrast in Eq. ( 1).If we set λ = 1, the free power cancels out and we recover the directed aging learning rule 49,50 that solely tends to reduce the clamped power.
Using the modified learning rule [Eq.( 9)], one can derive ODEs for the contrast and free power, similar to Eq. (3), These dynamics tend to reduce the value of the contrast C over time, up to interference from a term that encodes the alignment between the gradient of the contrast and the free power.Moreover, we find that the free power P F tends to be reduced by these dynamics, again up to an effect determined by the alignment.We now discuss the dynamics of the contrast and free power in a simplified linear setting.First, note that in the limit λ → ∞, the learning rule solely minimizes the free power.We denote this free power minimum as k * ∞ .Around this local minimum, the free power can be expanded to quadratic order, where is the free power Hessian with respect to learning degrees of freedom.We can similarly expand the contrast in series around the λ = 0 learning solution (the unmodified solution discussed earlier), We can now discuss the dynamics of the learning degrees of freedom k = −∂ k C λ .Taking the partial derivative of Eq. ( 11), we find a first order ODE for k, whose solution is exponential, Starting from an initial condition k(t = 0) ≡ k 0 , the learning DOF exponentially decay to k * λ .Let us discuss the learning DOF solution k * λ .It is clear that when no power minimization is applied, k * λ = k * 0 .If both Hessians H, H are full rank (and invertible), the λ parameter would smoothly interpolate between k * 0 and k * ∞ .However, we know that the Hessian of the contrast in overparameterized learning machines is low-rank (with the number of non-zero eigenvalues equal to the number of training tasks, see Appendix A for details). 43This means that the contrast Hessian H is not invertible and has vanishing eigenvalues.In the eigen-directions of these vanishing eigenvalues, the power minimization is dominant for any finite value of λ.Thus, the power minimization term introduces a singular perturbation so that for infinitesimal power minimization amplitude λ = 0 + , the learning solution approaches The solution k * 0 + tends to minimize the free power, while keeping the contrast low.For over-parameterized learning in the λ → 0 limit, The solution k * λ is then a weighted average of the limiting solutions k * 0 + , k * ∞ , weighted by the Hessian matrices H, λH.For weak power optimization (λ ≪ 1), For λ ≪ 1, note that the vector s is nearly constant as the inverse matrix is dominated by H.This means the solutions shift λs is approximately linear in the optimization parameter λ (see Appendix A).Let us further denote Δk 0 = k 0 − k * λ and introduce a time propagator U λ (t) ≡ e −( H+λH)t .The solution for k can be plugged in the equations above to express the dynamics of the contrast and free power, For both the contrast and free power, we keep the largest nonvanishing contribution at long times due to the modified learning dynamics, These are our key results: (1) the error induced by power minimization scales with λ 2 , while (2) the free power reduction compared to P F (k * 0 + ) scales linearly with λ.In other words, the free power difference, i.e., the difference between the free power with and without λ-minimization, scales linearly with this minimization amplitude.
Our argument considers the error and trained free power of solutions at infinite training time but a practical learning scenario ends after some finite training time t = τ.This training time must be large compared to the natural scale of the contrast Hessian to allow learning to occur.However, in the weak power minimization limit, this time can be much smaller than the power minimization timescale H ≫ τ −1 ≫ λH.In this case, the dynamics can be approximated by a fast decay toward the unmodified solution k * 0 , followed by a slow decay from k * 0 to the power minimizing solution k * λ .For small λ, the learned solution at a finite time τ is The learned solution moves away from k * 0 at a rate linearly proportional to λ.This solution can be used to estimate both the contrast and the free power at time τ.As seen before, we find that the contrast scales as λ 2 , while the free power is reduced proportional to λ and the elapsed time ΔP F λ ∼ λτ.Overall, these considerations suggest that under λ-modified dynamics, a trade-off emerges between the error and trained free power.This intuition is verified in the numerical simulations shown in Fig. 2. We train a 64 node resistor network, initialized with intermediate conductance values k 0 i = 1, for a regression task as before.Here, the training proceeds with the modified power minimization learning rule [Eq.( 9)], varying λ in the range 10 −10 ≤ λ ≤ 10 −2 .We train these networks for τ = 10 5 steps and then measure the trained error and free power, averaging the results over 50 realizations of the network and regression tasks.We find that for small λ, the error L and free power reduction ΔP F λ scale as predicted by Eq. ( 16) [Fig.2(a)].As before, we also compute the training energy E. We plot the trained free power P F (k * λ ) and the training energy as a function of the minimization amplitude λ in Fig. 2(b).Both of these are markedly decreased when λ is increased, showing the predicted trade-off between power efficiency and error.In these settings, choosing the optimization parameter λ = 10 −5 allows us to maintain a reduction of five orders of magnitude in error, while reducing the free state power and the total energy required for training by a factor ∼10 compared to standard learning (λ = 0).

A. Experimental results
So far, we argued on theoretical grounds that error can be traded-off for power efficiency by employing the learning rule in Eq. ( 9) and verified these arguments in simulations.Here, we verify the existence of the trade-off in laboratory experiments.][39] However, in this new implementation of the experiment, transistors replace the digital potentiometers in the role of variable resistors. 41nlike in the previous work, 37 this system is also able to learn according to the continuous coupled learning rule [Eq.( 2)] as each resistance element is set by a charged capacitor on the gate of the transistor instead of by a discrete counter.Modifications to the learning rule of the form in Eq. ( 9) are achieved by varying the measurement amplification from the free and clamped networks.In addition, unlike previous implementations, this new network operates continuously in time, with the clamped state value updated automatically via an electronic feedback loop, and so, training duration is measured in real time rather than training steps.Because of unavoidable noise in the experiment, η → 0 is unobtainable.As the clamped state approaches the free state, their difference becomes more and more difficult to measure.Therefore, we use a finite value η = 0.22 for these experiments with an effective learning rate of α = 1 24 ms .The experiments lasted 20 seconds each, and the network's resistances had completely settled at the end of each run.The network is a 4 × 4 square lattice of edges [inset in Fig. 3(c)] with periodic boundary conditions; the edges are initialized with uniform conductance in the approximate middle of their range at the start of each experiment.
The network was trained for 150 two-source, two-target node allostery tasks, wherein the sources were held at the low and high ends of the allowable range (0 and ∼0.45 V, respectively), with the two desired target outputs at either 20% and 80% or at 10% and 90% of this range, respectively.Across these experiments, λ was varied to seven values ranging from 0 to 0.055.In all cases, the network was able to lower the error, as shown for typical error vs training time curves in Fig. 3(a).For these tasks, the network also consistently lowered the free power, as shown for the complementary power curves over training time in Fig. 3(b).Consistent with theoretical predictions, error and trained power increased and decreased, respectively, with increasing λ, with their trade-off shown in Fig. 3(c).White diamonds correspond to the mean error and the trained free power of all the experiments performed with the same value of λ.
To study this trade-off seen in the experiment, we simulated N = 16 node resistor networks constructed similarly to the experimental network [the inset in Fig. 3(d)].These networks are simulated for similar two-source, two-target node allostery tasks (see Appendix B).We added a Gaussian white noise term to Eq. ( 9) with scale δ = 10 −3 V 2 to approximate the noisy conditions of experimental learning.The white noise term leads to an error floor L ∼ 10 −6 V 2 .A time step in our simulations is equivalent to one experimental learning step (∼0.1 ms), while we can set the simulated conductance and voltage scales to match the experiment as well (conductance scale [k] = 10 −3 Ω −1 and voltage scale [V] = Volt).The results for error and free power with λ in the range 10 −6 -10 −3 , averaged over 50 realizations of the network and tasks, are shown in Fig. 3(d) and qualitatively show the same error-free power trade-off.However, we note that these simulations are not intended as faithful realistic representations of the experimental learning machine as we simulate a linear flow network and are not attempting to model the specific details regarding the noise and bias profiles of the experiment.The comparison here is only intended to show that realized experimental learning machines display a qualitatively similar performance to power efficiency trade-off as predicted by our theory.

IV. DYNAMICAL CONTROL FOR GREATER POWER MINIMIZATION
In Sec.III, we showed how adding an explicit power minimization term in the contrast function leads to a new local learning rule that attempts to minimize both the error and the free power at the same time, leading to a trade-off between them controlled by the power minimization amplitude λ.We note that noisy inputs make it impossible to reach zero training error, and in any case, there is experimental noise in the self-learning circuits so there is, in practice, a non-zero error floor.Here, we use this insight to design a practical control scheme to dynamically modify λ during learning in order to attain a tolerable error with more power-efficient solutions.We will show how such a control scheme can yield even more power-efficient solutions compared to using a smart initialization (as in Sec.II) and constant λ (as in Sec.III).
Assume that we initialize the conductances of a resistor network at their minimal value (maximum resistance).This initialization leads to a free state V F (k min ) with the lowest possible free power P F min .This state corresponds to the minimum power found by the power minimization dynamics with λ ≫ 1, which selects the learning degrees of freedom resulting in the lowest free power P F min .As seen in Fig. 2, reducing the amplitude λ from infinity toward zero monotonically decreases the error, while increasing the trained free power P F (k * λ ).Here, we consider a simple dynamical control scheme.Briefly, we set a specific error tolerance as a target, L. We measure the instantaneous error L while learning using the local rule in Eq. ( 9).If the instantaneous error is larger than the desired tolerance, we decrease λ to promote error minimization, while if the error is smaller than the tolerance, we increase λ to emphasize power minimization.In other words, with ρ setting the update timescale of λ and the parameter p controlling the rate of the control scheme (a low p value sets the first term in the parentheses close to 1 so that λ dynamics are slow).
To test this dynamical control scheme for learning with power minimization, we simulate the training of N = 64 nodes for regression tasks as before.We initialize the conductance values at their minimum k min = 10 −3 and set α = 0.03, ρ = 1, p = 0.02.We find that the network trained with the λ dynamical control scheme quickly converges on the desired error tolerance [Fig.4(a), full line and closed circle.We compare these results with an "early stopping algorithm," defined as follows: In this algorithm, we consider a learning network without power minimization (λ = 0) [the dashed line shown in Fig. 4(a)].The network reaches the desired error tolerance L = 10 −3 after some time [marked by the open circle on the dashed line in Fig. 4(a)], which we call the "early stopping time."Note that our dynamical control scheme in Eq. ( 18) reaches the same error at a time given by the solid circle on the solid line.Evidently, the dynamical control scheme achieves a lower trained free power compared with early stopping [Fig.4(b)].Once the dynamical control scheme reaches the time indicated by the solid circle in [Fig.4(b)], λ stays constant but the system now keeps training at this value of λ, finally reaching a steady state at long times.As a result, the power advantage of this scheme [gray arrow in Fig. 4(b)] improves over training time until it converges at some free power value.training energies R E between the dynamically controlled learning and early stopping algorithm.We find that to utilize the full benefit of low free power solutions, one needs to train the system for longer times, increasing the network training energy.All results are averaged over 50 realizations of networks and tasks.
We measure the power gain fraction min at long training times and compare to the trained free power for the early stopping algorithm for different error tolerances [Fig.4(c)].
The power saving fraction is measured at τ = 10 5 in relation to the minimal free power produced by the network given for the lowest possible conductance values k min .As higher error L is tolerated, the dynamical control scheme improves in comparison to the simple early stopping algorithm, saving an additional fraction of power that scales as ln L. In this case, tolerating an error level of L = 10 −3 allows us to save ∼50% of the trained free power P F required for inference.We emphasize that this improvement in power is on top of using the best conductance initialization.
However, we note that gaining the full benefit of this power reduction requires long training, possibly much longer than the early stopping time, meaning that the training energy E is higher compared to the early stopping algorithm.This consideration means that in our dynamical control scheme, there is a trade-off between the training energy and the trained free power of the solution.This is verified in Fig. 4(d), where we measure the power gain G P F in terms of the total training energy ratio between the dynamical control scheme and early stopping algorithm, R E = E/ E EarlyStop .Such tradeoff also depends on the error tolerance, but we find that if one is willing to spend ∼5 − 10 times the training energy compared to the early stopping algorithm, the network achieves most of the benefit of power reduction due to the dynamical control scheme.If the training energy is a major concern and constitutes a significant fraction of the energy expended by the network during its life, one should consider this trade-off for overall lowest power solutions.Finally, we note that our dynamical control scheme is not optimized.Choosing ARTICLE pubs.aip.org/aip/amldifferent parameters or another dynamical control scheme altogether may produce superior power saving at a possibly lower training energy.

V. DISCUSSION
In this work, we studied how electrical circuits can physically learn to adopt desired functions in power-efficient ways.We established that physical learning affects the free power required to actuate the circuit given input signals.This free power can be lowered by choosing better initialization schemes for the learning degrees of freedom, e.g., initializing low conductances in electronic resistor networks.
We have also introduced a modified local learning rule that attempts to minimize both the error and the free power.We showed that this learning rule indeed lowers the trained free power of the obtained learning solutions in both simulations and experiments.This learning rule weights the importance of minimizing error vs power, giving rise to a trade-off between the two.While improving power efficiency at the expense of error (performance) may seem undesirable, a very low error is typically not required and can even be infeasible in real learning situations.Therefore, one can often train learning networks for lower power solutions without much of an adverse effect (Appendix C).In our experiments, there is a natural noise floor and there is no point in striving for a lower error than the floor.For these systems, power-efficient learning rules can improve the solution power with little to no penalty in error.
Finally, we have introduced a dynamical scheme for controlling the relative importance of error and power minimization to rapidly converge on power-efficient solutions with desired error tolerance.We find that such dynamical control can lead to lower power solutions.It is likely that an optimized version of such a dynamical control scheme could further reduce both the solution power and the overall training energy.This is a subject of future study.
While we presented details of the analytical approach for the case of resistor networks, our theoretical arguments apply to other physical systems trained using coupled learning, such as mechanical spring networks (Appendix D).Neuromorphic computing often promises to improve power efficiency by embedding learning algorithms in hardware, solving a major problem in modern power-hungry computational learning algorithms.While the hardware platform discussed here, self-learning electronic circuits, does indeed improve power efficiency, our work here focuses on how to achieve power efficiency in the learning process itself.As a result, our power-efficient learning approach may be easily adaptable to other neuromorphic hardware systems that can perform self-learning to offer additional power savings compared to only using efficient hardware.

ARTICLE
pubs.aip.org/aip/aml where ) is the "learning Hessian," i.e., the Hessian of the contrast with respect to the learning DOF evaluated at the solution.Close enough to the learning solution k * , we find that despite the explicit partial differentiation in Eq. ( 2), the learning dynamics are equivalent to the gradient descent on the contrast. 34Therefore, if we absorb the learning rate into the definition of the time unit, the weight dynamics are given by k ).This leads to simple exponential decaying dynamics.If we set the initial condition at k(t = 0) ≡ k 0 , then Setting the time propagator operator U(t) ≡ e − Ht = U T , we can use this result to obtain the decaying dynamics of the contrast, While these results are consistent with simple exponential decay of the learning DOF k and contrast C, one complication typically arises for over-parameterized learning.We have seen before that the learning Hessian in over-parameterized learning machines tends to be low rank (with the number of non-zero eigenvalues equal to the number of training tasks). 43As the learning Hessian has zero eigenvalues, it is not invertible.There are no dynamics in the eigendirections of these vanishing eigenvalues, as can be explicitly seen by rotating the frame into the coordinate system that diagonalizes H.The learning dynamics are agnostic to the components of k in the large null-space of H.We can plug these results in Eq. ( 3) to obtain the free state power dynamics, As only U(t) depends on time, this ODE can be integrated to find that the free power exponentially saturates to a value FIG. 5. Learning dynamics with power minimization.(a) Error L (blue) and free state power (black) as functions of time for a case where the gradients of the error and the free power align.We see that both are reduced by learning, and the error undershoots its steady state value before relaxing back to it.(b) Error L (blue) and free state power (black) as functions of time for a case where the gradients of the error and free power do not align.Here, learning increases the free power while smoothly reducing the error to its final value.

ARTICLE pubs.aip.org/aip/aml
where A is a projection matrix, projecting weight vectors into the stiff (i.e., non-null) subspace of H. Here, we see again that the power can increase or decrease during learning, depending on the alignment between the gradient of the free state power and the direction of weight dynamics.We now discuss the modified learning dynamics that minimizes both error and free power [Eq.( 9)].In the main text, we showed that these learning dynamics lead to exponentially decaying weight solutions [Eq.( 12)] and associated error and free power dynamics given in Eq. ( 15).The dynamical trajectories given these error/power dynamics follow two different prototypes, depending on the sign of ϕ ≡ −∂ k C T ∂ k P F .For ϕ < 0 (where the contrast gradient is aligned with the free power gradient), the contrast undershoots the infinite time limit, getting arbitrarily close to C = 0 before rebounding exponentially to C(t → ∞) [Fig.5(a)].This scenario is common when initializing the network with high conductance values.For ϕ > 0 (i.e., anti-alignment of the contrast and free power gradients), the dynamics tend to increase the free power, and we see analytically regular dynamics, where the contrast smoothly decays exponentially to its terminal value C(t → ∞) [Fig.5(b)].This scenario is common in flow/resistor networks initialized at low conductance values.Figure 5(c) shows the verification of the argument laid out in the main text that the solution k * λ − k * 0 + ∼ λ.We also show as before that the error grows quadratically with λ.Crucially, the arguments for the error are relevant not only for the training set (regression examples used to train the network) but also for the test examples that the network had not seen previously, whose error also scales quadratically in λ.More information about the regression tasks, as well as the training and test sets, is presented in Appendix B. Similarly to the error, the arguments about the trained free power are valid for both the training and test sets so that our modified learning dynamics reduces both of them [Fig.5(d)].
It is also interesting to combine our results for the dependence of the free power on both the power optimization λ and the initialization scale of the learning DOF κ, as discussed in Sec.II B. We simulated the learning of regression tasks on N = 64 networks as earlier and varied λ in the range 10 −10 -10 −2 and the initialization scale κ in the allowed range 10 −3 -10 1 .We observe the emergence of two regimes of interest, one where the power minimization is weak λ ≪ 1 and the other where λ is large [Fig.6(a)].The large λ regime is simpler to understand as the learning solutions there primarily reduce free power at the expense of error.Therefore, the free power reduction ΔP F λ is essentially the reduction from the free power at initialization to the minimal free power supported by the system, which scales as κ [at λ = 10 −4 , Fig. 6(b)].For weak minimization λ ≪ 1, the effect of the initialization is more subtle.We find that the free power reduction scales approximately as a power law κ 0.5 [at λ = 10 −8 , Fig. 6(b)].Since here we measure the free power reduction, we find that at lower initialization scales, less power is saved by applying power minimization.However, there is still a substantial benefit in free power reduction even for good initialization.These results also help contextualize our dynamical control scheme in Sec.IV, where we show that the dynamical scheme supports additional free power saving compared to just using good initialization.We reserve the detailed study of the interplay between initialization and free power minimization for future study.

APPENDIX B: PHYSICAL LEARNING TASKS
Here, we describe the regression tasks explored numerically in the main text.We simulated linear resistor networks with N = 64 whose structure is derived from jammed two-dimensional packings. 52We randomly choose two edges as input edges and another two as output edges [see Fig. 1 Similarly, we can compute the loss value associated with the test set.We simulate physical learning by applying the local learning rules described in the main text, picking a nudging amplitude value η = 10 −3 .
For better comparison to the physical experiment of Fig. 3, we also simulated a resistor network of the same geometry and a simple allostery task as the experiment [inset of Fig. 3(d)].In these simulations, we randomly choose two input nodes and assign to them input voltages, V i1 = 0V and V i2 = 0.45V.We further choose two random output nodes and train them such that when the inputs are applied, they would have output voltage values V o1 = 0.09V and V o2 = 0.36V.In this case, we again use a mean squared error loss function and compare the output voltages at the output nodes to the desired values V o1,2 .

APPENDIX C: POWER MINIMIZATION FOR LIMITED ACCURACY TASKS
The numerical results in the main text were limited to tasks that can, in principle, be learned perfectly by the learning machine.In such cases, there exist solutions with no error L(k * ) = 0, as discussed in Appendix A. There are, however, cases in which it is impossible to obtain solutions with zero error.The typical example is when the training set does not capture all the information contained in the broader data (or the test set).There are also cases where it is impossible to find solutions that nullify the error even on the training set.This can occur due to under-parameterization (too few learning degrees of freedom to learn the task), an insufficiently expressive model (e.g., a linear network cannot represent nonlinear relations) and noise in the learning process. 48irst, we will consider the case where the system ends up in a local minimum with L > 0. From the definition of coupled learning, we know that if the loss is finite L > 0 so is the contrast C > 0. In such as case, a minimum of the coupled learning dynamics k † would have finite error and contrast values L(k † ), C(k † ).Nonetheless, we can still perform a quadratic approximation around the contrast minimum k † similar to Appendix A, where the constant term C(k † ) is retained, Using this expansion, we can redo the derivation in Sec.III to find the steady state solution error and the trained free power when a finite power minimization amplitude λ is applied in Eq. ( 9), Comparing these expressions to Eq. ( 16), we see that the trained free power behavior stays the same.We also see that the error shift is the same, scaling as λ 2 , but now there is a finite contrast floor C(k † ) associated with a finite error.The trade-off between error and power is still maintained, although in this case, it may be much more favorable.For a small enough λ, 1 2 λ 2 s T Hs ≪ C(k † ), and so the contrast (and error) is nearly unaffected by the power minimization.As a result, we can apply a finite power minimization parameter λ, reducing the trained free power at nearly no penalty.Thus, power minimization is particularly useful for problems in which zero error solutions are not possible.
To verify these considerations, we simulated physical learning in N = 64 networks on regression and classification tasks (Fig. 7).Excess noise was added to the regression labels (outputs) in the training and test sets, sampled from a distribution ΔVo ∼ ∑ i ÃoiΔVi + ϵ, with ϵ = 10 −3 (see Appendix B).The simulated networks can successfully learn these tasks, reducing the error to some finite value L ≈ 10 −5 [Fig.7(a)].When adding a small power minimization λ < 10 −7 , the learning trajectories are almost unchanged and the error is nearly unaffected.When λ increases, the error starts increasing beyond the error floor [Fig.7(b)].At the same time, we observe that the trained free power is decreased linearly at finite λ, as seen before.These results show that in noisy cases, such as those seen in physical learning experiments, free power reduction can be achieved at little expense in errors up to a certain point.
Another case where this result is particularly relevant is in classification problems, where we would like to assign discrete labels to inputs.A standard example for such tasks is the classification of Iris specimens based on the measurements of lengths of their petals and sepals. 53Previously, we have shown that our flow / resistor networks can successfully learn to classify the Iris dataset, as well as could be expected from linear network models, in simulations 43 and experiments. 37In discrete classification tasks, we are typically not concerned with the mean squared error but with a measure of accuracy given by a discrete choice of the label based on the network response; excellent classification is possible even at relatively high values of the mean squared error.Therefore, it may be possible to induce power optimization without a penalty in classification accuracy.To test this idea, we simulated the training of our N = 64 node networks to classify the Iris dataset (a detailed description of the training protocol can be found in Ref. 37).Training at different power minimization amplitudes λ in the range 10 −10 < λ < 10 −2 , we find that the classification accuracy (for the training and test sets) is not affected by power minimization until λ ≈ 10 −7 [Fig.7(c)].At the same time, the solution free power is significantly reduced starting at λ > 10 −8 , showing that power gain (in this case, by a factor ∼2) is possible at little penalty in accuracy [Fig.7(d)].Now, we turn to another case in which tasks cannot be learned perfectly, this time due to the existence of noise.In any real physical learning, machine noise in measurement and learning DOF updates will lead to a non-zero error floor associated with physical learning.This is true even for tasks that, in the absence of noise, could be learned with no error.In such a setting, the random noise pushing the system away from the zero contrast (and error) minima implies that physical learning behaves as a high dimensional Ornstein-Uhlenbeck process 54  normal distribution centered around k * λ in Eq. ( 14), with a standard deviation scaling with the white noise amplitude σ, 55 k λ,i (σ) ∼ k * λ,i (σ = 0) where θ λ,i are the eigenvalues of the matrix H + λH.In other words, the noise induces the conductances to explore a vicinity of the solution k * λ , whose size depends on the noise amplitude σ, and the curvature is given by the eigenvalues θ λ,i .We can take this distribution of values of the learning DOF and plug it in the equation for the contrast [Eq.(11)], finding the distributions of this quantity, Similarly, this can be done for the free power reduction ΔP F λ .The average contrast induced by the noise, as well as the average free power reduction, can be deduced by taking the expectation value over these distributions.Here, note that the expectation values of these normal distributions are ⟨ N i(0, 1)⟩ = 0, ⟨ N 2 i (0, 1)⟩ = 1 so that we are left with We find that if we know the noise scale σ, measuring the average contrast value ⟨ C 0 (σ)⟩ allows us to glean information about the effective average curvature of the contrast near the learning solution.Note that the learning DOF diffuse freely in the space of zero contrast solutions, so the effective curvature is associated with the typical slopes of the contrast leaving the zero manifold.Overall, we see that the free power reduction is, on average, the same as in the case with no noise (up to second order terms in λ).However, the contrast now has a finite added term due to the exploration of values of the ARTICLE pubs.aip.org/aip/amllearning DOF beyond the minimum k * λ .This means that additive white noise has a similar effect to the finite contrast floor discussed earlier; finite power minimization λ can reduce the trained free power while having nearly no effect on the contrast (or error) up to a certain scale.

APPENDIX D: POWER MINIMIZATION IN MECHANICAL SPRING NETWORKS
In this work, we presented general arguments on how local learning rules could balance minimizing the error and trained-free power of obtained physical learning solutions, giving rise to a tradeoff between the two.However, in the main text, we only tested these ideas numerically and experimentally in resistor networks.Here, we show in simulations that these arguments apply similarly to physical learning systems governed by different physics, e.g., an elastic network of harmonic springs [Fig.8(a)].
7][58][59][60] Specifically, coupled learning can train spring networks to perform the desired tasks by modifying the spring constants or rest lengths. 34The physical cost function naturally minimized by such networks is the elastic energy E, where ki is the spring constant of spring i, ℓi is its rest length, ri the Euclidean distance between the nodes connected by the spring, and the energy is summed over all individual springs.For a spring network with adaptive spring constants, the local learning rule is FIG. 8. Energy-efficient learning in mechanical spring networks.(a) A mechanical spring network, each edge corresponding to a spring with adaptive stiffness k.Such networks are trained for allostery tasks so that prescribed strains at input edges (red) lead to desired strains at output edges (blue).(b) Error L (blue) and free energy reduction ΔE F λ (black) as functions of the power minimization amplitude λ.As seen for flow networks, including a power minimization term in the local learning rule leads to a trade-off between error and trained free energy, also having the same scaling behaviors.The results are averaged over five realizations of networks and tasks.
where r F i , r C i are the distances between the nodes separated by spring i in the free and clamped state, respectively.More details on the derivation of this learning rule can be found in Ref. 34.To see if spring networks can be trained to adopt low free energy solutions, i.e., spring configurations for which the desired state is easy (takes little energy) to actuate, we add a local free energy minimization term with amplitude λ, similarly to Eq. ( 9), We simulate this modified learning algorithm on an unstrained spring network with N = 27 nodes, as shown in Fig. 8(a).These networks are trained for allostery tasks, in which we apply prescribed relative strains 0.2 (randomly choosing contraction or extension) and desire particular strain values at another two random bonds (0.05 or 0.03, randomly choosing contraction or extension).With no energy minimization applied, λ = 0, and coupled learning generally succeeds in training these networks to a numerical normalized error floor of L ∼ 10 −8 .As we increase the power minimization amplitude λ, we observe that the error increases as λ 2 and the trained free energy E F is reduced as λ [Fig.8(a)], as predicted by Eq. ( 16) and observed in simulations of resistor networks.These results show that our approach to physical learning of power efficient solutions can be employed beyond linear resistor networks.Recent experimental progress has been achieved for implementing coupled learning in elastic networks, 42 but we leave the experimental validation of energy reduction in such networks for future study.

FIG. 1 .
FIG. 1. Effects of varying the conductance initialization scale.(a) Simulated resistor networks with edges corresponding to variable resistors.We train networks with N = 64 nodes to perform linear regression, i.e., to simulate desired linear equations with two variables (red source edges) and two results (blue target edges), see Appendix B for details.(b) The error L as a function of training time for several conductance initialization values κ.The error is successfully reduced by the coupled learning rule by multiple orders of magnitude, regardless of the choice of the initialization scale κ.(c) Training time T (epochs taken for the error to drop to error level L = 10 −4 ) as a function of initialization κ.The training time remains constant when initialization is far from the bounds but grows linearly for low initialization close to k min .(d) Free power P F during training for different initialization κ.At the end of the training, the system finds a solution with trained power marked by the colored dots.(e) The trained power as a function of initialization κ.Decreasing the conductance initialization scale has a strong effect, reducing the trained power needed to actuate the learned solution.(f) Training energy E as a function of initialization κ.Choosing a proper optimal initialization (here, κ ≈ 2 × 10 −2 ) can optimize the training energy.The results averaged over 50 realizations of networks and regression tasks.

FIG. 2 .
FIG. 2.Physical learning with power minimization.(a) Error L (blue) and trained free power difference between learning with and without power minimization, ΔP F λ (black), for varying values of the power minimization amplitude λ.As λ is increased, the error of the learned solution increases quadratically but the trained free power of these solutions is linearly decreased.(b) The trained free power (black) as well as the training energy E (green) decreases with λ, underscoring a trade-off between power-efficiency and error.The results are averaged over 50 realizations of networks and training tasks.

FIG. 3 .
FIG. 3. Experimental results for power optimization show a trade-off between error and power.(a) Error L as a function of time in laboratory experiments with different optimization amplitude values λ.An adaptive nonlinear resistors network can physically learn to adopt the desired function.This network learns to perform node allostery tasks, gradually minimizing the error down to a finite error floor.(b) Free power in physical learning experiments for different values of λ.As experiments are run with increasing λ, the learning process finds solutions with an increasing error but with improved trained free power.(c) Error vs the trained free power of experimentally learned solutions.Overall, we observe an error-power trade-off in this experimental learning machine.The inset shows a photograph of the experiment.(d) Error vs the trained free power of learned solutions in numerical simulations on the same network geometry and type of tasks.The trade-off between power efficiency and error is recapitulated in simulated learning resistor networks, where the units of time conductance and voltage are matched with the experiments.

FIG. 4 .
FIG. 4.Power-efficient solutions with dynamical control.(a) Error trajectories with our λ dynamical control scheme (full line) compared to simple learning without power minimization (broken line).We see that the controlled learning rapidly converges to a desired tolerance error level of L = 10 −3 .(b) Free power under dynamical control of λ vs free power without power minimization.The dynamically controlled system finds solutions that lower the free power compared to an early stopped training of the uncontrolled system at L = 10 −3 (open dot).The gray arrow signifies the saved power.(c) The power gain given our control scheme G P F compared to early stopping for different levels of tolerable error L. We find that dynamical control can generate significantly more power-efficient solutions.(d) Power gain G P F compared to the ratio of (c) Solution shifts (orange) |Δk * λ | as well as training error (full blue line) and test error (dashed blue line) as a function of the power minimization amplitude λ.As λ is increased, the learned solution k * λ linearly displaces from the limiting solution k * 0 + .The error increases quadratically with λ, both for the training set and for the test set.(d) Trained free power for the training and test sets in the learned regression problem, both decrease as a function of λ.The results in panels (c) and (d) averaged over 50 realization of networks and tasks.

FIG. 6 .
FIG. 6. Interplay of initialization and power minimization.(a) Free power reduction ΔP F λ depends on both the power minimization amplitude λ and the initialization scale κ.When using better initialization (lower κ), less power is saved by the power-efficient learning rule.(b) The free power ΔP F λ as a function of the initialization scale κ.For weak minimization (λ = 10 −8 , full line), ΔP F λ ∼ κ 0.5 .For strong power minimization (λ = 10 −4 , dashed line), ΔP F λ ∼ κ.The results are averaged over ten realizations of networks and tasks.
(a)].The input and output voltage drops are noted by the vectors ΔVi, ΔVo, respectively.The network is trained to perform regression recovering a linear relation, ΔVo + ϵ = ∑ 2 × 2 matrix Ãoi contains the desired function parameters and ϵ, a possible addition of white noise.Since we train a linear resistor network, the functional relation between the input and output voltage drops is always linear ΔVo = ∑ i AoiΔVi, and the correct matrix relation Ãoi is supposed to be recovered by learning.The values for the desired matrix were randomly chosen from the distribution 1 N (0, 1) 2×2 .(B2) We trained these networks in many realizations of geometry, choice of input/output edges and Ãoi.To train each realization of the problem, we sampled ntr = 20 training examples ΔV Training i ∼ U(0, 1) 2 and corresponding outputs ΔV Training o = ∑ i ÃoiΔV Training i + ϵ.Note that the scale of the input voltage drops determines the scale of power dissipation in the free state P F ∼ Δ Vi 2 .In the main text, we looked at noiseless regression problems with ϵ = 0, for which, the network can find exact solutions with zero error.In Appendix C, we study a case with finite label noise ϵ = 10 −3 .The training examples are sampled randomly during training and used to define the free and clamped states in the iterative learning process.Apart from the ntr = 20 training examples, we also sampled nte = 100 test examples from a wider distribution V Test i ∼ N (0, 1) and their associated desired outputs.The test points are not used during the learning process but help in verifying that the network can generalize.In our work, the test set is also interesting for showing ARTICLE pubs.aip.org/aip/aml the power-efficient property of the solutions that generalizes beyond the training set [Fig.5(d)].Once the training set is established, we can measure the error by simulating the output response of the network given input values, ΔVO(ΔV Training i ) and comparing it to the sample desired output values ΔV Training O by using a mean squared error loss function, L = 0.5n −1 tr ∑ [ΔVO(ΔV Training i

FIG. 7 .
FIG. 7.Power reduction with little error/accuracy loss in regression and classification problems.(a) Error vs time trajectories for regression task with label noise (with minimum possible error L ≈ 10 −5 ) for different values of the power minimization amplitude λ.As long as λ < 10 −7 , the error is largely unaffected.(b) Error (blue) and free power reduction (black) as a function of λ.The trained free power is still reduced by increasing λ, as before, even in the range where the error is unaffected.The regression results averaged over 50 realizations.(c) Classification accuracy of the Iris dataset for the training (full line) and test (dashed line) sets, as a function of λ.Similar results are obtained, as increasing the power minimization amplitude λ decreases accuracy (i.e., increases error), but only beyond a finite value of λ.(d) Power gain ΔG P F due to power minimization for the training (full line) and test (dashed line) sets.In this case, we gain a factor 2 in trained free power with little loss in accuracy (at λ ≈ 10 −7 ).The classification results averaged over 100 realizations.
) is the contrast Hessian with respect to learning degrees of freedom at λ = 0 [in over-parameterized networks, the constant term C(k * 0 ) vanishes, see Appendix A].If the learning solution at finite λ, k * λ is close to the limiting solutions k * ∞ and k * 0 , we can express the new contrast approximately as