Reservoir computing was achieved by constructing a network of virtual nodes multiplexed in time and sharing a single silicon beam exhibiting a classical Duffing non-linearity as the source of nonlinearity. The delay-coupled electromechanical system performed well on time series classification tasks, with error rates below 0.1% for the 1st, 2nd, and 3rd order parity benchmarks and an accuracy of $(78\xb12)%$ for the TI-46 spoken word recognition benchmark. As a first demonstration of reservoir computing using a non-linear mass-spring system in MEMS, this result paves the way to the creation of a new class of compact devices combining the functions of sensing and computing.

## I. INTRODUCTION

The discovery of faster numerical methods to adjust the parameters of artificial neural networks (aka training) has led in the last decade to a resurgence of interest in using these networks to implement complex functions, which are constructed from a finite (albeit large) training set of examples, and which can exhibit impressive generalization capabilities when applied to inputs which were not part of the training set.^{1} Recurrent neural networks (RNN) are especially efficient at modeling time dependent data,^{2,3} as they feed information from the “top” parts of the network at a given iteration to “lower” parts of the network at the next iteration. They are universal computers (in the sense described in Ref. 4) but are considered difficult to train.^{5} Under certain conditions, RNN form a so-called reservoir computer (RC), in which case the weights of the recurrent network are initialized randomly and are left untrained, while the weights of a simple output layer are adjusted to train the network for a desired output.^{6}

The concept of RC has led to interesting numerical applications^{7–9} but, more importantly, it has been the trigger for a variety of hardware implementations of computing systems with functionalities similar to those of artificial neural networks. In these hardware implementations, the dynamics of a physical system are (often) left untrained and provide memory and non-linear computing capabilities to a simple, trainable output system. Hardware RC results have been reported for optical systems,^{10–12} mechanical devices,^{13,14} memristor arrays,^{15} and spintronic devices,^{16} for instance.

Hardware RC can be interesting as computing systems if they provide significant gains over conventional computers in terms of speed^{17,18} or energy efficiency.^{6,19,20} In addition, RC implemented in micro-mechanical devices, in particular, could serve the dual purpose of sensing and of computing for various force stimuli,^{21} to create new classes of devices in distributed sensing or in control applications where the device dimensions or energy requirements are a limiting factor. As a step toward the realization of such devices combining mechanical sensing with neural-like computing, we describe in this paper a micro-fabricated silicon beam, with non-linear dynamics which can be harnessed to perform non-trivial computing tasks.

The dynamics of our oscillating beam, which exhibit a classical Duffing non-linearity (Sec. II), are coupled to a feedback mechanism in a scheme^{22} that has been used to create hardware RC implementations from various relatively simple non-linear systems.^{16,17,23,24} The resulting MEMS RC is trained (Sec. III) to process streams of bits and to classify spoken words, to demonstrate a small, energy-efficient computing device, which encodes information in the mechanical domain and therefore has the potential to function as both a sensor and a computer (as discussed in Sec. IV).

## II. METHODS

### A. The non-linear node

The non-linear expansion of input data into a higher dimensional state space is a necessary property of RC.^{25} In this experiment, it is achieved using the non-linear dynamics of a single clamped-clamped silicon beam at large oscillation amplitudes. Figure 1 shows a typical device, which incorporates two mechanically coupled beams, although this study focuses on a single one of them (highlighted in red). The other beam is not actuated and the coupled system will be the subject of future work on a hybrid reservoir architecture.^{26}

The device is micro-fabricated on a (100) p-doped silicon on insulator (SOI) substrate with a sacrificial oxide thickness of 1.5 $\mu $m. Both device and handle layers have a resistivity of (0.003 $\xb1$ 0.002)$\Omega $m. The device layer thickness of 10 $\mu $m defines the width of the beam (normal to its displacement), while its length and in-plane thickness were chosen to be 500 $\mu $m and 4 $\mu $m, respectively, in order to satisfy previously identified requirements.^{21} These requirements include a relatively low quality factor ($\u223c$100) so that the beam “forgets” about previous inputs rapidly enough, a high natural frequency ($fn>105$ Hz) so that the characteristic time of the oscillator is short enough to allow a reasonable processing speed (as discussed in Sec. IV), and a sufficiently non-linear behavior, insuring rich reservoir dynamics. This last prerequisite demands for a high value of $\beta $ ($\u223c1023$ Hz$2$ m$\u22122$), the coefficient controlling the amount of non-linearity in the restoring force for a Duffing oscillator whose displacement is described by

where $F(t)$ is the force per unit mass driving the beam, $\omega 0=2\pi f0$ is the natural angular frequency of the oscillator in its linear regime, and $Q$ is the quality factor of the MEMS. A value of $f0$=228 kHz was obtained for the beam studied here, and its quality factor, measured in the time domain, is 100 $\xb1$ 3, independent of the oscillation amplitude.

Oscillations of the beam are excited in the plane of the substrate by a 300 $\mu $m long drive electrode (yellow in Fig. 1) placed 6 $\mu $m away from the beam. For the electrostatic actuation scheme employed here, the force acting on the beam is proportional to the square of the driving voltage. Vibrations of the beam are thus solicited at twice the sinusoidal drive frequency $fd$, which allows one to filter out the feedthrough signal stemming from parasitic capacitive coupling between the drive electrode and the measurement electrode.

Reproducing a strategy that has proven successful,^{27} identical strain gage (1.2 $\mu $m large by 12 $\mu $m long) pairs are also patterned on the device to allow for the piezoresistive transduction of the mechanical vibrations, measured at the mid-voltage point between the two gages (blue pad in Fig. 1). The gages have a nominal resistance of $Rg=50$ $\Omega $ and are biased at $\xb12.5$ V (green traces in Fig. 1). When displaced, the beam acts as a lever and causes axial tension ($Rg\u2192Rg+\Delta R$) in one gage and compression ($Rg\u2192Rg\u2212\Delta R$) in the other such that the combined effect on the readout voltage $Vo$ is additive:

where $VB=2.5$ V is the DC bias voltage.

Electrical connections to the device are established by wirebonding the chip to a printed circuit board, with the substrate grounded. The reason for grounding the handle layer is twofold: it negates the possibility of a charged substrate exerting a static downward force on the beam, modifying its dynamics, and it mitigates feedthrough by removing the main feedthrough pathway (constituted by the two large series capacitors established around the oxide layer by the drive electrode, handle layer, and sense pad). The piezoresistive displacement signal being of small amplitude ($<100\mu $V) and riding on top of a relatively large feedthrough signal, its proper digitization requires amplification with a gain of 80 dB, bandpass filtering with a center frequency of $fh>fd$ to attenuate the feedthrough component and reduce the noise contribution, and finally envelope detection to extract the oscillation amplitude.

Figure 2 shows the amplitude of the piezoresistive displacement signal $Vo$ (before its amplification) as a function of the drive amplitude $A\u2032$, for the oscillator driven at its natural frequency $fn$. For low drive amplitudes, the device behaves as a damped harmonic oscillator and the response is essentially contained in the component at twice the drive frequency. As the amplitude is increased, a component at four times the drive frequency begins to dominate, which indicates the onset of nonlinear effects. The response of the system to a drive frequency and amplitude sweep, filtered around this fourth harmonic, is shown in Fig. 3. It features a discontinuity for drive amplitudes larger than 45 V, which coincides with the amplitude at which the response at $4fd$ starts to dominate over the component at $2fd$ (see Fig. 2). The frequency at which the discontinuity appears increases as the drive amplitude is increased, which suggests that the beam indeed behaves as a stiffening Duffing oscillator. The red dashed line in Fig. 3 verifies the Duffing nature of the nonlinearity by modeling the location of the discontinuity (jump down) with^{28}

where $xmax$ is the maximum displacement of the beam (right before the jump) which, for a linear transduction scheme, is proportional to the measured signal, and $f0$ is the natural frequency of the oscillator in its linear regime. A value of $\beta =1.9\xd71023$ Hz$2$ m$\u22122$ and a transduction coefficient of 90 $\mu $V/$\mu $m reproduce the data well and are coherent with values obtained by finite element analysis and analytical methods for the dimensions of the beam.

Note that Fig. 3 was constructed by sweeping the drive signal from low to high frequency, increasing the drive amplitude between each sweep. Decreasing the drive voltage or amplitude during the sweeps changes the location of the discontinuity. This results in the hysteresis loops shown in Fig. 4, constructed by fixing the drive amplitude (frequency) and sweeping up and down the drive frequency (amplitude).

### B. The reservoir

For RC to be realized on a single physical node, virtual nodes need to be created in order to generate a rich enough reservoir state, which can then be collapsed to a usable output by a simple linear combination of the individual node states from the reservoir state. This is done through the increasingly common delay-based approach,^{22} where virtual nodes are created by time-domain multiplexing: the input signal $u(t)$ is sampled and held for a time $\tau $ while a random binary mask of length $N$ is applied at a rate $\theta \u22121$ ($\tau =N\theta $) (see Figs. 5 and 6). The mask defines the input weights and keeps the system from saturating by regularly switching between two levels. For both tasks investigated in Sec. III, good results are obtained with mask values randomly chosen to be 0.45 or 0.70, with the input signal scaled such that $u(t)$ is limited to the range [0.60, 0.75]. Since coupling between adjacent virtual nodes is done entirely in the time domain, the reservoir interconnection sparsity can be tuned to some extent by varying the virtual node length $\theta $: decreasing the ratio $\theta /T$, where $T=2Q/\omega 0$ is the decay time of the oscillator, makes the behavior of a given virtual node more dependent on the response of previous nodes since the signal is given less time to settle. For our experiments, best results were obtained for $\theta \u223cT$ (see Sec. IV). The masked signal is then used to modulate the amplitude $A$ of the drive at frequency $fd$. After amplification with a gain of 26 dB to bring the drive amplitude from $A$ to $A\u2032$, the resulting amplitude modulated waveform is used to drive the silicon beam into a non-linear state (see Figs. 3 and 4). The $N$ virtual node states are then acquired by measuring the oscillation amplitude $x(t)$ at the end of each interval of length $\theta $. This is done by bandpass filtering the amplified readout signal (gain of 80 dB) around $fh=4fd$ and sampling its envelope at a rate $\theta \u22121$ with a resolution of 16 bits. The result is a vector $X$ of length $N$ containing the output layer state at a given timestep of the input function. This readout, scaled by a factor $\alpha $, is then added to the masked input for the next timestep, closing the feedback loop. With a proper choice of $\alpha $, this feedback loop ensures that the RC has the fading memory^{29} property, which means that perturbations propagate in the reservoir for a finite time and eventually vanish. The oscillator is thus driven by a voltage signal of the form:

Finally, the readout vector is used to form a scalar output

with $w$ being a vector of weights. In order to obtain a desired output $y\u2032(t)$ for a series of inputs $u(t)$, the weights are adjusted during a supervised training period, which consists of minimizing the mean squared error between $y$ and $y\u2032$ by applying a ridge regression with Tikhonov regularization:

where $y\u2032$ and $X$ are the target vector and the data matrix, obtained by concatenating the discrete entries of $y(t)$ and $x(t)$, respectively. The regularization parameter $\lambda $ prevents overfitting, thus increasing robustness. A value of $\lambda \u223c10\u22121$ V$2$ improves the stability in our system.

## III. RESULTS

### A. Parity benchmark

The performance of the system was first assessed by means of the parity benchmark. For this benchmark, a random binary sequence $u(t)$, with values drawn from $\u22121,1$, is fed to the network at a rate $\tau \u22121$, following the feedback system described in Sec. II B. The $nth$-order parity function with delay,

where $\delta \u2208Z\u2217$ is the delay, is then computed by the reservoir computer. Equation (7) shows that this task requires the reservoir to store information about a nonlinear transform of previous inputs, except for the case $n=1$, which is only a delay line (and the identity for $\delta =0$). For $n>1$, the parity function is not linear separable^{30} and depends on the input at times $t\u2212(i+\delta )\tau $, while the readout is a linear transform of the reservoir output. This benchmark thus tests the system’s memory capacity as well as its nonlinear processing capability. To verify that a non-linear behavior of the physical node is necessary to implement the parity function, we ran a series of numerical simulations for this task with $\beta =0$. For the linear case ($n=1$), the simulations could achieve near perfect success rates, even for large values of $\delta $. However, as soon as non-linear computations were introduced ($n\u22652$), we could not find any parameter set allowing even the slightest success.

Figure 7 shows the system’s prediction overlaid on the target [Eq. (7)] for $n=2,3,4,5$ and $\delta =0$. Unless otherwise noted, all results for this benchmark were obtained by setting the acquisition parameters to $N=400$, $\theta =0.1$ ms, $\tau =40$ ms, $A\u2032=82$ V, $fd=115.1$ kHz, and $\alpha =0.5$. The success rate, displayed beside each trace of Fig. 7, represents the fraction of correct bits, computed by comparing the signs of the prediction and of the target during a test phase with 1000 input bits. Training is done offline after acquiring the reservoir states for the first 3000 timesteps. While thresholding the prediction for $P4$ still gives a near bit-perfect reproduction of the target, the raw signal becomes more noisy. The prediction effectively becomes more uncertain as the difficulty and memory requirement of the task is increased from $P1$ to $P6$.

The mutual information,

with $pn,\delta $ being the success probability for the $\delta $-delayed $nth$-order parity function, can be used to estimate the memory capacity:^{30}

Figure 8 shows the mutual information obtained for $n=1,2,3,4$ as a function of the delay $\delta $. In all cases, the mutual information becomes negligible for $\delta >8$ such that the sum of Eq. (9) can be truncated at $\delta =8$, yielding a linear memory capacity of (5.6 $\xb1$ 0.2) bits ($P1$), and non-linear memory capacities of (4.4 $\xb1$ 0.2), (3.2 $\xb1$ 0.2), and (2.1 $\xb1$ 0.2) bits for $P2$, $P3$, and $P4$, respectively, which is comparable to results for other types of reservoirs.^{21,30,31}

Figure 9 shows the memory capacity as a function of the number of virtual nodes $N$, for a fixed value of $\theta =0.1$ ms ($\tau =N\theta $ thus increases with $N$). With an increase in the number of virtual nodes, the readout function [Eq. (5)] has richer dynamics at its disposal for the reconstruction of the target such that a monotonic increase in memory capacity could be expected.^{30} On the other hand, the fixed training set size of 3000 samples may not be sufficient for the larger reservoirs (the required training set length should scale up with $N$^{32}), which could explain the saturation or even slight decline of the memory capacity observed for larger $N$. Indeed, a training set of 3000 examples was deemed sufficient (RC performance did not significantly increase past this point) for the $N=400$ reservoir studied in the rest of this paper, but this has not been studied for larger reservoirs. Nevertheless, the decrease in memory capacity for more difficult tasks can be compensated up to a certain point by increasing the reservoir size. The curves presented in Fig. 9 show lower memory capacities than curves in Fig. 8 due to a less thorough parameter optimization. Detuning the system shifts these curves vertically but does not affect the general trend mentioned above.

As shown in Fig. 10, the memory capacity is also highly correlated to the feedback strength $\alpha $. It drops rapidly for low feedback levels as the system has no way of storing information about previous inputs, while too much feedback also causes a decline in memory capacity. In the latter case, the reservoir does not exhibit the required fading memory^{33} property and becomes unstable for $\alpha \u22650.7$, affecting the results.

The success rate in the drive frequency–drive amplitude plane, interpolated from a 10 by 16 points grid search, is shown for $P2$ to $P5$ in Fig. 11. It can be seen that the area of significant success ($>$50% since a random guess will in average produce a success rate of 50%) is limited to a rather specific region of the parameter space that shrinks as the complexity of the task is increased from $P2$ to $P5$. Since the drive amplitude and the drive frequency (along with the feedback gain) essentially determine the level of non-linearity in our system, this shows that even though RCs, and more generally RNNs, can operate with a multitude of different non-linear nodes, fine tuning of the non-linearity can be necessary, especially for more demanding tasks.

Although Fig. 11 shows that at 115.1 kHz a drive amplitude of 82 V produces good results, this sinusoidal drive is modulated by the masked input such that the drive signal explores a certain range of voltages during normal operation. For this experiment, it was found that the reservoir performed well when the mask levels were adjusted such that the drive signal envelope spanned voltage values strictly larger than 49 V. From Fig. 4, it can be seen that the drive signal thus visits a non-linear part of the transfer function but never explores the lower branch of the hysteresis loop. Fully spanning the hysteresis loop likely causes the response to be chaotic, and various studies show that a RC operates best at the *edge of chaos*,^{34} in the ordered phase.

### B. Spoken word classification

The reservoir computer’s ability to classify more complex signals was assessed using a spoken word classification task. For this task, a subset consisting of the numbers zero to nine spoken by 8 different females and 8 different males is drawn from the TI-46 corpus.^{35} Utterances are then randomly selected from this subset and processed by the RC. Compared to usual approaches,^{7} the preprocessing applied here is intentionally rudimentary to anticipate implementation in a simple system where sound pressure would be directly coupled to the oscillator. It involves only the normalization of each waveform by its standard deviation, low-pass filtering, and sampling of the envelope at twice the low-pass corner frequency of 30 Hz. The output layer state for a given utterance is obtained by integrating the consecutive readouts for each individual virtual node during the whole utterance. During the training phase, a vector of weights is computed for each of the ten digits in the vocabulary. Each classifier is trained to output a value of 1 when presented with an utterance of the digit corresponding to its class, and $\u22121$ otherwise. Prediction during the test period is then carried out by choosing the digit with the highest classifier output value.

Figure 12 shows the confusion matrix for this task. The prediction probabilities were obtained with a ten-fold cross-validation: 5000 utterances were processed by the RC and divided into subgroups of 500. Each of these subgroups was used exactly once for testing, with the rest of the data used for training the RC each time. Correct classification is achieved in (78 $\xb1$ 2)% of occurrences, with digits such as {6,8} being correctly classified more often and others such as {1,3,4} being harder to discriminate. While much lower error rates are common in the literature,^{10,16,17,36} they are invariably associated with a preprocessing of the utterances that often involves some sort of spectral analysis. Our intentional lack of an elaborate preprocessing scheme certainly calls for a more thorough parameter optimization in order to obtain lower error rates.

## IV. DISCUSSION

While the use of delay systems can simplify physical reservoir implementation of RC, it has the drawback of lowering the computation rate. The equivalent $N$-node traditional reservoir can process the same input signal at $N$ times the rate of the delay-coupled reservoir because of its parallel processing configuration.

For the device studied here, low error rates are achieved for $\theta \u22650.1$ ms, which is comparable to the oscillator characteristic time $T=2Q/\omega 0=0.13$ ms. This differs from the optimal value of $\theta \u223cT/5$ presented elsewhere.^{22} While longer values of $\theta $ tend to slightly increase the success rate, they also reduce the processing rate. For $\theta =0.1$ ms and $N=200$, which give adequate performance in our system, the sample clock of the input signal cycles at a rate $\tau \u22121=N\u22121\theta \u22121=50$ Hz. While this only allows for a much slower classification rate than the $105$ to $106$ words per second rates^{17,18} achieved with photonic reservoirs, it is sufficient to classify spoken words in real time, as only the low frequency components are necessary for this task.

The device demonstrated here occupies a relatively large surface ($\u223c$0.1 mm$2$), but it is a non-issue considering a single beam is used, and the additional space and power required for the external circuitry for the actuation, readout, and feedback loop is more concerning. Gains from size reduction instead impact the processing speed: by reducing the beam dimensions, its natural frequency increases while the quality factor can be kept relatively low by inducing additional viscous damping, resulting in an increase in clock rate since $\tau \u221d\theta \u223cT\u221dQ/\omega 0$. The same piezoresistive transduction method could be applied^{27} to a nanomechanical oscillator resonating at hundreds of MHz, allowing gains of many orders of magnitude in processing speed to be made.

Although fast high resolution ADCs and DACs are nowadays relatively cheap and accessible, a purely analog implementation of our system would have its appeals, such as being less demanding on the control circuitry, or even enabling the possibility of directly coupling different external inputs such as pressure or acceleration to the input layer of the reservoir. While both pre-processing and post-processing (given that the vector of weights is known) could easily be carried out entirely in the analog domain (which has been done for a photonic system^{37}), the long delay required for the feedback loop remains nearly impossible to implement in analog electronics. Other means, such as a mechanical delay line, could eventually be employed to bypass this bottleneck.

Adjusting the system parameters can be a challenging task due to the high number of possible parameter combinations. Our approach of scanning the parameter space thus suffered from poor resolution and limited range. Nevertheless, once tuned the RC was found to be quite robust. For instance, the drive frequency can be incremented much more finely than the $\u223c$100 Hz span insuring good performance for the $P4$ task (see Fig. 11). Thus, fluctuations of device properties such as its natural frequency $fn$ and quality factor $Q$, which can happen over time or between nominally identical silicon beams due to fabrication tolerances, can be accounted for by adjusting the operating point as to negate their adverse effect on RC performance. The bottom panel of Fig. 6 shows that the measured signal contains white noise. This is essentially due to the thermal noise generated by the piezoresistive gages in a measurement bandwidth of 80 kHz, which limits the SNR to $\u223c$ 30 dB. Yet, the results obtained here for two non-trivial tasks seem to indicate that the system is robust to this level of noise and thus well suited for its projected implementation as a smart sensor.

## V. CONCLUSION

The Duffing non-linearity of a silicon beam, in conjunction with a delay-coupled reservoir architecture, allowed for the first experimental demonstration of a MEMS reservoir computer. The system is capable of predicting time series and classifying spoken words with relatively good accuracy, reliably predicting the parity of up to three consecutive input bits (the $P1$ to $P3$ tasks) with an error rate under 0.1%, and correctly classifying isolated spoken digits with an accuracy of (78 $\xb1$ 2)%. While performance for the parity function exceeds numerically obtained results,^{21} state-of-the-art performance for the spoken digit classification task would require a more thorough parameter optimization or a refined preprocessing scheme.

The system’s memory capacity could be tuned to a certain extent by adjusting the feedback strength $\alpha $ (Fig. 10), although it could not be stabilized for delays longer than $8\tau $. Adding a second feedback loop with delay $2\tau $ and strength $\alpha \u2032<\alpha $ would likely improve the memory capacity.^{38,39} An unsynchronized approach,^{10} where a given node is fed back the response of a different node from the previous time step, could also potentially increase the performance of the RC. Other parameters could benefit from a more principled optimization.

This experiment has demonstrated that a MEMS oscillator can be used to construct a system capable of time series prediction and spoken word classification. It is a first step to conceiving computing substrates capable of also sensing their environment. More in-depth study of the rich non-linear dynamics of our oscillators will be the subject of a future communication.

## Acknowledgments

The authors thank the Ministère de l’économie, de la science et de l’innovation du Québec and the Natural Sciences and Engineering Research Council of Canada for financial support.

## References

*Proceedings of the International Conference on Machine Learning (ICML)*(Omnipress, 2011), pp. 1017–1024.

*Proceedings of the 13th European Symposium on Artificial Neural Networks (ESANN)*(D-Side Publishing, 2005), pp. 435–440.

*2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)*(IEEE, 2015), pp. 4486–4491.