We describe the continuous-time dynamics of networks implemented on Field Programable Gate Arrays (FPGAs). The networks can perform Boolean operations when the FPGA is in the clocked (digital) mode; however, we run the programed FPGA in the unclocked (analog) mode. Our motivation is to use these FPGA networks as ultrafast machine-learning processors, using the technique of reservoir computing. We study both the undriven dynamics and the input response of these networks as we vary network design parameters, and we relate the dynamics to accuracy on two machine-learning tasks.
Reservoir computing1,2 is an approach to machine learning that, with the use of appropriate hardware, has the potential to perform significantly faster than software machine-learning algorithms such as deep learning. The technique uses a driven dynamical system that nonlinearly processes inputs, and a postprocessing output layer that is trained with data for the desired input-output task. Training fits a postprocessing function of the dynamical-system outputs to the desired outputs in the training data. This approach does not require a mathematical model, nor continuously adjustable parameters, for a hardware dynamical system. We describe a method to program network dynamics on Field Programable Gate Array (FPGA) hardware and use these networks for machine learning. These networks use random sparse connectivity, similar to many software implementations of reservoir computing. Unlike software networks, the form of the nonlinearities is determined by the hardware. We report the results of experiments in which we characterize and show how to modify the FPGA network dynamics. We also give two examples of machine-learning applications, and compare the accuracy of different networks to their dynamical characteristics. Our goal is to provide a dynamical-system foundation for the design of hardware-accelerated machine-learning algorithms.
I. INTRODUCTION
Today, many tasks are being performed by machine-learning models that learn from exposure to past data how to interpret and respond to future events. A prominent example of a machine-learning model is the artificial neural network, in which an input signal is passed between nodes in a network; each node performs a simple nonlinear transformation before passing the signal to others. The network as a whole can perform very complex pattern recognition tasks by adapting weights between pairs of nodes. These information processing networks are used for many applications, including computer vision,3 speech recognition,4 social network filtering, and surpassing human performance in playing board games.5
Artificial neural networks are inspired in their design by biological neural networks, which outperform machine learning models in many ways. Biological networks are more flexible in their ability to learn, adapt, and switch between tasks, and also use much less operating power than comparable-size software networks. Unlike the artificial neural networks that have mostly feed forward or simple recurrent structures, biological networks have very complex recurrent topologies that are not fully understood in how they learn and operate. In biological networks, each neuron operates as an analog unit at a rate set by its own internal timing, while artificial models operate, similar to conventional computing, with a global clock synchronizing the operation of the different elements to perform digital operations.
This article describes the construction and study of an artificial information-processing network that is both analog and recurrent. The network is made of conventional CMOS transistors and wires that are interconnected to process information as a network of analog units. We consider these networks both as dynamical systems and as machine-learning tools.
The topology chosen for the network is a random directed graph. Networks on random directed graphs are commonly used as mathematical models to study many real world phenomena such as social networks,6 animal flocking,7 neural networks1 (more details below), and genetic evolution.8 In the genetic evolution model by Kauffman,8 a random Boolean network is used to describe a network of genes. The network is a random graph in which each node is a binary unit that represents one gene that can be either on or off. Each gene is also associated with a logic function and inputs from other genes. By a discrete-time numerical model, the gene states are updated synchronously. When for all genes, Kauffman found that the system cycles between different states with fixed periods. Derrida and Pomeau9 showed by an approximate model that is a critical value of , with stable fixed states when and unstable states with . Later work on asynchronously updated discrete time models10 showed that stability can be lost when dynamics and updates are without a fixed clock.
A. Reservoir computing
The machine-learning results we report are based on the reservoir computing model. A reservoir computer (also known as an echo state network or liquid state machine)1,2,11–13 is a type of recurrent artificial neural network that includes a set of neurons that are interconnected by a random directed graph. Unlike feed-forward neural networks, in which the state of nodes is static for each input entry, the reservoir computer is a dynamical system that is driven by the input signal. Software models of reservoir computers were applied to many real world tasks such as image classification,14 financial time series,15 robotics,16,17 communications,18,19 speech recognition,20 and medical applications.21,22 The dynamical properties of the model made it an intriguing model for the prediction of chaotic time series, such as Lorenz,23,24 Mackey-Glass,18 and Kuramoto-Sivashinsky.24,25
The reservoir computing model (see Fig. 1) is composed of an input layer, the random directed graph of neurons, called the reservoir, and typically a linear output layer used for regression tasks. In training, the output weights connecting the reservoir neurons to the output neurons are set by minimizing a cost function via a process of linear inversion. The dynamics of a reservoir computer modeled in software can be described as1
Here, is the state vector of the reservoir nodes at discrete time , and is the input vector at time . The nonlinear activation function of the nodes is (often chosen to be ). The connectivity matrix has real values. The matrix entry is the strength and sign of the connection within the reservoir from to . Similarly, contains the input weights from to . An additional feedback connection can be represented by a matrix to couple to (see Jaeger1).
Reservoir computing model. An input layer, a recurrent network with an underlying random directed graph, and a trained output layer.
Reservoir computing model. An input layer, a recurrent network with an underlying random directed graph, and a trained output layer.
The success of the software models motivated researchers to construct hardware reservoir computers, in which the internal reservoir is made of physical electronic26 or optical27,28 nodes. These hardware reservoirs operate orders of magnitude faster than those of software models. Unlike deep learning models of artificial neural networks, where the weights on all edges between all nodes in a network are typically modified in training, in reservoir computing, the weights within the reservoir are randomly initialized and are not further modified, only the output matrix is obtained in training. This feature simplifies the physical construction of the reservoir computer, as it removes the need to adjust physical elements within the reservoir during training.
B. Analog-gate networks
Analog-gate networks are a physical approximation to Boolean networks. These networks are built of logic circuits that are interconnected by wires. They are analog in the sense that the gates process their incoming signals and produce outputs independently, without a system clock synchronizing their actions. The analog networks have a great advantage over simulated models as they operate in parallel at fast time scales (), are easily scalable and are low cost. Each node in the analog-gate network can be a logic circuit that continuously processes input signals from other nodes. Unlike software models of Boolean networks, the system is unclocked and evolves continuously in time. The timing of signal transfer is determined by the delays of the logic circuits and the interconnecting wires. The output voltage of each node is a continuous analog value approximately limited to the range of outputs of the logic circuit, which corresponds to Boolean 1 at one end and Boolean 0 at the other.
A well-known simple example of an analog Boolean network is a ring oscillator, composed of an odd number of not gates connected in a closed loop chain. The ring oscillator state does not reach a fixed point and instead the output of any node oscillates between the two voltage levels representing Boolean 0 and 1 with frequency , where is the delay of a single gate. Ring oscillators are commonly used to measure basic properties of gate circuitry, such as the delay, the output range, the durability, and the dependence on voltage and temperature of all these properties. In contrast to the regularity of the ring oscillator’s state, aperiodic complex dynamics were demonstrated by Zhang et al.29 using a network composed of two XOR gates and one XNOR gate. Depending on the delays between the gates, the network exhibits periodic or chaotic dynamics. Rosin et al.30 used the chaotic dynamics of analog Boolean networks for random-number generation. In related work,31,32 the authors demonstrated coupled oscillators with controlled coupling strength as well as large networks of nonlocally coupled oscillators that lead to chimera states.33 In these works, Rosin et al. used Field Programable Gate Arrays (FPGAs) to flexibly synthesize analog networks without manufacturing integrated circuits with the desired gate topology, but instead by programing and reprograming the dedicated physical device.
In Sec. II, we describe our implementation of analog networks on FPGAs, together with methods for providing input and measuring output. In Sec. III, we describe the dynamics we observed in these networks, both without input (Sec. III A) and in response to a brief impulse input (Sec. III B). We introduce parameters that vary the proportions of different Boolean gate types that we assign randomly to the nodes of the network, and show how we can modify the dynamics by varying these parameters. We characterize the dynamics by various metrics: their self-activity level with no input, and their transient duration and type in response to impulses. We find that a function of the parameters, which we call the mean sensitivity of the nodes, is a good predictor of the self-activity level. We also explore other relationships among the metrics. In Sec. III C, we study how these metrics correlate to machine-learning accuracy for two classification tasks: MNIST image data, in which the input to the network is static; and RF signal data, in which the input is dynamic. We find that the most accurate networks have low self-activity for both tasks; for the static-input task, networks with low sensitivity and short transient duration perform best, whereas for the dynamic-input task, moderate sensitivity and moderate transient duration work best.
II. NETWORK DESIGN
This section describes an experimental random directed network in which each node is an analog gate and the network is implemented inside an FPGA. In Sec. III, we study the fundamental behavior of such a network: its self-activity, its response to external input, and usage of the network as a tool for machine learning.
A. Internal network
The recurrent neural network in our experiments is an interconnected set of logic devices inside an FPGA (see Fig. 2). Each node in the network is programed with a lookup table (LUT) that performs a Boolean logic function on Boolean input values and has a single output value. The output of a node can be wired to the inputs of multiple other nodes in the network and to dedicated output lines of the system. Inputs to a node come from other nodes in the network and from dedicated input lines of the system.
Schematic of the experimental design. (a) Nodes in the analog network are interconnected by a random graph and are coupled to external input lines and a reset line. (b) Within the FPGA, nodes are represented by lookup tables and interconnects serve to wire them together.
Schematic of the experimental design. (a) Nodes in the analog network are interconnected by a random graph and are coupled to external input lines and a reset line. (b) Within the FPGA, nodes are represented by lookup tables and interconnects serve to wire them together.
For the experiments described in this article, each node receives input from either one or two randomly selected nodes, in addition to any dedicated input lines. To describe the internal connectivity, we ignore external inputs for the remainder of this paragraph. The one-input nodes are programed to compute the identity function; their purpose is to insert varying time delays as the dynamics propagate from one two-input node to another. Two-input nodes are programed with one of the 12 logical operations such as AND, OR, or XOR. Though there are 16 possible two-input lookup tables, we exclude those that depend on one input but not the other; if we call the Boolean inputs A and B, then we exclude the four lookup tables that compute A, A, B, and B. However, we do allow nodes whose output is always Boolean 0 or always Boolean 1; as we will see in Sec. III A, such nodes are useful in controlling how active the network dynamics are. We remark that though the in-degree of each node is 1 or 2, the out-degree can be larger or smaller, depending on the random choices of input nodes.
To physically synthesize the network, we make use of the FPGA’s array of the programable logic blocks and the reconfigurable interconnects that wire the different logic blocks together. The FPGA is configured by a hardware description language that specifies the logic function of each node, the wiring of the network, and the input and output connections. The internal network is unclocked, thus asynchronous, and hence electrical signals pass from one node to another at the rate that the underlying electronic components allow, which is of the order of 100 ps through a single node. Due to the lack of clocked processing, the wires carry voltages that do not correspond to logical 0 or 1 exclusively, and the full range of intermediate voltages pass between nodes. Hence, the continuous voltage transfer functions of the logic blocks inside the FPGA determine the output of each node, and the simple Boolean logic of each node is not a complete description of its functionality. For slowly varying voltages, nonlinear processing of the inputs is described by functions called transfer characteristics. We note that the transfer characteristic of the simple NOT gate is similar to a shifted hyperbolic tangent function. A more realistic description of the transfer function includes the effect of hysteresis. Modeling the dynamics is beyond the scope of this article, and we emphasize that equations for the voltage dynamics are not necessary for the machine learning applications we describe.
B. Input
Without any external inputs, the network will either relax to a fixed point steady state, in which each node will output a constant voltage, or the system exhibits persistent periodic or aperiodic dynamics. The behavior depends on the selected gate types and the degree of connectivity between them. Input lines enable driving the system with external signals that are specified for each experiment. A fraction of the nodes receive an input line, carrying a Boolean value representing the input signal, and are programed to perform a Boolean exclusive or (XOR) between the nodes’ internal logic function output and the input value. A node indexed receiving input from nodes and and from external input line , and that performs the internal logic , will output
where is the Boolean value on input line .
Another type of input line that was included for all nodes is a reset line carrying Boolean value . When , outputs of all nodes were set to 1, while when , the reset line was ignored. This was accomplished by setting the output of each node to be an OR of together with the node’s output without the reset line,
The purpose of the reset line is to set network to a state that is independent of its previous state, unlike the input line described in Eq. (3).
Recall that the Boolean equations (3) and (4) do not completely describe the transfer characteristics of the nodes, since they do not include the continuous voltages nor the time delay of the analog network. Instead, these equations define how the network dynamics is specified to the FPGA via lookup tables.
1. Push button
Once the internal network is synthesized, the simplest way to externally perturb the network is by a push button on the development board of the FPGA. When the button is manually pressed, the Boolean value of a single input line switches from 0 to 1 and back to 0 within approximately 1 ns. To achieve this, the signal from the push button (1 while pressed, 0 otherwise) was passed in parallel to two lines A and B, with B including a sequence of 10 NOT gates that delay the signal (see Fig. 3). The lines were combined at a (B)&A two-input gate, whose output was the Boolean value of the external input line passed on to the network. This input line was connected to half of the nodes in the network; the rest of the nodes received no external input. This input scheme was performed on a Cyclone III Altera development board.
2. Digital interface
More complex and carefully timed input signals can be crafted using a digital interface. Multiple parallel digital input signals were written to the FPGA memory as a two-dimensional array. We implemented this using a 1024-bit-wide word, in a buffer 1024 words deep sent in one per clock cycle. Then by a clocked operation, typically at 200 MHz, the parallel digital signals were passed to the nodes’ inputs. This means that the input values at the input lines were set to 0 or 1, once every clock cycle—that is, once every 5 ns. This scheme was performed on a Xilinx Zynq® 7000 series chip in MicroZed 7020 and PYNQ-Z1 boards and a Xilinx Virtex 7® chip in a VC707 board.
C. Output
Due to bandwidth limitations, reading all nodes simultaneously can be difficult. Instead, two different readout methods were used: probing the analog behavior of individual nodes using an oscilloscope or storing a large subset of the digital node states in a buffer in the FPGA.
1. High bandwidth oscilloscope
The output of up to 2 nodes at a time were recorded by a Tektronix MSO4104 oscilloscope with 1 GHz bandwidth at a rate of 5 GSa/s. This method of measurement allowed relatively high temporal resolution as well as a direct recording of transitional voltages between the logical 1 and logical 0 levels. The oscilloscope was coupled to the FPGA output through the SMA or the HSMC connector on the Altera Cyclone III dev board. The HSMC connector was coupled to the oscilloscope via a Terasic® Debug Header Breakout Board and a 500 MHz probe. The advantage of the SMA connection was that the connection did not add a bandwidth restriction beyond the 1 GHz bandwidth of the scope. The advantage of using the HSMC connector was that it allowed switching the output between over 30 buffered nodes in the internal network without reprograming the board.
2. Digital output
Similar to the digital input scheme, up to 2048 nodes can be digitally read in parallel once every clock cycle (typically 200 MHz). The output values are first passed to the FPGA digital memory and then transferred to a supervisory computer’s memory for analysis. With this method of output, the sampled voltage of a node is converted by the hardware to Boolean 0 or 1.
III. METRICS AND EXPERIMENTAL RESULTS
This section describes the dynamics we observed in our FPGA networks over a range of parameters representing the distribution of logic operations (such as AND, OR, XOR) that are randomly assigned to the nodes. We introduce a quantity called the mean sensitivity of a Boolean network, which is a function of the distribution parameters. We describe the dynamics of a particular randomly chosen network in several ways:
Self-activity level: measuring how dynamic the network is in the absence of external input,
Transient type: describing the qualitative response to a brief-duration input, and
Transient time: measuring how long the perturbed network takes to return close to statistical equilibrium.
We motivate and define these four metrics and show experimental results about the relationships between them. In addition, we illustrate how the metrics relate to performance on two machine-learning tasks.
A. Free-running networks
We begin with a motivating example. When all nodes in the network are configured as XOR, the network exhibits self-excited aperiodic dynamics even in the absence of input data. The output of each node fluctuates throughout the full range of voltages between those corresponding to Boolean 0 and to Boolean 1. In this setting, there are no external input lines and we observe a single output line from one node. Figure 4 shows the output of one representative node in a network of 501 XOR gates implemented on an Altera Cyclone III FPGA. Each gate has two inputs from two other gates in the network that are randomly selected before synthesis. The measured spectrum is flat over 6 orders of magnitude in frequency, akin to white noise.
Disordered dynamics of the XOR network. (a) The voltage at the output of a single node in an analog-gate network of 501 XOR gates. Different time traces are vertically shifted. (b) Power spectrum of the node output.
Disordered dynamics of the XOR network. (a) The voltage at the output of a single node in an analog-gate network of 501 XOR gates. Different time traces are vertically shifted. (b) Power spectrum of the node output.
The disordered dynamics of the fully XOR network could potentially be useful to create noise or random numbers. However, the usefulness of such a network for the processing of external input is doubtful. Any external pattern imprinted on the network will be immediately masked by its tendency to generate white noise of its own. Of course, with a different choice of gate type instead of XOR, one could design a network at the other extreme—one that tends toward a homogeneous steady state in the absence of input and may lack the sensitivity to input needed for applications. To facilitate choosing an appropriate mixture of gate types, we next consider the input sensitivity of all possible two-input gates.
1. Gate sensitivity
The XOR gates used in the previous section, as well as the complementary XNOR gate, are the most sensitive of all two-input gates. The sensitivity is defined here as the fraction of changes in one of the digital input values that result in a change in the gate’s digital output. The XOR and XNOR gates have sensitivity , because any change to either of their input values causes a change of output [see Figs. 5(a) and 5(b)]. The always-one gate that outputs 1 regardless of its inputs, and the similar always-zero gate, have sensitivity , since their outputs are independent of the inputs. Besides XOR, XNOR, always-one, and always-zero gates, one can check that all other two-input logic functions, such as AND and OR, have . This is because only half the events of changing an input value will cause the output to change. When all nodes in the network are highly sensitive as the XOR, a change in the state of a single node is typically followed by a growing cascade of changes to the other node states. Because each node is connected to two others, on average, the cascade grows exponentially until the number of affected nodes is comparable to the system size. On the other hand, sensitivity 0.5 gates have a 50% likelihood to ignore a change in input, and thus are more likely to contain small perturbations and find a fixed or periodic state. By tuning the fractions of gates of different sensitivity levels, the overall excitability of the network can be controlled.
Illustration of gate sensitivity . (a) The lookup table of the XOR logic function. (b) The lookup table of XOR represented on a graph. Every vertex is a pair of input values. The color of the vertex is the output. Every change in 1 input is represented by an edge. For XOR, , since every move along an edge changes the output. (c) The OR function has , since half the moves along edges result in a change of output. (d) and (e) The concept of sensitivity can be generalized to gates with more than two inputs.
Illustration of gate sensitivity . (a) The lookup table of the XOR logic function. (b) The lookup table of XOR represented on a graph. Every vertex is a pair of input values. The color of the vertex is the output. Every change in 1 input is represented by an edge. For XOR, , since every move along an edge changes the output. (c) The OR function has , since half the moves along edges result in a change of output. (d) and (e) The concept of sensitivity can be generalized to gates with more than two inputs.
The discussion above regards the Boolean nature of the gate design but appears to have related consequence on the analog nature of the gates, as we show experimentally below.
2. Parameter space and mean sensitivity
In order to adjust the average sensitivity of the nodes, we divide the possible two-input gate types into three classes according to their sensitivity (0, 0.5, or 1). Recall (see Sec. II A) that of the 12 possible two-input gate types with sensitivity 0.5, we exclude the four that depend on only one input, leaving eight gate types in this class. Recall also that some of our networks have one-input identity/delay nodes; the parameters we describe below and our definition of mean sensitivity are based only on the gate types of the two-input nodes.
Let be the fraction of two-input nodes that have sensitivity 0, let be the fraction with sensitivity , and let be the fraction with sensitivity . We view , , and as parameters of the network, with the constraint , so that there are two degrees of freedom. Since in addition the three parameters must be non-negative, the resulting parameter space is a triangular region.
Having chosen values for the parameters, we then randomly assign gate types from each class to the corresponding two-input nodes. [Recall (see Sec. II A) that of the 12 possible two-input gate types with sensitivity 0.5, we exclude the four that depend on only one input.] We remark that by using multiple gate types with sensitivity 0.5, we make 0 and 1 equally likely to appear as outputs in the lookup table for a node, resulting in a statistical symmetry that we consider desirable for our applications. We have found that breaking this symmetry can yield different dynamical behaviors, including intermittent bursting, but we have not systematically explored the asymmetric case.
We define the mean sensitivity of the resulting network to be the average of the sensitivities of each node; as a result,
We found that for a given value of , varying the values of the parameters along the corresponding line segment in the parameter triangle had no noticeable effect on any of the quantities we measure and graph below. Thus, though we sampled from the entire triangle, we show the dependence of results only on .
3. Self-activity
By self-activity, we mean the dynamics of the network without any external input. The following experiments were performed on networks implemented in a PYNQ-Z1 board with digital input and output interfaces (see Secs. II B 2 and II C 2). For the digital output method, the measured state of node at the th sample time is either 0 or 1. We define the self-activity level of the network to be the average fraction of binary node states that change per time sample,
Figure 6 shows the self-activity levels we measured vs the mean sensitivity for a variety of randomly chosen networks with 2400 nodes; half the nodes have two inputs, and half are one-input delay nodes. We computed the self-activity level by averaging the outputs of 2048 of the nodes over 1024 time samples. Mean sensitivity almost always yields a very low level of self-activity, whereas almost always results in high self-activity. Values of near 0.5 can yield a wide range of self-activity levels, depending on the random choices that determine the network.
Self-activity level of network vs mean sensitivity . The curve represents the average of in bins of width 0.5 for .
Self-activity level of network vs mean sensitivity . The curve represents the average of in bins of width 0.5 for .
B. External driving
External channels passing signals to the internal network have the potential to allow it to temporarily store and to nonlinearly modify information for useful processing. Previous work on cellular automata showed that optimal conditions for the support of information transmission, storage, and modification, are achieved in the vicinity of a phase transition between ordered and disordered states.34 Our results from Sec. III A suggest such a phase transition as the parameter of mean sensitivity increases passed a threshold near , though the transition is not sharp for the limited network size we used.
To illustrate possible responses to inputs, we perturb two networks with mean sensitivity that have no self-activity (). Unlike the previous experiments, which were electrically isolated from the environment, here half the nodes are coupled to an external input line that passes a short pulse upon pressing a push button (details in Sec. II B 1). In both networks, 15 node outputs were measured before and after the pulse event. All sampled nodes were silent before the pulse, but after the pulse a fraction of them show an active transient, whose length varied from node to node (see Fig. 7). In one network, transient times were generally less than 50 ns; for the other network, many nodes were active for thousands of nanoseconds, and one node remained active for over 100 s. The specific pattern of dynamics for each node did not repeat itself between different pulses, but the qualitative characteristics of the response and the order of magnitude of its duration was repeatable.
Transient following the external pulse. Two networks of 4096 nodes at a fixed point steady state are perturbed by an external pulse. (a) Pulse is initiated by a push button and is passed to half the nodes. (b) In the first network, the response of the nodes lasts on the order of 10 ns. (c) and (d) In the second network, the response lasts more than at some nodes but much shorter time scales for others; both graphs show the same data but with different horizontal scales.
Transient following the external pulse. Two networks of 4096 nodes at a fixed point steady state are perturbed by an external pulse. (a) Pulse is initiated by a push button and is passed to half the nodes. (b) In the first network, the response of the nodes lasts on the order of 10 ns. (c) and (d) In the second network, the response lasts more than at some nodes but much shorter time scales for others; both graphs show the same data but with different horizontal scales.
The long transient of response displayed by the second network shows that the system can retain a type of memory of the external pulse over 5 orders of magnitude longer time than the typical activation time for a single node. If multiple bits of information can be stored, then the long memory allows for many cycles of nonlinear processing of the input signal and potentially can serve for tasks of pattern classification based on such a processing system.
1. Transient dynamics characterization
In this section, we describe a series of experiments that exhibit a variety of responses to impulse perturbations. Here, we focus on the collective behavior of the nodes. Using a digital input and output interface (see Secs. II B 2 and II C 2), we measured the simultaneous binary output state of 896 nodes out of a total of 1120 nodes in the network. In these experiments, half the nodes were one-input identity gates, included as delay lines to slow down the analog dynamics to better match the system’s 200 MHz clock for input and output. The other half of the gates were two-input gates.
An external input line was connected to 128 randomly picked nodes (11.4% of the nodes). The value at the external input line was always 0 except when a pulse was initiated, then it switched to 1 and returned to 0 in the following clock cycle. (We note that we repeated a sample of our experiments with longer pulses with duration up to 5 clock cycles and observed no significant difference in the lengths and types of transient responses compared to the results we report below for 1 clock cycle pulse duration.) Each experiment started with a period of input 0 to establish a (statistical) equilibrium, followed by over a thousand input pulses, and the binary state of the all 896 output nodes was recorded at 200 MHz throughout the experiment. The recorded output was used to construct a three-dimensional array , defined so that is the binary state of the node at clock-cycle times relative to the time of the th input pulse.
We quantify the average collective response to a pulse by
Here is the average state of the th node at equilibrium, is the absolute value of , and is an average over the indexes and . The time series indicates how close the instantaneous network state is to its average before, during, and after a perturbation.
Another measure of the network collective response is based on the standard deviation of its response with respect to different pulses,
Here, is the standard deviation of over the index . The time series indicates how much the response varies between different pulses.
2. Types of transient response
Based on examining the time series and for a variety of randomly chosen networks, we divide the network response to pulses into the following classes:
fixed state,
diverging, disordered, and
converging, disordered.
In Sec. III B 3, we describe the quantitative criteria we decided to associate with these classes; first, we give qualitative descriptions with some illustrative examples.
1. Fixed-state response
We call the network response fixed-state if the network has a nearly-steady state when unperturbed by an external signal and generally returns close to this fixed state after a transient response to a pulse. When the system is at its fixed state, . Small deviations from 0 may be due to small active subnetworks with periodic or aperiodic dynamics. When a pulse arrives at the input line at , the value of increases, indicating that a fraction of the nodes change state.
In Fig. 8, after a short transient, node states return to their fixed-state values, and . The value of is close to 0 far from the pulse time, indicating that the fixed state is always the same. The slight growth in following indicates that the network state does not exactly repeat following every pulse so that there is a divergence from homogeneity of state following a pulse.
External pulse. (a) The average distance from equilibrium and the standard deviation of response for a network with low self-activity. Nodes quickly return to the fixed point after a pulse. (b) The dynamics of 200 of the 896 output nodes following the first five pulses.
External pulse. (a) The average distance from equilibrium and the standard deviation of response for a network with low self-activity. Nodes quickly return to the fixed point after a pulse. (b) The dynamics of 200 of the 896 output nodes following the first five pulses.
Figure 9 shows two other fixed-state networks where the transient is much longer. The top case is more representative of long-transient fixed-state networks, where the network always returns to essentially the same fixed state. In the bottom case, there is a small amount of variation in the long-term response to a pulse. If the variation was larger, we would classify the response as disordered. In both cases, grows significantly immediately after the pulse, indicating some lack of repeatability of the transient reponse. This is a feature of the system that is akin to other physical networks and does not occur in digitally simulated networks unless they are purposefully designed to do so. Despite their lack of complete repeatability, long-transient networks may be desirable for signal-processing applications.
External pulse, low self-activity, and long transient. (a) The average distance from equilibrium and the standard deviation of response for a network with low self-activity and long transient. Nodes return slowly to the fixed point after apulse. (b) The dynamics of 200 of the 896 output nodes following the first five pulses. (c) and (d) Occasionally, the network does not return to the same fixed state for all pulses so that and do not return all the way to 0.
External pulse, low self-activity, and long transient. (a) The average distance from equilibrium and the standard deviation of response for a network with low self-activity and long transient. Nodes return slowly to the fixed point after apulse. (b) The dynamics of 200 of the 896 output nodes following the first five pulses. (c) and (d) Occasionally, the network does not return to the same fixed state for all pulses so that and do not return all the way to 0.
2. Diverging disordered response
We call the network response disordered if, long after a pulse, a significant fraction of the nodes remains in a varying state. Often the eventual state is statistically similar to the baseline state without an external input. In these networks, the baseline values of and (without perturbation) are usually significantly larger than 0 (see Fig. 10). We call the response diverging if rises, usually temporarily, after a pulse arrives.
External pulse, disordered diverging network. (a) The average distance from equilibrium and the standard deviation of response for a disordered state network that is diverging upon impulse. Nodes quickly return to the fixed point after a pulse. (b) The dynamics of 200 of the 896 output nodes following the first five pulses.
External pulse, disordered diverging network. (a) The average distance from equilibrium and the standard deviation of response for a disordered state network that is diverging upon impulse. Nodes quickly return to the fixed point after a pulse. (b) The dynamics of 200 of the 896 output nodes following the first five pulses.
3. Converging disordered response
We say a disordered network response is converging if decreases in value, often temporarily, after a pulse arrival. This indicates a higher degree of repeatability in the short-term response than the diverging case. As in other cases, still increases following a pulse. Figure 11 shows two networks with converging disordered response. In the bottom case, the decrease in persists indefinitely. It appears that many of the nodes oscillate periodically before and long after the pulses, and that the pulses partially synchronize the phases of the oscillations.
External pulse, disordered converging network. (a) The average distance from equilibrium and the standard deviation of response for a disordered state network that is converging upon impulse. Nodes quickly return to the fixed point after a pulse. (b) The dynamics of 200 of the 896 output nodes following the first five pulses. (c) and (d) A permanent decrease in seems to be related to phase synchronization of periodic node states.
External pulse, disordered converging network. (a) The average distance from equilibrium and the standard deviation of response for a disordered state network that is converging upon impulse. Nodes quickly return to the fixed point after a pulse. (b) The dynamics of 200 of the 896 output nodes following the first five pulses. (c) and (d) A permanent decrease in seems to be related to phase synchronization of periodic node states.
3. Transient classification and length
For a more systematic study of the collective transient dynamics, we used a networks of the same design as in Sec. III A 3; in particular, we measured the output of out of nodes. (The increase in size from 1120-node networks does not significantly alter the distribution of transient types and lengths, but it improves the classification accuracy we report in Sec. III C.) Impulse input was provided to 10% of the nodes, and we measured the output at clock-cycle times relative to the start of a pulse, for pulses per network.
We classify the network dynamics as fixed state if the mean and standard deviation of over from 241 to 250 are less than and , respectively. Otherwise, we classify the network dynamics as disordered. For a disordered network, we define its pulse response as converging (respectively, diverging) if the mean of for from 1 to 20 is less than (respectively, greater than) its mean for from to . In all cases, we define the transient length of the network pulse response to be the first time after the pulse that returns to within of its mean over from to .
We compare the transient length with the mean sensitivity (Fig. 12) and with the self-activity level (Fig. 13). We observe that low sensitivity and high self-activity level each generally imply low transient length; networks with long transients are generally those with moderate sensitivity and low-to-moderate self-activity level.
Mean sensitivity vs transient length. Transient lengths longer than 60 clock cycles were truncated to 60 in order to show detail at shorter transient lengths.
Mean sensitivity vs transient length. Transient lengths longer than 60 clock cycles were truncated to 60 in order to show detail at shorter transient lengths.
Transient length vs self-activity level . Transient lengths longer than 60 clock cycles were truncated to 60 in order to show detail at shorter transient lengths.
Transient length vs self-activity level . Transient lengths longer than 60 clock cycles were truncated to 60 in order to show detail at shorter transient lengths.
In Fig. 12, we see that fixed-state networks predominate for , and converging disordered networks are most common for . Diverging disordered networks occur primarily for , though all three types are common in this range. In Fig. 13, we see that long-transient networks tend to be fixed-state or diverging disordered, though a few converging disordered networks have very long transients.
C. Machine learning and reservoir computing
The analog-gate network can serve as part of a machine learning system by nonlinearly casting an input pattern into the high dimensional space of the node outputs. It can be used as a feature generation step prior to a machine learning regression or classification method suitable for the task. Unlike standard digital computing of nonlinear transformations, this transformation is analog and is determined by the circuitry of the FPGA and the graph of connections. A digital data entry is presented at the input lines (see Sec. II B 2), where each line spans to the inputs of multiple nodes (see Fig. 14). Inside the network, the wiring and the nonlinear activations of the logic devices lead to high order nonlinear transformations of the inputs. The output of the network is digitized (see Sec. II C 2) and passed as input to a machine learning classification model. The simplest of such models is multiplication by an output weight matrix that is obtained via linear regression, in which case the process that includes the input, the internal network and the linear regression is akin to the reservoir computing model.12 The utility of the analog network for transforming the input data is in its speed as well as ease of coupling to other digital processing layers within the FPGA platform.
Data flow for MNIST image classification. From left to right: image is binarized and unrolled to a vector. The analog network is allowed to settle to a fixed steady state, then it is reset, and the image is imposed on its inputs. The transient network output is used as the input to a linear model to classify the image.
Data flow for MNIST image classification. From left to right: image is binarized and unrolled to a vector. The analog network is allowed to settle to a fixed steady state, then it is reset, and the image is imposed on its inputs. The transient network output is used as the input to a linear model to classify the image.
To gauge the usefulness of the network for machine learning, we evaluated the accuracy of a model that includes the analog-gate network on the well-studied MNIST handwritten digits image dataset35 and on a recent dataset of RF signals from five similar transmitters.36 We use the same network design as in Secs. III A 3 and III B 3, measuring the output from 2048 of 2400 nodes, but now we provide input to all of the nodes. For the MNIST classification, we input an entire image at one time and measure the network output at a single subsequent time; for the RF classification, we input a signal to the network as a time series and measure the output at multiple times during the input (details below).
In training, the digital outputs of the network were recorded and passed to a supervisory computer in which output weights were calculated by minimizing the cost function :
Here, is the cost of a single input-output pair indexed , are the output weights of a fully connected linear layer, is the regularization parameter, is the th measured Boolean state of the network response to input , and is the one-hot encoded label of input . (By one-hot encoded, we mean that if input is labeled with class number out of classes, then is the standard basis vector of -dimensional space.) The values of the matrix are optimized by linear least squares regression. In a test phase, the network output was used to infer the labels of inputs that were set aside during training, and the accuracy of the inferred labels is reported. The accuracy is defined as the fraction of classifications that are correct. We remark that in this article, we choose a linear classifier not for maximum accuracy, but rather for simplicity, to allow us to compare performance across many networks.
1. MNIST dataset and preprocessing
The MNIST dataset contains 70 000 images of handwritten digits ranging from 0 to 9. Each image is gray scale pixels. Of the entire set, 30 000 images are used in training and 10 000 are used for testing. The pixels in each image were then binarized by setting a threshold at 30% of the maximal brightness level. The resulting 784 bits are passed as inputs to the network, with each bit mapped to multiple nodes. Before each image was presented, a reset pulse (see Sec. II B) of duration 3 clock cycles at 200MHz was sent in, followed by a 5 clock cycles delay and then the image was presented at the input for 5 clock cycles. The output was measured at the end of the image presentation interval.
Figure 15 shows the accuracy achieved by a variety of randomly-generated networks vs mean sensitivity , and transient length. Accuracy is highest for low sensitivities, , with a steep drop-off for . Furthermore, low sensitivity seems to guarantee relatively high accuracy. Indeed, low-sensitivity networks always improve upon the accuracy (0.860) of a linear classifier alone, applied directly to the images. On the other hand, high-sensitivity networks degrade the accuracy.
Mean sensitivity vs MNIST accuracy. The horizontal line represents the accuracy of a linear classifier applied directly to the 784-pixel gray scale images.
Mean sensitivity vs MNIST accuracy. The horizontal line represents the accuracy of a linear classifier applied directly to the 784-pixel gray scale images.
From Fig. 6, we know that low sensitivity implies low self-activity level. Figure 16 not only confirms that the most accurate networks have low self-activity level but also shows that accuracy continually decreases as increases. In contrast to mean sensitivity , low values of do not guarantee relatively high accuracy.
Though networks with often have transient length of 10-20 clock cycles (see Fig. 12), Fig. 17 shows the most accurate networks have transient lengths less than 5 clock cycles. Thus, low-sensitivity networks (which are necessarily of fixed-state type) with very low transient times achieve the best accuracy, and all such networks perform similarly well.
Transient length vs MNIST accuracy. Transient lengths longer than 60 clock cycles were truncated to 60 in order to show detail at shorter transient lengths.
Transient length vs MNIST accuracy. Transient lengths longer than 60 clock cycles were truncated to 60 in order to show detail at shorter transient lengths.
2. RF dataset
We used a wideband RF dataset36 containing 23 000 bursts representing transmitted packets from five different commercial ISM modems, four of which were from the same manufacturer. Instead of recording samples in the common packet preamble, extended transmissions of several packets were recorded to allow for traffic-related diversity among the samples. A RF shield box was used to maintain a clean signal with a low SNR, while recording with a USRP X310 SDR with an RF daughter board. While recording, the SDR used a sampling rate of 100 MHz, centered in the 2.4 GHz ISM band at 2450 MHz. In postprocessing, separate bursts were created by locating and isolating packets in the extended transmissions. Bursts usually contain both the packet preamble, which we used for classification, and the data transmission following the preamble.
For classification, we used 12 000 bursts to train the FPGA network to identify the transmitter of each burst, and 1000 bursts for testing accuracy. In preprocessing, we exclude all bursts shorter than 2000 time samples and longer than 25 000 time samples, since these appear to be either incomplete or abnormal.
For each burst, we extract 300 consecutive time samples from the packet preamble. Each sample is a complex number, and we compute the phase difference between successive samples (that is, the complex argument of their ratio). The resulting real-valued time series of phase differences is then encoded and input sequentially into the network. Each encoded phase difference is a vector of length 1024 whose entries consist of 200 consecutive ones; the remaining entries are zero; the location of the center of the block of ones represents the value of the phase difference.
The network inputs one value of the phase-difference time series per clock cycle, and we measure the output of 2048 nodes during each of the last 200 clock cycles of the input. Thus, we fit a linear combination of 409 600 measured output values to the class labels.
Figure 18 shows the accuracy achieved by a variety of randomly-generated networks vs mean sensitivity . In contrast to MNIST classification, accuracy is highest for moderate sensitivities, . Values of between and have the highest worst-case accuracy, but the accuracy is much more variable for a given value of than for MNIST classification. Thus, to maximize accuracy in the RF classification scenario, it is more important to test a variety of random choices of networks with appropriate values of . Nonetheless, in most cases, the accuracy exceed the best accuracy we achieved using a linear classifier, which we applied directly to the network input.
Mean sensitivity vs RF accuracy. The horizontal line represents the accuracy of a linear classifier applied to the same binary input signals that the networks receive. (Other choices of the input to the linear classifier, including the unprocessed RF signals, yielded lower accuracy.).
Mean sensitivity vs RF accuracy. The horizontal line represents the accuracy of a linear classifier applied to the same binary input signals that the networks receive. (Other choices of the input to the linear classifier, including the unprocessed RF signals, yielded lower accuracy.).
Figure 19 shows that, as for MNIST classification, accuracy decreases with increasing self-activity level , but low accuracy can also occur at low values of (when the mean sensitivity is also low).
In Fig. 20, we see that the most accurate network we tested has a transient length around 5 clock cycles, but the next-most accurate networks all have transient length around 10. Furthermore, accuracy generally decreases steeply for transient lengths shorter than 10. In contrast to our MNIST classification method, in which the input is presented to the network in a single clock cycle, the sequential input in our RF classification method seems to require a certain minimum amount of memory in the network in order to perform well.
Transient length vs RF accuracy. Transient lengths longer than 60 clock cycles were truncated to 60 in order to show detail at shorter transient lengths.
Transient length vs RF accuracy. Transient lengths longer than 60 clock cycles were truncated to 60 in order to show detail at shorter transient lengths.
We observe also that the most accurate networks (those above 0.85) all have a fixed-state or converging-disordered transient type. We speculate that the absence of high-accuracy diverging disordered networks is due to the lesser repeatability of their input responses compared to converging disordered networks.
IV. DISCUSSION
We have exhibited and quantified complex network dynamics of interconnected analog nodes that include hardware effects such as noise and manufacturing heterogeneity. These networks are implemented in silicon on chips made by FPGA manufacturers such as Xilinx® and Altera® and the hardware designs are flexibly modified using standard synthesis software tools. We use unclocked, analog processing, which allows greater speed and variety of dynamics than clocked, digital processing. The analog approach can result in varying responses to the same input signal, which may be detrimental for some applications. However, all natural learning systems, including biological neuronal systems, have inherent noise. Furthermore, even with digital processing, the output can be sensitive to small perturbations in the input, in particular, with the complex processing of machine-learning methods. Thus, robustness of the final output to noise in the input is always an issue in machine learning. For example, controlled dropout noise is artificially added to deep learning models during training.37
The analog-gate networks we designed were inspired by the software reservoir computers introduced by Jaeger1 and Maass et al.2 The main attributes we retained are the use of a sparse random graph, with recurrent connections between nodes, and a nonlinear activation function at each node. However, the hardware implementation does not allow the same control over network parameters and dynamics. For example, the spectral radius of the matrix in Eq. (1) is typically adjusted to yield appropriate dynamics. As an alternative, we program different nodes with different logic-gate types (resulting in different activation functions), and we adjust the dynamics by varying the proportions used for the various gate types (see Secs. III A 1 and III A 2). This mechanism allows us to vary continuously a quantity we call the mean sensitivity of the network, which offers some control over the free-running self-activity level and the transient response to input. The length of the transient response is akin to the duration of memory in recurrent neural networks.
We found (see Sec. III C) that for an image classification task with static input, the greatest accuracy was achieved by networks with very short transient length (minimal memory), whereas for a signal classification task with dynamic input, a moderate transient length (memory limited to an appropriate duration) was best. The most accurate networks for image classification had low mean sensitivity; for signal classification, accuracy was best for mean sensitivity near the value of 0.5, above which substantial network self-activity becomes common. However, for both classification tasks, greatest accuracy occurred for networks with low self-activity.
Because the network topology and gate types are randomly generated, there is considerable variability of the network dynamics for the same choices of the parameters and mean sensitivity. However, we found that the mean sensitivity has a strong influence on the dynamics (see Secs. III A 3 and III B 3). The network dynamics, in turn, affect but do not determine machine-learning accuracy. For a given machine-learning task, the mean sensitivity and the dynamical metrics we computed can be useful in determining networks that are appropriate candidates for optimizing accuracy, but it is still important to train and test many random network realizations with the available data. Recall that, here, we used a simple classifier, linear regression, in order to compare a large number of networks; in experiments with a more limited set of networks, which we will report elsewhere, we have found significant improvement in accuracy by using logistic regression. We remark also that, as is typical for machine-learning applications, accuracy increases with network size. We found that for both our classification tasks, when the mean sensitivity is chosen appropriately, accuracy showed little improvement beyond a size of 1500 nodes; recall that we used 2400-node networks in Sec. III C.
In addition to the network parameters we varied in this article, other parameters available for study include the number of nodes, the average degree of nodes, other characteristics of the degree distribution, and the distribution of output values from the lookup tables used to program the nodes. Recall that we used a symmetric distribution of gate types that resulted in an equal likelihood of Boolean 0 and 1 output, but there might be benefits to breaking this symmetry.
One limitation of our FPGA networks is that though the internal unclocked processing is analog, their native input and output interfaces are digital, resulting in some loss of information in the output. Constructing random networks in physical ASIC chips could allow a more direct access to the analog properties of the nodes. Such a system could receive analog input signals and produce analog output. This approach would make better use of the electrically analog behavior of the nodes, would greatly increase the efficiency of the processing, and possibly allow for completely new applications.
ACKNOWLEDGMENTS
This material is based in part upon the work supported by the National Science Foundation (NSF) (Nos. EAR 1417148 and 1909055), as well as the NSF Graduate Research Fellowship Program under Grant No. 1322106. Our research was partially supported through a DoD contract under the Laboratory of Telecommunication Sciences Partnership with the University of Maryland. We would also like to thank the Maryland Innovation Initiative for their support. We are grateful to Anthony Mautino for his assistance. We gratefully acknowledge UMD and the office of the Vice President for Research.
I. Shani, A. Restelli, and D. Lathrop are inventors on the Patent application PCT/US2018/03290238 and D. Lathrop is a cofounder of Recurrent Computing, Inc.