Digital computers store information in the form of bits that can take on one of two values 0 and 1, while quantum computers are based on qubits that are described by a complex wavefunction, whose squared magnitude gives the probability of measuring either 0 or 1. Here, we make the case for a probabilistic computer based on p-bits, which take on values 0 and 1 with controlled probabilities and can be implemented with specialized compact energy-efficient hardware. We propose a generic architecture for such p-computers and emulate systems with thousands of p-bits to show that they can significantly accelerate randomized algorithms used in a wide variety of applications including but not limited to Bayesian networks, optimization, Ising models, and quantum Monte Carlo.
I. INTRODUCTION
Feynman1 famously remarked “Nature isn't classical, dammit, and if you want to make a simulation of nature, you'd better make it quantum mechanical.” In the same spirit, we could say, “Many real life problems are not deterministic, and if you want to simulate them, you'd better make it probabilistic.” However, there is a difference. Quantum algorithms require quantum hardware, and this has motivated a worldwide effort to develop a new appropriate technology. In contrast, probabilistic algorithms can be and are implemented on existing deterministic hardware using pseudo RNGs (random number generators). Monte Carlo algorithms represent one of the top ten algorithms of the 20th century2 and are used in a broad range of problems including Bayesian learning, protein folding, optimization, stock option pricing, and cryptography, just to name a few. So why do we need a p-computer?
A key element in a Monte Carlo algorithm is the RNG, which requires thousands of transistors to implement with deterministic elements, thus encouraging the use of architectures that time share a few RNGs. Our work has shown the possibility of high quality true RNGs using just three transistors,3 prompting us to explore a different architecture that makes use of large numbers of controlled-RNGs or p-bits. Figure 1(a),4 shows a generic vision for a probabilistic or a p-computer having two primary components: an N-bit random number generator (RNG) that generates N-bit samples and a Kernel that performs deterministic operations on them. Note that each RNG-Kernel unit could include multiple RNG-Kernel sub-units (not shown) for problems that can benefit from it. These sub-units could be connected in series as in Bayesian networks [Fig. 2(a) or in parallel as done in parallel tempering5,6 or for problems that allow graph coloring.7 The parallel RNG-Kernel units shown in Fig. 1(a) are intended to perform easily parallelizable operations like ensemble sums using a data collector unit to combine all outputs into a single consolidated output.
Probabilistic computer: (a) overall architecture combining a probabilistic element (N-bit RNG) with deterministic elements (kernel and data collector). The N-bit RNG block is a collection of N 1-bit RNGs, or p-bits. (b) p-bit: desired input–output characteristics along with two possible implementations, one with CMOS technology using linear feedback shift registers (LFSRs) and lookup tables (LUTs)8 and the other using three transistors and a stochastic magnetic tunnel junction (s-MTJ).3 The first is used to obtain all the results presented here, while the second is a nascent technology with many unanswered questions.
Probabilistic computer: (a) overall architecture combining a probabilistic element (N-bit RNG) with deterministic elements (kernel and data collector). The N-bit RNG block is a collection of N 1-bit RNGs, or p-bits. (b) p-bit: desired input–output characteristics along with two possible implementations, one with CMOS technology using linear feedback shift registers (LFSRs) and lookup tables (LUTs)8 and the other using three transistors and a stochastic magnetic tunnel junction (s-MTJ).3 The first is used to obtain all the results presented here, while the second is a nascent technology with many unanswered questions.
Bayesian network for genetic relatedness mapped to a p-computer (a) with each node represented by one p-bit. With increasing NS, the correlations (b) between different nodes are obtained more accurately.
Bayesian network for genetic relatedness mapped to a p-computer (a) with each node represented by one p-bit. With increasing NS, the correlations (b) between different nodes are obtained more accurately.
Ideally, the Kernel and data collector are pipelined so that they can continually accept new random numbers from the RNG,4 which is assumed to be fast and available in numerous numbers. The p-computer can then provide samples per second, with Np being the number of parallel units9 and fc being the clock frequency. We argue that even with Np = 1, this throughput is well in excess of what is achieved with standard implementations on either CPU (central processing unit) or graphics processing unit (GPU) for a wide range of applications and algorithms including but not limited to those targeted by modern digital annealers or Ising solvers.8,10–17 Interestingly, a p-computer also provides a conceptual bridge to quantum computing, sharing many characteristics that we associate with the latter.18 Indeed, it can implement algorithms intended for quantum computers, though the effectiveness of quantum Monte Carlo depends strongly on the extent of the so-called sign problem specific to the algorithm and our ability to “tame” it.19
II. IMPLEMENTATION
Of the three elements in Fig. 1, two are deterministic. The Kernel is problem-specific ranging from simple operations like addition or multiplication to more elaborate operations that could justify special purpose chiplets.20 Matrix multiplication, for example, could be implemented using analog options like resistive crossbars.16,21–23 The data collector typically involves addition and could be implemented with adder trees. The third element is probabilistic, namely, the N-bit RNG, which is a collection of N 1-bit RNGs or p-bit. The behavior of each p-bit can be described by24
where si is the binary p-bit output, Θ is the step function, σ is the sigmoid function, Ii is the input to the p-bit, and r is a uniform random number between 0 and 1. Equation (1) is illustrated in Fig. 1(b). While the p-bit output is always binary, the p-bit input Ii influences the mean of the output sequence. With Ii = 0, the output is distributed 50–50 between 0 and 1, and this may be adequate for many algorithms. Yet, in general, a non-zero Ii determined by the current sample is necessary to generate desired probability distributions from the N-bit RNG-block.
One promising implementation of a p-bit is based on a stochastic magnetic tunnel junction (s-MTJ) as shown in Fig. 1(b), whose resistance state fluctuates due to thermal noise. It is placed in series with a transistor, and the drain voltage is thresholded by an inverter3 to obtain a random binary output bit, whose average value can be tuned through the gate voltage . It has been shown both theoretically25,26 and experimentally27,28 that s-MTJ-based p-bits can be designed to generate new random numbers in times ∼nanoseconds. The same circuit could also be used with other fluctuating resistors,29 but one advantage of s-MTJs is that they can be built by modifying magnetoresistive random access memory (MRAM) technology that has already reached gigabit levels of integration.30
Note, however, that the examples presented here all use p-bits implemented with deterministic CMOS elements or pseudo-RNGs using linear feedback shift registers (LFSRs) combined with lookup tables (LUTs) and thresholding elements8 as shown in Fig. 1(b). Such random numbers are not truly random but have a period that is longer than the time range of interest. The longer the period, the more registers are needed to implement it. Typically, a p-bit requires transistors,30 and the actual number depending on the quality of the pseudo RNG that is desired. Thirty-two stage LFSRs require transistors, while a Xoshiro128+31 would require around four times as many. Physics-based approaches, like s-MTJs, naturally generate true random numbers with infinite repetition period and ideally require only three transistors and one MTJ.
A simple performance metric for p-computers is the ideal sampling rate mentioned above. The results presented here were all obtained with an field-programmable gate array (FPGA) running on a 125 MHz clock, for which ns, which could be significantly shorter [even ns (Ref. 25)] if implemented with s-MTJs. Furthermore, s-MTJs are compact and energy-efficient, allowing up to a factor of 100 larger Np for a given area and power budget. With an increase in fc and Np, a performance improvement by 2–3 orders of magnitude over the numbers presented here may be possible with s-MTJs or other physics-based hardware.
We should point out that such compact p-bit implementations are still in their infancy,30 and many questions remain. First is the inevitable variation in RNG characteristics that can be expected. Initial studies suggest that it may be possible to train the Kernel to compensate for at least some of these variations.32,33 Second is the quality of randomness, as measured by statistical quality tests, which may require additional circuitry as discussed, for example, in Ref. 27. Certain applications like simple integration (Sec. III A) may not need high quality random numbers, while others like Bayesian correlations (Sec. III B) or Metropolis–Hastings methods that require a proposal distribution (Sec. III C) may have more stringent requirements. Third is the possible difficulty associated with reading sub-nanosecond fluctuations in the output and communicating them faithfully. Finally, we note that the input to a p-bit is an analog quantity requiring digital-to-analog converters (DACs) unless the kernel itself is implemented with analog components.
III. APPLICATIONS
A. Simple integration
A variety of problems such as high dimensional integration can be viewed as the evaluation of a sum over a very large number N of terms. The basic idea of the Monte Carlo method is to estimate the desired sum from a limited number Ns of samples drawn from configurations α generated with probability
The distribution {q} can be uniform or could be cleverly chosen to minimize the standard deviation of the estimate.34 In any case, the standard deviation goes down as , and all such applications could benefit from a p-computer to accelerate the collection of samples.
B. Bayesian network
A little more complicated application of a p-computer is to problems where random numbers are generated not according to a fixed distribution, but by a distribution determined by the outputs from a previous set of . Consider, for example, the question of genetic relatedness in a family tree35,36 with each layer representing one generation. Each generation in the network in Fig. 2(a) with N nodes can be mapped to a N-bit RNG-block feeding into a Kernel, which stores the conditional probability table (CPT) relating it to the next generation. The correlation between different nodes in the network can be directly measured, and an average over the samples computed to yield the correct genetic correlation as shown in Fig. 2(b). Nodes separated by p generations have a correlation of . The measured absolute correlation between strangers goes down to zero as .
This is the characteristic of Monte Carlo algorithms, namely, to obtain results with accuracy ε we need samples. The p-computer allows us to collect samples at the rate of = 125 MSamples per second if Np = 1 and fc = 125 MHz. This is about two orders of magnitude faster than what we get running the same algorithm on an Intel Xeon CPU.
How does it compare to deterministic algorithms run on CPU? As Feynman noted in his seminal paper,1 deterministic algorithms for problems of this type are very inefficient compared to probabilistic ones because of the need to integrate over all the unobserved nodes in order to calculate a property related to nodes
In contrast, a p-computer can ignore all the irrelevant nodes and simply look at the relevant nodes . We used the example of genetic correlations, because it is easy to relate to. However, it is representative of a wide class of everyday problems involving nodes with one-way causal relationships extending from “parent” nodes to “child” nodes,37–39 all of which could benefit from a p-computer.
C. Knapsack problem
Let us now look at a problem that requires random numbers to be generated with a probability determined by the outcome from the last sample generated by the same RNG. Every RNG then requires feedback from the very Kernel that processes its output. This belongs to the broad class of problems that are labeled as Markov Chain Monte Carlo (MCMC). For an excellent summary and evaluation of MCMC sampling techniques, we refer the reader to Ref. 40.
The knapsack is a textbook optimization problem described in terms of a set of items, , the m-th, each containing a value vm and weighing wm. The problem is to figure out which items to take (sm = 1) and which to leave behind (sm = 0) such that the total value is a maximum, while keeping the total weight below a capacity C. We could straightforwardly map it to the architecture (Fig. 1), using the RNG to propose solutions at random, and the Kernel to evaluate V, W and decide to accept or reject. Yet, this approach would take us toward the solution far too slowly. It is better to propose solutions intelligently looking at the previously accepted proposal and making only a small change to it. For our examples, we proposed a change of only two items each time.
This intelligent proposal, however, requires feedback from the kernel, which can take multiple clock cycles. One could wait between proposals, but the solution is faster if instead we continue to make proposals every clock cycle in the spirit of what is referred to as multiple-try Metropolis.41 The results are shown in Fig. 3 4 and compared with CPU (Intel Xeon at 2.3 GHz) and GPU (Tesla T4 at 1.59 GHz) implementations, using the probabilistic algorithm. Also shown are two efficient deterministic algorithms, one based on dynamic programming (DP), and another based on the work of Pisinger and co-workers.42,43
Example of the MCMC—Knapsack problem: (a) mapping to the general p-computer framework. (b) Performance of the p-computer compared to CPU implementation of the same probabilistic algorithm and along with two well-known deterministic approaches. A deterministic algorithm like the algorithm developed by Pisinger and co-workers,42,43 which is optimized specifically for the Knapsack problem can outperform MCMC. However, for a given MCMC algorithm, the p-computer provides orders of magnitude improvement over the standard CPU implementation.
Example of the MCMC—Knapsack problem: (a) mapping to the general p-computer framework. (b) Performance of the p-computer compared to CPU implementation of the same probabilistic algorithm and along with two well-known deterministic approaches. A deterministic algorithm like the algorithm developed by Pisinger and co-workers,42,43 which is optimized specifically for the Knapsack problem can outperform MCMC. However, for a given MCMC algorithm, the p-computer provides orders of magnitude improvement over the standard CPU implementation.
Note that the probabilistic algorithm (MCMC) gives solutions that are within 1% of the correct solution, while the deterministic algorithms give the correct solution. For the Knapsack problem, getting a solution that is 99% accurate should be sufficient for most real world applications. The p-computer provides orders of magnitude improvement over CPU implementation of the same MCMC algorithm. It is outperformed by the algorithm developed by Pisinger and co-workers,42,43 which is specifically optimized for the Knapsack problem. However, we note that the p-computer projection in Fig. 3(b) is based on utilizing better hardware like s-MTJs, but there is also significant room for improvement of the p-computer by optimizing the Metropolis algorithm used here and/or by adding parallel tempering.5,6
D. Ising model
Another widely used model for optimization within MCMC is based on the concept of Boltzmann machines (BMs) defined by an energy function E from which one can calculate the synaptic function Ii
which can be used to guide the sample generation from each RNG i in sequence44 according to Eq. (1). Alternatively, the sample generation from each RNG can be fixed, and the synaptic function can be used to decide whether to accept or reject it within a Metropolis–Hastings framework.45 Either way, samples will be generated with probabilities . We can solve optimization problems by identifying E with the negative of the cost function that we are seeking to minimize. Using a large β, we can ensure that the probability is nearly 1 for the configuration with the minimum value of E.
In principle, the energy function is arbitrary, but much of the work is based on quadratic energy functions defined by a connection matrix Wij and a bias vector hi (see, for example, Refs. 8 and 10–17)
For this quadratic energy function, Eq. (4) gives , so that the Kernel has to perform a multiply and accumulate operation as shown in Fig. 4(a). We refer the reader to Ref. 8 for an example of the max-cut optimization problem on a two-dimensional (2D) 90 × 90 array implemented with a p-computer.
Example of the Quantum Monte Carlo (QMC)—transverse field Ising model: (a) mapping to the general p-computer framework. (b) Solving the transverse Ising model for quantum annealing. Subfigure (b) is adapted from Sutton et al., IEEE Access 8, 157238–157252 (2020). Copyright 2020 Author(s), licensed under a Creative Commons Attribution (CC BY) license.8
Example of the Quantum Monte Carlo (QMC)—transverse field Ising model: (a) mapping to the general p-computer framework. (b) Solving the transverse Ising model for quantum annealing. Subfigure (b) is adapted from Sutton et al., IEEE Access 8, 157238–157252 (2020). Copyright 2020 Author(s), licensed under a Creative Commons Attribution (CC BY) license.8
Equation (4), however, is more generally applicable even if the energy expression is more complicated, or given by a table. The Kernel can be modified accordingly. For example of an energy function with fourth order terms implemented on an eight bit p-computer, we refer the reader to Ref. 30.
A wide variety of problems can be mapped onto the BM with an appropriate choice of the energy function. For example, we could generate samples from a desired probability distribution P, by choosing . Another example is the implementation of logic gates by defining E to be zero for all that belong to the truth table and have some positive value for those that do not.24 Unlike standard digital logic, such a BM-based implementation would provide invertible logic that not only provides the output for a given input but also generates all possible inputs corresponding to a specified output.24,46,47
E. Quantum Monte Carlo
Finally, let us briefly describe the feasibility of using p-computers to emulate quantum or q-computers. A q-computer is based on qubits that are neither 0 or 1 but are described by a complex wavefunction, whose squared magnitude gives the probability of measuring either 0 or 1. The state of an computer is described by a wavefunction with complex components, one for each possible configuration of the n qubits.
In gate-based quantum computing (GQC), a set of qubits is placed in a known state at time t, operated with d quantum gates to manipulate the wavefunction through unitary transformations
and measurements are made to obtain results with probabilities given by the squared magnitudes of the final wavefunctions. From the rules of matrix multiplication, the final wavefunction can be written as a sum over a very large number of terms
Conceptually, we could represent a system of n qubits and d gates with a system of p-bits with second states that label the second terms in the summation in Eq. (7).48 Each of these terms is often referred to as a Feynman path and what we want is the sum of the amplitudes of all such paths
The essential idea of quantum Monte Carlo is to estimate this enormous sum from a few suitably chosen samples, not unlike the simple Monte Carlo stated earlier in Eq. (2). What makes it more difficult, however, is the so-called sign problem,19 which can be understood intuitively as follows. If all the quantities are positive then it is relatively easy to estimate the sum from a few samples. However, if some are positive while some are negative with more cancelations, then many more samples will be required. The same is true if the quantities are complex quantities that cancel each other.
The matrices U that appear in GQC are unitary with complex elements, which often lead to significant cancelation of Feynman paths, except in special cases when there may be complete constructive interference. In general, this could make it necessary to use large numbers of samples for accurate estimation. A noiseless quantum computer would not have this problem, since qubits intuitively perform the entire sum exactly and yield samples according to the squared magnitude of the resulting wavefunction. However, real world quantum computers have noise, and p-computers could be competitive for many problems.
Adiabatic quantum computing (AQC) operates on very different physical principles, but its mathematical description can also be viewed as summing the Feynman paths representing the multiplication of r matrices
This is based on the Suzuki–Trotter method described in Ref. 49, where the number of replicas, r, is chosen large enough to ensure that if , one can approximately write . The matrices in AQC are Hermitian, and their elements can be all positive.50 A special class of Hamiltonians H having this property is called stoquastic, and there is no sign problem since the amplitudes in the Feynman sum in Eq. (8) all have the same sign.
An example of such a stoquastic Hamiltonian is the transverse field Ising model (TFIM) commonly used for quantum annealing, where a transverse field which is quantum in nature is introduced and slowly reduced to zero to recover the original classical problem. Figure 4, adapted from Ref. 8, shows an n = 250 qubit problem mapped to a 2D lattice of 250 × 10 = 2500 p-bits using r = 10 replicas to calculated average correlations between the z-directed spins on lattice sites separated by L. Very accurate results are obtained using Ns = 105 samples. However, these samples were suitably spaced to ensure their independence, which is an important concern in problems involving feedback.
Finally, we note that quantum Monte Carlo methods, both GQC and AQC, involve selective summing of Feynman paths to evaluate matrix products. As such, we might expect conceptual overlap with the very active field of randomized algorithms for linear algebra,51,52 though the two fields seem very distinct at this time.
IV. CONCLUDING REMARKS
In summary, we have presented a generic architecture for a p-computer based on p-bits, which take on values 0 and 1 with controlled probabilities, and can be implemented with specialized compact energy-efficient hardware. We emulate systems with thousands of p-bits to show that they can significantly accelerate the implementation of randomized algorithms that are widely used for many applications.53 A few prototypical examples are presented such as Bayesian networks, optimization, Ising models, and quantum Monte Carlo.
ACKNOWLEDGMENTS
The authors are grateful to Behtash Behin-Aein for helpful discussions and advice. We also thank Kerem Camsari and Shuvro Chowdhury for their feedback on the manuscript. The contents are based on the work done over the last 5–10 years in our group, some of which has been cited here, and it is a pleasure to acknowledge all who have contributed to our understanding. This work was supported in part by ASCENT, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.
AUTHOR DECLARATIONS
Conflict of Interests
One of the authors (S.D.) has a financial interest in Ludwig Computing.
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.