This paper proposes a large-scale, energy-efficient, high-throughput, and compact tensorized optical neural network (TONN) exploiting the tensor-train decomposition architecture on an integrated III–V-on-silicon metal–oxide–semiconductor capacitor (MOSCAP) platform. The proposed TONN architecture is scalable to 1024 × 1024 synapses and beyond, which is extremely difficult for conventional integrated ONN architectures by using cascaded multi-wavelength small-radix (e.g., 8 × 8) tensor cores. Simulation experiments show that the proposed TONN uses 79× fewer Mach–Zehnder interferometers (MZIs) and 5.2× fewer cascaded stages of MZIs compared with the conventional ONN while maintaining a >95% training accuracy for Modified National Institute of Standards and Technology handwritten digit classification tasks. Furthermore, with the proven heterogeneous III–V-on-silicon MOSCAP platform, our proposed TONN can improve the footprint-energy efficiency by a factor of 1.4 × 104 compared with digital electronics artificial neural network (ANN) hardware and a factor of 2.9 × 102 compared with silicon photonic and phase-change material technologies. Thus, this paper points out the road map of implementing large-scale ONNs with a similar number of synapses and superior energy efficiency compared to electronic ANNs.
I. INTRODUCTION
Artificial neural networks (ANNs) have proven their remarkable capabilities in various tasks, including computer vision, speech recognition, machine translations, medical diagnoses, and the game of Go.1 Neuromorphic computing accelerators, such as IBM TrueNorth2 and Intel Loihi,3 have shown significantly superior performance compared with traditional central processing units (CPUs) for specific neural network tasks. A majority of the electrical ANN hardware’s energy consumption comes from data movement in the synaptic interconnections. Optical neural networks (ONNs), also known as photonic neural networks, are expected to improve the energy efficiency and throughput significantly compared with electrical ANNs due to the capabilities of transmitting data at the speed of light without having a length-dependent impedance.4 However, there are mainly two challenges that prevent ONNs from achieving competitive performance with the electrical ANNs.
The first challenge is the limited scalability. While electrical ANN hardware is capable of achieving 4096 synaptic connections per neuron,5,6 the scale of the state-of-the-art ONNs is limited to 64 × 647 or smaller. This is because the synaptic interconnections in conventional ONNs typically rely on Mach–Zehnder interferometer (MZI) meshes.8 Scaling up to high-radix N × N meshes requires O(N2) MZIs and O(N) cascaded stages, which lead to insurmountable footprint, optical loss budget, and control complexity. For example, Lightmatter’s Mars device7 integrates a 64 × 64 micro-electro-mechanical systems (MEMS) MZI mesh on a 150 mm2 chip. With the same architecture and device platform, the predicted chip size for 1024 × 1024 will be larger than an 8-in. wafer and the optical insertion loss of 1024 cascaded stages of MEMS9 MZIs will be >675 dB, which is tough to be compensated by using optical amplifiers. The limited scalability makes the conventional ONNs typically rely on pre-processing10 and convolutional layers11 to handle meaningful machine learning (ML) datasets (e.g., ImageNet datasets). However, the convolutional neural networks (CNNs) are only efficient for specific tasks (e.g., image classification and time series) and require many layers. On the other hand, some literature proposes to pursue large-scale ONNs with comparable scale as the electrical ANN hardware.12–14 However, these efforts retain the conventional ONN architecture12 or reduce the throughput by encoding the input data in time domain13 or require bulky free-space devices.14
The second challenge of ONNs is the lack of a device platform that can monolithically integrate optical neurons with photodetectors (PDs), electrical neuron circuits, light emitters, and synaptic interconnections on silicon. By sharing the same fabrication process steps in most parts as the silicon complementary metal–oxide–semiconductor (CMOS) process, silicon photonics (SiPh)15,16 has been proved to be a desirable platform for the commercial large-volume and cheap electronic and photonic integrated circuit (EPIC) manufacturing. However, since silicon is an indirect bandgap material, the silicon light emitters are inefficient. Aligning III–V diode laser chips to SiPh chips will induce additional coupling losses and packaging complexity, limiting energy efficiency and integration density.
To mitigate these two challenges, on the architecture side, tensor-train (TT) decomposed synaptic interconnections have been proposed to realize large-scale ONNs with reduced hardware resources.17,18 As an effective approach to significantly compress the over-parameterized fully connected layers in ANNs,19,20 TT decomposition has been proved to achieve almost the same accuracy as an ensemble of deep CNNs21 on various tasks, including the Markov random field,22 image recognition,20,23,24 and video classification.24,25 On the device platform side, although tensorized ONN (TONN) can be implemented in various photonic platforms, heterogeneous III–V-on-silicon integration is an optimal choice.26 Integration of quantum dot (QD) comb lasers27,28 and avalanche photodiodes (APDs)29 with other SiPh devices at wafer scale further improves the energy efficiency of the TONN.
This paper proposes a large-scale, energy-efficient, high-throughput, and compact tensorized ONN (TONN) architecture on a densely integrated III–V-on-silicon metal–oxide–semiconductor capacitor (MOSCAP) platform. The TT decomposition enables the proposed architecture scalable to 1024 × 1024 and beyond, which is extremely difficult for conventional integrated ONNs. The detailed device architecture of the TONN is designed based on a multi-wavelength configuration that does not require intermediate memory. Simulations show that the proposed TONNs utilize 79× fewer number of MZIs and 5.2× fewer cascaded stages of MZIs than the conventional ONNs while maintaining a >95% accuracy for Modified National Institute of Standards and Technology (MNIST) handwritten digit classification tasks. Moreover, the proposed monolithic III–V-on-silicon MOSCAP platform provides MOSCAP synaptic MZIs with negligible static phase tuning energy consumption, a photodetector-free hitless monitoring scheme to reduce more than O(N2) control elements and pads, and dense wavelength division multiplexing (DWDM)-based neurons. With all the experimentally proven building blocks, the footprint-energy efficiency30 [(MAC/J) (MAC/s/mm2)] of the TONNs can be further improved by a factor of 2.9 × 102 compared with other photonic platforms and a factor of 1.4 × 104 compared with the state-of-the-art digital electronic ANNs.
The remainder of this paper is organized as follows. Section II introduces the principle of TT decomposition and the device architecture of TONN. Section III simulates the TONN using MNIST handwritten digit classification tasks and compares the test accuracy with the conventional ONNs. Section IV reports the implementation of TONN on the heterogeneous III–V-on-silicon MOSCAP platform and compares the energy-footprint efficiency with other ANN technologies. Finally, Sec. V concludes this paper.
II. TENSORIZED OPTICAL NEURAL NETWORK
A. Principle of tensor-train decomposition
ONNs typically consist of an input neuron layer, many hidden neuron layers, an output neuron layer, and synaptic interconnections, which can be abstracted by arbitrary weight matrices W, as shown in Fig. 1(a). By utilizing singular value decomposition (SVD), the arbitrary weight matrix W can be decomposed into two arbitrary unitary matrices and an array of additional phase shifters and amplitude modulators: W = UΣV*. In ONNs, the arbitrary unitary matrices U and V are typically realized by “loss-less” MZI meshes in a “rectangular”8 or “triangular”31 configuration. Each 2 × 2 MZI contains two phase shifters and two 50/50 optical power splitters. However, an N × N MZI mesh requires N(N − 1)/2 MZIs and N cascaded stages.8 Although MEMS32 or non-volatile33 technologies can be used to reduce the length of the phase shifters to tens of microns, the MZI meshes are still difficult to scale up to the high radix.
(a) Schematic of the ONN architecture with input layers, hidden layers, output layers, and synaptic interconnections. Each synaptic interconnection is a linear operation represented by an arbitrary weight matrix W. (b) Weight matrix TT decomposition for parameter compression.
(a) Schematic of the ONN architecture with input layers, hidden layers, output layers, and synaptic interconnections. Each synaptic interconnection is a linear operation represented by an arbitrary weight matrix W. (b) Weight matrix TT decomposition for parameter compression.
Here, italic lowercase letters, italic uppercase letters, and bold italic uppercase letters are used to represent vectors, matrices, and tensors, respectively. To represent weight matrix in the TT format, the matrix dimensions M and N are first assumed to be factored as and , where d is defined as the number factors of M and N. Let μ and ν be the natural bijections from indices (i, j) of W to indices [μ1(i), ν1(j), …, μd(i), νd(j)] of an order-2d weight tensor W. Then, W(i, j) = W(μ1(i), ν1(j), …, μd(i), νd(j)). TT decomposition can be interpreted as SVD of multi-dimensional matrices. As shown in Fig. 1(b), the TT decomposition expresses the tensor W as a series of tensor products,19,20,34
where the four-way tensor is the TT-core and the total number of tensor cores is d. The vector (R0, R1, …, Rd) is the TT-rank, and R0 = Rd = 1. In this way, the total number of parameters can be reduced from M × N into the summation of the parameters of each small TT-cores, i.e., .
B. Tensor-train rank determination
The choice of TT-ranks influences both the training accuracy and the complexity of the TONN. Smaller TT-ranks usually lead to lower training accuracy, and higher TT-ranks lead to higher computing and storage complexity. TT-rank determination is a nondeterministic polynomial time (NP)-hard problem.35 Traditional methods require tuning the rank manually,20 which often leads to suboptimal results as different TT-layers may require different TT-ranks. The authors of Ref. 36 proposed a heuristic binary search method to find the smallest TT-ranks for some predetermined accuracy threshold. The authors of Ref. 37 further provided some heuristic that guides the selection of TT-ranks. The authors of Ref. 38 used a Bayesian method to optimize the TT-ranks along with the model parameters. For convenience, the TT-ranks in this paper are determined based on empirical figures.
C. Device architecture of tensorized optical neural network
1. Tensor-train layers
To achieve the tensor-train layers, as shown in Fig. 1(b), the vectors x and y = Wx representing the input and output of the weight matrix W need to be reshaped into the tensor format by x(j) = x(μ1(j), …, μd(j)) and y(i) = y(ν1(i), …, νd(i)), where , , , and . Then, in TT format, the linear transformation by the weight tensor W can be represented by20
Based on Eq. (2), x needs to be multiplied by each TT-core in the sequence of Gd, Gd−1, …, G1, as shown in Fig. 2(a). The term Ij is used to notate the intermediate tensor: I1 = Gdx, I2 = Gd−1I1, …, y = Id = G1Id−1. Then, at TT-core k,
where k = 1, 2, …, d. Equation (3) can be physically implemented by a two-step operation. First, multiplies by and gives . Second, GkId−k is re-indexed to so that each TT layer is modular.
(a) Schematic of the products between the input data and the TT-cores in sequence. (b) N × N unitary matrix represented by a “rectangular” MZI mesh with 2 × 2 MZIs as the building blocks. (c) The proposed device architecture of the M × N TONN with a single wavelength (TONN-SW). (d) The proposed device architecture of the M × N TONN with multiple wavelengths (TONN-MWs). d is the number of factors of M and N and the number of tensor core layers. g is the total number of wavelengths for TONN-MW.
(a) Schematic of the products between the input data and the TT-cores in sequence. (b) N × N unitary matrix represented by a “rectangular” MZI mesh with 2 × 2 MZIs as the building blocks. (c) The proposed device architecture of the M × N TONN with a single wavelength (TONN-SW). (d) The proposed device architecture of the M × N TONN with multiple wavelengths (TONN-MWs). d is the number of factors of M and N and the number of tensor core layers. g is the total number of wavelengths for TONN-MW.
The physical implementation of TT-layers usually requires memory between each adjacent TT-core to store the intermediate data.23 Here, two “memory-free” device architectures of TONN are proposed with cascaded photonic TT-cores consisting of small-radix MZI meshes and passive cross-connects.
2. Single-wavelength implementation
To emulate the tensor product at the TT layer, the first approach is the TONN with single wavelength (TONN-SW), as shown in Fig. 2(c). At the TT-core k, where k = d, …, 1, the input tensor Id−k is represented by hk groups of NkRk optical signals at λ0, where hk = Md, …, Mk+1Nk−1, …, N1. The TT-core Gk is represented by hk identical Rk−1Mk × NkRk MZI meshes being put side-by-side. The MZI meshes can be enabled by putting 2 × 2 MZIs in a “rectangular” mesh, as shown in Fig. 2(b). By sending the input optical signals into the MZI meshes, the product between Gk and Id−k is implemented and then an optical passive cross-connect is used for switching the indices of Mk and Nk−1 in the output tensor GkId−k. The output of the cross-connect is Id−k+1, which is then the input of the next TT layer.
3. Multi-wavelength implementation
By adding parallelism in the wavelength domain using wavelength division multiplexing (WDM) technology, the TONN with multiple wavelengths (TONN-MWs) can save a considerable amount of MZIs compared with TONN-SW, as shown in Fig. 2(d). At the TT-core k, the input tensor Id−k is first encoded with the WDM channels of λ1, λ2, …, λg, where g = Nd/2, …, N1. For the first half of the TT-cores Gd, …, Gd/2+1, only the indices of Nd, …, Nd/2+1 of the input data x are used for multiplication with the TT-cores. Similar to TONN-SW, hk groups of NkRk g-wavelength WDM signals first go through hk identical Rk−1Mk × NkRk MZI meshes. Here, hk = Md, …, Mk+1Nk−1, …, Nd/2+1 for d/2 < k < d or Md/2, …, Mk+1Nk−1, …, N1 for k ≤ d/2. Then, for k ≠ d/2 + 1, an optical passive cross-connect is used to switch the indices of Mk and Nk−1 in the output tensor GkId−k and gives Id−k+1.
At TT-core [d/2 + 1], the output of the MZI meshes Gd/2+1Id/2−1 utilizes passive wavelength-space cross-connects to switch the indices between the wavelength domain (Nd/2, …, N1) and the space domain (Md, …, Md/2+1). The wavelength-space cross-connects, which can be realized by using WDM transponders, are fixed (no reconfiguration needed) once the TONN architecture is decided. As a result, the output is represented by hd/2 groups of Nd/2Rd/2 g-wavelength WDM signals, where g = Md, …, Md/2+1. In this way, for the second half of the TT-cores Gd/2, …, G1, the indices of Nd/2, …, N1 of the input data x can be used for multiplication with the TT-cores. Note that, here, d is assumed to be an even number. For d being an odd number, the wavelength-space cross-connects happen at the TT-core [(d + 1)/2].
D. Comparison between tensorized and conventional ONNs
To evaluate the scalability of the TONN, Table I compares the total number of MZIs and the total number of cascaded stages of MZIs, among conventional ONN, TONN-SW, and TONN-MW. Letting M = N, M1 = ⋯ = Md = N1 = ⋯ = Nd = N1/d, and R0 = ⋯ = Rd = R, both TONN-SW and TONN-MW can reduce the total number of cascaded stages from N to dRN1/d, compared with the conventional ONN. The TONN-SW can enable all-optical tensor core products without optical-to-electrical-to-optical (O/E/O) conversions; however, the saving of the total number of MZIs is not significant. On the other hand, the TONN-MW can reduce the total number of MZIs from N(N − 1)/2 to dRN1/2(RN1/d − 1) at the expense of only one layer of OEO conversion for the wavelength-space cross-connects in the TT-core [d/2 + 1].
Comparison of the total number of MZIs and the total number of cascaded stages of MZIs among conventional ONN, TONN-SW, and TONN-MW.
. | Total number of MZIs . | Total number of cascaded stages of MZIs . |
---|---|---|
Conventional ONN | N(N − 1)/2 | N |
TONN-SW | dRN(RN1/d − 1) | dRN1/d |
TONN-MW | dRN1/2(RN1/d − 1) | dRN1/d |
. | Total number of MZIs . | Total number of cascaded stages of MZIs . |
---|---|---|
Conventional ONN | N(N − 1)/2 | N |
TONN-SW | dRN(RN1/d − 1) | dRN1/d |
TONN-MW | dRN1/2(RN1/d − 1) | dRN1/d |
Figures 3(a) and 3(b) show the comparison of the total number of MZIs and cascaded stages of MZIs between conventional ONN and TONN-MW as a function of the radix N ≥ 128. At the radix of N = 1024, the conventional ONN has 5.2 × 105 MZIs and 1024 cascaded stages. For R = 2 and N1 = ⋯ = Nd = N1/d = 2, the TONN-MW requires 40 cascaded stages and 1920 MZIs, which are 25.6× and 272.8× less than the conventional ONN.
Comparison of the (a) total number of MZIs and (b) total number of cascaded stages of MZIs between conventional ONN and TONN-MW as a function of the radix N with R = 2, 4, or 8 and N1 = ⋯ = Nd = 2 or 4. The yellow star represents the current state-of-the-art 64 × 64 conventional ONN.7
Comparison of the (a) total number of MZIs and (b) total number of cascaded stages of MZIs between conventional ONN and TONN-MW as a function of the radix N with R = 2, 4, or 8 and N1 = ⋯ = Nd = 2 or 4. The yellow star represents the current state-of-the-art 64 × 64 conventional ONN.7
III. SIMULATIONS FOR TENSORIZED OPTICAL NEURAL NETWORKS
The performance of the TONNs is evaluated by training the MNIST handwritten digit classification tasks and compared with conventional ONNs and two-dimensional Fourier transform (2D-FT) pre-processed ONNs (FT-ONNs).10 The conventional ONNs have a configuration similar to that of the TONN while having much more parameters (i.e., MZIs) in the synaptic interconnections. As another method to scale ONNs, the FT-ONNs have a total number of MZIs similar to that of the TONNs but fewer neurons in the input layer due to pre-processing the input images. All the compared ONNs are designed with one input layer, one hidden layer, and one output layer.
The conventional ONNs have 784 neurons in the input layer, a varied number (4–2048) of neurons in the single hidden layer, and ten neurons in the output layer. The input data are the 28 × 28 grayscale MNIST images.
The TONN-MWs are with 784 neurons in the input layer, 1024 neurons in the hidden layer, and ten neurons in the output layer. The 784 × 1024 and 1024 × 10 synaptic interconnections are factorized as [4 × 7 × 7 × 4 × 4 × 8 × 8 × 4] and [4 × 8 × 8 × 4 × 1 × 5 × 2 × 1] for TT decomposition, respectively. The TT-ranks vary from 1 to 24.
The 2FT-ONNs are with 200 neurons in the input layer, a varied number (20–800) of neurons in the hidden layer, and ten neurons in the output layer. The input grayscale MNIST images are first Fourier transformed and then cropped to 20 × 10. The absolute values of the 200 complex-valued 2D-FT coefficients are used for the input of the FT-ONN.
Training simulations are performed by using TensorFlow and T3F39 Python libraries. The backpropagation algorithm trains each ONN with the Adam optimizer in ten epochs. Every neuron has a rectified linear unit activation function, and the categorical cross-entropy loss evaluates the network’s performance. The synaptic interconnections are assumed to be ideal (error-free). The impact of hardware imperfections on the performance of the TONNs can be found in Ref. 40. For each of the three ONNs listed above, 20 different trials with different random initialization are generated. The maximum test accuracy vs the total number of MZIs and cascaded stages of MZIs are plotted in Figs. 4(a) and 4(b), respectively. Considering that the SVD of an M × N arbitrary matrix gives one M × M and one N × N unitary arbitrary matrix, the total number of MZIs and cascaded stages for an M × N arbitrary weight matrix in the simulation is calculated by M(M − 1)/2 + N(N − 1)/2 and M + N, respectively.
Comparison of the test accuracy of conventional ONN, TONN-MW, and FT-ONN as a function of the total number of (a) MZIs and (b) cascaded stages of MZIs.
Comparison of the test accuracy of conventional ONN, TONN-MW, and FT-ONN as a function of the total number of (a) MZIs and (b) cascaded stages of MZIs.
To achieve >95% test accuracy, TONN-MW only requires 3890 MZIs and 157 cascaded stages, which are 79× and 5.2× less than the conventional ONN, respectively. With the same total number of MZIs and cascaded stages, TONN-MWs have at least 4.7% higher test accuracy compared with FT-ONNs.
IV. TONN-MW ON HETEROGENEOUS III–V-ON-SILICON MOSCAP PLATFORM
A. Example of a 1024 × 1024 TONN-MW
Figure 5(a) shows the device architecture example of the 1024 × 1024 TONN-MW. Here, 1024 × 1024 is factorized as 8 × 4 × 4 × 8 × 8 × 4 × 4 × 8. Corresponding to Fig. 2(d), the total number of tensor core layers is d = 4, and the total number of wavelengths is g = 32. The TT-rank is set as R0 = R4 = 1 and R1 = R2 = R3 = 2. As a result, the weight matrix is decomposed into four TT-cores with the dimensions of , , , and . Each TT-core contains four 8 × 8 MZI meshes side-by-side and cross-connects, leading to sixteen 8 × 8 MZI meshes, 448 MZIs, and 32 cascaded stages of MZIs in total. The input neuron signals are first modulated by using thirty-two 32-wavelength WDM microring modulator arrays, then multiplied by each TT-core, and finally detected by using thirty-two 32-wavelength WDM microring add-drop filter and detector arrays. The light source is provided by a 32-wavelength comb laser and power splitters. O/E/O conversions and passive electrical cross-connects enable the wavelength-space cross-connects at TT-core 3.
(a) Schematic of the device architecture of a 1024 × 1024 TONN-MW. (b) Schematics of the QD comb laser and MOSCAP microring modulator with a MOSCAP TEM image. (c) The optical spectrum of the QD comb laser. (d) Schematic of the Si–Ge waveguide APD. (e) MOSCAP microring modulator spectra and 28 Gb/s eye diagram. (f) Simulated plasma dispersion effect in a MOSCAP with different HfO2 gate thickness. (g) Si–Ge APD sensitivity vs gain.
(a) Schematic of the device architecture of a 1024 × 1024 TONN-MW. (b) Schematics of the QD comb laser and MOSCAP microring modulator with a MOSCAP TEM image. (c) The optical spectrum of the QD comb laser. (d) Schematic of the Si–Ge waveguide APD. (e) MOSCAP microring modulator spectra and 28 Gb/s eye diagram. (f) Simulated plasma dispersion effect in a MOSCAP with different HfO2 gate thickness. (g) Si–Ge APD sensitivity vs gain.
B. Heterogeneous III–V-on-silicon MOSCAP platform
The III–V-on-silicon MOSCAP platform is particularly suitable for implementing the TONN-MW since it can heterogeneously integrate all the required devices, including quantum dot (QD) comb lasers,27,28 MOSCAP microring modulators,53,54 MOSCAP phase shifters (MZIs), and QD29 or potentially silicon–germanium (Si–Ge)55–57 avalanche photodetectors (APDs) at wafer scale. Figure 5(b) shows the schematics of the QD comb laser and MOSCAP microring modulator with a MOSCAP transmission electronic microscopy (TEM) image. The QD comb laser has a record-wide 10-dB comb width of 25 nm, which means one laser can accommodate more than 280 wavelengths with 15.5 GHz spacing, as shown in Fig. 5(c). Thus, it is feasible to realize the 32-wavelength stream for the 1024 × 1024 TONN-MW. The experimental demonstration of a MOSCAP microring modulator shows good static and dynamic extinction ratios at 28 Gb/s,53 as shown in Fig. 5(e). The experimental results of the MOSCAP showed 109× higher wavelength tuning energy efficiency (Δwavelength/Δtuning power ∼ nm/pW level) than the thermally tuned resonators (Δwavelength/Δtuning power ∼ nm/mW level), and simulations show VπL = 0.09 V-cm to enable compact 400 μm-long MZIs [Fig. 5(f)].58 Furthermore, the recent experiment-validated invention to tap a lock-in signal through MOSCAP to read optical intensity-dependent waveguide conductance as a hitless optical monitoring59 eliminates all monitoring PDs for modulators, MZIs, and filters. It results in an enormous reduction in footprint, loss, metal pads, and subsequently packaging complexity. Figure 5(d) shows the schematics of a high-responsivity Si–Ge waveguide APD enhanced by loop reflectors.57 SiGe APDs can offer an exceptional sensitivity of −30 dBm at this speed [Fig. 5(g)],55–57 lowering the total optical output power and, subsequently, the wall-plug power in the light source (laser and optional amplifier) by a factor of 33.
C. Footprint-energy efficiency
Here, the footprint-energy efficiency12,30 [(MAC/J) (MAC/s/mm2)], defined as the product of the workload energy efficiency (MAC/J) and the throughput per footprint area (MAC/s/mm2), is used as the figure of merit (FOM). Table II compares the synapses per neuron (i.e., scale of the ANN), workload energy efficiency, and footprint-energy efficiency among different ANN hardware technologies, including analog electronics,2,5,41,42 digital electronics,6,43–46 and photonics.47–52 The scale of the photonics is set as 1024 in order to compete with the electrical ANNs. Since 1024 × 1024 integrated conventional ONNs are impractical, TONN architecture is assumed for all the photonic technologies.
Footprint-energy efficiency among ANN hardware technologies, including analog electronics, digital electronics, and photonic TONNs.
Category . | Technology . | Synapses per neuron . | MAC/J . | Figure of merit (MAC/J MAC/s/mm2) . |
---|---|---|---|---|
Analog electronics | NeuroGrid5 | 4096 | 9 × 109 | 8 × 1015 |
HiCANN41 | 224 | 4.7 × 109 | 2 × 1017 | |
TrueNorth2 | 256 | 3.3 × 1012 | >8 × 1018 | |
Flash memory42 | 100 | 1.4 × 1014 | 2.6 × 1027 | |
Digital electronics | NVIDIA Volta6 | 4096 | 2 × 1011 | 2 × 1022 |
Google TPU43 | Multiple of 128 | 5.8 × 1011 | 9.5 × 1022 | |
Graphcore44 | 1216 | 2 × 1012 | 7 × 1023 | |
Memristor crossbar45 | 256 | 6.9 × 1012 | 6.9 × 1023 | |
Groq46 | 640 | 3.3 × 1012 | 3 × 1024 | |
Photonics | SiPh47–51 TONN | 1024 | 7.3 × 1011 | 9.8 × 1025 |
PCM52 TONN | 1024 | 1.1 × 1012 | 1.4 × 1026 | |
MOSCAP TONN | 1024 | 6.5 × 1014 | 4.1 × 1028 |
Category . | Technology . | Synapses per neuron . | MAC/J . | Figure of merit (MAC/J MAC/s/mm2) . |
---|---|---|---|---|
Analog electronics | NeuroGrid5 | 4096 | 9 × 109 | 8 × 1015 |
HiCANN41 | 224 | 4.7 × 109 | 2 × 1017 | |
TrueNorth2 | 256 | 3.3 × 1012 | >8 × 1018 | |
Flash memory42 | 100 | 1.4 × 1014 | 2.6 × 1027 | |
Digital electronics | NVIDIA Volta6 | 4096 | 2 × 1011 | 2 × 1022 |
Google TPU43 | Multiple of 128 | 5.8 × 1011 | 9.5 × 1022 | |
Graphcore44 | 1216 | 2 × 1012 | 7 × 1023 | |
Memristor crossbar45 | 256 | 6.9 × 1012 | 6.9 × 1023 | |
Groq46 | 640 | 3.3 × 1012 | 3 × 1024 | |
Photonics | SiPh47–51 TONN | 1024 | 7.3 × 1011 | 9.8 × 1025 |
PCM52 TONN | 1024 | 1.1 × 1012 | 1.4 × 1026 | |
MOSCAP TONN | 1024 | 6.5 × 1014 | 4.1 × 1028 |
The energy consumption of the photonic TONNs under static operation mainly consists of five parts: laser wall-plug power, microring modulator power, MZI mesh power, microring add-drop filter power, and PD receiver power. The MOSCAP platform is assumed to be with QD comb lasers,27,28 MOSCAP microring modulators,53,54 MOSCAP MZIs, and Si–Ge APDs.55–57 The silicon photonics (SiPh) and phase-change material (PCM) platform are assumed to be with distributed-feedback (DFB) lasers,48 silicon photonic microring modulators,49 SiPh50 (or PCM52) MZIs, and Ge PIN PDs.49,51
For the proposed heterogeneous III–V-on-silicon MOSCAP platform, the laser wall-plug power is mainly decided by the PD sensitivity (SENSPD), the power margin (Power_margin), the total optical insertion loss (ILtotal), and the wall-plug efficiency (η) by . The energy stored in the MOSCAP microring modulator can be calculated by , where C is the capacitance and Vpp is the peak-to-peak voltage swing. Assuming a non-return-to-zero (NRZ) modulation pattern, the probability of charging of the MOSCAP modulator is 0.25.60 The driver power consumption is assumed to be equal to the modulator.61 Thus, the power consumption of the MOSCAP microring modulator can be calculated by , where B is the modulation bandwidth. With MOSCAP phase tuning, the static power consumption of MZI meshes and microring add-drop filters is considered to be zero. The total number of multiply-accumulate (MAC) per second of an N × N TONN-MW is BN2. For a 1024 × 1024 TONN-MW, as in Fig. 5(a), assuming B = 10 GHz, the total power consumption per wavelength is calculated to be 15.79 W, which corresponds to 6.5 × 1014 MAC/J. With an estimated area of 165 mm2 (detailed breakdown can be found in the Appendix), the computing throughput per unit area is calculated to be 6.4 × 1013 MAC/s/mm2. Thus, the footprint-energy efficiency of a 1024 × 1024 TONN-MW is 4.1 × 1028 (MAC/J) (MAC/s/mm2). The detailed calculations can be found in the Appendix.
With the footprint-energy efficiency as the FOM, the MOSCAP TONN outperforms the digital electronics6,43–46 and most of the analog electronics2,5,41 technologies by a factor of 1.4 × 104 and 5.1 × 109, respectively. As shown in Table II, our architecture exhibits a 10× larger number of synapses per neuron and 15.8× higher FOM compared with analog flash memory technology.42 Compared with TONNs with the other photonic technologies, including SiPh and PCM, the MOSCAP TONN improves the footprint-energy efficiency by 290× due to the following reasons: 1. Heterogeneous integration of all the optical components eliminates the coupling loss between discrete chips. 2. MOSCAP MZIs and microring add-drop filters have negligible static phase tuning energy consumption, while SiPh technologies require power to maintain the phase tuning. 3. High-sensitivity APD significantly reduces the required laser wall-plug power.
D. 3D co-package
The footprint of the proposed 1024 × 1024 MOSCAP TONN can be 27 × 6 mm2 in a folded configuration. Thus, integrating the 1024 × 1024 TONN in a single photonic die is achievable, considering that the maximum field size of the deep ultraviolet (DUV) stepper is 27 × 22 mm2. The total number of MZIs in the proposed 1024 × 1024 MOSCAP TONN is 4.5× less than the Lighmatter’s 64 × 64 chip.7 The packaging of photonic tensor cores, CMOS driver chips, and memory chips can be realized by three-dimensional (3D) co-package technology. Recently, several foundries (e.g., AIM Photonics62) have started to provide SiPh interposer services that could co-package SiPh and electronic integrated circuits in a 3D stack using through Si via (TSV). The 3D co-package technology63 brings CMOS chips closer to the compute nodes (photonic TONN chips) so that the integration density and energy efficiency can be further increased.
V. CONCLUSIONS
This paper proposes a scalable, energy-efficient, and compact TONN architecture on the integrated III–V-on-silicon MOSCAP platform. Based on TT decomposition, the high-radix (e.g., 1024 × 1024) synaptic interconnections can be enabled by cascaded small-radix (e.g., 8 × 8) photonic tensor-train cores. The detailed TONN device architectures with single or multiple wavelengths are discussed. Simulation experiments show that TONN-MW saves 79× MZIs and 5.2× cascaded stages of MZIs compared with conventional ONN while maintaining a >95% training accuracy for MNIST handwritten digit classification tasks. Furthermore, with footprint-energy efficiency as the figure of merit, the proposed TONN-MW on the heterogeneous III-on-silicon MOSCAP platform outperforms the digital electronics and other photonic technologies by a factor of 1.4 × 104 and 2.9 × 102, respectively. Our proposed architecture points out the road map for the future physical implementation of ONNs scaling up to 1024 and beyond, with significantly reduced hardware requirements and ultra-high energy efficiency.
ACKNOWLEDGMENTS
This work was supported, in part, by the AFOSR under Grant No. FA9550-181-1-0186. The authors would like to thank Geza Kurczveil, Sudharsanan Srinivasan, Stanley Cheung, and Yuan Yuan from Hewlett Packard Labs for providing photonic device parameters. The authors would also like to thank Kaiqi Zhang from the University of California, Santa Barbara, for discussions on tensor-train decomposition.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.
APPENDIX: FOOTPRINT-ENERGY EFFICIENCY CALCULATION
Table III lists the parameters used for footprint-energy efficiency calculation. The total optical loss for the first half of the 1024 × 1024 TONN-MW [Fig. 5(a)] can be calculated by
List of parameters for footprint-energy efficiency calculation.
. | . | MOSCAP . | SiPh . | PCM . |
---|---|---|---|---|
Data rate per wavelength | B | 10 Gb/s | 10 Gb/s | 10 Gb/s |
Laser wall-plug efficiency | η | 10% | 7.1%48 | 7.1%48 |
Laser coupling insertion loss | ILlaser_coupling | 0 dB | 3.9 dB47 | 3.9 dB47 |
Microring modulator insertion loss | ILring | 1 dB53,54 | 3.9 dB49 | 3.9 dB49 |
Microring modulator extinction ratio | EXT | 5.5 dB53,54 | 4.2 dB49 | 4.2 dB49 |
Power penalty by modulator extinction ratio | Penaltyext | 2.5 dB | 3.5 dB | 3.5 dB |
Microring modulator off-resonance through-port insertion loss | ILring_off | 0.1 dB | 0.1 dB | 0.1 dB |
Microring modulator power consumption | Pring | 1.3 mW53,54 | 1.54 mW49 | 1.54 mW49 |
MZI insertion loss | ILMZI | 0.77 dB | 1.1 dB50 | 1 dB52 |
MZI phase shifter length | Lphase_shifter | 500 μm | 35 μm50 | 35 μm52 |
MZI static power consumption | PMZI | 0 mW | 56 mW50 | 0 mW |
Waveguide crossing insertion loss | ILcrossing | 0.017 dB64 | 0.017 dB64 | 0.017 dB64 |
Microring add-drop filter insertion loss | ILring_filter | 0.2 dB | 0.2 dB | 0.2 dB |
Microring add-drop filter off-resonance through-port insertion loss | ILring_off | 0.1 | 0.1 | 0.1 dB |
Waveguide loss | ILwg | 2 dB | 2 dB | 2 dB |
Power margin | Power_margin | 3 dB | 3 dB | 3 dB |
Photodetector sensitivity at 10 Gb/s | SENSPD | −30 dBm55–57 | −13.9 dBm51 | −13.9 dBm51 |
Photodetector power consumption | PPD | 0.5 mW65 | 0.75 mW49 | 0.75 mW49 |
. | . | MOSCAP . | SiPh . | PCM . |
---|---|---|---|---|
Data rate per wavelength | B | 10 Gb/s | 10 Gb/s | 10 Gb/s |
Laser wall-plug efficiency | η | 10% | 7.1%48 | 7.1%48 |
Laser coupling insertion loss | ILlaser_coupling | 0 dB | 3.9 dB47 | 3.9 dB47 |
Microring modulator insertion loss | ILring | 1 dB53,54 | 3.9 dB49 | 3.9 dB49 |
Microring modulator extinction ratio | EXT | 5.5 dB53,54 | 4.2 dB49 | 4.2 dB49 |
Power penalty by modulator extinction ratio | Penaltyext | 2.5 dB | 3.5 dB | 3.5 dB |
Microring modulator off-resonance through-port insertion loss | ILring_off | 0.1 dB | 0.1 dB | 0.1 dB |
Microring modulator power consumption | Pring | 1.3 mW53,54 | 1.54 mW49 | 1.54 mW49 |
MZI insertion loss | ILMZI | 0.77 dB | 1.1 dB50 | 1 dB52 |
MZI phase shifter length | Lphase_shifter | 500 μm | 35 μm50 | 35 μm52 |
MZI static power consumption | PMZI | 0 mW | 56 mW50 | 0 mW |
Waveguide crossing insertion loss | ILcrossing | 0.017 dB64 | 0.017 dB64 | 0.017 dB64 |
Microring add-drop filter insertion loss | ILring_filter | 0.2 dB | 0.2 dB | 0.2 dB |
Microring add-drop filter off-resonance through-port insertion loss | ILring_off | 0.1 | 0.1 | 0.1 dB |
Waveguide loss | ILwg | 2 dB | 2 dB | 2 dB |
Power margin | Power_margin | 3 dB | 3 dB | 3 dB |
Photodetector sensitivity at 10 Gb/s | SENSPD | −30 dBm55–57 | −13.9 dBm51 | −13.9 dBm51 |
Photodetector power consumption | PPD | 0.5 mW65 | 0.75 mW49 | 0.75 mW49 |
Here, the worst-case number of waveguide crossings is considered. The power penalty induced by the microring modulator extinction ratio is calculated by
The required comb laser wall-plug power per wavelength is
By adding up the power consumption of the second half of the TONN-MW, the total power consumption per wavelength is
The estimated footprint of the MOSCAP TPNN-MW is 55 × 3 mm2 = 165 mm2, which contains 45 × 3 mm2 for the MZI meshes, 1 × 3 mm2 for the QD comb laser and power splitters, 4 × 3 mm2 for the MOSCAP microring modulator arrays, 4 × 3 mm2 for the MOSCAP microring add-drop filter and APD arrays, and 1 × 3 mm2 for the electrical cross-connects.