This paper proposes a large-scale, energy-efficient, high-throughput, and compact tensorized optical neural network (TONN) exploiting the tensor-train decomposition architecture on an integrated III–V-on-silicon metal–oxide–semiconductor capacitor (MOSCAP) platform. The proposed TONN architecture is scalable to 1024 × 1024 synapses and beyond, which is extremely difficult for conventional integrated ONN architectures by using cascaded multi-wavelength small-radix (e.g., 8 × 8) tensor cores. Simulation experiments show that the proposed TONN uses 79× fewer Mach–Zehnder interferometers (MZIs) and 5.2× fewer cascaded stages of MZIs compared with the conventional ONN while maintaining a >95% training accuracy for Modified National Institute of Standards and Technology handwritten digit classification tasks. Furthermore, with the proven heterogeneous III–V-on-silicon MOSCAP platform, our proposed TONN can improve the footprint-energy efficiency by a factor of 1.4 × 104 compared with digital electronics artificial neural network (ANN) hardware and a factor of 2.9 × 102 compared with silicon photonic and phase-change material technologies. Thus, this paper points out the road map of implementing large-scale ONNs with a similar number of synapses and superior energy efficiency compared to electronic ANNs.

Artificial neural networks (ANNs) have proven their remarkable capabilities in various tasks, including computer vision, speech recognition, machine translations, medical diagnoses, and the game of Go.1 Neuromorphic computing accelerators, such as IBM TrueNorth2 and Intel Loihi,3 have shown significantly superior performance compared with traditional central processing units (CPUs) for specific neural network tasks. A majority of the electrical ANN hardware’s energy consumption comes from data movement in the synaptic interconnections. Optical neural networks (ONNs), also known as photonic neural networks, are expected to improve the energy efficiency and throughput significantly compared with electrical ANNs due to the capabilities of transmitting data at the speed of light without having a length-dependent impedance.4 However, there are mainly two challenges that prevent ONNs from achieving competitive performance with the electrical ANNs.

The first challenge is the limited scalability. While electrical ANN hardware is capable of achieving 4096 synaptic connections per neuron,5,6 the scale of the state-of-the-art ONNs is limited to 64 × 647 or smaller. This is because the synaptic interconnections in conventional ONNs typically rely on Mach–Zehnder interferometer (MZI) meshes.8 Scaling up to high-radix N × N meshes requires O(N2) MZIs and O(N) cascaded stages, which lead to insurmountable footprint, optical loss budget, and control complexity. For example, Lightmatter’s Mars device7 integrates a 64 × 64 micro-electro-mechanical systems (MEMS) MZI mesh on a 150 mm2 chip. With the same architecture and device platform, the predicted chip size for 1024 × 1024 will be larger than an 8-in. wafer and the optical insertion loss of 1024 cascaded stages of MEMS9 MZIs will be >675 dB, which is tough to be compensated by using optical amplifiers. The limited scalability makes the conventional ONNs typically rely on pre-processing10 and convolutional layers11 to handle meaningful machine learning (ML) datasets (e.g., ImageNet datasets). However, the convolutional neural networks (CNNs) are only efficient for specific tasks (e.g., image classification and time series) and require many layers. On the other hand, some literature proposes to pursue large-scale ONNs with comparable scale as the electrical ANN hardware.12–14 However, these efforts retain the conventional ONN architecture12 or reduce the throughput by encoding the input data in time domain13 or require bulky free-space devices.14 

The second challenge of ONNs is the lack of a device platform that can monolithically integrate optical neurons with photodetectors (PDs), electrical neuron circuits, light emitters, and synaptic interconnections on silicon. By sharing the same fabrication process steps in most parts as the silicon complementary metal–oxide–semiconductor (CMOS) process, silicon photonics (SiPh)15,16 has been proved to be a desirable platform for the commercial large-volume and cheap electronic and photonic integrated circuit (EPIC) manufacturing. However, since silicon is an indirect bandgap material, the silicon light emitters are inefficient. Aligning III–V diode laser chips to SiPh chips will induce additional coupling losses and packaging complexity, limiting energy efficiency and integration density.

To mitigate these two challenges, on the architecture side, tensor-train (TT) decomposed synaptic interconnections have been proposed to realize large-scale ONNs with reduced hardware resources.17,18 As an effective approach to significantly compress the over-parameterized fully connected layers in ANNs,19,20 TT decomposition has been proved to achieve almost the same accuracy as an ensemble of deep CNNs21 on various tasks, including the Markov random field,22 image recognition,20,23,24 and video classification.24,25 On the device platform side, although tensorized ONN (TONN) can be implemented in various photonic platforms, heterogeneous III–V-on-silicon integration is an optimal choice.26 Integration of quantum dot (QD) comb lasers27,28 and avalanche photodiodes (APDs)29 with other SiPh devices at wafer scale further improves the energy efficiency of the TONN.

This paper proposes a large-scale, energy-efficient, high-throughput, and compact tensorized ONN (TONN) architecture on a densely integrated III–V-on-silicon metal–oxide–semiconductor capacitor (MOSCAP) platform. The TT decomposition enables the proposed architecture scalable to 1024 × 1024 and beyond, which is extremely difficult for conventional integrated ONNs. The detailed device architecture of the TONN is designed based on a multi-wavelength configuration that does not require intermediate memory. Simulations show that the proposed TONNs utilize 79× fewer number of MZIs and 5.2× fewer cascaded stages of MZIs than the conventional ONNs while maintaining a >95% accuracy for Modified National Institute of Standards and Technology (MNIST) handwritten digit classification tasks. Moreover, the proposed monolithic III–V-on-silicon MOSCAP platform provides MOSCAP synaptic MZIs with negligible static phase tuning energy consumption, a photodetector-free hitless monitoring scheme to reduce more than O(N2) control elements and pads, and dense wavelength division multiplexing (DWDM)-based neurons. With all the experimentally proven building blocks, the footprint-energy efficiency30 [(MAC/J) (MAC/s/mm2)] of the TONNs can be further improved by a factor of 2.9 × 102 compared with other photonic platforms and a factor of 1.4 × 104 compared with the state-of-the-art digital electronic ANNs.

The remainder of this paper is organized as follows. Section II introduces the principle of TT decomposition and the device architecture of TONN. Section III simulates the TONN using MNIST handwritten digit classification tasks and compares the test accuracy with the conventional ONNs. Section IV reports the implementation of TONN on the heterogeneous III–V-on-silicon MOSCAP platform and compares the energy-footprint efficiency with other ANN technologies. Finally, Sec. V concludes this paper.

ONNs typically consist of an input neuron layer, many hidden neuron layers, an output neuron layer, and synaptic interconnections, which can be abstracted by arbitrary weight matrices W, as shown in Fig. 1(a). By utilizing singular value decomposition (SVD), the arbitrary weight matrix W can be decomposed into two arbitrary unitary matrices and an array of additional phase shifters and amplitude modulators: W = UΣV*. In ONNs, the arbitrary unitary matrices U and V are typically realized by “loss-less” MZI meshes in a “rectangular”8 or “triangular”31 configuration. Each 2 × 2 MZI contains two phase shifters and two 50/50 optical power splitters. However, an N × N MZI mesh requires N(N − 1)/2 MZIs and N cascaded stages.8 Although MEMS32 or non-volatile33 technologies can be used to reduce the length of the phase shifters to tens of microns, the MZI meshes are still difficult to scale up to the high radix.

FIG. 1.

(a) Schematic of the ONN architecture with input layers, hidden layers, output layers, and synaptic interconnections. Each synaptic interconnection is a linear operation represented by an arbitrary weight matrix W. (b) Weight matrix TT decomposition for parameter compression.

FIG. 1.

(a) Schematic of the ONN architecture with input layers, hidden layers, output layers, and synaptic interconnections. Each synaptic interconnection is a linear operation represented by an arbitrary weight matrix W. (b) Weight matrix TT decomposition for parameter compression.

Close modal

Here, italic lowercase letters, italic uppercase letters, and bold italic uppercase letters are used to represent vectors, matrices, and tensors, respectively. To represent weight matrix WRM×N in the TT format, the matrix dimensions M and N are first assumed to be factored as M=k=1dMk and N=k=1dNk, where d is defined as the number factors of M and N. Let μ and ν be the natural bijections from indices (i, j) of W to indices [μ1(i), ν1(j), …, μd(i), νd(j)] of an order-2d weight tensor W. Then, W(i, j) = W(μ1(i), ν1(j), …, μd(i), νd(j)). TT decomposition can be interpreted as SVD of multi-dimensional matrices. As shown in Fig. 1(b), the TT decomposition expresses the tensor W as a series of tensor products,19,20,34

W(μ1(i),ν1(j),,μd(i),νd(j))=k=1dGk(:,μk(i),νk(j),:),
(1)

where the four-way tensor GkRRk1×Mk×Nk×Rk is the TT-core and the total number of tensor cores is d. The vector (R0, R1, …, Rd) is the TT-rank, and R0 = Rd = 1. In this way, the total number of parameters can be reduced from M × N into the summation of the parameters of each small TT-cores, i.e., k=1dRk1MkNkRk.

The choice of TT-ranks influences both the training accuracy and the complexity of the TONN. Smaller TT-ranks usually lead to lower training accuracy, and higher TT-ranks lead to higher computing and storage complexity. TT-rank determination is a nondeterministic polynomial time (NP)-hard problem.35 Traditional methods require tuning the rank manually,20 which often leads to suboptimal results as different TT-layers may require different TT-ranks. The authors of Ref. 36 proposed a heuristic binary search method to find the smallest TT-ranks for some predetermined accuracy threshold. The authors of Ref. 37 further provided some heuristic that guides the selection of TT-ranks. The authors of Ref. 38 used a Bayesian method to optimize the TT-ranks along with the model parameters. For convenience, the TT-ranks in this paper are determined based on empirical figures.

1. Tensor-train layers

To achieve the tensor-train layers, as shown in Fig. 1(b), the vectors x and y = Wx representing the input and output of the weight matrix W need to be reshaped into the tensor format by x(j) = x(μ1(j), …, μd(j)) and y(i) = y(ν1(i), …, νd(i)), where xRN, yRM, xRNd××N1, and yRMd××M1. Then, in TT format, the linear transformation by the weight tensor W can be represented by20 

y(ν1(i),,νd(i))=k=1dGk(:,μk(i),νk(j),:)x(μ1(j),,μd(j)).
(2)

Based on Eq. (2), x needs to be multiplied by each TT-core in the sequence of Gd, Gd−1, …, G1, as shown in Fig. 2(a). The term Ij is used to notate the intermediate tensor: I1 = Gdx, I2 = Gd−1I1, …, y = Id = G1Id−1. Then, at TT-core k,

Idk+1=GkIdk,
(3)

where k = 1, 2, …, d. Equation (3) can be physically implemented by a two-step operation. First, GkRRk1×Mk×Nk×Rk multiplies by IdkRRk×Nk×Md××Mk+1×Nk1××N1 and gives GkIdkRRk1×Mk×Md××Mk+1×Nk1××N1. Second, GkIdk is re-indexed to Idk+1RRk1×Nk1×Md××Mk×Nk2××N1 so that each TT layer is modular.

FIG. 2.

(a) Schematic of the products between the input data and the TT-cores in sequence. (b) N × N unitary matrix represented by a “rectangular” MZI mesh with 2 × 2 MZIs as the building blocks. (c) The proposed device architecture of the M × N TONN with a single wavelength (TONN-SW). (d) The proposed device architecture of the M × N TONN with multiple wavelengths (TONN-MWs). d is the number of factors of M and N and the number of tensor core layers. g is the total number of wavelengths for TONN-MW.

FIG. 2.

(a) Schematic of the products between the input data and the TT-cores in sequence. (b) N × N unitary matrix represented by a “rectangular” MZI mesh with 2 × 2 MZIs as the building blocks. (c) The proposed device architecture of the M × N TONN with a single wavelength (TONN-SW). (d) The proposed device architecture of the M × N TONN with multiple wavelengths (TONN-MWs). d is the number of factors of M and N and the number of tensor core layers. g is the total number of wavelengths for TONN-MW.

Close modal

The physical implementation of TT-layers usually requires memory between each adjacent TT-core to store the intermediate data.23 Here, two “memory-free” device architectures of TONN are proposed with cascaded photonic TT-cores consisting of small-radix MZI meshes and passive cross-connects.

2. Single-wavelength implementation

To emulate the tensor product at the TT layer, the first approach is the TONN with single wavelength (TONN-SW), as shown in Fig. 2(c). At the TT-core k, where k = d, …, 1, the input tensor Idk is represented by hk groups of NkRk optical signals at λ0, where hk = Md, …, Mk+1Nk−1, …, N1. The TT-core Gk is represented by hk identical Rk−1Mk × NkRk MZI meshes being put side-by-side. The MZI meshes can be enabled by putting 2 × 2 MZIs in a “rectangular” mesh, as shown in Fig. 2(b). By sending the input optical signals into the MZI meshes, the product between Gk and Idk is implemented and then an optical passive cross-connect is used for switching the indices of Mk and Nk−1 in the output tensor GkIdk. The output of the cross-connect is Idk+1, which is then the input of the next TT layer.

3. Multi-wavelength implementation

By adding parallelism in the wavelength domain using wavelength division multiplexing (WDM) technology, the TONN with multiple wavelengths (TONN-MWs) can save a considerable amount of MZIs compared with TONN-SW, as shown in Fig. 2(d). At the TT-core k, the input tensor Idk is first encoded with the WDM channels of λ1, λ2, …, λg, where g = Nd/2, …, N1. For the first half of the TT-cores Gd, …, Gd/2+1, only the indices of Nd, …, Nd/2+1 of the input data x are used for multiplication with the TT-cores. Similar to TONN-SW, hk groups of NkRkg-wavelength WDM signals first go through hk identical Rk−1Mk × NkRk MZI meshes. Here, hk = Md, …, Mk+1Nk−1, …, Nd/2+1 for d/2 < k < d or Md/2, …, Mk+1Nk−1, …, N1 for kd/2. Then, for kd/2 + 1, an optical passive cross-connect is used to switch the indices of Mk and Nk−1 in the output tensor GkIdk and gives Idk+1.

At TT-core [d/2 + 1], the output of the MZI meshes Gd/2+1Id/2−1 utilizes passive wavelength-space cross-connects to switch the indices between the wavelength domain (Nd/2, …, N1) and the space domain (Md, …, Md/2+1). The wavelength-space cross-connects, which can be realized by using WDM transponders, are fixed (no reconfiguration needed) once the TONN architecture is decided. As a result, the output Id/2RRd/2×Nd/2×Md××Md/2+1×Nd/21××N1 is represented by hd/2 groups of Nd/2Rd/2g-wavelength WDM signals, where g = Md, …, Md/2+1. In this way, for the second half of the TT-cores Gd/2, …, G1, the indices of Nd/2, …, N1 of the input data x can be used for multiplication with the TT-cores. Note that, here, d is assumed to be an even number. For d being an odd number, the wavelength-space cross-connects happen at the TT-core [(d + 1)/2].

To evaluate the scalability of the TONN, Table I compares the total number of MZIs and the total number of cascaded stages of MZIs, among conventional ONN, TONN-SW, and TONN-MW. Letting M = N, M1 = ⋯ = Md = N1 = ⋯ = Nd = N1/d, and R0 = ⋯ = Rd = R, both TONN-SW and TONN-MW can reduce the total number of cascaded stages from N to dRN1/d, compared with the conventional ONN. The TONN-SW can enable all-optical tensor core products without optical-to-electrical-to-optical (O/E/O) conversions; however, the saving of the total number of MZIs is not significant. On the other hand, the TONN-MW can reduce the total number of MZIs from N(N − 1)/2 to dRN1/2(RN1/d − 1) at the expense of only one layer of OEO conversion for the wavelength-space cross-connects in the TT-core [d/2 + 1].

TABLE I.

Comparison of the total number of MZIs and the total number of cascaded stages of MZIs among conventional ONN, TONN-SW, and TONN-MW.

Total number of MZIsTotal number of cascaded stages of MZIs
Conventional ONN N(N − 1)/2 N 
TONN-SW dRN(RN1/d − 1) dRN1/d 
TONN-MW dRN1/2(RN1/d − 1) dRN1/d 
Total number of MZIsTotal number of cascaded stages of MZIs
Conventional ONN N(N − 1)/2 N 
TONN-SW dRN(RN1/d − 1) dRN1/d 
TONN-MW dRN1/2(RN1/d − 1) dRN1/d 

Figures 3(a) and 3(b) show the comparison of the total number of MZIs and cascaded stages of MZIs between conventional ONN and TONN-MW as a function of the radix N ≥ 128. At the radix of N = 1024, the conventional ONN has 5.2 × 105 MZIs and 1024 cascaded stages. For R = 2 and N1 = ⋯ = Nd = N1/d = 2, the TONN-MW requires 40 cascaded stages and 1920 MZIs, which are 25.6× and 272.8× less than the conventional ONN.

FIG. 3.

Comparison of the (a) total number of MZIs and (b) total number of cascaded stages of MZIs between conventional ONN and TONN-MW as a function of the radix N with R = 2, 4, or 8 and N1 = ⋯ = Nd = 2 or 4. The yellow star represents the current state-of-the-art 64 × 64 conventional ONN.7 

FIG. 3.

Comparison of the (a) total number of MZIs and (b) total number of cascaded stages of MZIs between conventional ONN and TONN-MW as a function of the radix N with R = 2, 4, or 8 and N1 = ⋯ = Nd = 2 or 4. The yellow star represents the current state-of-the-art 64 × 64 conventional ONN.7 

Close modal

The performance of the TONNs is evaluated by training the MNIST handwritten digit classification tasks and compared with conventional ONNs and two-dimensional Fourier transform (2D-FT) pre-processed ONNs (FT-ONNs).10 The conventional ONNs have a configuration similar to that of the TONN while having much more parameters (i.e., MZIs) in the synaptic interconnections. As another method to scale ONNs, the FT-ONNs have a total number of MZIs similar to that of the TONNs but fewer neurons in the input layer due to pre-processing the input images. All the compared ONNs are designed with one input layer, one hidden layer, and one output layer.

  1. The conventional ONNs have 784 neurons in the input layer, a varied number (4–2048) of neurons in the single hidden layer, and ten neurons in the output layer. The input data are the 28 × 28 grayscale MNIST images.

  2. The TONN-MWs are with 784 neurons in the input layer, 1024 neurons in the hidden layer, and ten neurons in the output layer. The 784 × 1024 and 1024 × 10 synaptic interconnections are factorized as [4 × 7 × 7 × 4 × 4 × 8 × 8 × 4] and [4 × 8 × 8 × 4 × 1 × 5 × 2 × 1] for TT decomposition, respectively. The TT-ranks vary from 1 to 24.

  3. The 2FT-ONNs are with 200 neurons in the input layer, a varied number (20–800) of neurons in the hidden layer, and ten neurons in the output layer. The input grayscale MNIST images are first Fourier transformed and then cropped to 20 × 10. The absolute values of the 200 complex-valued 2D-FT coefficients are used for the input of the FT-ONN.

Training simulations are performed by using TensorFlow and T3F39 Python libraries. The backpropagation algorithm trains each ONN with the Adam optimizer in ten epochs. Every neuron has a rectified linear unit activation function, and the categorical cross-entropy loss evaluates the network’s performance. The synaptic interconnections are assumed to be ideal (error-free). The impact of hardware imperfections on the performance of the TONNs can be found in Ref. 40. For each of the three ONNs listed above, 20 different trials with different random initialization are generated. The maximum test accuracy vs the total number of MZIs and cascaded stages of MZIs are plotted in Figs. 4(a) and 4(b), respectively. Considering that the SVD of an M × N arbitrary matrix gives one M × M and one N × N unitary arbitrary matrix, the total number of MZIs and cascaded stages for an M × N arbitrary weight matrix in the simulation is calculated by M(M − 1)/2 + N(N − 1)/2 and M + N, respectively.

FIG. 4.

Comparison of the test accuracy of conventional ONN, TONN-MW, and FT-ONN as a function of the total number of (a) MZIs and (b) cascaded stages of MZIs.

FIG. 4.

Comparison of the test accuracy of conventional ONN, TONN-MW, and FT-ONN as a function of the total number of (a) MZIs and (b) cascaded stages of MZIs.

Close modal

To achieve >95% test accuracy, TONN-MW only requires 3890 MZIs and 157 cascaded stages, which are 79× and 5.2× less than the conventional ONN, respectively. With the same total number of MZIs and cascaded stages, TONN-MWs have at least 4.7% higher test accuracy compared with FT-ONNs.

Figure 5(a) shows the device architecture example of the 1024 × 1024 TONN-MW. Here, 1024 × 1024 is factorized as 8 × 4 × 4 × 8 × 8 × 4 × 4 × 8. Corresponding to Fig. 2(d), the total number of tensor core layers is d = 4, and the total number of wavelengths is g = 32. The TT-rank is set as R0 = R4 = 1 and R1 = R2 = R3 = 2. As a result, the weight matrix is decomposed into four TT-cores with the dimensions of G4R1×8×4×2, G3R2×4×4×2, G2R2×4×4×2, and G1R2×4×8×1. Each TT-core contains four 8 × 8 MZI meshes side-by-side and cross-connects, leading to sixteen 8 × 8 MZI meshes, 448 MZIs, and 32 cascaded stages of MZIs in total. The input neuron signals are first modulated by using thirty-two 32-wavelength WDM microring modulator arrays, then multiplied by each TT-core, and finally detected by using thirty-two 32-wavelength WDM microring add-drop filter and detector arrays. The light source is provided by a 32-wavelength comb laser and power splitters. O/E/O conversions and passive electrical cross-connects enable the wavelength-space cross-connects at TT-core 3.

FIG. 5.

(a) Schematic of the device architecture of a 1024 × 1024 TONN-MW. (b) Schematics of the QD comb laser and MOSCAP microring modulator with a MOSCAP TEM image. (c) The optical spectrum of the QD comb laser. (d) Schematic of the Si–Ge waveguide APD. (e) MOSCAP microring modulator spectra and 28 Gb/s eye diagram. (f) Simulated plasma dispersion effect in a MOSCAP with different HfO2 gate thickness. (g) Si–Ge APD sensitivity vs gain.

FIG. 5.

(a) Schematic of the device architecture of a 1024 × 1024 TONN-MW. (b) Schematics of the QD comb laser and MOSCAP microring modulator with a MOSCAP TEM image. (c) The optical spectrum of the QD comb laser. (d) Schematic of the Si–Ge waveguide APD. (e) MOSCAP microring modulator spectra and 28 Gb/s eye diagram. (f) Simulated plasma dispersion effect in a MOSCAP with different HfO2 gate thickness. (g) Si–Ge APD sensitivity vs gain.

Close modal

The III–V-on-silicon MOSCAP platform is particularly suitable for implementing the TONN-MW since it can heterogeneously integrate all the required devices, including quantum dot (QD) comb lasers,27,28 MOSCAP microring modulators,53,54 MOSCAP phase shifters (MZIs), and QD29 or potentially silicon–germanium (Si–Ge)55–57 avalanche photodetectors (APDs) at wafer scale. Figure 5(b) shows the schematics of the QD comb laser and MOSCAP microring modulator with a MOSCAP transmission electronic microscopy (TEM) image. The QD comb laser has a record-wide 10-dB comb width of 25 nm, which means one laser can accommodate more than 280 wavelengths with 15.5 GHz spacing, as shown in Fig. 5(c). Thus, it is feasible to realize the 32-wavelength stream for the 1024 × 1024 TONN-MW. The experimental demonstration of a MOSCAP microring modulator shows good static and dynamic extinction ratios at 28 Gb/s,53 as shown in Fig. 5(e). The experimental results of the MOSCAP showed 109× higher wavelength tuning energy efficiency (Δwavelength/Δtuning power ∼ nm/pW level) than the thermally tuned resonators (Δwavelength/Δtuning power ∼ nm/mW level), and simulations show VπL = 0.09 V-cm to enable compact 400 μm-long MZIs [Fig. 5(f)].58 Furthermore, the recent experiment-validated invention to tap a lock-in signal through MOSCAP to read optical intensity-dependent waveguide conductance as a hitless optical monitoring59 eliminates all monitoring PDs for modulators, MZIs, and filters. It results in an enormous reduction in footprint, loss, metal pads, and subsequently packaging complexity. Figure 5(d) shows the schematics of a high-responsivity Si–Ge waveguide APD enhanced by loop reflectors.57 SiGe APDs can offer an exceptional sensitivity of −30 dBm at this speed [Fig. 5(g)],55–57 lowering the total optical output power and, subsequently, the wall-plug power in the light source (laser and optional amplifier) by a factor of 33.

Here, the footprint-energy efficiency12,30 [(MAC/J) (MAC/s/mm2)], defined as the product of the workload energy efficiency (MAC/J) and the throughput per footprint area (MAC/s/mm2), is used as the figure of merit (FOM). Table II compares the synapses per neuron (i.e., scale of the ANN), workload energy efficiency, and footprint-energy efficiency among different ANN hardware technologies, including analog electronics,2,5,41,42 digital electronics,6,43–46 and photonics.47–52 The scale of the photonics is set as 1024 in order to compete with the electrical ANNs. Since 1024 × 1024 integrated conventional ONNs are impractical, TONN architecture is assumed for all the photonic technologies.

TABLE II.

Footprint-energy efficiency among ANN hardware technologies, including analog electronics, digital electronics, and photonic TONNs.

CategoryTechnologySynapses per neuronMAC/JFigure of merit (MAC/J MAC/s/mm2)
Analog electronics NeuroGrid5  4096 9 × 109 8 × 1015 
HiCANN41  224 4.7 × 109 2 × 1017 
TrueNorth2  256 3.3 × 1012 >8 × 1018 
Flash memory42  100 1.4 × 1014 2.6 × 1027 
Digital electronics NVIDIA Volta6  4096 2 × 1011 2 × 1022 
Google TPU43  Multiple of 128 5.8 × 1011 9.5 × 1022 
Graphcore44  1216 2 × 1012 7 × 1023 
Memristor crossbar45  256 6.9 × 1012 6.9 × 1023 
Groq46  640 3.3 × 1012 3 × 1024 
Photonics SiPh47–51 TONN 1024 7.3 × 1011 9.8 × 1025 
PCM52 TONN 1024 1.1 × 1012 1.4 × 1026 
MOSCAP TONN 1024 6.5 × 1014 4.1 × 1028 
CategoryTechnologySynapses per neuronMAC/JFigure of merit (MAC/J MAC/s/mm2)
Analog electronics NeuroGrid5  4096 9 × 109 8 × 1015 
HiCANN41  224 4.7 × 109 2 × 1017 
TrueNorth2  256 3.3 × 1012 >8 × 1018 
Flash memory42  100 1.4 × 1014 2.6 × 1027 
Digital electronics NVIDIA Volta6  4096 2 × 1011 2 × 1022 
Google TPU43  Multiple of 128 5.8 × 1011 9.5 × 1022 
Graphcore44  1216 2 × 1012 7 × 1023 
Memristor crossbar45  256 6.9 × 1012 6.9 × 1023 
Groq46  640 3.3 × 1012 3 × 1024 
Photonics SiPh47–51 TONN 1024 7.3 × 1011 9.8 × 1025 
PCM52 TONN 1024 1.1 × 1012 1.4 × 1026 
MOSCAP TONN 1024 6.5 × 1014 4.1 × 1028 

The energy consumption of the photonic TONNs under static operation mainly consists of five parts: laser wall-plug power, microring modulator power, MZI mesh power, microring add-drop filter power, and PD receiver power. The MOSCAP platform is assumed to be with QD comb lasers,27,28 MOSCAP microring modulators,53,54 MOSCAP MZIs, and Si–Ge APDs.55–57 The silicon photonics (SiPh) and phase-change material (PCM) platform are assumed to be with distributed-feedback (DFB) lasers,48 silicon photonic microring modulators,49 SiPh50 (or PCM52) MZIs, and Ge PIN PDs.49,51

For the proposed heterogeneous III–V-on-silicon MOSCAP platform, the laser wall-plug power is mainly decided by the PD sensitivity (SENSPD), the power margin (Power_margin), the total optical insertion loss (ILtotal), and the wall-plug efficiency (η) by PLaser=10(SENSPD+Power_margin+ILtotal)/10/η. The energy stored in the MOSCAP microring modulator can be calculated by CVpp2/2, where C is the capacitance and Vpp is the peak-to-peak voltage swing. Assuming a non-return-to-zero (NRZ) modulation pattern, the probability of charging of the MOSCAP modulator is 0.25.60 The driver power consumption is assumed to be equal to the modulator.61 Thus, the power consumption of the MOSCAP microring modulator can be calculated by PModulator=BCVpp2/4, where B is the modulation bandwidth. With MOSCAP phase tuning, the static power consumption of MZI meshes and microring add-drop filters is considered to be zero. The total number of multiply-accumulate (MAC) per second of an N × N TONN-MW is BN2. For a 1024 × 1024 TONN-MW, as in Fig. 5(a), assuming B = 10 GHz, the total power consumption per wavelength is calculated to be 15.79 W, which corresponds to 6.5 × 1014 MAC/J. With an estimated area of 165 mm2 (detailed breakdown can be found in the  Appendix), the computing throughput per unit area is calculated to be 6.4 × 1013 MAC/s/mm2. Thus, the footprint-energy efficiency of a 1024 × 1024 TONN-MW is 4.1 × 1028 (MAC/J) (MAC/s/mm2). The detailed calculations can be found in the  Appendix.

With the footprint-energy efficiency as the FOM, the MOSCAP TONN outperforms the digital electronics6,43–46 and most of the analog electronics2,5,41 technologies by a factor of 1.4 × 104 and 5.1 × 109, respectively. As shown in Table II, our architecture exhibits a 10× larger number of synapses per neuron and 15.8× higher FOM compared with analog flash memory technology.42 Compared with TONNs with the other photonic technologies, including SiPh and PCM, the MOSCAP TONN improves the footprint-energy efficiency by 290× due to the following reasons: 1. Heterogeneous integration of all the optical components eliminates the coupling loss between discrete chips. 2. MOSCAP MZIs and microring add-drop filters have negligible static phase tuning energy consumption, while SiPh technologies require power to maintain the phase tuning. 3. High-sensitivity APD significantly reduces the required laser wall-plug power.

The footprint of the proposed 1024 × 1024 MOSCAP TONN can be 27 × 6 mm2 in a folded configuration. Thus, integrating the 1024 × 1024 TONN in a single photonic die is achievable, considering that the maximum field size of the deep ultraviolet (DUV) stepper is 27 × 22 mm2. The total number of MZIs in the proposed 1024 × 1024 MOSCAP TONN is 4.5× less than the Lighmatter’s 64 × 64 chip.7 The packaging of photonic tensor cores, CMOS driver chips, and memory chips can be realized by three-dimensional (3D) co-package technology. Recently, several foundries (e.g., AIM Photonics62) have started to provide SiPh interposer services that could co-package SiPh and electronic integrated circuits in a 3D stack using through Si via (TSV). The 3D co-package technology63 brings CMOS chips closer to the compute nodes (photonic TONN chips) so that the integration density and energy efficiency can be further increased.

This paper proposes a scalable, energy-efficient, and compact TONN architecture on the integrated III–V-on-silicon MOSCAP platform. Based on TT decomposition, the high-radix (e.g., 1024 × 1024) synaptic interconnections can be enabled by cascaded small-radix (e.g., 8 × 8) photonic tensor-train cores. The detailed TONN device architectures with single or multiple wavelengths are discussed. Simulation experiments show that TONN-MW saves 79× MZIs and 5.2× cascaded stages of MZIs compared with conventional ONN while maintaining a >95% training accuracy for MNIST handwritten digit classification tasks. Furthermore, with footprint-energy efficiency as the figure of merit, the proposed TONN-MW on the heterogeneous III-on-silicon MOSCAP platform outperforms the digital electronics and other photonic technologies by a factor of 1.4 ×  104 and 2.9 × 102, respectively. Our proposed architecture points out the road map for the future physical implementation of ONNs scaling up to 1024 and beyond, with significantly reduced hardware requirements and ultra-high energy efficiency.

This work was supported, in part, by the AFOSR under Grant No. FA9550-181-1-0186. The authors would like to thank Geza Kurczveil, Sudharsanan Srinivasan, Stanley Cheung, and Yuan Yuan from Hewlett Packard Labs for providing photonic device parameters. The authors would also like to thank Kaiqi Zhang from the University of California, Santa Barbara, for discussions on tensor-train decomposition.

The authors have no conflicts to disclose.

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Table III lists the parameters used for footprint-energy efficiency calculation. The total optical loss for the first half of the 1024 × 1024 TONN-MW [Fig. 5(a)] can be calculated by

ILtotal=ILlaser_coupling+ILring+Penaltyext+31×ILring_off+16×ILMZI+36×ILcrossing+ILring_filter+31×ILring_off+ILwg.
(A1)
TABLE III.

List of parameters for footprint-energy efficiency calculation.

MOSCAPSiPhPCM
Data rate per wavelength B 10 Gb/s 10 Gb/s 10 Gb/s 
Laser wall-plug efficiency η 10% 7.1%48  7.1%48  
Laser coupling insertion loss ILlaser_coupling 0 dB 3.9 dB47  3.9 dB47  
Microring modulator insertion loss ILring 1 dB53,54 3.9 dB49  3.9 dB49  
Microring modulator extinction ratio EXT 5.5 dB53,54 4.2 dB49  4.2 dB49  
Power penalty by modulator extinction ratio Penaltyext 2.5 dB 3.5 dB 3.5 dB 
Microring modulator off-resonance through-port insertion loss ILring_off 0.1 dB 0.1 dB 0.1 dB 
Microring modulator power consumption Pring 1.3 mW53,54 1.54 mW49  1.54 mW49  
MZI insertion loss ILMZI 0.77 dB 1.1 dB50  1 dB52  
MZI phase shifter length Lphase_shifter 500 μ35 μm50  35 μm52  
MZI static power consumption PMZI 0 mW 56 mW50  0 mW 
Waveguide crossing insertion loss ILcrossing 0.017 dB64  0.017 dB64  0.017 dB64  
Microring add-drop filter insertion loss ILring_filter 0.2 dB 0.2 dB 0.2 dB 
Microring add-drop filter off-resonance through-port insertion loss ILring_off 0.1 0.1 0.1 dB 
Waveguide loss ILwg 2 dB 2 dB 2 dB 
Power margin Power_margin 3 dB 3 dB 3 dB 
Photodetector sensitivity at 10 Gb/s SENSPD −30 dBm55–57  −13.9 dBm51  −13.9 dBm51  
Photodetector power consumption PPD 0.5 mW65  0.75 mW49  0.75 mW49  
MOSCAPSiPhPCM
Data rate per wavelength B 10 Gb/s 10 Gb/s 10 Gb/s 
Laser wall-plug efficiency η 10% 7.1%48  7.1%48  
Laser coupling insertion loss ILlaser_coupling 0 dB 3.9 dB47  3.9 dB47  
Microring modulator insertion loss ILring 1 dB53,54 3.9 dB49  3.9 dB49  
Microring modulator extinction ratio EXT 5.5 dB53,54 4.2 dB49  4.2 dB49  
Power penalty by modulator extinction ratio Penaltyext 2.5 dB 3.5 dB 3.5 dB 
Microring modulator off-resonance through-port insertion loss ILring_off 0.1 dB 0.1 dB 0.1 dB 
Microring modulator power consumption Pring 1.3 mW53,54 1.54 mW49  1.54 mW49  
MZI insertion loss ILMZI 0.77 dB 1.1 dB50  1 dB52  
MZI phase shifter length Lphase_shifter 500 μ35 μm50  35 μm52  
MZI static power consumption PMZI 0 mW 56 mW50  0 mW 
Waveguide crossing insertion loss ILcrossing 0.017 dB64  0.017 dB64  0.017 dB64  
Microring add-drop filter insertion loss ILring_filter 0.2 dB 0.2 dB 0.2 dB 
Microring add-drop filter off-resonance through-port insertion loss ILring_off 0.1 0.1 0.1 dB 
Waveguide loss ILwg 2 dB 2 dB 2 dB 
Power margin Power_margin 3 dB 3 dB 3 dB 
Photodetector sensitivity at 10 Gb/s SENSPD −30 dBm55–57  −13.9 dBm51  −13.9 dBm51  
Photodetector power consumption PPD 0.5 mW65  0.75 mW49  0.75 mW49  

Here, the worst-case number of waveguide crossings is considered. The power penalty induced by the microring modulator extinction ratio is calculated by

Penaltyext=10×log10((10EXT/10+1)/(10EXT/101)).
(A2)

The required comb laser wall-plug power per wavelength is

Plaser_wall_plug=10(SENSPD+Power_margin+ILtotal)/10/η.
(A3)

By adding up the power consumption of the second half of the TONN-MW, the total power consumption per wavelength is

Ptotal=2×(Plaser_wall_plug+PModulator+256×PMZI+PPD).
(A4)

The estimated footprint of the MOSCAP TPNN-MW is 55 × 3 mm2 = 165 mm2, which contains 45 × 3 mm2 for the MZI meshes, 1 × 3 mm2 for the QD comb laser and power splitters, 4 × 3 mm2 for the MOSCAP microring modulator arrays, 4 × 3 mm2 for the MOSCAP microring add-drop filter and APD arrays, and 1 × 3 mm2 for the electrical cross-connects.

1.
D.
Silver
,
A.
Huang
,
C. J.
Maddison
,
A.
Guez
,
L.
Sifre
,
G.
van den Driessche
,
J.
Schrittwieser
,
I.
Antonoglou
,
V.
Panneershelvam
,
M.
Lanctot
,
S.
Dieleman
,
D.
Grewe
,
J.
Nham
,
N.
Kalchbrenner
,
I.
Sutskever
,
T.
Lillicrap
,
M.
Leach
,
K.
Kavukcuoglu
,
T.
Graepel
, and
D.
Hassabis
, “
Mastering the game of Go with deep neural networks and tree search
,”
Nature
529
,
484
489
(
2016
).
2.
M. V.
DeBole
,
R.
Appuswamy
,
P. J.
Carlson
,
A. S.
Cassidy
,
P.
Datta
,
S. K.
Esser
,
G. J.
Garreau
,
K. L.
Holland
,
S.
Lekuch
,
M.
Mastro
,
J.
McKinstry
,
B.
Taba
,
C.
di Nolfo
,
B.
Paulovicks
,
J.
Sawada
,
K.
Schleupen
,
B. G.
Shaw
,
J. L.
Klamo
,
M. D.
Flickner
,
J. V.
Arthur
,
D. S.
Modha
,
A.
Amir
,
F.
Akopyan
,
A.
Andreopoulos
,
W. P.
Risk
,
J.
Kusnitz
,
C.
Ortega Otero
, and
T. K.
Nayak
, “
TrueNorth: Accelerating from zero to 64 million neurons in 10 years
,”
Computer
52
,
20
29
(
2019
).
3.
M.
Davies
,
N.
Srinivasa
,
T.-H.
Lin
,
G.
Chinya
,
Y.
Cao
,
S. H.
Choday
,
G.
Dimou
,
P.
Joshi
,
N.
Imam
,
S.
Jain
,
Y.
Liao
,
C.-K.
Lin
,
A.
Lines
,
R.
Liu
,
D.
Mathaikutty
,
S.
McCoy
,
A.
Paul
,
J.
Tse
,
G.
Venkataramanan
,
Y.-H.
Weng
,
A.
Wild
,
Y.
Yang
, and
H.
Wang
, “
Loihi: A neuromorphic manycore processor with on-chip learning
,”
IEEE Micro
38
,
82
99
(
2018
).
4.
Y.
Shen
,
N. C.
Harris
,
S.
Skirlo
,
M.
Prabhu
,
T.
Baehr-Jones
,
M.
Hochberg
,
X.
Sun
,
S.
Zhao
,
H.
Larochelle
,
D.
Englund
, and
M.
Soljačić
, “
Deep learning with coherent nanophotonic circuits
,”
Nat. Photonics
11
,
441
446
(
2017
).
5.
B. V.
Benjamin
,
P.
Gao
,
E.
McQuinn
,
S.
Choudhary
,
A. R.
Chandrasekaran
,
J.-M.
Bussat
,
R.
Alvarez-Icaza
,
J. V.
Arthur
,
P. A.
Merolla
, and
K.
Boahen
, “
Neurogrid: A mixed-analog-digital multichip system for large-scale neural simulations
,”
Proc. IEEE
102
,
699
716
(
2014
).
6.
J.
Choquette
,
O.
Giroux
, and
D.
Foley
, “
Volta: Performance and programmability
,”
IEEE Micro
38
,
42
52
(
2018
).
7.
C.
Ramey
, “
Silicon photonics for artificial intelligence acceleration: HotChips 32
,” in
2020 IEEE Hot Chips 32 Symposium (HCS)
(
IEEE
,
2020
), pp.
1
26
.
8.
W. R.
Clements
,
P. C.
Humphreys
,
B. J.
Metcalf
,
W. S.
Kolthammer
, and
I. A.
Walsmley
, “
Optimal design for universal multiport interferometers
,”
Optica
3
,
1460
1465
(
2016
).
9.
P.
Edinger
,
A. Y.
Takabayashi
,
C.
Errando-Herranz
,
U.
Khan
,
H.
Sattari
,
P.
Verheyen
,
W.
Bogaerts
,
N.
Quack
, and
K. B.
Gylfason
, “
Silicon photonic microelectromechanical phase shifters for scalable programmable photonics
,”
Opt. Lett.
46
,
5671
5674
(
2021
).
10.
I. A. D.
Williamson
,
T. W.
Hughes
,
M.
Minkov
,
B.
Bartlett
,
S.
Pai
, and
S.
Fan
, “
Reprogrammable electro-optic nonlinear activation functions for optical neural networks
,”
IEEE J. Sel. Top. Quantum Electron.
26
,
1
12
(
2020
).
11.
J.
Feldmann
,
N.
Youngblood
,
M.
Karpov
,
H.
Gehring
,
X.
Li
,
M.
Stappers
,
M.
Le Gallo
,
X.
Fu
,
A.
Lukashchuk
,
A. S.
Raja
,
J.
Liu
,
C. D.
Wright
,
A.
Sebastian
,
T. J.
Kippenberg
,
W. H. P.
Pernice
, and
H.
Bhaskaran
, “
Parallel convolutional processing using an integrated photonic tensor core
,”
Nature
589
,
52
58
(
2021
).
12.
M. A.
Nahmias
,
T. F.
de Lima
,
A. N.
Tait
,
H.-T.
Peng
,
B. J.
Shastri
, and
P. R.
Prucnal
, “
Photonic multiply-accumulate operations for neural networks
,”
IEEE J. Sel. Top. Quantum Electron.
26
,
1
18
(
2020
).
13.
R.
Hamerly
,
L.
Bernstein
,
A.
Sludds
,
M.
Soljačić
, and
D.
Englund
, “
Large-scale optical neural networks based on photoelectric multiplication
,”
Phys. Rev. X
9
,
21032
(
2019
).
14.
T.
Wang
,
S.-Y.
Ma
,
L. G.
Wright
,
T.
Onodera
,
B.
Richard
, and
P. L.
McMahon
, “
An optical neural network using less than 1 photon per multiplication
,” arXiv:2104.13467 (
2021
).
15.
G. T.
Reed
,
G.
Mashanovich
,
F. Y.
Gardes
, and
D. J.
Thomson
, “
Silicon optical modulators
,”
Nat. Photonics
4
,
518
526
(
2010
).
16.
M.
Jacques
,
Z.
Xing
,
A.
Samani
,
E.
El-Fiky
,
X.
Li
,
M.
Xiang
,
S.
Lessard
, and
D. V.
Plant
, “
240 Gbit/s silicon photonic Mach-Zehnder modulator enabled by two 2.3-Vpp drivers
,”
J. Lightwave Technol.
38
,
2877
2885
(
2020
).
17.
X.
Xiao
and
S. J. B.
Yoo
, “
Tensor-train decomposed synaptic interconnections for compact and scalable photonic neural networks
,” in
2020 IEEE Photonics Conference (IPC)
(
IEEE
,
2020
), pp.
1
2
.
18.
X.
Xiao
and
S. J. B.
Yoo
, “
Scalable and compact 3D tensorized photonic neural networks
,” in
2021 Optical Fiber Communications Conference and Exhibition (OFC)
(
Optical Society of America
,
2021
), pp.
1
3
.
19.
I. V.
Oseledets
, “
Tensor-train decomposition
,”
SIAM J. Sci. Comput.
33
,
2295
2317
(
2011
).
20.
A.
Novikov
,
D.
Podoprikhin
,
A.
Osokin
, and
D. P.
Vetrov
, “
Tensorizing neural networks
,” in
Advances in Neural Information Processing Systems
(
MIT Press
,
2015
), pp.
442
450
.
21.
J.
Ba
and
R.
Caruana
, “
Do deep nets really need to be deep?
,” in
Advances in Neural Information Processing Systems
(
MIT Press
,
2014
), pp.
2654
2662
.
22.
A.
Novikov
,
A.
Rodomanov
,
A.
Osokin
, and
D.
Vetrov
, “
Putting MRFs on a tensor train
,” in
International Conference on Machine Learning (PMLR)
(
PMLR
,
2014
), pp.
811
819
.
23.
H.
Huang
,
L.
Ni
, and
H.
Yu
, “
LTNN: An energy-efficient machine learning accelerator on 3D CMOS-RRAM for layer-wise tensorized neural network
,” in
2017 30th IEEE International System-On-Chip Conference (SOCC)
(
IEEE
,
2017
), pp.
280
285
.
24.
C.
Deng
,
F.
Sun
,
X.
Qian
,
J.
Lin
,
Z.
Wang
, and
B.
Yuan
, “
TIE: Energy-efficient tensor train-based inference engine for deep neural network
,” in
Proceedings of the 46th International Symposium on Computer Architecture, ISCA ’19
(
Association for Computing Machinery
,
New York
,
2019
), pp.
264
278
.
25.
Y.
Yang
,
D.
Krompass
, and
V.
Tresp
, “
Tensor-train recurrent neural networks for video classification
,” in
Proceedings of the 34th International Conference on Machine Learning ICML’17
(
JMLR.org
,
2017
), Vol. 70, pp.
3891
3900
.
26.
D.
Liang
and
J. E.
Bowers
, “
Recent progress in heterogeneous III-V-on-silicon photonic integration
,”
Light: Adv. Manuf.
2
,
59
(
2021
).
27.
G.
Kurczveil
,
A.
Descos
,
D.
Liang
,
M.
Fiorentino
, and
R.
Beausoleil
, “
Hybrid silicon quantum dot comb laser with record wide comb width
,” in
Frontiers in Optics/Laser Science, OSA Technical Digest
, edited by
B.
Lee
,
C.
Mazzali
,
K.
Corwin
, and
R.
Jason Jones
(
Optical Society of America
,
Washington, DC
,
2020
), p.
FTu6E.6
.
28.
G.
Kurczveil
,
M. A.
Seyedi
,
D.
Liang
,
M.
Fiorentino
, and
R. G.
Beausoleil
, “
Error-free operation in a hybrid-silicon quantum dot comb laser
,”
IEEE Photonics Technol. Lett.
30
,
71
74
(
2018
).
29.
B.
Tossoun
,
G.
Kurczveil
,
C.
Zhang
,
A.
Descos
,
Z.
Huang
,
A.
Beling
,
J. C.
Campbell
,
D.
Liang
, and
R. G.
Beausoleil
, “
Indium arsenide quantum dot waveguide photodiodes heterogeneously integrated on silicon
,”
Optica
6
,
1277
1281
(
2019
).
30.
A. R.
Totovic
,
G.
Dabos
,
N.
Passalis
,
A.
Tefas
, and
N.
Pleros
, “
Femtojoule per MAC neuromorphic photonics: An energy and technology roadmap
,”
IEEE J. Sel. Top. Quantum Electron.
26
,
1
15
(
2020
).
31.
M.
Reck
,
A.
Zeilinger
,
H. J.
Bernstein
, and
P.
Bertani
, “
Experimental realization of any discrete unitary operator
,”
Phys. Rev. Lett.
73
,
58
61
(
1994
).
32.
S.
Hamann
,
A.
Ceballos
,
J.
Landry
, and
O.
Solgaard
, “
High-speed random access optical scanning using a linear MEMS phased array
,”
Opt. Lett.
43
,
5455
5458
(
2018
).
33.
M.
Miscuglio
and
V. J.
Sorger
, “
Photonic tensor cores for machine learning
,”
Appl. Phys. Rev.
7
,
031404
(
2020
); arXiv:2002.03780.
34.
C.
Hawkins
and
Z.
Zhang
, “
Bayesian tensorized neural networks with automatic rank selection
,”
Neurocomputing
453
,
172
180
(
2021
).
35.
C. J.
Hillar
and
L.-H.
Lim
, “
Most tensor problems are NP-hard
,”
J. ACM
60
,
1
(
2013
).
36.
A.-H.
Phan
,
K.
Sobolev
,
K.
Sozykin
,
D.
Ermilov
,
J.
Gusak
,
P.
Tichavský
,
V.
Glukhov
,
I.
Oseledets
, and
A.
Cichocki
, “
Stable low-rank tensor decomposition for compression of convolutional neural network BT
,” in
Computer Vision–ECCV 2020
(
Springer International Publishing
,
Cham
,
2020
), pp.
522
539
.
37.
X.
Cao
and
G.
Rabusseau
, “
Tensor regression networks with various low-rank tensor approximations
,” arXiv:1712.09520 (
2017
).
38.
K.
Zhang
,
C.
Hawkins
,
X.
Zhang
,
C.
Hao
, and
Z.
Zhang
, “
On-FPGA training with ultra memory reduction: A low-precision tensor method
,” arXiv:2104.03420 (
2021
).
39.
A.
Novikov
,
P.
Izmailov
,
V.
Khrulkov
,
M.
Figurnov
, and
I. V.
Oseledets
, “
Tensor train decomposition on TensorFlow (T3F)
,”
J. Mach. Learn. Res.
21
,
1
7
(
2020
).
40.
M. B.
On
,
Y.-J.
Lee
,
X.
Xiao
,
R.
Proietti
, and
S. J.
Ben Yoo
, “
Analysis of the hardware imprecisions for scalable and compact photonic tensorized neural networks
,” in
2021 European Conference on Optical Communication (ECOC)
(
IEEE
,
2021
), pp.
1
4
.
41.
J.
Schemmel
,
D.
Brüderle
,
A.
Grübl
,
M.
Hock
,
K.
Meier
, and
S.
Millner
, “
A wafer-scale neuromorphic hardware system for large-scale neural modeling
,” in
2010 IEEE International Symposium on Circuits and Systems (ISCAS)
(
IEEE
,
2010
), pp.
1947
1950
.
42.
M. R.
Mahmoodi
and
D.
Strukov
, “
An ultra-low energy internally analog, externally digital vector-matrix multiplier based on NOR flash memory technology
,” in
Proceedings of the 55th Annual Design Automation Conference, DAC ’18
(
Association for Computing Machinery
,
New York
,
2018
).
43.
S.
Cass
, “
Taking AI to the edge: Google’s TPU now comes in a maker-friendly package
,”
IEEE Spectrum
56
,
16
17
(
2019
).
44.
I.
Kacher
,
M.
Portaz
,
H.
Randrianarivo
, and
S.
Peyronnet
, “
Graphcore C2 card performance for image-based deep learning application: A report
,” arXiv:2002.11670 (
2020
).
45.
Y.
Kim
,
Y.
Zhang
, and
P.
Li
, “
A digital neuromorphic VLSI architecture with memristor crossbar synaptic array for machine learning
,” in
2012 IEEE International SOC Conference
(
IEEE
,
2012
), pp.
328
333
.
46.
L.
Gwennap
, “Groq rocks neural networks, microprocessor report,” Technical (
2020
); available at http://groq.com/wp-content/uploads/2020/04/Groq-Rocks-NNs-Linley-Group-MPR-2020Jan06.pdf.
47.
N.
Hatori
,
T.
Shimizu
,
M.
Okano
,
M.
Ishizaka
,
T.
Yamamoto
,
Y.
Urino
,
M.
Mori
,
T.
Nakamura
, and
Y.
Arakawa
, “
A hybrid integrated light source on a silicon platform using a trident spot-size converter
,”
J. Lightwave Technol.
32
,
1329
1336
(
2014
).
48.
H.
Duprez
,
A.
Descos
,
T.
Ferrotti
,
C.
Sciancalepore
,
C.
Jany
,
K.
Hassan
,
C.
Seassal
,
S.
Menezo
, and
B.
Ben Bakir
, “
1310 nm hybrid InP/InGaAsP on silicon distributed feedback laser with high side-mode suppression ratio
,”
Opt. Express
23
,
8489
8497
(
2015
).
49.
M.
Rakowski
,
Y.
Ban
,
P. D.
Heyn
,
N.
Pantano
,
B.
Snyder
,
S.
Balakrishnan
,
S. V.
Huylenbroeck
,
L.
Bogaerts
,
C.
Demeurisse
,
F.
Inoue
,
K. J.
Rebibis
,
P.
Nolmans
,
X.
Sun
,
P.
Bex
,
A.
Srinivasan
,
J. D.
Coster
,
S.
Lardenois
,
A.
Miller
,
P.
Absil
,
P.
Verheyen
,
D.
Velenis
,
M.
Pantouvaki
, and
J. V.
Campenhout
, “
Hybrid 14 nm FinFET–silicon photonics technology for low-power Tb/s/mm2 optical I/O
,” in
2018 IEEE Symposium on VLSI Technology
(
IEEE
,
2018
), pp.
221
222
.
50.
M.
Mendez-Astudillo
,
M.
Okamoto
,
Y.
Ito
, and
T.
Kita
, “
Compact thermo-optic MZI switch in silicon-on-insulator using direct carrier injection
,”
Opt. Express
27
,
899
906
(
2019
).
51.
D.
Benedikovic
,
L.
Virot
,
G.
Aubin
,
J.-M.
Hartmann
,
F.
Amar
,
B.
Szelag
,
X.
Le Roux
,
C.
Alonso-Ramos
,
P.
Crozat
, and
E.
Cassan
, “
Silicon-germanium pin photodiodes with double heterojunction: High-speed operation at 10 Gbps and beyond
,” in
2020 European Conference on Integrated Optics
,
2020
.
52.
P.
Xu
,
J.
Zheng
,
J. K.
Doylend
, and
A.
Majumdar
, “
Low-loss and broadband nonvolatile phase-change directional coupler switches
,”
ACS Photonics
6
,
553
557
(
2019
).
53.
S.
Srinivasan
,
D.
Liang
, and
R. G.
Beausoleil
, “
Heterogeneous SISCAP microring modulator for high-speed optical communication
,” in
2020 European Conference on Optical Communications (ECOC)
(
IEEE
,
2020
), pp.
1
3
.
54.
S.
Srinivasan
,
D.
Liang
, and
R. G.
Beausoleil
, “
High temperature performance of heterogeneous MOSCAP microring modulators
,” in
Optical Fiber Communication Conference (OFC) 2021, OSA Technical Digest
, edited by
P.
Dong
,
J.
Kani
,
C.
Xie
,
R.
Casellas
,
C.
Cole
, and
M.
Li
(
Optical Society of America
,
Washington, DC
,
2021
), p.
Th5A.1
.
55.
B.
Wang
,
Z.
Huang
,
Y.
Yuan
,
D.
Liang
,
X.
Zeng
,
M.
Fiorentino
, and
R. G.
Beausoleil
, “
64 Gb/s low-voltage waveguide SiGe avalanche photodiodes with distributed Bragg reflectors
,”
Photonics Res.
8
,
1118
1123
(
2020
).
56.
Y.
Yuan
,
Z.
Huang
,
B.
Wang
,
W. V.
Sorin
,
X.
Zeng
,
D.
Liang
,
M.
Fiorentino
,
J. C.
Campbell
, and
R. G.
Beausoleil
, “
64 Gbps PAM4 Si-Ge waveguide avalanche photodiodes with excellent temperature stability
,”
J. Lightwave Technol.
38
,
4857
4866
(
2020
).
57.
Y.
Yuan
,
Z.
Huang
,
X.
Zeng
,
D.
Liang
,
W. V.
Sorin
,
M.
Fiorentino
, and
R. G.
Beausoleil
, “
High responsivity Si-Ge waveguide avalanche photodiodes enhanced by loop reflector
,”
IEEE J. Sel. Top. Quantum Electron.
28
,
1
8
(
2022
).
58.
X.
Huang
,
D.
Liang
,
C.
Zhang
,
G.
Kurczveil
,
X.
Li
,
J.
Zhang
,
M.
Fiorentino
, and
R.
Beausoleil
, “
Heterogeneous MOS microring resonators
,” in
2017 IEEE Photonics Conference (IPC)
(
IEEE
,
2017
), pp.
121
122
.
59.
S.
Srinivasan
,
D.
Liang
, and
R.
Beausoleil
, “
Non-invasive light monitoring for heterogeneous photonic integrated circuits
,” in
2021 IEEE Photonics Conference (IPC)
(
IEEE
,
2021
).
60.
M. R.
Watts
,
D. C.
Trotter
,
R. W.
Young
,
A. L.
Lentine
, and
W. A.
Zortman
, “
Limits to silicon modulator bandwidth and power consumption
,”
Proc. SPIE
7221
,
72210M
(
2009
).
61.
D. A. B.
Miller
, “
Attojoule optoelectronics for low-energy information processing and communications
,”
J. Lightwave Technol.
35
,
346
396
(
2017
).
63.
N.
Margalit
,
C.
Xiang
,
S. M.
Bowers
,
A.
Bjorlin
,
R.
Blum
, and
J. E.
Bowers
, “
Perspective on the future of silicon photonics and electronics
,”
Appl. Phys. Lett.
118
,
220501
(
2021
).
64.
Y.
Ma
,
Y.
Zhang
,
S.
Yang
,
A.
Novack
,
R.
Ding
,
A. E.-J.
Lim
,
G.-Q.
Lo
,
T.
Baehr-Jones
, and
M.
Hochberg
, “
Ultralow loss single layer submicron silicon waveguide crossing for SOI optical interconnect
,”
Opt. Express
21
,
29374
29382
(
2013
).
65.
B.
Wang
,
Z.
Huang
,
W. V.
Sorin
,
X.
Zeng
,
D.
Liang
,
M.
Fiorentino
, and
R. G.
Beausoleil
, “
A low-voltage Si-Ge avalanche photodiode for high-speed and energy efficient silicon photonic links
,”
J. Lightwave Technol.
38
,
3156
3163
(
2020
).