This paper proposes a large-scale, energy-efficient, high-throughput, and compact tensorized optical neural network (TONN) exploiting the tensor-train decomposition architecture on an integrated III–V-on-silicon metal–oxide–semiconductor capacitor (MOSCAP) platform. The proposed TONN architecture is scalable to 1024 × 1024 synapses and beyond, which is extremely difficult for conventional integrated ONN architectures by using cascaded multi-wavelength small-radix (e.g., 8 × 8) tensor cores. Simulation experiments show that the proposed TONN uses 79× fewer Mach–Zehnder interferometers (MZIs) and 5.2× fewer cascaded stages of MZIs compared with the conventional ONN while maintaining a >95% training accuracy for Modified National Institute of Standards and Technology handwritten digit classification tasks. Furthermore, with the proven heterogeneous III–V-on-silicon MOSCAP platform, our proposed TONN can improve the footprint-energy efficiency by a factor of 1.4 × 10^{4} compared with digital electronics artificial neural network (ANN) hardware and a factor of 2.9 × 10^{2} compared with silicon photonic and phase-change material technologies. Thus, this paper points out the road map of implementing large-scale ONNs with a similar number of synapses and superior energy efficiency compared to electronic ANNs.

## I. INTRODUCTION

Artificial neural networks (ANNs) have proven their remarkable capabilities in various tasks, including computer vision, speech recognition, machine translations, medical diagnoses, and the game of Go.^{1} Neuromorphic computing accelerators, such as IBM TrueNorth^{2} and Intel Loihi,^{3} have shown significantly superior performance compared with traditional central processing units (CPUs) for specific neural network tasks. A majority of the electrical ANN hardware’s energy consumption comes from data movement in the synaptic interconnections. Optical neural networks (ONNs), also known as photonic neural networks, are expected to improve the energy efficiency and throughput significantly compared with electrical ANNs due to the capabilities of transmitting data at the speed of light without having a length-dependent impedance.^{4} However, there are mainly two challenges that prevent ONNs from achieving competitive performance with the electrical ANNs.

The first challenge is the limited scalability. While electrical ANN hardware is capable of achieving 4096 synaptic connections per neuron,^{5,6} the scale of the state-of-the-art ONNs is limited to 64 × 64^{7} or smaller. This is because the synaptic interconnections in conventional ONNs typically rely on Mach–Zehnder interferometer (MZI) meshes.^{8} Scaling up to high-radix *N* × *N* meshes requires *O*(*N*^{2}) MZIs and *O*(*N*) cascaded stages, which lead to insurmountable footprint, optical loss budget, and control complexity. For example, Lightmatter’s Mars device^{7} integrates a 64 × 64 micro-electro-mechanical systems (MEMS) MZI mesh on a 150 mm^{2} chip. With the same architecture and device platform, the predicted chip size for 1024 × 1024 will be larger than an 8-in. wafer and the optical insertion loss of 1024 cascaded stages of MEMS^{9} MZIs will be >675 dB, which is tough to be compensated by using optical amplifiers. The limited scalability makes the conventional ONNs typically rely on pre-processing^{10} and convolutional layers^{11} to handle meaningful machine learning (ML) datasets (e.g., ImageNet datasets). However, the convolutional neural networks (CNNs) are only efficient for specific tasks (e.g., image classification and time series) and require many layers. On the other hand, some literature proposes to pursue large-scale ONNs with comparable scale as the electrical ANN hardware.^{12–14} However, these efforts retain the conventional ONN architecture^{12} or reduce the throughput by encoding the input data in time domain^{13} or require bulky free-space devices.^{14}

The second challenge of ONNs is the lack of a device platform that can monolithically integrate optical neurons with photodetectors (PDs), electrical neuron circuits, light emitters, and synaptic interconnections on silicon. By sharing the same fabrication process steps in most parts as the silicon complementary metal–oxide–semiconductor (CMOS) process, silicon photonics (SiPh)^{15,16} has been proved to be a desirable platform for the commercial large-volume and cheap electronic and photonic integrated circuit (EPIC) manufacturing. However, since silicon is an indirect bandgap material, the silicon light emitters are inefficient. Aligning III–V diode laser chips to SiPh chips will induce additional coupling losses and packaging complexity, limiting energy efficiency and integration density.

To mitigate these two challenges, on the architecture side, tensor-train (TT) decomposed synaptic interconnections have been proposed to realize large-scale ONNs with reduced hardware resources.^{17,18} As an effective approach to significantly compress the over-parameterized fully connected layers in ANNs,^{19,20} TT decomposition has been proved to achieve almost the same accuracy as an ensemble of deep CNNs^{21} on various tasks, including the Markov random field,^{22} image recognition,^{20,23,24} and video classification.^{24,25} On the device platform side, although tensorized ONN (TONN) can be implemented in various photonic platforms, heterogeneous III–V-on-silicon integration is an optimal choice.^{26} Integration of quantum dot (QD) comb lasers^{27,28} and avalanche photodiodes (APDs)^{29} with other SiPh devices at wafer scale further improves the energy efficiency of the TONN.

This paper proposes a large-scale, energy-efficient, high-throughput, and compact tensorized ONN (TONN) architecture on a densely integrated III–V-on-silicon metal–oxide–semiconductor capacitor (MOSCAP) platform. The TT decomposition enables the proposed architecture scalable to 1024 × 1024 and beyond, which is extremely difficult for conventional integrated ONNs. The detailed device architecture of the TONN is designed based on a multi-wavelength configuration that does not require intermediate memory. Simulations show that the proposed TONNs utilize 79× fewer number of MZIs and 5.2× fewer cascaded stages of MZIs than the conventional ONNs while maintaining a >95% accuracy for Modified National Institute of Standards and Technology (MNIST) handwritten digit classification tasks. Moreover, the proposed monolithic III–V-on-silicon MOSCAP platform provides MOSCAP synaptic MZIs with negligible static phase tuning energy consumption, a photodetector-free hitless monitoring scheme to reduce more than *O*(*N*^{2}) control elements and pads, and dense wavelength division multiplexing (DWDM)-based neurons. With all the experimentally proven building blocks, the footprint-energy efficiency^{30} [(MAC/J) (MAC/s/mm^{2})] of the TONNs can be further improved by a factor of 2.9 × 10^{2} compared with other photonic platforms and a factor of 1.4 × 10^{4} compared with the state-of-the-art digital electronic ANNs.

The remainder of this paper is organized as follows. Section II introduces the principle of TT decomposition and the device architecture of TONN. Section III simulates the TONN using MNIST handwritten digit classification tasks and compares the test accuracy with the conventional ONNs. Section IV reports the implementation of TONN on the heterogeneous III–V-on-silicon MOSCAP platform and compares the energy-footprint efficiency with other ANN technologies. Finally, Sec. V concludes this paper.

## II. TENSORIZED OPTICAL NEURAL NETWORK

### A. Principle of tensor-train decomposition

ONNs typically consist of an input neuron layer, many hidden neuron layers, an output neuron layer, and synaptic interconnections, which can be abstracted by arbitrary weight matrices *W*, as shown in Fig. 1(a). By utilizing singular value decomposition (SVD), the arbitrary weight matrix *W* can be decomposed into two arbitrary unitary matrices and an array of additional phase shifters and amplitude modulators: *W* = *U*Σ*V**. In ONNs, the arbitrary unitary matrices *U* and *V* are typically realized by “loss-less” MZI meshes in a “rectangular”^{8} or “triangular”^{31} configuration. Each 2 × 2 MZI contains two phase shifters and two 50/50 optical power splitters. However, an *N* × *N* MZI mesh requires *N*(*N* − 1)/2 MZIs and *N* cascaded stages.^{8} Although MEMS^{32} or non-volatile^{33} technologies can be used to reduce the length of the phase shifters to tens of microns, the MZI meshes are still difficult to scale up to the high radix.

Here, italic lowercase letters, italic uppercase letters, and bold italic uppercase letters are used to represent vectors, matrices, and tensors, respectively. To represent weight matrix $W\u2208RM\xd7N$ in the TT format, the matrix dimensions *M* and *N* are first assumed to be factored as $M=\u220fk=1dMk$ and $N=\u220fk=1dNk$, where *d* is defined as the number factors of *M* and *N*. Let *μ* and *ν* be the natural bijections from indices (*i*, *j*) of *W* to indices [*μ*_{1}(*i*), *ν*_{1}(*j*), …, *μ*_{d}(*i*), *ν*_{d}(*j*)] of an order-2*d* weight tensor ** W**. Then,

*W*(

*i*,

*j*) =

**(**

*W**μ*

_{1}(

*i*),

*ν*

_{1}(

*j*), …,

*μ*

_{d}(

*i*),

*ν*

_{d}(

*j*)). TT decomposition can be interpreted as SVD of multi-dimensional matrices. As shown in Fig. 1(b), the TT decomposition expresses the tensor

**as a series of tensor products,**

*W*^{19,20,34}

where the four-way tensor $Gk\u2208RRk\u22121\xd7Mk\xd7Nk\xd7Rk$ is the TT-core and the total number of tensor cores is *d*. The vector (*R*_{0}, *R*_{1}, …, *R*_{d}) is the TT-rank, and *R*_{0} = *R*_{d} = 1. In this way, the total number of parameters can be reduced from *M* × *N* into the summation of the parameters of each small TT-cores, i.e., $\u2211k=1dRk\u22121MkNkRk$.

### B. Tensor-train rank determination

The choice of TT-ranks influences both the training accuracy and the complexity of the TONN. Smaller TT-ranks usually lead to lower training accuracy, and higher TT-ranks lead to higher computing and storage complexity. TT-rank determination is a nondeterministic polynomial time (NP)-hard problem.^{35} Traditional methods require tuning the rank manually,^{20} which often leads to suboptimal results as different TT-layers may require different TT-ranks. The authors of Ref. 36 proposed a heuristic binary search method to find the smallest TT-ranks for some predetermined accuracy threshold. The authors of Ref. 37 further provided some heuristic that guides the selection of TT-ranks. The authors of Ref. 38 used a Bayesian method to optimize the TT-ranks along with the model parameters. For convenience, the TT-ranks in this paper are determined based on empirical figures.

### C. Device architecture of tensorized optical neural network

#### 1. Tensor-train layers

To achieve the tensor-train layers, as shown in Fig. 1(b), the vectors *x* and *y* = *Wx* representing the input and output of the weight matrix *W* need to be reshaped into the tensor format by *x*(*j*) = ** x**(

*μ*

_{1}(

*j*), …,

*μ*

_{d}(

*j*)) and

*y*(

*i*) =

**(**

*y**ν*

_{1}(

*i*), …,

*ν*

_{d}(

*i*)), where $x\u2208RN$, $y\u2208RM$, $x\u2208RNd\xd7\cdots \xd7N1$, and $y\u2208RMd\xd7\cdots \xd7M1$. Then, in TT format, the linear transformation by the weight tensor

**can be represented by**

*W*^{20}

Based on Eq. (2), ** x** needs to be multiplied by each TT-core in the sequence of

*G*_{d},

*G*_{d−1}, …,

*G*_{1}, as shown in Fig. 2(a). The term

*I*_{j}is used to notate the intermediate tensor:

*I*_{1}=

*G*_{d}

**,**

*x*

*I*_{2}=

*G*_{d−1}

*I*_{1}, …,

**=**

*y*

*I*_{d}=

*G*_{1}

*I*_{d−1}. Then, at TT-core

*k*,

where *k* = 1, 2, …, *d*. Equation (3) can be physically implemented by a two-step operation. First, $Gk\u2208RRk\u22121\xd7Mk\xd7Nk\xd7Rk$ multiplies by $Id\u2212k\u2208RRk\xd7Nk\xd7Md\xd7\cdots \xd7Mk+1\xd7Nk\u22121\xd7\cdots \xd7N1$ and gives $GkId\u2212k\u2208RRk\u22121\xd7Mk\xd7Md\xd7\cdots \xd7Mk+1\xd7Nk\u22121\xd7\cdots \xd7N1$. Second, *G*_{k}*I*_{d−k} is re-indexed to $Id\u2212k+1\u2208RRk\u22121\xd7Nk\u22121\xd7Md\xd7\cdots \xd7Mk\xd7Nk\u22122\xd7\cdots \xd7N1$ so that each TT layer is modular.

The physical implementation of TT-layers usually requires memory between each adjacent TT-core to store the intermediate data.^{23} Here, two “memory-free” device architectures of TONN are proposed with cascaded photonic TT-cores consisting of small-radix MZI meshes and passive cross-connects.

#### 2. Single-wavelength implementation

To emulate the tensor product at the TT layer, the first approach is the TONN with single wavelength (TONN-SW), as shown in Fig. 2(c). At the TT-core *k*, where *k* = *d*, …, 1, the input tensor *I*_{d−k} is represented by *h*_{k} groups of *N*_{k}*R*_{k} optical signals at *λ*_{0}, where *h*_{k} = *M*_{d}, …, *M*_{k+1}*N*_{k−1}, …, *N*_{1}. The TT-core *G*_{k} is represented by *h*_{k} identical *R*_{k−1}*M*_{k} × *N*_{k}*R*_{k} MZI meshes being put side-by-side. The MZI meshes can be enabled by putting 2 × 2 MZIs in a “rectangular” mesh, as shown in Fig. 2(b). By sending the input optical signals into the MZI meshes, the product between *G*_{k} and *I*_{d−k} is implemented and then an optical passive cross-connect is used for switching the indices of *M*_{k} and *N*_{k−1} in the output tensor *G*_{k}*I*_{d−k}. The output of the cross-connect is *I*_{d−k+1}, which is then the input of the next TT layer.

#### 3. Multi-wavelength implementation

By adding parallelism in the wavelength domain using wavelength division multiplexing (WDM) technology, the TONN with multiple wavelengths (TONN-MWs) can save a considerable amount of MZIs compared with TONN-SW, as shown in Fig. 2(d). At the TT-core *k*, the input tensor *I*_{d−k} is first encoded with the WDM channels of *λ*_{1}, *λ*_{2}, …, *λ*_{g}, where *g* = *N*_{d/2}, …, *N*_{1}. For the first half of the TT-cores *G*_{d}, …, *G*_{d/2+1}, only the indices of *N*_{d}, …, *N*_{d/2+1} of the input data ** x** are used for multiplication with the TT-cores. Similar to TONN-SW,

*h*

_{k}groups of

*N*

_{k}

*R*

_{k}

*g*-wavelength WDM signals first go through

*h*

_{k}identical

*R*

_{k−1}

*M*

_{k}×

*N*

_{k}

*R*

_{k}MZI meshes. Here,

*h*

_{k}=

*M*

_{d}, …,

*M*

_{k+1}

*N*

_{k−1}, …,

*N*

_{d/2+1}for

*d*/2 <

*k*<

*d*or

*M*

_{d/2}, …,

*M*

_{k+1}

*N*

_{k−1}, …,

*N*

_{1}for

*k*≤

*d*/2. Then, for

*k*≠

*d*/2 + 1, an optical passive cross-connect is used to switch the indices of

*M*

_{k}and

*N*

_{k−1}in the output tensor

*G*_{k}

*I*_{d−k}and gives

*I*_{d−k+1}.

At TT-core [*d*/2 + 1], the output of the MZI meshes *G*_{d/2+1}*I*_{d/2−1} utilizes passive wavelength-space cross-connects to switch the indices between the wavelength domain (*N*_{d/2}, …, *N*_{1}) and the space domain (*M*_{d}, …, *M*_{d/2+1}). The wavelength-space cross-connects, which can be realized by using WDM transponders, are fixed (no reconfiguration needed) once the TONN architecture is decided. As a result, the output $Id/2\u2208RRd/2\xd7Nd/2\xd7Md\xd7\cdots \xd7Md/2+1\xd7Nd/2\u22121\xd7\cdots \xd7N1$ is represented by *h*_{d/2} groups of *N*_{d/2}*R*_{d/2} *g*-wavelength WDM signals, where *g* = *M*_{d}, …, *M*_{d/2+1}. In this way, for the second half of the TT-cores *G*_{d/2}, …, *G*_{1}, the indices of *N*_{d/2}, …, *N*_{1} of the input data ** x** can be used for multiplication with the TT-cores. Note that, here,

*d*is assumed to be an even number. For

*d*being an odd number, the wavelength-space cross-connects happen at the TT-core [(

*d*+ 1)/2].

### D. Comparison between tensorized and conventional ONNs

To evaluate the scalability of the TONN, Table I compares the total number of MZIs and the total number of cascaded stages of MZIs, among conventional ONN, TONN-SW, and TONN-MW. Letting *M* = *N*, *M*_{1} = ⋯ = *M*_{d} = *N*_{1} = ⋯ = *N*_{d} = *N*^{1/d}, and *R*_{0} = ⋯ = *R*_{d} = *R*, both TONN-SW and TONN-MW can reduce the total number of cascaded stages from *N* to *dRN*^{1/d}, compared with the conventional ONN. The TONN-SW can enable all-optical tensor core products without optical-to-electrical-to-optical (O/E/O) conversions; however, the saving of the total number of MZIs is not significant. On the other hand, the TONN-MW can reduce the total number of MZIs from *N*(*N* − 1)/2 to *dRN*^{1/2}(*RN*^{1/d} − 1) at the expense of only one layer of OEO conversion for the wavelength-space cross-connects in the TT-core [*d*/2 + 1].

. | Total number of MZIs . | Total number of cascaded stages of MZIs . |
---|---|---|

Conventional ONN | N(N − 1)/2 | N |

TONN-SW | dRN(RN^{1/d} − 1) | dRN^{1/d} |

TONN-MW | dRN^{1/2}(RN^{1/d} − 1) | dRN^{1/d} |

. | Total number of MZIs . | Total number of cascaded stages of MZIs . |
---|---|---|

Conventional ONN | N(N − 1)/2 | N |

TONN-SW | dRN(RN^{1/d} − 1) | dRN^{1/d} |

TONN-MW | dRN^{1/2}(RN^{1/d} − 1) | dRN^{1/d} |

Figures 3(a) and 3(b) show the comparison of the total number of MZIs and cascaded stages of MZIs between conventional ONN and TONN-MW as a function of the radix *N* ≥ 128. At the radix of *N* = 1024, the conventional ONN has 5.2 × 10^{5} MZIs and 1024 cascaded stages. For *R* = 2 and *N*_{1} = ⋯ = *N*_{d} = *N*^{1/d} = 2, the TONN-MW requires 40 cascaded stages and 1920 MZIs, which are 25.6× and 272.8× less than the conventional ONN.

## III. SIMULATIONS FOR TENSORIZED OPTICAL NEURAL NETWORKS

The performance of the TONNs is evaluated by training the MNIST handwritten digit classification tasks and compared with conventional ONNs and two-dimensional Fourier transform (2D-FT) pre-processed ONNs (FT-ONNs).^{10} The conventional ONNs have a configuration similar to that of the TONN while having much more parameters (i.e., MZIs) in the synaptic interconnections. As another method to scale ONNs, the FT-ONNs have a total number of MZIs similar to that of the TONNs but fewer neurons in the input layer due to pre-processing the input images. All the compared ONNs are designed with one input layer, one hidden layer, and one output layer.

The conventional ONNs have 784 neurons in the input layer, a varied number (4–2048) of neurons in the single hidden layer, and ten neurons in the output layer. The input data are the 28 × 28 grayscale MNIST images.

The TONN-MWs are with 784 neurons in the input layer, 1024 neurons in the hidden layer, and ten neurons in the output layer. The 784 × 1024 and 1024 × 10 synaptic interconnections are factorized as [4 × 7 × 7 × 4 × 4 × 8 × 8 × 4] and [4 × 8 × 8 × 4 × 1 × 5 × 2 × 1] for TT decomposition, respectively. The TT-ranks vary from 1 to 24.

The 2FT-ONNs are with 200 neurons in the input layer, a varied number (20–800) of neurons in the hidden layer, and ten neurons in the output layer. The input grayscale MNIST images are first Fourier transformed and then cropped to 20 × 10. The absolute values of the 200 complex-valued 2D-FT coefficients are used for the input of the FT-ONN.

Training simulations are performed by using TensorFlow and T3F^{39} Python libraries. The backpropagation algorithm trains each ONN with the Adam optimizer in ten epochs. Every neuron has a rectified linear unit activation function, and the categorical cross-entropy loss evaluates the network’s performance. The synaptic interconnections are assumed to be ideal (error-free). The impact of hardware imperfections on the performance of the TONNs can be found in Ref. 40. For each of the three ONNs listed above, 20 different trials with different random initialization are generated. The maximum test accuracy vs the total number of MZIs and cascaded stages of MZIs are plotted in Figs. 4(a) and 4(b), respectively. Considering that the SVD of an *M* × *N* arbitrary matrix gives one *M* × *M* and one *N* × *N* unitary arbitrary matrix, the total number of MZIs and cascaded stages for an *M* × *N* arbitrary weight matrix in the simulation is calculated by *M*(*M* − 1)/2 + *N*(*N* − 1)/2 and *M* + *N*, respectively.

To achieve >95% test accuracy, TONN-MW only requires 3890 MZIs and 157 cascaded stages, which are 79× and 5.2× less than the conventional ONN, respectively. With the same total number of MZIs and cascaded stages, TONN-MWs have at least 4.7% higher test accuracy compared with FT-ONNs.

## IV. TONN-MW ON HETEROGENEOUS III–V-ON-SILICON MOSCAP PLATFORM

### A. Example of a 1024 × 1024 TONN-MW

Figure 5(a) shows the device architecture example of the 1024 × 1024 TONN-MW. Here, 1024 × 1024 is factorized as 8 × 4 × 4 × 8 × 8 × 4 × 4 × 8. Corresponding to Fig. 2(d), the total number of tensor core layers is *d* = 4, and the total number of wavelengths is *g* = 32. The TT-rank is set as *R*_{0} = *R*_{4} = 1 and *R*_{1} = *R*_{2} = *R*_{3} = 2. As a result, the weight matrix is decomposed into four TT-cores with the dimensions of $G4\u2208R1\xd78\xd74\xd72$, $G3\u2208R2\xd74\xd74\xd72$, $G2\u2208R2\xd74\xd74\xd72$, and $G1\u2208R2\xd74\xd78\xd71$. Each TT-core contains four 8 × 8 MZI meshes side-by-side and cross-connects, leading to sixteen 8 × 8 MZI meshes, 448 MZIs, and 32 cascaded stages of MZIs in total. The input neuron signals are first modulated by using thirty-two 32-wavelength WDM microring modulator arrays, then multiplied by each TT-core, and finally detected by using thirty-two 32-wavelength WDM microring add-drop filter and detector arrays. The light source is provided by a 32-wavelength comb laser and power splitters. O/E/O conversions and passive electrical cross-connects enable the wavelength-space cross-connects at TT-core 3.

### B. Heterogeneous III–V-on-silicon MOSCAP platform

The III–V-on-silicon MOSCAP platform is particularly suitable for implementing the TONN-MW since it can heterogeneously integrate all the required devices, including quantum dot (QD) comb lasers,^{27,28} MOSCAP microring modulators,^{53,54} MOSCAP phase shifters (MZIs), and QD^{29} or potentially silicon–germanium (Si–Ge)^{55–57} avalanche photodetectors (APDs) at wafer scale. Figure 5(b) shows the schematics of the QD comb laser and MOSCAP microring modulator with a MOSCAP transmission electronic microscopy (TEM) image. The QD comb laser has a record-wide 10-dB comb width of 25 nm, which means one laser can accommodate more than 280 wavelengths with 15.5 GHz spacing, as shown in Fig. 5(c). Thus, it is feasible to realize the 32-wavelength stream for the 1024 × 1024 TONN-MW. The experimental demonstration of a MOSCAP microring modulator shows good static and dynamic extinction ratios at 28 Gb/s,^{53} as shown in Fig. 5(e). The experimental results of the MOSCAP showed 10^{9}× higher wavelength tuning energy efficiency (Δwavelength/Δtuning power ∼ nm/pW level) than the thermally tuned resonators (Δwavelength/Δtuning power ∼ nm/mW level), and simulations show *V*_{π}*L* = 0.09 V-cm to enable compact 400 *μ*m-long MZIs [Fig. 5(f)].^{58} Furthermore, the recent experiment-validated invention to tap a lock-in signal through MOSCAP to read optical intensity-dependent waveguide conductance as a hitless optical monitoring^{59} eliminates all monitoring PDs for modulators, MZIs, and filters. It results in an enormous reduction in footprint, loss, metal pads, and subsequently packaging complexity. Figure 5(d) shows the schematics of a high-responsivity Si–Ge waveguide APD enhanced by loop reflectors.^{57} SiGe APDs can offer an exceptional sensitivity of −30 dBm at this speed [Fig. 5(g)],^{55–57} lowering the total optical output power and, subsequently, the wall-plug power in the light source (laser and optional amplifier) by a factor of 33.

### C. Footprint-energy efficiency

Here, the footprint-energy efficiency^{12,30} [(MAC/J) (MAC/s/mm^{2})], defined as the product of the workload energy efficiency (MAC/J) and the throughput per footprint area (MAC/s/mm^{2}), is used as the figure of merit (FOM). Table II compares the synapses per neuron (i.e., scale of the ANN), workload energy efficiency, and footprint-energy efficiency among different ANN hardware technologies, including analog electronics,^{2,5,41,42} digital electronics,^{6,43–46} and photonics.^{47–52} The scale of the photonics is set as 1024 in order to compete with the electrical ANNs. Since 1024 × 1024 integrated conventional ONNs are impractical, TONN architecture is assumed for all the photonic technologies.

Category . | Technology . | Synapses per neuron . | MAC/J . | Figure of merit (MAC/J MAC/s/mm^{2})
. |
---|---|---|---|---|

Analog electronics | NeuroGrid^{5} | 4096 | 9 × 10^{9} | 8 × 10^{15} |

HiCANN^{41} | 224 | 4.7 × 10^{9} | 2 × 10^{17} | |

TrueNorth^{2} | 256 | 3.3 × 10^{12} | >8 × 10^{18} | |

Flash memory^{42} | 100 | 1.4 × 10^{14} | 2.6 × 10^{27} | |

Digital electronics | NVIDIA Volta^{6} | 4096 | 2 × 10^{11} | 2 × 10^{22} |

Google TPU^{43} | Multiple of 128 | 5.8 × 10^{11} | 9.5 × 10^{22} | |

Graphcore^{44} | 1216 | 2 × 10^{12} | 7 × 10^{23} | |

Memristor crossbar^{45} | 256 | 6.9 × 10^{12} | 6.9 × 10^{23} | |

Groq^{46} | 640 | 3.3 × 10^{12} | 3 × 10^{24} | |

Photonics | SiPh^{47–51} TONN | 1024 | 7.3 × 10^{11} | 9.8 × 10^{25} |

PCM^{52} TONN | 1024 | 1.1 × 10^{12} | 1.4 × 10^{26} | |

MOSCAP TONN | 1024 | 6.5 × 10^{14} | 4.1 × 10^{28} |

Category . | Technology . | Synapses per neuron . | MAC/J . | Figure of merit (MAC/J MAC/s/mm^{2})
. |
---|---|---|---|---|

Analog electronics | NeuroGrid^{5} | 4096 | 9 × 10^{9} | 8 × 10^{15} |

HiCANN^{41} | 224 | 4.7 × 10^{9} | 2 × 10^{17} | |

TrueNorth^{2} | 256 | 3.3 × 10^{12} | >8 × 10^{18} | |

Flash memory^{42} | 100 | 1.4 × 10^{14} | 2.6 × 10^{27} | |

Digital electronics | NVIDIA Volta^{6} | 4096 | 2 × 10^{11} | 2 × 10^{22} |

Google TPU^{43} | Multiple of 128 | 5.8 × 10^{11} | 9.5 × 10^{22} | |

Graphcore^{44} | 1216 | 2 × 10^{12} | 7 × 10^{23} | |

Memristor crossbar^{45} | 256 | 6.9 × 10^{12} | 6.9 × 10^{23} | |

Groq^{46} | 640 | 3.3 × 10^{12} | 3 × 10^{24} | |

Photonics | SiPh^{47–51} TONN | 1024 | 7.3 × 10^{11} | 9.8 × 10^{25} |

PCM^{52} TONN | 1024 | 1.1 × 10^{12} | 1.4 × 10^{26} | |

MOSCAP TONN | 1024 | 6.5 × 10^{14} | 4.1 × 10^{28} |

The energy consumption of the photonic TONNs under static operation mainly consists of five parts: laser wall-plug power, microring modulator power, MZI mesh power, microring add-drop filter power, and PD receiver power. The MOSCAP platform is assumed to be with QD comb lasers,^{27,28} MOSCAP microring modulators,^{53,54} MOSCAP MZIs, and Si–Ge APDs.^{55–57} The silicon photonics (SiPh) and phase-change material (PCM) platform are assumed to be with distributed-feedback (DFB) lasers,^{48} silicon photonic microring modulators,^{49} SiPh^{50} (or PCM^{52}) MZIs, and Ge PIN PDs.^{49,51}

For the proposed heterogeneous III–V-on-silicon MOSCAP platform, the laser wall-plug power is mainly decided by the PD sensitivity (*SENS*_{PD}), the power margin (*Power*_*margin*), the total optical insertion loss (*IL*_{total}), and the wall-plug efficiency (*η*) by $PLaser=10(SENSPD+Power_margin+ILtotal)/10/\eta $. The energy stored in the MOSCAP microring modulator can be calculated by $CVpp2/2$, where *C* is the capacitance and *V*_{pp} is the peak-to-peak voltage swing. Assuming a non-return-to-zero (NRZ) modulation pattern, the probability of charging of the MOSCAP modulator is 0.25.^{60} The driver power consumption is assumed to be equal to the modulator.^{61} Thus, the power consumption of the MOSCAP microring modulator can be calculated by $PModulator=BCVpp2/4$, where *B* is the modulation bandwidth. With MOSCAP phase tuning, the static power consumption of MZI meshes and microring add-drop filters is considered to be zero. The total number of multiply-accumulate (MAC) per second of an *N* × *N* TONN-MW is *BN*^{2}. For a 1024 × 1024 TONN-MW, as in Fig. 5(a), assuming *B* = 10 GHz, the total power consumption per wavelength is calculated to be 15.79 W, which corresponds to 6.5 × 10^{14} MAC/J. With an estimated area of 165 mm^{2} (detailed breakdown can be found in the Appendix), the computing throughput per unit area is calculated to be 6.4 × 10^{13} MAC/s/mm^{2}. Thus, the footprint-energy efficiency of a 1024 × 1024 TONN-MW is 4.1 × 10^{28} (MAC/J) (MAC/s/mm^{2}). The detailed calculations can be found in the Appendix.

With the footprint-energy efficiency as the FOM, the MOSCAP TONN outperforms the digital electronics^{6,43–46} and most of the analog electronics^{2,5,41} technologies by a factor of 1.4 × 10^{4} and 5.1 × 10^{9}, respectively. As shown in Table II, our architecture exhibits a 10× larger number of synapses per neuron and 15.8× higher FOM compared with analog flash memory technology.^{42} Compared with TONNs with the other photonic technologies, including SiPh and PCM, the MOSCAP TONN improves the footprint-energy efficiency by 290× due to the following reasons: 1. Heterogeneous integration of all the optical components eliminates the coupling loss between discrete chips. 2. MOSCAP MZIs and microring add-drop filters have negligible static phase tuning energy consumption, while SiPh technologies require power to maintain the phase tuning. 3. High-sensitivity APD significantly reduces the required laser wall-plug power.

### D. 3D co-package

The footprint of the proposed 1024 × 1024 MOSCAP TONN can be 27 × 6 mm^{2} in a folded configuration. Thus, integrating the 1024 × 1024 TONN in a single photonic die is achievable, considering that the maximum field size of the deep ultraviolet (DUV) stepper is 27 × 22 mm^{2}. The total number of MZIs in the proposed 1024 × 1024 MOSCAP TONN is 4.5× less than the Lighmatter’s 64 × 64 chip.^{7} The packaging of photonic tensor cores, CMOS driver chips, and memory chips can be realized by three-dimensional (3D) co-package technology. Recently, several foundries (e.g., AIM Photonics^{62}) have started to provide SiPh interposer services that could co-package SiPh and electronic integrated circuits in a 3D stack using through Si via (TSV). The 3D co-package technology^{63} brings CMOS chips closer to the compute nodes (photonic TONN chips) so that the integration density and energy efficiency can be further increased.

## V. CONCLUSIONS

This paper proposes a scalable, energy-efficient, and compact TONN architecture on the integrated III–V-on-silicon MOSCAP platform. Based on TT decomposition, the high-radix (e.g., 1024 × 1024) synaptic interconnections can be enabled by cascaded small-radix (e.g., 8 × 8) photonic tensor-train cores. The detailed TONN device architectures with single or multiple wavelengths are discussed. Simulation experiments show that TONN-MW saves 79× MZIs and 5.2× cascaded stages of MZIs compared with conventional ONN while maintaining a >95% training accuracy for MNIST handwritten digit classification tasks. Furthermore, with footprint-energy efficiency as the figure of merit, the proposed TONN-MW on the heterogeneous III-on-silicon MOSCAP platform outperforms the digital electronics and other photonic technologies by a factor of 1.4 × 10^{4} and 2.9 × 10^{2}, respectively. Our proposed architecture points out the road map for the future physical implementation of ONNs scaling up to 1024 and beyond, with significantly reduced hardware requirements and ultra-high energy efficiency.

## ACKNOWLEDGMENTS

This work was supported, in part, by the AFOSR under Grant No. FA9550-181-1-0186. The authors would like to thank Geza Kurczveil, Sudharsanan Srinivasan, Stanley Cheung, and Yuan Yuan from Hewlett Packard Labs for providing photonic device parameters. The authors would also like to thank Kaiqi Zhang from the University of California, Santa Barbara, for discussions on tensor-train decomposition.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

### APPENDIX: FOOTPRINT-ENERGY EFFICIENCY CALCULATION

Table III lists the parameters used for footprint-energy efficiency calculation. The total optical loss for the first half of the 1024 × 1024 TONN-MW [Fig. 5(a)] can be calculated by

. | . | MOSCAP . | SiPh . | PCM . |
---|---|---|---|---|

Data rate per wavelength | B | 10 Gb/s | 10 Gb/s | 10 Gb/s |

Laser wall-plug efficiency | η | 10% | 7.1%^{48} | 7.1%^{48} |

Laser coupling insertion loss | IL_{laser_coupling} | 0 dB | 3.9 dB^{47} | 3.9 dB^{47} |

Microring modulator insertion loss | IL_{ring} | 1 dB^{53,54} | 3.9 dB^{49} | 3.9 dB^{49} |

Microring modulator extinction ratio | EXT | 5.5 dB^{53,54} | 4.2 dB^{49} | 4.2 dB^{49} |

Power penalty by modulator extinction ratio | Penalty_{ext} | 2.5 dB | 3.5 dB | 3.5 dB |

Microring modulator off-resonance through-port insertion loss | IL_{ring_off} | 0.1 dB | 0.1 dB | 0.1 dB |

Microring modulator power consumption | P_{ring} | 1.3 mW^{53,54} | 1.54 mW^{49} | 1.54 mW^{49} |

MZI insertion loss | IL_{MZI} | 0.77 dB | 1.1 dB^{50} | 1 dB^{52} |

MZI phase shifter length | L_{phase_shifter} | 500 μm | 35 μm^{50} | 35 μm^{52} |

MZI static power consumption | P_{MZI} | 0 mW | 56 mW^{50} | 0 mW |

Waveguide crossing insertion loss | IL_{crossing} | 0.017 dB^{64} | 0.017 dB^{64} | 0.017 dB^{64} |

Microring add-drop filter insertion loss | IL_{ring_filter} | 0.2 dB | 0.2 dB | 0.2 dB |

Microring add-drop filter off-resonance through-port insertion loss | IL_{ring_off} | 0.1 | 0.1 | 0.1 dB |

Waveguide loss | IL_{wg} | 2 dB | 2 dB | 2 dB |

Power margin | Power_margin | 3 dB | 3 dB | 3 dB |

Photodetector sensitivity at 10 Gb/s | SENS_{PD} | −30 dBm^{55–57} | −13.9 dBm^{51} | −13.9 dBm^{51} |

Photodetector power consumption | P_{PD} | 0.5 mW^{65} | 0.75 mW^{49} | 0.75 mW^{49} |

. | . | MOSCAP . | SiPh . | PCM . |
---|---|---|---|---|

Data rate per wavelength | B | 10 Gb/s | 10 Gb/s | 10 Gb/s |

Laser wall-plug efficiency | η | 10% | 7.1%^{48} | 7.1%^{48} |

Laser coupling insertion loss | IL_{laser_coupling} | 0 dB | 3.9 dB^{47} | 3.9 dB^{47} |

Microring modulator insertion loss | IL_{ring} | 1 dB^{53,54} | 3.9 dB^{49} | 3.9 dB^{49} |

Microring modulator extinction ratio | EXT | 5.5 dB^{53,54} | 4.2 dB^{49} | 4.2 dB^{49} |

Power penalty by modulator extinction ratio | Penalty_{ext} | 2.5 dB | 3.5 dB | 3.5 dB |

Microring modulator off-resonance through-port insertion loss | IL_{ring_off} | 0.1 dB | 0.1 dB | 0.1 dB |

Microring modulator power consumption | P_{ring} | 1.3 mW^{53,54} | 1.54 mW^{49} | 1.54 mW^{49} |

MZI insertion loss | IL_{MZI} | 0.77 dB | 1.1 dB^{50} | 1 dB^{52} |

MZI phase shifter length | L_{phase_shifter} | 500 μm | 35 μm^{50} | 35 μm^{52} |

MZI static power consumption | P_{MZI} | 0 mW | 56 mW^{50} | 0 mW |

Waveguide crossing insertion loss | IL_{crossing} | 0.017 dB^{64} | 0.017 dB^{64} | 0.017 dB^{64} |

Microring add-drop filter insertion loss | IL_{ring_filter} | 0.2 dB | 0.2 dB | 0.2 dB |

Microring add-drop filter off-resonance through-port insertion loss | IL_{ring_off} | 0.1 | 0.1 | 0.1 dB |

Waveguide loss | IL_{wg} | 2 dB | 2 dB | 2 dB |

Power margin | Power_margin | 3 dB | 3 dB | 3 dB |

Photodetector sensitivity at 10 Gb/s | SENS_{PD} | −30 dBm^{55–57} | −13.9 dBm^{51} | −13.9 dBm^{51} |

Photodetector power consumption | P_{PD} | 0.5 mW^{65} | 0.75 mW^{49} | 0.75 mW^{49} |

Here, the worst-case number of waveguide crossings is considered. The power penalty induced by the microring modulator extinction ratio is calculated by

The required comb laser wall-plug power per wavelength is

By adding up the power consumption of the second half of the TONN-MW, the total power consumption per wavelength is

The estimated footprint of the MOSCAP TPNN-MW is 55 × 3 mm^{2} = 165 mm^{2}, which contains 45 × 3 mm^{2} for the MZI meshes, 1 × 3 mm^{2} for the QD comb laser and power splitters, 4 × 3 mm^{2} for the MOSCAP microring modulator arrays, 4 × 3 mm^{2} for the MOSCAP microring add-drop filter and APD arrays, and 1 × 3 mm^{2} for the electrical cross-connects.

## REFERENCES

^{2}optical I/O