Digital accelerators in the latest generation of complementary metal–oxide–semiconductor processes support, multiply, and accumulate (MAC) operations at energy efficiencies spanning 10–100 fJ/Op. However, the operating speed for such MAC operations is often limited to a few hundreds of MHz. Optical or optoelectronic MAC operations on today’s SOI-based silicon photonic integrated circuit platforms can be realized at a speed of tens of GHz, leading to much lower latency and higher throughput. In this Perspective, we study the energy efficiency of integrated silicon photonic MAC circuits based on Mach–Zehnder modulators and microring resonators. We describe the bounds on energy efficiency and scaling limits for *N* × *N* optical networks with today’s technology based on the optical and electrical link budget. We also describe research directions that can overcome the current limitations.

## I. INTRODUCTION

Vector matrix multiplication operations represent the core of artificial neural networks (ANNs) and other computing applications of hardware accelerators. ANNs are realized in digital complementary metal–oxide–semiconductor (CMOS) circuits with multiple processing elements implementing multiply and accumulate (MAC) operations, which calculate the product of two numbers and add the result to an accumulator.^{1} The processing elements can be arranged in a systolic architecture, where data are passed through connected processing elements in a rhythmic sequence, to perform MAC operations either spatially or temporally over several clock cycles.^{2}

Integrated silicon photonics (SiP) circuits have been popularly employed in high-speed links to move data at a rate of tens of Gb/s, where optical modulation is more efficient than electronic switching for transmitting data over significant distances.^{3} Optical modulation can be realized using Mach–Zehnder modulators (MZMs) or microring modulators (MRMs).^{4} MZMs are broadband and easily support complex modulation schemes.^{3} MRMs have significantly smaller footprint and driver power consumption.^{5} As a technology, the current generation of SiP has now matured with high volume shipments for datacenter transceivers from companies such as Intel and Cisco. SiP circuits comprising Mach–Zehnder interferometers (MZIs) or microring resonators (MRRs) have also been used for other applications, such as high-speed optical switches and filters.^{6–10}

SiP is also being used for computing applications,^{11–16} where devices such as MZIs, MZMs, MRRs, and MRMs are used for computation in the optical analog domain. These encompass inference and training accelerators used for machine learning and neuromorphic computing applications where convolution takes 80% of the total processing time.^{17–19} Other integrated optical configurations implemented using field-programmable photonic arrays were shown to carry out linear transformations for signal processing and control.^{20,21} Linear transformation circuits are also employed in Ising machines^{22} and photonic quantum computing processors.^{23}

In this Perspective, we describe the advantages and challenges of implementing MAC operations using SiP and comment on how to address them. This Perspective is organized as follows: Secs. II and III describe the link budget, energy efficiency, and scaling opportunities for SiP MAC operations implemented with MZM and MRM, respectively. Section IV explores the possible approaches to further improve SiP MAC systems. It introduces the ongoing research in the field of SiP that when fully realized will lead to significant changes in the field of optical computing and communication. Section V concludes this Perspective.

## II. MZM-BASED Si-PHOTONIC IMPLEMENTATION

### A. System architecture

Figure 1 illustrates an MZM-based SiP implementation of an optical accelerator. Using off-chip lasers, light is guided by a polarization-maintaining (PM) single mode fiber (SMF), gets coupled to the SiP chip via an edge coupler, and is then split to *N* parts. These parts are modulated by using an array of *N* MZI modulators and fed into an *N* × *N* weight transformation (multiplication) matrix, *W*_{N×N}. The optical intensities at the MAC outputs, *Y*_{N×1}, are thus described by the multiplication product of the input vector, *V*_{N×1}, and the weight matrix, *W*_{N×N}, as

In other words, the weight matrix performs linear transformation for the input vector, *V*_{N×1}, and delivers *N* outputs, *Y*_{N×1}, that are routed to an array of photodetectors (PDs). These PDs are then connected to the electrical components, including transimpedance amplifiers (TIAs), main amplifiers, and sense amplifier-based comparators, which are either built in a separate CMOS/BiCMOS chip or monolithically integrated with the SiP devices in the same process.

Singular value decomposition (SVD) is an effective approach to represent a given matrix as a factorization of multiple matrices.^{24,25} SVD decomposes a real matrix into a product of unitary matrices and a diagonal matrix. This is useful in the experimental realization of an *N* × *N* matrix topology in which the sequential product of rotation matrices represents the sequential arrangement of linear transformation units in the overall matrix grid.^{26} A 2 × 2 linear transformation unit in the whole grid arrangement is practically implemented using a tunable beam splitter (TBS), as seen in Fig. 1.^{6,27} A TBS comprises of an MZI with a phase shifter (*θ*) in at least one of the arms, along with either an outer phase shifter (*ϕ*) or a tunable directional coupler.^{28,29} The transfer function for a single TBS can be described using the matrix representations for ideal 50:50 beam splitters and lossless phase shifters as

Thus, any arbitrary light redistribution can be obtained by changing *θ* and *ϕ*.

The SVD of the weight matrix can be described as

where *D* is a diagonal matrix and *T*_{m,n} represents the transformation matrix for a 2 × 2 node between two input terminals, *m* and *n*, within the *N* × *N* multiplication matrix,^{27} given as

### B. Optical network link budget

The photonic components shown in Fig. 1 are simulated in Cadence Spectre for 8 × 8 and 32 × 32 transformation matrix sizes to verify the optical link budget analysis. The optical components are modeled in Verilog-A to enable electronics–photonics co-simulation.^{30,31} The models of some of the components used in this Perspective are derived from Ref. 31, with some modifications to account for the laser electrical power consumption, the wall plug efficiency, *η*_{WPE}, link losses, etc. The optical power of laser is set to 0 dBm to estimate the optical power at the output terminals of the network, incident on the PDs. Coupling light to the SiP chip introduces loss in the range of 0.6–3 dB based on the coupling scheme used.^{32,33} The overall coupling loss from the laser to the SiP chip is collectively estimated as 1.6 dB, considering possible optimizations in the coupling efficiency. 1.6 dB is also a realistic estimate for photonic wire bonds (PWBs), an emerging technology which involves writing three-dimensional waveguides in a photosensitive polymer. PWBs have demonstrated efficient interfacing between the external sources to the silicon waveguide with coupling losses as low as 0.4 up to 1.7 dB.^{34–38}

For an input vector size of *N*, light passes through log_{2} *N* splitters before modulation, resulting in a total insertion loss of 10 log_{10} *N* + *EL*_{splitter} · log_{2} *N* dB, where *EL*_{splitter} is the estimated excess loss for a single splitter and ranges between 0.01 and 0.5 dB.^{39–43} The estimated excess loss for beam splitters and combiners in this study is 0.01 dB. The overall attenuation in the silicon waveguide is a function of the depth of the network. The MZM-based implementation is based on Clement’s arrangement, which is composed of beam splitters and phase shifters that can be programmed to implement linear transformation.^{27} With an MZM representing one node in Clement’s arrangement,^{27} the waveguide attenuation can be approximated as *Nη*_{wg}*L*_{MZI}, where *η*_{wg} represents the optical intensity attenuation in the Si waveguide and *L*_{MZI} is the length of an MZM arm, with the chosen values of 3 dB/cm and 0.5 mm, respectively. The insertion loss introduced by an MZM’s PN phase shifter is approximated as 1 dB/mm.^{41}

The insertion loss through a node is dependent on the excess loss for cross and through states, *EL*_{cross} and *EL*_{thru}, respectively, both of which are typically $<$1 dB.^{44,45} For simplicity, two phase shifters connected by using 3 dB adiabatic directional couplers (*EL*_{DC} ∼ 0.1 dB)^{39} are assumed in this work for analysis.

To study the optical attenuation of the 8 × 8 MZM implementation shown in Fig. 1 and verify its functionality, all inputs, except for the uppermost terminal (*m* = 1), are driven by *V*_{π} voltage that creates a *π* phase shift difference between their MZM’s arms and null their outputs. As light is set to propagate through the uppermost input terminal, the optical depth is defined by the route passing through the diagonal TBS nodes with *i* = *j*.

For a rectangular mesh arrangement, the matrix optical depth is equal to *N* with a total number of $N(N\u22121)2$ optical crossings.^{26,27} Hence, the 8 × 8 matrix implementation shown in Fig. 1 has an optical depth of 8 with 28 crossings.

The total optical link budgets are calculated based on (5), where *P*_{SMF-att}, *P*_{EC-IL}, *P*_{Si-att}, *P*_{splitter-IL,EL}, *P*_{PS-IL}, *P*_{DC-IL}, and *P*_{penalty} represent the attenuation introduced by the SMF fiber, fiber to chip coupling loss, silicon waveguide attenuation, splitter insertion and excess loss, phase shifters’ insertion loss, total adiabatic coupling insertion loss, and network penalty, respectively. The network penalty takes into account further impairments due to extinction ratio, crosstalk, intersymbol interference (ISI), and laser relative intensity noise (RIN), which is caused by the random spontaneous emission over time,^{4,46,47}

Figure 2 shows the calculated optical power throughout *N* × *N* networks with different input vector sizes. The optical power of the laser is set to 0 dBm for ease of illustration. Besides the attenuation due to the splitting, it can be noticed that the losses introduced by the optical components in the multiplication matrix (i.e., directional couplers and phase shifters) pose a limitation for scaling the network due to the highly attenuated optical intensities reaching the outputs. Figure 3 shows the optical intensities required at the analog front-end (AFE) to detect a signal with a resolution of *n*_{i/p} bit. This is obtained by representing the desired output signal and current noises in terms of the received optical intensity, as given in Eq. (8). The blue dotted lines represent the optical power at the matrix outputs and the corresponding bit resolution for a laser intensity of 10 dBm. It can be shown that the maximum achievable matrix size is ∼35 × 35 for binary networks operating at *DR* = 10 GS/s. We revisit this calculation again in Sec. II D.

### C. Energy efficiency

The total electrical power dissipation of the whole network comprises of the power consumed by the laser, input modulators, thermo-optic tuning of the matrix phase shifters, and the AFE, including PDs. Accordingly, for a configuration of size *N* × *N* operating at a data rate of *DR*, the energy efficiency (J/Op) can be calculated as

where *P*_{laser}, *P*_{i/p-drivers}, *P*_{mem-interface}, *P*_{mat-tuning}, and *P*_{o/p-AFE} represent the electrical power dissipated due to the laser, input modulator drivers, data fetch interfacing circuits, matrix tuning, and the output AFE circuits, respectively. *P*_{SOA} represent the electrical power dissipated if a semiconductor optical amplifier (SOA) is used to recover the loss.

The factor *γ* refers to the energy efficiency enhancement. It can be represented as $\gamma =\rho opt2\rho SOA$, where *ρ*_{opt} represents the energy scaling due to the loss of precision factor, and will be described later in Sec. II D. *ρ*_{SOA} represents the efficiency enhancement due to an SOA. Assuming an SOA introducing a gain of *η*_{SOA} (in dB), the corresponding enhancement is $\rho SOA=10\eta SOA10$. The use of an SOA is discussed later in Sec. IV, but it can be inferred from Eq. (6) that the use of an SOA always degrades the overall energy efficiency.

The amount of power dissipated by the laser is represented in terms of the laser’s wall plug efficiency, *η*_{WPE}, the optical insertion losses introduced by the SMF fiber, *IL*_{SMF}, the fiber to chip coupling *IL*_{EC}, the silicon waveguide loss, *IL*_{WG}, the input MZM loss, *IL*_{i/p-MZM}, the weight phase shifter loss, *IL*_{weight-PS}, the directional coupler loss, *IL*_{DC}, and the receiver’s PD sensitivity, *P*_{PD-opt}, as given in the following equation:

The total length of the waveguide was roughly approximated as the length spanning the optical depth of the matrix only. The output sensitivity is solved based on the targeted bit resolution, *n*_{i/p}, and the total noise at the output front-end due to the photodetector shot noise, dark current *I*_{d}, thermal noise, and laser relative intensity noise (RIN), as given in Eq. (8), with values reported in Table I. The parameters *R*, *R*_{L}, *k*, and *T* represent the PD responsivity, output load resistance, Boltzmann constant, and the absolute temperature,

Parameter . | Description . | Value . |
---|---|---|

P_{laser} | Laser power intensity | 10 dBm |

R | PD responsivity | 1 A/W^{48} |

R_{L} | Load resistance | 50 Ω |

I_{d} | Dark current | 35 nA^{48} |

T | Absolute temperature | 300 K |

DR | Data rate | 10 GS/s |

B_{o} | Optical bandwidth | 25 GHz |

B_{e} | Electrical bandwidth | $DR/2GHz$ |

λ | Wavelength | 1550 nm |

RIN | Relative intensity noise | −140 dB/Hz^{46,47} |

WPE | Wall plug efficiency | 10% |

Parameter . | Description . | Value . |
---|---|---|

P_{laser} | Laser power intensity | 10 dBm |

R | PD responsivity | 1 A/W^{48} |

R_{L} | Load resistance | 50 Ω |

I_{d} | Dark current | 35 nA^{48} |

T | Absolute temperature | 300 K |

DR | Data rate | 10 GS/s |

B_{o} | Optical bandwidth | 25 GHz |

B_{e} | Electrical bandwidth | $DR/2GHz$ |

λ | Wavelength | 1550 nm |

RIN | Relative intensity noise | −140 dB/Hz^{46,47} |

WPE | Wall plug efficiency | 10% |

The MZM drivers consume power that scales linearly with *N* as *P*_{mod-driver} = *N* · *DR* · *E*_{MZM-driver}, where *E*_{MZM-driver} represents the energy efficiency of the MZM driver. For binary resolution, the power consumed by the drivers and AFEs is extracted from recent work on PAM2 and is typically in the range of ∼2 pJ/b.^{49}

The matrix weights are tuned using thermo-optic phase shifters (TO-PSs). Doped Si heaters on the SOI platform typically dissipate about ∼20 mW for a *π*-shift.^{6,50,51} The efficiency of TO-PS can be improved using other heater materials such as TiN, substrate undercut to improve insulation, and deep trenches to reduce thermal crosstalk.^{50,52} This can be shown to significantly improve the overall energy efficiency of the network, as illustrated in Fig. 5. Assuming uniformly distributed weights, the expected energy consumption of the thermo-optic phase shifter is $ETO\u2212PS=1P\pi \u222b0P\pi PheaterdPheater=P\pi 2$, where *P*_{π} denotes the amount of electrical power required to create a phase shift of *π*. Calculating for all the nodes in Clement’s topology, the total average tuning power is $N(N\u22121)4P\pi $.

After optical processing, the optical data need to be converted to the electrical domain to be processed, stored, or reused in other networks. Efficient opto-electronic receivers, comprising a PD, a TIA, and main amplifiers, have been shown to have energy efficiencies of ∼0.4 to 2.4 pJ/b.^{53–58} For an AFE operating at 10 Gb/s and realized in 40 nm CMOS technology, an energy efficiency of 0.4 pJ/b^{53} is assumed for the calculation of binary resolution AFE, which scales with a factor of *N* for the whole output array. Higher AFE resolutions entail the use of linear TIAs along with analog to digital converter (ADC) circuits to recover the digital data. High-speed linear TIAs have shown efficiencies as low as 0.6 pJ/b.^{59} The energy consumption for ADCs is extracted from the energy per conversion figure of merit (FOM) such that *E*_{ADC} (J/b) = 2^{N} × *FOM*. The energy consumption values used in this work for 2*b*, 3*b*, and 4*b* are 1.7, 3.1, and 5.7 pJ/b based on a FOM of 0.335 pJ/conversion for an ADC designed to operate at a sampling rate of 28 GS/s.^{60}

Providing high-speed serial inputs to the SiP accelerator requires FIFOs and multiplexers to interface the data transfer with DRAM, as shown in Fig. 4. For a fair comparison to digital CMOS implementations, the power dissipation for both input and output interfacing circuits, represented by *P*_{mem-interface}, is taken into consideration in the energy efficiency calculation of SiP implementations. The power dissipated by the first in first out (FIFO), multiplexers, clock dividers, and retimers is estimated as 5.77 mW in 28 nm CMOS based on the data reported in Ref. 61 in 180 nm CMOS technology.

### D. Efficiency trade-off factors

It can be inferred from Eq. (7) that the required input laser power increases as a function of *N* as a result of the exponentially increasing optical losses in the MZM-based accelerator. However, the total energy efficiency [Eq. (6)] starts improving as *N* scales up due to the quadratic increase in the number of accelerated operations performed by the optical matrix, as shown in Fig. 5. Taking all optical losses into account shows that there is a scaling limit beyond which optical losses grow significantly and the overall efficiency drops and an optimal network size exists for minimum energy efficiency. Unfortunately, the maximum network scale, *N*_{ltd}, is limited by the rated output optical power of the laser and the signal-to-noise ratio (SNR) required for any given signal resolution, *n*_{i/p} ≥ 1*b*, as given in Eq. (8) and illustrated in Fig. 3.

Figure 6 shows the total energy efficiency and scaling limit for various input resolutions considering thermo-optic phase shifters with and without insulation. The energy efficiencies in Fig. 6 are calculated for accelerators to be operated at binary and higher resolution, *n*_{i/p} = {1, 2, 3, 4}*b*. Although the probability of transition reduces for multilevel signaling,^{5} the requirement on driver’s linearity or segmentation also increases. Furthermore, the energy consumed in the serializing and clocking remains the same.^{5} Thus, we assume similar energy efficiency for multilevel signaling as PAM2, ∼2 pJ/b, for MZM modulators.^{62} For binary resolution, the power consumed by the drivers and AFEs is extracted from recent work on PAM2 transceivers.^{5,59,62,63} Therefore, the energy efficiency for 2*b*, 3*b*, and 4*b* input MZM drivers is estimated in our calculation as ∼4, 6, and 8 pJ per symbol, respectively.

It can be concluded from Fig. 6 that opting for PDs with higher responsivities improves the energy efficiency of the network. This compensates for the optical system loss, relaxes the need to inject high optical power at the network input, and improves the overall energy efficiency. Utilizing avalanche PDs (APDs) is a possible way to significantly improve the optical sensitivity.^{64} Figure 6 also suggests that taking advantage of the loss of precision, when possible, shows minor improvement in the energy efficiency and the network scaling (Table II).

η_{WPE}
. | IL_{SMF} (dB)
. | IL_{EC} (dB)
. | IL_{WG} (dB/mm)
. | EL_{Splitter} (dB)
. | IL_{MZI} (dB/mm)
. | L_{MZI} (mm)
. | IL_{DC} (dB)
. |
---|---|---|---|---|---|---|---|

0.1 | 0 | 1.6 | 0.3 | 0.01 | 1 | 0.5 | 0.01 |

η_{WPE}
. | IL_{SMF} (dB)
. | IL_{EC} (dB)
. | IL_{WG} (dB/mm)
. | EL_{Splitter} (dB)
. | IL_{MZI} (dB/mm)
. | L_{MZI} (mm)
. | IL_{DC} (dB)
. |
---|---|---|---|---|---|---|---|

0.1 | 0 | 1.6 | 0.3 | 0.01 | 1 | 0.5 | 0.01 |

Considering a matrix that scales with *N*, the laser optical intensity should be typically scaled by a factor of *N* to account for the splitting loss in a lossless network. For mesh-like configurations similar to Fig. 1, scaling the input vector size increases the dynamic range of the output intensities. In other words, for a given matrix output, the intensity can be as low as that of a single input or as high as *N* times that amount. With the input’s digital resolution being *n*_{i/p}, the effective overall output resolution due to the network scaling is *n*_{i/p} + log_{2} *N*.

The conservative estimate of scaling the input power by *N* may not be necessary in some computational context, such as convolutional neural network (CNN) layers with adaptable hidden layer resolutions;^{65} the increased output resolution might be higher than that needed by the AFE to detect. Therefore, an energy scaling vs loss of precision trade-off factor, *ρ*_{opt}, can be introduced to take advantage of the network scaling,^{66,67} as illustrated in Fig. 7. Full accuracy is described by *ρ*_{opt} = 1 corresponding to reduced output precision of log_{2}(*ρ*_{opt}) = 0, at which the input optical intensity is scaled by *N*. Generally, for log_{2}(*ρ*_{opt}) bit reduction, the input is scaled by *N*/*ρ*_{opt}. Therefore, the maximum amount of energy saving is achieved when the log_{2} *N* bit reduction is tolerable at the optical output (AFE input).

To get a meaningful sense of the trade-off between the energy scaling and the loss of precision, *ρ*_{opt} is quantified in the following equation in terms of the probability of bit errors for binary networks (networks with binary weights) at the output such that

where the *Q* function is defined as $Q(x)=\u222bx\u221e12\pi e\u2212u2/2du$ and *i*_{irn} represents the total input referred noise at the AFE input with contributions from the PD, TIA, main amplifiers, and comparators (if applicable). Therefore, the laser power, *P*_{laser} = *NP*_{opt-o/p}, can be traded off for loss of output resolution.

The IL difference between the bar and cross states of a tunable beam splitter impacts the interference between the nodes in the mesh. For an MZM with intensity loss of *α*_{1} in one arm and *α*_{2} in the other, it can be shown that the output intensity at the cross state is given by $Icross=Iin1[\alpha 1+\alpha 2+2\alpha 1\alpha 2cos(\theta 2)]$ when *I*_{in2=0}, where *I*_{in1} and *I*_{in2} represent the MZM input intensities at its input ports 1 and 2, respectively. To get the transmission response of an MZM with equal losses, *α*_{1} on both arms, *θ* should be modified such that $cos(\theta )=\Delta \alpha +2\alpha 1cos(\theta 1)2\alpha 1\alpha 2$ in order to account for the loss difference between the two MZM arms, where Δ*α* = *α*_{1} − *α*_{2} and *θ*_{1} describes the phase shift when both arms have attenuation of *α*_{1}. This difference in insertion losses can be observed in single-arm beam splitters in which phase shifters are controlled by a single arm only. Using dual-arm tunable beam splitters, where *θ* is implemented differentially using phase shifters on both arms, introduces equal insertion losses for the bar and cross transmissions of each node. A dummy phase shifter can also be used in single-arm topologies to obtain equal losses.

For sake of comparison, the energy consumption for a digital MAC is estimated based on the energy consumed by multiplication and accumulation operations and register file access in a 28 nm CMOS implementation.^{68} With an estimated energy consumption of 0.046 pJ for 8*b* MAC and 0.0117 pJ for register file access, the calculated energy consumption is $\u223c(0.046+0.0117pJ)/2$ = 28.85 fJ for a single operation. Conversely, it can be observed from Fig. 6 that SiP networks based on MZMs need to be scaled down to achieve higher resolutions, which further degrades their energy efficiency. Compared to their 8*b* digital CMOS counterpart, the energy efficiencies for SiP MZM MAC operations (using low-power thermo-optic phase shifters with insulation and 1.2 A/W^{69} PD responsivity) at *n*_{i/p} = 1*b*–4*b* resolutions are 3.5× to 17.5× worse. Despite the lower energy efficiency, MZM-based MAC operations are performed at 2*N*× higher operating speed and lower latency (for a weight-stationary systolic array with an input vector size of *N* × 1) than the corresponding digital CMOS implementation.

In addition, multiple clock cycles are needed for digital multipliers and adders to provide the MAC output in systolic arrays. Operating at $\u223c10\xd7$ lower clock speeds further decreases their throughput in comparison to optical implementations. Assuming a number *α* of clock cycles needed for digital MAC, the latency of a digital systolic array with a 1 × *N* input vector size and a *N* × *N* matrix size is 2*Nα*/*f*_{CLK-CMOS}, where *f*_{CLK-CMOS} represents the clock frequency of the digital CMOS implementation. Hence, the throughput ratio of the optical to CMOS implementations is 2*Nαf*_{CLK-OPT}/*f*_{CLK-CMOS}, where *f*_{CLK-OPT} represents the clock speed at which an optical implementation operates at.

We also investigate the energy efficiency and network size at lower data rates in Fig. 8. Intuitively, the energy efficiency degrades at lower data rates because less number of operations is conducted with respect to the dissipated static power. On the other hand, the network size, shown in Fig. 8, can be increased by making use of the SNR improvement at lower data rates, as inferred from Eq. (8). If maximizing the optical throughput is not an overarching goal, opting for lower data rates (relative to 10 GS/s) leads to larger networks while not sacrificing much on the energy efficiency in implementations incorporating phase shifters with insulation [which do not consume much static power, as shown by the dotted curves in Fig. 8(a)].

## III. MRR-BASED Si-PHOTONIC IMPLEMENTATION

### A. System architecture

The SiP MRR-based implementation of an optical accelerator is illustrated in Fig. 9. A comb CW laser source is used to provide wavelengths *λ*_{1} through *λ*_{n}, which are coupled into the chip and then modulated by an array of *N* MRMs. Unlike mesh-like topologies, implementing vector matrix multiplication in the form of dot products has the advantage of maintaining equal path loss for all the outputs. The modulated input vector, *X*_{i/p}(*λ*), is then split (broadcast) into *N* branches to be modulated by the weight bank arrays;^{70} each output represents the dot product of the input vector and one of the row arrays of the weight matrix. In order to achieve weights with positive and negative polarities, the thru and drop transmissions of the weight arrays are routed to balanced PDs in a push–pull configuration at the receiver. The current difference at the output, *Y*_{O/p}, can be represented as^{14}

where *E*_{0}(*λ*), *W*_{dt}(*λ*), and *R*(*λ*) represent the amplitude of the input optical field, the difference between the weight’s drop and thru intensity transmissions, and the PD responsivity at a wavelength *λ*, respectively.

### B. Optical network link budget

Similar to Eq. (5), the optical link budget is calculated based on the following equation:

where *P*_{MRM-I/p-IL} represents the transmission insertion loss of the MRM for the input vector, *P*_{MRM-I/p-OBL} represents the out-of-band insertion loss (*OBL*) of the MRM for the input vector when the MRM resonance wavelength does not match the input vector wavelength, *P*_{MRR-W-IL} represents transmission insertion loss of the MRR for the weight vector, and *P*_{MRR-W-OBL} represents the out-of-band insertion loss of the MRR for the weight vector. Other terms have been defined already when describing Eq. (5).

Figure 10 shows the calculated optical power throughout an MRR-based implementation with different input vector sizes, *N*, using the values shown in Table III. The optical power of the laser is set to 0 dBm for the ease of illustration. Similar to MZM-based implementations, the attenuation introduced due to the splitting and cascading of microrings significantly degrades the optical power and poses a limitation on the energy efficiency, as will be discussed in Sec. III C. Figure 11 shows the optical intensities required at the AFE to detect a signal with a resolution of *n*_{i/p} bit. This is obtained by representing the desired output signal and current noises in terms of the received optical intensity, as given in Eq. (8). It can be shown that the maximum achievable matrix size is $\u223c85\xd785$ for binary networks. We revisit this calculation again in Sec. III D.

. | Parameter . | Value . | . |
---|---|---|---|

η_{WPE} | 0.1 | ||

IL_{SMF} (dB) | 0 | ||

IL_{EC} (dB) | 1.6 | ||

IL_{WG} (dB/mm) | 0.3 | ||

EL_{Splitter} (dB) | 0.01 | ||

IL_{MRM} (dB)^{71} | 4 | ||

OBL_{MRM} (dB) | 0.01 | ||

IL_{MRR} (dB) | 0.01 | ||

d_{MRR} (μm) | 20 | ||

IL_{penalty} (dB) | 4.8 |

. | Parameter . | Value . | . |
---|---|---|---|

η_{WPE} | 0.1 | ||

IL_{SMF} (dB) | 0 | ||

IL_{EC} (dB) | 1.6 | ||

IL_{WG} (dB/mm) | 0.3 | ||

EL_{Splitter} (dB) | 0.01 | ||

IL_{MRM} (dB)^{71} | 4 | ||

OBL_{MRM} (dB) | 0.01 | ||

IL_{MRR} (dB) | 0.01 | ||

d_{MRR} (μm) | 20 | ||

IL_{penalty} (dB) | 4.8 |

### C. Energy efficiency

The energy efficiency (J/Op) of the MRM-based implementation with size *N* × *N* operating at a data rate of *DR* can be calculated as

Here, *P*_{laser} represents the total electrical power consumed by the optical source (either a single comb laser source or multiple sources generating all the desired input wavelengths). To better study the energy efficiency based on a targeted signal resolution, *n*_{i/p}, the power consumption of the input laser source is formulated as a function of the optical power reaching the PDs at the output AFE, as given in the following equation:

where *d*_{MRR} represents the gap between the centers of two adjacent microrings and is dictated by the thermal crosstalk, which should be taken for design considerations. A *d*_{MRR} of 15 *μ*m has been shown to be sufficient to avoid thermal crosstalk in a photonic switch implementation.^{72} We assume a *d*_{MRR} of 20 *µ*m in this work for an optimistic realization of the system with *R*_{MRR} = 6 *µ*m.

Since the number of operations scales quadratically with the weight vector size, *N*, while the energy consumption of the modulators and AFE increases linearly as can be seen in (12), the overall energy efficiency improves with scaling, as shown in Fig. 12. We assume similar MRM energy efficiency for multilevel signaling as PAM2, ∼0.3 pJ/b,^{5} which takes into account both the contribution of the modulator driver and the serializers. Therefore, the energy efficiency for 2*b*, 3*b*, and 4*b* input MRM drivers is estimated in our calculation as ∼0.6, 0.9, and 1.2 pJ per symbol, respectively.

MRMs offer smaller footprint and lower input capacitance, which leads to a significant reduction in their driving power. However, they are also more sensitive to fabrication mismatch and thermal drift, which entails the need to use heaters for calibration across a wide spectral range. Excluding heater power, the power consumed by a closed loop controller implemented for a low-power wavelength division multiplexing (WDM) topology is $\u223c0.2mW$.^{73} The average energy efficiency for the state-of-the-art MRR heaters on an SOI platform is $\u223c20mW/\pi $.^{7,74,75} The scaling of the power consumed by the heaters with *O*(*N*^{2}) degrades the overall energy efficiency of the network. Figure 13 shows the total energy efficiency and scaling limit for various input resolutions considering thermo-optic phase shifters with and without insulation. We assume power consumption values, *P*_{heater}, of 2.8^{75} and 40 mW^{7} for phase shifters with and without insulation, respectively, to provide a phase shift of one free spectral range (FSR).

### D. Scaling limitations

It can be inferred from Fig. 13 that it is feasible to implement vector matrix multiplication using MRR-based networks with sizes scaling up to *N* = 85. For *n*_{i/p} = 2*b* or above, the network size is within the maximum number of microrings permitted for WDM implementations due to FSR limitations and crosstalk.^{76} Attempting to engineer the MRR’s dimensions and coupling ratio compromises the quality factor, which degrades the channel spacings. This translates to a limit in the MRR vector size of *N* < *FSR*/Δ*λ*. Channel spacings are typically set according to the amount of acceptable crosstalk between channels (interchannel interference). As an example, for a 50 nm transmission window with channel spacings of 0.8 nm, the maximum number of channels is 62.^{76,77} Thus, for *n*_{i/p} = 1*b*, FSR may set a limitation to the overall network size.

Using series coupling to increase the filter order has been experimentally shown to reduce both interchannel and intrachannel crosstalks, thus maximizing the filter finesse and the channel count.^{77} However, this comes at the expense of higher footprint, lower drop port transmission, and extra tuning power. To cascade several MRRs for MAC operations, it is necessary to maintain channel spacings to avoid the adjacent weight-dependent crosstalk. The number of channels that can be supported by optimized MRRs with finesse of 368 and 540 are calculated to be 108 and 148, respectively.^{76,78,79}

The two-point coupling scheme has been proposed to address the post-fabrication correction of MRM spectral features for large-scale MRM implementations.^{80} Although it mitigates the secondary resonances of an MRM and doubles the FSR, an extra micro-heater is introduced to correct for the coupling, which increases the power consumption. Another attempt to achieve an FSR-free filter has been demonstrated using tunable couplers along with modified vernier filters that use higher-order coupled MRRs.^{81} However, this topology is associated with penalty in terms of design complexity, increased footprint, and tuning power.

Introducing contra-directional coupling (CDC) in a microring combines the wavelength selectivity of the CDC with the compact feature size of the MRR, thus reaping the advantages of both and providing an FSR-free response.^{82} Implementing this design technique allows for the potential use of several channels in MRR-based accelerators. This comes with a trade-off of using extra heaters in the CDC and in the region of the MRR that does not include corrugated structures.

As shown in Fig. 13, the optimum energy per operation of binary SiP networks based on MRRs (∼75 fJ) is obtained at *N* = 85 for a PD responsivity of *R* = 1.2 A/W. It can also be shown that reducing the power consumption of weight tuning circuits by one order of magnitude improves the energy efficiency by roughly one order of magnitude as well. Compared to their 8*b* digital CMOS counterpart, the energy efficiencies for SiP MRM MAC operations (using low-power thermo-optic phase shifters with insulation and 1.2 A/W PD responsivity) at *n*_{i/p} = 1*b*–4*b* resolutions are 2.6× to 13× worse. In comparison to MZM-based implementations, MRR-based implementations can have 1.8× bigger network scale and achieve 1.3× lower energy consumption per operation. Similar to MZM-based implementations, MRR MAC operations are performed at a 2*Nαf*_{CLK-OPT}/*f*_{CLK-CMOS}× higher throughput than its digital CMOS counterparts.

Figure 14 shows the energy efficiency and scaling at lower data rates. Similar to MZM implementations, reducing the data rate degrades the energy efficiency while scaling up the network size due to the reduced noise levels at the AFE.

## IV. RESEARCH OPPORTUNITIES

As summarized in Secs. II and III, SiP accelerators operate at much higher speed and lower latency than their CMOS counterparts. Nevertheless, it is further desired to improve the size of the MAC networks in SiP, especially for neural network applications, and the energy efficiency. There have been several promising research studies in the field of SiP. Classifying the existing commercial SiP technology as the first generation, we describe several emerging technologies that will make up the next generation of SiP. Figure 15 summarizes the advancements in SiP that can be leveraged by SiP-based accelerators to reduce optical loss, improve the energy efficiency, and incorporate heterogeneous integration techniques for performance improvement.

### A. Optical loss reduction

As described in Secs. II and III, optical losses limit the scalability of the SiP 1.0 networks. Losses must be minimized at the coupling interfaces and in the components. PWB is one way to ensure efficient coupling between the chip and the optical fiber with insertion loss $\u223c1$ dB with negligible variation.^{35} Passive alignment to SMF optical fibers can be accomplished using V-grooves arrays. Such fiber to chip self-alignment has been shown to have a coupling efficiency of $\u223c\u22121.3$ dB.^{83} In another demonstration, coupling losses as low as $\u223c0.5$ dB and $\u223c0.35$ dB have also been reported for passive and active alignments, respectively.^{84}

For scaling up the networks, the optical signal attenuation can be compensated by using SOAs. On-chip SOAs can be utilized to pre-amplify the input signal and also exploited as weight matrix elements to provide weights magnitudes $>1$.^{85} However, the non-linear gain–current curve entails a need for calibration.

Improving the responsivity of the AFE is yet another way to tolerate the optical losses. It relaxes the need to increase the laser power to compensate for the losses. The high multiplication gain and responsivity of APDs have been shown to improve the sensitivities of optoelectronic receivers front-end.^{64} Improving dark current and quantum efficiency by a careful design of the APD geometry has been projected to improve the sensitivity of Si–Ge APD receivers of up to −29 dBm at 12.5 Gb/s^{86} as compared to −18.5 dBm for Ge PIN detectors.^{87} Limiting the bandwidth of the AFE and using equalization techniques,^{88} such as continuous time linear equalization (CTLE) and decision feedback equalization (DFE),^{89} can reduce the input referred noise of the AFE and further improve the sensitivity.

### B. Improving energy efficiency

Commercial CW lasers suffer from low WPE in the range of $\u223c1%\u221210%$, which impacts the energy efficiency on the system.^{90,91} Hybrid-integrated silicon photonic lasers have been shown to provide $\u223c12.2%$ WPE.^{92}

Although introducing on-chip CW lasers mitigates the coupling losses, the feasibility of using them, especially for networks using WDM, requires wavelength stabilization and reflection cancellation.^{93,94}

Reducing the power consumption of the phase shifters in the weight matrix is a critical requirement, given that their overall energy consumption scales quadratically with network size [Eqs. (6) and (12)]. Thermo-optic phase shifters dissipate high-power consumption, given their resistive nature. Introducing trenches, undercuts, and back-side substrate removal has been shown to improve the tuning efficiency of the rings by an order of magnitude with a measured reported power consumption of $\u223c4mW$ per FSR.^{75,95,96} However, thermal isolation and substrate removal exacerbate self-heating and must be taken into consideration while designing a CMOS controller.^{97}

Several post-fabrication schemes have been investigated to correct for the fabrication-induced variations. Reducing the process variations was investigated by patterning SiN on top of the Si waveguide to introduce field perturbations, which effectively adjusts the optical path length.^{98} Another demonstrated technique relies on trimming using Ge ion implantation followed by laser annealing to tune the MRR resonant wavelength across the whole FSR without introducing any excess loss. Its accuracy, CMOS-compatibility, and feasibility for wafer-scale correction render it a potential technique to be utilized in optical neuromorphic implementations to reduce the tuning power.^{99}

Alternatives such as nano-opto-electro-mechanical systems (NOEMS)^{16,100} and liquid crystal on silicon (LCOS)^{101} have the potential to reduce the tuning power overhead significantly. The dynamic energy consumption of NOEMS was reported in the range of 0.13 and 0.32 fJ for digital pulse signals^{100} and is assumed to be ∼1 fJ for our study. On the other hand, LCOS has been shown to dissipate power as low as 2 nW.^{101}

Phase change materials (PCMs) such as Ge_{2}Sb_{2}Se_{4}Te (GSST) have been demonstrated as compact phase shifters in which the optical phase shift is obtained by tuning the state of the material from amorphous and crystalline.^{102,103} Being able to sustain their crystallization state with the absence of power renders them as good candidates for tuning low-speed weights in SiP implementations with no static power consumption. Given their compact sizes and non-volatile nature, the efficiency of implementing them in large-scale SiP networks is investigated, as shown in Fig. 16. Although not as lossy as PN phase shifters, PCMs have *IL* = 0.32 dB, which is relatively high for cascaded phase shifters in a large-scale implementation.^{102} This limits the network sizes for computation with several bits of resolutions. The resolution of the weights can be set by adjusting the level of crystallization of a PCM cell.^{104}

The pulse energy consumption for writing and erasing levels 1–7 was reported in the range of 372–601 and 562–373 pJ,^{104} respectively. Assuming the weights to be uniformly distributed, the average energy consumption, *E*_{PCM}, for setting the PCM to various weights can be calculated, as in Eq. (14), where *E*_{A} and *E*_{C} represent the pulse energy required to write (amorphization) and erase (crystallization) the first level (*L*_{1}), respectively. For levels, *L*_{i}, where *i* > 1, Δ*E*_{A} and Δ*E*_{C} represent the average amount of energy required to transition to one level higher or a lower, respectively, with all levels assumed to be equally spaced for the sake of simplicity,

Assuming that *E*_{A} = 372 pJ, *E*_{C} = 373 pJ, Δ*E*_{A} = (601 − 372)/(2^{n} − 2) pJ, and Δ*E*_{C} = (562 − 373)/(2^{n} − 2) pJ, the estimated average energy consumption for a PCM phase shifter with *n* = {1, 2, 3, 4}*b* equals {186, 231, 165, 121} pJ. For phase shifters with zero static power dissipation, the dynamic energy consumption is divided by the number of times weights have been reused for vector matrix multiplication. This is done by introducing a weight reuse factor, *α*_{w}, such that *P*_{mat-tuning} = *P*_{NOEMS,PCM}/*α*_{w}, where *α*_{w} typically ranges between 2^{6} and 2^{18} in general matrix multiplications (GEMMs).^{105} A value of *α*_{w} = 4096 is chosen for the calculation of energy efficiencies in this work. For networks where the weight reuse is low, the contribution of the dynamic energy per operation can be considerably higher for PCM than all the other weight tuning alternatives, degrading the energy efficiency by orders of magnitude.

Figure 16 shows the energy efficiency breakdown for both the MZM-based and MRR-based architectures for weight tuning that rely on thermo-optic phase shifters without and with insulation,^{75} NOEMS, LCOS, and PCM. For each implementation, energy calculations are based on the network scales that can satisfy the SNR requirements to compute with bit resolutions, *n*_{i/p} = {1, 2, 3, 4}*b*. The insertion loss for a 35 *µ*m long LCOS used to realize a *π* phase shift is taken as 0.35 dB.^{101} For MZM implementations, thermo-optic phase shifters with insulation seem currently attractive for energy efficiency and network size. For a large weight reuse factor, NOEMS-based phase shifters promise further energy reduction. For MRM implementations, similar conclusions can be drawn except that LCOS-based phase shifters also seem promising.

For both MZM-based and MRR-based architectures, it is evident that opting for matrix weight tuning alternatives with almost zero power consumption significantly improves the total energy efficiency with values approaching $<100$ fJ/Op for both architectures. Further research is still needed to demonstrate the feasibility of these approaches in high volume production to realize such an energy efficiency regime,

### C. SOA cascadability

SOAs are used to amplify optical signals over a given spectrum and can be implemented off-chip or using hybrid integration. Cascading SOAs has been conventionally used to restore signal levels in interconnect links and has been recently explored in deep neural network implementations.^{85}

Although SOAs help compensate for the optical losses to increase the scale of the network, they contribute significantly to the energy consumption. In addition, they suffer from several downsides, including the noises produced due to the optical amplification, the ripples in their gain spectrum, and amplification nonlinearity.^{106–108} The major noise component is attributed to the amplified spontaneous emission (ASE) of photons toward the input and output of an SOA.^{108,109} Therefore, the build-up of ASE noise due to the cascade degrades the SNR of the optical signal reaching the output.^{106,108}

To investigate the effect of incorporating SOAs on the scalability of the network and the received signal resolution, the overall SNR is quantified in Eq. (15), with an SOA gain value as *G* = 17 dB.^{46,110} *B*_{o} and *B*_{e} stand for the optical bandwidth of the amplifier and electrical bandwidth of the AFE, respectively. *ρ*_{ASE} represents the ASE noise and is calculated as

where *N*_{SOA} and *n*_{sp} represent the number of SOAs used in the network and the spontaneous emission factor of the optical amplifier, respectively.

The parameters given in Table I are used for calculating the SNR and *N*_{SOA}. For a target network resolution, *n*_{i/p}, the number of SOAs is calculated based on the resultant SNR whose signal and noise power values vary with the network scale. The SNR is calculated as follows:

As can be inferred from Figs. 2 and 10 for an SOA-less network, the SNR degrades since the signal optical intensity at the AFE *P*_{o/p} is attenuated with the scale of *N*. Incorporating an SOA helps replenish the signal intensity. SOAs can be added in the network before the SNR degrades below the threshold for a given resolution due to the insertion losses. However, the signal dependent noises are also amplified, which do not align in favor of the network resolution. For calculating the resolution, the AFE is assumed to tolerate signals with maximum *P*_{opt} intensity as high as 10 dBm, beyond which the number of SOAs is limited. Regions with no SOAs at the right-hand side of Figs. 17 and 18, shown in the Appendix, represent regions where the desired SNR cannot be achieved, indicating an infeasible resolution for a given scale.

SOAs require introducing III–V or II–VI compound semiconductor materials to the SiP platform, which is non-compliant with the standard SiP CMOS foundry runs. Back propagating ASE noise from the SOA emphasizes the need to employ an optical isolator for the input laser source.^{106} Use of narrowband filters is also possible to reduce the out-of-band noise,^{85,106} but these further reduce the maximum channel count, thus limiting the scaling of the network. To be used for linear analog computations, SOAs should deliver gains that are independent of the input intensities. Cross-gain modulation (XGM) is one type of non-linearity observed in SOA amplifiers in which the combination of all the input intensities impacts the gain of a single channel.^{111,112}

The aforementioned SOA limitations should be addressed in order to maintain the linearity of the weight matrix and allow for further scaling of networks. Designing highly efficient SOAs has the potential to increase the size of the networks. However, improving the network energy efficiency requires that the SOAs have low injection current and high optical signal-to-noise ratio. Recent attempts to use SOAs for optical networks have reported a power consumption of 42 mW per SOA, which leads to an energy consumption of ∼4.2 pJ/Op at a *DR* = 10 GS/s.^{85} To carry out four weighted additions, 16 SOAs were used for the weight tuning, along with extra SOAs for optical pre-amplification and input selection. A crosstalk of 0.6 dB was reported for the SOAs even for a small-scale circuit implementation of the arrayed waveguide grating (AWG) filter, which necessitated the use of feedback loops for gain calibration. For accelerators with SOA integration, the contribution to the overall network energy efficiency scales with *O*(*N*^{2}), setting the efficiency to the pJ/Op regime.

## V. CONCLUSION

We describe the behavior of MZM- and MRM-based SiP implementations for MAC accelerators based on today’s SiP technology. Both MZM and MRM implementations share similar optical and electrical challenges. In comparison to digital CMOS accelerators, SiP implementations have relatively higher energy consumption and operate at lower bit resolutions. In addition, they cannot be scaled to large network sizes because of the optical losses. Implementing MAC operations using SiP has two distinct advantages:^{113}

Optical MAC operations can be scaled to frequencies at tens of GHz, whereas MAC operations in digital CMOS are limited to a few hundreds of MHz or at most GHz operating speeds. For tasks where memory access is not the bottleneck, such as inference with fixed weights, an optical implementation can reduce latency and improve the energy efficiency at such high speeds. Digital CMOS counterparts, on the other hand, are limited by the clock frequency.

Multiplication operations can be intrinsically implemented in parallel in which the analog nature of the computation allows all matrix operations to take place at the same time for each input fetch.

^{114}Therefore, optical MAC implementations can increase the throughput and improve their energy efficiency at such high speeds. MAC operations implemented with digital circuits in CMOS are limited by the wiring interconnect density.

There are also some challenges of implementing MAC operations using SiP:

The losses in optical circuits severely limit the size of the MAC networks that can be physically realized, in comparison to a digital CMOS implementation where signal gain and regeneration are easily available. This, in turn, limits the applications of SiP MAC operations.

Although the power consumed in the optical MAC is less, when accounting for the losses and the power consumed by the laser and CMOS electronic circuits that drive and control the optical circuits, the energy efficiency is degraded.

Unlike digital CMOS implementations that can support 16

*b*/32*b*resolution, analog photonic MAC operations have a maximum demonstrated resolution of 8.5*b*.^{115}Nevertheless, such a resolution has been shown to be adequate for many inference tasks.^{116–118}Achieving a high throughput optical network entails accessing data at high speed. High-speed input/output (I/O) data can be streamed in/out from/to an off-chip DRAM; the corresponding energy consumption in moving the data must be considered for the overall implementation of an accelerator.

^{119}The energy consumption in data fetch from off-chip DRAM is significantly large. However, for a given dataset, the DRAM associated penalty is similar for both optical and digital CMOS implementations. We exclude that penalty in our work. For weight-stationary implementations where weights do not need to change frequently, an on-chip SRAM can be used, which can be adequately large due to the limited network size of the photonic accelerators.Most of the low-loss phase shifters have a reconfiguration speed in the range of

*μ*s-to-ms, making the weight reconfiguration in photonic MAC operations significantly slower than their electronic counterparts. This limits the use of photonic MAC operations to weight-stationary systolic array implementations, where the incoming data are high-speed, but the weight does not get updated quickly. For other scenarios, a fine weight retuning can be done with high-speed plasma-dispersion phase shifters, where the loss is controlled due to the need for a fine weight tuning range only.To carry out optoelectronic computing with several bits of resolutions at high-speed, CMOS or biCMOS drivers and transimpedance amplifiers (TIAs) are needed that must operate with multi-level pulse-amplitude modulation (PAM) signaling. Although many PAM2 (1

*b*) and PAM4 (2*b*) transceivers have been demonstrated,^{62,120}higher levels of modulation require linear drivers and TIAs, which are challenging to design at high speed and good energy efficiency.^{3}However, if data rates in optical computing are limited to a few tens of GBaud, this challenge is surmountable.Packaging considerations in optics (e.g., laser, fiber, and SOA attach) are far more challenging than the packaging considerations for electronic dies due to alignment accuracy and thermal management requirements.

^{121}

The mesh-like interconnections of MZM-based implementations that extend the optical dynamic range at the AFE place a trade-off between the output resolution and the power requirement of the laser. MZM drivers also consume higher electrical driver power because MZMs cannot be made very long due to the losses associated with their larger footprints. Therefore, the power consumption of the laser and high-speed drivers is amortized with a limited network scalability. On the other hand, MRR-based topologies provide better energy efficiencies due to their small footprint and, thus, lower modulation energy consumption. The overall energy efficiency for either of the implementations experiences major degradation mainly due to the inefficiency in lasers and phase shifters, the insertion and excess losses of the optical components, and the optical to electrical and electrical to optical conversion overhead.

However, an order of magnitude higher operating speed and the inherent parallelism in conducting multiplication for analog signals render them attractive for reducing delay and enhancing throughput in comparison to digital CMOS implementations. With the emerging technologies in SiP, e.g., NOEMS and LCOS, low energy tuning schemes have the potential to significantly improve the energy efficiency of the photonic accelerators. Nonetheless, thermal PS with insulation is still an efficient weight tuning option, which is attractive for mass production. Low voltage-swing modulators with heterogeneous integration of polymers also promise improvements in energy efficiency due to the significant reduction in modulator and CMOS driver power consumption.

Scaling SiP accelerators to larger network sizes is limited by the rated output optical power of the laser and the SNR required for a given signal resolution, *n*_{i/p}. MRM-based networks can scale to larger values than their MZM-based counterparts due to the lower loss associated with cascading microrings in a WDM implementation. Incorporating high-power multi-wavelength lasers will be crucial. However, MRR-based networks are more sensitive to temperature and often require temperature control between the photonic IC and the laser. The size of MRR-based networks that have been demonstrated in prototype hardware has been limited to eight modulators^{122} or 16 × 16 switch.^{7} Until larger MRM-based networks are demonstrated in hardware, the adoption of MZM-based networks will continue to be favored.

Heterogeneous integration of low-noise SOAs in SiP is a possible way to increase the network size, but the high-power consumption of SOAs degrades the energy efficiency significantly. Higher responsivity and low-noise APDs will also prove beneficial in scaling up the network sizes. The size can be further scaled up by reducing the insertion losses of contributors, such as directional couplers and phase shifters (if lossy). The splitters remain a significant limitation for scaling. To make efficient use of the limited optical network sizes, general matrix multiplication (GEMM) algorithms must be adopted.

Enhancing the energy efficiency can be achieved by adopting modulators with low static and dynamic power consumption, high-responsivity APDs along with TIAs with high sensitivity, and efficient SOAs and lasers with high WPE. To better address the need for high resolution, multi-level signaling significantly beyond PAM4 and PAM8 must be implemented. Controlling the temperature of the chip also helps with maintaining high resolution. In addition, the crosstalk and distortion of SOAs should be further investigated.

## ACKNOWLEDGMENTS

This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC). Access to CAD tools and technology is facilitated by CMC Microsystems. The authors would like to thank Dr. Alex Tait of Queen’s University and Avilash Mukherjee of UBC for their technical comments.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors declare no conflicts of interest.

## DATA AVAILABILITY

The data that support the findings of this study are available within the article.

### APPENDIX: SCALING UP USING SOAs

This section illustrates the feasibility of incorporating SOAs in SiP networks in Mach–Zehnder and microring based implementations. Figures 17 and 18 illustrate the number of SOAs in terms of the network resolution and scale for MZM- and MRM-based implementations, respectively. It can be observed that the resolution at which a SiP network is desired to operate is dependent on the network scale.

It can be inferred that an MZM-based network can support signal resolutions of 4*b* for a network size of *N*_{ltd} = 55 with a single SOA. In comparison, incorporating SOAs into MRM implementations increases the network limited scale to *N*_{ltd} = 94 for *n*_{i/p} = 4*b*, as shown in Fig. 18. The number of SOAs that can be added to a network is limited to 1 or 2.