Electronic–photonic computing systems offer immense potential in energy-efficient artificial intelligence (AI) acceleration tasks due to the superior computing speed and efficiency of optics, especially for real-time, low-energy deep neural network inference tasks on resource-restricted edge platforms. However, current optical neural accelerators based on foundry-available devices and conventional system architecture still encounter a performance gap compared to highly customized electronic counterparts. To bridge the performance gap due to lack of domain specialization, we present a time-multiplexed dynamic photonic tensor accelerator, dubbed TeMPO, with cross-layer device/circuit/architecture customization. At the device level, we present foundry-compatible, customized photonic devices, including a slow-light electro-optic modulator with experimental demonstration, optical splitters, and phase shifters that significantly reduce the footprint and power in input encoding and dot-product calculation. At the circuit level, partial products are hierarchically accumulated via parallel photocurrent aggregation, lightweight capacitive temporal integration, and sequential digital summation, considerably relieving the analog-to-digital conversion bottleneck. We also employ a multi-tile, multi-core architecture to maximize hardware sharing for higher efficiency. Across diverse edge AI workloads, TeMPO delivers digital-comparable task accuracy with superior quantization/noise tolerance. We achieve a 368.6 TOPS peak performance, 22.3 TOPS/W energy efficiency, and 1.2 TOPS/mm2 compute density, pushing the Pareto frontier in edge AI hardware. This work signifies the power of cross-layer co-design and domain-specific customization, paving the way for future electronic–photonic accelerators with even greater performance and efficiency.
I. INTRODUCTION
Photonic computing has emerged as a promising technology for high-performance and energy-efficient computing, particularly in computation-intensive artificial intelligence (AI) tasks. Various integrated photonic tensor core (PTC) designs have been introduced and demonstrated for ultra-fast photonic analog linear operation acceleration. Coherent PTCs that leverage interference and diffraction include MZI arrays,1 butterfly-style meshes,2,3 auto-designed photonic circuits,4 coupler-crossbar array,5 star-coupler-based design,6 and metalens-based diffractive PTCs,7 etc. Besides, to leverage the wavelength-division multiplexing (WDM) technique, there are incoherent multi-wavelength PTCs, e.g., MRR weight bank,8–11 PCM crossbar arrays,12 micro-comb-based computing engine.13,14 We emphasize three key features of efficient PTCs required by general edge AI from the perspective of versatility, dynamic reprogrammability, and domain-specific customization, respectively, shown in Fig. 1.
Versatility, or universality, is one of the important features of photonic AI hardware to accelerate a variety of DNN workloads. A versatile/generic photonic accelerator based on universal optical linear units is capable of realizing general matrix multiplication (GEMM) and thus directly implementing a wide spectrum of pre-trained digital DNNs. Many specialized linear units are not applicable to generic tensor computation since they restrict their matrix expressivity to a subspace of specialized matrices for higher hardware efficiency, e.g., butterfly meshes3 and tensorized MZI arrays.15
Besides versatility, photonic computing requires real-time, efficient input tensor encoding with low reconfiguration costs. One example is the MZI arrays, which support arbitrary weight matrices but suffer from high weight encoding costs due to the high complexity of matrix decomposition required to encode weights. Similarly, many subspace linear unit designs can approximate GEMM operations by cascading more programmable devices but require an even more costly optimization-based approach to map the weight matrix.3,6 Such a property restricts those designs to only support weight-static linear operations, e.g., fully connected (FC) layers and convolutional (CONV) layers, where weights are pretrained and pre-encoded into the device/circuit transmissions. However, advanced AI models, e.g., Transformer16–21 based on attention operations where both matrix multiplication operands are dynamic, full-range, and general tensors, cannot be efficiently mapped to those weight-static PTCs.
The third critical feature to enable efficient, scalable PTCs is domain-specific hardware customization. At the device level, many optical computing hardware demonstrations are based on standard foundry PDK elements, which are designed for optical communications and not optimized for analog neuromorphic computing. For example, bulky electro-optic (E-O) modulators ( mm-level in length)22 can be used as the transmitter module for high-speed communication but are not suitable for analog computing as the footprint is intractable with quadratically many such modulators for input encoding. On the other hand, thermo-optic MZI modulators are usually compact but can only be modulated at KHz frequency due to the 10 thermal constant and are usually power-consuming. Plasmonic devices23 are compact and high-speed but show high insertion loss (>10 dB), leading to significant laser power consumption. Hence, compact, low-power, low-loss, and high-speed modulators are in high demand for efficient optical computing. MRRs are compact and low-loss; however, their high locking power and high sensitivity to thermal variations limit their efficiency and robustness.24 To bridge the gap at the device level, it is necessary to customize computing-specific optical components, e.g., multi-operand devices for compact neural computing,25,26 diffractive meta-computing systems.7 At the circuit level, customization is critical to reducing the long-lasting analog-to-digital and optical-to-electrical conversion bottlenecks. At the architecture level, due to the lack of optical memory, the large spatial footprint of photonic circuits, and the high digital memory access cost, the architecture topology and dataflow also need to be customized to fully leverage the temporal locality to reduce data movement cost and maximize hardware sharing. Only with device-circuit-architecture cross-layer co-design and customization we can realize photonic computing’s advantages compared to its electronic counterparts.
In this work, we present a time-multiplexed dynamic photonic tensor accelerator design, dubbed TeMPO, for efficient AI acceleration, featuring ultra-compact slow-light electro-optic modulators for input operand encoding, hierarchical partial product accumulation with lightweight capacitive temporal integration modules and multi-core architecture to maximize sharing of data input/readout circuitry. One key innovation of this work is the utilization of custom-designed, foundry-fabricated slow-light MZI modulators (SL-MZM) with enhanced light–matter interaction for size and power reduction. It has a phase shifter length of 150 200 and a footprint about 10 greater than Si MRR while an order of magnitude smaller than the typical foundry offered Si Mach–Zehnder modulator (MZM) PDK elements. This SL-MZM is thermally robust, with no thermal tuning/locking circuit needed, and can also tolerate large manufacturing variations. Different from a multi-wavelength dynamic PTC designs,5 TeMPO simplifies the spectral multi-wavelength encoding to high-speed temporal encoding, eliminating the need for complex dispersion-engineered broadband device designs such as Si modulators, optical power splitters and directional couplers as well as remove WDM MUX/DEMUX overhead.
The major contributions of this paper are as follows:
We present a compact and energy-efficient multi-core photonic AI accelerator, TeMPO, with device and architecture co-optimization and customization.
Compact and Efficient Photonic Components—To enable ultra-fast, compact, low-power input operand encoding and dot-product computing, we adopt a customized slow-light MZM device with orders-of-magnitude smaller footprint and switching energy than the PDK MZM. We also customize optical power splitters with varying splitting ratios and an ultra-low power /2 phase shifter. With customized devices, TeMPO is 6.8 more compact and 9.1 more power efficient than the foundry counterparts.
Hierarchical Product Accumulation—TeMPO leverages photocurrent aggregation and temporal integration for partial product accumulation in the analog domain, significantly reducing the laser power and analog-to-digital conversion cost. We also enable input modulator sharing and output readout circuitry sharing to minimize the E-O/O-E cost.
Versatile and Robust Edge AI Evaluation—We evaluate TeMPO on both convolutional NNs and Vision Transformers on speech recognition, image classification, and advanced semantic segmentation tasks and show comparable accuracy and superior robustness to low-bit quantization and hardware noises from experimental measurement.
New Area-Energy Efficiency Pareto Frontier—We comprehensively evaluate the scalability and efficiency of our proposed TeMPO architecture and show 368.6 TOPS peak performance, 22.3 TOPS/W energy efficiency, and 1.2 TOPS/mm2 compute density, outperforming state-of-the-art electronic counterparts.
II. OVERVIEW OF TIME-MULTIPLEXED DYNAMIC PTC ARCHITECTURE DESIGN OF TeMPO
A. Dynamic photonic dot-product engine
B. TeMPO architecture overview
We have introduced one dynamic dot-product engine to realize vector dot-product. Now, we introduce a multi-core time-multiplexed photonic tensor accelerator TeMPO for parallel dot-product, shown in Fig. 3.
We have tiles in the architecture, and each tile contains PTCs. Each PTC is a crossbar of dynamic dot-product engines, which can finish a times vector outer product at each time step.
Given an times GEMM workload, we first partition the matrix into horizontal strips, each with a size of , and matrix into vertical strips, each with a size of . One block in the result matrix can be computed by accumulating vector outer product, i.e., . This length- reduction can be mapped to PTCs in a tile in parallel, and each PTC is responsible for computing vector outer products, which is formally rewritten as . Therefore, the total cycles consumed to compute is . There are of such matrix blocks in , and we mapped them to tiles in parallel. This entire matrix multiplication requires in total cycles.
Each cycle is defined as (1) feeding one vector into our PTC, (2) reading out the outer product results as photocurrent, (3) converting it to the electronic domain, and (4) accumulating partial product. As we mentioned above, each PTC consumes cycles to finish one block in the matrix, which means a conventional architecture needs to convert the photocurrent as electronic digital signals through trans-impedance amplifier (TIA) and analog-to-digital converter (ADC) at every cycle for each PTC and accumulate the result digitally with adders and registers. With a high data rate, e.g., 5–10 GHz, the AD conversion and digital accumulation cost is non-trivial, becoming a bottleneck of the performance and efficiency as the ADC power is proportional to its sampling frequency.
Next, we focus on the detailed design of a time-multiplexed PTC to explain how our architecture performs dynamic matrix-matrix multiplication. For illustration simplicity, we set the matrix with an equal number of rows and columns, i.e., , while the architecture can be applied to a matrix with arbitrary dimensions. A coherent monochromatic light source is used as the input to the photonic tensor core units. The input light is first fanned out to waveguides via a splitter. Next, a slow-light Mach-Zehnder modulator (SL-MZM) is connected in each waveguide arm, serving as the input operand modulator of the PTC. Digital electrical signals carrying the matrix information are converted to analog optical signals represented by the amplitude and phase before optical signals reach the dot-product engine for computing. Let be the electric field of the input light to the SL-MZM, and the electric field of MZM output can be expressed as , allowing broadband mapping of both positive and negative values. We consider two optical routing schemes for the PTC architecture in this work, namely, a double-layer-splitters scheme TeMPO-D and an embedded-uneven-splitters scheme TeMPO-E to guide the encoding optical signals to the targeting dot-product engines. Schematics of the proposed PTC architecture are shown in Figs. 4 and 5.
1. Double-layer-splitter PTC design TeMPO-D
A double-layer-splitter PTC design consists of two layers of optical splitters to route the encoded optical signals to the targeted dot-product engines for matrix calculation, and a schematic of the architecture is shown in Fig. 4. After the first fan-out splitter, half of the optical paths (bottom paths) are used to encode matrix via an SL-MZM array, mapping to a row vector of matrix . SL-MZMs on the top arms of the 1 splitter couple data of column vectors of matrix . The second layer consists of 1 optical splitters, each of which evenly splits the optical power with encoded information into secondary output arms so that dot products between any pair of and can be calculated simultaneously at dot-product engines. Waveguide crossings are needed for this architecture. The coded optical signals may pass up to crossings to reach the dot-product engine.
2. Embedded-uneven-splitters PTC design TeMPO-E
A schematic of the embedded-uneven-splitters PTC design, TeMPO-E is illustrated in Fig. 5. Different from the TeMPO-D design, this architecture adopts a series of uneven splitters to eliminate waveguide crossings. The splitting ratios are set at , , , and . For a PTC with dot-product engines, the splitting ratios of the two optical splitters that guide light into the dot-product engine are and , respectively, to ensure identical input power to each dot-product engine. The maximum number of crossings on the optical path is .
Comparing TeMPO-D with TeMPO-E, TeMPO-D design only requires one optical splitter before reaching the DOT engine with the cost of the increased number of waveguide crossings in some waveguide paths. For the TeMPO-E design, the number of uneven power splitters and waveguide crossings needed in each path are both , while TeMPO-D design requires waveguide crossings. We anticipate lower accumulated device loss in the TeMPO-E design when is large. In the following discussion, we only focus on the embedded-uneven-splitters design TeMPO-E and simplify it as TeMPO.
III. PHOTONIC COMPONENTS FOR PTCs
A. Laser source
A PTC utilizing optical wave phase and amplitude in time-domain processing only requires a monochromatic light source for optical signal processing. In the realm of integrated photonic computing chip design, o-band operation, in comparison with c-band components, offers several distinct advantages such as a smaller optical mode volume in Si/ waveguide structure, higher mode confinement with tighter bending radius and >1.5 higher in Ge PD responsivity.27,28
The second consideration pertains to the choice between an on-chip III-V integrated laser diode and an off-chip laser module. While the heterogeneously bonded laser to Si holds the promise of the miniaturized, photolithographically defined coherent on-chip light source, it has yet to mature for mass production. The long-term reliability of on-chip lasers remains undetermined. Laser cavities are highly sensitive to temperature variations, thus heterogeneously integrated on-chip laser, being in the close vicinity of other electronics that generate considerable heat would demand more complex electronics circuits in thermal management to maintain on-chip laser diode emission stability in optical mode/wavelength, polarization, and optical power. Integrated optical isolators on Si platform are not yet available from SiPho foundry; while an optical isolator is critical in minimizing reflections that could disturb laser operation if the reflection is not addressed. Varying laser operation will also, in turn, degrade the PTC performance. In this work, we advocate a technological path that utilizes a separate, off-chip laser module that takes advantage of the latest advancement in optical packaging to achieve low insertion loss at the fiber-to-chip interface.
High-power monolithic o-band lasers, capable of producing output powers as high as 150 mW,29 are commercially available now. In this work, we utilize a moderate laser power of 100 mW for system power-related analysis and evaluation. Utilizing index-matched epoxy and emerging packaging technology, such as photonic wires,30–32 one can expect 0.5–2 dB insertion loss at the fiber to chip facet.
B. Slow-light Mach–Zehnder modulator
Mach–Zehnder Modulators (MZMs) play a crucial role in the conversion of electrical signals to the optical domain in chip-scale PTC. Si modulators, utilizing the carrier plasma effect, offer a cost-effective and high-density integration solution for on-chip PTC. Achieving a dot-product operation for matrices of the size of requires 2 modulators for signal conversion. The physical dimension of these Si modulators serves as a critical design parameter, impacting the scalability of matrix operation. In this study, our approach involves the adoption of a 1D dielectric photonic crystal waveguide, specifically a rectangular-shaped Bragg grating,33 as a slow-light-enabled compact modulator to significantly reduce the footprint of the modulator array.34,35 Lately, we have experimentally demonstrated a Si slow-light MZM (SL-MZM) with a phase shifter length ( ) of 150 for optical computation application.36 The SL-MZM reported in this work was fabricated at AIM Photonics under a multi-project wafer (MPW) run, ensuring complete foundry compatibility. The modulator output is routed to an on-chip Ge photodetector (PD), a standard AIM PDK component with a tested bit rate of 15 Gbps. The SL-MZM, operating under maximum signals of 3.5 V, was characterized with up to 6-bit of resolution using both staircase and random data inputs. The readout signals from the PD are displaced on a real-time oscilloscope, shown in Fig. 6. The averaged variance during bit-holding time is reported as and for the staircase and random signal input cases, respectively.
Reflection occurring at different junctions within the modulator device, optical absorption due to carriers in waveguides, propagation loss in the Bragg grating phase shifter due to increased group indices, and mode mismatch at the Bragg grating waveguide interfaces are the primary factors contributing to the modulator insertion loss. The measured total modulator insertion loss is 6.4 dB for and is utilized as the loss figure in the system evaluation.
To achieve high-bit resolution at a high computing clock frequency, it is imperative to optimize both the electrical bandwidth and linearity of a Si modulator. Operating under reverse bias, the speed of a Si SL-MZM is limited by its RC time constant and photon lifetime. Typically, the PN junctions are doped at an elevated level (ranging from to ) to enhance the carrier plasma effect. As the phase shifter length is reduced in an SL-MZM, the total capacitance decreases. In this work, the measured SL-MZM junction capacitance was approximately 0.75 pF. Depending on the doping level in the connecting Si bar from the ridge waveguide to the via contacts, the intrinsic resistance of a SL-MZM ranges from 5 to 10 . The estimated RC time-limited electrical bandwidth of a SL-MZM is thus in the hundreds of GHz. The slow-light effect can be viewed as a traveling wave resonant in its propagation direction, with the optical bandwidth determined by the Q-factor of the resonator. For the rectangular Bragg grating-shaped slow-light, an optical bandwidth of approximately 26 GHz is estimated.37 However, the SL-MZM of this work did not reach its maximum bandwidth potential due to impedance mismatch of the electrodes,34 mismatch of the RF signals speed with the optical wave with high group index38 and waveguide dispersion in the slow-light spectrum. Dispersion engineering techniques such as phase-shifted Bragg grating, dispersion compensation,39 and line-shift photonic crystal waveguide are all effective approaches in reducing the dispersion-induced bandwidth penalty. With careful device design and optimization, a SL-MZM operating at a 5 GHz clock frequency is feasible, as assumed for system-level performance evaluation in this study.
C. Optical power splitter
The optical splitter is a crucial passive photonic component in integrated photonic systems for splitting optical power. Various types of structures such as Y-junction splitters,40 multimode interferometers (MMIs),41,42 and directional couplers43 have been demonstrated to achieve power splitting with varying splitting ratios. Y-junction splitters are usually compact and broadband, but the sharp corners can lead to increased reflection, resulting in unwanted FR resonance in a photonic system.40 The MMI-based power splitter is suitable for uniform power splitting, while the shape of tapered input and output waveguides needs to be carefully designed.44 By adjusting the coupling length, a directional coupler can also be used to obtain varying optical power splitting.
1. 1 × 2K optical splitter
2. Optical power splitter guiding to the dot-product engine
The TeMPO adopts directional couplers with varying splitting ratios to guide the coded optical signals to each DOT engine for matrix computing. A directional coupler with even splitting is often offered as a standard PDK component from SiPho foundries. Keeping the waveguide gap constant, one only needs to change the coupling length to adjust the splitting ratio. With 480 nm waveguide width and 200 nm gap between two parallel waveguides in the coupling region, our simulation shows that the coupling length is 14.6, 11.2, 9.2, 8, and 7 to achieve splitting ratios of 1:1, 1:2, 1:3, 1:4, and 1:5, respectively.
D. Dot-product engine design
The dot-product engine to realize vector–vector dot product is the key computation unit in our proposed photonic tensor core. A dot-product engine consists of a optical power splitter, a phase shifter, a pair of balanced PD, and a time integrator. They will be discussed separately in this section.
1. 2 × 2 Optical power splitter
in Eqs. (11) and (12) follows the same definition as Eq. (10). Three optical power splitter designs are developed, and the results are summarized in Table I. The simulated electric field profiles are illustrated in Fig. 8, where the optical power is coupled in through one input arm, and output power is measured through both output arms. Overall, the directional coupler features lower insertion loss and smaller size, while the two MMI designs have larger bandwidth near the targeting 50:50 splitting ratio. Taking the dimension, splitting ratio, and insertion loss into consideration, the directional coupler-based optical power splitter design will be utilized in the following system-level simulation study.
Splitter design . | Directional coupler . | MMI (Paired interference) . | MMI (General interference) . |
---|---|---|---|
Optical coupling region dimension (W × L) | 1.2 × 14.6 μm2 | 3.6 × 14.5 μm2 | 2.2 × 18 μm2 |
Block dimension (W × L) | 6.5 × 31 μm2 | 7 × 40.5 μm2 | 7 × 44 μm2 |
Splitting ratio | 50:50 | 52.5:47.5 | 50:50 |
Bandwidth at targeting splitting ratio | 1550 nm | 1500–1600 nm | 1530–1570 nm |
Insertion loss at 1550 nm | 0.05 dB | 0.18 dB | 0.37 dB |
Splitter design . | Directional coupler . | MMI (Paired interference) . | MMI (General interference) . |
---|---|---|---|
Optical coupling region dimension (W × L) | 1.2 × 14.6 μm2 | 3.6 × 14.5 μm2 | 2.2 × 18 μm2 |
Block dimension (W × L) | 6.5 × 31 μm2 | 7 × 40.5 μm2 | 7 × 44 μm2 |
Splitting ratio | 50:50 | 52.5:47.5 | 50:50 |
Bandwidth at targeting splitting ratio | 1550 nm | 1500–1600 nm | 1530–1570 nm |
Insertion loss at 1550 nm | 0.05 dB | 0.18 dB | 0.37 dB |
2. π/2pi/2 phase shifter
3. Photodetector responsivity and sensitivity
Meanwhile, the balanced PD’s current range determines the integrator’s design. Given the principle of time integration, i.e., , the maximum voltage with -time step integration of -frequency datarate is . To avoid saturation-induced integration error, i.e., , we must carefully design the integration time step and the capacitance given the maximum photocurrent generated by the balanced PD. Detailed integrator design specifications are introduced in the following section.
4. Temporal integrator
The proposed time-multiplexed approach requires integration of photodetector output current for the accumulation operation as in Eq. (7). This is one of the key mechanisms in TeMPO to significantly relieve the ADC power bottleneck. Integrator design and optimization—Our integrator design objective is to support a target maximum integration time step with good linearity in the voltage response and fast reset speed. We adopt a simple, compact, and foundry-compatible means of time integration using a capacitor. Capacitive elements are well suited for analog integration of current-based signals. The voltage across the terminals is proportional to the time-integral of the current from the photodiode. After each multiply-accumulate operation is complete, the capacitor integrator will need to be discharged (reset) before the next operation. By turning on field-effect transistors (FETs) in parallel to the capacitor, the charge across the capacitor can be rapidly dissipated for reset.
Now, we show the detailed integrator design with a target maximum integration time step and linearity and reset speed considerations. The proposed integration unit is shown in Fig. 10. As indicated by the insertion loss analysis and the PD responsivity, the estimated maximum photocurrent is 110 . Given a maximum targeted voltage of , the signal data rate of 5 GHz, and a target integration time step , we can derive the capacitor fF. Therefore, two foundry-compatible thin oxide capacitors with a capacitance range of 809 fF to 3.9 nF are connected to the PD’s output. Note that besides scaling up capacitors proportionally with , one can equivalently consider scaling down laser power and thus by a factor of . This can significantly reduce laser power but at the cost of a worse signal-to-noise ratio. In our design, we maintain the same laser power and include the factor in the capacitance.
For a linear integrator response, multiple flipped capacitor pairs are connected in parallel to achieve a symmetric circuit topology. To enable fast periodic reset, ten 40 nm n-channel and p-channel FETs are connected in parallel with the capacitor to ensure sufficient current driving capability for reset within a single baud time period. This choice accounts for the possibility of both positive and negative source current flow from the balanced photodiode, ensuring effective reset regardless of signal polarity. For simplicity, only two of each type of FET are depicted in Fig. 10.
Note that we prefer this capacitor-based design to an alternative operational amplifier (op-amp) based design due to efficiency considerations. Integrators with an op-amp and a capacitive feedback loop show desired input/output impedance; however, they are more suitable for voltage integration tasks with notably increased chip space usage and power. In contrast, the capacitor-based design has near-zero power and is more suitable for our photocurrent accumulation mechanism in TeMPO.
Integrator SPICE simulation—The integrator unit’s simulation employs flipped capacitor pairs and 40 nm FETs, as previously mentioned. We simulated a maximum current of over the entire integration period ( ) to ensure saturation of the capacitor does not occur. The FET gates received 2.5 V for 120 ps, with additional rise and fall times of 40 ps, ensuring a complete reset within a time step of . The waveforms for both the current signal and the integrated voltage signal are illustrated in Fig. 11. Given the maximum anticipated current of 110 , we recorded peak voltages of approximately 240 mV.
Integrator cost analysis—Our design shows a compact footprint of , a low power consumption of 0.3 mW, and a long integration time step , with a fast reset time of 2 time steps. Note that the integrator arrays are shared across cores in a tile; the integrator area/power cost can be further amortized by a factor of , leading to marginal hardware overhead at the system level.
Integrator’s benefits to system efficiency—To justify the efficiency benefit by setting to 60, we simulate how time step impacts the system power consumption when mapping a large matrix multiplication workload onto our architecture in Fig. 12. The TIA/ADC sampling frequency can be scaled down proportionally by times, approximately leading to lower power. To keep ADC/TIA power less than 5%, we set to 60 such that the on-chip power consumption can be drastically reduced from 68 to 16 W, with the ADC/TIA bottleneck completely eliminated.
IV. EVALUATION RESULTS
In this section, we will analyze the accuracy and hardware cost of our TeMPO architecture. We focus on three variants of our TeMPO with different device configurations listed in Table II. TeMPO -Custom-SL is the fully customized architecture settings used as our final design. For a comprehensive evaluation of TeMPO -Custom-SL, we also incorporated the analysis of on-chip memory, considering its area and power impact.5 Similar to Ref. 5, the architecture has a 2 MB global on-chip SRAM buffer and 4 KB on-chip local SRAM buffer for each tile, designed to hold two matrix multiplication workloads. To summarize, TeMPO -Custom-SL consumes 321 area, 17.5 W power at 5 GHz and integration time step, and realizes 368.6 TOPS peak computing speed with 6-bit precision, 22.3 TOPS/W energy efficiency, and 1.2 compute density.
Device . | Parameter . | Value . | TeMPO Foundry . | TeMPO Foundry-SL . | TeMPO Custom-SL . |
---|---|---|---|---|---|
DAC47 | Precision Power Area | 8-bit 50 mW(@14GSPS) 11 000 μm2 | |||
ADC48 | Precision Power Area | 8-bit 14.8 mW(@10GSPS) 2,850 μm2 | |||
Foundry Photodetector22,49 | Power Sensitivity Area Bandwidth Responsivity | 25 nW at −1 V −27 dBm 16 × 20 μm2 27 GHz 1.1 A/W | |||
TIA50 | Power Area Bandwidth | 3 mW <50 μm2 40 GHz | |||
Foundry MZM22 | Static power IL Area EO bandwidth Modulation efficiency Extinction ratio | 70 nW 3 dB 1600 × 460 μm2 12.5 GHz 450 fJ/bit >15 dB | |||
Customized SL-MZM34 (Fabricated at AIM) | Static power IL Area EO bandwidth Modulation efficiency Extinction ratio | 70 nW at −3.5 V 6.4 dB 250 × 25 μm2 10 GHz (foreseeable) 50 fJ/bit 6 dB | |||
Foundry 2 × 2 50:50 MMI49 | IL Area | 0.11 dB 36 × 10 μm2 | |||
Customized 2 × 2 50:50 Directional coupler | IL Area | 0.05 dB 31 × 6.5 μm2 | |||
Foundry TO phase shifter49 | IL Area Power | 0.03 dB 75 × 75 μm2 Pπ = 7 mW | |||
Customized phase shifter | IL Area Power | 0.05 dB 0.5 × 33 μm2 ∼0 W | |||
Customized 1 × 10 splitter | IL Area | 0.199 dB 34.6 × 14.1 μm2 | |||
Foundry 1 × 2 50:50 MMI49 | IL Area | 0.1 dB 22 × 10 μm2 | |||
Foundry waveguide crossing49 | IL Area | 0.23 dB 8 × 8 μm2 | |||
Fiber/chip coupling | IL | 2 dB | |||
Laser | Wavelength | 1550 nm |
Device . | Parameter . | Value . | TeMPO Foundry . | TeMPO Foundry-SL . | TeMPO Custom-SL . |
---|---|---|---|---|---|
DAC47 | Precision Power Area | 8-bit 50 mW(@14GSPS) 11 000 μm2 | |||
ADC48 | Precision Power Area | 8-bit 14.8 mW(@10GSPS) 2,850 μm2 | |||
Foundry Photodetector22,49 | Power Sensitivity Area Bandwidth Responsivity | 25 nW at −1 V −27 dBm 16 × 20 μm2 27 GHz 1.1 A/W | |||
TIA50 | Power Area Bandwidth | 3 mW <50 μm2 40 GHz | |||
Foundry MZM22 | Static power IL Area EO bandwidth Modulation efficiency Extinction ratio | 70 nW 3 dB 1600 × 460 μm2 12.5 GHz 450 fJ/bit >15 dB | |||
Customized SL-MZM34 (Fabricated at AIM) | Static power IL Area EO bandwidth Modulation efficiency Extinction ratio | 70 nW at −3.5 V 6.4 dB 250 × 25 μm2 10 GHz (foreseeable) 50 fJ/bit 6 dB | |||
Foundry 2 × 2 50:50 MMI49 | IL Area | 0.11 dB 36 × 10 μm2 | |||
Customized 2 × 2 50:50 Directional coupler | IL Area | 0.05 dB 31 × 6.5 μm2 | |||
Foundry TO phase shifter49 | IL Area Power | 0.03 dB 75 × 75 μm2 Pπ = 7 mW | |||
Customized phase shifter | IL Area Power | 0.05 dB 0.5 × 33 μm2 ∼0 W | |||
Customized 1 × 10 splitter | IL Area | 0.199 dB 34.6 × 14.1 μm2 | |||
Foundry 1 × 2 50:50 MMI49 | IL Area | 0.1 dB 22 × 10 μm2 | |||
Foundry waveguide crossing49 | IL Area | 0.23 dB 8 × 8 μm2 | |||
Fiber/chip coupling | IL | 2 dB | |||
Laser | Wavelength | 1550 nm |
A. Accuracy evaluation on real-time edge AI workloads
The performance of the proposed TeMPO is evaluated on high-performance, real-time edge machine learning tasks, including a Vision Transformer (ViT) DeiT-Tiny18 on image recognition on ImageNet-1k,51 a convolutional neural network (CNN) on the AR/VR voice keyword spotting task on Google Speech Command dataset,52 and a FCN-ResNet5053 model on semantic segmentation on PASCAL VOC2012.54 Our evaluation covers both weight-static CNNs and Transformers with dynamic self-attention operations for both speech and vision tasks to demonstrate our versatility for diverse edge ML. During model training, we adopt a hardware-aware training flow to consider the 6-bit weight/input quantization and hardware-measurement noises to guarantee a robust deployment on our photonic tensor cores.
Figure 13 visualizes our proposed TeMPO on three representative edge AI workloads. Table III shows the task performance on each application with 6-bit weight/activation quantization and noise perturbations. Our 6-bit quantized TeMPO can realize comparable recognition and segmentation performance on edge AI tasks.
Noise robustness evaluation—To assess the robustness of our architecture against noise, we tested our speech recognition model with noise-aware training under various noise intensities injected in inference. Figure 14(a) indicates that our architecture demonstrates superior robustness to random noises. Even when increasing the relative noise intensity from 0 to 0.08, the accuracy drops by only 1%. Additionally, we measure the real noises in the chip testing in Fig. 14(b), which causes a negligible accuracy drop.
Task . | Dataset . | Model . | Fp32 Performance . | INT6+Noise Acc . |
---|---|---|---|---|
Image recognition | ImageNet-1k51 | DeiT-Tiny18 | 0.722 (accuracy) | 0.712 (Accuracy) |
AR/VR Voice keyword spotting | Google speech command52 | CNN52 | 0.957 (accuracy) | 0.929 (accuracy) |
AR/VR semantic segmentation | Pascal VOC201254 | FCN R-50-D853 | 52.28 (mIoU) | 51.16 (mIoU) |
Task . | Dataset . | Model . | Fp32 Performance . | INT6+Noise Acc . |
---|---|---|---|---|
Image recognition | ImageNet-1k51 | DeiT-Tiny18 | 0.722 (accuracy) | 0.712 (Accuracy) |
AR/VR Voice keyword spotting | Google speech command52 | CNN52 | 0.957 (accuracy) | 0.929 (accuracy) |
AR/VR semantic segmentation | Pascal VOC201254 | FCN R-50-D853 | 52.28 (mIoU) | 51.16 (mIoU) |
B. System architecture-level performance analysis
As a case study, we configure our architecture with 6 6 PTCs ( ), and each PTC is of size ( ), working at a clock rate of 5 GHz. We give area and power estimation of our architecture.
The DAC power can be derived by , where is the DAC power at -bit precision and sampling rate, and is the clock frequency. Other power terms can be directly obtained from the device power specification.
We emphasize the benefits of our multi-core architecture and temporal integration mechanism in power efficiency,
Multi-Tile Architecture: Our multi-tile architecture can reduce the MZM and DAC power by a factor of for matrix since the matrix modulation components are shared across tiles before the on-chip waveguide broadcast, shown in Fig. 3.
Core Sharing: Multiple cores per tile share the same array of integrators, TIAs, and ADCs. Meanwhile, as we analyzed in Sec. III D 4, temporal integration can further reduce the TIA and ADC working frequency by a factor of . Hence, the power of TIA/ADC can be overall reduced by times.
Figure 16(b) shows the power breakdown of the three variants of TeMPO. Compared to the foundry MZM, which takes 450 fJ to encode each symbol, our designed SL-MZM only takes 50 fJ to encode each symbol, leading to an 89% reduction in the input tensor modulation power consumption. With time integration ( ), the ADC/TIA power is reduced by 60 , which becomes negligible (<5%) in the system power.
Overall, our optimized TeMPO -Custom-SL architecture equipped with energy-efficient SL-MZMs, customized splitters, phase shifters, and temporal integrators can reduce the on-chip system-level power by 9.1 compared to foundry PDK variants. Figure 17(b) indicates TeMPO -Custom-SL consumes 17.5 W power while 7% of power is from DACs. As technology continues advancing, power-efficient DACs are expected to significantly boost the efficiency of TeMPO further.
C. Tensor core efficiency and scalability analysis
In this section, we show a thorough analysis of the scalability of one PTC with different core sizes . Besides area, insertion loss (IL), and power, we further define computing speed, energy efficiency, and compute density. To estimate the peak performance, we define the computing speed for each core as . Note that the reset overhead is considered as a scaling factor . To evaluate the area efficiency, we adopt the metric of peak compute density, which measures how fast the hardware can compute per unit circuit area. For a TeMPO architecture ( cores) with PTC, the peak compute density is evaluated as , where is the clock frequency (no higher than the maximum ADC sampling rate, i.e., ). The energy efficiency of the hardware is defined as if we ignore energy cost during reset as the accelerator is idle, which measures how much energy it consumes to finish one operation.
Our TeMPO architecture has PTCs, and each PTC core size varies from to . Figure 18(a) shows a nearly quadratic area scaling since most of the area is attributed to the crossbar structure with quadratically many dot-product engines. Figure 18(b) shows almost linear insertion loss scaling as the number of crossings and splitters linearly increases with the core size . Hence, it is not efficient to use an overly large core size due to intractable insertion loss and laser power. In Fig. 18(c), we observe that power linearly scales with core size. Since the hardware power is dominated by DAC and we have a linear number of DAC to encode input vectors. Compared to quadratic power scaling in electronic circuits (as the transistor count quadratically increases with a larger ), this linear power scaling shows the advantage of photonic computing cores. Figure 18(d) shows the superior peak performance of our multi-core photonic accelerator. With 5 GHz computing frequency and a core size of 30–40, TeMPO can potentially realize Peta operations per second (POPS)-level computing speed. Thanks to the quadratically increasing computing speed and the linear power scaling, TeMPO shows a consistent efficiency boost with a larger core size in Fig. 18(e). In terms of compute density, we can obtain a higher density with a larger core size, as indicated by Fig. 18(f). We expect a higher compute density in the future with more compact coupler and photodetector designs as technology advances. Overall, TeMPO shows good scalability to a larger core size. The maximum size of the computing core is primarily constrained by the optical system’s loss, which can be significantly reduced using customized low-loss optical components fabricated with current SiPho foundry technologies. For example, to scale up the system, we could employ a dedicated laser source to enhance each mid-scale PTC and build distributed multi-core accelerators as high-performance photonic computer clusters. Advanced photonic packaging technologies, such as photonic wire bonding and glass-silicon co-packaging platforms, are particularly effective for minimizing loss in optical interconnects between chiplets.
D. Efficiency comparison with SoTA accelerator designs
We compare our designs with state-of-the-art (SoTA) electronic digital computers, including GPU, TPU, ASIC, and analog neuromorphic processors, e.g., IBM TrueNorth, as shown in Fig. 19. We observe that our architecture TeMPO can realize competitive energy efficiency and compute density compared to state-of-the-art digital computers. However, standard foundry PDK devices are not the most efficient designs for photonic computing. By replacing the foundry MZM with our SL-MZM alone, we can boost the compute density from 0.18 (TeMPO -Foundry) to 0.89 (TeMPO -Foundry-SL) . With customized SL-MZM, splitters, and phase shifters, our fully customized TeMPO -Custom-SL pushes the Pareto frontier to a record high level. It achieves 22.3 TOPS/W and 1.2 TOPS/ , outperforming the foundry PDK variant by 9.1 higher energy efficiency and 6.8 higher compute density, respectively. Compared to NVIDIA A100 GPU and Google TPUv4, TeMPO -Custom-SL shows 13.8 higher TOPS/W and 1.7 higher compute density, respectively.
E. Efficiency comparison with edge AI device designs
We also compare the performance and efficiency between our TeMPO and three NVIDIA edge GPUs, Jetson AGX Orin , Jetson AGX Xavier , and Jetson Nano65 TeMPO shows similar levels of power consumption ( 10 W) and die area ( 100 ) to those edge GPUs but shows an average of 793 higher energy efficiency (TOPS/W) and 163 higher area efficiency ( ) than them. Hence, our photonic accelerators are well-suited for a variety of real-time edge AI applications that require high performance and energy efficiency, e.g., unmanned aerial vehicles (UAVs), autonomous driving, AR/VR devices, and military equipment. Note that ultra-low power (<mW) and miniaturized wearables, e.g., biomedical devices and watches, might not be the target applications at the current stage, given the power and size of the high-performance photonic accelerators. We envision foreseeable technologies, including low-power DACs/ADCs, on-chip lasers, and non-volatile materials, can largely eliminate the power bottleneck from DACs/ADCs and boost the integration density, enabling broader applications in the low-power mobile computing regime as technology advances.
V. CONCLUSION
In this work, we present TeMPO, a time-multiplexed dynamic photonic tensor accelerator designed for energy-efficient edge AI applications. Through careful co-design across device, circuit, and architecture layers, TeMPO achieves significant performance improvements compared to state-of-the-art electronic accelerators. Key innovations include customized slow-light Mach–Zehnder modulator, optical splitter, and phase shifters for low-power dynamic tensor computation, analog domain accumulation via capacitive temporal integration to eliminate analog-to-digital conversion bottleneck, and a multi-core architecture for efficient hardware sharing. TeMPO demonstrates comparable task accuracy with 6-bit quantization to digital counterparts, superior noise tolerance, and a peak performance of 368.6 TOPS, energy efficiency of 22.3 TOPS/W, and compute density of 1.2 TOPS/ , pushing the Pareto frontier for edge AI hardware. This work establishes a new frontier in energy-efficient analog AI hardware, paving the path for future electronic-photonic accelerators in ubiquitous edge AI applications.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
M.Z. and D.Y. have contributed equally in this work.
Meng Zhang: Investigation (lead); Methodology (lead); Software (equal); Validation (lead); Visualization (equal); Writing – original draft (lead); Writing – review & editing (lead). Dennis Yin: Investigation (lead); Methodology (lead); Software (lead); Validation (equal); Visualization (equal); Writing – original draft (lead); Writing – review & editing (lead). Nicholas Gangi: Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal). Amir Begović: Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal). Alexander Chen: Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal). Zhaoran Rena Huang: Conceptualization (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Project administration (equal); Software (equal); Supervision (equal); Validation (equal); Visualization (lead); Writing – original draft (lead); Writing – review & editing (lead). Jiaqi Gu: Conceptualization (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Project administration (equal); Software (lead); Supervision (equal); Validation (equal); Visualization (lead); Writing – original draft (lead); Writing – review & editing (lead).
DATA AVAILABILITY
The data that support the findings of this study are available within the article.