M3ICRO: Machine Learning-Enabled Compact Photonic Tensor Core based on PRogrammable Multi-Operand Multimode Interference

Photonic computing shows promise for transformative advancements in machine learning (ML) acceleration, offering ultra-fast speed, massive parallelism, and high energy efficiency. However, current photonic tensor core (PTC) designs based on standard optical components hinder scalability and compute density due to their large spatial footprint. To address this, we propose an ultra-compact PTC using customized programmable multi-operand multimode interference (MOMMI) devices, named M3ICRO. The programmable MOMMI leverages the intrinsic light propagation principle, providing a single-device programmable matrix unit beyond the conventional computing paradigm of one multiply-accumulate (MAC) operation per device. To overcome the optimization difficulty of customized devices that often requires time-consuming simulation, we apply ML for optics to predict the device behavior and enable a differentiable optimization flow. We thoroughly investigate the reconfigurability and matrix expressivity of our customized PTC, and introduce a novel block unfolding method to fully exploit the computing capabilities of a complex-valued PTC for near-universal real-valued linear transformations. Extensive evaluations demonstrate that M3ICRO achieves a 3.4-9.6x smaller footprint, 1.6-4.4x higher speed, 10.6-42x higher compute density, 3.7-12x higher system throughput, and superior noise robustness compared to state-of-the-art coherent PTC designs, while maintaining close-to-digital task accuracy across various ML benchmarks. Our code is open-sourced at https://github.com/JeremieMelo/M3ICRO-MOMMI.


I. INTRODUCTION
Photonic computing has emerged as a promising technology for high-performance and energy-efficient computing, particularly in computation-intensive artificial intelligence (AI) applications [1][2][3][4][5][6][7][8][9][10] .Photonic tensor cores (PTCs) have been developed using standard optical components to enable matrix multiplication in the analog domain at the speed of light, including free-space diffractive designs 10 and integrated photonic circuit-based designs 1,5,7,8 .However, concerns regarding area efficiency and scalability arise due to the large number of bulky components used in existing PTC designs, shown in Fig. 1(a-d).Based on matrix decomposition, general matrix multiplication (GEMM), i.e., universal linear operations, can be mapped to cascaded Mach-Zehnder interferometer (MZI) arrays 1 .The large number of bulky MZIs used in the tensor core raises concerns about area efficiency and scalability.Efforts have been made to reduce the circuit footprint through approaches such as butterfly-style photonic mesh [11][12][13] with logarithmic network depth, automatically searched circuit topologies 14 , and low-rank MZI arrays 15 .There are also integrated diffractive optical neural networks (DONNs) that leverages on-chip diffractive components for high-parallelism computing 9,16,17 .Besides, incoherent PTCs based on microring resonator (MRR) weight banks [18][19][20][21] , phase-change material (PCM) crossbar arrays 6,22 , and frequency micro-comb have been proposed for compact GEMM using multiple wavelengths.However, the above works are based on standard components designed for optical communications.Their compute density is still limited by approximately one multiplyaccumulate (MAC) per device, which intrinsically limits their scalability and efficiency.MZI array 1 , (b) MRR weight bank 18 , (c) Butterfly-style PTC 11 , and (d) PCM crossbar 6 .(e) Our proposed M 3 ICRO PTC with customized MMI devices and trained with a machine learning-based approach.
To address the limitations of current PTC designs and enhance area efficiency, customized photonic devices tailored for optical computing have attracted attention.Multi-operand (MO) photonic devices have been explored to increase compute density.A compact photonic neuron based on multioperand rings (MORRs) 12,23 was proposed to squeeze vector dot-product and Lorentzian nonlinear transmission into a single MORR by setting multiple controllers inside the ring, i.e., y = f (∑ i φ i (w i x 2 i )).Compared to single-operand MRR weight banks 18,19 , MORR arrays can significantly reduce ring resonator and wavelength usage.In the multi-operand device family, another member, multi-operand MZI (MOMZI) 24 , was recently presented to partition the phase shifter in the MZI for vector dot-product with a sinusoidal nonlinear transmission, i.e., y = cos(∑ i φ (w i x i )).By squeezing vector/tensor operations into a single device, multi-operand devices represent a new design paradigm to scale up the compute density of arXiv:2305.19505v2[cs.ET] 28 Dec 2023 optical computing.However, for previous multi-operand devices, inputs and weights are encoded as the electrical control signals and controller tuning coefficients, respectively.Hence, they face challenges in limited weight reconfigurability and trainability difficulties associated with nonlinear transmission.
To achieve a breakthrough in the area efficiency compared to coherent PTCs based on basic devices while overcoming the limitation of existing multi-operand PTCs, we propose a novel coherent multi-path PTC design M 3 ICRO based on customized programmable multi-operand multimode interference (MOMMI) devices, shown in Fig. 1(e).By leveraging the principles of light propagation and interference, combined with fine-grained refractive index tuning within the multimode waveguide, MOMMIs enable the realization of ultra-compact programmable analog matrix multiplication cores.Our proposed PTC, equipped with a machine learningenabled training flow and a block unfolding method, facilitates efficient and differentiable training of complex-valued coherent PTCs based on customized devices and supports real-valued linear operations.
The contributions of M 3 ICRO are summarized as follows, • Closing the Loop of Photonics for AI and AI for Photonics -We propose the first ML-enabled programmable photonic tensor core (PTC) based on customized optical devices.
• Ultra-Compact Single-Device Optical Matrix Unit -We introduce an ultra-compact photonic tensor core based on customized programmable MOMMIs, a single-device matrix unit beyond the conventional paradigm of one MAC/device, significantly improving compute density and area efficiency.
• Superior Expressivity and Footprint Efficiency -We enhance the expressivity of MOMMIs by developing a multi-path PTC architecture called M 3 ICRO, offering superior matrix representability and improvements in footprint efficiency over previous coherent PTCs.
• ML-Assisted PTC Training Method -We propose a novel ML-assisted training method that estimates device gradients and enables differentiable optimization of MOMMIs, eliminating the need for time-consuming simulations and accelerating the training process.
• Efficient Complex PTCs with Block Unfolding -We introduce a novel block unfolding technique, achieving efficient, full-range, real-to-real linear transformations with 4x higher efficiency than previous differential photodetection approaches.

II. PROPOSED MOMMI-BASED PTC M 3 ICRO
We introduce a compact photonic tensor core (PTC) design M 3 ICRO based on customized programmable multi-operand MMI devices.We design a programmable MOMMI and investigate its matrix expressivity.Based on it, we construct the multi-path PTC M 3 ICRO with a compact footprint and nearuniversal matrix representability.We introduce an efficient ML-based training method for customized photonic devices.Additionally, we present a novel block unfolding method to overcome optimization challenges in complex coherent PTCs.

A. Initial State Design of General MMI Device
We start our PTC design from an initial MMI structure with a compact footprint, low insertion loss, and near-uniform power splitting ratios.This requires us to carefully determine the width and length of the MMI.Consider a 2-dimensional (2-D) horizontal plane of an MMI, we denote its length as L, width as W MMI , effective refractive index of the multimode region as n e f f , and index of the cladding as n c .We define , where W e0 is the effective width of the 0-th mode.Based on the dispersion equation 25 , we obtain the propagation constant spacing between the 0-th and v-th mode as β 0 − β ν ≈ ν(ν+2)π 3L π .The field profile Ψ(y, z) at the output ports can be written as the superposition of all guided modes at z = L, Ψ(y, L) = ∑ m−1 ν=0 c ν ψ ν (y) exp jL ν(ν+2)π 3L π . The output field should be a multiple self-imaging of the input field Ψ(y, 0), which holds at the condition of To obtain an MMI with the shortest length, we set p = 1 and L = 3L π /N, which corresponds to the first N-fold selfimaging.Based on this initialization, we simulate the figure of merit (FoM) of the MMI, defined as the product of the insertion loss and imbalance of power splitting, while performing hyperparameter search on L and W MMI to optimize the FoM.Ideally, the transfer function of a general k × k MMI corresponds to a symmetric unitary matrix, given its geometric symmetry and energy conservation.For example, after device optimization, the spatial dimensions and transfer matrix of our optimized 4×4 MMI are shown in Fig. 2. We observe a nearly symmetric unitary transfer matrix and a near-uniform power splitting ratio, which is a good initial state of the MMI.
B. M 3 ICRO: Programmable MOMMI-based PTC Now we discuss how to make an MMI reprogrammable, and then we will introduce how to construct our M 3 ICRO tensor core using this customized device.Programmable MOMMI.By changing the refractive index inside the multimode waveguide region, we can program the transfer matrix of the MMI.As shown in Fig. 3, we introduce a In this way, we can perform fine-grained manipulation of the device transmission.Discussion on the practicality of the device implementation is in Section III H.Note that the complex-valued transfer matrix W (ε) of a d-op k × k MOMMI is reparametrized by d refractive indices, leading to a reduced degree of freedom with only d real latent variables.Therefore the representable matrices are restricted to a subspace of arbitrary complex matrices.We sweep the refractive indices for each tuning pad and visualize the simulated transfer matrices of a 4-op 3-bit 4×4 MOMMI in Fig. 4.
A clear spiral-like matrix distribution in the parameter space can be observed as we gradually increase the normalized indices from (0,0,0,0) to (1,1,1,1) with 3-bit resolution on each pad, which represents the implementable matrix subspace.
A single MOMMI itself is an ultra-compact matrix unit.However, with a reduced number of parameters (d < 2k 2 ), it shows limited expressivity and lacks flexible controllability over matrix norm and signs, evidenced in Fig. 4. Therefore, we need to enhance its expressivity with a specialized tensor Multi-Path PTC M 3 ICRO.To enhance the matrix expressivity, we introduce a multi-path PTC M 3 ICRO in Fig. 5, constructed by cascading C blocks of interleaved MOMMIs and modulators with P parallel paths.Each MOMMI serves as an all-to-all channel mixer to create dense signal interactions.The internal diagonal complex matrix Σ ∈ C k handles rowcolumn scaling to modulate the matrix norm and signs.The formulation of the multi-path PTC M 3 ICRO is as follows, For a k × k multi-path PTC, instead of having 2k 2 real parameters in a general complex matrix, it has a reduced number of latent real variables, i.e., PCd + 2P(C − 1)k.As the architecture design variables of M 3 ICRO, P and C can be adjusted to trade off hardware efficiency and matrix expressivity.For example, If (k,d,P,C)=(4,4,2,2), it has exactly the same parameter count as a general complex matrix unit, i.e., 32.

C. Efficient Complex Tensor Core via Block Unfolding
A complex matrix unit seems to have a higher expressivity than a real counterpart since it doubles the parameter count.However, it is often not true when applied to neural networks that require real-valued operations, e.g., activation functions, normalization, pooling, and loss functions.Therefore, to fit into the widely used real-valued DNN paradigm, we need to construct a photonic tensor core that supports full-range realvalued inputs/outputs.Previous methods either (1) enforce a real transfer matrix that wastes the multiplication of the imaginary part, e.g., MZI arrays 1,11,[26][27][28] , or (2) remove the phase information by extracting the light intensity through photodetection, which only supports non-negative output 13,16,17 .For case (2), differential photodetection is widely used to create full-range output vectors, i.e., y = |W + x| − |W − x|, shown in Fig. 6(a).However, this method introduces undesired nonlinearity, which breaks the linear property and leads to optimization difficulty.Moreover, such a method is not efficient as it uses two k × k complex matrix units while the effective computing is one k × k real matrix-vector multiplication.
To solve those problems caused by complex-valued tensor cores, we propose a block unfolding method to enable efficient, full-range, real-to-real linear transformation.
Note that unfolding the output vector is equivalent to unfolding the complex weight matrix W ∈ C M 2 ×N to a 2× larger realvalued matrix W ∈ R M×N .With this method, we fully leverage the actual computing capability of the tensor core with only MN/k • #Params(W i j ) parameters, which is twice more efficient than enforcing a real transfer matrix and 4 times more efficient than the differential photodetection method.Note that this method is generic: any k × k complex-valued PTC, once equipped with our block unfolding, can support (2k) × k real matrix multiplication in one shot.The coherent detection with phase and magnitude detection can be implemented by using self-analyzers 29,30 .

D. Machine Learning-Enabled Differentiable Optimization
Optimizing customized photonic devices is challenging since it relies on time-consuming optical simulation involving Maxwell equations solving, eigenmode decomposition, and S-parameter extraction.Such a complicated process is usually treated as a blackbox and cannot be embedded into the outer-loop NN training.To enable the efficient optimization of the device variables ε, we employ ML for photonics by introducing a differentiable photonic hardware estimator (DPE): where f θ (•) : R d → C k×k is a multi-layer perceptron, Q(ε) is the quantized refractive index, ω and φ are learnabled parameters in the predefined sinusoidal features.The reparametrization on W θ guarantees a symmetric transfer matrix based on prior knowledge.

NN-based Differentiable
Photonic Hardware Estimator (DPE) (0,0,0,0) (0,0,0,1/7) As shown in Fig. 7, we build a differentiable training method with gradient replacement and straight-through estimator (STE) techniques.In the forward procedure, we quantize the refractive indices to b-bit levels and look up the ground truth table to obtain the transfer matrix W (ε) for forward propagation.During backward, we redefine the gradient where is calculated by the auto-differentiation through the NN predictor.All other terms during forward and backward are based on W (ε) to eliminate gradient approximation error accumulation for higher estimation fidelity.Figure 8 visualizes the predicted device behavior and shows superior fidelity with 1.1e-4 mean-square error (MSE) compared to the ground-truth targets.Most importantly, the predictor behaves as a high-quality first-order oracle with a very smooth landscape that can provide reliable and informative first-order gradient information to guide optimization.

E. Expressivity of Programmable MOMMI
To evaluate the matrix expressivity of our multi-path MOMMI-based PTC M 3 ICRO, we perform numerical analysis on different PTC designs in Fig. 9.We randomly generate 40k real matrices from Gaussian distribution, train the differentiable surrogate model of each PTC design with block unfolding to approximate those random real matrices, and then evaluate the average relative ℓ 2 matrix distance as the fidelity, i.e., First of all, the diagonal matrix used for norm and phase tuning is critical to the expressivity.From the expressivity colormap, we can conclude the following trade-off.(1) Increasing cascading depth C is more effective in boosting the expressivity, but it will significantly increase the circuit depth leading to higher delay and insertion loss.(2) Increasing the parallel path count P is not as effective in expressivity boost since it only interpolates inside the convex hull of the subspace and also introduces extra signal splitting and combining cost, but it does not increase the critical path length.(3) With large enough photonic mesh width and depth, our M 3 ICRO can potentially realize 100% matrix expressivity as a universal linear unit.We also compare our M 3 ICRO with previous PTC designs in Fig. 10 across different matrix sizes.M 3 ICRO variants have comparable expressivity to the universal MZI array and significantly outperform previous compact PTC designs based on FFT 11,16 and trainable butterfly 13,31 topology.

F. Hardware Performance and Efficiency Analysis
Section II E discussed the trade-offs between hardware cost and matrix expressivity with different depth C and parallel path count P. To cover two representative design points for the following discussion, we design a compact variant named M 3 ICRO (log) and a larger but more expressive variant M 3 ICRO (univ).For M 3 ICRO (log), we prioritize area efficiency and target ∼70% expressivity.We design M 3 ICRO (log) as a dual-path PTC, i.e., P = 2, with a logarithmic circuit depth C = ⌊log 2 k⌋.For M 3 ICRO (univ), we prioritize expressivity with >90% fidelity and design it as a near-universal PTC.We empirically set 70% parameter count as a target, assume P ≈ C, d = k, α=0.7 and have PCk + 2P(C − 1)k ≈ 2αk 2 ; The following analysis mainly focuses on those two variants of M 3 ICRO architecture.
Footprint.We derive the total device footprint of a PTC as where the footprint of the computing core A core is derived in Appendix A and Table IV.A laser , A Y , A MZM , and A PD represent the footprint of laser, Y-branch used for on-chip channel splitting, input modulators, and photodetectors.We plot the total footprint A total of different PTCs with increasing core sizes in Fig. 11(a).Our M 3 ICRO (log) PTC shows good footprint scalability and is 1.6∼8.9×more compact than the MZI array, 1.1∼4.8×smaller than FFT/Butterfly-style PTCs with a block size of 4, and 1∼3.5× smaller than FFT/Butterfly PTCs with a block size of 8. M 3 ICRO (univ) generally has a comparably compact footprint to Butterfly-8 PTC.Insertion Loss.As an important design metric, circuit insertion loss (IL) impacts the required laser power.High insertion loss fundamentally limits the PTC's power efficiency and scalability.The theoretical insertion loss IL core in the unit of dB of different PTCs is summarized in Table IV. Figure 11(b) shows the insertion loss scalability of different photonic computing cores, excluding signal splitting and input modulators.With a 64×64 core size, the MZI mesh has almost 97 dB insertion loss, while our M 3 ICRO shows less than 16 dB IL.Such a low insertion loss of M 3 ICRO can fundamentally enable further scaling of larger core sizes with affordable laser power.Peak Compute Speed and Density.To estimate the peak speed, we derive the PTC delay by accumulating the delay from electrical control to the final result readout 13,32 as follows, We assume τ EO =10 ps for the electrical-to-optical (EO) conversion, 10 ps for photodetection, and 200 ps for 5 GSPS analog-to-digital conversion (ADC).The optical path delay of the tensor core τ core is derived from the total length of cascaded devices along the critical path, which is summarized in Table IV.The peak computing speed on a k × k matrix-vector multiplication workload is defined as 2k 2 /τ.Note that if our block unfolding method is applied, the peak computing speed will double, i.e., 4k 2 /τ, as it finishes twice the computations in one shot compared to the differential detection method.The peak computing speed (TOPS) of different PTCs with increasing core sizes is compared in Fig. 11(c).Our M 3 ICRO (log) has 4.4× and 1.6× faster peak computing speed than MZI arrays and Butterfly-style PTCs, respectively.The speed can scale up when using a larger core size or wavelength-division multiplexing (WDM) for multi-wavelength parallel computing due to the broadband property of our design.
In terms of area efficiency (compute density), shown in Fig. 11(d), since our M 3 ICRO is very compact in spatial footprint, it shows 38.5× and 9.9× higher TOPS/mm 2 than MZI arrays and butterfly structures, respectively.Power and Energy Efficiency.The power of the photonic tensor core mainly consists of four parts, i.e., laser, input modulators, weight programming in the core, and photodetection, P total = P laser + P mod + P wt + P PD . (9) The weight programming power P wt is zero if using nonvolatile phase shifters 33 .The input modulation power P mod and detection power P PD are the same for all coherent PTCs using MZMs.Given the photodetector sensitivity S, ADC resolution of b-bit (we assume 8-bit here), and laser wallplug efficiency η, the required wall-plug power is P laser = 10 (S+IL)/10 × 2 b /η, where the total insertion loss IL includes the loss of the computing core IL core (in Table IV) and the loss of Y-branch splitting tree and input MZMs for k channels, i.e., The detailed device parameters used in the calculation are listed in Table III.With different core sizes k, we show the power consumption of different designs in Fig. 11(e).
Butterfly-style PTCs and our M 3 ICRO have much lower insertion loss than MZI meshes, which shows considerably better power scalability to larger core sizes.The energy efficiency is defined as the ratio of peak computing speed to power (TOPS/W).Figure 11(f) shows our 64×64 M 3 ICRO (log) and M 3 ICRO (univ) architectures have 289.9TOPS/W and 128.4 TOPS/W energy efficiency, outperforming Butterflystyle PTCs by 2.91× and 1.39×, respectively.

III. EVALUATION
We conduct various simulation-based evaluations on our M 3 ICRO PTC designs in terms of expressivity, quantization tolerance, and noise robustness.We mainly compare M 3 ICRO with (1) MZI array 1 , (2) FFT-based PTC with fixed optical Fourier transform modules 11,16 , and (3) Butterfly-style PTC with trainable butterfly transforms 12,13 .Note that we do not compare with other multi-operand tensor cores since they are incoherent architectures with nonlinear transmissions and limited training scalability, especially on large NN models.We also show the effectiveness of our block unfolding method.

A. Training Setups
We train optical neural network models based on the opensource library TorchONN and adopt the same settings for all PTC designs.We first train a software digital NN model and use it as a teacher model T .Its optical analog version is called the student model S. As an initialization, we map the teacher's weight matrices W T i j blockwise to the student counterpart W i j (ε, Σ) by solving the optimization problem After mapping, we fine-tune the student with knowledge distillation, min ε,Σ L CE (y S , ŷ) + ηβ 2 D KL y S β , y T β , where L CE is the cross-entropy loss between the student predictions and the labels, D KL is the KL divergence between student and teacher   11 FFT-8 11 Butterfly-4 13 Butterfly-8 13 M 3 ICRO (log)-4 predictions, β is the temperature (β =2), and η is set to 0.1 to balance two loss functions.During the 3000-step mapping stage, we use Adam optimizer with an initial learning rate of 1e-2 for Σ and 1e-3 for ε.Cosine learning rate decay is adopted.The fine-tuning stage learning rate is set to 3e-4 for Σ and 4e-4 for ε.The NN device predictor is a 6-layer MLP: (2k)-(256) ×3 -(128) ×2 -(2k 2 ).

B. Accuracy Evaluation
Table I shows a comprehensive comparison among different PTC designs on three different NN models and image classification datasets.The universal MZI array represents the ideal software NN accuracy.We observe unsatisfying FFT-based PTC due to its fixed Fourier transform and limited matrix expressivity.The butterfly PTC has trainable phases in the butterfly transform, showing enhanced accuracy on different tasks compared to the FFT designs.Across different MOMMI sizes, our M 3 ICRO (log) and M 3 ICRO (univ) series outperform the compact butterfly designs on all benchmarks.Our specially designed universal M 3 ICRO variants show the best accuracy.Even with 10×10 MOMMIs, the universal variant maintains >0.9 matrix expressivity with <0.5% accuracy degradation compared to the ideal digital software model.

C. Quantization Tolerance Evaluation
In practice, the index tuning precision inside the MOMMI is quantized for control efficiency consideration.In Fig. 12, we illustrate the impact of ε bitwidth on the PTC matrix expressivity and the corresponding accuracy on the ResNet-20 CIFAR-10 benchmark.To stablize the optimization of discrete device control variables ε, we set the following initial learning rate to min(α 0 , α 0 × 2 b−2 ), α 0 = 5e − 6.For M 3 ICRO (log)-4, the expressivity drops with fewer bitwidth while the task accuracy can maintain <1% drop with above 4-bit resolution, which is suitable for efficient device control.For M 3 ICRO (univ)-5, fidelity maintains >0.8 even with a fixed MOMMI (0-bit), and the accuracy can almost maintain the same value with more than 2-bit.Binary and fixed MOM-MIs suffer from overly limited matrix expressivity, which further necessitates and proves the superiority of the programmability of our MOMMI device over previous passive/fixed designs 11,16 .In practical settings, 4-bit to 8-bit resolutions are considered efficient and practical settings for most analog ML accelerators.Overall, our MOMMI-based PTC M 3 ICRO shows great task accuracy and quantization tolerance with 4 to 8-bit device controls.

D. Device Noise Robustness Evaluation
We evaluate the noise tolerance of our proposed M 3 ICRO PTC design against random index perturbation from nonideal control signals or environmental variations.We mainly compare with butterfly PTC since it has the SoTA noise tolerance due to its logarithmic network depth 12,13 .In Fig. 13, we first compare the relative matrix ℓ 2 error, i.e., 1 − F, caused by various device noise intensities.The noise is sampled from ∆ε ∼ N (0, σ 2 ) for the index of M 3 ICRO with the maximum tuning range of 1 and ∆φ ∼ N (0, (2πσ ) 2 ) for the phases in the butterfly PTC with the maximum tuning range of 2π 34 .We observe significantly lower sensitivity of M 3 ICRO compared to butterfly designs.We further evaluate the accuracy degradation on ResNet-20 CIFAR-10.All M 3 ICRO variants outperform the butterfly designs with better noise tolerance.

E. Ablation Study on Block Unfolding
Table II compares PTCs with differential photodetection and block unfolding on different benchmarks.Differential photodetection consumes 4 times the parameters and hardware cost to perform a nonlinear real-to-real transformation with a balanced output range.It cast significant optimization difficulty, leading to severe accuracy drop or even divergence on MobileNetV3.In contrast, our block unfolding achieves close-to-digital accuracy because it enables a real-to-real fullrange linear transform, which is compatible with direct weight In Fig. 14, we plot different NN hardware designs in the compute density (TOPS/mm 2 ) and energy efficiency (TOPS/W) space, including analog electronics [35][36][37] , digital electronics [38][39][40][41][42] , and analog photonic tensor cores.Note that PTCs are configured to have a single 64×64 core, much smaller than electronics counterparts with multiple large-size (>1024) cores.Analog electronics have relatively high energy efficiency with low compute density.SoTA digital processors, e.g., TPUv4 40 and A100 GPU 39 , show comparable energy efficiency with around 1 TOPS/mm 2 area efficiency.Analog photonic tensor cores outperform SoTA digital electronics by over two orders of magnitude in energy efficiency, while the compute density is still around 1 TOPS/mm 2 .With customized MOMMI devices, our M 3 ICRO designs, especially the M 3 ICRO (log) variant, shows 3-10 TOPS/mm 2 compute density, significantly advancing the Pareto frontier.With more compact MMI designs and multiple wavelengths, the compute density of M 3 ICRO can potentially reach an even higher level.

G. System Throughput Comparison
We use an internal system-level photonic accelerator simulator to evaluate the throughput of different PTC designs in Fig. 15.The detailed architectural simulation is in Appendix B. We adjust the core configurations to maintain similar area budgets for all PTCs for a fair comparison.Our compact MOMMI-based design equipped with block unfolding allows more cores on chip while boosting the effective computing speed.Our M 3 ICRO variants, on average, show 3.7-12× higher throughput (FPS) than baseline PTCs and 34.8-403× higher throughput than NVIDIA A100 GPU.

H. Discussion on Implementation of Programmable MOMMI
The practicality of index tuning for a multimode waveguide has been widely discussed in the literature [43][44][45] .In this work, the MOMMI device is designed for weight-static linear transformation, which does not require high-speed modulation.Hence we can use existing low-speed phase modulators as the tuning pads.For example, we can put thermal tuning pads on top of the multimode waveguide region, which has been experimentally demonstrated 44,45 .To reduce the power consumption, we can also use non-volatile phase change materials (PCM) 46 or liquid crystal (LC) 43 as the tuning pads with high index contrast and low static power consumption.If the application requires high-speed weight reconfiguration, we can adopt electro-optic (EO) index-tuning materials, such as thin-film lithium niobate 47 .

IV. CONCLUSION
In this study, we propose the first machine learning-enabled multi-path photonic tensor core M 3 ICRO based on customized programmable multi-operand multimode interference devices.We thoroughly investigate its matrix expressivity and enable efficient PTC optimization with an ML-based training method.We further introduce a block unfolding technique to enable full-range real-to-real linear transform for complexvalued PTC with 4 times higher efficiency than the differential photodetection approach.Extensive evaluation shows that our customized M 3 ICRO PTC has close-to-digital task accuracy, 1.6-4.4×higher speed, 9.9-38.5×higher compute density, superior noise robustness, 3.7-12× higher system throughput than previous SoTA coherent PTCs, and 34.8-403× faster than A100 GPU.This study opens up new possibilities for device customization and strengthens the integration of photonics and machine learning, driving the scalability and efficiency of photonic ML computing.
TABLE IV: Footprint (A core ), insertion loss (IL core ), and delay (τ core ) analysis of photonic tensor cores.A is footprint, IL is insertion loss, and L is device length.W 0 L 0 is the area of a reference k 0 × k 0 MMI.We assume the MMI size scales with k 2 based on Eq. (1).FFT/Butterfly-k ′ means that the PTC is of size-k ′ .If k > k ′ , the matrix is chunked into (k/k ′ ) × (k/k ′ ) blocks of size k ′ × k ′ .#CR(k ′ ) and #CCR(k ′ ) are the total crossing count and the number of cascaded crossings in the critical path.n g is the group index and c is the free-space light speed.

FIG. 1 :
FIG. 1: Overview of photonic tensor core designs with increasing compute density.PTCs with standard devices: (a) MZI array 1 , (b) MRR weight bank 18 , (c) Butterfly-style PTC 11 , and (d) PCM crossbar 6 .(e) Our proposed M 3 ICRO PTC with customized MMI devices and trained with a machine learning-based approach.

FIG. 2 :
FIG. 2: (a) Real and (b) imaginary parts of the transfer matrix of the optimized 4×4 MMI.(c) Detailed sizes of the MMI.(d) The transfer matrix of the optimized MMI is close to a unitary matrix.

FIG. 6 :
FIG.6:(a) Compared to previous complex-valued photonic tensor core designs with differential photodetection, our proposed block unfolding method supports pure linear transform with 4 times fewer parameters.(b) Illustration of block unfolding that interleaves the output vector's real part and imaginary part block-wise.
Figure 6(b) illustrates the principle of block unfolding.For an N-input M-output real linear layer, we first construct a M 2 × N complex matrix and partition it into a series of k × k blocks.Each complex submatrix W i j ∈ C k×k is implemented by a k ×k complex PTC.The real and imaginary part of the output vector z ∈ C M 2 is unfolded blockwise,

FIG. 9 :FIG. 10 :
FIG. 9: Numerical analysis on the matrix expressivity (fidelity) of different MOMMI-based PTC designs.The colormap shows how expressivity changes with different number of parallel paths P and cascaded blocks C.

FIG. 13 :
FIG. 13: Noise robustness evaluation for different tensor core designs on ResNet-20 CIFAR10 with various device noise intensity.All tensor cores adopt the proposed block unfolding method.Error bars show the accuracy standard deviation.The proposed M 3 ICRO-series shows superior robustness compared to previous SoTA butterfly-style PTCs.

FIG. 15 :
FIG.15:Compare the system-level throughput of the single-example inference task in frame-per-second (FPS) among PTCs (@5 GHz clock) and NVIDIA A100 GPU.The FPS of A100 is measured by the benchmarking tool from PyTorch with mixed precision.

TABLE I :
Compare accuracy across different PTC designs on various models and datasets MZI 1 FFT-4

TABLE II :
Evaluate the effectiveness of our proposed block unfolding (unfold) and previous differential photodetection (diff ) method.div means divergence due to the instability caused by the nonlinear absolute operations.
F. Advance Compute Density vs. Efficiency Pareto Frontier

TABLE III :
Adopted component parameters in M 3 ICRO.IL represents insertion loss.