With an ongoing trend in computing hardware toward increased heterogeneity, domain-specific coprocessors are emerging as alternatives to centralized paradigms. The tensor core unit has been shown to outperform graphic processing units by almost 3 orders of magnitude, enabled by a stronger signal and greater energy efficiency. In this context, photons bear several synergistic physical properties while phase-change materials allow for local nonvolatile mnemonic functionality in these emerging distributed non-von Neumann architectures. While several photonic neural network designs have been explored, a photonic tensor core to perform tensor operations is yet to be implemented. In this manuscript, we introduce an integrated photonics-based tensor core unit by strategically utilizing (i) photonic parallelism via wavelength division multiplexing, (ii) high 2 peta-operations-per-second throughputs enabled by tens of picosecond-short delays from optoelectronics and compact photonic integrated circuitry, and (iii) near-zero static power-consuming novel photonic multi-state memories based on phase-change materials featuring vanishing losses in the amorphous state. Combining these physical synergies of material, function, and system, we show, supported by numerical simulations, that the performance of this 4-bit photonic tensor core unit can be 1 order of magnitude higher for electrical data. The full potential of this photonic tensor processor is delivered for optical data being processed, where we find a 2–3 orders higher performance (operations per joule), as compared to an electrical tensor core unit, while featuring similar chip areas. This work shows that photonic specialized processors have the potential to augment electronic systems and may perform exceptionally well in network-edge devices in the looming 5G networks and beyond.

## I. INTRODUCTION

Aiming to replicate brain functionalities remains a captivating challenge, which not only inspires human feats but also has shown to provide technological usefulness for modern societies. Indeed, Machine Learning (ML), performed by neural networks (NN), has become a popular approach to Artificial Intelligence (AI) and consists of training a system to learn how to perform unsupervised decision classifications on unseen data; once a NN is trained, it can be implemented to produce an *inference*, in other words, recognizing and classifying objects or patterns.

Most NNs unravel multiple layers of interconnected neurons/nodes. Each neuron and layer, as well as the network interconnectivity, is essential to perform the task for which the network has been trained. In their connected layer, NNs strongly rely on vector matrix math operations,^{1} in which large matrices of input data and weights are multiplied, according to the training. Complex, multi-layered deep NNs, in fact, require a sizeable amount of bandwidth and low latency to satisfy the vast operation required to perform large matrix multiplication (MM) without sacrificing efficiency and speed.^{2}

Since the dawn of the computing era, due to the ubiquity of matrix algebra, which extends to neuromorphic computing, researchers have been investigating optimized ways to efficiently multiply matrices.^{3–6} To corroborate this statement, engineering a platform that performs energy efficient and faster matrix multiplication enables the solving of linear algebraic problems, such as inverting matrices, systems of linear equations, and finding determinants. Even some basic graph algorithms^{7} are obstructed by the speed at which matrix multiplication is computed.

For a general-purpose processor offering high computational flexibility, these matrix operations take place serially (i.e., one-at-a-time) while requiring continuous access to the cache memory, thus generating the so-called “von Neumann bottleneck.” Specialized architectures for NNs, such as Graphic Process Units (GPUs) and Tensor Process Units (TPUs), have been engineered to reduce the effect of the von Neumann bottleneck, enabling cutting-edge machine learning models. The paradigm of these architectures is to offer domain-specificity, such as optimization for convolutions or Matrix-Vector Multiplications (MVM) performing operations, unlike CPUs, in parallel and thus deployment of a systolic algorithm.

GPUs have thousands of processing cores optimized for matrix math operations, providing tens to hundreds of TFLOPS (Tera FLoating Point OPerations) of performance, which makes GPUs the obvious computing platform for deep NN-based AI and ML applications. GPUs and TPUs are particularly beneficial with respect to CPUs, but when used to implement deep NN performing inference on large 2-dimensional datasets such as images, they are rather power-hungry and require longer computation runtime (>tens of ms). Moreover, smaller matrix multiplication for less complex inference tasks [e.g., classification of handwritten digits of the Modified National Institute of Standards and Technology database (MNIST)]^{8} are still challenged by a non-negligible latency, predominantly due to the access overhead of the various memory hierarchies and the latency in executing each instruction in the GPU.^{9}

Given this context of computational hardware for obtaining architectures that mimic efficiently the biological circuitry of the brain, it is necessary to explore and reinvent the operational paradigms of current logic computing platforms when performing matrix algebra by replacing sequential and temporized operations, and their associated continuous access to memory, with massively parallelized distributed analog dynamical units toward delivering efficient post-CMOS devices and systems summarized as non-von Neumann architectures. In this paradigm shift, the wave nature of light and related inherent operations, such as interference and diffraction, can play a major role in enhancing computational throughput and concurrently reducing the power consumption of neuromorphic platforms. In recent years, the revolutionizing impact of NNs contributed to the development of a plethora of emerging technologies, ranging from free space diffractive optics^{10} to nanophotonic processors^{11–15} aiming to improve the computational efficiency of specific tasks performed by NN. Integrated photonic platforms can indeed provide parallel, power-efficient, and low-latency computing, which is possible because analog wave chips can (a) perform the dot product inherently using light matter interactions such as via a phase shifter or modulator, (b) enable signal accumulation (summation) by either electromagnetic coherent interference or incoherent accumulation through detectors, and (c) enable parallelism strategies and higher throughput using multiplexing schemes.

Additionally, we firmly believe, assisted by a state-of-the-art theoretical framework,^{16} that future technologies should perform computing tasks in the domain in which their time varying input signals lay, exploiting their intrinsic physical operations. In this view, photons are an ideal match for computing node-distributed networks and engines performing intelligent tasks over large data at the edge of a network (e.g., 5G), where the data signals may exist already in the form of photons (e.g., surveillance camera, optical sensor, etc.), thus pre-filtering and intelligently regulating the amount of data traffic that is allowed to proceed downstream toward data centers and cloud systems.^{17}

Here, we explore a photonic tensor core (PTC) able to perform 4 × 4 matrix multiplication and accumulation with a trained kernel in one shot (i.e., non-iteratively) and entirely passively; that is, once a NN is trained, the weights are stored in a 4-bit multilevel photonic memory directly implemented on-chip, without the need for either additional electro-optic circuitry or off-chip dynamic random-access memory (DRAM). The photonic memories feature low-loss, phase-change, nanophotonic circuits based on wires of G_{2}Sb_{2}Se_{5} deposited on a planarized waveguide, which can be updated using electrothermal switching and can be read completely optically. Electrothermal switching is enabled by tungsten heating electrodes, which clamp the Phase Change Memory (PCM) wire.

This work represents the first approach toward the realization of a photonic tensor processor storing data and processing in parallel, which could scale the number of multiply-accumulate (MAC) operations by several orders of magnitude while significantly suppressing power consumption and latency compared to the state-of-the-art hardware accelerators delivering real-time analytics.

## II. RESULTS AND DISCCUSION

### A. Matrix multiplication algorithms

Considering a naïve (“schoolbook”) algorithm, a multiplication between two square matrices, each with *n *×* n* entries, is characterized by a computational complexity of *O(n ^{3})* (Fig. 1), which means that the total operations required for performing this operation scales cubically with the size (

*n*) of the square matrix. Even for optimized algorithms, such as Strassen

^{18}or Winograd,

^{19}the complexity of the algorithm still requires a

*O(n*[Fig. 1(a)]. Such “flat” computational complexity scaling requires long latency when computed in regular CPUs since operations are executed sequentially at each clock cycle. A tensor core unit algorithm is an interesting alternative, however, not for a reduced computational (operation) complexity, which is indeed still

^{2.373})*O(n*, but because of its capability of exploiting parallel architectures and a systolic algorithm;

^{3})^{20}for instance, it can implement a multiplication between two

*n*×

*n*matrices with an operational time complexity of

*O(n*primarily dominated by reading/writing the input and output matrices, even if the number of total operations is still

^{2}),*O(n*

^{3}).^{21}In other words, one must distinguish between the complexities of the computational algorithm vs that of the system's execution time. In this regard, the computational complexity scaling results indeed suggest that the main focus should be placed on the fundamental improvements in the time complexity, thus diverting the focus onto hardware such as architectural, circuit, and component-level novelties, including new device physics to speed-up the matrix processor's signal “rate,” or in exploiting parallelization strategies (Fig. 1). Interestingly, both suggest to consider optics or integrated photonics given (a) the various physical multiplexing options (e.g., spectral, mode, polarization, etc.) and (b) the short delay in the range of 1–10s of picoseconds from modern optoelectronics devices with 3-dB bandwidths of tens of GHz in photonic foundries and even approaching hundreds of GHz in laboratory settings.

^{22}

For this reason, tensor cores are used to perform large-scale, 2-dimensional, or higher dimensional, matrix operations built up from smaller elements, namely TCUs. Each TCU operates on a 4 × 4 matrix and performs the following operation: ** D **=

**×**

*A***+**

*B***, where**

*C***,**

*A***,**

*B***, and**

*C***are 4 × 4 matrices. The matrices multiply inputs A and B are FP16 matrices, while the accumulation matrices C and D may be FP16 or FP32 matrices [Fig. 1(b)].**

*D*^{23}

In order to significantly reduce the execution time of large matrix multiplication or to avoid the non-negligible latency given by the time needed to collect data from a specific TCU, here we present a design based on silicon photonic TCUs performing matrix multiplication inherently with its latency time-of-flight only limited by the delay of the detection mechanism, which is well below tens of ps short in modern high-speed photoreceivers.^{24–26}

Unlike digital electronics, which rely on logic gates, in integrated photonics, multiplication, accumulation, and more in general linear algebraic operations can be performed inherently and non-iteratively, benefiting from the intrinsic parallelism provided by the electromagnetic nature of the signals and efficient light matter interaction. In this regard, integrated photonics is an ideal platform for mapping specific complex operations one-to-one into hardware, and in some cases algorithms, achieving time complexity [**O**(1)]. This allows processing information that does not scale with the number of elementary operations nor input size, with latency given only by the time-of-flight of the photon in the photonic chip and detection mechanism, assuming sufficiently available parallel hardware, such as wavelengths. Note that this is valid when a rapidly varying input is multiplied by a fixed or seldomly changing matrix (kernel), which is the case in ML when performing inference tasks. In Sec. II.B, we describe an architecture, namely a PTC unit that is homomorphically mapping an (exemplary) 4 × 4 matrix multiplication with a fixed (stored) kernel into a photonic platform, with a time complexity of **O**(1) and a runtime of <20 ps (considering photonic foundry available 50 GHz photodetector).^{25}

### B. Photonic tensor core architecture

The main advantage of performing a matrix multiplication and accumulation operation in the photonic domain is that it can be performed with near-zero static power consumption (i.e., static or rarely changing kernel), while allowing for low latency, given only by the time-of-flight of the photon.^{27}

To build a PTC, we use 16 fundamental units, namely dot-product engines, which perform an element-wise multiplication while featuring a Wavelength Division Multiplexing (WDM) scheme, similar to Refs. 28–30, for parallelizing the operation [Fig. 2(a)].

The dot product engine [Fig. 2(b)] performs the multiplication between two vectors, namely, between the *i*th row of the input matrix ** A** and the

*j*th column of the kernel

**. In this scheme, the**

*B**i*th row of the input matrix is given by WDM signals, which, if not already in the optical domain, are modulated by high-speed (e.g., Mach–Zehnder) modulators. The

*j*th column of the kernel matrix is loaded in the photonic memory by properly setting its weight states. Exploiting the light-matter interaction with the phase-change, material-based, photonic memory, the inputs, opportunely spectrally filtered by micro-ring resonators (MRR), are weighted in a seemingly quantized electro-absorption scheme (i.e., amplitude modulation), thus performing element-wise multiplication. The weighted inputs are thence incoherently summed up using a photodetector, which amounts to a MAC operation, obtaining the element

**of the result matrix.**

*D*_{ij}It is worth noticing that contrary to other photonic NN implementations^{13,15,29,31} based on micro-ring modulators, the transmission of the micro-rings is not actively tuned here for performing filtering, but just utilized for passively selecting a frequency to be modulated by the photonic memories. This allows us to have more control on the inter-channel crosstalk and potentially extend the number of wavelengths in a Dense WDM (DWDM) scheme without being affected by the induced quality factor variation caused by the variation of absorption coefficient. Additionally, for each couple of MRRs, our architecture comprises low loss, programmable, multi-state, photonic memories (discussed in Sec. II C), which unlike electro-optic modulators can retain information without any static power consumption and do not add considerable losses.

### C. Photonic memories

The benefits given by the intrinsic electromagnetic nature of the signals can potentially be hindered by the optoelectrical and electro-optical transductions, as well as by the repeated access to a digital and nonvolatile memory, which impacts on the overall operation speed, while producing considerable additional energy loss. For this reason, having a heterogeneously integrated optimized photonic memory, which retains information in a non-volatile fashion, poses a great advantage, especially when implementing NN-performing inference, where the trained weights are only rarely updated (i.e., depending on the application daily, monthly, yearly, if ever). To provide this functionality, a multi-state photonic memory device, which comprises multiple Ge_{2}Sb_{2}Se_{5} photonic memory wires, is placed in between two resonant rings [Fig. 2(b)] to select the opportune wavelength, front and back, respectively [Fig. 3(a)]. That is, once the PCM memory states are set (WRITE operation) in this photonic kernel (i.e., matrix ** B**), this architecture allows the performance of the weighting functionality to be entirely passive. Selective “writing” is achieved by changing the phase of the corresponding number of PCM wires that we deposited on the waveguides, by local electrostatic heating, which promotes crystallization or amorphization, and consequently modifies the waveguide modal refractive index in a reversible process.

We decided to implement the photonic memory kernels based on Ge_{2}Sb_{2}Se_{5}, since this material presents a broadband transparent region for telecommunication wavelengths in its amorphous state and can be used to implement high-performance nonvolatile multistate photonic memories.^{32} Ge_{2}Sb_{2}Se_{5} exhibits 3 orders of magnitude lower absorption coefficient with respect to regularly employed GST at 1550 nm, and features still a high optical (real part) index contrast Δ*n* of 0.5 across the near- to mid-IR bands and around 0.2 $\Delta \kappa $ in the C-band [Fig. 3(b)].^{33} Remarkably, the optical absorption in the amorphous state is vanishingly small and non-measurable when heterogeneously integrated in silicon photonics of ∼100-*μ*m-long lengths. Moreover, the relatively low variation of the absorption coefficient, indeed, makes it a promising material for multistate devices, avoiding the utilization of high laser power and extremely low noise equivalent power detectors. Assuming a continuous film, for the fundamental TM mode of the waveguide, the phase transition produces a variation of the effective absorption coefficient $\Delta $κ ∼ 0.01 to which corresponds 0.21 dB/*μ*m [Fig. 3(c)].

When the network is trained, the extracted weights are set by electrothermal switching of individual states of photonic memories, instead of the previously used optical pulses.^{34} Each state of the memory can reversibly be written by selective transitioning between amorphous (*a*) and crystalline (*c*) phases, using electrothermal switching induced by Joule heating, as previously demonstrated^{35} and reported in Fig. 4.

In our scheme, heat is applied to the material externally via joule heating of a tungsten (W) metal layer in contact with the wire, as shown in Fig. 4(a), whose mode profile and thermal profile is simulated in Figs. 4(a) and 4(b). Different pulse train profiles according to the type of transition (*a-c* or *c-a*) are applied to the wire via the connections in series to the device [Figs. 4(c) and 4(d)].^{35}

The material choice for the electrodes, their placement with respect to the waveguide and the propagating mode are opportunely engineered to minimize the insertion losses, while providing efficient thermal energy. This was possible because the metal of choice W has superior thermal properties without being affected by high optical losses, such as in plasmonic noble metals. Moreover, the periodicity and intensity of the electric pulses applied to the tungsten electrodes used for writing the memory has to be adjusted for providing sufficient thermal energy to Ge_{2}Sb_{2}Se_{5} wire according to which phase has to be switched. The voltage, number of pulses, and periodicity can be regulated to (i) heat up the PCM wire up to 250 °C and anneal for a few tens of microseconds to crystallize it and (ii) melt it, increasing the temperature to over 600 °C, for the amorphization.^{34,35} A resistive heater optimized for efficient switching and contemporary not-generating insertion losses, can also be made in doped silicon or in silicide, currently used in the p-n modulator,^{36} positioned next to the waveguide, indium tin oxide (ITO), or graphene electrodes.^{37}

Light signals that couple with this phase-change memory probe the variation of the absorption coefficient over phase transition (READ operation).

Our photonic memories comprise 30-nm-thin and 250-nm-wide programable PCM-wires arranged in a grating fashion (duty cycle 50%). Each wire corresponds to a quantized state. Therefore, considering as the highest state the condition in which all wires are in the amorphous conditions, 15 reprogrammable wires are sufficient for implementing a 4-bit memory for each element of the kernel (** B_{ij}**), with an overall length of just ∼8

*μ*m, excluding electrical circuitry [Fig. 5(a)]. The insertion loss, defined as 10log

_{10}$P0Pinput$, where $P0$ is the optical power transmitted when all the wires are in the amorphous state, is only ∼1

*dB for a 4-bit multilevel memory [Fig. 5(a-i)]. The optical power transmitted decreases when the GSSe wires are written (switching to crystalline), leading to discrete power levels for each quantized state [Figs. 5(a-ii) and (5-iii)]. The insertion losses for multistate memory devices with different quantization resolution (1 bit through 4*

*bit) are shown in Fig. 5(b) and are derived from numerical simulations reported in the supplementary material (Fig. S2), which highlights the electric field distribution. The photonic memory implemented in this configuration provides a uniform quantization. For a 4-bit photonic memory, the quantization step is 0.2 dB/state with a maximum extinction ratio of about 3.5 dB [Fig. 5(c)]. Here, the extinction ratio is computed as the ratio of the optical power transmitted in the 2 extreme configurations: all the wires in the crystalline state and all the wires in the amorphous state, 10log*

_{10}$P1111P0000.$

As an alternative to the use of individual wires corresponding to each state, it is possible to implement a multilevel photonic memory, which uses a more meager number of films of different length. The losses induced by selectively writing these films in all the possible combinations would generate the remaining states. In detail, exploiting the linear losses per unit length relation, a 4-bit memory, for instance, can be realized using only four films of variable length, whose losses in the crystalline state corresponds to state 1000, “0100,” “0010,” and “0001.” This binary-weighted approach would preserve the overall footprint while reducing the number of tungsten heaters and corresponding contact pads. However, even if this solution could be simpler for a practical implementation, it would require different writing/erasing times and voltages for each weighted state, requiring a further optimization step.

The functionality of the dot product engine, which is the building block of the proposed PTC unit, along with its programmability, is assessed using circuit level interconnect simulations [Fig. 6(a)]. For evaluating the engine's performance, we simulate a dot product between two 4-element vectors: input unitary row-vector *a _{1,i}* and the column-vector

*b*, with the latter being stored in the photonic memory. For illustrative purposes, we considered the last element of b

_{i,1}_{i,1}(i = 4) to be updated using a pseudo-random switching (1) varying the content of the photonic memory from the lowest state to the highest [Fig. 6(b)] by parallelly switching all the PCM wires and (2) varying just the Least Significant Bit (LSB) [Fig. 6(c)]. Bearing in mind the relatively slow and asymmetric switching speed, the moderately low noise equivalent power (NEP) of the photodetector and overall low spectral noise, it is possible to discriminate (eye-diagram completely open) results of the dot product between numbers, which differ by the smallest possible increment, namely 1 bit. It is worth mentioning that the discrimination between states can be aided by varying the dynamic range of the photonic memory and trading off insertion losses, e.g., using wider PCM wires. (Further details on the interconnect simulations are discussed in the supplementary material, Sec. S3.)

### D. Performances

The PTC implemented to according the proposed scheme can perform matrix multiplication with 4-bit precision completely passively once the weights are stored in the photonic network, which is a one-time operation. The PTC does not rely on any logic architecture nor does it require transduction from off-chip memory when performing inference, therefore it could be considered a full-fledged analog processor, as others recently developed.^{10,38} In fact, during inference tasks, our architecture performs tensor operations with a time complexity of *O*(1) and the static power consumption is approaching zero, since the system behaves as a passive filter and simply relies on light matter interactions with pre-stored states in the photonic memory (kernels have been already saved in the photonic memory in a former instance, and the inputs are readily accessible from the optical domain, assuming being situated at the edge-of the network) instead of logic operations, which requires optical switching.^{39,40}

An initial performance analysis is as follows: considering photonic foundry Ge-photodetectors, a micro-ring resonator (radius = 10 *μ*m), and American Institute for Manufacturing Integrated Photonics (AIM)-photonics disc-modulators, the latency of an individual PTC sub-unit (e.g., unit ** D_{2,1}**) requires $\Sigma ${Electro-Optic (E2O) + Time-of-Flight (ToF) + Detection (Rx)} = ∼65 ps for processing a 4 × 4 matrix multiplication resulting in computing 64 MACs at 4-bit precision. This delivers a total 0.5–2 peta operations per second (POPS/s) throughput (assuming layout of Fig. 2, and 1 MAC = 2 operations − OPS) for ∼250 4 × 4 PTC units, when limiting the maximum die-area to 800 mm

^{2}[4-bit Digital to Analog Converters (DAC) − area = 0.05 mm

^{2}] limited mainly by the electro-to-optical conversion (i.e., DACs). For an optical data input (e.g., camera), the peak throughput increases to 16 POPS/s for only a few watts of power. If pipelining could be used, the 65 ps drops to ∼20 ps latency, thus improving throughputs by another 3×, and hence one could consider sharing DAC usage among cores. Another key aspect to consider is the overall power consumption, which considering the completely passive operation of the dot product and accumulation (and zero bias in detection with high responsivity photoreceivers), accounts only for the laser power, which is below 5m W and the bias of the photodetector in the case of the optical data input (middle column, Table I), but also the E2O DACs in the electronic-data input case (left column, Table I).

. | Electronic data PTC . | Optical data PTC^{b}
. | NVIDIA T4^{c}
. | A100 . |
---|---|---|---|---|

No. of tensor cores | 250 | 250 | 320 | 512 |

Clock speed | 50 GHz | N.A. | <1.5 GHz | <1.5 GHz |

Bit resolution | 4-bit | 4-bit | 4-bit | 4-bit |

Throughput (POPS/s) | 0.5 (∼2)^{a} | ∼16 | 0.26 | 1.26 |

Power | 81 W | <2 W | 70 W | 400 W (max) |

Operation efficiency (TOPS/J) | 25 | ∼10^{3} | 3 | 4 |

. | Electronic data PTC . | Optical data PTC^{b}
. | NVIDIA T4^{c}
. | A100 . |
---|---|---|---|---|

No. of tensor cores | 250 | 250 | 320 | 512 |

Clock speed | 50 GHz | N.A. | <1.5 GHz | <1.5 GHz |

Bit resolution | 4-bit | 4-bit | 4-bit | 4-bit |

Throughput (POPS/s) | 0.5 (∼2)^{a} | ∼16 | 0.26 | 1.26 |

Power | 81 W | <2 W | 70 W | 400 W (max) |

Operation efficiency (TOPS/J) | 25 | ∼10^{3} | 3 | 4 |

^{a}

10:1 DAC (Digital to Analog converter) reuse.

^{b}

Optical data input (no DACs).

^{c}

Inference only.

Although, it is important to mention that when performing inference, thanks to robustness of the NN achievable through opportune training, low-bit quantization of the weights is also possible, obtaining indeed efficient and accurate inference for low resolution quantized weights.^{41,42} If the system would be used to perform relatively simple inference tasks at the edge of a network, it may not require a high-bit resolution. It is worth mentioning that for GPUs and digital architectures, a great portion of the power consumption is related to other tasks performed and off-chip memory. For this reason, we believe that PTC, and in general neuromorphic photonics, is advantageous with respect to digital electronics only for specific applications (1) that can exploit inherent operations utilizing efficient light matter interactions, such as the one proposed in this research, avoiding cumbersome domain conversion (lowering overall power consumption) and (2) that would require ns-fast inference. In this view, the proposed photonic engine can be used for accelerating inference tasks for specific applications in which obtaining a prediction (even at reduced accuracy) in near real-time (≪ns delay) is essential and has the priority over obtaining a prediction with high accuracy with longer latency. We particularly envision its use for tasks in which the input data are already RF or optical signals avoiding inefficient analog-to-digital conversion.

Additionally, we compared the potential performances of the PTC with the performance achieved by the state-of-the-art GPU (NVIDIA A100, May 2020) when performing tensor operations at 4-bit. The A100 requires a total max power of 400 W, which is 1 order of magnitude higher than the T4, featuring a higher number of tensor cores. Remarkably, the total throughput is still 1 order of magnitude lower compared to the potential throughput provided by the photonic tensor core when performing inference (when no writing operation is performed).

## III. CONCLUSION

In summary, we propose a tensor core unit implemented in photonics that relies on photonic multiplexed (WDM) signals, weighted, after filtering, using engineered multi-state photonic memories based on Ge_{2}Sb_{2}Se_{5} wires patterned on the waveguide. The photonic memories are reprogrammed by selectively changing the phase (amorphous/crystalline) of the wires, using electrothermal switching through Joule heating induced by tungsten electrodes. The photonic memory programming can be realized in parallel (few microseconds), if needed, or alternatively, this photonic tensor core can operate as a passive system with a pre-SET kernel matrix; that is, there will be no dynamic nor static power dissipation. The runtime complexity is thence *O*(1). An additional key technology feature of this design is that no additional losses are introduced by the photonic memories, avoiding repeaters, and optical amplifiers cumbersome electrical-optical domain crossings and conversions. The architecture shows execution time limited only by the time of flight of the photon in the chip, which is a function of the ring size/selectivity (number of wavelengths) and the latency of the photodetector *O*(<10^{−1}ns) once the kernel matrix is programmed and optical input data are being processed. The concurrent development of new PCM materials and the advancement of the integration of photonic memories can enable the realization of engines based on the proposed scheme able to inherently perform full precision floating point matrix multiplication and accumulation, and consequently opening a pathway toward the realization of all-optical photonic tensor units, which can significantly speed up intelligent tasks at the edge of the network without requiring electro-optic conversions and access to external memories.

## SUPPLEMENTARY MATERIAL

See the supplementary material for detailed numerical (Figs. S1-S2) and experimental (Fig. S3) characterization of the photonic memories, interconnect analysis of the photonics dot product unit (Fig. S4) and PTC utilization (Tab. S1).

## ACKNOWLEDGMENTS

V.S. is supported from the Presidential Early Career Award for Scientist and Engineers (PECASE) nominated by the Department of Defense through the Air Force Office of Scientific Research under Award No. FA9550-20-1-0193. The authors appreciate insightful discussions with Alexander Kildishev and Juejun Hu.

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.