The recent explosive compute growth, mainly fueled by the boost of artificial intelligence (AI) and deep neural networks (DNNs), is currently instigating the demand for a novel computing paradigm that can overcome the insurmountable barriers imposed by conventional electronic computing architectures. Photonic neural networks (PNNs) implemented on silicon integration platforms stand out as a promising candidate to endow neural network (NN) hardware, offering the potential for energy efficient and ultra-fast computations through the utilization of the unique primitives of photonics, i.e., energy efficiency, THz bandwidth, and low-latency. Thus far, several demonstrations have revealed the huge potential of PNNs in performing both linear and non-linear NN operations at unparalleled speed and energy consumption metrics. Transforming this potential into a tangible reality for deep learning (DL) applications requires, however, a deep understanding of the basic PNN principles, requirements, and challenges across all constituent architectural, technological, and training aspects. In this Tutorial, we, initially, review the principles of DNNs along with their fundamental building blocks, analyzing also the key mathematical operations needed for their computation in photonic hardware. Then, we investigate, through an intuitive mathematical analysis, the interdependence of bit precision and energy efficiency in analog photonic circuitry, discussing the opportunities and challenges of PNNs. Followingly, a performance overview of PNN architectures, weight technologies, and activation functions is presented, summarizing their impact in speed, scalability, and power consumption. Finally, we provide a holistic overview of the optics-informed NN training framework that incorporates the physical properties of photonic building blocks into the training process in order to improve the NN classification accuracy and effectively elevate neuromorphic photonic hardware into high-performance DL computational settings.

## I. INTRODUCTION

During the past decade, the relentless expansion of artificial intelligence (AI) through deep neural networks (DNNs) has been driving the need for high-performance computing and time-of-flight data processing. Conventional digital computing units, which are based on the well-known Von-Neumann architecture^{1} and inherently rely on serialized data processing, have faced daunting challenges toward undertaking the execution of emerging DNN datasets. Von-Neumann architectures comprise a centralized processing unit (CPU), which is responsible for executing all operations (arithmetic, logic, and controlling) dictated by the program’s instructions, and a separate random-access memory (RAM) unit that stores all necessary data and instructions. The communication between CPU and memory is realized via a shared bus that is used to transfer all data between them, implying that they cannot be accessed simultaneously. This leads to the well-known Von-Neumann bottleneck,^{2} where the processor remains idle for a certain amount of time during memory data access. On top of that, the need for moving data between CPU and memory (via the bus) requires charging/discharging of metal wires, limiting in this way both the bandwidth and the energy efficiency due to Joule heating and capacitance,^{3} respectively.

There have been numerous demonstrations toward overcoming these effects, including, among others, caching, multi-threading, and new RAM architectures and technologies (e.g., ferroelectric RAMs^{4} and optical RAMs^{5–9}), with the ultimate target being the energy efficient and high-speed CPU-memory data movement. None of these solutions seems, however, to be capable of coping with the computational and energy demands of DNNs, revealing a need for shifting toward specialized computing hardware architectures. In this endeavor, highly parallelized accelerators have been developed, including graphic processing units (GPUs), application specific integrated circuits (ASICs), and field programmable gate arrays (FPGAs), with GPUs and ASICs being, until now, the dominant hardware computing engines for DNN implementations. Specifically, GPUs leverage their hundreds of cores toward accelerating the matrix multiplication operations of DNNs, which are the most time- and power-consuming computations.^{10} Moreover, they have dedicated non-uniform memory access architectures (e.g., video RAMs) that are (i) programmable, meaning that the stored data can be selectively accessed or deleted, (ii) faster than CPU counterparts, and (iii) located very close to their cores, reducing in this way the distance between computing and data. Yet, despite GPU’s unrivaled parallelization ability that ushers in exceptional computational throughput, the need for data movement still remains and sets a fundamental limit in both speed and energy efficiency.

Toward totally eradicating the constraints of data movement, recent developments in analog computing through memristive crossbar arrays^{11–13} follow an alternative approach, called in-memory computing. This scheme allows for certain DNN computational tasks (e.g., weighting) to be performed within the memory cell itself, seamlessly supporting multiplication operations without requiring any data transfer.^{14} The recent 64-core analog-in-memory compute (AiMC) research prototype of IBM^{15} and the commercial entry of Mythic’s AiMC engine have validated the energy benefits that can originate from in-memory computing compared to Von-Neumann architectures. These implementations employ computational memory devices, including resistive RAMs (RRAMs) and phase change materials (PCMs), where the application of a voltage results in a change of the material’s property, achieving in this way both data storing and computing. However, issues related to memory instability and finite resistance of the crossbar wires may lead to computational errors and crossbar size limitations, respectively, making it hard to reach the computational throughput and parallelization level of GPUs.^{12,14} Similar to in-memory computing, neuromorphic computing comprises an alternative non-Von-Neumann architecture that is inspired by the structure and function of the human brain, meaning that both memory and computing are governed by artificial neurons and synapses. Neuromorphic chips mostly employ spiking neural networks (SNNs) to emulate the behavior of biological neurons, which communicate through discrete electrical pulses called spikes. SNNs can process spatiotemporal information more efficiently and accurately than conventional neural networks^{16} as they respond to changes in the input data in real time. Additionally, they rely on asynchronous communication and event-driven computations, where, typically, only a small portion of the entire system is active at any given time, while the rest is idle, resulting to low-power operation.^{17} However, neuromorphic computing is not currently being used in real-word applications, and there are still a wide variety of challenges in both algorithmic and application development^{18} that need to be addressed toward outperforming conventional deep learning (DL) approaches. At the same time, the underlying electronic hardware in analog compute engines continues to rely heavily on complementary metal–oxide–semiconductor (CMOS) electronic transistors and interconnects, whose speed and energy efficiency are dictated by their size. Taking into account that transistor scaling has slowed down during the last decade, since we are approaching its fundamental physical size limits,^{19} there is no significant performance margin left to be gained. In parallel, the requirement for multiple connected neurons yields increased interconnect lengths in analog in-memory computing schemes that finally result in low line-rate operation in order to avoid an increased energy consumption. All this indicates that a radical departure from traditional electronic computing systems toward a novel computational hardware technology has to be realized in order to be able to fully reap the benefits of the architectural shift toward non-Von-Neumann layouts.

Along this direction, integrated photonics emerged as a promising candidate for the hardware implementation of DNNs; the analog nature of light is inherently compatible with analog compute principles, while low-energy and high-bandwidth connectivity is the natural advantage of optical wires. On top of that, photonics can offer multiple degrees of freedom, such as wavelength, phase, mode, and polarization, being suitable for parallelizing data processing through multiplexing techniques^{20,21} that have been traditionally employed in optical communication systems for transferring information at enormous data rates (>Tb/s). The constantly growing deployment of optical interconnects and their rapid penetration to smaller network segments has been also the driving force for the impressive advances witnessed in photonic integration and, particularly, in silicon photonics; silicon photonic integrated circuits (PICs) with thousands of photonic components can be fabricated in a single die nowadays,^{22} forming a highly promising technology landscape for optical information processing tasks at chip-scale. Nevertheless, compared with electronic systems that host billions of transistors, thousands of photonic components may not be sufficient to build a vast universal hardware engine for generic applications. Yet, the constant progress in the field of integrated optics coupled with the rapid advances in fabrication and packaging can eventually shape new horizons in this field. This has raised expectations for an integrated photonic neural network (PNN) platform that can cope with the massively growing computational needs of Deep Learning (DL) engines, where computational capacity requirements double every 4–6 months.^{23} In this realm, several PNN demonstrations have been proposed,^{24–41} employing light both for data transfer and computational functions and shaping a new roadmap for orders of magnitude higher computational and energy efficiencies than conventional electronic counterparts. At the same time, they highlighted a number of remaining challenges that have to be addressed at technology, architecture, and training level, designating a bidirectional interactive relationship between hardware and software: the photonic hardware substrate has to comply with existing DL models and architectures, but at the same time, the DL training algorithms have to adapt to the idiosyncrasy of the photonic hardware. Integrated neuromorphic photonic hardware extends along a pool of architectural and technology options, the main target being the deployment of highly scalable and energy efficient setups that are compatible with conventional DL training models and suitable to safeguard high accuracy performance. In parallel, the use of light in all its basic computational blocks brings inevitably a number of new physical and mathematical quantities in NN layouts,^{41,42} such as noise and multiplication between “noisy” matrices, as well as mathematical expressions for non-typical activation responses, which are not encountered in conventional DL training models employed in the digital world. This calls for an optics-informed DL training model library; the term “optics-informed” has been recently coined by Roumpos *et al.*^{42} in order to describe the hardware-aware characteristics of DL training models and declare their alignment along the nature of optical hardware since it takes into account the idiosyncrasy of light and photonic technology. However, despite the advances pursued in both the hardware and software segments, the complexity of photonic processing is still far behind electronics with respect to both their algorithmic and their hardware capabilities. Hence, the field of PNNs does not currently proceed along the mission of replacing conventional electronic-based AI engines but aims rather to engender applications where photonics can offer certain benefits over their electronic counterparts. This mainly expands along inference applications since inference comprises the most critical process in defining the power and computational resource requirements in certain applications, such as modern Natural Language Models (NLPs), where inference workloads are estimated to consume 25×–1386× higher power than training.^{43} Other deployment scenarios include latency-critical applications that are related to cyber-security in DCs,^{37} non-linearity compensation in fiber communication systems,^{39} acceleration of DNN’s matrix multiplication operations at 10 s of GHz frequency update rates,^{25} decentralization of the AI input layer from core AI processing for edge applications,^{44} and finally to provide solutions to non-linear optimization problems in, e.g., autonomous driving and robotics.^{40}

In this Tutorial, we aim to provide a comprehensive understanding of the underlying mechanisms, technologies, and training models of PNNs, highlighting their distinctive advantages and addressing the remaining challenges when compared to conventional electronic approaches. This Tutorial forms the first attempt toward addressing the field of PNNs for DL applications within a hardware/software co-design and co-development framework: with the emphasis being on integrated PNN deployments, we define and describe the PNN fundamentals, taking into account both the underlying chip-scale neuromorphic photonic hardware and the necessary optics-informed DL training models. This paper is structured as follows: In Sec. II, we introduce the basic definitions and requirements for NN hardware, analyzing the basic NN building blocks (artificial neuron, NN models) as well as the main mathematical operations required for the hardware implementation of NNs, i.e., multiply and accumulate (MAC) and matrix-vector-multiplication (MVM) operations. The same section also provides an intuitive analysis on bit resolution and energy efficiency trade-offs of analog photonic circuits, discussing the advantages and opportunities of PNNs. In Sec. III, a review in the basic computational photonic hardware technologies is provided, presenting a summary of photonic MVM architectures and weight technologies in Sec. III and activations functions in Sec. IV. Finally, Sec. V is devoted to the challenges and requirements in the photonic DL training sector, providing a solid definition of optics-informed DL models and summarizing the relevant state-of-the-art techniques and demonstrations.

## II. BASIC DEFINITIONS AND REQUIREMENTS FOR NEURAL NETWORK HARDWARE

Merging photonics with neuromorphic computing architectures requires a solid knowledge of the underlying NN architectures, building blocks, and mechanisms. The most basic definitions and requirements are briefly described below.

### A. Artificial neuron

An artificial neuron comprises the main operation unit in a neural network, with the operation of the basic McCulloch–Pitts neuron model^{45} being mathematically described by $y=\phi \u2211Wixi+b$, where *y* is the neuron output, *φ* is an activation (non-linear) function, *x*_{i} is the *ith* element of the input vector *x*, *w*_{i} is the weight factor for the input value *x*_{i}, and *b* is a bias. The linear term ∑*W*_{i}*x*_{i} represents the weighted addition and is typically carried out by the so-called linear neuron part, which comprises (i) an array of axons, with every *ith* axon denoting the transmission line that provides a single *x*_{i} × *w*_{i} product, (ii) an array of synaptic weights, with every *ith* weight *w*_{i} located at the *ith* axon, and (iii) a summation stage. The non-linear neuron part comprises the activation function φ, with rectified linear unit (ReLU), sigmoid, pooling, etc., being among the most widely employed activation functions in current DL applications.^{46}

For a layer of *M* interconnected neurons, the output of these neurons can be expressed in vector form as $y=W\xd7x+b$, where *x* is an input vector with N elements, *W* is the *N* × *M* weight matrix, *b* is a bias vector with M elements, and *y* is a vector made of *M* outputs. Figure 1(a) depicts a schematic layout of a biological neuron that can be mathematically described via an artificial neuron shown in Fig. 1(b), where the dendrites correspond to the weight signals, nucleus corresponds to the summation and activation function, and axon terminals are responsible for providing the inputs to the next neuron, while Fig. 1(c) depicts the resulting layout when utilizing artificial neuron to structure a DNN with a single input layer, a single output layer, and one or more hidden layers.

### B. Neural network models

Part of the unprecedented success of NNs in tackling complex computational problems can be attributed to the plethora of NN models, capable of uniquely synergizing several hundred or up to billions of artificial neurons into versatile computational building blocks. In this section, we will give an overview of several NN models based both on their popularity and success in resolving standardized benchmarking problems, as well as their compatibility with hardware implementation in silicon photonic platforms.

NN models can be broadly classified in different categories based on the following:

Data flow pattern. Considering the direction of the information flow, NN models can be grouped in two categories: In feed-forward NNs, the signals travel exclusively to one direction, usually from left to right, while in feed-back NNs, the signals travel in both directions, allowing neurons to receive data from neurons belonging to subsequent or even the same layer. Figures 2(a)–2(e) depicts five popular types of NN models, grouped based on their data flow in feed-forward and feed-back implementations, with the latter being mostly utilized for resolving temporal and ordinal workloads as the network effectively retains memory of the previous samples.

Interconnectivity. The interconnection density between neurons of subsequent layers or even the same layer can be used to classify NN models in dense and sparse implementations. Figure 2(a) depicts a typical DNN model, where each neuron of the first layer is connected to all the neurons of the subsequently layer, usually denoted as a fully-connected layout, while the neurons of the second layer are interconnected to only two neurons of the subsequent layer, corresponding to a sparse layout. While high-interconnectivity density allows the NN to extract more complex relationships between the input data, the cost associated with the increasing number of weights scaling with a

*O*(*N*^{2}) complexity for a*NxN*interconnectivity promotes the use of sparse models.Structural Layout. Employing a specific layout can enhance neural network models with unique attributes. A typical example of such a model, specifically a single layer of a convolutional NN,

^{47}is illustrated in Fig. 2(b). This architectural approach, widely employed in image recognition task due to its spatio-local feature extraction capabilities, promotes weight re-use and as such relaxes the computational requirements, through applying the same weight kernel, i.e., a set of weight values, across the input data values. Another typical NN layout, depicted in Fig. 2(c), is an NN autoencoder, a model associated with data encryption due to its data compression layout that effectively reduces the dimensionality of the input data in its central layers, boosts wide employment in non-linearity compensation in optical communications.^{42}Finally, Fig. 2(d) illustrates the most common feed-back NN model called recurrent, while Fig. 2(e) depicts a special type of recurrent NNs typically denoted as reservoir computing,^{48}where a fixed connectivity recurrent layer is placed between the input and the output layer. The relaxed training requirements, as only the output layer has to be trained along with the ease of constructing time-delayed reservoir circuitry in silicon photonics platforms, have led to impressive demonstrations in optical channel equalization applications.

### C. MAC and MVM operations

*O*(

*N*

^{2}) and

*O*(

*N*) complexity, respectively. As such, MAC operations comprise the most significant computational burden and are usually correlated with the computational capacity of the NN model.

^{49}A single MAC operation calculates the product of two numbers and adds the result to an accumulator. Defining

*a*as the accumulation variable that holds the weighted input sum $\u2211n=1i\u22121wn\xd7xn$, the operation can be described by the following form:

*N*inputs creates a weighted sum that can be broken down into a series of

*N*parallel MAC operations, while a fully connected neural layer with

*M*interconnected neurons and

*N*inputs per neuron supports a total number of

*M*×

*N*parallel MACs. A typical digital electronic MAC unit layout that realizes the mathematical formula of (1) is depicted in Fig. 3(a), showing that the partial 2

*N*-dimensional weighted sum, stored by the accumulator, is then fed back to the summation circuit for being added with the subsequent partial weighted sum produced by the next

*N*input and

*N*weight values. Figure 3(b) illustrates an example of a MAC operational unit when implemented in the analog electronic domain, where an input vector x is imprinted in the electrical domain through the use of DACs and subsequently broadcasted to an array of [i, j] synaptic weights implemented through variable resistive elements arranged in a Xbar configuration.

^{50,51}By controlling the impedance of the variable resistive elements, the electrical current emerging to every Xbar’s column output provides based on Kirchhoff’s law the weighted summation of the column inputs,

^{51}e.g., $y1=\u2211k=1ixk*wk,1$. Careful examination of the two MAC implementations can provide us with some significant insight into the differences between analog and digital computing:

Value representation and information density. Digital implementations use discrete values of physical variables, employing typically two discrete levels that are correlated with the upper and lower switching voltage of a transistor and are usually denoted as 0 and 1. On the other hand, analog computing employs values across the whole range of physical variables, allowing in this way for the representation of several equivalent bits of information at the same time unit. A direct consequence of this value representation form is the required noise robustness of the computational system that will be discussed in more detail in Subsection II D, especially for optical implementations.

Computational primitives. While digital computing is solely based on the mathematics and respective deployments of Boolean logic-based circuitry, analog computing can employ the physical laws of the underlying hardware, e.g., capacitors, resistors

^{52}to implement a variety of mathematical operations, unlocking a quiver of functionalities described by the exploited physical phenomena.Latency. Given the large number of devices required to implement a specific mathematical operation using Boolean logic, e.g., a digital computational building block implementing 8-bit parallel multiplication requires ∼3000 transistors.

^{53}This forms a latency-critical computational path that is defined by the maximum register-to-register delay and effectively limits the maximum achievable operating frequency and, as such, the achieved latency.^{54}This has led to the adoption of multi-threading and multi-core setups for parallel processing in modern computing systems, investing in architectural innovations toward system acceleration. On the other hand, analog systems are inherently built as parallel computational systems, giving them a significant edge in latency critical tasks while requiring ∼500× fewer components,^{53}on average, than digital electronic circuits for multiplication operations.

These advantages, synergized with the primitives of photonic devices, have fueled the rise of optical MVM hardware, with an indicative example of an analog photonic dot product implementation given in Fig. 3(c). In this approach, the input and/or weight information is encoded in one of the underlying physical variables of the photonic system i.e., the amplitude, phase, polarization, or wavelength of a light beam, while the physical primitives of optical phenomena are utilized for the mathematical operations: in this particular example, loss experienced during the transmission of light via the weight-encoding physical system provides the multiplication operation, while interference of light waves is used for providing the summation mechanism. Harnessing the advantages of light-based systems, i.e., multiple axes of freedom for encoding information in time, space, and wavelength, low propagation loss, low electromagnetic interference, and high-bandwidth operation hold the credentials to surpass analog electronic deployments in large scale photonic accelerators.^{55} It is noteworthy, though, that both the electronic and photonic analog compute engines necessitate the use of Digital-to-Analog (DAC) and Analog-to-Digital (ADC) modules for interfacing NN input and output modules with the digital world.

### D. Precision

Migrating MAC operations from digital circuitry, where high-precision (i.e., 16-, 32-, or 64-bit) floating point representations are utilized, to the analog domain, necessitates a basic understanding of the physical representation and energy-efficiency tradeoffs of analog photonic circuitry. Given the continuous nature of analog variables, as opposed to the usually two-level discretized variables in digital systems, representing high-precision numerical quantities in an analog system necessitates significantly higher signal-to-noise ratios (SNRs). This requirement shapes an optimal bit resolution/energy efficiency operational regime for analog photonic computing systems.^{56} In this subsection, we will discuss the precision limitations of analog photonic computing, outlining its optimized operational trade-offs in the shot-noise limited regime vs state-of-the-art digital MAC circuitry.

^{53}that the power consumption of a digital MAC scales linearly with the bit-resolution, such as

*P*

_{D−single−bit}is the power consumption of a single-bit MAC operation and

*b*

_{dig}is the bit resolution. For photonic implementations, we use the correlation between the achieved bit resolution and the standard deviation of the system’s total noise level as this has been defined in

^{57}

^{,}

*I*

_{max}−

*I*

_{min}defines the range between maximum and minimum electrical current values generated at the photodiode (PD) output and

*σ*

_{TOTAL}is the standard deviation of the total noise of the photonic link under the generally valid assumptions that the link is dominated by additive white Gaussian noise (AWGN). Assuming that the link operates at the shot noise limit of the photodiode, the total noise of the system equals the shot noise, and we have

*h*is the Planck constant,

*B*is the employed bandwidth,

*v*is the lightwave frequency, (

*λ*= 1550

*nm*

*or*

*v*= 193.41 THz), and

*I*

_{avg}is the average electrical current generated at the photodiode output, which relates to the average optical power

*P*

_{avg}that enters the photodiode via

*I*

_{avg}=

*RP*

_{avg}, with

*R*being the photodiode responsivity. It should be noted that the aforementioned calculation is an approximation of the actual shot noise of the photonic link as its value is dependent on the number of photons reaching the receiver in any given computing interval.

^{55,56}Taking into account that

*I*

_{max}−

*I*

_{min}=

*R*(

*P*

_{max}−

*P*

_{min}) =

*R*×

*OMA*, with

*OMA*denoting the optical modulation amplitude, and assuming that the input signal has a duty cycle of 50% and an infinite extinction ratio (ER), the

*OMA*turns to be equal $OMA=2\xd7Pavg=2\xd7IavgR$, with

*I*

_{max}−

*I*

_{min}= 2 ×

*I*

_{avg}. Considering an ideal optical MAC unit in order to calculate the theoretical limit of optical energy efficiency, we assume only the use of unitary and lossless layouts, where a link performs, in principle, lossless MAC operation

^{58}and is powered by a laser that consumes an average power of

*P*

_{laser}and has a wall-plug efficiency of

*a*= 0.2, and the average optical power emitted by the laser will be

*a*×

*P*

_{laser}and will be required to be greater or equal to

*P*

_{shot}when operation at the shot-noise limit is required. Finding

*σ*

_{TOTAL}=

*σ*

_{shot}via Eq. (3) and then using the resulting expression to replace

*σ*

_{shot}in Eq. (4), while replacing

*I*

_{avg}with

*R*×

*P*

_{avg}and requesting

*P*

_{avg}=

*P*

_{shot}due to the operation at the shot-noise limit,

*P*

_{shot}and, consequently, the consumed laser power

*P*

_{laser}can be calculated as

*B*, we can transform Eq. (5) to the shot-noise limited energy efficiency per MAC described in the following equation:

^{59}with the shot-noise limited energy efficiency of a single photonic circuit calculated by Eq. (6) when assuming a unity responsivity (R = 1). Another important metric that could be utilized for the aforementioned comparison is the Landauer limit

^{60}that effectively defines the minimum energy required for a digital irreversible computation and as such could be employed as the theoretical minimum energy for digital computation, with an analysis and relevant metrics accessible in the supplementary material of Ref. 61. While a more detailed discussion and related advantages of the achieved energy efficiency of a photonic accelerator will be provided in Sec. II E, it becomes evident that harnessing the analog-architecture derived advantages of photonic implementations, the bit-resolution of the photonic accelerator will have to range in lower bitwidths than its digital equivalent. Moreover, as analog systems encode the data information along single physical variables, they have to migrate from floating-point to fixed-point representations. This is dictated by both the lack of bit-resolution depth that would allow splitting the mantissa and exponent part of the represented number as well as the nature of computation that requires physical number representations. In order to ease the understanding of the two different representation-schemes, Fig. 4(b) schematically describes the two approaches.

*N k*-bit signals are summated, the full digital resolution at the chip’s output would be defined through

*N*is employed for compensating for the 1/

*N*splitting ratio of the laser source at the circuit’s ingress. However, in analog implementations, we retain the same bit-precision across the different neural layers, and as such, the output of the summation should have the same bitwidth as the input neuron values. Consequently, the required SNR at the analog photonic output would be lower than the full digital equivalent one, implying that we can keep the minimum optical power difference between adjacent bits (MOPB) constant, even when reducing the laser optical power. Defining this digital-to-analog precision loss

^{55,56}as

*a*

_{prec}, we can highlight two interesting operational regimes, schematically illustrated in Figs. 5(a) and 5(b). Specifically, in Fig. 5(a), a light beam, originating from a laser source and consuming an optical power of

*P*, is split in an 1:

*N*splitter (

*N*= 4) into four equivalent beams that get subsequently modulated in

*X*

_{1}–

*X*

_{4}optical modulators. With the four inputs having a bit resolution of

*X*

_{res}= 2, their summation, using Eq. (7), has a full digital precision of

*Y*

_{res}= 2 + log

_{2}4 = 4 and the system corresponds to

*a*

_{prec}= 1. On the other hand, in Fig. 5(b), we set the output bit resolution to

*Y*

_{res}= 2, and hence, assuming only positive weights and a lossless weight matrix, we can maintain the same MOPB at the output summation even when reducing the injected optical laser power to

*P*′ =

*P*/

*N*=

*P*/4, while the systems now correspond to a

*a*

_{prec}= 4 =

*N*. A more thorough analysis is given in the Appendix.

In this context, NNs are uniquely suited for analog computing, as empirical research has shown that they can operate effectively with both low precision and fixed-point representation with inference models working nearly just as well with 4–8 bits of precision in both activations and weights—sometimes even down to 1–2 bits.^{62} On top of that, bit precision in analog compute engines can be improved by incorporating in the NN training the idiosyncrasies and noise sources of the underlying photonic hardware, investing in this way in the so-called hardware-aware training or optics-informed DL models.^{31} Employing this approach, researchers have already showcased robust networks that can secure almost the same accuracy with noise-free digital platforms,^{63} while a more detailed discussion is included in Sec. V.

### E. Technology requirements for energy and area efficiency

*NxN*neural layer, pictorially represented in Fig. 6. The

*NxN*neural layer comprises an

*NxN*weight matrix

*W*, which gets multiplied by an

*N:*1 input vector

*X*and yields an

*N*:1 output vector

*Y*. Assuming that every synaptic weight is implemented via a hardware module that consumes a power of

*P*

_{W}watts and an area of

*A*

_{W}mm

^{2}, each input signal generation structure is realized by a hardware circuit that consumes a power of

*P*

_{X}watts and an area of

*A*

_{X}mm

^{2}, the receiver circuitry that is employed for obtaining every output signal consumes

*P*

_{Y}watts and has a footprint of

*A*

_{Y}mm

^{2}, and the optical laser source consumes P

_{laser}watts, then the total power consumed equals

*B*MAC/sec compute rate per axon, then the total compute rate equals

*N*

^{2}

*B*MAC/s, leading to an energy efficiency in J/MAC (or in MAC/s/watt) of

^{64}defined in J/s as

*C*and

*V*are the capacitance and the driving voltage of the input modulator, respectively. After merging Eqs. (9) and (10) and employing typical values (

*C*= 14 fF,

*V*

_{pp}= 2

*V*)

^{65}for state-of-the-art electro-absorption modulators (EAMs) while excluding, at this point, their static energy consumption, it can be derived that

Technology . | Compute rate B (MAC/s) . | Static consumption (W) . | Efficiency (J/MAC) . |
---|---|---|---|

TO PS^{66} | 10–50 × 10^{9} | 12 × 10^{−3} | 2.4–12 pJ/MAC |

Insulated TO PS^{66} | 10–50 × 10^{9} | 4 × 10^{−3} | 0.4–2 pJ/MAC |

EAMs^{65} | 10–50 × 10^{9} | 2–20 × 10^{−6} | 0.4–20 fJ MAC |

Non-volatile PCMs^{26} | 10–50 × 10^{9} | $\u2248$0 | $\u2248$ 0 pJ/MAC |

Technology . | Compute rate B (MAC/s) . | Static consumption (W) . | Efficiency (J/MAC) . |
---|---|---|---|

TO PS^{66} | 10–50 × 10^{9} | 12 × 10^{−3} | 2.4–12 pJ/MAC |

Insulated TO PS^{66} | 10–50 × 10^{9} | 4 × 10^{−3} | 0.4–2 pJ/MAC |

EAMs^{65} | 10–50 × 10^{9} | 2–20 × 10^{−6} | 0.4–20 fJ MAC |

Non-volatile PCMs^{26} | 10–50 × 10^{9} | $\u2248$0 | $\u2248$ 0 pJ/MAC |

*V*

_{static}= −1.5

*V*, corresponding to the mean value of a uniform distribution that ranges in the EAMs operating regime, i.e., [0–3V], a responsivity of

*R*= 0.8A/W and P

_{in}ranging from −15 to −5 dBm. Regarding the thermo-optic (TO) phase shifter (PS),

^{66}we assume a uniform distribution of the weight values, corresponding to power distribution in the P

_{0}–P

_{π}range with an average value of

*P*

_{TO}=

*P*

_{π}/2.

- Higher than the accelerator’s noise energy. In this context, following the analysis of Subsection II D for the shot-noise limited optical power and considering an
*NxN*neural layer, with (a) a power splitting ratio of*N*^{2}, implying that we have to multiply the output power by*N*^{2}to compensate the input and column splitting stages, and (b) a digital precision loss of*a*_{prec}=*N*, the shot-noise limited optical power can be calculated using the following equation, which forms actually a more detailed representation of the laser power calculated in Eq. (5), where, however, the digital precision loss and the compensation loss factor are also taken into account:which makes the constituent term of Eq. (9) to equal(14)$Plaser_shot=3.85aJ\xd72ba\u221212\xd7(1/R)\xd7B\xd71aprec\xd7N2$assuming a responsivity(15)$Plaser_shotN2B=3.85aJ\xd72ba\u221212\xd7(1/R)\xd7B\xd71N\xd7N2N2\xd7B=3.85aJ\xd72ba\u221212\xd71N,$*R*= 1A/W. - Sufficient to generate the minimum required electrical charge at the receiver that can drive the subsequent node of the next NN layer.
^{67}With the photonic accelerator operating at 1550 nm and assuming a photodetector with*C*_{d}= 1 fF,^{68}a*C*_{i}= 200*aF*, 1*μ*m wire with an interconnect capacitance of 200 aF/um,^{67}and a required output voltage of*V*_{out}= 0.5 V,^{69}the minimum optical power required can be calculated, following the same convention of N^{2}splitting loss and*N*digital precision loss,that concludes for the fourth term of the energy efficiency to(16)$Vout=Plaser_switcha\xd7eh\xd7v\xd7(Cd+Ci)\xd71aprec\xd71N2\xd7BPlaser_switch=Vout\xd7(Cd+Ci)\xd7h\xd7ve\xd71N\xd7N2\xd7B,$Here, it should be pointed out that this interconnect capacitance(17)$Plaser_switchN2B=Vout\xd7(Cd+Ci)\xd7h\xd7ve\xd71N\xd7N2\xd7BN2\xd7B=2.4fJ\xd71N.$*C*_{i}suggests a monolithic integration approach or a very intimate proximity of the photonic chiplet to the respective electronic chiplet. More traditional integration approaches will enforce higher interconnect capacitances and significantly increase the required energy, with an interesting analysis provided in Ref. 70. Combining all the terms in a single efficiency equation, we can conclude toThis highlights that energy efficiency improves with the following:(18)$Eeff=PxNB+PwB+PYB+max(Plasershot,Plaserswitch)N2BEeff=5fJN+PwNB+1.25fJN+1N\xd7max[3.85aJ\xd72ba\u221212,2.4fJ].$ Increasing

*N*, implying that the energy consumed for generating and receiving the input and output signal, respectively, is optimally utilized when the same input and output signals are shared along multiple matrix multiplication, or equivalently, neural operations. With current neuromorphic architectures being radix-limited by maximum emitted laser power,^{71}loss-optimized architectures are required for allowing high circuit scalability and harnessing the advantages of photonic implementations.Increasing B, which has a predominant effect in reducing energy consumption especially when using high-power consumption weight nodes, i.e., currently widely employed thermo-optic heaters dominate the energy efficiency reaching up to ∼1 pJ/MAC.

Operating in an optimized bit resolution energy regime as highlighted in the fourth constituent of Eq. (18). As we can observe, the order of magnitude difference between the shot-noise limited and minimum switching energy contributions has a threshold point at around 4.5 bits, implying that a careful examination of the underlying technology blocks and an optimized operational regime can significantly improve the energy consumption.

*NA*

_{X}+

*NA*

_{Y}+

*N*

^{2}

*A*

_{W}mm

^{2}, suggesting an area efficiency in MAC/s/mm

^{2}of

*B*and relative gains for high accelerator radices.

## III. INTEGRATED PHOTONIC MATRIX-VECTOR-MULTIPLY ARCHITECTURES

In this section, we solely focus on photonic matrix-vector-multiply architectures that could potentially be deployed in DL environments rather than in spiking or event-based computing paradigms. This has been motivated by the current challenges faced by SNNs, including difficulties in understanding underlying mechanisms and a lack of standardized benchmarks.^{18} In contrast, the established success of deep learning models results from years of research and the availability of extensive datasets and benchmarks, contributing to their widespread applicability and effectiveness.

### A. Coherent MVM architectures

Herein, we initially investigate the architectural categories of integrated PNNs, and then, we delve deeper into their individual building blocks, providing the recent developments on the photonic weight technologies as well as the non-linear activation functions implementations. Depending on the mechanism of information encoding and the calculation of linear operations, integrated PNNs can be classified into three broad categories: coherent, incoherent, and spatial architectures.

Coherent architectures harness the effect of constructive and destructive interference for linear combination of the inputs in the domain of electrical field amplitudes, requiring just a single wavelength for calculating the neural network linear operations. The principle of operation of coherent architectures is pictorially represented in Fig. 7(a), while Figs. 7(b)–7(d) illustrate indicative coherent layouts that have been proposed in the literature and will be comprehensively analyzed in this Tutorial. The first linear neuron realized in this manner has been proposed in Ref. 29, with its core relying on the optical interference unit realized through cascaded MZIs in a singular value decomposition (SVD) arrangement,^{72} as per Fig. 8(a). The SVD approach assumes decomposition of the arbitrary weight matrix *W* to *W* = *USV*^{†}, where *U* and *V* denote unitary matrices, with *V*^{†} being the conjugate transpose of *V* and *S* being a diagonal matrix that carries the singular values of *W*. Therefore, this scheme rests upon the factorization of unitary matrices that in the photonic domain have mainly based on U(2) factorization techniques employing 2 × 2 MZIs.^{73}

In this regime, back in 1994, Reck *et al.*^{74} proposed the first optical unitary matrix decomposition scheme, the so-called triangular mesh shown in Fig. 8(b), using 2 × 2 MZIs as the elementary building block, illustrated in Fig. 8(c). Recently, this layout has been optimized by Clements *et al.*,^{75} introducing the rectangular mesh of 2 × 2 MZIs, depicted in Fig. 8(d), that is more loss-balanced and error-tolerant design than Reck’s architecture. Both layouts necessitate *N*(*N* − 1)/2 variable beam splitters for implementing any *N* × *N* unitary matrix, requiring, also, the same number of programming steps for realizing the decomposition. Although these U(2)-based architectures rely on simple library of photonic components that facilitate their fabrication, they suffer from several drawbacks, with the most important being the fidelity degradation. Fidelity corresponds to the measurement of closeness between the experimentally obtained and the theoretically targeted matrix values, denoting a quantity that declares the accuracy in implementing a targeted matrix in the experimental domain. Fidelity degradation in the U(2)-based layouts originates from the differential path losses imposed by the non-ideal lossy optical components.^{76}

On top of that, U(2)-based layouts cannot support any fidelity restoration mechanism without altering their architectural structure or sacrificing their universality. Transferring these layouts in an SVD scheme toward implementing arbitrary matrices, the above effects exacerbate as two concatenated unitary matrix layouts are required. In an attempt to counteract these issues, the authors in Ref. 77 proposed the universal generalized Mach–Zehnder interferometer (UGMZI)-based unitary architecture illustrated in Fig. 8(e) and introduced a novel U(N) unitary decomposition technique^{78} in the optical domain that migrates from the conventional U(2) factorization by employing *N* × *N* Generalized MZIs (GMZIs) as the elementary building block. GMZIs serve as *N* × *N* beam splitters,^{79,80} followed by *N* PSs with each *N* × *N* beam splitter comprising two *N* × *N* MMI couplers interconnected by *N* PS, as depicted in Fig. 8(f). This scheme eliminates the differential path losses, and hence, it can yield 100% fidelity performance by applying a simple fidelity restoration mechanism, which incorporates *N* variable optical attenuators at the inputs of the UGMZI. Yet, this architecture heavily relies on MMI couplers with a high number of ports in order to perform transformations on large unitary matrices, which are still a rather immature integrated circuit technology that is under development in current research fabrication attempts. Finally, the authors in Ref. 81 proposed the slimmed SVD-based PNN, where they have traded the universality for area and loss efficiency by eliminating one of the two unitary matrices, implying that they can implement only specific weight matrices.

Apart from SVD-based approaches, direct-element mapping architectures comprise also coherent layouts that employ a single wavelength and interference for calculating the linear operations. The mapping of the weight values to the underlying photonic fabric is bijective, meaning that each photonic node imprints a dedicated value of the targeted weight matrix without necessitating decomposition, minimizing this way the programming complexity. Figure 9 illustrates the first coherent direct-element mapping architecture,^{76} implemented in a crossbar (Xbar) layout. In order to support both positive and negative weight values, this architecture requires the use of two devices per weight—an attenuator for imprinting the weight magnitude, proportional to$Wi$, and a PS for controlling the phase, i.e., the sign of the weight, sign(*W*_{i}), enforcing 0 phase shift in the case of positive and *π* phase shift in case of negative weights, resulting in $signWiWi\xd7Xi$. The weighted inputs are linearly combined in *N*:1 combiner stage, constituted from cascaded Y-junction combiners, yielding the output electrical field proportional to $\u2211i=1NXiWi$, which conceals the sign information in its phase. If compatibility with electrical non-linearities is needed, the sign information of the signal emerging from the Xbar output can be translated from its phase to its magnitude by introducing an optional bias branch, which sets a constant reference power level that allows for mapping the positive/negative output field above/below the bias, as experimentally demonstrated in Ref. 82. Xbar architecture, thanks to its loss-balanced configuration, can yield 100% fidelity performance, while its non-cascaded and one-to-one mapping connectivity significantly improves the phase-induced fidelity performance since the error is restricted only to a single matrix element. These benefits were experimentally verified in Refs. 83 and 84, employing a 4 × 4 silicon photonic Xbar with SiGe EAMs as computing cells, while the NN classification credentials of this architecture were experimentally validated in Refs. 24 and 25 using a 2:1 single-column Xbar layout that is capable to calculate the linear operations of the MNIST dataset at up to 50 GHz clock frequency with a classification accuracy of >95%. In an effort to exploit the full potential of the photonic platform, Xbar architecture can be equipped with wavelength division multiplexing (WDM) technology to further boost the throughput as has been proposed in Refs. 85 and 86, realizing multiple output vectors at a single timeslot. Although the Xbar layout seems currently to be the optimal architectural candidate for PNNs, it requires careful and precise effort during circuit design in order to synchronize the optical signals that travel through different paths and coherently recombine at the output. Hence, optimum performance of the Xbar necessitates the employment of equal length optical paths whenever coherent recombination is required, suggesting that the path-length difference has to be compensated during the photonic chip layouting.

Finally, a recent coherent demonstration in Ref. 87 exploits vertical-cavity surface-emitting lasers (VCSELs) for encoding, in *i*-time steps, both the input vector and weight matrix, as shown in Fig. 7(d). Using the injection locking mechanism between the deployed VCSELs, the phase coherency is retained over the entire circuit, allowing for the realization of the coherent amplitude addition at the interference stage of each timestep. Matrix-vector products are realized by the photoelectric multiplication process in homodyne detectors, while a switched integrator charged amplifier is employed for the accumulation of the individual *i* products. Despite its simplicity, this architecture requires precise phase control over the individual VCSELs toward retaining phase coherency over the entire circuit, raising stability and scalability issues.

### B. Incoherent MVM architectures

Demarcating from coherent architectures, incoherent PNNs encode the NN parameters into different wavelengths and calculate the network linear operations by employing WDM technology principles and power addition. A pictorial representation of how incoherent architectures operate is given in Fig. 7(e), while some incoherent layouts that have been suggested in the literature and will be thoroughly examined in this Tutorial are illustrated in Figs. 7(f)–7(h). The first implementation that follows this approach has been proposed in Ref. 88, when a team from Princeton initially demonstrated the so-called broadcast-and-weight architecture and then elaborated in more detail in Ref. 89. Each input *x*_{i} is imprinted at a designated wavelength *λ*_{i}, essentially making each channel *λ*_{i} a virtual axon of a linear neuron, while all *N* inputs (*λ*_{s}) are typically multiplexed together into a single waveguide when arriving to the linear neuron, as shown in Fig. 10. The main building block of this architecture is the microring resonator (MRR) bank, consisting of *N* MRRs that are embraced by two parallel waveguides and are responsible for enforcing channel-selective weighting values. Each MRR filter is designed such that its transfer function can be continuously tuned, ideally between the values of 0 and 1, achieving controlled attenuation of the signal’s power at the corresponding *λ*_{i}. The sign is encoded by exploiting path-diversity and balanced photodetection (BPD); assuming that an *a*_{i} fraction of a signal at a certain wavelength exits via the THRU port of the respective MRR module and the remaining (1-*a*_{i}) part gets forwarded to the DROP port, the subtraction of the respective photocurrents at the BPD yields the weighting value *w*_{i} = 2*a*_{i} − 1 for this specific signal, which can range between −1 and 1, given that *a*_{i} ranges between 0 and 1. With all different wavelengths leaving through the same DROP and the same THRU port and entering the same BPD unit, the BPD output provides the total weighted sum of WDM inputs. This architecture allows for one-to-one mapping of the weighting values into the MRR weight bank alleviating the programming complexity, yet it comprises a rather challenging solution since it necessitates the simultaneous operation and precise control of various resonant devices, raising issues in its scalability credentials. An alternative incoherent architecture is proposed by the authors in Ref. 26, demonstrating a PNN that follows the photonic in-memory computing paradigm where the weighting cells are realized through PCM-based memories. This approach exploits the non-volatile characteristics of the PCM devices consuming, in principle, zero power consumption when inference operation is targeted, meaning that the weights of the NN do not have to be updated and, thus, are statically imprinted in the PCM weighting modules. This architecture utilizes an integrated frequency comb laser to imprint the multiple inputs of the NN, with each comb line corresponding to a dedicated NN input value. The multi-wavelength signals after the PCM-based weighting stage that follows the layout depicted in Fig. 7(h) are incoherently combined to a photodiode (PD) in order to produce the linear summation. Although this architecture minimizes the memory movement bottleneck, it requires (i) precise design to timely synchronize the multi-wavelength signals at each PD and (ii) broad wavelength spectrum of frequency comb laser for implementing large scale NNs. An additional incoherent architecture is proposed in Ref. 28 and illustrated in Fig. 11. The authors employ WDM input signals for imprinting the NN input vector, while the realization of the weight matrices is implemented via multiple semiconductor optical amplifiers (SOAs). They adopt the cross-connect switch principles used in optical communications for constructing the PNN and arrayed waveguide gratings (AWGs) for multiplexing/demultiplexing the signals as well as for reducing the out-of-band accumulated noise of the SOAs. Although it comprises a promising solution toward implementing large scale PNNs, the deployment of multiple SOAs, as single stage weighting elements, trades the scalability credentials for increased power consumption. Finally, an alternative to the coherent/incoherent architectures has been proposed in Ref. 32, where the authors encode the N pixels of the classification image (NN input values) directly to the grating couplers through optical collimators, while the weight information of each NN input is imprinted through a dedicated PIN-based optical attenuator. Each weighted input is launched to a PD, and the resulted photocurrents are combined to generate the linear weighted sum of the neurons. As opposed to the coherent and incoherent layouts, there is no requirement for the encoded signals to be in phase or in different wavelength, respectively, since every NN input is imprinted at a designated photonic waveguide/axon. This, however, necessitates multiple waveguides/axons for implementing a high-dimensional NN, imposing scalability limitations.

From all previous implementations, it becomes easily evident that the main challenges and limitations of integrated PNNs relate to their scalability and thence to the hardware encoding of the vast amount of NN parameters into a photonic chip. In this direction, the authors in Refs. 38, 90, and 91 introduced the optical tiled matrix multiplication (TMM) technique, shown in Fig. 12, that follows the principles of the general matrix multiply (GeMM) method adopted by modern digital AI engines^{92,93} and attempting to virtually increase the size of the PNN without fabricating large photonic circuits. The rationale behind this concept is the following: the weight matrix and the input vector of an NN is divided into smaller tiles, whose dimension is dictated by the available hardware neurons. The remaining tiles are unrolled in the time domain via time division multiplexing (TDM) and then are sequentially imprinted into the photonic hardware, allowing in this way for the calculation of matrix-multiplication operations of a NN layer whose dimension is higher than the one implemented on hardware. The resulting time-unfolded products, produced by the multiple tiles, need to be added together in order to form the final summation. For this reason, the authors in Refs. 44, 91, and 94 utilized a charge accumulation technique either electro-optically using a low-bandwidth photodetector or electrically via a low-pass RC filter. Besides accumulation, this implementation allows for power efficient and low-cost ADCs since it relaxes their sampling rate and bandwidth requirements. However, the employment of optical TMM and charge accumulation techniques in a PNN engender specific requirements that need to be addressed: (i) both input vector-imprinting and weight-encoding modulators have to operate at the same data rate and (ii) the number of time-unfolded products that will be accumulated is dictated by the deployed capacitance of the RC filter or the bandwidth of photodetector, implying that after a certain period, a capacitor voltage/photodetector power should be reset in order to store (e.g., to a local memory) the first set of accumulated summation. The same process is repeated until the calculation of the total linear operations of PNN.

Even when employing the proposed techniques, large NN language models, such as chat-GPT^{95} and Megatron,^{96} necessitate billions of trainable parameters, which are challenging to encode not only in silicon photonic hardware but also in current electronic computing engines. Therefore, these models are deployed on High Performance Computers (HPCs) incorporating thousands of interconnected GPUs and/or tensor processing units (TPUs), e.g., Megatron language model deploys 4400 A100 GPUs,^{96} with each single accelerator comprising hundreds or thousands of nodes. This architectural paradigm has been already transferred in analog electronic accelerators prototypes, with recent multi-core systems already expanding to higher than 50 cores^{15} and can act as a blueprint architectural approach for multi-core photonic accelerators. Interconnection of the constituent photonic cores can benefit from the recent breakthroughs in optical chip-to-chip communications projected to offer significant energy and latency saving compared to electronic counterparts^{97} while also paving the way for reduced opto-electronic (OE) conversions and even on switch-fabric workload accelerations.^{98} Finally, the development of commercially viable silicon photonic accelerators has to tackle both the well documented packaging challenges of deploying very large scale integrated photonics,^{99,100} along with photonic accelerator specific packaging and interconnect requirements.^{101} Fortunately, recent breakthroughs in large scale photonic circuity packaging highlight a feasible developmental roadmap capable of addressing the challenges of (i) laser source integration through employing either heterogeneous integration of III–V components via wafer bonding^{102} and micro transfer printing^{103} or via photonic wire bonding,^{104} (ii) photonic/electronic system-in-package (SiP) development, with prominent approaches, including monolithic integration in silicon photonics^{105} or mainstream electronic platforms^{97} and 3D integration,^{106} and (iii) photonic accelerator memory access, where either optical interconnectivity between memory and accelerator is promoted^{107} or the novel photonic-in-memory-computing paradigm^{26} with the weight matrix being non-volatile and as such significantly alleviating the memory-accelerator memory requirements.

## IV. NEUROMORPHIC PHOTONIC HARDWARE TECHNOLOGY

### A. Photonic weighting technologies

Delving deeper into the individual PNN building blocks, we provide an overview of photonic technologies that can be promising candidates toward the realization of the NN weight imprinting into an integrated platform. As discussed previously, most PNN demonstrations focused on the weight matrix implementation rather than the NN input vector since the number of weight values comprises the greatest contributing factor to the hardware encoding of the entire NN parameters. For example, assuming a fully connected NN with topology of 10:10:5, the number of input values is 10, while the total weight values is 150, and this difference becomes more pronounced as the NN dimensions/layers increase. Hence, the selection of the photonic weight technology becomes crucial as it implicitly indicates the size and energy efficiency of the PNN. The photonic weight technologies can be divided into two categories, depending on their volatile characteristics. Non-volatile devices can be used as memories by storing the NN weight values in a PNN, and this information can be retained by statically applying ultra-low or even zero electrical power. These devices can either use memristors heterogeneously integrated with photonic microring resonators^{108} or exploit physical phenomena such as phase change^{26} and ferroelectricity^{109} in order to store and retain the weight values. The employment of non-volatile memory elements is more suitable for equipping PNN inference engines, offering low-power weight encoding with high-precision, but, in turn, they impose challenges that are related to reconfiguration time, fabrication maturity, compactness, and scalability. For example, PCMs that are mostly based on GST-based compounds exhibit up to 5-bit resolution,^{110} but, in turn, their reconfiguration time is restricted to the sub-MHz regime while in most demonstrations operate via optical absorption, limiting their deployment in large scale circuits. Ferroelectric materials, such as Barium Titanate (BTO), have already validated its non-volatile credentials retaining its states over 10 h.^{109} However, to incorporate this device into a PNN, one aspect that still needs to be addressed and optimized is the footprint since the required PS length for achieving pi phase shift is at least 1 mm,^{109} rendering the implementation of a large scale PNN rather challenging. On the other hand, when training applications are targeted or the TMM technique has to be applied for executing a high dimension neural layer over a limited PNN hardware, volatile devices take the lead over non-volatile materials since they offer dynamic weight update. Various TO MZI or MRR^{27,29–31,111–113} devices have been proposed for weight data encoding due to their well-established and mature fabrication process as well as their high bit precision (up to 9-bit^{113}), yet their reconfiguration time is limited to ms values.

Electro-optic devices, such as micro-electro-mechanical systems (MEMS),^{114,115} EAMs,^{36} semiconductor optical amplifiers (SOAs),^{28} ITO-based modulators,^{116} graphene-based phase shifters,^{117} and silicon p-i-n diode Mach–Zehnder Modulator (MZMs),^{118} have already been demonstrated and potentially perform weighting functions exhibiting reconfiguration times in the GHz regime, trading, however, their performance in bit precision.^{41} Therefore, the selection of the photonic weight technology heavily depends on the targeted NN application (inference, training) and its bit resolution requirements. Figure 13 puts in juxtaposition the power consumption and footprint of different photonic technology candidates for the realization of the weighting function for PNN implementations, highlighting also their speed capabilities/reconfiguration time.

### B. Photonic activation functions

An indispensable part in the realization of an NN is the activation function, i.e., a non-linear function that is applied at the egress of the linear weighted summation. The non-linearity of the activation function allows the network to generalize better, converge faster, approximate non-linear relationships, as well as avoid local minima. Despite the relative relaxed requirements in the properties of the activation functions, i.e., a certain degree of non-linearity and differentiability across the employed range,^{119} DNN implementations have been dominated, due to their higher performance credentials, by the use of the ReLU,^{120} PreLU,^{121} and variations of the sigmoid transfer function, including tanh and the logistic sigmoid.^{122} This dominance has shaped photonic NN activation function circuitry objectives, targeting to converge to these specific electrical baseline functions’ performance at the highest possible bit rate as well as achieve a certain level of SNR at their output to safeguard the scalability of the neural circuitry. Previous implementations of non-linearity in photonic NNs have been streamlined across three basic axes: (i) The simplest approach relies on applying the non-linear activation in the electronic domain. This was achieved through offline implementation in a CPU, following the opto-electrical conversion of the vector-matrix-multiplication product,^{29} by chaining an ADC to a digital multiplier and finally to a DAC^{123} or by introducing non-linearity in the neuron’s egress through a specially designed ADC.^{94} Despite the simplicity and effectiveness of digitally applying the non-linear activation function, the related unavoidable digital conversion induces, in the best case, a latency of several clock-cycles for every layer of the NN that employs one.^{29,123} Transferring this induced latency to a photonic NN accelerator would significantly decrease the achieved computation capabilities and as such its total performance credentials. (ii) The hybrid electrical-optical approach that relies on a cascade of active photonic and/or electronic components, i.e., photodiode—amplifier-modulator laser, with non-linear behavior provided by the opto-electrical synergies, such as transimpedance amplifier (TIA), or by the non-linear behavior of the photonic components (e.g., modulators).^{124–129} The hybrid electrical-optical approaches provide a viable alternative to digitally applied activation functions, but, in turn, the induced noise and latency originating from the cascaded optical to electrical to optical conversions still may impose a non-negligible overhead to the performance of the photonic NN. (iii) The all-optical approach based on engineering the non-linearities of optical components to conclude to practical photonic activation functions. In this context, different mechanisms and materials have been investigated, including among others gain saturation in SOA^{130,131} absorption saturation,^{132,133} reverse absorption saturation in films of buckyballs (C60),^{133} PCM non-linear characteristics,^{134,135} SiGe hybrid structure in a microring resonator,^{136,137} and poled thin-lithium niobate (PPLN) nanophotonic waveguides.^{138} All optical approaches seem to hit the sweet spot, between applicability and function, allowing time-of-flight computation and negating the need of costly conversions.

Finally, a recent trend and probably the most promising for realizing a complete PNN comprises the development of programmability feature in both hybrid and all-optical approaches, where a single building block can realize multiple activation functions by modifying its operational conditions.^{126,128,129,133,135} These implementations have mainly relied on the different non-linear transfer functions obtained by the same component when altering its operational conditions through specific settings, e.g., DC bias voltage for a modulator, DC current for an SOA, gain of a TIA, input optical power and pulse duration for PCM, etc. Therefore, by enabling reconfigurability in PNNs can pave the way toward implementing different AI applications/tasks without requiring any modifications in the underlying hardware.

Yet, the programmability properties of the non-linear activation functions need to be combined with high-speed performance to comply with the frequency update rate of the execution of the linear part. Figure 14 provides an overview of the devices that have been proposed for the implementation of NN activation functions, classifying them according to their implementation (all-optical, electro-optical), their speed performance, and the number of activation functions that they can realize, while Table II summarizes the power consumption and area metrics of state-of-the-art activation function demonstrations.

References . | Power consumption (mW) . | Area (mm^{2})
. |
---|---|---|

TIA + MZM^{129} | 425 | 7.13 |

TIA + non-linear cond + MZM^{128} | 400 | 0.625 |

SOA^{130} | 1640 | 9.1 |

EAM^{124} | 17 | N.A. |

Thin film LiNbO_{3}^{138} | 135 × 10^{−3} | N.A. |

Saturable absorber^{133} | 40 × 10^{3} | 11.76 × 10^{3} |

MRR^{126} | 0.1 | 25 |

## V. OPTICS-INFORMED DEEP LEARNING MODELS

Despite the significant energy and footprint advantages of analog photonic neuromorphic circuitry, its use for DL applications necessitates a unique software-hardware NN co-design and co-development approach for accounting for various factors that are absent in digital hardware and as such ignored in current digital electronic DL models.^{125} These include among others fabrication variations, optical bandwidth, optical noise, optical crosstalk, limited ER, and non-linear activation functions that deviate from the typical activation functions used in conventional DL models, with all of them acting effectively as performance degradation factors.^{139} In this context, significant research effort has been invested into incorporating the photonic-hardware idiosyncrasy in NN training models,^{140} engendering also a new photonic hardware-aware DL-framework. This reality has shaped a new framework for PNNs that should be eventually better defined as the NN field that combines neuromorphic photonic hardware with optics-informed DL training models using light for carrying out all constituent computations but at the same time using DL training models that are optimally adapted to the properties of light and the characteristics of the photonic hardware technology. The research field of hardware-aware DL training models designed and deployed for neuromorphic photonic hardware has led to the introduction of optics-informed DL models,^{31,37,141,142} a term that has been recently coined in Ref. 42, revealing a strong potential in matching and even outperforming digital NN layouts in certain applications.^{42}

Optics-informed DL models have to embed all relevant noise and physical quantities that stem from the analog nature of light and the optical properties of the computational elements into the training processing. In order to ease the understanding of the noise sources and physical quantities that impact a photonic accelerator and related NN implementation challenges, Fig. 15(a) illustrates the implementation of a single neuron axon over photonic hardware, along with the dominant signal quality degradation mechanisms. The input neuron data *x*_{i} is quantized prior being injected in a DAC, whose bit resolution for the tens of GSa/s sampling rates required for photonic neuromorphic computing ranges around 4–8 bits,^{143} i.e., being significantly lower than the 32-bit floating point numbers utilized in digital counterparts.

This disparity is exemplary illustrated in Fig. 15(b), with the input NN data and the DAC having a bit resolution of 8 and 2 bit, respectively, resulting into quantization errors denoted as *Q*_{error}. Followingly, the quantized electrical signal at the DAC’s egress is used to drive an optical modulator in order to imprint the information in the optical domain. In this case, the non-linearity and non-infinite ER of the photonic modulator will modify the incoming signal, with Fig. 15(c) indicatively illustrating the effect on the signal representation of limited ER. It should be pointed out that in this simple analysis, we assume a weight stationary layout and as such neglect the effect of weight noise. We also approximate the frequency response of the photonic axon, denoted as *T*_{f} as a low-pass filter, a valid assumption when considering the convolution of the constituent frequency responses of the modulator and the photodiode that are typically limited in the GHz range. The effect of this low-pass behavior is schematically captured in Fig. 15(d), showcasing the effect of limited bandwidth on the calculated weighted sum. Several noise-sources also degrade the SNR of the optical signal traversing the photonic neuron, including among others Relative Intensity Noise (RIN), shot noise, and thermal noise. Under the general valid assumption that the main noise contribution can be approximated as AWGN sources, we concatenate the noise profile of the photonic axon into a single noise factor, correlated with the standard deviation of the zero-mean AWGN added to the signal. Figure 15(e) illustrates the effect of random AWGN noise added on the neural data that has propagated through the photonic hardware. Finally, an ADC is utilized for interfacing the signal back to the digital domain, introducing again quantization error, as depicted in Fig. 15(f). Comparing the finally received digital signal at the ADC output with the original NN input digital signal can clearly indicate the significant differences that may translate into degraded NN performance when relying on conventional DL training models.

In this section, we begin by highlighting the challenges and opportunities of using photonic activation functions in NN implementations, followed by an in-depth analysis of the approach and related benefits of incorporating photonic noise, limited bandwidth, limited ER, and quantization in NN training. Finally, we provide a brief overview of related applications and discuss the potential of optics-informed DL models.

### A. Training with photonic activation functions: Challenges and solutions

^{125,129,130}In turn, the fitted transfer functions are employed in software-implemented neural networks during training. More specifically, authors in Ref. 130 presented an all-optical neuron that utilizes a logistic sigmoid activation function using a WDM input and weighting scheme. The activation function is realized by means of a deeply saturated differentially biased Semiconductor Optical Amplifier-Mach-Zehnder Interferometer (SOA-MZI)

^{144}followed by a SOA-Cross-Gain-Modulation (XGM) gate. The transfer function of the photonic sigmoid activation function is defined as

*A*

_{1}= 0.060,

*A*

_{2}= 1.005,

*z*

_{0}= 0.154, and

*d*= 0.033 are tuned to fit the experimental observations as implemented on real hardware devices.

^{130}

^{145}that converts the data into an optical signal along with a photodiode. The formula of this photonic activation function is the following:

^{146}and is given by

^{147,148}More precisely, such difficulties are attributed to the physical properties that force the activation to work on a smaller region of the input domain, leading to a narrow activation window and making them easily saturated. These limitations arise from the fact that physical systems operate within a specific power range, while low power consumption is also a parameter that must be taken into account during hardware design and implementation. These limitations, which are further exaggerated when recurrent architectures are used, dictate the employment of different training paradigms.

^{63,149}Indeed, different activation functions require the use of different initialization schemes to ensure that the input signal will not diminish and that the gradients will correctly back-propagate. Failing to use an initialization scheme that is correctly designed for the activation function at hand can stall the training process or lead to sub-optimal results.

^{150}Therefore, even though these photonic neuromorphic implementations can significantly improve the inference speed, further advances are required in the way that NNs are designed and trained in order to fully exploit the potential of such photonic hardware.

Motivated by the variance preserving assumption,^{147} novel initialization approaches, targeting photonic activation functions, analytically compute the optimal variance during the initialization.^{125} More advanced approaches propose activation agnostic methods applying auxiliary optimization tasks that allow initializing neural network parameters by taking into account the actual data distribution and the limited activation range of the employed transfer functions.^{151}

### B. Noise-aware training in optics-informed DL

^{31}However, in hardware-implemented neural networks, there are also limitations that arise due to the noise that emerges from various sources, such as shot noise, thermal noise, and weight read noise,

^{31,57,110}as well as due to other phenomena, such as limited bandwidth and extinction ratio. To this end, there are approaches that model such noise sources as AWGN. In this way, the effect of noise is introduced in the software training process, significantly improving the performance during the deployment, exploiting the robustness of ANNs to noise, especially when they take it into account during the training process.

^{31}More specifically, noise sources are simulated in order to train in a noise-aware fashion. In this way, we exploit the fact that DL models are intrinsically robust to noise, especially when they are first adequately trained to tolerate noise sources, which are modeled as AWGN. Therefore, the output of a neuron that incorporates such a source can be modeled as

*a*

_{i}for each layer,

*N*(0,

*σ*

^{2}), altering the scaling factors results in adjusting the initialization variance for each layer. Moreover, in order to optimize the scaling factor

*a*

_{i}, an auxiliary linear classification layer is required, $Wiclass\u2208Rm(i)\xd7Nc$, where

*N*

_{C}is the number of classes (for a multi-class classification task) or the number of values to regress (for regression problem). In this way,

*a*

_{i}and $Wiclass$ are those terms that need to be optimized, while the actual weights and biases of the network are kept fixed. Then, the output of the classification layer can be directly used to perform the task at hand, i.e., either classification or regression. Additionally, an extra regularization penalty parameter is added, denoted by Ω(

*a*

_{i}), in an effort to penalize scaling factors that lead to saturation of the activation function. Specifically, after forward passing from linear term,

*z*

_{i}= |

*a*

_{i}|W

_{i}x + b

_{i}, we calculate Ω(

*a*

_{i}) as follows:

*l*and

*u*are the lower and upper bound of activation region, while

*n*and

*m*are the fan-in and fan-out, respectively, and max{·} denotes the maximum element in the set. $J\u0303(Wiclass,ai;X,y)$ is formulated as

*a*

_{i}and classification weights layers are optimized using gradient descent. After the optimization has been completed, the weights of the

*i*th layer can be re-initialized using the optimized scaling factor

*a*

_{i}. All layers of the network, from input to output, are iteratively initialized with the aforementioned procedure. This initialization scheme considers the modeled noise sources and can appropriately adjust the variances accordingly. After this process has been completed, the model is ready to be trained using regular back-propagation. The ability of neural networks to compensate for such phenomena by taking them into account during the training have been also used to decompose different noise source and introduce them during the training process. Indeed, such approaches have been also used successfully to handle the limited bandwidth and extinction ratio.

^{141}

### C. Training with quantized and mixed precision neural networks

Other operations, such as ADC and DAC operations, have also been shown to affect the accuracy of photonic neural networks. However, considering these phenomena are during the training results in more robust representations and, in turn, in higher performance during the deployment. More specifically, photonic computing includes the employment of DAC and ADC conversions along with the parameter encoding, amplification, and processing devices, such as modulators, PDs, and amplifiers, which, inevitably, introduce degradation of the analog precision during inference, as each constituent introduces a relevant noise source that impacts the electro-optic link’s bit resolution properties. Thus, the noise introduced increases when higher line rates are applied, translating to lower bit resolution. Furthermore, being able to operate in lower precision networks during deployment can further improve the potential use of analog computing by increasing the computational rate of the developed accelerators while keeping energy consumption low.^{53,153}

Typically, the degradation introduced to analog precision can be simulated through a quantization process that converts a continuous signal to a discrete one by mapping its continuous set to a finite set of discrete values. This can be achieved by rounding and truncating the values of the input signal. Despite the fact that quantization techniques are widely studied by the DL community,^{153–155} they generally target large CNNs containing a large number of surplus parameters with a minor contribution to the overall performance of the model.^{156,157} Furthermore, existing works in the DL community focus mainly on partially quantized models that ignore input and bias.^{154,158} These limitations, which are further exaggerated when high-slope photonic activations are used, dictate the use of different training paradigms that take into account the actual physical implementation.^{31} Indeed, neuromorphic photonics impose new challenges on the quantization of the DL model, requiring the appropriate adaptation of existing methodologies to the unique limitations of photonic substrates. Furthermore, the quantization scheme applied in neuromorphic photonics typically follows a very simple uniform quantization.^{57,159} This differs from the approaches traditionally used in trainable quantization schemes for DL models^{160} as well as mixed precision quantization.^{161}

To this end, several proposed approaches deal with the limited precision requirements before models are deployed to hardware. Such approaches calibrate networks with limited precision requirements after training the models, namely, post-training quantization methods, offering improvements in contrast to applying the model directly to hardware without taking into account the limited precision components.^{161} Other approaches take into account the limited precision requirements during training, naming quantization-aware training methods.^{161,162} Later methods significantly exceed the performance of post-training approaches, eliminating or restricting performance degradation between the full precision and limited precision models.^{162}

The authors in Ref. 142 proposed an activation-agnostic, photonic-compliant, and quantization-aware training framework that does not require additional modifications of the hardware during inference, significantly improving model performance at lower bit resolution. More specifically, they proposed to train the networks with quantized parameters by applying uniform quantization to all parameters involved during the forward pass, and consequently, the quantization error is accumulated and propagated through the network to the output and affects the employed loss function. In this way, the network is adjusted to lower-precision signals, making it more robust to reduced bit resolution during inference, significantly improving the model performance. To this end, every signal involved in the response of the *i*-th layer is first quantized in a specific floating range $hmini,\u2026,hmaxi\u2208R$. Then, during the forward pass of the network, quantization error *∈* is injected to simulate the effect of rounding during the quantization, while during the backpropagation, the rounding is ignored and approximated with an identity function. A comprehensive mathematical analysis regarding the quantization training can be found in Refs. 161 and 162. Finally, more advanced approaches targeting novel dynamic precision architectures^{41,163} propose stochastic approaches to gradually reduce the precision of layers within a model, exploiting their position and tolerance to noise, based on theoretical indications and empirical evidence.^{164} More specifically, the stochastic mixed precision quantization-aware training scheme, which is proposed in Ref. 164, adjusts the bit resolutions among layers in a mixed precision manner based on the observed bit resolution distribution of the applied architectures and configurations. In this way, the authors are able to significantly reduce the inference execution times of the deployed NN.^{41}

### D. Applications

Applying the aforementioned methods allows us to employ PNNs where high frequencies with minimum energy consumption are required, utilizing the DL techniques in a whole new spectrum of applications. Such applications include network monitoring and optical signal transmission, where the high compute rates limit the application of existing accelerators. For example, neuromorphic photonics are capable of operating at very high frequencies and can be integrated on a backplane pipeline of a modern high-end switch, which makes them an excellent choice for challenging Distributed Denial of Service (DDoS) attacks detection applications, where high-speed and low-energy inference is required. More specifically, the authors in Refs. 37 and 165 build on the concept of a neuromorphic lookaside accelerator, targeting to perform real-time traffic inspection, searching for DDoS attack patterns during the reconnaissance attack phase, when the attacker tries to determine critical information about the target’s configuration. Before deploying a DDoS attack, a port scanning procedure is compiled to track open ports on a target machine. During this procedure, port scanning tools, such as Nmap, create synthetic traffic that can be captured and analyzed by the proposed network, capturing huge amounts of packages in the used computation rates of the modern high-end switches.

Another domain that can be potentially benefited from neuromorphic hardware is communications. Over the recent years, there is an increasing interest in employing DL in the communication domain,^{166} ranging from wireless^{167} to optical fiber communications,^{42} exploiting the robustness of ANNs to noise, especially when they take it into account during the training process. Such approaches design the communication system by carrying out the optimization in a single end-to-end process, including the transmitter, receiver, and communication channel, with the ultimate goal to achieve optimal end-to-end performance by acquiring a robust representation of the input message,^{42,168} introducing an end-to-end deep learning fiber communication transceiver design, emphasizing training by examining all optical activation schemes and respective limitations present in realistic demonstrations. They applied the data-driven noise-aware initialization method^{169} that is capable of initializing PNNs by taking into account the actual data distribution, noise sources, as well as the unique nature of photonic activation functions. They focused on training photonic architectures, which employ all optical activation schemes,^{130} by simulating their given transfer functions. This allows for reducing the effect of vanishing gradient phenomena as well as improving the ability of networks coupled with communication systems to withstand noise, e.g., due to the optical transmission link. As experimentally demonstrated, this method is significantly tolerant to the degradation occurred when easily saturated photonic activations are employed as well as significantly improve the signal reconstruction of the all-optical intensity modulation/direct detection (IM/DD) system.

## VI. CONCLUSION

Conventional electronic computing architectures face many challenges due to the rapid growth of compute power, driven by the rise of AI and DNNs, calling for a new hardware computing paradigm that could overcome these limitations and being capable of sustaining this ceaseless compute expansion. In this Tutorial, prompted by the ever-increasing maturity of silicon photonics, we presented the feasibility of PNNs and their potential embodiment in future DL environments. First, we discussed the essential concepts and criteria for NN hardware, examining the fundamental components of NNs and their core mathematical operations. Then, we investigated the interdependence of analog bit precision and energy efficiency of photonic circuits, highlighting the benefits and challenges of PNNs over conventional approaches. Moreover, we reviewed the state-of-the-art PNN architectures, analyzing their perspectives with respect to MVM operation execution, weight technology selection, and activation function implementation. Finally, the recently introduced optics-informed DL training framework was presented, which comprises a novel software-hardware NN co-design approach that aims to significantly improve the NN accuracy performance by incorporating the photonic hardware idiosyncrasy into NN training.

## ACKNOWLEDGMENTS

The work was, in part, funded by the EU Horizon projects PlasmoniAC (Grant No. 871391), SIPHO-G (Grant No. 101017194), and Gatepost (Grant No. 101120938).

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

A.T. and M.M.-P. contributed equally to this work.

**Apostolos Tsakyridis**: Conceptualization (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Writing – original draft (equal). **Miltiadis Moralis-Pegios**: Conceptualization (lead); Investigation (equal); Methodology (equal); Validation (equal); Writing – original draft (equal). **George Giamougiannis**: Conceptualization (equal); Methodology (equal). **Manos Kirtas**: Data curation (equal); Investigation (equal); Software (equal). **Nikolaos Passalis**: Data curation (equal); Investigation (equal); Software (equal). **Anastasios Tefas**: Conceptualization (equal); Methodology (equal); Supervision (equal). **Nikos Pleros**: Conceptualization (equal); Methodology (equal); Supervision (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

### APPENDIX: DIGITAL–ANALOG-PRECISION LOSS

*B*= log

_{2}

*M*, featuring infinite ER, and an average optical power of

*P*

_{avg}. When receiving the signal in a thermal noise dominated optical link, we can evaluate its quality using the Q factor of the outer eye diagram of PAM-M modulation, which can be expressed through

*P*

_{1}is the optical signal’s peak power,

*P*

_{m}is the optical power of the signal’s penultimate level, and

*σ*

_{t}is the standard deviation of the thermal noise. The optical power of the penultimate level of a linear PAM-M signal can be calculated through subtracting the distance between the penultimate PAM-M level from the peak power,

*P*

_{0}= 0 for an infinite ER signal. Replacing (A2) into (A1),

*B*= log

_{2}

*M*, in order to maintain the same signal quality, we need to increase the optical power of the receiver’s input signal.

*N*inputs signals of (PAM-M), and only positive weight values, the optical peak power of the signal emerging at the output can be calculated through

*P*

_{i}is the optical peak power of each constituent signal. When the signals have the same optical peak power

*P*

_{i}, we can transform (A4) to

*N*×

*N*optical matrix multiplier, the input optical signal will experience loss from the front-end optical splitter, which can be calculated from

*S*

_{e_loss}is the excess loss of the splitter, which for a cascaded tree MMI layout can be calculated through

*S*

_{e_loss}=

*MMI*

_{loss}× log

_{2}

*N*. As such, Eq. (A5) can be rewritten as

*MMI*

_{loss}= 0.06 dB

*or MMI*

_{loss}= 0.014 in natural numbers,

^{170}we can deduct that the aforementioned equation can be simplified without significant accuracy loss in

_{2}

*N*× 0.014 ≪

*N*.

*a*

_{prec}= 1, we increase the optical power of the input laser by a factor of the beam splitter loss, i.e.,

*N*, essentially compensating the optical loss of the splitter for each contributing beam as such

*M*

^{out}= 3.7.

*a*

_{prec}=

*N*, where we keep the output equivalent bit precision

*B*

^{out}the same as the input signal resolution

*B*, such as

*B*

^{out}=

*B*. In this case, the signal quality at the output defined through the Q factor remains the same as the input

*P*

_{i},

*a*

_{prec}= 1) and same input–output bit resolution (

*a*

_{prec}=

*N*).

Summing up, we defined *a*_{prec} as the analog–digital precision and illustrated two operational regimes:

*a*_{prec}= 1, when we increase the output power of the laser source to compensate for the splitting loss by a factor of N. In this case, the output bit resolution reaches $Bout=N\xd7(2M\u22121)+1$, which for only integer values of bit resolution can be simplified to $Bout=B+log2N$ and*a*_{prec}=*N*, where we keep the same bit precision at both the input and the output, trading off the decreased bit precision, as opposed to the full digital precision case, with lower input laser power by a factor of*a*_{prec}=*N*.

## REFERENCES

^{2}engines

*Comparison of heater architectures for thermal control of silicon photonic circuits*

*Optical Fiber Communication Conference*

*The Unitary and Rotation Groups*

_{3}N

_{4}waveguide platform

*μ*W power consumption

*Advances in Neural Information Processing Systems*

*Proceedings of the International Conference on Machine Learning*

*Proceedings of the IEEE International Conference on Computer Vision*

*Learning photonic neural network initialization for noise-aware end-to-end fiber transmission*

*Proceedings of the European Signal Processing Conference*

10th International Conference on Group IV Photonics