The recent explosive compute growth, mainly fueled by the boost of artificial intelligence (AI) and deep neural networks (DNNs), is currently instigating the demand for a novel computing paradigm that can overcome the insurmountable barriers imposed by conventional electronic computing architectures. Photonic neural networks (PNNs) implemented on silicon integration platforms stand out as a promising candidate to endow neural network (NN) hardware, offering the potential for energy efficient and ultra-fast computations through the utilization of the unique primitives of photonics, i.e., energy efficiency, THz bandwidth, and low-latency. Thus far, several demonstrations have revealed the huge potential of PNNs in performing both linear and non-linear NN operations at unparalleled speed and energy consumption metrics. Transforming this potential into a tangible reality for deep learning (DL) applications requires, however, a deep understanding of the basic PNN principles, requirements, and challenges across all constituent architectural, technological, and training aspects. In this Tutorial, we, initially, review the principles of DNNs along with their fundamental building blocks, analyzing also the key mathematical operations needed for their computation in photonic hardware. Then, we investigate, through an intuitive mathematical analysis, the interdependence of bit precision and energy efficiency in analog photonic circuitry, discussing the opportunities and challenges of PNNs. Followingly, a performance overview of PNN architectures, weight technologies, and activation functions is presented, summarizing their impact in speed, scalability, and power consumption. Finally, we provide a holistic overview of the optics-informed NN training framework that incorporates the physical properties of photonic building blocks into the training process in order to improve the NN classification accuracy and effectively elevate neuromorphic photonic hardware into high-performance DL computational settings.

During the past decade, the relentless expansion of artificial intelligence (AI) through deep neural networks (DNNs) has been driving the need for high-performance computing and time-of-flight data processing. Conventional digital computing units, which are based on the well-known Von-Neumann architecture1 and inherently rely on serialized data processing, have faced daunting challenges toward undertaking the execution of emerging DNN datasets. Von-Neumann architectures comprise a centralized processing unit (CPU), which is responsible for executing all operations (arithmetic, logic, and controlling) dictated by the program’s instructions, and a separate random-access memory (RAM) unit that stores all necessary data and instructions. The communication between CPU and memory is realized via a shared bus that is used to transfer all data between them, implying that they cannot be accessed simultaneously. This leads to the well-known Von-Neumann bottleneck,2 where the processor remains idle for a certain amount of time during memory data access. On top of that, the need for moving data between CPU and memory (via the bus) requires charging/discharging of metal wires, limiting in this way both the bandwidth and the energy efficiency due to Joule heating and capacitance,3 respectively.

There have been numerous demonstrations toward overcoming these effects, including, among others, caching, multi-threading, and new RAM architectures and technologies (e.g., ferroelectric RAMs4 and optical RAMs5–9), with the ultimate target being the energy efficient and high-speed CPU-memory data movement. None of these solutions seems, however, to be capable of coping with the computational and energy demands of DNNs, revealing a need for shifting toward specialized computing hardware architectures. In this endeavor, highly parallelized accelerators have been developed, including graphic processing units (GPUs), application specific integrated circuits (ASICs), and field programmable gate arrays (FPGAs), with GPUs and ASICs being, until now, the dominant hardware computing engines for DNN implementations. Specifically, GPUs leverage their hundreds of cores toward accelerating the matrix multiplication operations of DNNs, which are the most time- and power-consuming computations.10 Moreover, they have dedicated non-uniform memory access architectures (e.g., video RAMs) that are (i) programmable, meaning that the stored data can be selectively accessed or deleted, (ii) faster than CPU counterparts, and (iii) located very close to their cores, reducing in this way the distance between computing and data. Yet, despite GPU’s unrivaled parallelization ability that ushers in exceptional computational throughput, the need for data movement still remains and sets a fundamental limit in both speed and energy efficiency.

Toward totally eradicating the constraints of data movement, recent developments in analog computing through memristive crossbar arrays11–13 follow an alternative approach, called in-memory computing. This scheme allows for certain DNN computational tasks (e.g., weighting) to be performed within the memory cell itself, seamlessly supporting multiplication operations without requiring any data transfer.14 The recent 64-core analog-in-memory compute (AiMC) research prototype of IBM15 and the commercial entry of Mythic’s AiMC engine have validated the energy benefits that can originate from in-memory computing compared to Von-Neumann architectures. These implementations employ computational memory devices, including resistive RAMs (RRAMs) and phase change materials (PCMs), where the application of a voltage results in a change of the material’s property, achieving in this way both data storing and computing. However, issues related to memory instability and finite resistance of the crossbar wires may lead to computational errors and crossbar size limitations, respectively, making it hard to reach the computational throughput and parallelization level of GPUs.12,14 Similar to in-memory computing, neuromorphic computing comprises an alternative non-Von-Neumann architecture that is inspired by the structure and function of the human brain, meaning that both memory and computing are governed by artificial neurons and synapses. Neuromorphic chips mostly employ spiking neural networks (SNNs) to emulate the behavior of biological neurons, which communicate through discrete electrical pulses called spikes. SNNs can process spatiotemporal information more efficiently and accurately than conventional neural networks16 as they respond to changes in the input data in real time. Additionally, they rely on asynchronous communication and event-driven computations, where, typically, only a small portion of the entire system is active at any given time, while the rest is idle, resulting to low-power operation.17 However, neuromorphic computing is not currently being used in real-word applications, and there are still a wide variety of challenges in both algorithmic and application development18 that need to be addressed toward outperforming conventional deep learning (DL) approaches. At the same time, the underlying electronic hardware in analog compute engines continues to rely heavily on complementary metal–oxide–semiconductor (CMOS) electronic transistors and interconnects, whose speed and energy efficiency are dictated by their size. Taking into account that transistor scaling has slowed down during the last decade, since we are approaching its fundamental physical size limits,19 there is no significant performance margin left to be gained. In parallel, the requirement for multiple connected neurons yields increased interconnect lengths in analog in-memory computing schemes that finally result in low line-rate operation in order to avoid an increased energy consumption. All this indicates that a radical departure from traditional electronic computing systems toward a novel computational hardware technology has to be realized in order to be able to fully reap the benefits of the architectural shift toward non-Von-Neumann layouts.

Along this direction, integrated photonics emerged as a promising candidate for the hardware implementation of DNNs; the analog nature of light is inherently compatible with analog compute principles, while low-energy and high-bandwidth connectivity is the natural advantage of optical wires. On top of that, photonics can offer multiple degrees of freedom, such as wavelength, phase, mode, and polarization, being suitable for parallelizing data processing through multiplexing techniques20,21 that have been traditionally employed in optical communication systems for transferring information at enormous data rates (>Tb/s). The constantly growing deployment of optical interconnects and their rapid penetration to smaller network segments has been also the driving force for the impressive advances witnessed in photonic integration and, particularly, in silicon photonics; silicon photonic integrated circuits (PICs) with thousands of photonic components can be fabricated in a single die nowadays,22 forming a highly promising technology landscape for optical information processing tasks at chip-scale. Nevertheless, compared with electronic systems that host billions of transistors, thousands of photonic components may not be sufficient to build a vast universal hardware engine for generic applications. Yet, the constant progress in the field of integrated optics coupled with the rapid advances in fabrication and packaging can eventually shape new horizons in this field. This has raised expectations for an integrated photonic neural network (PNN) platform that can cope with the massively growing computational needs of Deep Learning (DL) engines, where computational capacity requirements double every 4–6 months.23 In this realm, several PNN demonstrations have been proposed,24–41 employing light both for data transfer and computational functions and shaping a new roadmap for orders of magnitude higher computational and energy efficiencies than conventional electronic counterparts. At the same time, they highlighted a number of remaining challenges that have to be addressed at technology, architecture, and training level, designating a bidirectional interactive relationship between hardware and software: the photonic hardware substrate has to comply with existing DL models and architectures, but at the same time, the DL training algorithms have to adapt to the idiosyncrasy of the photonic hardware. Integrated neuromorphic photonic hardware extends along a pool of architectural and technology options, the main target being the deployment of highly scalable and energy efficient setups that are compatible with conventional DL training models and suitable to safeguard high accuracy performance. In parallel, the use of light in all its basic computational blocks brings inevitably a number of new physical and mathematical quantities in NN layouts,41,42 such as noise and multiplication between “noisy” matrices, as well as mathematical expressions for non-typical activation responses, which are not encountered in conventional DL training models employed in the digital world. This calls for an optics-informed DL training model library; the term “optics-informed” has been recently coined by Roumpos et al.42 in order to describe the hardware-aware characteristics of DL training models and declare their alignment along the nature of optical hardware since it takes into account the idiosyncrasy of light and photonic technology. However, despite the advances pursued in both the hardware and software segments, the complexity of photonic processing is still far behind electronics with respect to both their algorithmic and their hardware capabilities. Hence, the field of PNNs does not currently proceed along the mission of replacing conventional electronic-based AI engines but aims rather to engender applications where photonics can offer certain benefits over their electronic counterparts. This mainly expands along inference applications since inference comprises the most critical process in defining the power and computational resource requirements in certain applications, such as modern Natural Language Models (NLPs), where inference workloads are estimated to consume 25×–1386× higher power than training.43 Other deployment scenarios include latency-critical applications that are related to cyber-security in DCs,37 non-linearity compensation in fiber communication systems,39 acceleration of DNN’s matrix multiplication operations at 10 s of GHz frequency update rates,25 decentralization of the AI input layer from core AI processing for edge applications,44 and finally to provide solutions to non-linear optimization problems in, e.g., autonomous driving and robotics.40 

In this Tutorial, we aim to provide a comprehensive understanding of the underlying mechanisms, technologies, and training models of PNNs, highlighting their distinctive advantages and addressing the remaining challenges when compared to conventional electronic approaches. This Tutorial forms the first attempt toward addressing the field of PNNs for DL applications within a hardware/software co-design and co-development framework: with the emphasis being on integrated PNN deployments, we define and describe the PNN fundamentals, taking into account both the underlying chip-scale neuromorphic photonic hardware and the necessary optics-informed DL training models. This paper is structured as follows: In Sec. II, we introduce the basic definitions and requirements for NN hardware, analyzing the basic NN building blocks (artificial neuron, NN models) as well as the main mathematical operations required for the hardware implementation of NNs, i.e., multiply and accumulate (MAC) and matrix-vector-multiplication (MVM) operations. The same section also provides an intuitive analysis on bit resolution and energy efficiency trade-offs of analog photonic circuits, discussing the advantages and opportunities of PNNs. In Sec. III, a review in the basic computational photonic hardware technologies is provided, presenting a summary of photonic MVM architectures and weight technologies in Sec. III and activations functions in Sec. IV. Finally, Sec. V is devoted to the challenges and requirements in the photonic DL training sector, providing a solid definition of optics-informed DL models and summarizing the relevant state-of-the-art techniques and demonstrations.

Merging photonics with neuromorphic computing architectures requires a solid knowledge of the underlying NN architectures, building blocks, and mechanisms. The most basic definitions and requirements are briefly described below.

An artificial neuron comprises the main operation unit in a neural network, with the operation of the basic McCulloch–Pitts neuron model45 being mathematically described by y=φWixi+b, where y is the neuron output, φ is an activation (non-linear) function, xi is the ith element of the input vector x, wi is the weight factor for the input value xi, and b is a bias. The linear term ∑Wixi represents the weighted addition and is typically carried out by the so-called linear neuron part, which comprises (i) an array of axons, with every ith axon denoting the transmission line that provides a single xi × wi product, (ii) an array of synaptic weights, with every ith weight wi located at the ith axon, and (iii) a summation stage. The non-linear neuron part comprises the activation function φ, with rectified linear unit (ReLU), sigmoid, pooling, etc., being among the most widely employed activation functions in current DL applications.46 

For a layer of M interconnected neurons, the output of these neurons can be expressed in vector form as y=W×x+b, where x is an input vector with N elements, W is the N × M weight matrix, b is a bias vector with M elements, and y is a vector made of M outputs. Figure 1(a) depicts a schematic layout of a biological neuron that can be mathematically described via an artificial neuron shown in Fig. 1(b), where the dendrites correspond to the weight signals, nucleus corresponds to the summation and activation function, and axon terminals are responsible for providing the inputs to the next neuron, while Fig. 1(c) depicts the resulting layout when utilizing artificial neuron to structure a DNN with a single input layer, a single output layer, and one or more hidden layers.

FIG. 1.

(a) Schematic representation of a biological neuron. (b) Basic model of an artificial neuron comprising a linear part (Σ) and a non-linear part (φ). Its fundamental operations are (i) the weighted addition (or synaptic operation) and (ii) the non-linear activation function. (c) Example of a multi-layer NN comprising several neurons, one input layer, several hidden layers, and one output layer.

FIG. 1.

(a) Schematic representation of a biological neuron. (b) Basic model of an artificial neuron comprising a linear part (Σ) and a non-linear part (φ). Its fundamental operations are (i) the weighted addition (or synaptic operation) and (ii) the non-linear activation function. (c) Example of a multi-layer NN comprising several neurons, one input layer, several hidden layers, and one output layer.

Close modal

Part of the unprecedented success of NNs in tackling complex computational problems can be attributed to the plethora of NN models, capable of uniquely synergizing several hundred or up to billions of artificial neurons into versatile computational building blocks. In this section, we will give an overview of several NN models based both on their popularity and success in resolving standardized benchmarking problems, as well as their compatibility with hardware implementation in silicon photonic platforms.

NN models can be broadly classified in different categories based on the following:

  • Data flow pattern. Considering the direction of the information flow, NN models can be grouped in two categories: In feed-forward NNs, the signals travel exclusively to one direction, usually from left to right, while in feed-back NNs, the signals travel in both directions, allowing neurons to receive data from neurons belonging to subsequent or even the same layer. Figures 2(a)2(e) depicts five popular types of NN models, grouped based on their data flow in feed-forward and feed-back implementations, with the latter being mostly utilized for resolving temporal and ordinal workloads as the network effectively retains memory of the previous samples.

  • Interconnectivity. The interconnection density between neurons of subsequent layers or even the same layer can be used to classify NN models in dense and sparse implementations. Figure 2(a) depicts a typical DNN model, where each neuron of the first layer is connected to all the neurons of the subsequently layer, usually denoted as a fully-connected layout, while the neurons of the second layer are interconnected to only two neurons of the subsequent layer, corresponding to a sparse layout. While high-interconnectivity density allows the NN to extract more complex relationships between the input data, the cost associated with the increasing number of weights scaling with a O(N2) complexity for a NxN interconnectivity promotes the use of sparse models.

  • Structural Layout. Employing a specific layout can enhance neural network models with unique attributes. A typical example of such a model, specifically a single layer of a convolutional NN,47 is illustrated in Fig. 2(b). This architectural approach, widely employed in image recognition task due to its spatio-local feature extraction capabilities, promotes weight re-use and as such relaxes the computational requirements, through applying the same weight kernel, i.e., a set of weight values, across the input data values. Another typical NN layout, depicted in Fig. 2(c), is an NN autoencoder, a model associated with data encryption due to its data compression layout that effectively reduces the dimensionality of the input data in its central layers, boosts wide employment in non-linearity compensation in optical communications.42 Finally, Fig. 2(d) illustrates the most common feed-back NN model called recurrent, while Fig. 2(e) depicts a special type of recurrent NNs typically denoted as reservoir computing,48 where a fixed connectivity recurrent layer is placed between the input and the output layer. The relaxed training requirements, as only the output layer has to be trained along with the ease of constructing time-delayed reservoir circuitry in silicon photonics platforms, have led to impressive demonstrations in optical channel equalization applications.

FIG. 2.

Different types of neural network models categorized in (a)–(c) feed-forward and (d) and (e) feed-back configurations.

FIG. 2.

Different types of neural network models categorized in (a)–(c) feed-forward and (d) and (e) feed-back configurations.

Close modal
As highlighted in Subsections II A and II B, two types of mathematical operations are required for the hardware implementation of multi-layer NN models: (i) the linear MAC operation that constitutes the weighted summation at a single neuron ingress and (ii) the non-linear activation function employed at the neuron’s egress. Given that a N × N neural network layout comprises N neurons with N weighting elements per neuron, the two operations scale with O(N2) and O(N) complexity, respectively. As such, MAC operations comprise the most significant computational burden and are usually correlated with the computational capacity of the NN model.49 A single MAC operation calculates the product of two numbers and adds the result to an accumulator. Defining a as the accumulation variable that holds the weighted input sum n=1i1wn×xn, the operation can be described by the following form:
(1)
A single artificial neuron with N inputs creates a weighted sum that can be broken down into a series of N parallel MAC operations, while a fully connected neural layer with M interconnected neurons and N inputs per neuron supports a total number of M × N parallel MACs. A typical digital electronic MAC unit layout that realizes the mathematical formula of (1) is depicted in Fig. 3(a), showing that the partial 2N-dimensional weighted sum, stored by the accumulator, is then fed back to the summation circuit for being added with the subsequent partial weighted sum produced by the next N input and N weight values. Figure 3(b) illustrates an example of a MAC operational unit when implemented in the analog electronic domain, where an input vector x is imprinted in the electrical domain through the use of DACs and subsequently broadcasted to an array of [i, j] synaptic weights implemented through variable resistive elements arranged in a Xbar configuration.50,51 By controlling the impedance of the variable resistive elements, the electrical current emerging to every Xbar’s column output provides based on Kirchhoff’s law the weighted summation of the column inputs,51 e.g., y1=k=1ixk*wk,1. Careful examination of the two MAC implementations can provide us with some significant insight into the differences between analog and digital computing:
  • Value representation and information density. Digital implementations use discrete values of physical variables, employing typically two discrete levels that are correlated with the upper and lower switching voltage of a transistor and are usually denoted as 0 and 1. On the other hand, analog computing employs values across the whole range of physical variables, allowing in this way for the representation of several equivalent bits of information at the same time unit. A direct consequence of this value representation form is the required noise robustness of the computational system that will be discussed in more detail in Subsection II D, especially for optical implementations.

  • Computational primitives. While digital computing is solely based on the mathematics and respective deployments of Boolean logic-based circuitry, analog computing can employ the physical laws of the underlying hardware, e.g., capacitors, resistors52 to implement a variety of mathematical operations, unlocking a quiver of functionalities described by the exploited physical phenomena.

  • Latency. Given the large number of devices required to implement a specific mathematical operation using Boolean logic, e.g., a digital computational building block implementing 8-bit parallel multiplication requires ∼3000 transistors.53 This forms a latency-critical computational path that is defined by the maximum register-to-register delay and effectively limits the maximum achievable operating frequency and, as such, the achieved latency.54 This has led to the adoption of multi-threading and multi-core setups for parallel processing in modern computing systems, investing in architectural innovations toward system acceleration. On the other hand, analog systems are inherently built as parallel computational systems, giving them a significant edge in latency critical tasks while requiring ∼500× fewer components,53 on average, than digital electronic circuits for multiplication operations.

FIG. 3.

Implementation of MAC operations in (a) typical digital accelerators. (b) Analog electronic approach employing Kirchhoff’s law and a Xbar configuration. (c) Indicative analog photonic approach employing coherent inference of light-beams.

FIG. 3.

Implementation of MAC operations in (a) typical digital accelerators. (b) Analog electronic approach employing Kirchhoff’s law and a Xbar configuration. (c) Indicative analog photonic approach employing coherent inference of light-beams.

Close modal

These advantages, synergized with the primitives of photonic devices, have fueled the rise of optical MVM hardware, with an indicative example of an analog photonic dot product implementation given in Fig. 3(c). In this approach, the input and/or weight information is encoded in one of the underlying physical variables of the photonic system i.e., the amplitude, phase, polarization, or wavelength of a light beam, while the physical primitives of optical phenomena are utilized for the mathematical operations: in this particular example, loss experienced during the transmission of light via the weight-encoding physical system provides the multiplication operation, while interference of light waves is used for providing the summation mechanism. Harnessing the advantages of light-based systems, i.e., multiple axes of freedom for encoding information in time, space, and wavelength, low propagation loss, low electromagnetic interference, and high-bandwidth operation hold the credentials to surpass analog electronic deployments in large scale photonic accelerators.55 It is noteworthy, though, that both the electronic and photonic analog compute engines necessitate the use of Digital-to-Analog (DAC) and Analog-to-Digital (ADC) modules for interfacing NN input and output modules with the digital world.

Migrating MAC operations from digital circuitry, where high-precision (i.e., 16-, 32-, or 64-bit) floating point representations are utilized, to the analog domain, necessitates a basic understanding of the physical representation and energy-efficiency tradeoffs of analog photonic circuitry. Given the continuous nature of analog variables, as opposed to the usually two-level discretized variables in digital systems, representing high-precision numerical quantities in an analog system necessitates significantly higher signal-to-noise ratios (SNRs). This requirement shapes an optimal bit resolution/energy efficiency operational regime for analog photonic computing systems.56 In this subsection, we will discuss the precision limitations of analog photonic computing, outlining its optimized operational trade-offs in the shot-noise limited regime vs state-of-the-art digital MAC circuitry.

In digital implementations, where scaling the bit resolution is achieved by parallel circuitry operating at discrete binary levels, we can make the simplified assumption53 that the power consumption of a digital MAC scales linearly with the bit-resolution, such as
(2)
where PDsinglebit is the power consumption of a single-bit MAC operation and bdig is the bit resolution. For photonic implementations, we use the correlation between the achieved bit resolution and the standard deviation of the system’s total noise level as this has been defined in57,
(3)
where ImaxImin defines the range between maximum and minimum electrical current values generated at the photodiode (PD) output and σTOTAL is the standard deviation of the total noise of the photonic link under the generally valid assumptions that the link is dominated by additive white Gaussian noise (AWGN). Assuming that the link operates at the shot noise limit of the photodiode, the total noise of the system equals the shot noise, and we have
(4)
where h is the Planck constant, B is the employed bandwidth, v is the lightwave frequency, (λ = 1550 nm or v = 193.41 THz), and Iavg is the average electrical current generated at the photodiode output, which relates to the average optical power Pavg that enters the photodiode via Iavg = RPavg, with R being the photodiode responsivity. It should be noted that the aforementioned calculation is an approximation of the actual shot noise of the photonic link as its value is dependent on the number of photons reaching the receiver in any given computing interval.55,56 Taking into account that ImaxImin = R(PmaxPmin) = R × OMA, with OMA denoting the optical modulation amplitude, and assuming that the input signal has a duty cycle of 50% and an infinite extinction ratio (ER), the OMA turns to be equal OMA=2×Pavg=2×IavgR, with ImaxImin = 2 × Iavg. Considering an ideal optical MAC unit in order to calculate the theoretical limit of optical energy efficiency, we assume only the use of unitary and lossless layouts, where a link performs, in principle, lossless MAC operation58 and is powered by a laser that consumes an average power of Plaser and has a wall-plug efficiency of a = 0.2, and the average optical power emitted by the laser will be a × Plaser and will be required to be greater or equal to Pshot when operation at the shot-noise limit is required. Finding σTOTAL = σshot via Eq. (3) and then using the resulting expression to replace σshot in Eq. (4), while replacing Iavg with R × Pavg and requesting Pavg = Pshot due to the operation at the shot-noise limit, Pshot and, consequently, the consumed laser power Plaser can be calculated as
(5)
Considering then a 1:1 relationship between available bandwidth, expressed in GHz, and obtained MAC/s performance and dividing both parts of Eq. (5) with B, we can transform Eq. (5) to the shot-noise limited energy efficiency per MAC described in the following equation:
(6)
Figure 4(a) puts in juxtaposition the achieved energy efficiency of a digital MAC circuit using Eq. (2) for a reference value of 46 fJ for 8-bit MAC,59 with the shot-noise limited energy efficiency of a single photonic circuit calculated by Eq. (6) when assuming a unity responsivity (R = 1). Another important metric that could be utilized for the aforementioned comparison is the Landauer limit60 that effectively defines the minimum energy required for a digital irreversible computation and as such could be employed as the theoretical minimum energy for digital computation, with an analysis and relevant metrics accessible in the supplementary material of Ref. 61. While a more detailed discussion and related advantages of the achieved energy efficiency of a photonic accelerator will be provided in Sec. II E, it becomes evident that harnessing the analog-architecture derived advantages of photonic implementations, the bit-resolution of the photonic accelerator will have to range in lower bitwidths than its digital equivalent. Moreover, as analog systems encode the data information along single physical variables, they have to migrate from floating-point to fixed-point representations. This is dictated by both the lack of bit-resolution depth that would allow splitting the mantissa and exponent part of the represented number as well as the nature of computation that requires physical number representations. In order to ease the understanding of the two different representation-schemes, Fig. 4(b) schematically describes the two approaches.
FIG. 4.

(a) Comparison of digital MAC energy efficiency vs shot-noise limited photonic MAC. Even though this graph provides insight into the optimized bitwidth of photonic implementations, it should be noted that photonic accelerators can surpass this efficiency due to favorable scaling laws as will be discussed in Subsection II E. (b) Example of fixed vs floating point arithmetic.

FIG. 4.

(a) Comparison of digital MAC energy efficiency vs shot-noise limited photonic MAC. Even though this graph provides insight into the optimized bitwidth of photonic implementations, it should be noted that photonic accelerators can surpass this efficiency due to favorable scaling laws as will be discussed in Subsection II E. (b) Example of fixed vs floating point arithmetic.

Close modal
Fortunately, for analog MAC implementations, we can also take advantage of the reduced, as compared to digital, precision required at the circuit’s output. More specifically, considering an analog MAC implementation where #N k-bit signals are summated, the full digital resolution at the chip’s output would be defined through
(7)
In a photonic computing scenario, assuming a layout similar to the one depicted in Fig. 3(c) with a lossless weight matrix, the full digital precision can be only retained when a multiplication stage with a gain of N is employed for compensating for the 1/N splitting ratio of the laser source at the circuit’s ingress. However, in analog implementations, we retain the same bit-precision across the different neural layers, and as such, the output of the summation should have the same bitwidth as the input neuron values. Consequently, the required SNR at the analog photonic output would be lower than the full digital equivalent one, implying that we can keep the minimum optical power difference between adjacent bits (MOPB) constant, even when reducing the laser optical power. Defining this digital-to-analog precision loss55,56 as aprec, we can highlight two interesting operational regimes, schematically illustrated in Figs. 5(a) and 5(b). Specifically, in Fig. 5(a), a light beam, originating from a laser source and consuming an optical power of P, is split in an 1:N splitter (N = 4) into four equivalent beams that get subsequently modulated in X1X4 optical modulators. With the four inputs having a bit resolution of Xres = 2, their summation, using Eq. (7), has a full digital precision of Yres = 2 + log24 = 4 and the system corresponds to aprec = 1. On the other hand, in Fig. 5(b), we set the output bit resolution to Yres = 2, and hence, assuming only positive weights and a lossless weight matrix, we can maintain the same MOPB at the output summation even when reducing the injected optical laser power to P′ = P/N = P/4, while the systems now correspond to a aprec = 4 = N. A more thorough analysis is given in the  Appendix.
FIG. 5.

Illustrative example of effect of output precision in photonic neural networks. Four 2-bit inputs (X1–X4) are summated for two different digital-to-analog precision loss. (a) For full digital precision, we have Yres = 4. (b) For the same input and output precision bitwidths, Yres = 2. Assuming a lossless weight matrix and the same receiver sensitivity, we can lower the input laser power by N (P′ = P/4).

FIG. 5.

Illustrative example of effect of output precision in photonic neural networks. Four 2-bit inputs (X1–X4) are summated for two different digital-to-analog precision loss. (a) For full digital precision, we have Yres = 4. (b) For the same input and output precision bitwidths, Yres = 2. Assuming a lossless weight matrix and the same receiver sensitivity, we can lower the input laser power by N (P′ = P/4).

Close modal

In this context, NNs are uniquely suited for analog computing, as empirical research has shown that they can operate effectively with both low precision and fixed-point representation with inference models working nearly just as well with 4–8 bits of precision in both activations and weights—sometimes even down to 1–2 bits.62 On top of that, bit precision in analog compute engines can be improved by incorporating in the NN training the idiosyncrasies and noise sources of the underlying photonic hardware, investing in this way in the so-called hardware-aware training or optics-informed DL models.31 Employing this approach, researchers have already showcased robust networks that can secure almost the same accuracy with noise-free digital platforms,63 while a more detailed discussion is included in Sec. V.

Energy efficiency is mainly dictated by the supported computational speed and the overall power consumed for the computations, which can be broken down into the power consumption of the laser source, fan-in, fan-out, and weighting technology. On the same line, area efficiency gets determined by the computational speed divided by the overall footprint required for the computations, which depends, obviously, strongly on the size of the individual fan-in, weighting, and fan-out structures. In an effort to determine the main parameters that affect the energy and area efficiency, we performed the following analysis for an NxN neural layer, pictorially represented in Fig. 6. The NxN neural layer comprises an NxN weight matrix W, which gets multiplied by an N:1 input vector X and yields an N:1 output vector Y. Assuming that every synaptic weight is implemented via a hardware module that consumes a power of PW watts and an area of AW mm2, each input signal generation structure is realized by a hardware circuit that consumes a power of PX watts and an area of AX mm2, the receiver circuitry that is employed for obtaining every output signal consumes PY watts and has a footprint of AY mm2, and the optical laser source consumes Plaser watts, then the total power consumed equals
(8)
Assuming an operation at B MAC/sec compute rate per axon, then the total compute rate equals N2B MAC/s, leading to an energy efficiency in J/MAC (or in MAC/s/watt) of
(9)
By carefully examining the contribution of the constituents of Eq. (9), we can conclude to the basic operational trade-offs of neuromorphic photonic accelerators. The first term encompasses the driving circuit of the input modulators, where its minimum required dynamic power consumption at a B computational rate can be approximated via their switching energy,64 defined in J/s as
(10)
where C and V are the capacitance and the driving voltage of the input modulator, respectively. After merging Eqs. (9) and (10) and employing typical values (C = 14 fF, Vpp = 2V)65 for state-of-the-art electro-absorption modulators (EAMs) while excluding, at this point, their static energy consumption, it can be derived that
(11)
The second term PwB corresponds to the driving circuit of the weight matrix, assuming stationary operations, where we can discern some typical operating regimes based on the deployed technology and compute rate, as given in Table I.
FIG. 6.

Matrix-vector-multiplication of NxN weight matrix and an N:1 input vector X, resulting to an N:1 output vector Y.

FIG. 6.

Matrix-vector-multiplication of NxN weight matrix and an N:1 input vector X, resulting to an N:1 output vector Y.

Close modal
TABLE I.

Typical weight technologies in silicon photonic accelerators.

TechnologyCompute rate B (MAC/s)Static consumption (W)Efficiency (J/MAC)
TO PS66  10–50 × 109 12 × 10−3 2.4–12 pJ/MAC 
Insulated TO PS66  10–50 × 109 4 × 10−3 0.4–2 pJ/MAC 
EAMs65  10–50 × 109 2–20 × 10−6 0.4–20 fJ MAC 
Non-volatile PCMs26  10–50 × 109  0 pJ/MAC 
TechnologyCompute rate B (MAC/s)Static consumption (W)Efficiency (J/MAC)
TO PS66  10–50 × 109 12 × 10−3 2.4–12 pJ/MAC 
Insulated TO PS66  10–50 × 109 4 × 10−3 0.4–2 pJ/MAC 
EAMs65  10–50 × 109 2–20 × 10−6 0.4–20 fJ MAC 
Non-volatile PCMs26  10–50 × 109  0 pJ/MAC 
The EAM static consumption is calculated through
(12)
considering an average static bias voltage of Vstatic = −1.5V, corresponding to the mean value of a uniform distribution that ranges in the EAMs operating regime, i.e., [0–3V], a responsivity of R = 0.8A/W and Pin ranging from −15 to −5 dBm. Regarding the thermo-optic (TO) phase shifter (PS),66 we assume a uniform distribution of the weight values, corresponding to power distribution in the P0–Pπ range with an average value of PTO = Pπ/2.
The third term PYB incorporates the power consumption of the receiver circuitry. Following a similar analysis to the transmitter part and for an output voltage of 0.5 V, compatible with current CMOS transistors, we can conclude to
(13)
The final term PlaserN2B corresponds to the power consumption of the optical laser source. Similar to the analysis provided in Ref. 55, the input optical power reaching the receiver circuitry should be as follows:
  • Higher than the accelerator’s noise energy. In this context, following the analysis of Subsection II D for the shot-noise limited optical power and considering an NxN neural layer, with (a) a power splitting ratio of N2, implying that we have to multiply the output power by N2 to compensate the input and column splitting stages, and (b) a digital precision loss of aprec = N, the shot-noise limited optical power can be calculated using the following equation, which forms actually a more detailed representation of the laser power calculated in Eq. (5), where, however, the digital precision loss and the compensation loss factor are also taken into account:
    (14)
    which makes the constituent term of Eq. (9) to equal
    (15)
    assuming a responsivity R = 1A/W.
  • Sufficient to generate the minimum required electrical charge at the receiver that can drive the subsequent node of the next NN layer.67 With the photonic accelerator operating at 1550 nm and assuming a photodetector with Cd = 1 fF,68 a Ci = 200 aF, 1 μm wire with an interconnect capacitance of 200 aF/um,67 and a required output voltage of Vout = 0.5 V,69 the minimum optical power required can be calculated, following the same convention of N2 splitting loss and N digital precision loss,
    (16)
    that concludes for the fourth term of the energy efficiency to
    (17)
    Here, it should be pointed out that this interconnect capacitance Ci suggests a monolithic integration approach or a very intimate proximity of the photonic chiplet to the respective electronic chiplet. More traditional integration approaches will enforce higher interconnect capacitances and significantly increase the required energy, with an interesting analysis provided in Ref. 70. Combining all the terms in a single efficiency equation, we can conclude to
    (18)
    This highlights that energy efficiency improves with the following:
  • Increasing N, implying that the energy consumed for generating and receiving the input and output signal, respectively, is optimally utilized when the same input and output signals are shared along multiple matrix multiplication, or equivalently, neural operations. With current neuromorphic architectures being radix-limited by maximum emitted laser power,71 loss-optimized architectures are required for allowing high circuit scalability and harnessing the advantages of photonic implementations.

  • Increasing B, which has a predominant effect in reducing energy consumption especially when using high-power consumption weight nodes, i.e., currently widely employed thermo-optic heaters dominate the energy efficiency reaching up to ∼1 pJ/MAC.

  • Operating in an optimized bit resolution energy regime as highlighted in the fourth constituent of Eq. (18). As we can observe, the order of magnitude difference between the shot-noise limited and minimum switching energy contributions has a threshold point at around 4.5 bits, implying that a careful examination of the underlying technology blocks and an optimized operational regime can significantly improve the energy consumption.

Following a similar analysis for the computational area efficiency, the total area consumed by this circuit equals NAX + NAY + N2AW mm2, suggesting an area efficiency in MAC/s/mm2 of
(19)
revealing a positive linear relation with B and relative gains for high accelerator radices.

In this section, we solely focus on photonic matrix-vector-multiply architectures that could potentially be deployed in DL environments rather than in spiking or event-based computing paradigms. This has been motivated by the current challenges faced by SNNs, including difficulties in understanding underlying mechanisms and a lack of standardized benchmarks.18 In contrast, the established success of deep learning models results from years of research and the availability of extensive datasets and benchmarks, contributing to their widespread applicability and effectiveness.

Herein, we initially investigate the architectural categories of integrated PNNs, and then, we delve deeper into their individual building blocks, providing the recent developments on the photonic weight technologies as well as the non-linear activation functions implementations. Depending on the mechanism of information encoding and the calculation of linear operations, integrated PNNs can be classified into three broad categories: coherent, incoherent, and spatial architectures.

Coherent architectures harness the effect of constructive and destructive interference for linear combination of the inputs in the domain of electrical field amplitudes, requiring just a single wavelength for calculating the neural network linear operations. The principle of operation of coherent architectures is pictorially represented in Fig. 7(a), while Figs. 7(b)7(d) illustrate indicative coherent layouts that have been proposed in the literature and will be comprehensively analyzed in this Tutorial. The first linear neuron realized in this manner has been proposed in Ref. 29, with its core relying on the optical interference unit realized through cascaded MZIs in a singular value decomposition (SVD) arrangement,72 as per Fig. 8(a). The SVD approach assumes decomposition of the arbitrary weight matrix W to W = USV, where U and V denote unitary matrices, with V being the conjugate transpose of V and S being a diagonal matrix that carries the singular values of W. Therefore, this scheme rests upon the factorization of unitary matrices that in the photonic domain have mainly based on U(2) factorization techniques employing 2 × 2 MZIs.73 

FIG. 7.

(a) Coherent PNN linear architectures that are based on electric field summation, (b)–(d) indicative layouts from the literature that follow the coherent principles, (e) incoherent PNN linear architectures that are based on WDM optical power addition, and (f)–(h) indicative layouts from the literature that adopt the incoherent concept.

FIG. 7.

(a) Coherent PNN linear architectures that are based on electric field summation, (b)–(d) indicative layouts from the literature that follow the coherent principles, (e) incoherent PNN linear architectures that are based on WDM optical power addition, and (f)–(h) indicative layouts from the literature that adopt the incoherent concept.

Close modal
FIG. 8.

(a) Singular value decomposition scheme, (b) triangular mesh implemented by Reck et al., (c) conventional Mach–Zehnder interferometer with an additional PS element at one of the two outputs, (d) rectangular mesh proposed by Clements et al., (e) UGMZI architecture proposed by Tsakyridis et al., and (f) individual UGMZI node that operates as an NxN beam splitter.

FIG. 8.

(a) Singular value decomposition scheme, (b) triangular mesh implemented by Reck et al., (c) conventional Mach–Zehnder interferometer with an additional PS element at one of the two outputs, (d) rectangular mesh proposed by Clements et al., (e) UGMZI architecture proposed by Tsakyridis et al., and (f) individual UGMZI node that operates as an NxN beam splitter.

Close modal

In this regime, back in 1994, Reck et al.74 proposed the first optical unitary matrix decomposition scheme, the so-called triangular mesh shown in Fig. 8(b), using 2 × 2 MZIs as the elementary building block, illustrated in Fig. 8(c). Recently, this layout has been optimized by Clements et al.,75 introducing the rectangular mesh of 2 × 2 MZIs, depicted in Fig. 8(d), that is more loss-balanced and error-tolerant design than Reck’s architecture. Both layouts necessitate N(N − 1)/2 variable beam splitters for implementing any N × N unitary matrix, requiring, also, the same number of programming steps for realizing the decomposition. Although these U(2)-based architectures rely on simple library of photonic components that facilitate their fabrication, they suffer from several drawbacks, with the most important being the fidelity degradation. Fidelity corresponds to the measurement of closeness between the experimentally obtained and the theoretically targeted matrix values, denoting a quantity that declares the accuracy in implementing a targeted matrix in the experimental domain. Fidelity degradation in the U(2)-based layouts originates from the differential path losses imposed by the non-ideal lossy optical components.76 

On top of that, U(2)-based layouts cannot support any fidelity restoration mechanism without altering their architectural structure or sacrificing their universality. Transferring these layouts in an SVD scheme toward implementing arbitrary matrices, the above effects exacerbate as two concatenated unitary matrix layouts are required. In an attempt to counteract these issues, the authors in Ref. 77 proposed the universal generalized Mach–Zehnder interferometer (UGMZI)-based unitary architecture illustrated in Fig. 8(e) and introduced a novel U(N) unitary decomposition technique78 in the optical domain that migrates from the conventional U(2) factorization by employing N × N Generalized MZIs (GMZIs) as the elementary building block. GMZIs serve as N × N beam splitters,79,80 followed by N PSs with each N × N beam splitter comprising two N × N MMI couplers interconnected by N PS, as depicted in Fig. 8(f). This scheme eliminates the differential path losses, and hence, it can yield 100% fidelity performance by applying a simple fidelity restoration mechanism, which incorporates N variable optical attenuators at the inputs of the UGMZI. Yet, this architecture heavily relies on MMI couplers with a high number of ports in order to perform transformations on large unitary matrices, which are still a rather immature integrated circuit technology that is under development in current research fabrication attempts. Finally, the authors in Ref. 81 proposed the slimmed SVD-based PNN, where they have traded the universality for area and loss efficiency by eliminating one of the two unitary matrices, implying that they can implement only specific weight matrices.

Apart from SVD-based approaches, direct-element mapping architectures comprise also coherent layouts that employ a single wavelength and interference for calculating the linear operations. The mapping of the weight values to the underlying photonic fabric is bijective, meaning that each photonic node imprints a dedicated value of the targeted weight matrix without necessitating decomposition, minimizing this way the programming complexity. Figure 9 illustrates the first coherent direct-element mapping architecture,76 implemented in a crossbar (Xbar) layout. In order to support both positive and negative weight values, this architecture requires the use of two devices per weight—an attenuator for imprinting the weight magnitude, proportional toWi, and a PS for controlling the phase, i.e., the sign of the weight, sign(Wi), enforcing 0 phase shift in the case of positive and π phase shift in case of negative weights, resulting in signWiWi×Xi. The weighted inputs are linearly combined in N:1 combiner stage, constituted from cascaded Y-junction combiners, yielding the output electrical field proportional to i=1NXiWi, which conceals the sign information in its phase. If compatibility with electrical non-linearities is needed, the sign information of the signal emerging from the Xbar output can be translated from its phase to its magnitude by introducing an optional bias branch, which sets a constant reference power level that allows for mapping the positive/negative output field above/below the bias, as experimentally demonstrated in Ref. 82. Xbar architecture, thanks to its loss-balanced configuration, can yield 100% fidelity performance, while its non-cascaded and one-to-one mapping connectivity significantly improves the phase-induced fidelity performance since the error is restricted only to a single matrix element. These benefits were experimentally verified in Refs. 83 and 84, employing a 4 × 4 silicon photonic Xbar with SiGe EAMs as computing cells, while the NN classification credentials of this architecture were experimentally validated in Refs. 24 and 25 using a 2:1 single-column Xbar layout that is capable to calculate the linear operations of the MNIST dataset at up to 50 GHz clock frequency with a classification accuracy of >95%. In an effort to exploit the full potential of the photonic platform, Xbar architecture can be equipped with wavelength division multiplexing (WDM) technology to further boost the throughput as has been proposed in Refs. 85 and 86, realizing multiple output vectors at a single timeslot. Although the Xbar layout seems currently to be the optimal architectural candidate for PNNs, it requires careful and precise effort during circuit design in order to synchronize the optical signals that travel through different paths and coherently recombine at the output. Hence, optimum performance of the Xbar necessitates the employment of equal length optical paths whenever coherent recombination is required, suggesting that the path-length difference has to be compensated during the photonic chip layouting.

FIG. 9.

Xbar architecture as a direct-element mapping coherent layout.

FIG. 9.

Xbar architecture as a direct-element mapping coherent layout.

Close modal

Finally, a recent coherent demonstration in Ref. 87 exploits vertical-cavity surface-emitting lasers (VCSELs) for encoding, in i-time steps, both the input vector and weight matrix, as shown in Fig. 7(d). Using the injection locking mechanism between the deployed VCSELs, the phase coherency is retained over the entire circuit, allowing for the realization of the coherent amplitude addition at the interference stage of each timestep. Matrix-vector products are realized by the photoelectric multiplication process in homodyne detectors, while a switched integrator charged amplifier is employed for the accumulation of the individual i products. Despite its simplicity, this architecture requires precise phase control over the individual VCSELs toward retaining phase coherency over the entire circuit, raising stability and scalability issues.

Demarcating from coherent architectures, incoherent PNNs encode the NN parameters into different wavelengths and calculate the network linear operations by employing WDM technology principles and power addition. A pictorial representation of how incoherent architectures operate is given in Fig. 7(e), while some incoherent layouts that have been suggested in the literature and will be thoroughly examined in this Tutorial are illustrated in Figs. 7(f)7(h). The first implementation that follows this approach has been proposed in Ref. 88, when a team from Princeton initially demonstrated the so-called broadcast-and-weight architecture and then elaborated in more detail in Ref. 89. Each input xi is imprinted at a designated wavelength λi, essentially making each channel λi a virtual axon of a linear neuron, while all N inputs (λs) are typically multiplexed together into a single waveguide when arriving to the linear neuron, as shown in Fig. 10. The main building block of this architecture is the microring resonator (MRR) bank, consisting of N MRRs that are embraced by two parallel waveguides and are responsible for enforcing channel-selective weighting values. Each MRR filter is designed such that its transfer function can be continuously tuned, ideally between the values of 0 and 1, achieving controlled attenuation of the signal’s power at the corresponding λi. The sign is encoded by exploiting path-diversity and balanced photodetection (BPD); assuming that an ai fraction of a signal at a certain wavelength exits via the THRU port of the respective MRR module and the remaining (1-ai) part gets forwarded to the DROP port, the subtraction of the respective photocurrents at the BPD yields the weighting value wi = 2ai − 1 for this specific signal, which can range between −1 and 1, given that ai ranges between 0 and 1. With all different wavelengths leaving through the same DROP and the same THRU port and entering the same BPD unit, the BPD output provides the total weighted sum of WDM inputs. This architecture allows for one-to-one mapping of the weighting values into the MRR weight bank alleviating the programming complexity, yet it comprises a rather challenging solution since it necessitates the simultaneous operation and precise control of various resonant devices, raising issues in its scalability credentials. An alternative incoherent architecture is proposed by the authors in Ref. 26, demonstrating a PNN that follows the photonic in-memory computing paradigm where the weighting cells are realized through PCM-based memories. This approach exploits the non-volatile characteristics of the PCM devices consuming, in principle, zero power consumption when inference operation is targeted, meaning that the weights of the NN do not have to be updated and, thus, are statically imprinted in the PCM weighting modules. This architecture utilizes an integrated frequency comb laser to imprint the multiple inputs of the NN, with each comb line corresponding to a dedicated NN input value. The multi-wavelength signals after the PCM-based weighting stage that follows the layout depicted in Fig. 7(h) are incoherently combined to a photodiode (PD) in order to produce the linear summation. Although this architecture minimizes the memory movement bottleneck, it requires (i) precise design to timely synchronize the multi-wavelength signals at each PD and (ii) broad wavelength spectrum of frequency comb laser for implementing large scale NNs. An additional incoherent architecture is proposed in Ref. 28 and illustrated in Fig. 11. The authors employ WDM input signals for imprinting the NN input vector, while the realization of the weight matrices is implemented via multiple semiconductor optical amplifiers (SOAs). They adopt the cross-connect switch principles used in optical communications for constructing the PNN and arrayed waveguide gratings (AWGs) for multiplexing/demultiplexing the signals as well as for reducing the out-of-band accumulated noise of the SOAs. Although it comprises a promising solution toward implementing large scale PNNs, the deployment of multiple SOAs, as single stage weighting elements, trades the scalability credentials for increased power consumption. Finally, an alternative to the coherent/incoherent architectures has been proposed in Ref. 32, where the authors encode the N pixels of the classification image (NN input values) directly to the grating couplers through optical collimators, while the weight information of each NN input is imprinted through a dedicated PIN-based optical attenuator. Each weighted input is launched to a PD, and the resulted photocurrents are combined to generate the linear weighted sum of the neurons. As opposed to the coherent and incoherent layouts, there is no requirement for the encoded signals to be in phase or in different wavelength, respectively, since every NN input is imprinted at a designated photonic waveguide/axon. This, however, necessitates multiple waveguides/axons for implementing a high-dimensional NN, imposing scalability limitations.

FIG. 10.

Broadcast and weight architecture using microring resonator bank for realizing the weighting function.

FIG. 10.

Broadcast and weight architecture using microring resonator bank for realizing the weighting function.

Close modal
FIG. 11.

Semiconductor optical amplifier-based incoherent photonic neural network.

FIG. 11.

Semiconductor optical amplifier-based incoherent photonic neural network.

Close modal

From all previous implementations, it becomes easily evident that the main challenges and limitations of integrated PNNs relate to their scalability and thence to the hardware encoding of the vast amount of NN parameters into a photonic chip. In this direction, the authors in Refs. 38, 90, and 91 introduced the optical tiled matrix multiplication (TMM) technique, shown in Fig. 12, that follows the principles of the general matrix multiply (GeMM) method adopted by modern digital AI engines92,93 and attempting to virtually increase the size of the PNN without fabricating large photonic circuits. The rationale behind this concept is the following: the weight matrix and the input vector of an NN is divided into smaller tiles, whose dimension is dictated by the available hardware neurons. The remaining tiles are unrolled in the time domain via time division multiplexing (TDM) and then are sequentially imprinted into the photonic hardware, allowing in this way for the calculation of matrix-multiplication operations of a NN layer whose dimension is higher than the one implemented on hardware. The resulting time-unfolded products, produced by the multiple tiles, need to be added together in order to form the final summation. For this reason, the authors in Refs. 44, 91, and 94 utilized a charge accumulation technique either electro-optically using a low-bandwidth photodetector or electrically via a low-pass RC filter. Besides accumulation, this implementation allows for power efficient and low-cost ADCs since it relaxes their sampling rate and bandwidth requirements. However, the employment of optical TMM and charge accumulation techniques in a PNN engender specific requirements that need to be addressed: (i) both input vector-imprinting and weight-encoding modulators have to operate at the same data rate and (ii) the number of time-unfolded products that will be accumulated is dictated by the deployed capacitance of the RC filter or the bandwidth of photodetector, implying that after a certain period, a capacitor voltage/photodetector power should be reset in order to store (e.g., to a local memory) the first set of accumulated summation. The same process is repeated until the calculation of the total linear operations of PNN.

FIG. 12.

Tiled matrix multiplication technique that firstly proposed as the general matrix multiply method by modern digital AI engines.

FIG. 12.

Tiled matrix multiplication technique that firstly proposed as the general matrix multiply method by modern digital AI engines.

Close modal

Even when employing the proposed techniques, large NN language models, such as chat-GPT95 and Megatron,96 necessitate billions of trainable parameters, which are challenging to encode not only in silicon photonic hardware but also in current electronic computing engines. Therefore, these models are deployed on High Performance Computers (HPCs) incorporating thousands of interconnected GPUs and/or tensor processing units (TPUs), e.g., Megatron language model deploys 4400 A100 GPUs,96 with each single accelerator comprising hundreds or thousands of nodes. This architectural paradigm has been already transferred in analog electronic accelerators prototypes, with recent multi-core systems already expanding to higher than 50 cores15 and can act as a blueprint architectural approach for multi-core photonic accelerators. Interconnection of the constituent photonic cores can benefit from the recent breakthroughs in optical chip-to-chip communications projected to offer significant energy and latency saving compared to electronic counterparts97 while also paving the way for reduced opto-electronic (OE) conversions and even on switch-fabric workload accelerations.98 Finally, the development of commercially viable silicon photonic accelerators has to tackle both the well documented packaging challenges of deploying very large scale integrated photonics,99,100 along with photonic accelerator specific packaging and interconnect requirements.101 Fortunately, recent breakthroughs in large scale photonic circuity packaging highlight a feasible developmental roadmap capable of addressing the challenges of (i) laser source integration through employing either heterogeneous integration of III–V components via wafer bonding102 and micro transfer printing103 or via photonic wire bonding,104 (ii) photonic/electronic system-in-package (SiP) development, with prominent approaches, including monolithic integration in silicon photonics105 or mainstream electronic platforms97 and 3D integration,106 and (iii) photonic accelerator memory access, where either optical interconnectivity between memory and accelerator is promoted107 or the novel photonic-in-memory-computing paradigm26 with the weight matrix being non-volatile and as such significantly alleviating the memory-accelerator memory requirements.

Delving deeper into the individual PNN building blocks, we provide an overview of photonic technologies that can be promising candidates toward the realization of the NN weight imprinting into an integrated platform. As discussed previously, most PNN demonstrations focused on the weight matrix implementation rather than the NN input vector since the number of weight values comprises the greatest contributing factor to the hardware encoding of the entire NN parameters. For example, assuming a fully connected NN with topology of 10:10:5, the number of input values is 10, while the total weight values is 150, and this difference becomes more pronounced as the NN dimensions/layers increase. Hence, the selection of the photonic weight technology becomes crucial as it implicitly indicates the size and energy efficiency of the PNN. The photonic weight technologies can be divided into two categories, depending on their volatile characteristics. Non-volatile devices can be used as memories by storing the NN weight values in a PNN, and this information can be retained by statically applying ultra-low or even zero electrical power. These devices can either use memristors heterogeneously integrated with photonic microring resonators108 or exploit physical phenomena such as phase change26 and ferroelectricity109 in order to store and retain the weight values. The employment of non-volatile memory elements is more suitable for equipping PNN inference engines, offering low-power weight encoding with high-precision, but, in turn, they impose challenges that are related to reconfiguration time, fabrication maturity, compactness, and scalability. For example, PCMs that are mostly based on GST-based compounds exhibit up to 5-bit resolution,110 but, in turn, their reconfiguration time is restricted to the sub-MHz regime while in most demonstrations operate via optical absorption, limiting their deployment in large scale circuits. Ferroelectric materials, such as Barium Titanate (BTO), have already validated its non-volatile credentials retaining its states over 10 h.109 However, to incorporate this device into a PNN, one aspect that still needs to be addressed and optimized is the footprint since the required PS length for achieving pi phase shift is at least 1 mm,109 rendering the implementation of a large scale PNN rather challenging. On the other hand, when training applications are targeted or the TMM technique has to be applied for executing a high dimension neural layer over a limited PNN hardware, volatile devices take the lead over non-volatile materials since they offer dynamic weight update. Various TO MZI or MRR27,29–31,111–113 devices have been proposed for weight data encoding due to their well-established and mature fabrication process as well as their high bit precision (up to 9-bit113), yet their reconfiguration time is limited to ms values.

Electro-optic devices, such as micro-electro-mechanical systems (MEMS),114,115 EAMs,36 semiconductor optical amplifiers (SOAs),28 ITO-based modulators,116 graphene-based phase shifters,117 and silicon p-i-n diode Mach–Zehnder Modulator (MZMs),118 have already been demonstrated and potentially perform weighting functions exhibiting reconfiguration times in the GHz regime, trading, however, their performance in bit precision.41 Therefore, the selection of the photonic weight technology heavily depends on the targeted NN application (inference, training) and its bit resolution requirements. Figure 13 puts in juxtaposition the power consumption and footprint of different photonic technology candidates for the realization of the weighting function for PNN implementations, highlighting also their speed capabilities/reconfiguration time.

FIG. 13.

Power consumption vs device length of various optical technologies for realizing weighting circuits. They are group in two different categories, depending on their reconfiguration time: blue square dots for low speed (ms/ns) and red square dots for high speed (10s of ps).

FIG. 13.

Power consumption vs device length of various optical technologies for realizing weighting circuits. They are group in two different categories, depending on their reconfiguration time: blue square dots for low speed (ms/ns) and red square dots for high speed (10s of ps).

Close modal

An indispensable part in the realization of an NN is the activation function, i.e., a non-linear function that is applied at the egress of the linear weighted summation. The non-linearity of the activation function allows the network to generalize better, converge faster, approximate non-linear relationships, as well as avoid local minima. Despite the relative relaxed requirements in the properties of the activation functions, i.e., a certain degree of non-linearity and differentiability across the employed range,119 DNN implementations have been dominated, due to their higher performance credentials, by the use of the ReLU,120 PreLU,121 and variations of the sigmoid transfer function, including tanh and the logistic sigmoid.122 This dominance has shaped photonic NN activation function circuitry objectives, targeting to converge to these specific electrical baseline functions’ performance at the highest possible bit rate as well as achieve a certain level of SNR at their output to safeguard the scalability of the neural circuitry. Previous implementations of non-linearity in photonic NNs have been streamlined across three basic axes: (i) The simplest approach relies on applying the non-linear activation in the electronic domain. This was achieved through offline implementation in a CPU, following the opto-electrical conversion of the vector-matrix-multiplication product,29 by chaining an ADC to a digital multiplier and finally to a DAC123 or by introducing non-linearity in the neuron’s egress through a specially designed ADC.94 Despite the simplicity and effectiveness of digitally applying the non-linear activation function, the related unavoidable digital conversion induces, in the best case, a latency of several clock-cycles for every layer of the NN that employs one.29,123 Transferring this induced latency to a photonic NN accelerator would significantly decrease the achieved computation capabilities and as such its total performance credentials. (ii) The hybrid electrical-optical approach that relies on a cascade of active photonic and/or electronic components, i.e., photodiode—amplifier-modulator laser, with non-linear behavior provided by the opto-electrical synergies, such as transimpedance amplifier (TIA), or by the non-linear behavior of the photonic components (e.g., modulators).124–129 The hybrid electrical-optical approaches provide a viable alternative to digitally applied activation functions, but, in turn, the induced noise and latency originating from the cascaded optical to electrical to optical conversions still may impose a non-negligible overhead to the performance of the photonic NN. (iii) The all-optical approach based on engineering the non-linearities of optical components to conclude to practical photonic activation functions. In this context, different mechanisms and materials have been investigated, including among others gain saturation in SOA130,131 absorption saturation,132,133 reverse absorption saturation in films of buckyballs (C60),133 PCM non-linear characteristics,134,135 SiGe hybrid structure in a microring resonator,136,137 and poled thin-lithium niobate (PPLN) nanophotonic waveguides.138 All optical approaches seem to hit the sweet spot, between applicability and function, allowing time-of-flight computation and negating the need of costly conversions.

Finally, a recent trend and probably the most promising for realizing a complete PNN comprises the development of programmability feature in both hybrid and all-optical approaches, where a single building block can realize multiple activation functions by modifying its operational conditions.126,128,129,133,135 These implementations have mainly relied on the different non-linear transfer functions obtained by the same component when altering its operational conditions through specific settings, e.g., DC bias voltage for a modulator, DC current for an SOA, gain of a TIA, input optical power and pulse duration for PCM, etc. Therefore, by enabling reconfigurability in PNNs can pave the way toward implementing different AI applications/tasks without requiring any modifications in the underlying hardware.

Yet, the programmability properties of the non-linear activation functions need to be combined with high-speed performance to comply with the frequency update rate of the execution of the linear part. Figure 14 provides an overview of the devices that have been proposed for the implementation of NN activation functions, classifying them according to their implementation (all-optical, electro-optical), their speed performance, and the number of activation functions that they can realize, while Table II summarizes the power consumption and area metrics of state-of-the-art activation function demonstrations.

FIG. 14.

Number of activation functions produced by a single device vs operating frequency. They are classified based on their implementation: Blue square dots for all-optical and red square dots for electro-optical.

FIG. 14.

Number of activation functions produced by a single device vs operating frequency. They are classified based on their implementation: Blue square dots for all-optical and red square dots for electro-optical.

Close modal
TABLE II.

A summary of the power consumption and area metrics for the most advanced activation function demonstrations.

ReferencesPower consumption (mW)Area (mm2)
TIA + MZM129  425 7.13 
TIA + non-linear cond + MZM128  400 0.625 
SOA130  1640 9.1 
EAM124  17 N.A. 
Thin film LiNbO3138  135 × 10−3 N.A. 
Saturable absorber133  40 × 103 11.76 × 103 
MRR126  0.1 25 
ReferencesPower consumption (mW)Area (mm2)
TIA + MZM129  425 7.13 
TIA + non-linear cond + MZM128  400 0.625 
SOA130  1640 9.1 
EAM124  17 N.A. 
Thin film LiNbO3138  135 × 10−3 N.A. 
Saturable absorber133  40 × 103 11.76 × 103 
MRR126  0.1 25 

Despite the significant energy and footprint advantages of analog photonic neuromorphic circuitry, its use for DL applications necessitates a unique software-hardware NN co-design and co-development approach for accounting for various factors that are absent in digital hardware and as such ignored in current digital electronic DL models.125 These include among others fabrication variations, optical bandwidth, optical noise, optical crosstalk, limited ER, and non-linear activation functions that deviate from the typical activation functions used in conventional DL models, with all of them acting effectively as performance degradation factors.139 In this context, significant research effort has been invested into incorporating the photonic-hardware idiosyncrasy in NN training models,140 engendering also a new photonic hardware-aware DL-framework. This reality has shaped a new framework for PNNs that should be eventually better defined as the NN field that combines neuromorphic photonic hardware with optics-informed DL training models using light for carrying out all constituent computations but at the same time using DL training models that are optimally adapted to the properties of light and the characteristics of the photonic hardware technology. The research field of hardware-aware DL training models designed and deployed for neuromorphic photonic hardware has led to the introduction of optics-informed DL models,31,37,141,142 a term that has been recently coined in Ref. 42, revealing a strong potential in matching and even outperforming digital NN layouts in certain applications.42 

Optics-informed DL models have to embed all relevant noise and physical quantities that stem from the analog nature of light and the optical properties of the computational elements into the training processing. In order to ease the understanding of the noise sources and physical quantities that impact a photonic accelerator and related NN implementation challenges, Fig. 15(a) illustrates the implementation of a single neuron axon over photonic hardware, along with the dominant signal quality degradation mechanisms. The input neuron data xi is quantized prior being injected in a DAC, whose bit resolution for the tens of GSa/s sampling rates required for photonic neuromorphic computing ranges around 4–8 bits,143 i.e., being significantly lower than the 32-bit floating point numbers utilized in digital counterparts.

FIG. 15.

(a) Illustrative implementation of a single neuron axon in photonic hardware, including the main signal quality degradation mechanisms. Indicative time traces are provided across the signal path: (b) NN data before and after quantization in a DAC, (c) effect of limited ER modulator on NN data, (d) effect of limited bandwidth, (e) effect of AWGN, and (f) effect of quantization at the output ADC.

FIG. 15.

(a) Illustrative implementation of a single neuron axon in photonic hardware, including the main signal quality degradation mechanisms. Indicative time traces are provided across the signal path: (b) NN data before and after quantization in a DAC, (c) effect of limited ER modulator on NN data, (d) effect of limited bandwidth, (e) effect of AWGN, and (f) effect of quantization at the output ADC.

Close modal

This disparity is exemplary illustrated in Fig. 15(b), with the input NN data and the DAC having a bit resolution of 8 and 2 bit, respectively, resulting into quantization errors denoted as Qerror. Followingly, the quantized electrical signal at the DAC’s egress is used to drive an optical modulator in order to imprint the information in the optical domain. In this case, the non-linearity and non-infinite ER of the photonic modulator will modify the incoming signal, with Fig. 15(c) indicatively illustrating the effect on the signal representation of limited ER. It should be pointed out that in this simple analysis, we assume a weight stationary layout and as such neglect the effect of weight noise. We also approximate the frequency response of the photonic axon, denoted as Tf as a low-pass filter, a valid assumption when considering the convolution of the constituent frequency responses of the modulator and the photodiode that are typically limited in the GHz range. The effect of this low-pass behavior is schematically captured in Fig. 15(d), showcasing the effect of limited bandwidth on the calculated weighted sum. Several noise-sources also degrade the SNR of the optical signal traversing the photonic neuron, including among others Relative Intensity Noise (RIN), shot noise, and thermal noise. Under the general valid assumption that the main noise contribution can be approximated as AWGN sources, we concatenate the noise profile of the photonic axon into a single noise factor, correlated with the standard deviation of the zero-mean AWGN added to the signal. Figure 15(e) illustrates the effect of random AWGN noise added on the neural data that has propagated through the photonic hardware. Finally, an ADC is utilized for interfacing the signal back to the digital domain, introducing again quantization error, as depicted in Fig. 15(f). Comparing the finally received digital signal at the ADC output with the original NN input digital signal can clearly indicate the significant differences that may translate into degraded NN performance when relying on conventional DL training models.

In this section, we begin by highlighting the challenges and opportunities of using photonic activation functions in NN implementations, followed by an in-depth analysis of the approach and related benefits of incorporating photonic noise, limited bandwidth, limited ER, and quantization in NN training. Finally, we provide a brief overview of related applications and discuss the potential of optics-informed DL models.

Non-linear transfer functions provided by the photonic substrates are integrated into the training process by fitting generic functions, such as sigmoid, sinusoidal, and tanh, to the experimental observation of the physical response of the components.125,129,130 In turn, the fitted transfer functions are employed in software-implemented neural networks during training. More specifically, authors in Ref. 130 presented an all-optical neuron that utilizes a logistic sigmoid activation function using a WDM input and weighting scheme. The activation function is realized by means of a deeply saturated differentially biased Semiconductor Optical Amplifier-Mach-Zehnder Interferometer (SOA-MZI)144 followed by a SOA-Cross-Gain-Modulation (XGM) gate. The transfer function of the photonic sigmoid activation function is defined as
(20)
in which the parameters A1 = 0.060, A2 = 1.005, z0 = 0.154, and d = 0.033 are tuned to fit the experimental observations as implemented on real hardware devices.130 
Furthermore, the photonic sinusoidal is another activation function that is widely used in benchmarks and photonic hardware. The photonic layout corresponds to employing a MZM device145 that converts the data into an optical signal along with a photodiode. The formula of this photonic activation function is the following:
(21)
Recently, the authors in Ref. 129 demonstrated a programmable analog opto-electronic (OE) circuit that can be configured to provide a range of non-linear activation functions for incoherent neuromorphic photonic circuits at up to 10 GHz line rates. The proposed programmable OE circuit provides activation functions similar to those typically used in DL models, including tanh-, sigmoid-, ReLU-, and inverted ReLU-like activations. Additionally, it introduces also a series of novel photonic non-linear functions that are referred to as Rectified Sine Squared (Re-Sin), Sine Squared with Exponential tail (ExpSin), and Double Sine Squared.
The provided analog non-linear activation integrated circuit (NLA-IC) offers the capability to implement two different sets of activation functions. The first comprises the OE activations where the NLA-IC is exploited as a standalone activation unit, producing a variety of activation functions that are described by a generic tanh mathematical function, based on the response of the programmable transimpedance amplifier (TIA),146 and is given by
(22)
where a,b,c,d,kR are hyper parameters and define the different behaviors of the programmable circuit. The second set of AFs incorporates the OEO activations. In this case, the NLA-IC is combined with a MZM producing a diverse range of OEO non-linear activation functions syndicating the OE responses and the MZM’s sine-squared response. The mathematical function describing the OEO functions is given by
(23)
where, as before, e,dR defines the behavior of the programmable circuits. Such transfer functions are integrated during training to simulate the inference process on actual photonic devices. However, training models that utilize such activation functions get challenging due to the limited activation window they offer, making them susceptible to vanishing gradient phenomena.147,148 More precisely, such difficulties are attributed to the physical properties that force the activation to work on a smaller region of the input domain, leading to a narrow activation window and making them easily saturated. These limitations arise from the fact that physical systems operate within a specific power range, while low power consumption is also a parameter that must be taken into account during hardware design and implementation. These limitations, which are further exaggerated when recurrent architectures are used, dictate the employment of different training paradigms.63,149 Indeed, different activation functions require the use of different initialization schemes to ensure that the input signal will not diminish and that the gradients will correctly back-propagate. Failing to use an initialization scheme that is correctly designed for the activation function at hand can stall the training process or lead to sub-optimal results.150 Therefore, even though these photonic neuromorphic implementations can significantly improve the inference speed, further advances are required in the way that NNs are designed and trained in order to fully exploit the potential of such photonic hardware.

Motivated by the variance preserving assumption,147 novel initialization approaches, targeting photonic activation functions, analytically compute the optimal variance during the initialization.125 More advanced approaches propose activation agnostic methods applying auxiliary optimization tasks that allow initializing neural network parameters by taking into account the actual data distribution and the limited activation range of the employed transfer functions.151 

The aforementioned approaches lowered the performance gap between software and hardware implemented architectures.31 However, in hardware-implemented neural networks, there are also limitations that arise due to the noise that emerges from various sources, such as shot noise, thermal noise, and weight read noise,31,57,110 as well as due to other phenomena, such as limited bandwidth and extinction ratio. To this end, there are approaches that model such noise sources as AWGN. In this way, the effect of noise is introduced in the software training process, significantly improving the performance during the deployment, exploiting the robustness of ANNs to noise, especially when they take it into account during the training process.31 More specifically, noise sources are simulated in order to train in a noise-aware fashion. In this way, we exploit the fact that DL models are intrinsically robust to noise, especially when they are first adequately trained to tolerate noise sources, which are modeled as AWGN. Therefore, the output of a neuron that incorporates such a source can be modeled as
(24)
where σi2 and σw2 are the variance of the noise that affect the input and weighting, respectively. In addition, the final output of the neuron is modeled as
(25)
where σa2 is the variance of the noise induced by the activation function. Training DL models using the aforementioned noise sources makes them more robust to noise that will arise during deployment.
On top of that, the authors in Ref. 152 proposed an advanced initialization method that incorporates the existing noise sources during the initialization method, estimating iteratively the optimal variance for each layer by taking into account the actual data distribution and as such, leading to significant performance improvements during the deployment. To this end, in Ref. 152, the authors proposed an auxiliary task-based approach that can estimate the initialization variance for each layer. To this end, they introduce an additional scale factor ai for each layer,
(26)
where |·| denotes the absolute value operator. Assuming that the weights are initialized by drawing from a Gaussian distribution with zero mean and variance denoted by N(0, σ2), altering the scaling factors results in adjusting the initialization variance for each layer. Moreover, in order to optimize the scaling factor ai, an auxiliary linear classification layer is required, WiclassRm(i)×Nc, where NC is the number of classes (for a multi-class classification task) or the number of values to regress (for regression problem). In this way, ai and Wiclass are those terms that need to be optimized, while the actual weights and biases of the network are kept fixed. Then, the output of the classification layer can be directly used to perform the task at hand, i.e., either classification or regression. Additionally, an extra regularization penalty parameter is added, denoted by Ω(ai), in an effort to penalize scaling factors that lead to saturation of the activation function. Specifically, after forward passing from linear term, zi = |ai|Wix + bi, we calculate Ω(ai) as follows:
(27)
where l and u are the lower and upper bound of activation region, while n and m are the fan-in and fan-out, respectively, and max{·} denotes the maximum element in the set. J̃(Wiclass,ai;X,y) is formulated as
(28)
Finally, the scaling factor ai and classification weights layers are optimized using gradient descent. After the optimization has been completed, the weights of the ith layer can be re-initialized using the optimized scaling factor ai. All layers of the network, from input to output, are iteratively initialized with the aforementioned procedure. This initialization scheme considers the modeled noise sources and can appropriately adjust the variances accordingly. After this process has been completed, the model is ready to be trained using regular back-propagation. The ability of neural networks to compensate for such phenomena by taking them into account during the training have been also used to decompose different noise source and introduce them during the training process. Indeed, such approaches have been also used successfully to handle the limited bandwidth and extinction ratio.141 

Other operations, such as ADC and DAC operations, have also been shown to affect the accuracy of photonic neural networks. However, considering these phenomena are during the training results in more robust representations and, in turn, in higher performance during the deployment. More specifically, photonic computing includes the employment of DAC and ADC conversions along with the parameter encoding, amplification, and processing devices, such as modulators, PDs, and amplifiers, which, inevitably, introduce degradation of the analog precision during inference, as each constituent introduces a relevant noise source that impacts the electro-optic link’s bit resolution properties. Thus, the noise introduced increases when higher line rates are applied, translating to lower bit resolution. Furthermore, being able to operate in lower precision networks during deployment can further improve the potential use of analog computing by increasing the computational rate of the developed accelerators while keeping energy consumption low.53,153

Typically, the degradation introduced to analog precision can be simulated through a quantization process that converts a continuous signal to a discrete one by mapping its continuous set to a finite set of discrete values. This can be achieved by rounding and truncating the values of the input signal. Despite the fact that quantization techniques are widely studied by the DL community,153–155 they generally target large CNNs containing a large number of surplus parameters with a minor contribution to the overall performance of the model.156,157 Furthermore, existing works in the DL community focus mainly on partially quantized models that ignore input and bias.154,158 These limitations, which are further exaggerated when high-slope photonic activations are used, dictate the use of different training paradigms that take into account the actual physical implementation.31 Indeed, neuromorphic photonics impose new challenges on the quantization of the DL model, requiring the appropriate adaptation of existing methodologies to the unique limitations of photonic substrates. Furthermore, the quantization scheme applied in neuromorphic photonics typically follows a very simple uniform quantization.57,159 This differs from the approaches traditionally used in trainable quantization schemes for DL models160 as well as mixed precision quantization.161 

To this end, several proposed approaches deal with the limited precision requirements before models are deployed to hardware. Such approaches calibrate networks with limited precision requirements after training the models, namely, post-training quantization methods, offering improvements in contrast to applying the model directly to hardware without taking into account the limited precision components.161 Other approaches take into account the limited precision requirements during training, naming quantization-aware training methods.161,162 Later methods significantly exceed the performance of post-training approaches, eliminating or restricting performance degradation between the full precision and limited precision models.162 

The authors in Ref. 142 proposed an activation-agnostic, photonic-compliant, and quantization-aware training framework that does not require additional modifications of the hardware during inference, significantly improving model performance at lower bit resolution. More specifically, they proposed to train the networks with quantized parameters by applying uniform quantization to all parameters involved during the forward pass, and consequently, the quantization error is accumulated and propagated through the network to the output and affects the employed loss function. In this way, the network is adjusted to lower-precision signals, making it more robust to reduced bit resolution during inference, significantly improving the model performance. To this end, every signal involved in the response of the i-th layer is first quantized in a specific floating range hmini,,hmaxiR. Then, during the forward pass of the network, quantization error is injected to simulate the effect of rounding during the quantization, while during the backpropagation, the rounding is ignored and approximated with an identity function. A comprehensive mathematical analysis regarding the quantization training can be found in Refs. 161 and 162. Finally, more advanced approaches targeting novel dynamic precision architectures41,163 propose stochastic approaches to gradually reduce the precision of layers within a model, exploiting their position and tolerance to noise, based on theoretical indications and empirical evidence.164 More specifically, the stochastic mixed precision quantization-aware training scheme, which is proposed in Ref. 164, adjusts the bit resolutions among layers in a mixed precision manner based on the observed bit resolution distribution of the applied architectures and configurations. In this way, the authors are able to significantly reduce the inference execution times of the deployed NN.41 

Applying the aforementioned methods allows us to employ PNNs where high frequencies with minimum energy consumption are required, utilizing the DL techniques in a whole new spectrum of applications. Such applications include network monitoring and optical signal transmission, where the high compute rates limit the application of existing accelerators. For example, neuromorphic photonics are capable of operating at very high frequencies and can be integrated on a backplane pipeline of a modern high-end switch, which makes them an excellent choice for challenging Distributed Denial of Service (DDoS) attacks detection applications, where high-speed and low-energy inference is required. More specifically, the authors in Refs. 37 and 165 build on the concept of a neuromorphic lookaside accelerator, targeting to perform real-time traffic inspection, searching for DDoS attack patterns during the reconnaissance attack phase, when the attacker tries to determine critical information about the target’s configuration. Before deploying a DDoS attack, a port scanning procedure is compiled to track open ports on a target machine. During this procedure, port scanning tools, such as Nmap, create synthetic traffic that can be captured and analyzed by the proposed network, capturing huge amounts of packages in the used computation rates of the modern high-end switches.

Another domain that can be potentially benefited from neuromorphic hardware is communications. Over the recent years, there is an increasing interest in employing DL in the communication domain,166 ranging from wireless167 to optical fiber communications,42 exploiting the robustness of ANNs to noise, especially when they take it into account during the training process. Such approaches design the communication system by carrying out the optimization in a single end-to-end process, including the transmitter, receiver, and communication channel, with the ultimate goal to achieve optimal end-to-end performance by acquiring a robust representation of the input message,42,168 introducing an end-to-end deep learning fiber communication transceiver design, emphasizing training by examining all optical activation schemes and respective limitations present in realistic demonstrations. They applied the data-driven noise-aware initialization method169 that is capable of initializing PNNs by taking into account the actual data distribution, noise sources, as well as the unique nature of photonic activation functions. They focused on training photonic architectures, which employ all optical activation schemes,130 by simulating their given transfer functions. This allows for reducing the effect of vanishing gradient phenomena as well as improving the ability of networks coupled with communication systems to withstand noise, e.g., due to the optical transmission link. As experimentally demonstrated, this method is significantly tolerant to the degradation occurred when easily saturated photonic activations are employed as well as significantly improve the signal reconstruction of the all-optical intensity modulation/direct detection (IM/DD) system.

Conventional electronic computing architectures face many challenges due to the rapid growth of compute power, driven by the rise of AI and DNNs, calling for a new hardware computing paradigm that could overcome these limitations and being capable of sustaining this ceaseless compute expansion. In this Tutorial, prompted by the ever-increasing maturity of silicon photonics, we presented the feasibility of PNNs and their potential embodiment in future DL environments. First, we discussed the essential concepts and criteria for NN hardware, examining the fundamental components of NNs and their core mathematical operations. Then, we investigated the interdependence of analog bit precision and energy efficiency of photonic circuits, highlighting the benefits and challenges of PNNs over conventional approaches. Moreover, we reviewed the state-of-the-art PNN architectures, analyzing their perspectives with respect to MVM operation execution, weight technology selection, and activation function implementation. Finally, the recently introduced optics-informed DL training framework was presented, which comprises a novel software-hardware NN co-design approach that aims to significantly improve the NN accuracy performance by incorporating the photonic hardware idiosyncrasy into NN training.

The work was, in part, funded by the EU Horizon projects PlasmoniAC (Grant No. 871391), SIPHO-G (Grant No. 101017194), and Gatepost (Grant No. 101120938).

The authors have no conflicts to disclose.

A.T. and M.M.-P. contributed equally to this work.

Apostolos Tsakyridis: Conceptualization (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Writing – original draft (equal). Miltiadis Moralis-Pegios: Conceptualization (lead); Investigation (equal); Methodology (equal); Validation (equal); Writing – original draft (equal). George Giamougiannis: Conceptualization (equal); Methodology (equal). Manos Kirtas: Data curation (equal); Investigation (equal); Software (equal). Nikolaos Passalis: Data curation (equal); Investigation (equal); Software (equal). Anastasios Tefas: Conceptualization (equal); Methodology (equal); Supervision (equal). Nikos Pleros: Conceptualization (equal); Methodology (equal); Supervision (equal); Writing – review & editing (equal).

The data that support the findings of this study are available from the corresponding author upon reasonable request.

In order to get a meaningful sense of the relationship between digital-to-analog precision loss and input optical power in a photonic matrix multiplier, we begin by assuming a linear PAM-M optical signal or, equivalently, a signal with a bit resolution B = log2M, featuring infinite ER, and an average optical power of Pavg. When receiving the signal in a thermal noise dominated optical link, we can evaluate its quality using the Q factor of the outer eye diagram of PAM-M modulation, which can be expressed through
(A1)
where P1 is the optical signal’s peak power, Pm is the optical power of the signal’s penultimate level, and σt is the standard deviation of the thermal noise. The optical power of the penultimate level of a linear PAM-M signal can be calculated through subtracting the distance between the penultimate PAM-M level from the peak power,
(A2)
with P0 = 0 for an infinite ER signal. Replacing (A2) into (A1),
(A3)
From Eq. (A3), we can deduct that for a given bit resolution B = log2M, in order to maintain the same signal quality, we need to increase the optical power of the receiver’s input signal.
In the case of the analog matrix multiplier, assuming a loss-less weight matrix implementation, N inputs signals of (PAM-M), and only positive weight values, the optical peak power of the signal emerging at the output can be calculated through
(A4)
where Pi is the optical peak power of each constituent signal. When the signals have the same optical peak power Pi, we can transform (A4) to
(A5)
Moreover, for an N × N optical matrix multiplier, the input optical signal will experience loss from the front-end optical splitter, which can be calculated from
(A6)
where Se_loss is the excess loss of the splitter, which for a cascaded tree MMI layout can be calculated through Se_loss = MMIloss × log2N. As such, Eq. (A5) can be rewritten as
(A7)
Assuming a silicon photonic implementation and some state-of-the-art values for MMI-loss, i.e., MMIloss = 0.06 dB or MMIloss = 0.014 in natural numbers,170 we can deduct that the aforementioned equation can be simplified without significant accuracy loss in
(A8)
as log2N × 0.014 ≪ N.
Based on the above, we will highlight two distinct cases. In the first that we refer to as full digital precision aprec = 1, we increase the optical power of the input laser by a factor of the beam splitter loss, i.e., N, essentially compensating the optical loss of the splitter for each contributing beam as such
(A9)
This safeguards that the minimum optical power difference between adjacent bits (MOPB) remains constant, such as
(A10)
Using exemplary values from the above equation, we can see that the equivalent bit resolution of the output signal is significantly higher than the input signal, e.g., for N = 4 and M = 2, Mout = 3.7.
Another interesting operational regime is defined as aprec = N, where we keep the output equivalent bit precision Bout the same as the input signal resolution B, such as Bout = B. In this case, the signal quality at the output defined through the Q factor remains the same as the input Pi,
(A11)
From the above equation, we can deduce that when we maintain the same equivalent bit resolution at the output of the matrix multiplier, we do not need to compensate for the splitter loss as we are effectively trading the bit resolution for reduced laser power. Comparing to the full digital precision case, we have
(A12)
where PlaseraN and Plasera1 are the required optical powers for full digital precision (aprec = 1) and same input–output bit resolution (aprec = N).

Summing up, we defined aprec as the analog–digital precision and illustrated two operational regimes:

  • aprec = 1, when we increase the output power of the laser source to compensate for the splitting loss by a factor of N. In this case, the output bit resolution reaches Bout=N×(2M1)+1, which for only integer values of bit resolution can be simplified to Bout=B+log2N and

  • aprec = N, where we keep the same bit precision at both the input and the output, trading off the decreased bit precision, as opposed to the full digital precision case, with lower input laser power by a factor of aprec = N.

1.
J.
von Neumann
, “
First draft of a report on the EDVAC
,”
IEEE Ann. Hist. Comput.
15
(
4
),
27
75
(
1993
).
2.
J.
Backus
, “
Can programming be liberated from the von Neumann style?: A functional style and its algebra of programs
,”
Commun. ACM
21
(
8
),
613
641
(
1978
).
3.
S. R.
Agrawal
et al
, “
A many-core architecture for in-memory data processing
,” in
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-50 ’17)
(
ACM digital library
,
2017
), pp.
245
258
.
4.
Y.
Arimoto
and
H.
Ishiwara
, “
Current status of ferroelectric random-access memory
,”
MRS Bull.
29
(
11
),
823
828
(
2004
).
5.
A.
Tsakyridis
et al
, “
10 Gb/s optical random access memory (RAM) cell
,”
Opt. Lett.
44
,
1821
1824
(
2019
).
6.
C.
Pappas
,
T.
Moschos
,
T.
Alexoudi
,
C.
Vagionas
, and
N.
Pleros
, “
16-bit (4 × 4) optical random access memory (RAM) bank
,”
J. Lightwave Technol.
41
,
949
956
(
2023
).
7.
C.
Pappas
et al
, “
Caching with light: A 16-bit capacity optical cache memory prototype
,”
IEEE J. Sel. Top. Quantum Electron.
29
(
2
),
6100911
(
2023
).
8.
T.
Alexoudi
,
G. T.
Kanellos
, and
N.
Pleros
, “
Optical RAM and integrated optical memories: A survey
,”
Light: Sci. Appl.
9
,
91
(
2020
).
9.
H.
Han
,
T.
Alexoudi
,
C.
Vagionas
,
N.
Pleros
, and
N.
Hardavellas
, “
A practical shared optical cache with hybrid MWSR/R-SWMR NoC for multicore processors
,”
J. Emerging Technol. Comput. Syst.
18
,
76
(
2022
).
10.
Y.
Chen
,
Y.
Xie
,
L.
Song
,
F.
Chen
, and
T.
Tang
, “
A survey of accelerator architectures for deep neural networks
,”
Engineering
6
(
3
),
264
274
(
2020
).
11.
I.
Boybat
,
M.
Le Gallo
,
S. R.
Nandakumar
et al
, “
Neuromorphic computing with multi-memristive synapses
,”
Nat. Commun.
9
,
2514
(
2018
).
12.
A.
Sebastian
,
M.
Le Gallo
,
R.
Khaddam-Aljameh
, and
E.
Eleftheriou
, “
Memory devices and applications for in-memory computing
,”
Nat. Nanotechnol.
15
,
529
544
(
2020
).
13.
Mythic, Taking Powerful, Efficient Inference to the Edge Paradigms Seems, Capable of Stimulating Additional Advances Shaping Future Digital Computing Roadmaps, Mythic https://mythic.ai/wp-content/uploads/2022/02/MythicWhitepaper-2019oct31.pdf.
14.
D.
Ielmini
and
H. S. P.
Wong
, “
In-memory computing with resistive switching devices
,”
Nat. Electron.
1
,
333
343
(
2018
).
15.
M.
Le Gallo
,
R.
Khaddam-Aljameh
,
M.
Stanisavljevic
et al
, “
A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference
,”
Nat. Electron.
6
,
680
693
(
2023
).
16.
H.
Mostafa
,
L. K.
Müller
, and
G.
Indiveri
, “
An event-based architecture for solving constraint satisfaction problems
,”
Nat. Commun.
6
,
8941
(
2015
).
17.
A.
Amir
et al
, “
A low power, fully event-based gesture recognition system
,” in
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(
IEEE
,
2017
), pp.
7388
7397
.
18.
C. D.
Schuman
,
S. R.
Kulkarni
,
M.
Parsa
et al
, “
Opportunities for neuromorphic computing algorithms and applications
,”
Nat. Comput. Sci.
2
,
10
19
(
2022
).
19.
T. N.
Theis
and
H.-S. P.
Wong
, “
The end of Moore’s law: A new beginning for information technology
,”
Comput. Sci. Eng.
19
(
2
),
41
50
(
2017
).
20.
D. A. B.
Miller
, “
Waves, modes, communications, and optics: A tutorial
,”
Adv. Opt. Photonics
11
,
679
825
(
2019
).
21.
Y.
Bai
et al
, “
Photonic multiplexing techniques for neuromorphic computing
,”
Nanophotonics
12
(
5
),
795
817
(
2023
).
22.
N.
Margalit
et al
, “
Perspective on the future of silicon photonics and electronics
,”
Appl. Phys. Lett.
118
,
220501
(
2021
).
23.
J.
Sevilla
et al
, “
Compute trends across three eras of machine learning
,” in
2022 International Joint Conference on Neural Networks (IJCNN)
(
IEEE
,
Padua, Italy
,
2022
), pp.
1
8
.
24.
G.
Giamougiannis
et al
, “
Universal linear optics revisited: New perspectives for neuromorphic computing with silicon photonics
,”
IEEE J. Sel. Top. Quantum Electron.
29
(
2
),
6200116
(
2023
).
25.
A.
Tsakyridis
et al
, “
Universal linear optics for ultra-fast neuromorphic silicon photonics towards Fj/MAC and TMAC/sec/mm2 engines
,”
IEEE J. Sel. Top. Quantum Electron.
28
(
6
),
8300815
(
2022
).
26.
J.
Feldmann
,
N.
Youngblood
,
M.
Karpov
et al
, “
Parallel convolutional processing using an integrated photonic tensor core
,”
Nature
589
,
52
58
(
2021
).
27.
A. N.
Tait
,
T. F.
de Lima
,
E.
Zhou
et al
, “
Neuromorphic photonic networks using silicon photonic weight banks
,”
Sci. Rep.
7
,
7430
(
2017
).
28.
B.
Shi
,
N.
Calabretta
, and
R.
Stabile
, “
Deep neural network through an InP SOA-based photonic integrated cross-connect
,”
IEEE J. Sel. Top. Quantum Electron.
26
(
1
),
7701111
(
2020
).
29.
Y.
Shen
et al
, “
Deep learning with coherent nanophotonic circuits
,”
Nat. Photonics
11
(
7
),
441
446
(
2017
).
30.
H.
Zhang
et al
, “
An optical neural chip for implementing complex-valued neural network
,”
Nat. Commun.
12
,
457
(
2021
).
31.
G.
Mourgias-Alexandris
,
M.
Moralis-Pegios
,
A.
Tsakyridis
et al
, “
Noise-resilient and high-speed deep learning with coherent silicon photonics
,”
Nat. Commun.
13
,
5572
(
2022
).
32.
F.
Ashtiani
,
A. J.
Geers
, and
F.
Aflatouni
, “
An on-chip photonic deep neural network for image classification
,”
Nature
606
,
501
506
(
2022
).
33.
H.
Zhou
et al
, “
Photonic matrix multiplication lights up photonic accelerator and beyond
,”
Light: Sci. Appl.
11
,
30
(
2022
).
34.
W.
Zhu
,
L.
Zhang
,
Y.
Lu
,
P.
Zhou
, and
L.
Yang
, “
Design and experimental verification for optical module of optical vector–matrix multiplier
,”
Appl. Opt.
52
,
4412
4418
(
2013
).
35.
F.
Shokraneh
et al
, “
A single layer neural network implemented by a 4 × 4 MZI-based optical processor
,”
IEEE Photonics J.
11
(
6
),
4501612
(
2019
).
36.
G.
Giamougiannis
et al
, “
Silicon-integrated Coherent Neurons with 32GMAC/sec/axon compute line-rates using EAM-based input and weighting cells
,” in
European Conference on Optical Communication (ECOC)
(
IEEE
,
2021
).
37.
A.
Tsakyridis
et al
, “
DDOS attack identification via a silicon photonic deep neural network with 50 GHz input and weight update
,” in
2023 Optical Fiber Communications Conference and Exhibition (OFC)
(
IEEE
,
San Diego, CA
,
2023
), pp.
1
3
.
38.
G.
Giamougiannis
et al
, “
Neuromorphic silicon photonics with 50 GHz tiled matrix multiplication for deep-learning applications
,”
Adv. Photonics
5
(
1
),
016004
(
2023
).
39.
C.
Huang
,
S.
Fujisawa
et al
, “
A silicon photonic–electronic neural network for fibre nonlinearity compensation
,”
Nat. Electron.
4
,
837
844
(
2021
).
40.
T. F.
de Lima
et al
, “
Machine learning with neuromorphic photonics
,”
J. Lightwave Technol.
37
,
1515
1534
(
2019
).
41.
G.
Giamougiannis
et al,
Analog nanophotonic computing going practical: Silicon photonic deep learning engines for tiled optical matrix multiplication with dynamic precision
,”
Nanophotonics
12
(
5
),
963
(
2023
).
42.
I.
Roumpos
et al
, “
High-performance end-to-end deep learning IM/DD link using optics-informed neural networks
,”
Opt. Express
31
(
12
),
20068
(
2023
).
43.
A. A.
Chien
,
L.
Lin
,
H.
Nguyen
,
V.
Rao
,
T.
Sharma
, and
R.
Wijayawardana
, “
Reducing the carbon impact of generative AI inference (today and in 2035)
,” in
Proceedings of the 2nd Workshop on Sustainable Computer Systems
,
2023
.
44.
A.
Sludds
et al
, “
Delocalized photonic deep learning on the internet’s edge
,”
Science
378
,
270
276
(
2022
).
45.
W. S.
McCulloch
and
W.
Pitts
, “
A logical calculus of the ideas immanent in nervous activity
,”
Bull. Math. Biophys.
5
,
115
133
(
1943
).
46.
C.
Nwankpa
, et al
, “
Activation functions: Comparison of trends in practice and research for deep learning
,” arXiv:1811.03378 (
2018
).
47.
V.
Bangari
,
B. A.
Marquez
,
H. B.
Miller
,
A. N.
Tait
,
M. A.
Nahmias
,
T.
Ferreira de Lima
,
H.-T.
Peng
,
P. R.
Prucnal
, and
B. J.
Shastri
, “
Digital Electronics and Analog Photonics for Convolutional Neural Networks (DEAP-CNNs)
,”
IEEE J. Quantum Electron.
26
(
1
),
7701213
(
2020
).
48.
A.
Bogris
,
C.
Mesaritakis
,
S.
Deligiannidis
, and
P.
Li
, “
Fabry-perot lasers as enablers for parallel reservoir computing
,”
IEEE J. Sel. Top. Quantum Electron.
27
(
2
),
7500307
(
2021
).
49.
J.
Hasler
and
B.
Marr
, “
Finding a roadmap to achieve large neuromorphic hardware systems
,”
Front. Neurosci.
7
,
118
(
2013
).
50.
Mythic, Taking Powerful, Efficient Inference to the Edge Paradigms Seems, Capable of Stimulating Additional Advances Shaping Future Digital Computing Roadmaps, Mythic https://mythic.ai/wp-content/uploads/2022/02/MythicWhitepaper-2019oct31.pdf.
51.
C.
Li
et al
, “
Analogue signal and image processing with large memristor crossbars
,”
Nat. Electron.
1
(
1
),
52
59
(
2017
).
52.
K.
Saito
,
M.
Aono
, and
S.
Kasai
, “
Amoeba-inspired analog electronic computing system integrating resistance crossbar for solving the travelling salesman problem
,”
Sci. Rep.
10
,
20772
(
2020
).
53.
R.
Sarpeshkar
, “
Analog versus digital: Extrapolating from electronics to neurobiology
,”
Neural Comput.
10
(
7
),
1601
1638
(
1998
).
54.
Intel, Intel® High Level Synthesis Compiler Pro Edition: Best Practices Guide, available at: https://www.intel.com/content/www/us/en/docs/programmable/683152/21-4/maximum-frequency-fmax.html; accessed 03 October 2023.
55.
M. A.
Nahmias
,
T. F.
de Lima
,
A. N.
Tait
,
H.-T.
Peng
,
B. J.
Shastri
, and
P. R.
Prucnal
, “
Photonic multiply-accumulate operations for neural networks
,”
IEEE J. Sel. Top. Quantum Electron.
26
(
1
),
7701518
(
2020
).
56.
M. A.
Al-Qadasi
,
L.
Chrostowski
,
B. J.
Shastri
, and
S.
Shekhar
, “
Scaling up silicon photonic-based accelerators: Challenges and opportunities
,”
APL Photonics
7
(
2
),
020902
(
2022
).
57.
S.
Garg
et al
, “
Dynamic precision analog computing for neural networks
,” arXiv:2102.06365v1.
58.
S.
Bandyopadhyay
et al
, “
Single chip photonic deep neural network with accelerated training
,” arXiv:2208.01623 (
2022
).
59.
S.
Gudaparthi
,
S.
Narayanan
,
R.
Balasubramonian
,
E.
Giacomin
,
H.
Kambalasubramanyam
, and
P.-E.
Gaillardon
, “
Wire-aware architecture and dataflow for CNN accelerators
,” in
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’52)
(
IEEE
,
2019
).
60.
R.
Landauer
, “
Irreversibility and heat generation in the computing process
,”
IBM J. Res. Dev.
5
,
183
191
(
1961
).
61.
R.
Hamerly
,
L.
Bernstein
,
A.
Sludds
,
M.
Soljačić
, and
D.
Englund
, “
Large-scale optical neural networks based on photoelectric multiplication
,”
Phys. Rev. X
9
,
021032
(
2019
).
62.
I.
Hubara
,
M.
Courbariaux
,
D.
Soudry
,
R.
El-Yaniv
, and
Y.
Bengio
,“
Binarized neural networks
,” in
Proceedings of the Advances in Neural Information Processing Systems
(
NeurIPS
,
2016
), pp.
4107
4115
;
Y.
Umuroglu
et al
, “
Finn: A framework for fast, scalable binarized neural network inference
,” in
Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays
(
ACM digital library
,
2017
), pp.
65
74
.
63.
M.
Moralis-Pegios
et al
, “
Neuromorphic silicon photonics and hardware-aware deep learning for high-speed inference
,”
J. Lightwave Technol.
40
(
10
),
3243
3254
(
2022
).
64.
D. A. B.
Miller
, “
Energy consumption in optical modulators for interconnects
,”
Opt. Express
20
,
A293
A308
(
2012
).
65.
M.
Pantouvaki
et al
, “
Active components for 50 Gb/s NRZ-OOK optical interconnects in a silicon photonics platform
,”
J. Lightwave Technol.
35
(
4
),
631
638
(
2017
).
66.
A.
Masood
et al
, “
Comparison of heater architectures for thermal control of silicon photonic circuits
,” in
10th International Conference on Group IV Photonics
(
IEEE
,
Seoul, Korea
,
2013
), pp.
83
84
.
67.
D. A. B.
Miller
, “
Device requirements for optical interconnects to silicon chips
,”
Proc. IEEE
97
(
7
),
1166
1185
(
2009
).
68.
K.
Nozaki
et al
, “
Photonic-crystal nano-photodetector with ultrasmall capacitance for on-chip light-to-voltage conversion without an amplifier
,”
Optica
3
(
5
),
483
492
(
2016
).
69.
J.
Proesel
,
C.
Schow
, and
A.
Rylyakov
, “
Ultra low power 10- to 25-Gb/s CMOS-driven VCSEL links
,” in
Optical Fiber Communication Conference
(
IEEE
,
2012
),
Paper OW4I.3
.
70.
D. A. B.
Miller
, “
Attojoule optoelectronics for low-energy information processing and communications
,”
J. Lightwave Technol.
35
(
3
),
346
396
(
2017
).
71.
J. P.
Epping
et al
, “
Hybrid integrated silicon nitride lasers
,”
Proc. SPIE
11274
,
112741L
(
2020
).
72.
D. A. B.
Miller
, “
Self-configuring universal linear optical component [Invited]
,”
Photonics Res.
1
,
1
15
(
2013
).
73.
F. D.
Murnaghan
,
The Unitary and Rotation Groups
(
Spartan Books
,
1962
), Vol.
3
,
Chap
. 2, p.
7
.
74.
M.
Reck
,
A.
Zeilinger
,
H. J.
Bernstein
, and
P.
Bertani
, “
Experimental realization of any discrete unitary operator
,”
Phys. Rev. Lett.
73
,
58
61
(
1994
).
75.
W. R.
Clements
,
P. C.
Humphreys
,
B. J.
Metcalf
,
W. S.
Kolthammer
, and
I. A.
Walsmley
, “
Optimal design for universal multiport interferometers
,”
Optica
3
,
1460
1465
(
2016
).
76.
G.
Giamougiannis
et al
, “
A coherent photonic crossbar for scalable universal linear optics
,”
J. Lightwave Technol.
41
(
8
),
2425
2442
(
2023
).
77.
A.
Tsakyridis
,
G.
Giamougiannis
,
A.
Totovic
, and
N.
Pleros
, “
Fidelity restorable universal linear optics
,”
Adv. Photonics Res.
,
3
,
2200001
(
2022
).
78.
P.
Dita
, “
Factorization of unitary matrices
,”
J. Phys. A: Math. Gen.
36
,
2781
2789
(
2006
).
79.
N. S.
Lagali
,
M. R.
Paiam
,
R. I.
MacDonald
,
K.
Worhoff
, and
A.
Driessen
, “
Analysis of generalized Mach-Zehnder interferometers for variable-ratio power splitting and optimized switching
,”
J. Lightwave Technol.
17
(
12
),
2542
2550
(
1999
).
80.
S.
Kovaios
et al
, “
Generalized Mach Zehnder interferometers integrated on Si3N4 waveguide platform
,”
IEEE J. Sel. Top. Quantum Electron.
29
(
6
),
6101309
(
2023
).
81.
Z.
Zheng
et al
, “
Hardware-software co-design of slimmed optical neural networks
,” in
Proceedings of the 24th Asia and South Pacific Design Automation Conference
,
2019
.
82.
G.
Mourgias-Alexandris
et al
, “
Neuromorphic photonics with coherent linear neurons using dual-IQ modulation cells
,”
J. Lightwave Technol.
38
(
4
),
811
819
(
2020
).
83.
G.
Giamougiannis
,
M.
Moralis-Pegios
,
A.
Tsakyridis
,
N.
Bamiedakis
,
D.
Lazovsky
, and
N.
Pleros
, “
On-chip universal linear optics using a 4 × 4 silicon photonic coherent crossbar
,” in
OFC
(
IEEE
,
2023
), pp.
1
3
.
84.
M.
Moralis-Pegios
et al
, “
Perfect linear optics using silicon photonics
,” arXiv:2306.17728 (
2023
).
85.
A.
Totovic
et al
, “
WDM equipped universal linear optics for programmable neuromorphic photonic processors
,” in
Neuromorphic Computing and Engineering
,
2022
.
86.
A.
Totovic
et al
, “
Programmable photonic neural networks combining WDM with coherent linear optics
,”
Sci. Rep.
12
,
5605
(
2022
).
87.
Z.
Chen
,
A.
Sludds
,
R.
Davis
et al
, “
Deep learning with coherent VCSEL neural networks
,”
Nat. Photon.
17
,
723
730
(
2023
).
88.
A. N.
Tait
et al
, “
Broadcast and weight: An integrated network for scalable photonic spike processing
,”
J. Lightwave Technol.
32
,
4029
4041
(
2014
).
89.
A. N.
Tait
et al
, “
Microring weight banks
,”
IEEE J. Sel. Top. Quantum Electron.
22
(
6
),
312
315
(
2016
).
90.
A.
Tsakyridis
et al
, “
Silicon photonic neuromorphic computing with 16 GHz input data and weight update line rates
,” in
CLEO
(
IEEE
,
2022
), pp.
1
2
.
91.
F.
Brückerhoff-Plückelmann
et al
, “
A large scale photonic matrix processor enabled by charge accumulation
,”
Nanophotonics
12
(
5
),
819
825
(
2023
).
94.
L.
De Marinis
et al
, “
A codesigned integrated photonic electronic neuron
,”
IEEE J. Quantum Electron.
58
(
5
),
8100210
(
2022
).
95.
T. B.
Brown
et al
, “
Language models are few-shot learners
,” in
Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20)
(
Curran Associates Inc.
,
Red Hook, NY
,
2020
), pp.
1877
1901
.
96.
S.
Smith
et al
, “
Using deepspeed and megatron to train megatron-turing NLG 530B, a large-scale generative language model
,” arXiv:2201.11990.
97.
M.
Wade
et al
, “
Driving compute scale-out performance with optical I/O chiplets in Advanced System-in-package platforms
,” in
2023 IEEE Hot Chips 35 Symposium (HCS)
(
IEEE
,
2023
).
98.
Q.
Cheng
et al
,
Optical Fiber Telecommunications VII
(
Academic Press
,
2020
), Chap. 18, pp.
785
825
.
99.
K.
Shiflett
et al
, “
Flumen: Dynamic processing in the photonic interconnect
,” in
Proceedings of the 50th Annual International Symposium on Computer Architecture
,
2023
.
100.
M. R.
Watts
et al
, “
Very large scale integrated photonics (VLSI-P)
,” in
CLEO: 2014
(
OSA Technical Digest
,
2014
).
101.
L.
Ranno
et al
, “
Integrated photonics packaging: Challenges and opportunities
,”
ACS Photonics
9
(
11
),
3467
3485
(
2022
).
102.
M. S.
Nezami
et al
, “
Packaging and interconnect considerations in neuromorphic photonic accelerators
,”
IEEE J. Sel. Top. Quantum Electron.
29
(
2
),
6100311
(
2023
).
103.
D.
Caimi
et al
, “
Heterogeneous integration of III–V materials by direct wafer bonding for high-performance electronics and optoelectronics
,”
IEEE Trans. Electron Devices
68
(
7
),
3149
3156
(
2021
);
B.
Haq
et al
, “
Micro-transfer-printed III-V-on-silicon c-band distributed feedback lasers
,”
Opt. Express
28
(
22
),
32793
32801
(
2020
).
[PubMed]
104.
M. R.
Billah
et al
, “
Hybrid integration of silicon photonics circuits and InP lasers by photonic wire bonding
,”
Optica
5
(
7
),
876
883
(
2018
).
105.
F.
Zanetto
et al
, “
Time-multiplexed control of programmable silicon photonic circuits enabled by monolithic CMOS electronics
,”
Laser Photonics Rev.
17
(
11
),
2300124
(
2023
).
106.
D.-W.
Kim
et al
, “
3D system-on-packaging using through silicon via on SOI for high-speed optcal interconnections with silicon photonics devices for application of 400 Gbps and beyond
,” in
ECTC
(
IEEE
,
San Diego, CA
,
2018
).
107.
C.
Sun
et al
, “
Single-chip microprocessor that communicates directly using light
,”
Nature
528
,
534
538
(
2015
).
108.
B.
Tossoun
et al
, “
High-speed and energy-efficient non-volatile silicon photonic memory based on heterogeneously integrated memresonator
,” arXiv:2303.05644.
109.
J.
Geler-Kremer
,
F.
Eltes
,
P.
Stark
et al
, “
A ferroelectric multilevel non-volatile photonic phase shifter
,”
Nat. Photonics
16
,
491
497
(
2022
).
110.
X.
Li
et al
, “
Fast and reliable storage using a 5 bit, nonvolatile photonic memory cell
,”
Optica
6
,
1
6
(
2019
).
111.
Z.
Lu
,
K.
Murray
,
H.
Jayatilleka
, and
L.
Chrostowski
, “
Michelson interferometer thermo-optic switch on SOI with a 50-μW power consumption
,” in
IEEE Photonics Conference (IPC)
(
IEEE
,
2016
).
112.
A.
Ribeiro
et al
, “
Demonstration of a 4 × 4-port universal linear circuit
,”
Optica
3
,
1348
1357
(
2016
).
113.
W.
Zhang
et al
, “
Silicon microring synapses enable photonic deep learning beyond 9-bit precision
,”
Optica
9
,
579
584
(
2022
).
114.
T.
Grottke
et al
, “
Optoelectromechanical phase shifter with low insertion loss and a 13π tuning range
,”
Opt. Express
29
(
4
),
5525
5537
(
2021
).
115.
N.
Quack
,
H.
Sattari
,
A. Y.
Takabayashi
,
Y.
Zhang
,
P.
Verheyen
,
W.
Bogaerts
,
P.
Edinger
,
C.
Errando-Herranz
, and
K. B.
Gylfason
, “
MEMS-enabled silicon photonic integrated devices and circuits
,”
IEEE J. Quantum Electron.
56
(
1
),
8400210
(
2020
).
116.
R.
Amin
,
R.
Maiti
,
Y.
Gui
,
C.
Suer
,
M.
Miscuglio
,
E.
Heidari
,
R. T.
Chen
,
H.
Dalir
, and
V. J.
Sorger
, “
Sub-wavelength GHz-fast broadband ITO Mach–Zehnder modulator on silicon photonics
,”
Optica
7
(
4
),
333
335
(
2020
).
117.
V.
Sorianello
,
M.
Midrio
,
G.
Contestabile
et al
, “
Graphene–silicon phase modulators with gigahertz bandwidth
,”
Nat. Photonics
12
,
40
44
(
2018
).
118.
W. M.
Green
,
M. J.
Rooks
,
L.
Sekaric
, and
Y. A.
Vlasov
, “
Ultra-compact, low RF power, 10 Gb/s silicon Mach-Zehnder modulator
,”
Opt. Express
15
,
17106
17113
(
2007
).
119.
K.
Kawaguchi
, “
Deep learning without poor local minima
,” in
Advances in Neural Information Processing Systems
(
MIT Press
,
2016
), pp.
586
594
.
120.
X.
Glorot
,
A.
Bordes
, and
Y.
Bengio
, “
Deep sparse rectifier neural networks
,”
Proc. Mach. Learn. Res.
15
,
315
323
(
2011
), available at https://proceedings.mlr.press/v15/glorot11a.html.
121.
K.
He
,
X.
Zhang
,
S.
Ren
, and
J.
Sun
, “
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification
,” in
Proceedings of the IEEE International Conference on Computer Vision
(
IEEE
,
2015
), pp.
1026
1034
.
122.
S.
Hochreiter
and
J.
Schmidhuber
, “
Long short-term memory
,”
Neural Comput.
9
,
1735
1780
(
1997
).
123.
J. K.
George
,
H.
Nejadriahi
, and
V. J.
Sorger
, “
Towards on-chip optical FFTs for convolutional neural networks
,” in
2017 IEEE International Conference on Rebooting Computing (ICRC)
(
IEEE
,
2017
), pp.
1
4
.
124.
J.
George
,
A.
Mehrabian
,
R.
Amin
,
J.
Meng
,
T.
de Lima
,
A.
Tait
,
B.
Shastri
,
T.
El-Ghazawi
,
P.
Prucnal
, and
V.
Sorger
, “
Neuromorphic photonics with electro-absorption modulators
,”
Opt. Express
27
,
5181
(
2019
).
125.
N.
Passalis
,
G.
Mourgias-Alexandris
,
A.
Tsakyridis
,
N.
Pleros
, and
A.
Tefas
, “
Training deep photonic convolutional neural networks with sinusoidal activations
,”
IEEE Trans. Emerging Top. Comput. Intell.
5
(
3
),
384
393
(
2021
).
126.
A.
Jha
,
C.
Huang
, and
P.
Prucnal
, “
Reconfigurable all-optical nonlinear activation functions for neuromorphic photonics
,”
Opt. Lett.
45
,
4819
(
2020
).
127.
A.
Tait
,
T.
Ferreira de Lima
,
M.
Nahmias
,
H.
Miller
,
H.
Peng
,
B.
Shastri
, and
P.
Prucnal
, “
Silicon photonic modulator neuron
,”
Phys. Rev. Appl.
11
,
064043
(
2019
).
128.
I.
Williamson
,
T.
Hughes
,
M.
Minkov
,
B.
Bartlett
,
S.
Pai
, and
S.
Fan
, “
Reprogrammable electro-optic nonlinear activation functions for optical neural networks
,”
IEEE J. Sel. Top. Quantum Electron.
26
,
7700412
(
2020
).
129.
C.
Pappas
et al
, “
Programmable tanh-ELU-sigmoid-and sin-based nonlinear activation functions for neuromorphic photonics
,”
IEEE J. Sel. Top. Quantum Electron.
29
(
6
),
6101210
(
2023
).
130.
G.
Mourgias-Alexandris
,
A.
Tsakyridis
,
N.
Passalis
,
A.
Tefas
,
K.
Vyrsokinos
, and
N.
Pleros
, “
An all-optical neuron with sigmoid activation function
,”
Opt. Express
27
,
9620
(
2019
).
131.
B.
Shi
,
N.
Calabretta
, and
R.
Stabile
, “
InP photonic integrated multi-layer neural networks: Architecture and performance analysis
,”
APL Photonics
7
,
010801
(
2022
).
132.
A.
Dejonckheere
,
F.
Duport
,
A.
Smerieri
,
L.
Fang
,
J.
Oudar
,
M.
Haelterman
, and
S.
Massar
, “
All-optical reservoir computer based on saturation of absorption
,”
Opt. Express
22
,
10868
(
2014
).
133.
A. E.
Dehghanpour
and
S.
Koohi
, “
All-optical recurrent neural network with reconfigurable activation function
,”
IEEE J. Sel. Top. Quantum Electron.
29
(
2
),
7700114
(
2023
).
134.
M.
Miscuglio
,
A.
Mehrabian
,
Z.
Hu
,
S.
Azzam
,
J.
George
,
A.
Kildishev
,
M.
Pelton
, and
V.
Sorger
, “
All-optical nonlinear activation function for photonic neural networks [Invited]
,”
Opt. Mater. Express
8
,
3851
(
2018
).
135.
J.
Feldmann
,
N.
Youngblood
,
C.
Wright
,
H.
Bhaskaran
, and
W.
Pernice
,
All-optical spiking neurosynaptic networks with self-learning capabilities
,”
Nature
569
,
208
214
(
2019
).
136.
Z.
Fu
,
Z.
Wang
,
P.
Bienstman
, and
R.
Jiang
, “
Programmable low-power consumption all-optical nonlinear activation functions using a micro-ring resonator with phase-change materials
,”
Opt. Express
30
,
44943
(
2022
).
137.
B.
Wu
et al
, “
Low-threshold all-optical nonlinear activation function based on a Ge/Si hybrid structure in a microring resonator
,”
Opt. Mater. Express
12
,
970
980
(
2022
).
138.
G. H.
Li
,
R.
Sekine
,
R.
Nehra
,
R. M.
Gray
,
L.
Ledezma
,
Q.
Guo
, and
A.
Marandi
, “
All-optical ultrafast ReLU function for energy-efficient nanophotonic deep learning
,”
Nanophotonics
12
(
5
),
847
855
(
2023
).
139.
T.
Ferreira de Lima
et al
, “
Noise analysis of photonic modulator neurons
,” in
IEEE JSTQE
(
IEEE
,
2019
), pp.
1
9
.
140.
S. K.
Vadlamani
,
R.
Hamerly
, and
D.
Englund
, “
One-time training that transfers to arbitrary highly faulty optical neural networks
,” in
Frontiers in Optics + Laser Science 2022 (FIO, LS), Technical Digest Series
(
Optica Publishing Group
,
2022
), paper FTh1B.3.
141.
G.
Mourgias-Alexandris
et al
, “
Channel response-aware photonic neural network accelerators for high-speed inference through bandwidth-limited optics
,”
Opt. Express
30
,
10664
10671
(
2022
).
142.
M.
Kirtas
et al
, “
Quantization-aware training for low precision photonic neural networks
,”
Neural Networks
155
,
561
573
(
2022
).
143.
B.
Moeneclaey
et al
, “
A 6-bit 56-GSA/s DAC in 55 Nm SiGe BiCMOS
,” in
2021 IEEE BiCMOS and Compound Semiconductor Integrated Circuits and Technology Symposium (BCICTS)
(
IEEE
,
2021
), p.
202
.
144.
A.
Tsakyridis
et al
, “
Theoretical and experimental analysis of burst-mode wavelength conversion via a differentially-biased SOA-MZI
,”
J. Lightwave Technol.
38
(
17
),
4607
4617
(
2020
).
145.
S.
Pitris
et al
, “
O-band silicon photonic transmitters for datacom and computercom interconnects
,”
J. Lightwave Technol.
37
(
19
),
5140
5148
(
2019
).
146.
G.
Coudyzer
,
P.
Ossieur
,
L.
Breyne
,
M.
Matters
,
J.
Bauwelinck
, and
X.
Yin
, “
A 50 Gbit/s pam-4 linear burst-mode transimpedance amplifier
,”
IEEE Photonics Technol. Lett.
31
(
12
),
951
954
(
2019
).
147.
X.
Glorot
and
Y.
Bengio
, “
Understanding the difficulty of training deep feedforward neural networks
,” in
Proceedings of the International Conference on Artificial Intelligence and Statistics
(
JMLR Workshop
,
2010
), pp.
249
256
.
148.
R.
Pascanu
,
T.
Mikolov
, and
Y.
Bengio
, “
On the difficulty of training recurrent neural networks
,”in
Proceedings of the International Conference on Machine Learning
(ACM Digital Library
2013
), pp.
1310
1318
.
149.
M.
Moralis-Pegios
et al
, “
Photonic neuromorphic computing: Architectures, technologies, and training models
,” in
2022 Optical Fiber Communications Conference and Exhibition (OFC)
(
IEEE
,
San Diego, CA
,
2022
), pp.
01
03
.
150.
H.
Xiangyu
et al
, “
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification
,” in
Proceedings of the IEEE International Conference on Computer Vision
(
IEEE
,
2015
), pp.
1026
1034
.
151.
N.
Passalis
,
G.
Mourgias-Alexandris
,
N.
Pleros
, and
A.
Tefas
, “
Adaptive initialization for recurrent photonic networks using sigmoidal activations
,” in
Proceedings of the IEEE International Symposium on Circuits and Systems
(
IEEE
,
2020a
), pp.
1
5
.
152.
M.
Kirtas
,
N.
Passalis
,
G.
Mourgias-Alexandris
,
G.
Dabos
,
N.
Pleros
, and
A.
Tefas
Learning photonic neural network initialization for noise-aware end-to-end fiber transmission
,” in
Proceedings of the European Signal Processing Conference
(
IEEE
,
2022
),
1731
1735
.
153.
B.
Jacob
,
S.
Kligys
,
B.
Chen
,
M.
Zhu
,
M.
Tang
,
A.
Howard
,
H.
Adam
, and
D.
Kalenichenko
, “
Quantization and training of neural networks for efficient integer-arithmetic-only inference
,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(
IEEE
,
2018
), pp.
2704
2713
.
154.
U.
Kulkarni
,
M.
Sm
,
S. V.
Gurlahosur
, and
G.
Bhogar
, “
Quantization friendly mobilenet (QF-mobilenet) architecture for vision based applications on embedded platforms
,”
Neural Networks
136
,
28
39
(
2021
).
155.
D.
Lee
,
D.
Wang
,
Y.
Yang
,
L.
Deng
,
G.
Zhao
, and
G.
Li
, “
QTTNet: Quantized tensor train neural networks for 3D object and Video recognition
,”
Neural Networks
141
,
420
(
2021
).