Photonic neural networks (PNNs) are gaining significant interest in the research community due to their potential for high parallelization, low latency, and energy efficiency. PNNs compute using light, which leads to several differences in implementation when compared to electronics, such as the need to represent input features in the photonic domain before feeding them into the network. In this encoding process, it is common to combine multiple features into a single input to reduce the number of inputs and associated devices, leading to smaller and more energy-efficient PNNs. Although this alters the network’s handling of input data, its impact on PNNs remains understudied. This paper addresses this open question, investigating the effect of commonly used encoding strategies that combine features on the performance and learning capabilities of PNNs. Here, using the concept of feature importance, we develop a mathematical methodology for analyzing feature combination. Through this methodology, we demonstrate that encoding multiple features together in a single input determines their relative importance, thus limiting the network’s ability to learn from the data. However, given some prior knowledge of the data, this can also be leveraged for higher accuracy. By selecting an optimal encoding method, we achieve up to a 12.3% improvement in the accuracy of PNNs trained on the Iris dataset compared to other encoding techniques, surpassing the performance of networks where features are not combined. These findings highlight the importance of carefully choosing the encoding to the accuracy and decision-making strategies of PNNs, particularly in size or power constrained applications.

## I. INTRODUCTION

Artificial Intelligence (AI) systems gained widespread relevance in recent years,^{1} finding diverse applications ranging from image classification^{2} to speech recognition.^{3} These systems have traditionally been implemented in electronic hardware, benefiting from the steady performance improvements driven by the miniaturization of electronic integrated circuits. However, with components now shrinking to the atomic scale, the limitations of this platform become apparent.^{4} At this size, for example, quantum effects may disrupt functionality,^{5} and the heat from densely packed devices becomes hard to dissipate.^{6} In response, new technologies are being explored to enable further improvements in AI. These emerging technologies are often not subject to the same constraints of their electronic counterparts and, thus might offer more efficient alternatives for certain applications.^{7}

Photonic neural networks (PNNs) are hardware implementations of AI systems that perform computations on optical signals, rather than on electronic ones. Using light, they are able to leverage several of its properties to potentially enable high parallelization, low latency, and reduced power consumption.^{8} For example, PNNs have been demonstrated to perform sub-nanosecond image classification^{9} and to achieve up to 10^{12} multiply-accumulate operations per second.^{10} However, transitioning from electronics to photonics remains challenging. Practical applications of medium- to large-scale systems are currently limited by the large physical footprint of photonic circuits,^{11,12} their loss accumulation, and the high power consumption of some of its electro-optic devices.^{13}

One way of alleviating these issues is by optimizing circuits,^{14,15} or carefully designing PNNs to minimize circuit size. A common practice found in the literature involves taking advantage of the complex representation of light (using amplitude and phase) to represent multiple features in a single input, thus combining multiple real-valued features into fewer complex-valued inputs. By using fewer inputs, a circuit requires fewer components and a smaller network, which leads to a reduction in the overall footprint. Such technique aligns well with the capabilities of photonic circuits, which are able to process complex inputs through complex transformations.^{16}

However, the way we represent features in neural networks (NNs) greatly influences the difficulty of the problems they solve. For example, in tasks with radial symmetry centered around the origin, opting for a polar coordinate system can emphasize the relevant feature relationships necessary for accurately solving the task, short-cutting the network’s need to learn it. This approach can significantly reduce the computational complexity required for achieving high accuracy. Moreover, the choice of feature representation also shapes the network’s approaches to solve tasks, as NNs tend to rely on the most straightforward cues available within the data.^{17} This highlights the need to understand which feature relationships are emphasized by the representation strategies used in PNNs. By doing so, we can ensure that these networks not only achieve high accuracy but also adopt desirable decision-making strategies.

In this paper, we explore the role of feature representation in the accuracy and decision-making strategies of PNNs. We investigate the common practice of combining various features into a single input, using eXplainable AI (XAI) methods to compare relative importance of the combined features. To our knowledge, only one work investigated such practice as a means of improving accuracy in PNNs.^{18} However, the consequences of the feature combination itself are still unknown, and different feature representations were not explored. Our work tackles these open questions with a mathematical analysis of feature combination focused on photonic implementations, where networks and circuits are constrained by size. We point out how different data representations and hardware implementations can be exploited for higher accuracy and lower complexity, as well as the shortcomings of the current solutions.

The rest of this paper is structured as follows: in Secs. II and III, we review the basics of photonic implementations of AI and feature importance metrics. In Sec. IV, we calculate the relative importance of features that share the same input. Sections V and VI discuss practical examples and simulations of artificial neural networks (ANNs) and PNNs. Finally, Sec. VII concludes the discussions brought up in this paper.

## II. PHOTONIC NEURAL NETWORKS

In this section, we provide a review of ANNs and their photonic implementations. We also address common strategies of representing features in light, which will be used in our further discussions in Sec. IV.

### A. Artificial neural networks

Artificial neural networks (ANNs), first proposed in the 1940s,^{19} are mathematical functions loosely inspired by how the human brain processes information. These functions are known to be universal approximators,^{20} hence their ability to handle a wide variety of tasks. The network’s behavior, i.e., the way it processes inputs, is determined by their connection strengths (called “weights”) and non-linearities (referred to as “activation functions”).^{21} Typically, these parameters are obtained through training, approximating the ANN to a probability function associated with the given task. For instance, in classification tasks, ANNs are designed to assign a class to an input by approximating a function that calculates the likelihood of belonging to each class.^{22}

*L*layers and designed with

*N*inputs and

*N*outputs, as shown in Fig. 1(a). The process by which a given layer

*l*transforms its inputs is described as

Initially, inputs are combined through weighted sums by a weight matrix **W**^{(l)} to obtain *z*^{(l)}. Then, the element-wise application of an activation function *σ*(·) to *z*^{(l)} introduces non-linearity and yields the output of the layer, where *y*^{(0)} is the input of the network and *y*^{(L)} the output. A bias $b\u20d7$ might be added before the activation function to allow for the network to better adjust to the data. The entire network, from the first to last layer, can be seen as a sequence of such transformations, written as $y\u20d7(L)=f(y\u20d7(0))$.

Thus, ANNs implement input–output mappings that can be either real or complex. real-valued neural networks (RVNNs) are characterized by real parameters and inputs, with $f:RN\u21a6RN$. In these networks, each layer scales and combines inputs before non-linearly transforming them. Complex-valued neural networks (CVNNs), on the other hand, operate in the complex domain, meaning that both the input vector and the network’s parameters are complex-valued and $f:CN\u21a6CN$.^{23} In that case, each layer has the ability to not only scale and combine but also rotate inputs in the complex plane. This rotation, inherent to complex algebra, makes CVNNs more suitable for tasks where phase information is important, such as in audio processing^{24} or optical communications.^{25}

### B. Photonic implementations

Photonic computing is emerging as a promising approach to improve ANN implementations for specific applications by computing with light. This allows us to leverage its unique characteristics to potentially enable faster and more energy-efficient AI systems. For example, in the optical domain, linear transformations can be done passively^{26} and information can be easily parallelized and processed at high speeds.^{10}

PNNs are implementations of ANNs through photonic inputs, components, and transformations.^{27} Although no single photonic component acts as an artificial neuron, a circuit can be designed to perform the mathematical operations of an ANN. This is achieved by using several components, such as waveguides, interferometers, and modulators, which guide and manipulate light signals. These circuits operate on complex signals and implement complex transformations, meaning that PNNs can act as RVNNs and CVNNs, depending on the task at hand.

Several PNN circuits were suggested and demonstrated experimentally. They can be broadly categorized by how different inputs are distinguished, whether through spatial, wavelength, or time domains.^{28}

In this study, we focus on PNNs that use spatial differentiation of inputs. These networks assign a separate input to each optical signal and implement weight matrix multiplications by making different inputs interfere with each other. Most notably, this is achieved by using meshes of Mach–Zehnder interferometers (MZIs).^{26,29} The interference, and hence the specific mathematical operation performed by the mesh, can be selected by adjusting the phase shifters found in these devices. Activation functions, on the other hand, can be implemented by using any of the devices and circuits that exhibit optical non-linearity.^{30,31} The schematics of an ANN implementation and an MZI are shown in Figs. 1(b) and 1(c), respectively.

If PNNs use coherent light inputs, they can be represented in the complex domain. In these networks, the *i*th input is characterized by an amplitude *A*_{i} and phase *ϕ*_{i}. Thus, the input vector can be expressed as $y\u20d7(0)=A1ei\varphi 1,\u2026,ANei\varphi N\u22ba\u2208CN$. Given the two degrees of freedom available for each input, feature encoding can be achieved using various methodologies. We divide common approaches found in the literature into two distinct groups: real and complex encoding.

Real encoding simplifies the input representation by encoding data solely in the amplitude of the optical signals, maintaining a uniform initial phase across all inputs (in practice having *ϕ*_{i} = 0 *∀ i* and thus $y\u20d7(0)\u2208RN$). Several researchers employ this encoding method for its compatibility with RVNNs used in electronic computers.^{8,32} It allows for an easy mapping of weights from electronically trained networks to photonic transformations. In these networks, while the nature of the transformations of individual MZIs is inherently complex, the overall behavior can effectively be real-valued. Since no phase information is used, only the amplitude of the outputs is of interest, which simplifies the detection scheme. However, it is important to ensure that different inputs experience the same phase before reaching the network to maintain phase consistency, which might not be simple to achieve experimentally.

In contrast, complex encoding uses both amplitude and phase at the same time, having inputs that lie in the complex plane, that is $y\u20d7(0)\u2208CN$. The transformations in the PNN in this case are complex, and thus, detection of both intensity and phase in the outputs might be used, adding to the electronic complexity of the circuit. In image classification tasks, for example, real-valued input images can be transformed into Fourier space representation to obtain phase and amplitude information,^{33–35} or have different sections mapped to the real and imaginary parts of complex numbers,^{18,36} which reduces by half the number of inputs.

The encoding choice for PNNs influences the network’s behavior, the type of information that is detected at the output, and the overall size of the circuit, as it may imply the use of additional peripheral devices. Beyond hardware specifications, this choice might also impact how features are processed within the network. When two features share the same input, the network may process them differently from the way they would be processed individually. Understanding these dynamics is crucial for optimizing the PNN performance.

## III. FEATURE IMPORTANCE

In this section, we look to the field of XAI for methods of evaluating feature importance in ANNs, to later study the impact of combining features in PNNs. We focus on gradient-based techniques, particularly sensitivity analysis.

ANNs, especially those with several layers, are highly non-linear models that use numerous parameters. The network’s complexity often leads them to be regarded as opaque or “black-box” systems, since their decision-making processes are difficult to grasp intuitively. That is, while we can mathematically describe how a given output is obtained, it is difficult to specify “why” with an intuitive explanation.

Nonetheless, being able to explain the decision-making strategies of a model has a number of practical applications. Clear explanations can, for instance, enhance our understanding of a problem or be used to demonstrate fair treatment. In photonics research, XAI is currently used to explain the inverse design of circuits^{37} or to aid in the description of physical models.^{38} The concept of “explainability” is still subject of an ongoing debate^{39,40} and, consequently, a variety of methods have been proposed to attain it.^{41,42} Highlighting which input features are considered as important to an ANN is a common way to explain its outputs. Several methods estimate such feature importance, of which we emphasize sensitivity analysis.

^{43–45}The underlying principle is that if small changes in an input lead to significant changes in the output, then that input is likely to be important for the network, i.e., it contributes to the prediction of this output. In such case, the importance of the

*i*th input, $yi(0)$, to the

*c*th output of the network, $yc(L)$, is denoted by

*R*

_{i→c},

Gradient-based explanations are frequently used in image classification tasks to generate saliency maps^{46} and also show fair performance in matching feature importance in simulated data.^{47} Over time, other methods built up on sensitivity analysis, addressing some of its drawbacks by suggesting additional forms of estimating feature importance.^{48} For example, adding Gaussian noise to the input and averaging their resulting gradients helps in generating more consistent saliency maps.^{49} These techniques are often easy to implement, given that the necessary partial derivatives can be computed through back-propagation.

## IV. ANALYTICAL DERIVATION OF RELATIVE IMPORTANCE

Here, we use the concepts elaborated in previous sections to investigate how the importance of features is shaped in PNNs. Initially, we employ the sensitivity analysis shown in Sec. III to obtain the importance of an arbitrary feature encoded in one input. Then, we introduce the concept of *encoding functions* to describe the different feature encoding processes and representations in photonics, given in Sec. II.

Consider a set of features $X={x1,\u2026,xn}\u2208R$. Assume that we want all the elements in $X$ to be used by our model. However, due to either a prohibitively large quantity of features or size restrictions on our network, we also wish to use a number of inputs that is less than the number of elements in $X$. To achieve both objectives, we combine features into complex inputs, as shown in Sec. II. In this context, we calculate the relative importance of such combined features, to understand what relationships are highlighted by our inputs.

*c*th output of the network. Applying the chain rule to Eq. (3), we write this importance

*R*

_{j→c}as

*x*

_{j}is represented in the input. The process of creating an input from elements of $X$ is what we term feature encoding. An input $yi(0)$ obtained from the feature

*x*

_{j}is hence written as $yi(0)=gi(xj)$, where

*g*

_{i}is the encoding function for the

*i*th input. Considering the encoding process as such, we can write

It should be noted how the importance depends on both the network, represented in the derivative from output to input, and the feature encoding process, given the presence of the encoding function. The modulus operation ensures that the importance is always positive and real-valued. Here, we assume the network to be derivable in the vicinity of the current input, which might not be the case for some CVNN architectures.

*x*

_{j}and

*x*

_{k}, are represented using a single input $yi(0)$, comparing their relevance. We define the relative importance between

*x*

_{j}and

*x*

_{k}to the

*c*th output,

*R*

_{j,k→c}as the following ratio:

We see that the component of Eq. (4) related to the network is canceled, leaving only the derivatives of the encoding function. Thus, *R*_{j,k→c} is solely determined by the way features are encoded into *y*_{i}, and hence, it is independent of the considered output. To simplify the notation, we drop the subscript indicating the output for the rest of this paper. One of the consequences of Eq. (6) is that the encoding function chosen to combine *x*_{j} and *x*_{k} defines how these features are perceived by the model relative to one another.

Although encoding functions are a method of pre-processing features, in the context of PNNs, they can also be implemented in hardware. The incorporation of encoding functions in the circuit is particularly interesting for low-latency applications, as the speed at which inputs are transformed and combined would be limited only by the reconfigurability of the driving electronics. We now explore two types of complex encoding functions to see how they dictate relative feature importance. We also point out how they could be implemented in hardware.

### A. Exponential encoding

*x*

_{j}and

*x*

_{k}into a single input would be to encode

*x*

_{j}in its amplitude and

*x*

_{k}in its phase. This encoding function can be written as

In this case, the relative importance between the two features is dynamic, establishing an amplitude-dependent relation between the importance of amplitude and phase.

A hardware version of an exponential encoding function is shown in Fig. 2(a), where a balanced MZI and a phase shifter are used to modulate the amplitude and phase of an input, respectively. The encodings and importance are not exactly the same since this amplitude modulation scheme is mediated by a sine function, it implements *g*(*x*_{j}, *x*_{k}) = *i* sin(*x*_{j})exp(*x*_{k}*i*). Here, Eq. (7) can be achieved short of a global phase shift by mapping *x*_{j} to arcsin(*x*_{j}).

### B. Linear encoding

An encoding function similar to that of Eq. (9) implemented in hardware, is shown in Fig. 2(b). There, two MZIs are used as amplitude modulators, while one of their outputs has its phase shifted by *π*/2 to encode the respective input in the imaginary axis. In that case, *g*(*x*_{j}, *x*_{k}) = *i*[sin(*x*_{j}) + sin(*x*_{k})*i*]. Equation (9) can be achieved short of a global phase shift by mapping *x*_{j} and *x*_{k} to arcsin(*x*_{j}) and arcsin(*x*_{k}).

## V. ON THE IMPACT OF ENCODING FUNCTIONS TO ANNs

In this section, we address the practical implications of the discussions brought up in Sec. IV. Here, our objective is to demonstrate how a well-engineered encoding function can significantly improve the accuracy of an ANN on a test task. We begin by defining such a task and studying the relative feature importance found in a solution to it. Later, we create an encoding function that reproduces these importances on trained ANNs, finally comparing its use against others.

*n*-sphere. An

*n*-sphere is the generalization of a circle to

*n*+ 1 dimensions, similar to how hyperplanes generalize planes. It is defined by a set of points

*S*

^{(n)}that are equidistant from a central point

*c*

_{0}= (

*c*

_{1}, …,

*c*

_{n+1}) by a radius

*r*

_{0}. The distance of a point

*P*= (

*x*

_{1}, …,

*x*

_{n+1}) to

*c*

_{0}is

*P*is considered outside of the

*n*-sphere if

*r*

_{p}exceeds

*r*

_{0}and inside otherwise. In this context, a mathematical model that outputs a probability of

*P*being outside of

*S*

^{(n)}can be constructed using a logistic function. The logistic function

*σ*(

*x*) = 1/(1 +

*e*

^{−x}) is bounded between 0 and 1 with a smooth sigmoid transition and is typically used in binary classification problems. Given the coordinates of

*P*, this model can be expressed as

Here, *y* represents the probability that *r*_{p} > *r*_{0} given the coordinates of *P*. When *y* = 0.5, Eq. (12) delineates the boundary defined by *S*^{(n)}, allowing for accurate classification of points based on this threshold.

*P*, we conjecture that its relative feature importances are desirable to other models that wish to do it as well. Thus, we examine the sensitivity of

*y*to an arbitrary feature

*x*

_{j}, which can be calculated according to Eq. (4) as

*n*-sphere. Exactly at that point, small variations in

*x*

_{j}cause the largest deviations of the probability of

*P*being outside of

*S*

^{(n)}. The relative feature importance between two features

*x*

_{j}and

*x*

_{k}is

*g*(

*x*

_{j},

*x*

_{k}) that achieves the desired reduction in dimentionality while preserving the relationships given by Eq. (14). Given that after combining features, their relative importance should follow Eq. (6), we can obtain one such

*g*(

*x*

_{j},

*x*

_{k}) by solving the following system of partial derivatives:

In the same manner, we can obtain other functions that express different relative importance. A constant *R*_{j,k} = 1, for instance, is achieved by using $g(xj,xk)=(xj+xk)n\u2200n\u2208R$. Alternatively, $g(xj,xk)=(xj\xd7xk)n\u2200n\u2208R$ leads to *R*_{j,k} = |*x*_{k}/*x*_{j}|, which is the inverse of Eq. (14) when *c*_{j} and *c*_{k} are zero. To compare the use of these encoding functions, we trained several ANNs, benchmarking them on networks that do not combining inputs (called “independent” here). We are particularly interested in the performance of Eq. (16), which we call “engineered” encoding function. The description of the training procedures is in the following.

A dataset of 1000 points in four dimensions was created, where each coordinate value was randomly chosen between −2 and 2. The points were labeled as either inside or outside of a three-sphere *S*^{(3)} of radius 1, centered at the origin *c*_{0} = (0, 0, 0, 0), according to their position. In order to obtain a balanced dataset, we generated the same amount of points inside and outside the sphere. The networks trained to solve this task were composed of an input layer containing either two or four neurons (depending on the combination of features or not), a hidden layer of six neurons, and an output layer with a single neuron. A logistic activation function *σ* was used for every layer. Each encoding function was used to train 100 different networks, thus accounting for the random initialization of weights and random shuffling of the dataset prior to training. The networks were trained on 70% of the available data for 100 epochs with a learning rate of 0.001 and tested on the remaining data.

The results of these experiments are shown in Fig. 2(d). We notice that some representations can render the task harder to solve, while others maintain, to some extent, the accuracy achieved by the use of independent inputs. The engineered encoding function in Eq. (16) outperformed all others. With this example, we show that the way we combine features plays a role in the accuracy of ANNs. Given prior knowledge on how features relate to one another, which may come from domain-specific knowledge or from inspecting the data (noticing symmetries or class distributions, for example), we could estimate relative feature importance and obtain an encoding function that aligns with them. Combining features with said encoding function could improve the network performance.

## VI. APPLICATION OF ENCODING FUNCTIONS IN PNNs

In this section, we retake the subject of this study and explore the use of different encoding functions in PNNs trained on the Iris dataset,^{50} a standard benchmark for classification algorithms. Our goal is to show how carefully chosen encoding functions might lead to higher accuracies in PNNs. To this end, we compare the performance of several encoding functions by means of simulations of PNNs, which differ significantly from the ANNs of Sec. V in terms of their complex-valued inputs and transformations.

The Iris flower classification task involves categorizing three different Iris species (Setosa, Versicolour, and Virginica) based on four features: the lengths and widths of sepals and petals. The dataset, consisting of 150 labeled data points, has considerable class overlaps, such that no single feature alone can distinguish all the species, making this an ideal candidate for our experiments. Visualizations of feature distributions and class overlaps are shown in Fig. 3(a).

Our experimental design involves training PNNs by combining features in pairs, as shown in Fig. 3(b). We assess their performance by averaging the accuracy of 100 trained PNNs, benchmarking them against a PNN that does not combine features. This sample size was chosen to allow for convergence in the average values obtained for accuracy, given the variability in the training process. The architecture of the PNNs consists of a single hidden layer with six neurons. Depending on the configuration, the number of input neurons varies between 3 (when combining features) and 5 (when using independent inputs), where one input acts as a bias for both configurations. The output layer has three neurons, matching the number of classes. All the configurations use the same underlying circuit, where NN layers are implemented using meshes of MZIs with trainable phase shifters, as shown in Fig. 1(b). Every layer is followed by a *softplus* activation function, which can be implemented in integrated photonic circuits.^{51} Although its hardware implementation would change both the modulus and the phase of the signals, we model it by applying *softplus*(*x*) = log(1 + exp(*x*)) solely to the modulus of the complex numbers.^{33} This approach allows us to simulate the gain and activation behavior while simplifying the model by avoiding additional phase changes. These phase changes can make the simulation and training more challenging and are less critical to the primary function of the *softplus* activation in this context.

The circuits were simulated using the Photontorch Python package.^{52} The simulations were performed under ideal conditions, excluding noise and component imperfections. They were trained for 300 epochs on 70% of the dataset, reserving the remaining 30% for testing. The dataset was divided into five shuffled batches per epoch to enhance training stability. A *softmax* function was used to convert the output light intensity values into class probabilities.^{22} Weight updates were performed using a cross-entropy loss function combined with stochastic gradient descent. The initial learning rate was set at 0.01 and adjusted at learning plateaus. While higher accuracies may be achieved by further optimizing the training process to each specific case, we opted for a constant training procedure across all circuits to isolate the effects of different encoding functions.

Here, we investigate the use of the encoding functions detailed in Sec. II: linear and exponential encoding. Given the anisotropic nature of the Iris classification task, unlike the *n*-sphere problem, we also consider which features to combine. To explore the impacts of this choice on the obtained accuracy, we use two combination strategies for features: grouping by the lengths and widths (*l*/*w*) or by petal and sepal information (*p*/*s*). We benchmarked the performance of PNNs using different encoding functions and groupings of features to the independent case, where features were not combined.

The results of these experiments, shown in Fig. 4, are summarized as follows: exponential encoding exhibited the lowest performance, falling up to 11% in mean accuracy compared to the independent benchmark. In contrast, linear encoding, commonly used in the photonics community,^{18,33–36} was able to match the performance of the independent case. The difference between the best and worst performing encoding functions was 12.3%. These results highlight that both the manner in which features are combined and the combination of features itself play significant roles in the final accuracy of PNNs. When comparing different feature groupings, we found that (*l*/*w*) consistently performed worse than (*p*/*s*), demonstrating that the choice of which features to combine can also impact accuracy for some tasks.

These findings are supported by heuristics found in the data. A closer inspection of Fig. 3(a) reveals that petal length and petal width together are highly discriminative of the different classes. These features separate different species in a similar fashion, as evidenced by the distribution of classes along the diagonal of the plot, suggesting that they may have similar importance. Thus, the combination (*p*/*s*) with linear encoding would combine petal length and width with an equal relative importance, expressing such relationships.

## VII. CONCLUSIONS

Combining features into single inputs in PNNs can lead to reduced number of inputs and associated devices as well as enable the use of smaller and more energy efficient NNs. These benefits would help render some circuits more feasible to be simulated, fabricated, tested, or deployed. However, this method of feature combination imposes predefined relationships among the features that may not necessarily reflect the nature of data or task at hand. Nonetheless, selecting or designing encoding functions based on an understanding of the dataset or from domain-specific knowledge can lead to improved accuracy. We have illustrated this first on an ideal simple example and then for simulated PNNs.

In the scenarios shown here, as it is seen in the literature, features are combined into a single input. As an alternative, we could distribute features across many inputs, circumventing the discussions brought up here and making it possible to learn other relative feature importance. For instance, principal component analysis (PCA) can be used for dimensionality reduction, distributing features across many inputs simultaneously.^{32} Expanding on this concept, a learnable encoding function that uses every feature available would be a fully connected layer of NN,^{35} which is more complex and less efficient than what is explored in our work. In addition, the approach used here could be applied directly at a hardware level, using integrated photonics and CMOS-compatible platforms for volume production.

Here, the discussions highlight that there is no neutral way of using this feature combination strategy in PNNs. Combining features in this manner will necessarily emphasize certain feature relationships. Sometimes, a PNN might achieve good performance metrics despite combinations that are not ideal. However, even if high accuracy is achieved, these combinations can also introduce or amplify biases in the model outputs, depending on the specific features and their encoded interactions. Rather than leaving this to chance, we suggest to carefully assess how to encode features given the nature of the problem and data.

## ACKNOWLEDGMENTS

The authors thank **Peter Bienstman** and **Thomas Van Vaerenbergh** for their valuable advice and discussions during the early stages of this work, as well as for reviewing the paper before submission.

This project received funding from École Centrale de Lyon and ANR (Grant No. ANR-20-THIA-0007-01). **Paul Jimenez** and **Fabio Pavanello** acknowledge ANR’s support (Grant No. ANR-20-CE39-0004), and **Fabio Pavanello** acknowledges support from European Union’s Horizon Europe research and innovation programme (Grant No. 101070238).

Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Mauricio Gomes de Queiroz**: Conceptualization (lead); Data curation (lead); Formal analysis (lead); Investigation (lead); Methodology (lead); Software (lead); Visualization (lead); Writing – original draft (lead); Writing – review & editing (equal). **Paul Jimenez**: Formal analysis (supporting); Methodology (supporting); Writing – review & editing (equal). **Raphael Cardoso**: Methodology (supporting); Writing – review & editing (equal). **Mateus Vidaletti Costa**: Methodology (supporting); Writing – review & editing (equal). **Mohab Abdalla**: Methodology (supporting); Writing – review & editing (equal). **Ian O’Connor**: Supervision (supporting); Writing – review & editing (equal). **Alberto Bosio**: Funding acquisition (equal); Supervision (supporting); Writing – review & editing (equal). **Fabio Pavanello**: Conceptualization (supporting); Formal analysis (supporting); Funding acquisition (equal); Methodology (supporting); Supervision (lead); Writing – review & editing (equal).

## DATA AVAILABILITY

The code that reproduces the experiments and the data that support the findings of this study are available at github.com/mgomesq/feature representation pnns.

## REFERENCES

*Frontiers in Optics*