Intelligent performance inference: A graph neural network approach to modeling maximum achievable throughput in optical networks

One of the key performance metrics for optical networks is the maximum achievable throughput for a given network. Determining it, however, is a nondeterministic polynomial time (NP) hard optimization problem, often solved via computationally expensive integer linear programming (ILP) formulations. These are infeasible to implement as objectives, even on very small node scales of a few tens of nodes. Alternatively, heuristics are used although these, too, require considerable computation time for a large number of networks. There is, thus, a need for an ultra-fast and accurate performance evaluation of optical networks. For the first time, we propose the use of a geometric deep learning model, message passing neural networks (MPNNs), to learn the relationship between node and edge features, the network structure, and the maximum achievable network throughput. We demonstrate that MPNNs can accurately predict the maximum achievable throughput while reducing the computational time by up to five-orders of magnitude compared to the ILP for small networks (10–15 nodes) and compared to a heuristic for large networks (25–100 nodes)—proving their suitability for the design and optimization of optical networks on different time-and distance-scales


I. INTRODUCTION
Multiwavelength optical networks underpin the global data communication network infrastructure and use the wavelength domain both for routing and to increase point-to-point data transmission rates. Optical networks are also being used for intra-and inter-data center communications and in high performance computing. 1 To enhance the throughput of these networks, optical fibers carry many wavelength channels on a single fiber, known as wavelength division multiplexing (WDM), which are also used for wavelength routing. 2 This has greatly improved the overall capacity of optical networks but requires the solution of the routing and wavelength assignment (RWA) problem. These have been shown to be nondeterministic polynomial time (NP) hard and, therefore, computationally difficult to solve optimally. [2][3][4] The overarching goal of the physical network design is to maximize network performance, measured by throughput, latency, and resilience, while minimizing the cost and/or resource use, making networks intelligent and adaptive. 5,6 This is an evolution from the previous goals of minimizing the number of wavelengths needed to optically route data within the network, 7 which quantified the relationship between wavelength requirements and the physical topology. 8,9 However, the understanding of linear and nonlinear physical layer impairments, in the presence of multiple wavelegnth channels, has shown that these play a key role in determining both routing and throughput, highlighting the importance of including optical fiber physical properties in network analysis and design. 6 The maximum achievable network throughput is defined here as the throughput when the resources (wavelengths) are fully exploited, at zero blocking, for a given demand distribution (measured in bits per second). To quantify the maximum achievable throughput in a network requires an optimal solution to the RWA problem. Integer linear programming (ILP) formulations have been shown to solve the RWA problem optimally; [10][11][12] however, these are infeasible for networks larger than ∼30 nodes. Problem-agnostic optimization frameworks, which aim at efficiently exploring the solution space, i.e., meta-heuristics, have also been shown to give good solutions; however, these have no guarantee of achieving a global optimum, while often still taking a long time to solve. [13][14][15][16] Handcrafted algorithms, i.e., heuristics, are designed for specific purposes and are based on the automated rules-of-thumb. These are usually highly scalable but have shown limited success in terms of reaching optimum values, compared to ILP solutions. 4 In addition, graph cut algorithms have been analyzed to estimate the maximum achievable throughput; however, these also succumb to computational complexity for larger graphs and have been shown to be non-exact for non-uniform traffic distributions. 7,17 The efficient and accurate calculation of the maximum achievable throughput of optical networks, thus, remains a considerable challenge.
To make this task more computationally efficient, machine learning has been introduced to learn the relationship between the topology and performance parameters based on previously labeled datasets.
More specifically, supervised deep learning has been used, where large labeled datasets are used to train learnable functions. There are two broad approaches to deep learning: (i) artificial neural networks (ANNs) based on Euclidean data or (ii) geometric deep learning, using graph neural networks (GNNs), based on graph structured data to perform classification or regression tasks. In Ref. 18, an ANN was used to estimate the blocking probability of a network, and in Ref. 19, it was used to estimate the fast optical packet loss rate for bufferless optical packet-switched networks. The problem, however, with ANNs is that they operate on grid-style data, i.e., vector/matrix inputs. Therefore, they are not able to use relational data, i.e., the relation of one node to other nodes, nor fully represent structural features of graphs, such as node degree variance or algebraic connectivity. 9 This makes it difficult to learn a general relationship between these features and the performance parameters as the structure and physical properties of graphs have a significant impact on the overall network performance. More recently, there have been many works that have applied novel combinatorial optimization techniques, using reinforcement learning to solve the RWA problem. [20][21][22][23][24] However, these have yet to achieve performance comparable to the ILP solutions.
Geometric deep learning applies deep learning to graph structured data, incorporating the graph structure within the learning process. By including node features, edge features, and learnable functions that operate in a graph-structured manner, geometric deep learning learns relationships on graphs better than traditional deep learning. Graph neural networks (GNNs) are a collection of supervised learning frameworks for geometric deep learning, able to aggregate information in graphs. Message passing neural networks (MPNNs) are a specific formulation of GNN, shown to perform well for regression tasks, such as the evaluation of the maximum achievable throughput of an optical network, one of the fundamental goals of our work. 25 The benefit of this type of model is that it has been shown to generalize well to graphs of different sizes and structures and to be computationally efficient. The downside, however, is that it requires large datasets to learn and predict graph properties. More often than not, these are far and few between, making it difficult to apply these frameworks. Previously, this approach has been successfully applied in quantum chemistry and non-optical communication networks. [25][26][27] In this paper, we apply a MPNN architecture to model the performance of optical networks, for the first time, by learning the relationship between the topology and the maximum achievable throughput. We generated three large datasets (∼80 000 graphs per dataset), with the number of nodes in the range 10-100, expanding our previous work. 28,29 The datasets were used to train an accurate MPNN model to learn the relationship between the topology and the performance of optical networks. We demonstrate the significant computational time reduction with this model to estimate graph performance. This, in turn, reduces the complexity of optical network topology design problems by replacing computationally complex objectives with surrogate models, such as the one demonstrated here. This enables future topology optimization to maximize the achievable throughput, making topology design more intelligent and computationally efficient.
The rest of this paper is structured as follows. Section II describes the MPNN model and its operation. Section III details the methodology used to generate the three graph datasets. Section IV explains the training procedure and the different parameter settings for the MPNN model. Finally, using the generated datasets, Sec. V analyzes the model's capability to predict the maximum achievable throughput in terms of accuracy, computation time, and generalization capability.

II. MESSAGE PASSING NEURAL NETWORKS (MPNNs)
Optical networks are a set of nodes (for example, major cities or data center or servers within data centers) inter-connected, with arbitrary connectivity, by optical fibers. Optical fibers carry multiple discrete wavelength channels, each carrying individually modulated data. To transfer data between source and destination nodes, for each source-destination demand, a lightpath is set up, consisting of a route (a series of edges) and a wavelength designated for transmission. The process of selection of wavelength (color) and route is determined via the RWA algorithm. Optical networks are usually modeled as graphs and so ideally lend themselves as inputs to graph neural networks. Recently, there have been works that apply graph neural networks in the context of reinforcement learning (RL) to the RWA problem in optical networks. 23,24 These works, however, are only compared to heuristics, where sometimes they perform better and sometimes worse. We apply MPNNs to learn from the solutions directly from integer linear programming formulations and heuristics to consistently predict performance properties of optical networks, such as the maximum achievable throughput.
We start with a digraph denoted as G(N, E), where N and E are its set of nodes and edges, respectively. All nodes and edges have a pre-determined set of node and edge features, xn and enu, respectively, where n ∈ N and (n, u) ∈ E. In this work, the degree (δn) and the total traffic originating from a node are used to describe node features, and the worst case noise-to-signal ratio (NSR) of a fully populated link is used to describe the corresponding edge feature. These node and edge features are vectors with information related to either nodes: degree and traffic, or the edges: distance and signalto-noise (SNR) ratio. MPNNs use abstract vectors, referred to as a node or edge hidden state. In this work, for simplicity, we confine the hidden states to nodes, represented as h t n , where t represents the message passing iteration. These hidden states are vectors that ARTICLE scitation.org/journal/aml hold embeddings for nodes, i.e., for a specific node, they capture structural information from the rest of the graph. We define the set of node features/edge features as XN/XE, respectively, and the set of hidden node states as HN. GNNs can be used for either regression or classification tasks, and in this work, we focus on the regression of graph properties. The MPNN framework centers around three functions: message function Mt(h t n , h t u , enu), update function Ut(h t n , m t+1 n ), and readout function R (HN, XN), where n is a node in the node set N, t is the message passing iteration out of T iterations, and u is a node in the neighborhood of n [u ∈ (n)]. T generally is chosen to be in the order of either the average shortest path length or diameter of the graph. There have been several different formulations of these functions in the past; 30-37 however, they all follow the same general algorithm.
The MPNN-outlined in Algorithm 1-is made up of three stages: (i) message passing (line 7), (ii) update (line 8), and (iii) readout (line 11). In (i), each node in the graph requests information (messages) from its neighborhood [ (n)]. This information (messages) is given by feeding node and edge information into the message function [Mt(h t n , h t u , enu)]. To form the nodal message (m t+1 n ) of node n, we have to aggregate the information (messages) given by the neighborhood of n. This can be done via different operations, i.e., averaging, sampling, or summing. We obtain the message (m t+1 n ) of node n by summing the messages from the neighborhood of n. This is then fed through an update function (line 8) defined as Ut(h t n , m t+1 n ), which updates the state (h t+1 n ) of each node. These two steps are illustrated in the inner block of the diagram shown in Fig. 2, where one can see that the process is repeated for each node n ∈ N. This procedure iteratively distributes the information of the graph to every node by collecting local messages and using these to update the new hidden vectors. Figure 1 demonstrates this process of message passing for the computation of the node state of node 1 on an example six-node topology. Here, the process is shown for two message passing iterations, t = 1, 2. Working backward from t = 2 and node 1, we can see that we use the neighbors of node 1, which feed their messages into the aggregation that are then updated. Before this, their neighbors do the same for their states. This demonstrates how the information about the graph is distributed across the graph during message passing. After T message passing rounds and update layers, shown by the outer blocks in Fig. 2, the hidden states are aggregated and used to create a graph level prediction y, seen in line 11 of Algorithm 1. This process of aggregation and graph-level readout is summarized by the readout function R(HN, XN), where HN denotes the set of hidden states and XN denotes the set of node features. The readout function outputs a scalar value used for prediction. This architecture has been shown to provide good learning capability for general graph structured data, 26 where one can incorporate individual node and edge features. Another advantage of such a model is that the architecture is size agnostic, meaning that the model can generalize to graphs of different sizes.

A. Message function
The message function is used to extract information from both hidden states of neighboring nodes and adjacent edge features, thus producing messages (line 7). The functions that are used to formulate these messages are generally learnable functions. Within optical networking, the fiber lengths of edges significantly impact the transmission performance over these edges; therefore, it is essential to include this information in the formulation of messages,

ARTICLE
scitation.org/journal/aml As shown in Eq. (1), we use a matrix ANN (A) and vector valued ANN (b) to extract features from the edge feature vectors enu in constructed messages. These messages are then aggregated via a sum operation, as seen in line 7 of Algorithm 1 and the inner block of Fig. 2, to give the future message of node n, m t+1 n . The sum operator is chosen due to it being permutation agnostic and for its simplicity.

B. Update function
The update function is used to take the information (messages) aggregated from the neighborhood of node n and learn a new hidden state vector h t+1 n to incorporate this new information in the state (line 8). As this is a sequential process, we use a recurrent neural network (RNN) architecture, which can take previous states into account and learn how much of the previous states to use in the next state, while producing new abstract representations of the data. RNNs have been shown to struggle with vanishing gradients during training; therefore, we chose to use a gated recurrent unit (GRU). GRUs use reset and update gates to learn how much of the state and input to use or forget depending on the target. They exhibit improved performance over standard RNNs; however, they have been shown to have reasonable computational complexity, 38 As shown in Eq. (2), the update function solely consists of a GRU function. The current state, h t n , and the aggregated messages, m t+1 n , are fed in to produce an updated representation, h t+1 n .

C. Readout function
The final graph-level aggregation of the states and features is carried out via the readout function (line 11). The aim is to aggregate all the relevant inter-dependencies between the nodes via the hidden states, h T n , and represent it as a single vector, on which one can regress to the target outputs (throughput, in our case). The readout function is shown in the following equation: To learn which parts of the hidden vector, h T n , are important for the prediction of the target, an attention mechanism was used. The attention mechanism learns weights in the range of [0, 1], which are used to weight a vector, with the goal of learning which parts of the vector are important for the learning task. This is achieved by feeding the concatenated hidden vector, h T n , and node features, xn, through an ANN and passing the output through a sigmoid function σ, which normalizes the output between [0, 1]. Using an element-wise multiplication (Hadamard product), this vector acts as attention scores to the original hidden vector, h T n , which is fed through another ANN j. 26 Summing these operations over all nodes gives us the final vector used for the regression layer. The regression layer b consists of a single ANN layer, which reduces the output to a scalar value.
Having defined the MPNN model architecture, it now needs to be trained on graph labels, i.e., maximum achievable throughput, to predict this on all unseen graphs. Section III details the methodology for generating training and testing datasets.

III. DATASET GENERATION
To train the MPNN, three large sets of graphs, with different numbers of nodes, were generated. To calculate the maximum achievable throughput for nodes sizes below 25, an ILP was used to find optimal RWAs, although for the two larger sets of graphs (N = 25-100), a heuristic algorithm was used. Using these RWAs, the throughput was calculated and then stored as a graph label to be used for training and testing. This section details the methodology used for generating topologies, the ILP formulation, heuristic algorithms, the physical layer impairment (PLI) model, and the final calculation of the maximum achievable throughput.

A. Topology generation
A single generative graph model was chosen to create the graph structures for training and testing. These were created via the SNR-Barabasi-Albert (BA) model, which has been shown to reflect optical core network structures and physical properties. 6 In this generative graph model, distances between nodes are incorporated in the process of edge selection, creating localized hubs within graphs. Nodes are chosen uniformly over a grid, representing the size of the north-American continent, resulting in unique node locations for each graph and therefore giving a greater range of graph structures and physical properties than in Ref. 6, where the same real network node locations were used in all generated graphs. The SNR-BA generative graph model then uses these unique node locations to generate the graphs. Inter-node distances were constrained to be at least 100 km, 1500 km if 1000 km ≤ D hav ≤ 1200 km, To model distance, D hav , in the generated graphs, their geographical coordinates were used in conjunction with the haversine formula. 39 The haversine formula takes into account the curvature of the earth and calculates distances over a sphere rather than a plain. We also account for realistic fiber distances, as shown in Eq. (4). The fiber distances are estimated according to the European Telecommunications Standards Institute (ETSI) guidelines for estimating distance overheads in communication networks. 40 To demonstrate the model operation on different size topologies, three separate datasets were generated: (i) alpha set with 10 ≤ |N| ≤ 15 (∼75 000 graphs), beta set with 25 ≤ |N| ≤ 45 (∼95 000 graphs), and gamma set with 55 ≤ |N| ≤ 100 (∼75 000 graphs). The nodes increased by 1 and 5 for the alpha and beta and gamma datasets, respectively, giving 6, 5, and 10 node scales for the alpha, beta, and gamma datasets, respectively. To make sure that the model learns performance trends over a variety of edge numbers, edge numbers were chosen by adding an empirically determined percentage of the nodes for a given graph, multiple times, as seen in Eq. (5). Here, dn was chosen to be 0.2, typical of the relatively sparse core ARTICLE scitation.org/journal/aml networks, and i varied between 1 and 10 so that the graph, empirically, has ∼20% more edges than nodes; as the sparsest core networks have about 20% more edges than nodes, we have used this as the minimum value, 2 To generate the labels for the maximum achievable throughput values, an optimum RWA that maximizes the number of allocated lightpaths (simply referred to as the optimum RWA) is required for each network considered. To find these optimum RWAs, an ILP formulation was used for |N| < 25 and a heuristic was used for |N| = 25-100, as described in Secs. III B and III C.

B. Integer linear program
To maximize the maximum achievable throughput of the network, an ILP is developed that maximizes the number of lightpaths allocated, maximized using the objective function (6). The ILP considered here does not directly optimize the maximum achievable throughput; however, it assumes that maximizing the number of lightpaths allocated is highly correlated with the maximizing throughput within the network. Maximizing the number of lightpaths allows the problem to be specified in terms of an integer objective, making the problem less computationally complex. Maximizing throughput directly by solving the routing and wavelength allocation problem, a more complex problem to be solved because of nonlinear interactions between lightpaths. A decision variable δ w,k,z is used to define a lightpath, where w ∈ W, k ∈ K, and z ∈ Z are the set of wavelengths, k-shortest paths, and node-pairs, respectively. It is defined as in Eq. (7) and is constrained to assigning a lightpath, subject to the normalized traffic matrix (T c z ) and the objective M, defined in Eq. (8). For this work, all the training data were generated with uniform traffic, meaning that uniform bandwidth was assumed to be routed between all node-pairs. Here, the objective is to maximize M, as summarized in Eq. (6). The wavelength continuity and edge-disjoint constraints of paths are defined in Eq. (9), where I( j ∈ k) refers to whether edge ( j) occurs on path (k), Using this ILP formulation, optimum RWAs were found for each graph in the alpha dataset using 156 wavelengths and 20 kshortest paths. ILP formulations, however, do not scale for larger graphs, and heuristics in conjunction with sequential loading were used for the beta and gamma datasets, as described in Sec. III C.

C. Heuristics
Previously, it has been argued that heuristics, and specifically, first-fit k-shortest-paths (FF-kSPs) can achieve similar performance for estimating capacity in optical networks. 4 These heuristics are used in combination with sequential loading to load the network until the maximum achievable throughput can be found, and the FF-kSP heuristic has been shown to give very good performance for this type of task, with low computational complexity. 4 The heuristics here are used for two purposes: (i) to benchmark the performance of the MPNN (N ≤ 25) and (ii) to generate training labels for larger graphs (N ≥ 25), for which the ILP is infeasible.
The sequential loading applied is described by Algorithm 2. Here, the same objective (M) is maximized as in the ILP, where Tr and Tc are N × N matrices representing the connections to be routed and the normalized traffic matrix, respectively. FR is the routing function, which takes three inputs: a graph (G), a connections matrix (Tr), and the previous RWA. Once blocking is achieved, the iteration (ms) is halved and the network is loaded until blocking is achieved. This iterates mi times, after which the final RWA is returned, where mi was chosen to be 6, as more would be adding negligible amounts to the objective. In this work, kSP-FF and FF-kSP 4 were used, both of which are common heuristics for solving the RWA problem. FF-kSP was used to generate optimum RWA configurations for the graphs in the beta and gamma datasets due to its high linear correlation to the ILP-determined performance data, as seen later in Sec. V A.
Having optimized the RWA of a given network, we need to use these optimum RWA values to calculate the maximum achievable throughput. For this, the physical layer impairments (PLIs) over the fiber in which the data are routed need to be taken into account to calculate the SNR values of the individual lightpaths.

D. Physical layer impairment model
To calculate the resultant throughput for a given RWA, we calculated the accumulated SNR for each lightpath assignment. A lightpath i = {p i , λi} ∈ R has a path p i and a wavelength λi associated with it and is part of the set of lightpaths for a routing R. To  In this work, just by way of an example, we assumed multiples of 80 km standard single mode fiber spans, with α = 0.2 dB km , D = 18 ps nm km , and γ = 1.2 1 W km , amplified with erbium doped fiber amplifiers (EDFAs) with a noise figure of 4 dB. Nyquist spaced 32Gbd channels were interfaced with colorless, directionless, and contentionless reconfigurable optical add-drop multiplexers (CDC-ROADMs) over a constrained C-band (1530-1570 nm) optical bandwidth. The total SNR of that path is then calculated by taking the inverse sum of the noise-to-signal ratio (NSR) values of each link traversed by the path p i , shown in Eq. (11), This SNR is used to calculate the maximum achievable data rate over this lightpath using Eq. (12). 42 It can be seen that the capacity of a lightpath mainly depends on the SNR of that path, which, in turn, depends on the length and congestion along the path. BCH represents the channel bandwidth used, which was kept constant at 32 GHz for all channels, The throughputs for all the lightpaths were calculated and summed, as in Eq. (13), to give the total achievable throughput for a particular RWA and network. For the routing of larger networks, multiple fibers needed to be incorporated to allow for all-to-all connectivity, where four fibers per edge for 25 ≤ |N| ≤ 45 and 16 fibers per edge for 55 ≤ |N| ≤ 100 were used. Multiple fibers generally increase the throughput of a network by at least ξ times, where ξ is the number of fibers per edge. 43 The hierarchy of graphs, however, does not change when adding more fibers, i.e., which network is better does not change, therefore not changing the overall distribution of the generated graphs.
Using this methodology, the three datasets, alpha, beta, and gamma, were generated via the SNR-BA model; their RWAs were optimized via either ILP or FF-kSP (depending on their node scale); and finally their maximum achievable throughputs (TA) were calculated. The whole process is illustrated in Fig. 3. Using TA as a training label (target), the MPNN model, defined in Sec. II, was trained in a supervised manner, described in Sec. IV.

IV. TRAINING
Using the datasets created as described in Sec. III, the three MPNN models were trained to predict the maximum achievable throughput, TA. The node features (xn) chosen to be the degree of each node (δn) and its normalized traffic were used to initialize the node hidden vectors (h 1 n ). As throughput depends on the physical properties of the fiber links, the overall transmission quality metric is a critical feature. The number of lightpaths carried over an edge or set of edges, as well as the length of paths taken, influences the overall system performance via the achievable signal-to-noise ratio in the nonlinear optical regime. Throughout testing, it was seen that the inverse of the SNR (NSR) was a better feature to use as it is an additive quantity over a set of edges. In an ideal case, optimal launch powers would be found for each wavelength in the network; however, due to the computational complexity associated with carrying this out for many networks (240 000 topologies), the assumption of uniform launch powers, i.e., equal input power/lightpath for all lightpaths, was used, similarly to other work. 4 A hidden vector size of 16, with eight message passing rounds, was used for all model training. Larger hidden vector sizes did not seem to provide more accuracy but added higher computational complexity. The message passing rounds were chosen to work well over the whole spectrum of node scales, where T = 8 gave good performance for all. As overfitting was initially a problem, a dropout rate of 0.65 with L2 regularization at a rate of 0.03 was used, where higher values started reducing the accuracy of the model. The learning rate was initialized with a value of 0.001 and decayed exponentially using 10 000 steps at a rate of 0.95. The Adam optimization algorithm was used to train all models, with the graphs in batches of 50, and a single fully connected regression layer consisting of 256 neurons was used for the final regression output, where larger values did not provide further accuracy without overfitting. To monitor whether the model was tending to overfit, a validation set of 500 graphs was used to evaluate the performance at each epoch. The training was conducted on a Nvidia V100 16 GB graphics processing unit (GPU), with training generally taking around 72 h, covering 2000 epochs.

ARTICLE scitation.org/journal/aml
Having trained the three separate models on the different datasets, we used three separate test sets to gauge the performance of the trained inference models. These were generated identically to the methodology laid out in Sec. III, but were unseen (not used for training) by the model. Section V analyzes the accuracy, computation time, and generalization capabilities of the MPNN model using these test sets.

V. RESULTS
To understand how well the MPNN model performs in comparison to other methodologies for estimating the maximum achievable throughput of an optical network, there are three aspects we evaluate: (i) model accuracy, (ii) time complexity, and (iii) generalization capability to unseen graphs.
To evaluate the accuracy of the MPNN, the three test sets of 6000, 5000, and 10 000 graphs were used to test the alpha, beta, and gamma models, respectively. Here, 1000 graphs for each node scale were generated. Labels for the graphs in the alpha test set were generated via the ILP formulation, while for the beta and gamma test sets, FF-kSP was used. Again, uniform traffic was generated and used for all these results. To measure the accuracy of model predictions, the coefficient of determination R 2 was used, which measures how much of the variance in the data is explained by the model, by comparing the prediction variance to that of the original data. 44 Here, we define R 2 as in the following equation: where y n refers to the true label, p n refers to the predicted label, and y refers to the mean. The Pearson's correlation coefficient ρ was used as an alternative metric and determines how linearly labels and predictions are related, with a value of 1 signifying a perfect linear correlation. This coefficient is important in the context of surrogate models or cost functions, as a model might be inaccurate (low R 2 ), but might have a high linear correlation (high ρ), which means that although inaccurate, it can predict the relative performance of a network well-vital for optimization. The predictive accuracy for all three models is analyzed and discussed in Sec. V A.

A. Maximum achievable throughput
The label that was evaluated is that of maximum achievable throughput (TA). For each of the 6000 graphs in the alpha test set, the performance was evaluated via ILP (used for the labels), FF-kSP, kSP-FF, and MPNN and plotted in Fig. 4(a). It can be seen that the kSP-FF and FF-kSP heuristics underperform compared to the ILP, giving lower R 2 values in both, although kSP-FF is worse as it has even lower values for R 2 and ρ, as seen in Table I. This signifies poor predictive accuracy of the actual labels (TA). The performance of kSP-FF is worse than that of FF-kSP as it prioritizes shorter paths over optimizing wavelength selection; this is an expected result seen in Ref. 4. FF-kSP generally spreads the usage more evenly across all network links compared to kSP-FF. This spreading of resources uses a slightly more spectrum, but achieves much less network congestion. However, the MPNN has learned on a variety of graphs of these sizes and can accurately predict the throughput trend here, as indicated by its high R 2 value of 0.951. Furthermore, one can see the difference in throughput distributions, where the CDF of the different methodologies is plotted in Fig. 4(b). Here, it is clear that the kSP-FF and FF-kSP heuristics give different distributions compared to the ILP. The MPNN, however, is able to replicate the original distribution of the ILP well, and therefore, it can predict the throughput of the networks better. In addition to the MPNN, linear and nonlinear ML regression methods were evaluated to compare other ML frameworks. Here, the degree variance, connectivity, algebraic connectivity, communicability distance, and communicability traffic index were used as input features for the graph. ElasticNet and the artificial neural network (ANN) were trained and tested with the same data, both scoring lower with R 2 values of 0.763 and 0.768, respectively.
Another metric evaluated was the Pearson's correlation coefficient (ρ). As for the optimization, there is significant linear correlation between the real performance and predicted performance properties. It can be seen that the heuristics generally perform well, and the FF-kSP has a high linear correlation (ρ = 0.969) between the estimated throughput values and those calculated via the ILP, as seen in Table I. The MPNN has a similar correlation, with ρ = 0.975, meaning that it predicts the relative throughput performance of networks well. ElasticNet and the ANN both score ρ values lower than the MPNN and FF-kSP. The high linear correlation of FF-kSP compared to the ILP makes it a good candidate to evaluate the maximum achievable throughput for the larger graphs, even though the real maximum achievable throughput might be larger.
To further evaluate the performance of the model, the model was applied to the beta and gamma testing datasets that included larger graphs. Here, the labels were generated via the FF-kSP heuristic as the ILP is not able to scale to these larger topologies. The MPNN was used to infer the maximum achievable throughput of the graphs in the respective test sets and plotted in Figs. 5(a) and 5(b). It is clear that for the beta model, the MPNN performs better than the alpha and gamma models, with an R 2 value of 0.973. This is because the beta training set had the most training graphs per node scale. This was possible for the beta model as the heuristic runs faster on the smaller beta graphs than the larger gamma set graphs and also faster than the ILP. This can be seen in Figs. 6(a) and 6(b). Although the gamma model still achieves a high R 2 of 0.948, one can see that the larger throughput values have more variance in them. To remedy this, more samples are needed in the training set for larger node scales. Both the beta and gamma models achieved high linear correlation values (ρ) of 0.986 and 0.974, respectively.
One can conclude that heuristics consistently underperform in estimating the throughput in optical networks compared to the ILP solutions. However, FF-kSP is able to predict the trend (linear correlation-ρ) well, making it suitable for optimization tasks and training data generation. The MPNN on the other hand is able to accurately predict throughput values of unseen data based on similar structure and sizes seen during training. It also has a high linear correlation between the labels and its own predictions, making it an excellent candidate for modeling the throughput of these optical networks. Therefore, the MPNN is able to predict maximum achievable throughput accurately although it does not provide a solution, for example, for the RWA, of how to reach it. The model could potentially be adjusted to learn the RWA; this is outside the scope of the current work and is left for future research. However, given the extensive training and data needs of this model, what is the real benefit of using it in place of heuristics? This question is addressed in Sec. V B.

B. Computational time comparison
The key advantage of using machine learning to model relationships in data is very low inference times compared to ILP and, even, heuristic methods.
The ILP has a worst-case computational complexity, as described in the following equation, where D is the number of demands or connection requests (D = ∑ z∈Z ⌈T C z ⋅ M⌉), E is the number of edges, and W is the number of wavelengths used: 45 O (2 D⋅E⋅W ). For the heuristic algorithms used, the complexity generally scales as in Eq. (16), 46 but must be modified to include the term R-number of sequential loading iterations, as the heuristic needs to run many times before finding the optimum RWA, The advantage of the MPNN, thus, is that it can directly evaluate the network properties, learned from previous data. For the inference of the MPNN, the complexity is defined in the following equation, 47 where T denotes the number of message passing rounds, defined previously, and d is the length of hidden dimensional vector used, 16 in our case: Comparing Eqs. (15)- (17), it can be seen that the MPNN scales the best, computationally, with the number of nodes. To quantify the computational benefits of using MPNNs to model optical networks, graphs with nodes varying from 10 to 20 were used to evaluate the ILP, FF-kSP, kSP-FF, and MPNN time performance. Here, the node size has been expanded to 20 nodes for the test set to see the computational time scaling better with smaller graphs. This, however, was not possible in the case of the training set since the ILP computation is too complex. For each graph, the respective methodologies were used to calculate the maximum achievable throughput and their computation times measured and plotted in Fig. 6(a). The reduction in computation time through using the MPNN model can be clearly seen, where it takes ∼10 s of ms, compared to 10, 100, and 1000 s of seconds for kSP-FF, FF-kSP, and ILP, respectively. The same trend, between FF-kSP and the MPNN, was analyzed for the larger networks using the beta and gamma test sets, with the results shown in Fig. 6(b). The MPNN again shows a minimum computation time increase for these node ranges, compared to the heuristic method.
Therefore, the MPNN model can be seen as an accurate and fast method for evaluating performance metrics of graphs, which are generally computationally difficult to evaluate. This would allow for the fast evaluation of a large number of graphs within any topology design process. However, how well does this model generalize when varying the input structure and traffic distributions that affect the throughput of the network? This is the question explored in Sec. V C.

C. Generalization capability
The generalization capability of an machine learning (ML) model refers to the ability of the model to operate over distributions not seen during the training process. In the context of this work, this could encompass different graph structures, different traffic distributions, or distance scales. Here, we choose to evaluate the model against two different graph structures and a change in input traffic distribution.
To test different graph structures, two common generative models were used to vary the structures tested, the Erdos-Renyi (ER) and Barabasi-Albert (BA) models. 48,49 Using these models, 5000 graphs with 25 ≤ |N| ≤ 45 were generated per generative model. After calculating the throughputs of these graphs using the FF-kSP heuristic and the Gaussian Noise (GN) model, we feed the graphs through the MPNN model to predict the throughput values.
When comparing the accuracy of the model over these varied graph structures, one can see that the accuracy, in terms of coefficient of determination (R 2 ), drops significantly, as seen in Table II.
This means that the predictions vary largely from the actual labels of the graphs. However, this is to be expected as the graph structures resulting from the ER and BA models are largely different from those from SNR-BA graphs. 6 This difference in the graph structure can be quantified by the weighted spectral density distance (WSD), 50 which measures the difference in structure between two sets of graphs. When the WSD (FWSD) is smaller, the graph structures are more similar. The weighted spectral density distance is calculated between the original SNR-BA test graphs and the various test sets and are shown in Table II. Here, one can see that the WSD is close to zero for SNR-BA graphs as they are generated from the same distribution as the originally tested SNR-BA graphs. We ARTICLE scitation.org/journal/aml can see that the ER and BA graph's WSD is larger, where the ER structures are the most different from the original test graphs. In addition, due to the difference in graph structures, there are large differences in throughput distributions. This is measured by the Kolmogorov-Smirnov two sample test, which returns a distance D, which signifies the largest absolute difference between the CDF of two distributions. In Table II, the large D values for ER and BA test sets signify the difference in throughput distributions to those seen during training. These D values are larger than those of the SNR-BA test distributions, meaning that their throughput distributions are further from those arising from SNR-BA distributions. The Pearson's correlation coefficient (ρ) however remains high, which shows that it still indicates the throughput trends of the networks well. This is important for optimization as the cost function might not need to accurately describe throughput exactly; however, it needs to describe one graph being better than another.
To investigate how a change in the traffic demand matrix affects the accuracy of the model, we generated localized skewed traffic matrices for 5000 graphs. By defining the traffic as in the following equation, we generated five different traffic skews shown in Table II: For each of these skewed traffic matrices, we tested the MPNN accuracy in terms of R 2 and ρ. The R 2 values drop by about 7% and remain constant for the different skew values (γ > 0). The Pearson's correlation coefficient (ρ) value remains high at around 0.98, unchanged from the uniform traffic distribution results, meaning that it still predicts the throughput trend well.
The large variation in performance, in terms of R 2 for ER and BA graphs, indicates that the trained MPNN model does not generalize well to largely different graph structures. This highlights the importance of using a variety of different structures within the training dataset and that an expansion of the training data is necessary here to represent different graph structures. Once the training set is more representative, by generating a variety of graph structures, it accurately evaluates the vast solution space of the topology design problem.

VI. CONCLUSION
In this paper, the well-known NP-hard problem of estimating optimal performance of an optical network was investigated, focusing, in particular, on the maximum achievable network throughput. Understanding and being able to predict it are vital for optimizing optical networks for the future to maximize the longevity of investments made in infrastructure for communications and improve existing topologies. We have proposed the use of a geometric deep learning framework to learn this network property from a previously generated training dataset of graphs.
By applying this methodology, we showed that the model can accurately model the maximum achievable throughput of previously unseen graphs, proven by high coefficient of determination accuracy (R 2 ) and Pearson's correlation coefficient (ρ) values. This accuracy was achieved with significant time savings of multiple orders of magnitude, enabling rapid prediction (∼30 ms for 100 node graphs) of maximal performance of a large number of graphs.
Furthermore, both the importance and weaknesses of the training datasets were highlighted by applying the model to other graph structures, such as ER and BA graphs, where the model struggled to give accurate predictions, showing that the training set of graphs needed to be expanded. The generalization of different traffic input distributions generally performed well and showed suitability for future optimization purposes.
This type of modeling is invaluable to further optimize the physical topologies of optical networks as it acts as a surrogate model in future optimization. It directly enables the use of the maximum achievable throughput in the cost function of any number of optimization algorithms and gives huge time savings. In addition, networks are ever-evolving and may not operate with a fixed grid of channels. They might have a flexible grid, regeneration devices, or intermediate grooming at routers or switches. These changes in network configuration could be included in future training sets by using them as node or edge features and generating more training data using a range of these configurations. Further investigations will include work on analyzing the resilience of networks, accurate traffic modeling, and generalization to multiple fiber systems using MPNNs.

ACKNOWLEDGMENTS
Financial support from UK EPSRC Doctoral Training Programme and the Programme Grant TRANSNET (Grant No. EP/R035342/1) is gratefully acknowledged. Microsoft is acknowledged for the support under the "Optics for the Cloud" program.

Conflict of Interest
The authors have no conflicts to disclose.