Functional networks have emerged as powerful instruments to characterize the propagation of information in complex systems, with applications ranging from neuroscience to climate and air transport. In spite of their success, reliable methods for validating the resulting structures are still missing, forcing the community to resort to expert knowledge or simplified models of the system’s dynamics. We here propose the use of a real-world problem, involving the reconstruction of the structure of flights in the US air transport system from the activity of individual airports, as a way to explore the limits of such an approach. While the true connectivity is known and is, therefore, possible to provide a quantitative benchmark, this problem presents challenges commonly found in other fields, including the presence of non-stationarities and observational noise, and the limitedness of available time series. We explore the impact of elements like the specific functional metric employed, the way of detrending the time series, or the size of the reconstructed system and discuss how the conclusions here drawn could have implications for similar analyses in neuroscience.

The reconstruction of functional complex networks finds application in fields as diverse as neuroscience, climate, economics, finance, or air transport, thanks to their ability of revealing not only the hidden connectivity among elements within a complex system but also the relationship between such connectivity and the underlying dynamics. The reliability of results heavily depends on selecting the appropriate connectivity metrics, a choice constrained by data availability and the field of application; on the other hand, estimating such reliability is a challenging task, as the true underlying connectivity is usually not known. Our study addresses this challenge by considering an air transport problem, involving the reconstruction of US flight networks solely based on time series of airport operations. Using the true flight network as a reference, our analysis scrutinizes the sensitivity of results to four parameters of interest and three data pre-processing strategies. Our findings underscore a significant decline in the precision of the network reconstruction process as the network size surpasses 25–30 nodes and highlight that accepting a minor increase in the number of false positive links, through a careful adjustment of the significance level threshold, significantly enhances precision. Additionally, we show that metrics tailored to linear dependencies excel in this problem, whose nature is mainly linear, whereas some metrics sensitive to both linear and non-linear relationships demand extensive datasets for achieving comparable levels of performance. In short, we here leverage a real-world problem to compare multiple reconstruction tools and data processing techniques and provide insights and commentary on the limits of data-based functional networks.

## I. INTRODUCTION

Among the many possibilities that the theory of complex networks has offered to the scientific community in the last decade, of special relevance is the reconstruction and analysis of functional networks. These allow the study of the connectivity structure among the elements composing a complex system even when explicit information about such connectivity is not available. The basic idea is that the behavior that we observe of such elements is a function of both their internal dynamics and their connectivity; the analysis then becomes an inverse problem, in which the connectivity is inferred from the observed dynamics. This is usually done by encoding the dynamics of each element of interest through a time series; mapping those elements to nodes of a network; and pairwise connecting them when a flow of information (as obtained through metrics like correlation or causality tests) is observed between the corresponding time series. The possibility of recovering such hidden connectivity has found numerous applications in multiple fields, including neuroscience,^{1,2} climate,^{3–6} economy and finance,^{7–10} or air transport,^{11–14} to name but a few.

This approach is hindered by two important intertwined problems: the presence of many alternatives throughout the reconstruction process and the difficulty of validating the obtained results. Regarding the former problem, the researcher has to make multiple decisions on, for instance, how to pre-process the data, which functional metric to use, and how to post-process the obtained networks. As few solid theoretical principles have been proposed, such choices are mostly made according to subjective criteria. Second, in most real scenarios, the actual connectivity network is not known, as, otherwise, there would be no need for reconstructing the functional one; hence, these two cannot be compared. This implies that the aforementioned choices cannot directly be tested against a ground truth. Beyond blindly trusting the mathematics behind the reconstruction approach, validation attempts have mostly been based on two ideas. On one hand, many studies have relied on synthetic numerical systems, in which the dynamics and connectivity of the elements composing the system are fixed.^{15–19} Yet, these synthetic systems do not always represent the complexity behind real data, and their results are as realistic as the used models: conclusions should not be taken at face value. On the other hand, the use of exogenous information has been proposed for then evaluating the capacity of the reconstructed network to recover such information.^{20–22} To illustrate, the researcher may reconstruct functional networks corresponding to control healthy subjects and patients suffering from a neurological pathology; if a correlation between the topology and the group is found, it can be concluded that the functional networks are indeed representing the characteristics of brain activity. It is worth noting that such a test is nevertheless evaluating the usefulness of the reconstruction, as opposed to the similarity between the result and the underlying information flow in the brain.

In this contribution, we explore the evaluation of the reliability of functional network reconstruction in the context of air transport and specifically the reconstruction of the network of flights in the United States using only information about operations at airports (i.e., departures and landings). While being conceptually similar to the problem of detecting the propagation of delays,^{11–14} it presents the advantage that the real connectivity network is known and is given by the operated flights. At the same time, the system under analysis is a real one, with many idiosyncrasies (including data limitedness and non-stationarity) that can hinder the results. We consider a large set of functional connectivity metrics and data pre-processing strategies and quantitatively evaluate the reliability of the reconstructed network against the real one. Parameters of sensitivity, like the size of the network and the length of the available time series, are further explored employing both real and synthetic data.

Results indicate that functional networks are good representations of the underlying system, provided only a limited number of nodes are included in them, and the dynamics of the system in the considered period is stationary. Some important caveats are further highlighted, including the inability of standard significance thresholds to define which links have to be discarded in the case of tests yielding a $p$-value.

## II. DATA PRE-PROCESSING AND NETWORK RECONSTRUCTION

For the sake of completeness, we start by reviewing the main methods used throughout this work, specifically: the strategies to eliminate trends in the time series and to make them stationary (Sec. II A); the functional metrics used to detect relationships between pairs of airports (Sec. II B); and the approach to evaluate the quality of the reconstructed networks, i.e., their degree of equivalence to the real ones (Sec. II C). While we have tried to include a large selection of metrics, covering all those commonly used in the Literature, it is worth noting that any list is necessarily incomplete. Notably, many application-specific alternatives have been proposed in the last decade, especially for the analysis of brain activity data,^{18,23} which could not here be applied due to their underlying hypotheses on the nature of the data.

### A. Time series pre-processing

The starting point of the analysis here presented is a time series representing the hourly number of landing operations at a set of airports, either obtained through a synthetic model or from real data. Most of the functional metrics here considered are based on the hypothesis that these time series are stationary, i.e., that their values have no explicit dependence on time. If such a hypothesis is not fulfilled, usually the result is the detection of spurious relationships. To illustrate, it is well-known that most of the flights are operated throughout the day, with fewer operations happening at night; if a correlation metric were used on the raw time series without eliminating the intra-daily periodicity, the result would be a strong relationship between all airports, irrespective of the real topology.

In addition to not performing any pre-processing on the time series, which is here used as a null model (NM), the following three transformations have been considered:

*Delta.*This normalization is calculated as the difference between the value observed at one day and hour, and the average value observed for the same weekday and at the same hour: $\Delta x(t)=x(t)\u2212\u27e8x(t+24\u22c57\u22c5k)\u27e9$, with $k\u2208 Z$ and $\u27e8\u22c5\u27e9$ representing the average.*Second difference.*The second difference of the time series, i.e., $ x \u2033(t)=x(t)+x(t\u22122)\u22122x(t\u22121)$, is a common and simple way of making time series stationary when no information about the nature of the underlying periodic trends is available.*Z-Score.*We finally consider a normalization based on a Z-Score, defined as: $z(t)= [ x ( t ) \u2212 \u27e8 x ~ ( t ) \u27e9 ]/ \sigma x ~ ( t )$. Here, $ x ~(t)$ represents the set of values observed on the same weekday and at the same hour, i.e., $x(t+24\u22c57\u22c5k)$ with $k\u2208 Z$; additionally, $\u27e8\u22c5\u27e9$ and $ \sigma \u22c5$ represent, respectively, the average and the standard deviation. Note that, when compared with the previous normalization, the Z-Score encodes how much the observed value deviates from the expectation, also taking into account the variability of the latter. In this sense, the Z-Score can be defined as a standard-deviation-normalized version of the Delta method.

### B. Network reconstruction

Once the time series representing the evolution of the traffic volume at each airport have been preprocessed, the next step involves the reconstruction of the corresponding functional network. Given a set of $N$ airports, this process involves testing for the presence of a relationship between any pair $x\u2192y$ of them and storing the result as the weight of the corresponding link in a weight matrix $ W$ of size $N\xd7N$.

In this work, we consider the following standard metrics used in the literature; for the sake of clarity, we are here assuming the test for $x$ driving $y$.

*Null model (NM).*In order to have a reference through the evaluation of the reconstructed networks, we start by considering a null model in which the weight between two airports is assumed to be proportional to the product of the number of flights operating in each one of them. From an air transport perspective, this can be seen as a gravity model in which two airports attract flights according to their respective importance.^{27–29}On the other hand, the complex network practitioner will recognize the idea behind a null model preserving the degrees of nodes.^{30}*Rank Correlation (RC).*Spearman’s rank correlation between the two analyzed time series. In order to account for the time required for the propagation of flights, the caused time series is shifted between $0$ and $5$ h ahead in time; in other words, the correlation is calculated on the pairs $x(t),y(t+\lambda )$, with $\lambda \u2208{0,1,\u2026,5}$. The final result of the test corresponds to the smallest obtained $p$-value.*Ordinal Synchronization (OS).*Echegoyen*et al.*^{31}introduced a metric to evaluate the level of synchronization between two time series, $x(t)$ and $y(t)$, grounded on their projection into non-overlapped $D$-ordinal patterns,^{32}$ V t$ and $ W t$, respectively. The instantaneous ordinal synchronization at time $t$, IOS $ t raw= V t N\u22c5 W t N$, is defined as the dot product between the ordinal vectors at time $t$, previously normalized by their Euclidean norm ( $ V t N= V t/\Vert V t\Vert $ and $ W t N= W t/\Vert W t\Vert $). Then, the value of IOS $ t raw$ is linearly re-scaled to be bounded between $\u22121$ and $1$,where min is the minimum possible value of the scalar product between two ordinal vectors. In general, for an arbitrary ordinal vector $U={0,1,\u2026,D\u22121}$ of length $D$,$ IOS t=2 ( IOS t raw \u2212 min 1 \u2212 min\u2212 1 2 ),$Thus, the ordinal synchronization OS ${x,y}$ is obtained averaging the instantaneous values of IOS $ t$ along the whole time series of length $M$, i.e., OS ${X,Y}=\u27e8 IOS t\u27e9$, with $t\u2208{1,\u2026,M\u2212D}$ and $D=5$.$min= U \u22c5 U R \Vert U \Vert 2,with U R={D\u22121,\u2026,1,0}.$*Continuous Ordinal Patterns (COP) causality.*This causality metric is based on an evolution of the classical concept of ordinal patterns,^{32}introduced in Ref. 33 as continuous ordinal patterns. In short, given a set of sub-sequences of the original time series, a COP is defined as the set of values that best define their evolution. This idea has been then exploited in Ref. 34, showing that pre-processing two time series using the COP maximizing their differences yields an improved causality detection by a classical Granger causality test (see below). COPs have here a length (also known as the embedding dimension) of $D=5$ and are obtained by testing a set of $100$ random patterns—see Ref. 34 for details.*Granger Causality (GC).*The GC^{35}is one of the best-known exponents of*predictive causality*tests^{36}and is based on assessing whether the inclusion of information about the driving element helps predict the future dynamics of the driven element. Such assessment is performed by comparing two autoregressive-moving-average (ARMA) models, forecasting the future dynamics of $y$ by, respectively, including or not information about the past of $x$; for finally performing an F-test on the residuals and, thus, obtaining a $p$-value assessing whether the presence of information about $x$ is relevant—and hence, whether a causality relationship is present.*Time-Reversed Granger Causality (TR-GC).*Among the alternatives that have been proposed to improve the robustness of the GC test in the presence of mixed signals and noise, we here consider the concept of time-reversed GC. It is based on contrasting the causality scores obtained on the original and time-reversed time series, with the idea that the directed information flow should be reduced if the temporal order is reverse.^{37,38}Given the GC score between the two time series $ F x \u2192 y$, a new score is calculated as $ D ~= F x \u2192 y\u2212 F ~ x ~ \u2192 y ~$, where $ ~$ represents the time-reversal operation. The larger $ D ~$, the stronger is the information flow between $x$ and $y$.*Mutual Information (MI).*MI is an entropy-based metric used to measure the amount of information shared between two variables, X and Y. In this context, “information” refers to the uncertainty associated with the measurement of a particular variable. MI quantifies the expected reduction in uncertainty about one variable (e.g., X) resulting from knowledge of the value of another variable (e.g., Y).The MI between two variables $X$ and $Y$ is given bywhere $P(x,y)$ represents the joint probability distribution, while $P(x)$ and $P(y)$ denote the marginal probability distributions. The MI, as defined above, quantifies the degree of dependence between two random variables and can be interpreted as a bi-directional information measure. Being a model-free tool and capturing both linear and non-linear dependencies, MI is commonly recognized as a versatile metric for assessing relationships.$I(X:Y)=\u2212 \u2211 y \u2211 xP(x,y)log\u2061 ( P ( x , y ) P ( x ) P ( y ) ),$*Transfer Entropy (TE).*TE is an information-theoretic measure that quantifies the reduction in uncertainty of a future value of $Y$ when the past value of $X$ is known, given that the past value of $Y$ is already accounted for. In other words, TE evaluates the extra information that the historical state of process $X$ provides about the next observation in process $Y$, given the knowledge of $Y$’s past. The concept of transfer entropy was pioneered independently by Schreiber^{39}and Paluš^{40}and is mathematically represented asHere, $l$ and $k$ designate the embedding vectors or the past values of the $Y$ and $X$ processes, respectively, i.e., $ x n ( k )$ corresponds to $( x n,\u2026, x n \u2212 k + 1)$. The most notable difference between MI and TE is that the latter is not symmetric, thus indicating the direction of the information flow from the source to the target variables. By capturing the direct exchange of information between two variables and being sensitive to non-linear attributes, TE is one of the measures used for network inference.$T E X \u2192 Y ( k , l )= \u2211 x \u2208 X , y \u2208 Yp( y n + 1, y n ( l ), x n ( k ))log\u2061 ( p ( y n + 1 | y n ( l ) , x n ( k ) ) p ( y n + 1 | y n ( l ) ) ).$

### C. Evaluation of the reconstructed networks

Before moving to the description of the way the resulting networks are evaluated, it is important to pause and discuss what the previous metrics and tests yield. With the only exception of the GC and RC tests, the result of analyzing a pair of traffic time series is a statistics, or a metric, defining how strong these are related; in other words, the larger the output, the more confident we can be about the existence of a connection between the corresponding nodes. This is nevertheless not a binary answer, i.e., whether a link is present or not, but rather a relative weight.

While the same happens in the case of tests yielding a $p$-value, as the GC and RC ones, the problem is commonly dismissed by using a significance threshold: if the $p$-value is below a given value $\alpha $, possibly after a multiple test correction, the link is considered to exist. This is nevertheless an oversimplification. To understand why, it is necessary to recall that the $p$-value represents the probability of seeing an effect of equal or larger magnitude if the null hypothesis were true. In other words, if the obtained $p$-value is extremely small, it is telling us that it is very unlikely that the effect we observe in the data may have emerged in the absence of a link. Yet, very unlikely is not the same as impossible. Consequently, $1/p$-value (or $1\u2212p$-value) can be seen as an indicator of the confidence of the presence of a link. For an in-depth discussion of how $p$-values can be transformed into probabilities using Bayesian statistics, and of a probabilistic interpretation of the resulting topological metrics, the interested reader can refer to Ref. 41.

The evaluation here carried out takes this factor into account and aims at checking whether the strongest links, according to the statistics and $p$-values yielded by the tests previously described, are also the strongest links in the data, in this case measured by the number of flights connecting the airports under analysis. Following this idea, two complementary metrics are evaluated:

Area Under the Curve (AUC). This metric represents the probability that, given a pair of airports connected by a flight, the weight of the corresponding link is higher than that of a pair of airports not connected. It is thus similar to the Mann–Whitney U or the Wilcoxon ranks tests.

^{42,43}The AUC ranges from 0 to $1$, where a value of $1$ indicates that the network structure is perfectly recovered, while an AUC of $0.5$ indicates non-informative results, i.e., not better than flipping a coin for each link.Rank correlation. The absolute value of the rank correlation between the link weights, as calculated by the functional metrics, and the corresponding weights in the real networks, here representing the number of operated flights. Values close to $1$, thus, indicate that the recovered ranking of link importance is similar to the real one.

Note the complementarity of the two metrics: while the AUC measures the correctness of the recovered binary structures (i.e., whether links exist or not), the rank correlation focuses on the relative importance of each link. In extreme cases of fully connected networks, it is, thus, possible to obtain an AUC of $1.0$, as a null threshold would recover the full network; yet, the rank correlation may be close to zero if the importance of each link does not correspond with the real one. In order to avoid confusion between the RC functional metric of Sec. II B and the rank correlation as a way to evaluate the results, the former is always referred to by its acronym.

## III. ANALYSIS OF SYNTHETIC DATA

Before moving to the analysis of real air transport data, the network reconstruction is here tested with synthetic data generated by a minimal traffic model. The objective is threefold: to be able to evaluate the impact of known characteristics of the real time series, such as their non-stationarity and noisiness; the impact of sensitivity parameters, including the time series length and presence of observational noise; and to evaluate the implications of the use of a significant threshold in the binarization of the resulting functional network.

### A. The synthetic traffic model: Stationary version

The initial version of the synthetic model is composed of a set of $N$ airports that are connected in a random configuration. We specifically consider a directed Erdös–Rényi graph of $N$ nodes and $l$ links. Note that, for the sake of simplicity, this last value is set through the link density $ l d$, such that $l\u2248 l dN(N\u22121)$. Two additional values are attached to each link of the network. First, a weight representing how much traffic is expected on that link, which allows us to introduce heterogeneity between pairs of airports; given two airports $i$ and $j$, the value of $ w i , j$ is randomly drawn from a uniform distribution $ U(0.5,1.5)$. Note that, while any positive range could here be used, we opted for a minimum value of $0.5$ to ensure enough information (i.e., traffic) is available at each link. Second, we simulate the time required by aircraft to move between two airports by introducing a lag factor $ \lambda i , j$, defined as an integer random number between $1$ and $4$, both included. Note that no constraint is imposed on the time (and hence, distance) between pairs of airports, as would be required to e.g., fulfill the triangular inequality.

Once the connectivity network is defined, we use it to reconstruct the time series of operations at each airport. Specifically, for each pair of connected airports (according to the previous graph) and for each time step, we first draw a random number from $ U(0,1)$ and skip this time step if such number is greater than $0.2$. Note that this is equivalent to allowing flights between two connected airports only every five time steps on average and is used to recreate the sparseness and burstness of real operations. Second, we define the number of aircraft operating between the two airports as $ v i , j(t)=\u230a\epsilon w i , j\u230b+1$, where $\epsilon $ is a random value drawn from an exponential distribution with rate parameter equal to one. As a final step, this number of aircraft $ v i , j(t)$ is added to the time series of the number of operations of airport $i$ at time $t$, and to the one of airport $j$ at time $t+ \lambda i , j$ (i.e., to account for the flight time).

In short, the traffic between two airports is defined through a time series in which aircraft operate in bursts, with their intensity being proportional to the link weight $w$. Note that these traffics are stationary, i.e., they present no trends or seasonalities, as the values composing the time series have no dependency on time; this is not generally true for real-world data and is a topic that will be tackled below. Finally, when reconstructing the complete traffic time series of one airport, this is composed of two contributions: all flights departing to other airports; and all flights from there arriving, with a delay given by the corresponding $\lambda $. A graphical representation of the result, for a system composed of three airports and three routes, is depicted in the top part of Fig. 1.

The results of calculating the tests and metrics of Sec. II B are presented in Fig. 2. The four panels report the results as a function of, from left to right and top to bottom, the number of simulated airports $N$, the length of the reconstructed time series $ l t$, the link density $ d l$, and the amplitude of an observational noise added to the time series $\sigma $. Additionally, the left and right graphs in each panel report the evolution of the AUC and of the rank correlation, as discussed in Sec. II C.

Some interesting conclusions can already be drawn. First of all, panel (a) of Fig. 2 shows that results are good provided the network is very small, here for $N<15$. Even for slightly larger networks, e.g., for $N>30$, both the AUC and the rank correlation substantially drop; and for $N=50$, results are essentially undistinguishable from random ones. Second, simpler metrics, like GC or RC, perform better than more complex and non-linear ones. This is to be expected, as the problem under analysis is a linear one, due to the necessary conservation of the number of aircraft in each route and to the limited length of the time series. On this last point, some metrics, like the TE, improve their results with very long time series [see panel (b) of Fig. 2]. Third, sparser networks are easier to reconstruct than denser ones, see panel (c) of the same figure, as the information needed to reconstruct each link is less polluted by the information corresponding to other connections. Finally, as is to be expected, the addition of an observational noise to the raw time series hinders the reconstruction process—see panel (d).

It is important to note that the synthetic model here considered is based on a directed Erdös–Rényi graph, which has been chosen for being the most hypothesis-free model of connectivity in a graph. Real air traffic is nevertheless not random and is specifically mostly bidirectional: if $n$ flights go from airport $A$ to airport $B$, a very similar number should be expected in the route $B\u2192A$. Note that such symmetry may have a major impact on results, depending on the metrics used to detect functional connectivity. To understand this point, Fig. 3 reports the results obtained with two modifications of the model in which the network is forced to always be symmetric or anti-symmetric. Enforcing a symmetry is beneficial to MI, as its results for $A\u2192B$ are by definition the same as for $B\u2192A$ but is extremely detrimental to TR-GC, as, in this case, the presence of an information flow between $A$ and $B$ precludes the detection of the opposite flow. Such conclusions will help to explain the results obtained in the case of the real US network.

### B. Is the significance level really significant?

As previously discussed in Sec. II B, in the literature, it is customary to use a significance level to post-process the $p$-values obtained by statistical tests, especially by the GC. To illustrate, a significance level of $\alpha =0.01$ may be set, which is then corrected for multiple comparisons through, for instance, a Bonferroni correction to obtain $ \alpha \u2217=0.01/[N(N\u22121)]$. Whenever the $p$-value obtained by testing two time series is below $ \alpha \u2217$, a link between the corresponding nodes is established.

Beyond the theoretical problems this approach entails, we here illustrate another, more practical issue. Fixing the significance level is tantamount to fixing the maximal false positive rate that we are willing to accept in the results. While this is strictly necessary in some contexts, e.g., while evaluating the effects of a drug,^{44} the objective here is different: to reconstruct a good representation of the underlying network. For this, we need to maximize the number of links correctly estimated (as either existing or not), but not necessarily limit the number of false links.

An example may help elucidate this point. Suppose the system under analysis is composed of $N=10$ nodes; by setting $ \alpha \u2217=0.01/[N(N\u22121)]\u22481.11\xd7 10 \u2212 4$, we expect to find on average one false positive link, and we further observe ten true positive reconstructed links. Now, let us suppose that by doubling $\alpha $ to $0.02$ we double the number of false positives (now two of them are present), but we also double the number of retrieved true positive links. From the point of view of understanding the underlying connectivity structure, retrieving $22$ links (two of which being false) is more informative than retrieving only $11$ links (one being false). In other words, we may accept increasing the number of false positives, provided the increase in the number of true positives compensates for that.

In order to understand how this affects the problem at hand, Fig. 4 reports the evolution of the best thresholds, for the GC (blue dashed lines) and the RC (orange solid lines) tests, obtained by maximizing the difference between the true positive and false positive rates in the synthetic model. When compared to the use of a fixed significance level $\alpha =0.01$ (see dotted gray lines), it is clear that optimal results are obtained by using larger thresholds: i.e., by allowing a larger number of false positive links in the network, in exchange of more true positive ones. Notably, such an optimal threshold depends on all parameters here considered, including the observational noise; finding a way of fixing it a priori may not always be feasible.

### C. The synthetic traffic model: Non-stationary version

Air transport operations, and, hence, the time series from them extracted, are characterized by non-stationarities developing at multiple scales—from the day/night cycle to weekly and seasonal traffic patterns. As stationarity is a requirement of most correlation and causality metrics, we here simulate the presence of regular patterns in the data, for then checking the efficiency of the three pre-processing strategies described in Sec. II A. For that, we modify the model presented in Sec. III A to calculate only the values corresponding to the first $24$ time steps for each airport; for then repeating those values, adding integer random numbers in the range $[\u22122,2]$. In other words, we use the original model to simulate the dynamics during one day (i.e., $24$ h), for then repeat a similar pattern across several days. An example of the resulting time series is depicted in the bottom part of Fig. 1

The results, as a function of the number of airports in the model, are depicted in Fig. 5, where the four groups of panels correspond to the four pre-processing strategies. Notably, pre-processing through a delta [panel (b)] or a second derivative [panel (c)] yields bad results, and is mostly similar to the analysis of the raw data [panel (a)]. The only good strategy seems to be the application of a Z-Score, which allows recovering results very close to what was observed for the stationary model—see panel (a) of Fig. 2. This is also confirmed by the almost perfect correspondence between results obtained in the stationary and non-stationary (with Z-Score pre-processing) cases (see Fig. 6).

## IV. RECONSTRUCTING THE REAL US AIR TRANSPORT NETWORK

In order to test the previously obtained insights in the context of a real problem, we here consider the case of reconstructing the US air transport network, assuming that the only available information is the description of the operations taking place at each airport. We specifically consider operations taking place at the largest US airports, provided by the Bureau of Transportation Statistics (BTS) of the United States Department of Transportation (DoT), and including departure and arrival times of all flights between the years 2015 and 2019, both included. Time series representing the number of landing aircraft per hour have been reconstructed, for then being processed using the same approaches used with the synthetic data. Note that the resulting time series are highly non-stationary: out of the $30$ airports considered, $25$ yielded a $p$-value smaller than $0.01$ according to the Kwiatkowski–Phillips–Schmidt–Shin test;^{45} and $30$ out of $30$ were not stationary according to the augmented Dickey–Fuller unit root tests.^{46}

The evolution of the quality of the reconstructed networks, as a function of the number of airports $N$ included in them, and for all combinations of metrics and pre-processing strategies, is reported in Fig. 7. Some interesting considerations ought to be highlighted. First, the AUC remains high in all cases up to $N\u224815$. This is due to the fact that airports are sorted in descending order of size (i.e., the number of operations); the first ones are, therefore, the largest and form a fully connected core that is easily recovered. If one looks at the evolution of the rank correlation of link weights, this drops even for small values of $N$, suggesting that all algorithms struggle to recover the real internal structure. Second, results are qualitatively similar for all pre-processing strategies and further similar to those obtained for the stationary model—see panel (a) of Fig. 2. This suggests that the daily and weekly trends characteristic of air transport are not strongly affecting the results, probably due to the presence of enough variability in everyday operations—consequences of, e.g., unexpected weather phenomena, delays, etc. Third, when using a Z-Score detrend algorithm and assessing the results using the rank correlation criteria, four functional metrics outperform the other ones: RC (orange solid line), COP (dark gray solid line), GC (blue dashed line), and MI (light gray solid line).

The previous results have been obtained using five years of data, i.e., time series long enough to support the application of the metrics here considered. At the same time, air transport is a dynamic system, which constantly reconfigures itself in response to changes in demand and to other external perturbations. It is, therefore, relevant to explore the limits of the network reconstruction process when one needs to focus on shorter time windows. Toward this end, panel (a) of Fig. 8 reports the evolution of the rank correlation, when considering time series of different lengths extracted at random from the 2015–2019 data set. The relationship between the quality of the results and the length of the time series has a concave shape: in other words, using too short or too long time series is detrimental to the reconstruction process. While this prima facie contradicts what was obtained by the synthetic model [see panel (b) of Fig. 2], it highlights the importance of taking into account the non-stationarity of the system: when long time series are used, the reconstructed network can only depict the constant core of the system, and sporadic connections are lost. Panel (b) of the same figure reports the evolution of the rank correlation as a function of the number of operations, when considering time series of one month in length. A weak negative correlation can be appreciated, i.e., it is easier to recover the underlying network structure in months with fewer operations. While this may be surprising, we speculate that this comes from two contributions. First, months with high levels of traffic (typically summer ones) see an increased number of routes, and, thus, a higher link density; as seen in panel (c) of Fig. 2, this is detrimental to the network reconstruction process. Second, even maintaining constant the link density, a larger number of flights may act as a noise in the time series—to illustrate, the best situation for reconstructing functional links would be the case of a single flight per day between two isolated airports, as it would be clear which airports that flight is connecting. This is further confirmed in panel (c) of Fig. 8, reporting the evolution of the rank correlation when increasing the number of flights in the stationary synthetic model of Sec. III A; for the sake of clarity, the number of operations is there reported normalized against the same number in Fig. 1.

## V. DISCUSSION AND CONCLUSIONS

In this contribution, we have explored, by means of synthetic models and real data sets, the process of reconstructing functional networks representing the presence of routes between a set of airports. While being conceptually similar to other applications, especially the reconstruction of networks of delay propagation,^{11–14} it presents one important advantage: the underlying airport network is known, and the accuracy of the results can, therefore, be evaluated. At the same time, what is presented here is different from previous works in that the analysis is conducted on a real system, whose elements’ internal dynamics is not artificially defined nor is always easy to describe.^{47,48} Using a synthetic model simulating the movement of aircraft between airports, we have evaluated the performance of seven metrics (plus a null model) for detecting links, and their sensitivity to parameters like the size of the network, the length of the time series, the density of links, or the presence of observational noise (see Fig. 2). We have further compared three data pre-processing strategies for detrending the time series and thus eliminating the daily and weekly periodicities commonly found in this type of data (Fig. 5). We have finally analyzed a large data set of US flights and evaluated the reconstructed networks against the real one (Figs. 7 and 8).

The most important conclusion that can here be drawn is that the reconstructed networks have good properties, in terms of the number of correctly recovered links (measured through the AUC) and the links’ weights (rank correlation), provided the number of considered nodes is small. Both in the synthetic model and in the real data, going beyond 25–30 nodes sees a drastic reduction in the AUC to values close to $0.7$, and rank correlations in the 0.3–0.4 range. This is an important limitation in the analysis of real-world data, in which the availability of a large number of time series is usually considered a positive feature. To illustrate, while some electroencephalography (EEG) machines record $32$ locations of the scalp at the same time, more advanced options include $64$ or even up to $256$ channels. Magnetoencephalography (MEG) machines usually include more than $300$ channels. Moving back to air transport, it is not unusual to study the properties of the top- $50$ or top- $100$ airports in a given region, with some works even surpassing $200$ airports.^{14,49–51} We speculate that the reasons beyond such a drop in performance for large networks are two. First, increasing the number of nodes implies that each element receives information from more neighbors (assuming a constant link density), with each contribution being masked by the other ones—on the contrary, if a node would only receive information from a single source, such dependency would easily be identified. Second, at least in the case of air transport, increasing the system size requires including smaller airports, which have fewer operations, and whose connectivity is more difficult to detect. While this may not apply to EEG data, drastically increasing the number of sensors conveys other problems, like volume conduction.^{52}

A second interesting conclusion refers to the optimal quantity of information for the network reconstruction process. As it may be expected, in the case of the synthetic model, panel (b) of Fig. 2, it is clear that longer time series improve the quality of the results. The same is nevertheless not true in the case of the real system: as seen in Fig. 8, considering very long time intervals is actually detrimental. It is easy to find the reason for this: long time series improve the detection of the functional connectivity, but only if the system is stationary, i.e., the underlying connectivity does not change with time. On the other hand, air transport presents important seasonalities, from weekly to annual ones; as a consequence, when the reconstruction process is performed over long time periods, the result only represents the backbone (or constant) connectivity, losing the faster dynamics. A similar problem is to be expected when reconstructing functional networks of brain activity.^{53–56} In synthesis, it can be concluded that using short time series is not negative: the increase in the uncertainty in the reconstruction is compensated by the reduction of the non-stationarity in the data.

A third conclusion that can be drawn is related to the relative performance of the different reconstruction metrics. As can be appreciated in Figs. 2, 5, and 7, the best metrics are the linear and simpler ones, e.g., COP, GC, and RC. This is explained by the fact that the underlying problem is a linear one; non-linear metrics are, thus, too complex for it. While we have shown that non-linear metrics, like the TE, perform properly if long enough time series are provided, this is still a major limitation in real-world studies. To illustrate, panel (b) of Fig. 2 suggests $\u2248 10 4$ as an appropriate length; if data are sampled every hour, as customary in air transport, this corresponds to time series of more than one year in length. Not only such long time series are not always available but they also preclude the analysis of transitory dynamics, as previously discussed. In other words, simpler metrics should not be discarded *a priori*, as they can outperform more complex ones. It is worth noting that translating this conclusion to other problems, e.g., to the reconstruction of brain functional network, should be done with care, due to the differences in the underlying dynamics.

Fourth, it is worth analyzing the role of the detrending pre-processing. As illustrated in Fig. 5, the best strategy is always to use a Z-Score pre-processing, probably for being the only one both taking into account long-range (here, weekly) correlations and preserving the magnitude of the deviations. Despite this, differences are only minor when reconstructing the real air transport network (see Fig. 7), possibly due to the presence of a daily variability that is larger than the effect of the daily and weekly trends.

The results here presented also point toward some open questions that will have to be tackled in the future. First, if different functional metrics detect different aspects of the dynamics, it may be possible to define a way of merging them, possibly weighting their contributions using machine learning. Second, while it seems that the GC metric is able to retrieve good reconstructions even with only a few days of data [see inset of panel (a) of Fig. 8], more sensitive metrics may be beneficial, as e.g., for detecting intra-daily connectivity patterns. Third, results here presented (especially Fig. 3) highlight the importance of ensuring a match between the symmetric vs asymmetric nature of the information flows in the system under analysis and that of the considered functional metric. This idea could also be extended to analyses based on the comparisons of the results yielded by symmetric and asymmetric metrics, targeted at explaining the form of such information flows. Finally, as discussed in depth in Sec. III B, better ways of defining suitable thresholds will have to be constructed, going beyond simple significance levels and $p$-values.

## ACKNOWLEDGMENTS

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant Agreement No. 851255). This work has been partially supported by the María de Maeztu project CEX2021-001164-M funded by the MCIN/AEI/10.13039/501100011033.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Kishor Acharya:** Formal analysis (equal); Software (equal); Writing – original draft (equal); Writing – review & editing (equal). **Felipe Olivares:** Formal analysis (equal); Software (equal); Writing – original draft (equal); Writing – review & editing (equal). **Massimiliano Zanin:** Conceptualization (equal); Data curation (equal); Formal analysis (equal); Software (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

The data that support the findings of this study are openly available from the Bureau of Transportation Statistics, United States Department of Transportation.^{57}

## REFERENCES

*Schaum’s Outline of Theory and Problems of Statistics*

*Mathematical Statistics with Applications in R*

*The Gravity Model in Transportation Analysis: Theory and Extensions*

*Handbook of International Trade and Transportation*(Edward Elgar Publishing, 2018), Vol. 4, pp. 141–158.

*Proceedings of the 2009 SIAM International Conference on Data Mining*(SIAM, 2009), pp. 780–791.

*Latent Variable Analysis and Signal Separation: 10th International Conference, LVA/ICA 2012, Tel Aviv, Israel, 12–15 March 2012. Proceedings 10*(Springer, 2012), pp. 25–33.