Inferring nonlinear and asymmetric causal relationships between multivariate longitudinal data is a challenging task with wide-ranging application areas including clinical medicine, mathematical biology, economics, and environmental research. A number of methods for inferring causal relationships within complex dynamic and stochastic systems have been proposed, but there is not a unified consistent definition of causality in the context of time series data. We evaluate the performance of ten prominent causality indices for bivariate time series across four simulated model systems that have different coupling schemes and characteristics. Pairwise correlations between different methods, averaged across all simulations, show that there is generally strong agreement between methods, with minimum, median, and maximum Pearson correlations between any pair (excluding two similarity indices) of 0.298, 0.719, and 0.955, respectively. In further experiments, we show that these methods are not always invariant to real-world relevant transformations (data availability, standardization and scaling, rounding errors, missing data, and noisy data). We recommend transfer entropy and nonlinear Granger causality as particularly strong approaches for estimating bivariate causal relationships in real-world applications. Both successfully identify causal relationships and a lack thereof across multiple simulations, while remaining robust to rounding errors, at least 20% missing data and small variance Gaussian noise. Finally, we provide flexible open-access Python code for computation of these methods and for the model simulations.

Quantifying causal relationships between longitudinal observations of a complex system is essential to an understanding of the interactions between sub-components of the system and is subsequently key to building better and more parsimonious models.^{1,2} In many real-world applications, we are rarely able to access or describe an underlying graphical network of these interactions *a priori*, and we are typically limited to observing simultaneously recorded variables from each subsystem as a multivariate time series. Two key properties that are widely regarded as crucial in defining causal relationships are that the effect is temporally preceded by the cause and that external changes to values of the causal variable propagate to values of the effect variable and do not break the causal structure.^{3} Correlation or synchronization in these multivariate time series does not necessarily imply a causal relationship between variables, and counterexamples are easy to find.^{4} Furthermore, a lack of correlation does not imply a lack of causality, and a reliance on correlation-based measures may result in nonlinear causal relationships being obscured, e.g., Ref. 5. In recent decades, various mathematical frameworks^{1,6,7} have been described to allow identification of nonlinear (and asymmetric) causal structures within complex systems, primarily driven by domain-specific applications, from diverse application areas including as statistical economics,^{8} climate science,^{9–11} and computational neuroscience.^{12,13}

## I. INTRODUCTION

No general method exists to identify causal structures within complex systems, and there is no single consistent and unifying notion of quantitative causality estimation for time series data. Published methods can be broadly categorized into the following groups:

regression-based indices that use “recent history” vectors as predictors in a model (e.g., Granger causality),

information-theoretic indices that build upon ideas of conditional mutual information (e.g., transfer entropy),

indices based on state space dynamics, such as local neighborhoods and trajectories (e.g., convergent cross mapping), and

graphical models that scale causal inference estimation to high-dimensional multivariate systems for causal identification.

There exist common themes between these methods, and membership within these groups is sometimes somewhat blurred. Figure 1 sets out key properties and similarities between methods from groups 1–3. Previous reviews of the literature^{14–17} typically focus on a subset of methods from one of these groups. The suitability, interchangeability, and performance of published methods, particularly where they span more than one of these groups, has received relatively little attention. In this work, we identified and assessed a widely used subset of indices for directed bivariate causality inference, concentrating on methods involving univariate embeddings to describe the recent history of the system (Sec. II). A review of such methods has been published previously by Lungarella *et al.*^{18} We reproduce these results for the original and newer methods. We also extend this work by proposing a set of modifications that can be made to simulated data prior to causal estimation in order to investigate sensitivity of each method to data availability, scaling, missing data, rounding, and Gaussian noise (Sec. III). Each of these reproduces phenomena that often occur in real-world data, such as when instruments have a fixed measurement precision and data are reported with rounding errors. We believe that these tests should provide in-depth benchmarking criteria for new proposed methodologies. Finally, we summarize the strengths and weaknesses of these approaches and identify key areas of further research (Sec. IV).

## II. METHODS

We observe a complex system as a set of variables within a multivariate time series. The time series $X=(x1,\u2026,xT)$ and $Y=(y1,\u2026,yT)$ describe a bivariate system with state $st=(xt,yt)$ at time $t$. A critical implicit assumption is that the series has unit time or equivalently that the data are observed at a fixed constant frequency. Underpinning all methods is the key assumption that causes directly precedes effect. As a result, a preliminary step is the construction of time-delay embedding vectors $xtm,\tau $ in the $m$-dimensional state space $X\u2245Rm$. An important distinction in the defined methods is whether their theoretical basis is stochastic or deterministic. Both employ the same time-delay vectors, though only in deterministic systems are the vectors $xtm,\tau $ considered elements within an $m$-dimensional state space $X\u2245Rm$. We construct equivalent embedding vectors $yt$ (and state space $Y\u2245Rm$) for $Y$ and joint embedding vectors $zt$ (and state space $Z\u2245R2m$),

In practice, many real-world systems are stochastic, with some level of noise or randomness in at least part of the system. A further assumption for stochastic causality estimation is that of separability, which states that there is unique information about the effect variable that is contained only within the causal variable. The standard approach here is to describe or model the current value $xt$ of $X$ as conditional upon the “recent history” joint embedding vector $zt\u22121$ (full model). Separability means that removing the causal variable $Y$ eliminates the information it contains about the effect $X$, which we observe either by identifying non-zero coefficients in the full model or constructing a reduced model, conditioned only upon $xt\u22121$. These methods are generally described with time index shifted $t\u21a6t+1$, though the interpretation (“current” and “recent history”) remains the same.

Granger causality^{1} (GC), a notable and popular method for causality estimation in time series, fits autoregressive models on the time series to this end. Extensions of GC to nonlinear systems include a locally linear version called extended Granger causality^{19} (EGC) and nonlinear Granger causality^{20} (NLGC), which perform a “global” nonlinear autoregression using radial basis functions (RBFs). Predictability improvement^{21} (PI) is another locally constant linear regression of “recent history” embeddings, which measures a reduction in the mean squared error when $zt$ is used for predicting a “horizon” value $xt+h$ instead of $xt$ alone.

Information theory is a natural framework for describing causal relationships. Transfer entropy^{7} (TE) measures deviation from the generalized Markov property $p(xt+1|xt)=p(xt+1|xt,yt)$ as a conditional mutual information. With weak coupling and limited data, transfer entropy can suffer from finite sample effects, and effective transfer entropy^{22} (ETE) corrects for this using shuffled versions of the causal variable. TE reduces to vanilla GC under the assumption of Gaussian variables^{23} ($\u22c6$, Fig. 1), and non-zero GC implies violation of the generalized Markov property and non-zero TE.^{24} A coarse-grained transinformation rate^{25} (CTIR) is based upon “coarse-grained entropy rates” and measures the rate of net information flow, averaged over different lags $\tau $. Often, the difficulty in information-theoretic methods (described in depth in Ref. 14) is the robust estimation of joint probabilities or entropy values, which in turn form building blocks for these methods. We use a histogram binning partition (H) and the (hypercube) Kraskov–Stögbauer–Grassberger (KSG) estimate,^{26} which is a technique involving $k$-nearest neighbor statistics. All information-theoretic computation here is in “nats” (logarithm base $e$).

Fully deterministic dynamical systems, which evolve according to a differential equation or a difference equation, do not necessarily satisfy the separability condition. In these systems, $xt$ can often be reformulated as a function of only past values of $X$, which makes the potential causal role of $Y$ in the coupled system less clear, as highlighted by Granger.^{1} Causal relationships in a coupled deterministic system are instead observed via the event that each variable belongs to a shared attractor manifold $A\u2282Z$. A consequence of Takens’ embedding theorem^{27} is that the “library of historical behavior” of $X$ preserves the topology of $A$ and, by transitivity, local neighborhoods in $X$ those in $Y$ and vice versa.^{5} It is possible to detect unidirectional causal influence, where only the dynamics of a causal variable propagate to the response variable in this way. Sugihara *et al.*^{5} argue that the inferred direction of unidirectional causal influence is counter-intuitively reversed (i.e., cross mapping from $X$ to $Y$ reveals causal influence from $Y$ to $X$).

The key assumption of cross mapped indices is that causal relationships are observed in the similarity between sets of (subscript) indices denoting the nearest neighbors for each set of embedding vectors, which can be mapped from one variable to the other to reveal interdependency. This is the idea behind the similarity indices: two similarity indices we test here, denoted $SIY\u2192X(1)$ and $SIY\u2192X(2)$, are $H(X|Y)$ in Ref. 28 and $E(X|Y)$ in Ref. 29, respectively. Convergent cross mapping^{5} (CCM) computes the correlation $\rho $ between the cross mapped estimate and the true value, with convergence in $\rho $ as $T$ increases “a key property that distinguishes causation from simple correlation.”^{5}

## III. RESULTS

Our results are split into two parts. First, we reproduce the results from Ref. 18, evaluating the performance of all methods including the additional CTIR and CCM, plus ETE using histogram binning and TE using KSG. In these simulations, we choose the same simulation model parameters and causality index parameters as in Ref. 18 (Table I). In the second part, we investigate sensitivity to common issues relevant to real-world data using the Ulam lattice system to illustrate these.

. | Method . | Parameters/other choices . | Notes/suggestions . | Values used here . |
---|---|---|---|---|

All | T: Time series length | Depends on data availability | 10^{p}, p = 3, 4, 5 | |

Embedding | h: Time horizon value | Normally h = 1, generalized to h ≥ 1 (in PI) | h = 1 | |

All/CTIR | m: Embedding dimension | “Optimal”^{30} vs “empirical” (m = 1, …, 5) | m = 1 or 2 | |

τ: Time-delay lag | “Optimal”^{31} vs “empirical” (τ = 1, 2, 3) | τ = 1 | ||

EGC^{19} | Nearest neighbor metric | ℓ_{p}, may depend on the state space/distribution | ℓ_{1} | |

L: No. of neighborhoods | L = 100 in Refs. 18 and 19 (depends on T) | L = 20 or 100 | ||

δ: Neighborhood size | Compute EGC for δ ↓ 0 (Ref. 19) | Various (Table S.I in the | ||

supplementary material) | ||||

NLGC^{20} | Radial basis function (RBF) | Gaussian RBFs in Refs. 18 and 20 | Gaussian | |

Regression error | P: No. of RBFs | E.g., gap statistics^{32} | Various (Table S.I in the | |

supplementary material) | ||||

x_{ρ}: Gaussian RBF centers | Via k-means or fuzzy c-means clustering | Via k-means | ||

σ^{2}: Gaussian RBF variance | A priori fixed, e.g., σ^{2} = 0.05 in Refs. 18 and 20 | σ^{2} = 0.05 | ||

PI^{21} | Nearest neighbor (NN) metric | ℓ_{p}, may depend on the state space/distribution | ℓ_{2} | |

R: No. of NNs | A priori unclear, e.g., R = 1, 10 in Refs. 18 and 21 | R = 1 or 10 | ||

h: Time horizon value | As above, e.g., h = 1 in Refs. 18 and 21 | h = 1 | ||

Information | Estimation | Estimation method | e.g., KSG, histogram partition | Both (H / KSG) |

theory | Nearest neighbor metric (KSG) | ℓ_{∞} (for hypercube dimensions)^{26} | ℓ_{∞} | |

k: No. of NNs (KSG) | Small values, e.g., k = 2, 3, 4^{26} | k = 4 | ||

N: No. of bins (histogram) | E.g., via minimum description length^{33,34} | N = 8 | ||

TE^{7} | n/a | No. of parameters besides estimation (above) | n/a | |

ETE^{22} | $Nshuffle$: No. of shuffled X or Y | A priori unclear, single shuffle in Ref. 22 | $Nshuffle=10$ | |

CTIR^{25} | τ_{max}: Max time-delay lag | $\tau max:I(xt,xt+\tau \u2032)\u22480,\u2200\tau \u2032\u2265\tau max$^{25} | τ_{max} = 5 or 20 | |

τ_{I}, ~ɛ_{I}: For estimation of τ_{max} | τ_{max} = min _{τ′}{τ′ ≤ τ_{I} : I(x_{t}, x_{t+τ′}) < ɛ_{I}} | Unused, fixed τ_{max} | ||

Cross | SI^{28,29} | Nearest neighbor (NN) metric | ℓ_{p}, may depend on the state space/distribution | ℓ_{2} |

mapped | R: No. of NNs | A priori unclear, e.g., R = 10 in Refs. 28 and 29 | Various (Table S.I in the | |

supplementary material) | ||||

CCM^{5} | Nearest neighbor metric | ℓ_{p}, may depend on the state space/distribution | ℓ_{2} | |

T_{max}: Max. segment length | Convergence: compute ρ for T ↑ T_{max}^{5} | T_{max} = T | ||

n_{T}: No. of segments of size T | ρ values averaged across n_{T} segments, size T | n_{T} = 40 | ||

ρ_{∞}: Converged CCM value | $\rho Tmax$ in Ref. 35 or fit exponential regression^{36} | $\rho Tmax$ (if ↓ holds) | ||

$\delta \rho $: Convergence tolerance | “Converged” if $\rho \u221e\u2212\rho m+2>\delta \rho $ | $\delta \rho =0.05$ |

. | Method . | Parameters/other choices . | Notes/suggestions . | Values used here . |
---|---|---|---|---|

All | T: Time series length | Depends on data availability | 10^{p}, p = 3, 4, 5 | |

Embedding | h: Time horizon value | Normally h = 1, generalized to h ≥ 1 (in PI) | h = 1 | |

All/CTIR | m: Embedding dimension | “Optimal”^{30} vs “empirical” (m = 1, …, 5) | m = 1 or 2 | |

τ: Time-delay lag | “Optimal”^{31} vs “empirical” (τ = 1, 2, 3) | τ = 1 | ||

EGC^{19} | Nearest neighbor metric | ℓ_{p}, may depend on the state space/distribution | ℓ_{1} | |

L: No. of neighborhoods | L = 100 in Refs. 18 and 19 (depends on T) | L = 20 or 100 | ||

δ: Neighborhood size | Compute EGC for δ ↓ 0 (Ref. 19) | Various (Table S.I in the | ||

supplementary material) | ||||

NLGC^{20} | Radial basis function (RBF) | Gaussian RBFs in Refs. 18 and 20 | Gaussian | |

Regression error | P: No. of RBFs | E.g., gap statistics^{32} | Various (Table S.I in the | |

supplementary material) | ||||

x_{ρ}: Gaussian RBF centers | Via k-means or fuzzy c-means clustering | Via k-means | ||

σ^{2}: Gaussian RBF variance | A priori fixed, e.g., σ^{2} = 0.05 in Refs. 18 and 20 | σ^{2} = 0.05 | ||

PI^{21} | Nearest neighbor (NN) metric | ℓ_{p}, may depend on the state space/distribution | ℓ_{2} | |

R: No. of NNs | A priori unclear, e.g., R = 1, 10 in Refs. 18 and 21 | R = 1 or 10 | ||

h: Time horizon value | As above, e.g., h = 1 in Refs. 18 and 21 | h = 1 | ||

Information | Estimation | Estimation method | e.g., KSG, histogram partition | Both (H / KSG) |

theory | Nearest neighbor metric (KSG) | ℓ_{∞} (for hypercube dimensions)^{26} | ℓ_{∞} | |

k: No. of NNs (KSG) | Small values, e.g., k = 2, 3, 4^{26} | k = 4 | ||

N: No. of bins (histogram) | E.g., via minimum description length^{33,34} | N = 8 | ||

TE^{7} | n/a | No. of parameters besides estimation (above) | n/a | |

ETE^{22} | $Nshuffle$: No. of shuffled X or Y | A priori unclear, single shuffle in Ref. 22 | $Nshuffle=10$ | |

CTIR^{25} | τ_{max}: Max time-delay lag | $\tau max:I(xt,xt+\tau \u2032)\u22480,\u2200\tau \u2032\u2265\tau max$^{25} | τ_{max} = 5 or 20 | |

τ_{I}, ~ɛ_{I}: For estimation of τ_{max} | τ_{max} = min _{τ′}{τ′ ≤ τ_{I} : I(x_{t}, x_{t+τ′}) < ɛ_{I}} | Unused, fixed τ_{max} | ||

Cross | SI^{28,29} | Nearest neighbor (NN) metric | ℓ_{p}, may depend on the state space/distribution | ℓ_{2} |

mapped | R: No. of NNs | A priori unclear, e.g., R = 10 in Refs. 28 and 29 | Various (Table S.I in the | |

supplementary material) | ||||

CCM^{5} | Nearest neighbor metric | ℓ_{p}, may depend on the state space/distribution | ℓ_{2} | |

T_{max}: Max. segment length | Convergence: compute ρ for T ↑ T_{max}^{5} | T_{max} = T | ||

n_{T}: No. of segments of size T | ρ values averaged across n_{T} segments, size T | n_{T} = 40 | ||

ρ_{∞}: Converged CCM value | $\rho Tmax$ in Ref. 35 or fit exponential regression^{36} | $\rho Tmax$ (if ↓ holds) | ||

$\delta \rho $: Convergence tolerance | “Converged” if $\rho \u221e\u2212\rho m+2>\delta \rho $ | $\delta \rho =0.05$ |

### A. Numerical simulations

We investigate performance on four simulated model systems (Table II). In each simulation, we assess the causality estimates of each method by varying the coupling strength $\lambda $. These simulated systems are widely studied in chaos theory, e.g., Ref. 37, and also appear elsewhere in the literature, e.g., Ulam lattice in Ref. 7.

Simulation . | Coupling . | Dynamics . | T = 10^{p}
. | Simulation model parameters . | Coupling strength . |
---|---|---|---|---|---|

Linear process | X ← Y | L & S | p = 4 | $bx=0.8,by=0.4,\sigma x2=\sigma y2=0.2$ | λ ∈ [0, 1] |

Ulam lattice | X → Y | NL & D & C | p = 3, 5 | N_{L} = 100 (size of lattice) | λ ∈ [0, 1] |

Hénon uni-d | X → Y | NL & D & C | p = 3, 4, 5 | a = 1.4, ~b_{x} = 0.3, ~b_{y} = 0.3 | λ ∈ [0, 1] |

Hénon bi-d (I, NI) | X ↔ Y | NL & D & C | p = 4 | a = 1.4, ~b_{x} = 0.3, ~b_{y} = 0.3 or 0.1 | λ_{xy}, λ_{yx} ∈ [0, 0.4] |

Simulation . | Coupling . | Dynamics . | T = 10^{p}
. | Simulation model parameters . | Coupling strength . |
---|---|---|---|---|---|

Linear process | X ← Y | L & S | p = 4 | $bx=0.8,by=0.4,\sigma x2=\sigma y2=0.2$ | λ ∈ [0, 1] |

Ulam lattice | X → Y | NL & D & C | p = 3, 5 | N_{L} = 100 (size of lattice) | λ ∈ [0, 1] |

Hénon uni-d | X → Y | NL & D & C | p = 3, 4, 5 | a = 1.4, ~b_{x} = 0.3, ~b_{y} = 0.3 | λ ∈ [0, 1] |

Hénon bi-d (I, NI) | X ↔ Y | NL & D & C | p = 4 | a = 1.4, ~b_{x} = 0.3, ~b_{y} = 0.3 or 0.1 | λ_{xy}, λ_{yx} ∈ [0, 0.4] |

*Linear process*:

*Ulam lattice*:

*Hénon unidirectional map*:

*Hénon bidirectional map*:

We reproduce figures for all simulations and methods in Figs. 3–5 and summarize our results in Fig. 2, which shows correlations between each pair of indices. For linear process and Ulam lattice simulations, we report causality estimates in both directions, i.e., $iX\u2192Y$ and $iY\u2192X$ (where $i$ denotes any of the causality indices). For Hénon maps, we instead use the directed index $DX\u2192Y=iX\u2192Y\u2212iY\u2192X$, following Ref. 18. In general, all measures exhibit a small standard deviation relative to the absolute value of the index, indicating that random initial conditions during data simulation have at most a very small influence on the causal structure when initial transients are discarded. Though we are able to replicate the findings in Ref. 18 in most cases, we occasionally find minor differences between their results and ours. In particular, we sometimes find results of a similar profile but a different magnitude, as $\lambda $ varies. We observe this for EGC and linear processes, PI and all simulations, SI$(1)$ and Ulam lattice, and TE and Hénon unidirectional maps. Though we handle numerical outliers differently in our visualization of results for Hénon bidirectional maps, our results for these simulations appear largely comparable in magnitude and profile. There is no mention of a data standardization step in Ref. 18 and the results we report do not involve any pre-processing, though this did not appear to rectify these differences. In one notable difference between identical Hénon bidirectional map results, Lungarella *et al.*^{18} find a region in the $\lambda $-space (namely, ${(\lambda xy,\lambda yx):\lambda xy>0.1,\lambda yx>0.1,\lambda xy+\lambda yx<0.35}$) in which they identify general synchronization between $X$ and $Y$ and have difficulty estimating indices due to numerical instabilities; yet, we do not observe this. We have followed the implementation in Ref. 18 as closely as possible, and it is unclear why these differences exist.

We knowingly deviated from the implementations in Ref. 18 only in the case of NLGC, in which we preferred to use $k$-means rather than fuzzy $c$-means clustering to determine RBF centers, after finding similar or improved results but with a much reduced computational cost. Lungarella *et al.*^{18} note that NLGC is numerically unstable for “small” $T$ and computationally expensive for “large” $T$, which we suggest may be partly due to their use of fuzzy $c$-means. Little detail is provided about their implementation of this, but it may perhaps be that an early stopping criteria sometimes forces a “poor quality” clustering. Furthermore, we found that the performance of NLGC in Hénon bidirectional map simulations improved significantly with a different set of NLGC parameters (e.g., $P=50$ instead of $P=10$), though we do not present these alternate results.

For the linear process (LP), the simplest simulation model, all indices show a very strong positive correlation in the $Y\u2192X$ direction (Fig. 3). In the reverse $X\u2192Y$ direction, TE and CTIR both decrease with increasing $\lambda $, the cross mapped indices all show a marked increase, and the remaining indices are approximately zero for all $\lambda $. This gives rise to patterns of positive and negative correlations between pairs of methods. As each $xt$ or $yt$ is a sum of Gaussian variables, we can derive theoretical values for Shannon entropy and, consequently, for the information theory methods (see the supplementary material). Figure 3 shows that TE (KSG) reliably estimates the ‘true” transfer entropy, but TE (H) significantly underestimates the theoretical values. This is a fundamental flaw that undermines any other advantageous properties of TE (H). Though computed CTIR values match the theoretical values, it is clearly negative in $X\u2192Y$ and as such does not reflect the causal structure of the system. Increasing the size of the data $T$ alters the value of TE (H) here, but TE (KSG) remains accurate as $T$ increases. However, in all other simulations, TE (H) is more robust to increasing data size.

The Ulam lattice (UL) chains together unidirectional coupled chaotic Ulam maps. For large $NL$, the causal influence from $Y$ to $X$ is negligible. UL exhibits synchronization for $\lambda \u22480.18,0.82$, where cause and effect variables are indistinguishable from each other; e.g., the system converges to a two state attractor. As a result, most indices either have values approximately equal to zero or suffer from high variance numerical instabilities. Outside of these regions of synchronization, the information-theoretic methods and regression-based indices show reasonable consistency (Figs. 2 and 4). The exception is CTIR, which slowly decreases as $\lambda $ increases for $T=105$, albeit still correctly identifying the direction of information flow. ETE (H) successfully corrects for the small sample effects that give rise to these spurious positive TE (H) results in the $Y\u2192X$ direction when $T=103$. Both similarity indices (SIs) fail to identify any causal structure in the UL. For CCM, while the net directed index $DX\u2192Y$ increases with $\lambda $, it is negative for $\lambda <0.5$ and so misidentifies the direction of causality. The positive correlations between methods in $iY\u2192X$ for $T=105$ occur due to a very slight peak in the value at $\lambda \u22480.5$ for nearly all methods (except CTIR and EGC).

Synchronization occurs in the range $\lambda \u2208[0.7,1]$ for Hénon unidirectional maps (Fig. 4). All indices are consistent and perform reasonably well, and we do not observe the noisy fluctuations seen in Ref. 18 apart from for EGC when $\lambda >0.7$. It is notable that about half the indices [TE (H), ETE (H), EGC, NLGC, PI] are fairly consistent even as the order of magnitude of the data size $T$ is increased, while the values of the other indices [TE (KSG), CTIR, SIs, CCM] increase, in some cases tripling in value. There is a strong degree of similarity between all methods for the Hénon bidirectional maps (Fig. 5). In HB (I) simulations, the exceptions to this are NLGC and CCM, though we found that setting a larger number of RBFs resulted in better performance for NLGC. For CCM, the direction of causality is sometimes incorrect and the reason for this is unclear, though this may also be a result of poor parameter choices. We observe expected symmetry in the values of $\lambda $, and synchronization in the region is approximately equal to ${(\lambda xy,\lambda yx):\lambda xy+\lambda yx>0.28}$. There are a small number of points in which numerical instabilities are present in all indices, but the consistency across all indices suggests that these are isolated points in which the system converges to some limit cycle or attractor. There are more differences between methods in HB (NI) results. Significant numerical instabilities occur in EGC, particularly when the system is in a state of synchrony: the region approximately equal to ${(\lambda xy,\lambda yx):0.05<\lambda xy<0.15,0.1<\lambda yx<0.28}$. Outside of this region, EGC is broadly similar to the information-theoretic indices, which are highly correlated, and to a slightly lesser degree with the SI$(1)$ and PI. In contrast, NLGC, SI$(2)$, and CCM have unusual results. The first of these is negative almost everywhere (even in a repeat analysis with more RBF kernels) and the latter two are mostly non-negative; moreover, the regions with the most extreme values occur in quite different places in all three.

### B. Computational burden

An important consideration in selecting a suitable method is any trade-off between performance and computational efficiency. The most significant factor in this is often how the algorithmic cost of each method scales with increasing data size $T$. Table S.II in the supplementary material shows the mean and standard deviations of the time taken to compute each index and simulation. TE (H)/ETE (H) is the fastest in almost all cases, even though this calculation also includes ten reshuffles and recomputations for ETE (H). CCM, EGC, and TE (KSG) are similarly among methods with a smaller computational cost. Several methods have extreme values in UL simulations with $T=105$, particularly CTIR and PI, but this is distorted by difficulties in computation when the system is in a synchronized state. We observed a marked difference in the computational cost for NLGC when using $k$-means for clustering instead of fuzzy $c$-means, and we suggest that $k$-means is more suitable here. Our computation was done in a high performance CPU computing cluster using SkyLake 6140 with 18 core 2.3 GHz processors and 384 GB of RAM. Although the computational times we report may be slightly faster than on a laptop computer with less processing power, we did not observe any substantial difference when internally comparing run times.

### C. Real-world relevant data issues

Next, we investigated the sensitivity of all causality indices to a number of modifications mirroring issues that often arise in real-world data. We choose UL with $T=103$ to illustrate the effects of these transformations, as LP is too simplistic a model to give sufficient insight. We keep all simulation parameters and causality index parameters the same. In Table III, we summarize the means $\mu ^$ and standard deviations $\sigma ^$ of the directed indices $DX\u2192Y$ for each $\lambda $, which are normalized by their deviation from the “base” UL results and averaged over all $\lambda $. This normalization allows us to compare across methods that take values in different ranges.

. | . | . | . | Scaling . | Rounding . | Missing data . | Gaussian noise . | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

. | Baseline . | . | Data size . | Stand. . | D_{10X→Y}
. | D_{X→10Y}
. | 1 d.p. . | 1 d.p. . | 2 d.p. . | 10% NA . | 20% NA . | $\sigma G2=0.1$ . | $\sigma G2=1$ . | $\sigma G2=1$ . |

Method . | T = 10^{3}
. | . | T = 10^{5}
. | X,Y . | . | . | X . | Y . | X,Y . | X,Y . | X,Y . | X,Y . | X . | Y . |

EGC | ⟨μ⟩ = 0.840 | $f(\mu ,\mu ^)$ | −0.064 | 0.036 | 0.207 | −0.071 | 0.031 | 0.112 | 0.004 | −0.027 | −0.040 | 0.533 | 0.981 | 0.950 |

⟨σ⟩ = 0.021 | $g(\sigma ,\sigma ^)$ | 0.660 | 1.004 | 1.425 | 0.691 | 0.959 | 0.946 | 0.970 | 1.023 | 1.033 | 1.025 | 0.473 | 0.598 | |

NLGC | ⟨μ⟩ = 0.610 | $f(\mu ,\mu ^)$ | 0.013 | 0.299 | 2.905 | −101.849 | 0.000 | −0.003 | 0.001 | −0.008 | −0.020 | 0.031 | 0.740 | −0.007 |

⟨σ⟩ = 0.023 | $g(\sigma ,\sigma ^)$ | 0.089 | 0.608 | 64.469 | 126.734 | 0.000 | 0.954 | 0.994 | 1.353 | 1.906 | 1.023 | 1.345 | 2.325 | |

PI | ⟨μ⟩ = 0.380 | $f(\mu ,\mu ^)$ | 0.002 | 0.293 | 1.576 | −98.727 | 0.950 | −0.951 | −0.016 | 0.001 | 0.005 | 0.011 | 0.617 | 0.019 |

⟨σ⟩ = 0.032 | $g(\sigma ,\sigma ^)$ | 0.094 | 0.681 | 69.340 | 66.364 | 0.918 | 1.051 | 1.001 | 1.214 | 1.511 | 1.007 | 2.041 | 2.120 | |

TE (H) | ⟨μ⟩ = 0.675 | $f(\mu ,\mu ^)$ | −0.158 | 0.000 | 0.000 | 0.000 | 0.011 | 0.030 | 0.000 | 0.071 | 0.159 | 0.026 | 0.786 | 0.731 |

⟨σ⟩ = 0.019 | $g(\sigma ,\sigma ^)$ | 0.085 | 1.000 | 1.000 | 0.000 | 1.004 | 0.994 | 1.013 | 1.313 | 1.633 | 1.112 | 1.015 | 0.920 | |

ETE (H) | ⟨μ⟩ = 0.674 | $f(\mu ,\mu ^)$ | −0.158 | 0.000 | 0.000 | 0.000 | 0.014 | 0.026 | 0.000 | 0.075 | 0.155 | 0.026 | 0.748 | 0.774 |

⟨σ⟩ = 0.019 | $g(\sigma ,\sigma ^)$ | 0.083 | 1.000 | 1.000 | 1.000 | 0.992 | 0.994 | 1.009 | 1.296 | 1.634 | 1.109 | 0.947 | 0.854 | |

TE (KSG) | ⟨μ⟩ = 1.509 | $f(\mu ,\mu ^)$ | −1.348 | 0.000 | 0.269 | 0.609 | 0.095 | −0.134 | −0.029 | 0.111 | 0.216 | 0.306 | 0.841 | 0.863 |

⟨σ⟩ = 0.025 | $g(\sigma ,\sigma ^)$ | 0.120 | 1.071 | 0.869 | 1.026 | 1.232 | 1.320 | 1.042 | 1.197 | 1.430 | 1.028 | 1.017 | 0.962 | |

CTIR | ⟨μ⟩ = 0.462 | $f(\mu ,\mu ^)$ | −1.226 | 0.000 | 0.128 | 0.713 | 0.299 | −0.355 | −0.026 | 0.083 | 0.161 | 0.273 | 0.826 | 0.848 |

⟨σ⟩ = 0.014 | $g(\sigma ,\sigma ^)$ | 0.110 | 1.016 | 0.898 | 0.920 | 1.159 | 1.209 | 1.042 | 1.076 | 1.258 | 0.958 | 0.942 | 0.872 | |

SI^{(1)} | ⟨μ⟩ = 0.001 | $f(\mu ,\mu ^)$ | 0.015 | 0.000 | 0.000 | 0.000 | 0.549 | −0.556 | 0.005 | −0.013 | 0.018 | −0.007 | −0.061 | 0.066 |

⟨σ⟩ = 0.029 | $g(\sigma ,\sigma ^)$ | 0.097 | 1.000 | 1.000 | 0.000 | 1.363 | 1.355 | 1.053 | 1.084 | 1.219 | 0.895 | 0.705 | 0.693 | |

SI^{(2)} | ⟨μ⟩ = 0.000 | $f(\mu ,\mu ^)$ | 0.029 | 0.000 | 0.000 | 0.000 | −3.105 | 3.121 | −0.001 | −0.037 | 0.017 | 0.000 | −8.256 | 7.736 |

⟨σ⟩ = 0.000 | $g(\sigma ,\sigma ^)$ | 0.000 | 1.000 | 1.000 | 0.000 | 1.394 | 1.399 | 1.074 | 1.666 | 2.630 | 0.968 | 2.334 | 2.018 | |

CCM | ⟨μ⟩ = 0.001 | $f(\mu ,\mu ^)$ | 0.031 | 0.000 | 0.000 | 0.000 | −1.249 | 1.289 | −0.005 | −0.009 | −0.025 | 0.013 | 0.151 | −0.075 |

⟨σ⟩ = 0.047 | $g(\sigma ,\sigma ^)$ | 0.103 | 1.000 | 1.000 | 0.000 | 1.115 | 1.105 | 0.740 | 1.090 | 1.250 | 1.010 | 0.944 | 0.959 |

. | . | . | . | Scaling . | Rounding . | Missing data . | Gaussian noise . | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

. | Baseline . | . | Data size . | Stand. . | D_{10X→Y}
. | D_{X→10Y}
. | 1 d.p. . | 1 d.p. . | 2 d.p. . | 10% NA . | 20% NA . | $\sigma G2=0.1$ . | $\sigma G2=1$ . | $\sigma G2=1$ . |

Method . | T = 10^{3}
. | . | T = 10^{5}
. | X,Y . | . | . | X . | Y . | X,Y . | X,Y . | X,Y . | X,Y . | X . | Y . |

EGC | ⟨μ⟩ = 0.840 | $f(\mu ,\mu ^)$ | −0.064 | 0.036 | 0.207 | −0.071 | 0.031 | 0.112 | 0.004 | −0.027 | −0.040 | 0.533 | 0.981 | 0.950 |

⟨σ⟩ = 0.021 | $g(\sigma ,\sigma ^)$ | 0.660 | 1.004 | 1.425 | 0.691 | 0.959 | 0.946 | 0.970 | 1.023 | 1.033 | 1.025 | 0.473 | 0.598 | |

NLGC | ⟨μ⟩ = 0.610 | $f(\mu ,\mu ^)$ | 0.013 | 0.299 | 2.905 | −101.849 | 0.000 | −0.003 | 0.001 | −0.008 | −0.020 | 0.031 | 0.740 | −0.007 |

⟨σ⟩ = 0.023 | $g(\sigma ,\sigma ^)$ | 0.089 | 0.608 | 64.469 | 126.734 | 0.000 | 0.954 | 0.994 | 1.353 | 1.906 | 1.023 | 1.345 | 2.325 | |

PI | ⟨μ⟩ = 0.380 | $f(\mu ,\mu ^)$ | 0.002 | 0.293 | 1.576 | −98.727 | 0.950 | −0.951 | −0.016 | 0.001 | 0.005 | 0.011 | 0.617 | 0.019 |

⟨σ⟩ = 0.032 | $g(\sigma ,\sigma ^)$ | 0.094 | 0.681 | 69.340 | 66.364 | 0.918 | 1.051 | 1.001 | 1.214 | 1.511 | 1.007 | 2.041 | 2.120 | |

TE (H) | ⟨μ⟩ = 0.675 | $f(\mu ,\mu ^)$ | −0.158 | 0.000 | 0.000 | 0.000 | 0.011 | 0.030 | 0.000 | 0.071 | 0.159 | 0.026 | 0.786 | 0.731 |

⟨σ⟩ = 0.019 | $g(\sigma ,\sigma ^)$ | 0.085 | 1.000 | 1.000 | 0.000 | 1.004 | 0.994 | 1.013 | 1.313 | 1.633 | 1.112 | 1.015 | 0.920 | |

ETE (H) | ⟨μ⟩ = 0.674 | $f(\mu ,\mu ^)$ | −0.158 | 0.000 | 0.000 | 0.000 | 0.014 | 0.026 | 0.000 | 0.075 | 0.155 | 0.026 | 0.748 | 0.774 |

⟨σ⟩ = 0.019 | $g(\sigma ,\sigma ^)$ | 0.083 | 1.000 | 1.000 | 1.000 | 0.992 | 0.994 | 1.009 | 1.296 | 1.634 | 1.109 | 0.947 | 0.854 | |

TE (KSG) | ⟨μ⟩ = 1.509 | $f(\mu ,\mu ^)$ | −1.348 | 0.000 | 0.269 | 0.609 | 0.095 | −0.134 | −0.029 | 0.111 | 0.216 | 0.306 | 0.841 | 0.863 |

⟨σ⟩ = 0.025 | $g(\sigma ,\sigma ^)$ | 0.120 | 1.071 | 0.869 | 1.026 | 1.232 | 1.320 | 1.042 | 1.197 | 1.430 | 1.028 | 1.017 | 0.962 | |

CTIR | ⟨μ⟩ = 0.462 | $f(\mu ,\mu ^)$ | −1.226 | 0.000 | 0.128 | 0.713 | 0.299 | −0.355 | −0.026 | 0.083 | 0.161 | 0.273 | 0.826 | 0.848 |

⟨σ⟩ = 0.014 | $g(\sigma ,\sigma ^)$ | 0.110 | 1.016 | 0.898 | 0.920 | 1.159 | 1.209 | 1.042 | 1.076 | 1.258 | 0.958 | 0.942 | 0.872 | |

SI^{(1)} | ⟨μ⟩ = 0.001 | $f(\mu ,\mu ^)$ | 0.015 | 0.000 | 0.000 | 0.000 | 0.549 | −0.556 | 0.005 | −0.013 | 0.018 | −0.007 | −0.061 | 0.066 |

⟨σ⟩ = 0.029 | $g(\sigma ,\sigma ^)$ | 0.097 | 1.000 | 1.000 | 0.000 | 1.363 | 1.355 | 1.053 | 1.084 | 1.219 | 0.895 | 0.705 | 0.693 | |

SI^{(2)} | ⟨μ⟩ = 0.000 | $f(\mu ,\mu ^)$ | 0.029 | 0.000 | 0.000 | 0.000 | −3.105 | 3.121 | −0.001 | −0.037 | 0.017 | 0.000 | −8.256 | 7.736 |

⟨σ⟩ = 0.000 | $g(\sigma ,\sigma ^)$ | 0.000 | 1.000 | 1.000 | 0.000 | 1.394 | 1.399 | 1.074 | 1.666 | 2.630 | 0.968 | 2.334 | 2.018 | |

CCM | ⟨μ⟩ = 0.001 | $f(\mu ,\mu ^)$ | 0.031 | 0.000 | 0.000 | 0.000 | −1.249 | 1.289 | −0.005 | −0.009 | −0.025 | 0.013 | 0.151 | −0.075 |

⟨σ⟩ = 0.047 | $g(\sigma ,\sigma ^)$ | 0.103 | 1.000 | 1.000 | 0.000 | 1.115 | 1.105 | 0.740 | 1.090 | 1.250 | 1.010 | 0.944 | 0.959 |

#### 1. Data availability

Many of the indices, with the exception of TE (KSG) and CTIR, remain consistent with increasing data size $T$, while at the same time exhibit decreasing variance. These results reinforce the similar observations in HU maps. The large increases in the value of the two methods mentioned are concerning and represent a drawback of both methods that should be acknowledged in applications of these approaches. It is unclear whether there is convergence to some “correct” value as the amount of data increases or whether both are unbounded as $T\u2192\u221e$, but initial computations do not support the former (not shown). Though the $DX\u2192Y$ values from both transfer entropy methods are highly correlated, they are both estimates of the same quantity, and it is difficult to reconcile their different magnitudes, particularly as we have already seen significant underestimation in TE (H) for LP simulations.

#### 2. Standardization and scaling

In the second set of experiments, we perform three tests: standardizing both series by their sample mean and standard deviation in the first and separately scaling each unstandardized time series by a factor of $10$ (Fig. S2 in the supplementary material). For the Ulam lattice system, sample means for both $X$ and $Y$ are typically between 0.4 and 0.7, and standard deviations are both approximately equal to 1.2 (except when the system is in synchrony). Several methods are invariant under linear scaling or shifting of the original series $X$ and $Y$, including cross mapping approaches. Information-theoretic measures are also invariant in theory, but the KSG algorithm, based on $k$-nearest neighbors, does not retain this property. Similarly, EGC relies on a neighborhood size parameter, and scaling the data without changing this parameter accordingly can result in insufficient points available for the locally linear regressions, as is observed when either $X$ or $Y$ is scaled by 10. The directed index for both NLGC and PI has a vastly inflated magnitude when $Y$ is scaled by 10. With this in mind, we recommend standardization or normalization of the data before employing these methods.

#### 3. Rounding errors and missing data

We perform three experiments to investigate rounding errors, first rounding each time series separately to one decimal place and then rounding both to two decimal places (Fig. S3 in the supplementary material). TE and both GC extensions have similar performance to the baseline in all cases, while CCM suffers the most. In two experiments with missing data of 10% and 20%, all methods appear robust to this.

#### 4. Noisy data

In the case of the earlier LP simulations, Gaussian noise forms an integral component of the system itself, and the theoretical expression for TE shows that this depends only on the ratio of the variances $\sigma x/\sigma y$ (see the supplementary material). However, this noise is inherent in the simulation process (i.e., it does not arise in observation of the system). In our UL experiments, we added additional Gaussian noise *after* simulation. The inclusion of this “observation” noise does not alter the state of the system or the information flow between variables, but it does obscure the causal structure. In the first of these experiments (Fig. S3 in the supplementary material), in which we added small variance Gaussian noise ($\sigma G=0.1$), the amplitude of this noise is an order of magnitude less than the original UL values and the inclusion of this noise has a small effect for all indices. In the latter experiments, we added Gaussian noise (with $\sigma G=1$) to each variable individually, and the effect is more pronounced. NLGC performs best in general and appears very resilient to noise added to $Y$ (effect variable), though it drops slightly in value when Gaussian noise is added to $X$ (cause variable). It is interesting that the two SIs have quite different results, with SI$(1)$ more robust to noise, though both methods are not in general able to successfully identify the direction of causality.

## IV. DISCUSSION

In-depth comparative studies of this kind are relatively rare in the mathematical literature (examples include Refs. 16 and 38), particularly in evaluating performance of methods for estimating a concept, such as causality, that does not have a consistent, fundamental mathematical definition. Even without this, causal inference has a huge importance in how we can model, predict, and exploit real-world applications from many scientific disciplines. Asymmetric bivariate causal inference is the first key step to providing this insight into interactions between components in complex networks. In the context of causality indices, review papers^{14–16} have previously had a narrower focus in some manner, for example, on only one group of methods or on a few bivariate methods and their multivariate extensions. We follow the template of Ref. 18 in reviewing methods drawn from diverse mathematical foundations, but we extend this review with additional methods, and crucially, we investigate the impact of common issues that are relevant to real-world data. In reproducing and updating their work, we are also able to resolve some computational stability issues and comment on the computational costs of each method, while we also make our code publicly available for other researchers to develop further.

### A. Further work

A primary concern in causality inference is the difficulties with model misspecification, specifically causal identification in multivariate systems. Omission of confounding variables can create spurious false-positive causal relationships. There may also be redundancy across multiple variables that provide similar information to the effect variable or sets of variables that interact synergistically such that their combined causal influence is greater than the “sum of their parts.” These are key concerns outlined in Ref. 15; consequently, results from bivariate indices cannot be definitively interpreted as the existence of a fundamental direct causal relationship between two variables.^{28} A key avenue for further work is to advance this analysis beyond a bivariate setting by including possible confounding variables, in line with conditional extensions to Granger causality^{8,19,39,40} and transfer entropy.^{41} Recent work with graphical models of multivariate systems^{2} is an important step toward high-dimensional causal identification.

Separate univariate embedding is not without some limitations and is not necessarily the optimal multivariate embedding. Aside from in Ref. 42, mixed embeddings are as yet uncommon in causality estimation. There is not yet a theoretical framework for longitudinal data that is recorded non-simultaneously and irregularly. Often, a typical workflow for such data involves pre-processing to transform the data into a multivariate series with constant time intervals. However, many imputation methods result in significant and poorly quantified biases in information content and flow, which inevitably propagate through to estimates of causality, and more work is needed to explicitly factor this into a causal inference framework.

### B. Summary and recommendations

Each causality index has strengths and weaknesses, and there is no single method whose all-round performance exceeds all others. Transfer entropy and Granger causality have long been regarded as the leading methods for systems that contain a small number of variables, and these have had wide applications.^{8,13} Transfer entropy has the distinct advantage that it is built upon the principles of Shannon entropy in a well-established and universal information-theoretic framework. It performs solidly throughout, though there is some tension between algorithms for TE, with the estimates rarely in complete agreement. We have shown that a histogram fixed partition approach is biased even in the simplest model despite TE (H) having general consistency and computational efficiency. Therefore, we recommend the KSG algorithm for transfer entropy computation unless perhaps data are extremely scarce ($T<103$). However, there are some unanswered concerns about TE (KSG), particularly that it appears to increase in magnitude as more data are available. TE (KSG) also suffers in performance when data are unequally scaled due to the resultant difficulties with identifying unique nearest neighbors. CTIR, while sometimes not wholly dissimilar in value from TE, did not seem to offer any obvious advantage to compensate for its much higher computational cost or occasional unusual behavior. Vanilla GC is widely favored but has restrictive assumptions and is ill-suited to complex nonlinear problems. Of the two nonlinear extensions to Granger causality, Lungarella *et al.*^{18} appear to prefer EGC. Some of the computational challenges and numerical instability that they experienced with NLGC may have been a result of their choice of a fuzzy $c$-means for determining RBF kernels, and alternate parameter choices appear to resolve some of their concerns. We find that NLGC is one of the most robust methods to rounding errors, missing data, and Gaussian noise. They rightly note that “If the rank of the data is small, kernel-based methods tend to overfit,”^{18} but we did not observe any issues with this in our simulation experiments. Predictability improvement (PI) likewise performed solidly and has a slight advantage among the regression-based indices in that it is perhaps less reliant on parameter choices. Finally, dynamical systems theory offers a different insight into causal inference that should not be readily dismissed despite our mixed results here, even though the deterministic simulation models appeared to be well-suited to the underlying theory. Convergent cross mapping is a more recent and popular method, and this offered a broad improvement on the similarity indices (SIs), which did not consistently identify the strength or direction of causality. However, CCM too did not always manage to determine the correct causal flow in our simulations.

We have highlighted the value of a standardization pre-processing step in avoiding algorithmic issues, which is also important in comparing results from different data for each method. Rounding errors give rise to practical issues within the implementation of several of the algorithms. For instance, in $k$-nearest neighbor approaches, it is typically assumed that the distances between pairs of points are unique and not discrete. Subsequent edge cases can be treated by adding random noise with low amplitude to the data before estimating the causal relationships,^{26} though propagation of this noise to final estimates is something that should be analyzed. Likewise, many existing implementations of the methods are not equipped to handle missing data (e.g., Refs. 41 and 43). We believe that this is broadly straightforward to implement across all indices, as it can be handled exclusively within the time-delay embedding vector step, by performing embedding and then removing any embedding vectors missing at least one component. As Mønster *et al.*^{36} put it, “Noise in real-world data is ubiquitous, the inclusion of noise in model investigations has been largely ignored.” Added Gaussian noise leads to the biggest changes in the value for most methods, particularly noise at observation in the causal variable. However, provided the magnitude of noise is small compared to the values themselves, all methods perform adequately.

On the basis of this work, we conclude that the strongest choice for identifying and quantifying bivariate causal relationships is, in our view, either transfer entropy (KSG) or nonlinear Granger causality. Predictability improvement is a reasonable alternative and perhaps the next best candidate. A more cautious approach may involve using more than one method, from different theoretical backgrounds. Where possible, it is advantageous to identify a base case for the system, where subsequent results can be reliably compared against. For new methodologies, we recommend investigation into the real-world issues that we have discussed.

## SUPPLEMENTARY MATERIAL

See the supplementary material that contains additional tables and figures. Tables S.I and S.II show full parameter choices and computational time requirements of each method, respectively. Figures S1–S3 show the results of real-world relevant transformation experiments. The supplementary material also contains theoretical results for information-theoretic measures in the linear process simulation.

## ACKNOWLEDGMENTS

T.E. is funded by the Engineering and Physical Sciences Research Council (EPSRC) National Productivity Investment Fund (NPIF) No. EP/S515334/1, Reference No. 2089662. A CC BY or equivalent license is applied to the AAM arising from this submission. We would like to acknowledge Marcel Stimberg and Daniel Nüst for their work with the CODECHECK. We found several existing open-source code repositories, listed in the Data Availability section. It was insightful to view and test these packages, though we still decided to develop our own code for these methods. In addition, we appreciate advice from George Sugihara on the use of CCM (pyEDM) in email correspondence. We also acknowledge the Python community for core packages that this work depends upon, including IPython^{47} v7.16.1, Matplotlib^{48} v3.3.2, NumPy^{49} v1.18.5, Palettable^{50} v3.3.0, Pandas^{51} v1.0.5, Python^{52} v3.8.3, scikit-learn^{53} v0.23.1, SciPy^{54} v1.5.0, and statsmodels^{55} v0.11.1.

## DATA AVAILABILITY

The data that support the findings of this study are openly available in GitHub at https://github.com/tedinburgh/causality-review, Ref. 44. Our code is available at the same repository. A CODECHECK certificate is available confirming that the computations underlying this article could be independently executed: doi.org/10.5281/zenodo.4720843. The existing open-access code for some indices includes repositories for information theory and transfer entropy: IDTxl^{41} v1.1, PyIF,^{45} and for convergent cross mapping: pyEDM^{43} v1.7.4. We also adapted the fuzzy $c$-means code based on Ref. 46. We checked our results for transfer entropy and convergent cross mapping against those from IDTxl and pyEDM, respectively. All the codes in our repository and in these others are in Python.

## REFERENCES

*Handbook of Econometrics*(Elsevier, 1984), Vol. 2, pp. 1101–1144.

*Dynamical Systems and Turbulence, Warwick 1980*(Springer, Berlin, 1981), pp. 366–381.

*Proceedings of the 9th Python in Science Conference (SciPy 2010)*(Proceedings of the Python in Sciences Conferences, 2010).