Causal network reconstruction from time series is an emerging topic in many fields of science. Beyond inferring directionality between two time series, the goal of causal network reconstruction or causal discovery is to distinguish direct from indirect dependencies and common drivers among multiple time series. Here, the problem of inferring causal networks including time lags from multivariate time series is recapitulated from the underlying causal assumptions to practical estimation problems. Each aspect is illustrated with simple examples including unobserved variables, sampling issues, determinism, stationarity, nonlinearity, measurement error, and significance testing. The effects of dynamical noise, autocorrelation, and high dimensionality are highlighted in comparison studies of common causal reconstruction methods. Finally, method performance evaluation approaches and criteria are suggested. The article is intended to briefly review and accessibly illustrate the foundations and practical problems of time series-based causal discovery and stimulate further methodological developments.
Reconstructing interaction networks from observed time series is a common problem in fields where active experiments are impossible, unethical, or expensive. Pairwise association networks, for example based on correlations, cannot be interpreted causally. The goal of causal network reconstruction or causal discovery is to distinguish direct from indirect dependencies and common drivers among multiple time series. Here, we briefly recapitulate the theoretical assumptions underlying causal discovery from time series, discuss practical estimation problems, and illustrate each aspect with accessible examples.
I. Introduction
Reconstructing the causal relations behind the phenomena we observe is a fundamental problem in all fields of science. The traditional approach is to conduct active experiments, but in many fields such as Earth system science or neuroscience, manipulations of the complex system under study are either impossible, unethical, or very expensive. On the other hand, modern science generates an ever-growing amount of data from these systems, in particular time series data. Concurrently, novel computing hardware today allows efficient processing of massive amounts of data. These developments have led to emerging interest in the problem of reconstructing causal networks or causal discovery from observational time series.
In the past few decades, a number of original causality concepts have been developed, such as Granger causality (Granger, 1969) or transfer entropy (Schreiber, 2000). Since the 1990s, computer scientists, statisticians, and philosophers have grounded causal reasoning and inference in a robust mathematical framework (Pearl, 2000; Spirtes et al., 2000). The (quite natural) definition of causality underlying this framework is that if and only if an intervention or manipulation in has an effect on (Pearl, 2000; Spirtes et al., 2000). This effect may be in changing ’s mean or any change in its post-interventional distribution denoted which is different from the conditional distribution . Unfortunately, all we can measure from observational data are statistical dependencies. These can be visualized in a graphical model (Lauritzen, 1996) or time series graph (Eichler, 2011) that represents the conditional independence relations among the variables and their time lags (Fig. 1). The theory of causal discovery lays out the assumptions under which the underlying causal dependencies can be inferred from observational data.
There are different sets of assumptions that allow us to identify a causal graph. Here, we focus on time-lagged causal discovery in the framework of conditional independence testing using the assumptions of time-order, Causal Sufficiency, the Causal Markov Condition, and Faithfulness, among others, which are all discussed thoroughly in this paper. But some of these assumptions can be replaced. Recent work (Peters et al., 2017) shows ways to use assumptions on the noise structure and dependency types in the framework of structural causal models which can complement the approach studied here and we will include references to recent work from this framework throughout the sections.
The paper is organized as follows: In Sec. II, we relate Granger causality and similar concepts to the conditional independence-framework (Spirtes et al., 2000). Section III provides the necessary definitions and notation and in Sec. IV we recapitulate the assumptions underlying time-lagged causal discovery from time series alongside illustrative examples. The practical estimation aspect from introducing some causal discovery algorithms to significance testing is presented in Sec. V, while Sec. VI discusses suggestions for performance evaluation. In Sec. VII, we present some comparison studies of common causal methods and conclude the paper with a brief discussion (Sec. VIII). The paper is accompanied by a python jupyter notebook on https://github.com/jakobrunge/tigramite to reproduce some of the examples.
II. From Granger causality to conditional independence
Granger (1969), based on work by Wiener (1956), was the first to propose a practical, operational definition of causality based on prediction improvement. The underlying idea of measuring whether Granger causes is that there is some unique information in relevant for that is not contained in ’s past as well as the past of “all the information in the universe” (Granger, 1969). In practice, typically only ’s past is used (bivariate Granger causality). Measuring prediction improvement can be operationalized in different ways. The most common framework are vector autoregressive models (VAR),
where , is the coefficient matrix at lag , some maximum time lag, and denotes an independent noise term. Here, Granger-causes if any of the coefficients at lag is non-zero. A non-zero can then be denoted as a causal link at lag . Another option is to compare the residual variances of the VAR fitted with and without including the variable . The use of VARs restricts this notion of causality to a causality in mean (Granger, 1969). A more general definition is that of (bivariate) transfer entropy (Schreiber, 2000; Barnett et al., 2009)
where denotes the conditional mutual information (CMI). Bivariate TE is a common term, another naming option would be bivariable TE since and could also be multivariate variables. Transfer entropy can also be phrased in a multivariate (or multi-variable) lag-specific version (Runge et al., 2012a). Many current methods are advancements of the concept of transfer entropy (Wibral et al., 2013; Staniek and Lehnertz, 2008; Vejmelka and Palus, 2008), in particular in its multivariate version (Sun and Bollt, 2014; Sun et al., 2015; Runge et al., 2012b; 2012a; Runge, 2015).
Tests for causality are then based on testing whether a particular CMI is greater than zero. Looking at the definition of CMI,
TEbiv and its advancements essentially test for conditional independence of and given , denoted since
then represents ’s past and other included variables. The lag-specific generalization of the VAR model (1) then is the full conditional independence (FullCI) approach
where . can be CMI or any other conditional dependence measure. In the case of partial correlation, a non-zero entry corresponds to a non-zero . A general concept to represent conditional independence relations among multiple variables and their time lags is that of time series graphical models (Eichler, 2011).
III. Definitions and notation
A. Definition of time series graphs
Consider a multivariate process of dimension . We define the time series graph of as follows [Fig. 1(c)]: The set of nodes in that graph consists of the set of components at each time . That is, the graph is actually infinite, but in practice defined up to some maximum time lag . Compared to the general concept of graphical models (Lauritzen, 1996) for data without time-ordering, for time series graphs the time-dependence is explicitly used to define directional links in the set of edges (Eichler, 2011). For convenience, we treat , , and as sets of random variables here and use the difference symbol “” for sets.
(Definition of links).
For , we call links autodependencies. For , the same definition can also be used to define undirected contemporaneous links (see Eichler, 2011; Runge, 2015), which would lead to the time series graph being a mixed graph instead of a directed acyclic graph. The arrow of time is a convenient way to disambiguate independence relationships: If we do not have access to the time ordering of the variables (or there is none) and observe as the only conditional independence relation while all other relations are dependent, then this relation can be generated by any of the three causal motifs , , or which form a Markov equivalence class.
Here, we define links with lags relative to some time point , but throughout this paper we assume stationarity (discussed in Sec. IV E). Then a link is repeated for every if a link exists at time . Alternatively, the links can be estimated from different realizations at time . In practice, however, the links will mostly be estimated from single time series realizations requiring stationarity.
The parents of a node are defined as
In the following, denote nodes or sets of nodes in the graph and random variables of the process , sometimes dropping the subscript. We will denote a general conditional dependence measure as which can be CMI or also some other measure such as partial correlation, depending on the context.
B. Separation
When considering the dependency between two variables given a set of conditions as in , the idea of open and blocked paths or separation between the corresponding nodes in the time series graph [Fig. 1(c)] is important. A directed path is a sequence of linked nodes containing only motifs . But there are also other paths on which information is shared even though no causal interventions could “travel” along these. In general (Eichler, 2011), in the above defined time series graph, a path between two single nodes and is called open if it contains only the motifs or . For notational convenience, we will sometimes use left-pointing arrows, while still in the time series graph all directed links are forward in time. For example, is an open path. On the other hand, if any motif on a path is , the path is blocked. Nodes in such motifs are also called colliders. If we now consider a separating or conditioning set , openness and blockedness of motifs reverse, i.e., denoting a conditioned node by , the motifs and , are blocked and the motif becomes open. For example, the path is blocked, while is open. Note that paths can also traverse links repeatedly, e.g., forward and backward.
(Separation).
Intuitively, if two nodes are separated, no information is shared between the two. For example, in Fig. 1(c) and are separated by or also by . Then . Conversely, and are still connected given since the “back door”-path is still open. These definitions are important for the relations between the graph and the underlying process.
IV. Assumptions of causal discovery from observational time series
Causal information cannot be obtained from associations of measured variables without some assumptions. A variety of different assumptions have been shown to be sufficient to estimate the true causal graph (Spirtes et al., 2000; Peters et al., 2017). Here, we focus on three main assumptions under which the time series graph represents causal relations: Causal Sufficiency, the Causal Markov Condition, and Faithfulness. For time-lagged causal discovery from observational time series, we also need the assumptions of no instantaneous effects and stationarity. Further added to these are dependence type assumptions (e.g., linear or nonlinear) and no measurement error, and we will also assume that the joint distribution of the process has a positive density. All of these are discussed in the following.
A. Causal sufficiency
As Granger (1969) already notes, “[t]he one completely unreal aspect of the above definitions is the use of the series representing all available information [in the universe].” This definition makes sure that the measured variables include all of the common causes of and . However, we do not always need the whole universe. If we have available only a limited number of measured variables, we only need to assume that there exist no other unobserved (or latent) variables that directly or indirectly influence any other pair of our set of variables which is the assumption of Causal Sufficiency (Spirtes et al., 2000).
(Causal Sufficiency).
A set of variables is causally sufficient for a process if and only if in the process every common cause of any two or more variables in is in or has the same value for all units in the population.
(Unobserved variables)
What happens if such an unobserved (latent) confounder exists? Consider the example depicted in Fig. 2 where we assume no autodependencies. Here, is an unobserved variable and drives both and . With the time lags considered, this common driver leads to an association between and . The estimated time series graph [via Eq. (6)] among the observed variables will then contain a link even though it is a spurious association and any manipulation of would not have any effect on . Here, acts as a direct common driver leading to an induced association. But the estimated time series graph additionally contains a link even though there is no direct confounder between and . The reason is that because the path is open through the condition on (see Definition 2).
The Fast Causal Discovery (FCI) algorithm (Spirtes et al., 2000; Zhang, 2008) does not assume causal sufficiency and allows us to partially identify which links are spurious due to unobserved confounders and also for which links confoundedness cannot be determined. The underlying idea is that if conditional independence holds between two variables for any subset (including the empty set) of , then these variables are not linked. In the example above, this idea can be used to remove the link since , i.e., and are unconditionally independent. Latent causal discovery is further addressed, for example, in Entner and Hoyer (2010), Eichler (2013), Ramb et al. (2013), Smirnov (2013), Hyttinen et al. (2014), and Geiger et al. (2015).
(Sub-sampled time series)
Causal sufficiency can also be violated if all variables are observed, but they are sampled at too coarse time intervals relative to the causal links. Consider a process with time series graph depicted in the top panel of Fig. 3 featuring a causal loop with all causal links at lag . If we sub-sample the time series with an original resolution of at , we would estimate the causal graph from as shown in the bottom panel of Fig. 3 that has a completely reversed causal loop. Looking at the top panel time series graph again, this spurious reversal can be understood: For example, in the path the node is not sampled and, thus, unobserved, leading to a spurious link in the sub-sampled time series graph in the bottom panel (note that is measured with twice the sampling rate then). Sub-sampled time series are an active area of research (Smirnov, 2013; Barnett and Seth, 2015; Spirtes and Zhang, 2016), to some extent sub-sampled time series graphs can be identified as addressed in Gong et al. (2015) and Hyttinen et al. (2016).
B. Causal Markov condition
All independence-based causal discovery methods necessitate the Causal Markov Condition (Spirtes et al., 2000) which constitutes a close relationship between the process and its graph .
(Causal Markov Condition).
Note that if Causal Sufficiency is not fulfilled, then also generally the Markov condition will not hold (Spirtes et al., 2000). Intuitively, the Causal Markov Condition implies that once we know the values of ’s parents, all other variables in the past ( for ) become irrelevant for predicting . Of course, ’s descendants at future time points can also “predict” .
(Non-markovian processes)
(Time aggregation)
Another example where noise terms become dependent is time-aggregation. Consider the causal chain of processes shown in the top panel of Fig. 5. Time aggregation of a time series realization with is here done by constructing the new time series and correspondingly for . Now, we observe additional contemporaneous links next to the directed links and in the time series graph of the aggregated process due to the too coarse time resolution. But, furthermore, we also get spurious directed links, for example between and . In general, the causal structure of an aggregated process may be very different from the original process.
Time aggregation is an important issue in many applied fields. For example, in climate research time series are frequently measured daily and then aggregated to a monthly scale to investigate dependencies (Runge et al., 2014). Ideally, the time resolution is at least as short as the shortest causal time lag (assuming no instantaneous effects, see Sec. IV D). See Breitung and Swanson (2002) and Barnett and Seth (2015) for a discussion on temporal aggregation in time series models. Ignoring time order, in some cases the recent methods discussed in Peters et al. (2017) can help.
C. Faithfulness
The Causal Markov Condition guarantees that separation in the graph implies independence in the process. But what can be concluded from an estimated conditional independence relation, that is, the reverse direction? Faithfulness guarantees that the graph entails all conditional independence relations that are implied by the Markov condition.
(Faithfulness).
The combination of Faithfulness and the Markov property implies that and its logical contraposition . Both conditions are an important assumption for causal discovery algorithms as discussed in Spirtes et al. (2000). Intuitively, Faithfulness together with the Causal Markov Condition allow us to conclude that (in the limit of infinite sample size) a measured statistical dependency is actually due to some (not necessarily direct) causal mechanism and, conversely, a measured independence (given any set of conditions) implies that no direct causal mechanism exists (see also Remark 1 in the Discussion).
(Counteracting mechanisms)
(Determinism)
One may argue that we live in a deterministic world and the assumption of “an independent noise term” that is pertinent to statistics is unrealistic. On the other hand, for a given observed variable , the complexity of the underlying processes will almost always imply that does not deterministically depend on its parents, but some unresolved processes constitute “intrinsic” or “dynamical” noise that is also driving .
The former example only illustrated a static case of determinism. The field of nonlinear dynamics studies the properties of nonlinear and chaotic dynamical processes which has led to a plethora of nonlinear time series analysis methods (Kantz and Schreiber, 2003), often from an information-theoretic perspective (Hlavácková-Schindler et al., 2007) including transfer entropy. Many of these methods built on the assumption that no system is perfectly deterministic, for example, due to the coarse-graining of the system’s phase-space in the measurement process. In Sec. VII A, we study the effect of dynamical noise on several common time-series based causal discovery approaches for chaotic systems.
(Non-pairwise dependencies)
Next to the fine-tuned example on counteracting mechanisms, Faithfulness can also be violated for a dependency realized in an XOR-gate. Suppose, as shown in Fig. 7 among several “faithful” pairwise dependencies, we have that with being binary random processes with equal probability for all values of and . Then even though both are connected to —a violation of Faithfulness. Here, the full information is synergistically contained in . Note that from the chain rule it follows that . Thus, the MI is zero, but the CMI is not and there is, hence, a link in the time series graph. Note that already if , Faithfulness is not violated anymore as analyzed in Sun et al. (2015), showing that Faithfulness violations are rather pathological.
Another form of synergy without violation of Faithfulness is the case that , where we have a rather weak MI , but again a much larger . In Runge et al. (2015a), synergy is analyzed in the context of optimal prediction schemes.
As pointed out in James et al. (2016) for synergistic dependencies, the problem is that the concept of a pairwise dependency graphical model does not apply, but hyper-graphs are needed to represent such dependencies. Causal discovery of such graphs, however, carries the problem of combinatorial explosion if links between sets of nodes are considered.
D. Instantaneous effects
Granger causality and the definition of time series graphs are examples for lagged definitions of causality. To guarantee that the lagged parents defined in Eq. (8) are sufficient for the Causal Markov Condition to hold, we need to assume that there are no instantaneous (contemporaneous) causal effects, i.e., . One may argue that causality between dynamical systems cannot have instantaneous effects because the speed of light is finite and, if the process is sampled with sufficient resolution (otherwise, see the Examples of sub-sampling and aggregation), we only need to consider lagged causal effects. However, we often do not have a sufficiently sampled time series. Here, recent developments in causal inference theory (Zhang and Hyvärinen, 2009; Peters et al., 2013; Lopez-Paz et al., 2015; Spirtes and Zhang, 2016; Peters et al., 2017) address instantaneous causality within the framework of structural causal models which can be applied to determine causal directionality for contemporaneous links. These models work under assumptions on the noise in the model such as non-Gaussianity. Also, the logical causal orientation rules of the causal discovery approaches in Spirtes et al. (2000) can be used to partially orient contemporaneous links.
E. Stationarity
To estimate the time series graph defined in Definition 1 from time series data, we assume stationarity. Another option would be to utilize independent ensembles of realizations of lagged processes. Here, we define stationarity with respect to a time index set . For example, can contain all time indices belonging to a certain regime of a dynamical process, e.g., only winter months in the climate system.
(Causal stationarity).
This constitutes actually a weaker form of stationarity than the common definition of stationarity in mean, variance, spectral properties, or of the value of individual coefficients in a linear model. For example, one could require that all CMIs are stationary,
which is a much stronger statement. The strength of causal mechanisms may fluctuate over time and the causal stationarity assumption only requires conditional independence to be stationary.
(Non-stationarity due to confounding)
Consider the data shown in Fig. 8 and suppose we first only have access to the variables . Clearly, the time series of this subprocess are nonstationary in a classical sense, varying over time not only in their mean but also in their spectral properties. Estimating the time series graph on these three variables results in the graph shown in the bottom left panel of Fig. 8, where the common nonstationary trend leads to an almost fully connected graph.
A typical example of a common nonstationarity, albeit not the same as in our example, is found in climate time series which are usually all driven by solar forcing leading to a common seasonal signal. In climate research the time series are typically anomalized, that is, the seasonal signal is estimated and subtracted from the data (Storch and Zwiers, 1999) which is equivalent to regressing out its influence. But this is not always possible, in our example the common signal is not purely periodic and cannot easily be estimated from the data. Another option for the case of piecewise stationary processes is to include background knowledge on the stationary regimes and estimate the graphs separately for the stationary subsets of . For example, the climatic seasons El Niño and La Niña lead to different causal directions of surface temperature anomalies in the tropical Pacific (Philander, 1985). Prior knowledge of when the seasons start and end allow us to restrict the estimation of time series graphs to samples within a particular season.
Now suppose we actually have access to the common signal and include it in our analysis (but without estimating the parents of , i.e., treating as exogenous). Then, as shown in the bottom right panel of Fig. 8, we can recover the true causal structure where is a confounder of the three variables while they are connected through the motif . Thus, here nonstationarity is a result of confounding and can be removed if we have access to the underlying trend.
The key point is that the causal structure, that is, the time series graph, of the whole process is invariant in time. One may argue that causal laws are generally invariant and non-stationarity is simply a problem of violation of causal sufficiency. The idea of finding invariant predictors for causal inference is explored in Peters et al. (2016).
F. Dependency type assumptions
To test conditional independence hypotheses , different test statistics can be utilized. These are typically based on making certain assumptions about the type of the underlying dependency structure. While classical statistical methods are often based on the assumption of linearity (which allows us to derive rigorous results), modern statistics, the physics community, and the recent field of machine learning have developed non-parametric or model-free methods that allow us to better capture the nonlinear reality of many dynamical complex systems—at the cost of weaker theoretical results. Conditional independence testing can be classified into regression-based and model-free approaches. Here, we only discuss tests for continuously valued variables, for discrete variables one can, for example, use methods based on contingency tables (Spirtes et al., 2000) or discrete CMI estimation (Cover and Thomas, 2006).
Regression-based conditional independence tests of are based on first regressing out the influence of from and and then testing the dependence between the residuals. We first fit a model assuming
for centered variables and independent and identically normally distributed . Now further restrictions can be laid upon the functional form of . For example, the partial correlation test assumes linearity, while a non-parametric regression can be based on Gaussian Process regression (Rasmussen and Williams, 2006).
Secondly, from the estimated functions , the residuals are formed as
Finally, the dependence between the residuals can be tested with different pairwise association tests. For partial correlation this is a -test, while the dependence between the residuals of a non-parametric regression can be tested with non-parametric tests (Gretton et al., 2008; Székely et al., 2007) such as the distance correlation coefficient (Székely et al., 2007) (see also Sec. V C). Note that these models all make parametric assumptions and, thus, do not estimate conditional independence in its most general form.
The other extreme to partial correlation are model-free methods that directly test conditional independence. The most prominent test statistic is CMI as defined in Eq. (3), for which non-parametric estimators based on nearest-neighbor statistics exist (Kraskov et al., 2004; Frenzel and Pompe, 2007; Vejmelka and Palus, 2008; Póczos and Schneider, 2012) [see also Gao et al. (2015) and Lord et al. (2018) for recent progress on nearest-neighbor entropy estimators]. Other possible conditional independence tests are Kernel Conditional Independence Tests (Zhang et al., 2011; Strobl et al., 2017) which essentially test for zero Hilbert-Schmidt norm of the partial cross-covariance operator or conditional distance correlation (Wang et al., 2015). Some new recent tests are based on neural networks (Sen et al., 2017) or decision tree regression (Chalupka et al., 2018). In Runge (2018), a conditional independence test based on CMI is introduced.
(Nonlinearity)
For the linear case (first row in Fig. 9), we consider and the regression-based techniques correctly fit the dependencies of the pairs and (red fit lines in gray scatterplots), and, thus, correctly identify the independence of the residuals (black scatterplot). A model-free test also correctly identifies the common driver motif here.
For the nonlinear additive noise case with a quadratic dependency , partial correlation cannot fit the dependencies of the pairs and . As a result, the residuals are still correlated (spurious gray dashed link) and the causal graph is completely wrong: We overlook the links and and get a false positive . Since here the dependencies are still additive functions, non-parametric regressions and model-free tests yield a correct causal graph.
Finally, if the dependencies are multiplicative (bottom row) as in , both regression methods fail. Then the residuals are nonlinearly related which is not detected with a partial correlation test (here two errors somewhat cancel each other out). A non-parametric test on the residuals, on the other hand, then wrongly estimates the spurious link (gray dashed in center bottom row).
Model-free methods in principle can deal with all these cases, which might lead to the conclusion that they are superior. But the “no-free-lunch-theorem” tells us that such generality has a price and model-free methods are very data-hungry and computationally expensive. If expert knowledge pointing to a linear or otherwise parametric dependency is available, then regression-based methods will typically greatly outperform model-free methods.
G. Measurement error
Measurement error, unlike dynamical noise, contaminates the variables between which we seek to reconstruct dependencies and constitutes a difficult problem in causal network reconstruction (Scheines and Ramsey, 2016).
(Observational noise)
Here, we only discuss measurement error in its simple form as observational noise, which can be modeled as . Such observational noise presents at least two sorts of problems for causal discovery.
Firstly, observational noise attenuates true associations and, therefore, lowers detection power. This is because in general which is a consequence of the data processing inequality (Cover and Thomas, 2006): Manipulating a variable can only reduce its information content. In Fig. 10, we added normal observational noise with to . Then the links and cannot be reconstructed anymore. Secondly, too much noise on conditioning variables makes it impossible to preserve conditional independence. In Fig. 10, we have even though . The effect of observational noise on causal discovery is discussed further in Smirnov (2013); Scheines and Ramsey (2016) and Runge et al. (2018) give some numerical experiments for high-dimensional causal discovery.
V. Practical estimation
The previous sections concerned fundamental assumptions of time-lagged causal discovery based on conditional independence relations. We now turn to the topic of practical estimation where we introduce several common causal discovery methods and discuss their consistency, significance testing, and computational complexity. We restrict the analysis to the class of conditional independence approaches, which flexibly allows us to use different independence tests. But graphical models, in general, can also be estimated with score-based Bayesian methods, e.g., the max-min hill-climbing algorithm (Tsamardinos et al., 2006).
A. Causal discovery algorithms
1. Granger causality/transfer entropy/FullCI
Transfer entropy, as introduced by Schreiber (2000), is a direct information-theoretic implementation of Granger causality (Barnett et al., 2009). In a lag-specific implementation, as given by FullCI [Eq. (6)], it tests for conditional independence between each and conditioned on the entire past (excluding ). As the experiments in Sec. VII will demonstrate, this approach strongly suffers from the curse of dimensionality.
2. Optimal causation entropy
Sun and Bollt (2014) and Sun et al. (2015) developed a discovery algorithm based on the information-theoretic optimal causation entropy principle (algorithm abbreviated as OCE) which reconstructs the lagged parents of a variable by an iterative procedure alleviating the curse of dimensionality: Starting with an empty parent set , first the MIs for all are estimated. As the first parent , the one with the largest MI with is selected. The next parent , however, is chosen according to the largest CMI among all remaining variables, the third parent is the one with largest CMI conditional on the two previously selected parents, etc. The process is continued until the CMI of a selected parent is non-significant. This forward-selection stage is followed by a backward elimination whereby the significance of each of the parents is tested conditional on the remaining parents:
The significance of CMIs can be tested with a nearest-neighbor CMI estimator (Kraskov et al., 2004; Frenzel and Pompe, 2007; Vejmelka and Palus, 2008) in combination with a permutation test where is randomly shuffled. Of course, the conditional independencies in Eq. (26) can also be tested with other test statistics.
3. PC algorithm
An alternative to this forward-backward scheme is the PC algorithm (named after its inventors Peter and Clark) (Spirtes and Glymour, 1991). The original PC algorithm was formulated for general random variables without assuming a time order. It consists of several phases where first, in the skeleton-discovery phase, an undirected graphical model (Lauritzen, 1996) is estimated whose links are then oriented using a set of logical rules (Spirtes and Glymour, 1991; Spirtes et al., 2000). A later improvement led to the more robust modification called PC-stable (Colombo and Maathuis, 2014).
For the case of time series, we can use the information of time order which naturally provides an orientation rule for links. The algorithm then is as follows: For every variable it starts by initializing the preliminary parents . In the first iteration (), we remove a variable from if the null hypothesis
cannot be rejected at a significance threshold . Then, in each next iteration, we increase and remove a variable from if any of the null hypotheses
cannot be rejected, where iterates (in an inner loop) through all combinations of subsets with cardinality . The algorithm converges for a link once and the null hypothesis is rejected (if the null hypothesis cannot be rejected, the link is removed). Runge et al. (2018) provide pseudo-code for this algorithm.
The forward-backward scheme of OCE conducts conditional independence tests only using the conditions with highest CMI in the preceding stage and quickly increases the number of conditions. This can lead to wrong parents being kept in (see Sec. V B) which are only removed in the backward stage where the dimensionality of the set can be already quite high. High dimensionality, in principle, leads to lower detection power (Example VII C). The PC algorithm conducts conditional independence tests not only using the condition with highest association, but it goes through (in theory all) combinations of conditions which can help to alleviate the curse of dimensionality regarding the estimation dimension of compared to OCE. On the other hand, the PC algorithm conducts many more tests which increases other problems (see Sec. V C).
4. PCMCI
A more recent approach that addresses some of the shortcomings of the PC algorithm above is PCMCI (Runge et al., 2018). PCMCI is a two-step approach which uses a version of the PC-algorithm only as a condition-selection step (PC algorithm) to obtain for all , followed by the momentary conditional independence (MCI) test for defined as
with . The main difference between PC and PC is that PC tests only the condition subset with largest association instead of going through all possible combinations. To this end is sorted after every iteration according to the test statistic value and is determined by the first variables in (excluding ). This leads to less tests, but still provably removes incorrect links.
The MCI test is the most important difference to the PC algorithm and the approach by Sun and Bollt (2014). The additional conditioning on the parents in MCI accounts for autocorrelation leading to well-controlled false positive rates at the expected level (Runge et al., 2018). A variant (PCMCI) where the condition on the parents is dropped leads to a very similar approach to OCE. PCMCI, like FullCI, OCE, and PC, can be implemented with different conditional independence tests. For further details on PCMCI, see Runge et al. (2018). In Sec. VII we compare FullCI, OCE, and PCMCI in a number of numerical comparison studies.
B. Consistency
Consistency is an important property of causal methods that tells us whether the method provably converges to the true causal graph in the limit of infinite sample size. Consistency concerns the conditional independence tests on the one hand, but also the causal algorithm in the case of iterative approaches such as those discussed in Sec. V A.
For example, for the consistency of the non-parametric regression independence tests in Eq. (23), we need to assume that the function estimator converges to the true function, that the noise in the model is additive and independent, and finally that we have a consistent unconditional independence test for the residuals. With a consistent test, the time series graph can be directly estimated based on Definition 1. For iterative causal algorithms, we can define universal consistency as follows.
(Universal causal consistency).
That is, the probability of estimating the wrong graph becomes arbitrarily small if enough data is available, for any distribution (hence “universal”). Consistency has been proven for classical causal discovery algorithms such as the PC-algorithm (Spirtes et al., 2000), the optimal causation approach Sun and Bollt (2014), Sun et al. (2015) and PCMCI Runge et al. (2018), as an approach based on PC.
However, universal consistency is a weaker statement than, for example, uniform consistency which bounds the error probability as a function of the sample size giving a rate of convergence. Thus, for a non-uniform, but only universally consistent method, the sample size at which a given error can be guaranteed and can be different for every distribution . Robins et al. (2003) showed that no uniformly consistent causal discovery algorithm from the class of independence-based approaches (Spirtes et al., 2000) exists since the convergence can always be made arbitrarily slow by a distribution that is almost unfaithful with some dependencies made arbitrarily small. Uniform consistency for conditional-independence based algorithms can only be achieved under further assumptions such as having strong enough dependencies (Kalisch, 2008).
(An inconsistent causal algorithm)
Consider again the forward-selection stage of the OCE algorithm (Sun and Bollt, 2014; Sun et al., 2015) introduced in Sec. V as a standalone method to reconstruct parents of a variable . Even though the scheme sounds appealing and efficient, the scheme alone is not a consistent estimator of causal graphs. It yields a superset of the parents (Sun et al., 2015) which may also contain false positives: Consider the example graph shown in Fig. 11. Here, the causal parents of are (dropping time subscripts here). If forward-selection alone was a causal approach, then in each step the variable with strongest association would also need to be a causal parent. But in this example the MI between and can be larger than the MIs of and with . For example, for we have nats while nats. See Appendix A for an information-theoretic analysis. Hence, the wrong parent is selected. This scheme, thus, requires the second step of the OCE approach, given by Eq. (26).
C. Significance testing
How can we assess the statistical significance of conditional independence tests on which the causal algorithms in Sec. V A are based, such as the tests discussed in Sec. IV F? Using a test statistic ( here stands not only for CMI, but any conditional independence test statistic) for the observed samples we wish to test the hypothesis
versus the general alternative
To assess the significance of an outcome of such a test using -values, we need to know the distribution of the test statistic under the null hypothesis. For partial correlation tests, exact analytical expressions of the null distribution exist under certain assumptions, but for nonlinear tests (such as CMI or also non-parametric regression tests) these are typically not available except for some asymptotic large sample size cases (Strobl et al., 2017). The alternative then are permutation tests as discussed in Example 12.
Given the null distribution, the -value is defined as the probability—given —of observing a value of the test statistic that is the same or more extreme than what was actually observed. If our test statistic is non-negative (such as CMI), the -value for an observed test statistic value is defined as . Choosing a significance level , we reject the null hypothesis if .
There are two types of errors we can make. Rejecting when is true is called a type I error or false positive. Retaining when is true is called a type II error or false negative. We also call one minus the type II error rate of a test the true positive rate or detection rate.
If the test statistic has a continuous distribution, then under the -value has a uniform distribution (Wasserman, 2004). Therefore, if we reject when the -value is less than , the probability of a type I error is . For a well-calibrated test under , we thus expect to measure on average a false positive rate of . If this is not the case, the test is ill-calibrated indicating that we got the null distribution wrong because certain assumptions, such as independent samples, are violated. In Examples 12 and 13, we discuss such cases.
(Permutation testing)
Permutation testing is straightforward in the bivariate independence test case. To create an estimate of the null distribution of a test statistic , we can simply generate a large number of test statistics where is a permuted version of .
To achieve a test under the correct null hypothesis, we can use a local permutation scheme that preserves the associations between and . Runge (2018) suggests such a scheme depicted in the bottom panel of Fig. 12 which only permutes those and where . This scheme can be used for CMI conditional independence testing or also other test statistics. Other schemes are discussed in Doran et al. (2014) and Sen et al. (2017).
(Non-independent samples)
A basic assumption underlying many conditional independence tests is that the samples are independent and identically distributed (i.i.d.). Unfortunately, time series are typically dependent in time. To take this dependence into account, one can either adapt the distribution under the null hypothesis, for example, in partial correlation -tests by estimating the effective degrees of freedom, which is, however, difficult in a multivariate setting. Or one can modify the test statistic to explicitly account for autocorrelation (Runge et al., 2018). Also, a permutation scheme needs to be adapted to preserve auto-dependencies, for example, by shuffling blocks of samples (Peifer et al., 2005). For bivariate tests such an approach is again straightforward, but not for the multivariate case as analyzed in the autocorrelation comparison study in Sec. VII.
(Sequential testing of causal links)
The preceding discussion concerned tests of an individual conditional independence relationship. Directly testing causal links via Definition 1, thus, gives us a well-calibrated test if all assumptions are fulfilled.
However, in iterative causal algorithms (Sec. V A) such as the PC algorithm, OCE, or the first step of PCMCI (PC) multiple tests on a particular link are conducted with different condition sets that are determined by the outcome of previous tests. If a link is found non-significant in any of these tests, this link is removed. Since the tests are not independent of each other (since they are typically based on the same data sample), it is almost impossible to derive a combined -value of all these tests.
Figure 13 depicts an illustrative numerical example where the combined false positive rate is much lower than the of each individual test and the true positive rate is lower than the minimal true positive rate among the individual tests. In summary, even though all assumptions may be valid, sequential testing makes a significance assessment difficult. These issues are further discussed in Tsamardinos and Brown (2008) and Strobl and Spirtes (2016) and references therein where False Discovery Rate (Benjamini and Hochberg, 1995) approaches are discussed.
As a side remark on the previously discussed OCE and PCMCI causal discovery approaches, in the backward-elimination stage [Eq. (26)] OCE tests only the significance of each of the parents conditional on the remaining parents. PCMCI (Runge et al., 2018), on the other hand, in the second MCI step tests all links again [Eq. (29)] which makes the MCI test slightly less dependent on the sequential testing issue of the condition-selection step PC since parents that have been removed in PC (false negatives) are tested again in the MCI test. In particular, the false positives are as expected as demonstrated in the comparison studies in Sec. VII.
D. Computational complexity
Application areas of causal discovery methods vary in the typical numbers of variables as well as available sample sizes . Next to the properties discussed before, an important issue then is how a method scales with dimensionality and sample size. High-dimensionality arises from the number of included variables and the maximum time lag [see Fig. 1(c)] and has at least two consequences: (1) higher computational complexity leading to longer runtimes and (2) typically lower detection power. Independence tests may also become ill-calibrated in high-dimensions.
For directly testing causal links via Definition 1 (FullCI), the computational complexity depends on the complexity of a single high-dimensional conditional independence test. In the linear partial correlation case, OLS regression scales . FullCI estimated using nearest-neighbor estimators of CMI (Kraskov et al., 2004; Frenzel and Pompe, 2007), on the other hand, will scale regarding time complexity while the complexity in will depend on algorithmic details such as using efficient KD-tree nearest-neighbor search procedures (Maneewongvatana and Mount, 1999).
The methods PC, OCE, or PCMCI (Sec. V A) based on a condition-selection step avoid high-dimensional conditional independence estimation by conducting more tests with lower dimensional conditioning sets. Their theoretical complexities are difficult to evaluate, for numerical evaluations see Sun et al. (2015) and Runge et al. (2018), but typically they scale polynomially in time.
The other major challenge with high dimensionality is detection power as analyzed in Example VII C.
VI. Performance evaluation criteria
How can causal discovery methods, such as those described in Sec. V be evaluated? Typically, we want to know which method performs best on data from the kind of system we wish to study. Ideally, we would like to compare different methods on a data sample where the underlying causal truth is known or evaluate methods by experimentally manipulating a system, i.e., actually performing the do-experiment (Pearl, 2000) mentioned in the introduction which forms the theoretical basis of the present concept of causality. Since both of these options are mostly not available, an alternative is to construct synthetic model data where the underlying ground truth is known. These can then be used to study the performance of causal methods for realistic finite sample situations.
A. Models
To evaluate causal methods on synthetic data, several aspects for constructing model systems are relevant:
Model realism: The model systems should mimic the domain-specific properties of real data in terms of nonlinearity, autocorrelation, spectral properties, noise structure (dynamical as well as observational), etc.
Model diversity: To avoid biased conclusions, a large number of different randomly selected connectivity structures should be tested [including link density as well as properties such as small-worldness (Watts and Strogatz, 1998)]. For example, the aforementioned forward-selection approach failed for the example shown in Fig. 11 but works for many other graphs. But also consistent methods may have biases for finite samples as studied in Runge et al. (2018).
Model dimensionality: As studied in Fig. 16, a method may perform well only for a small number of variables and the performance for high-dimensional settings (large networks) can quickly degrade.
Sample sizes: The comparative performance of different models may vary widely for different sample sizes which needs to be studied if no uniform consistency results are available.
B. Metrics
The performance of a causal method on a single realization of a model does not allow for reliable conclusions. Therefore, each model needs to be evaluated from many realizations. Then the most straightforward evaluation metric is to measure false positive rates and true positive rates for a given as shown in the comparison studies in Sec. VII. The number of realizations should be chosen high enough since the error of a false or true positive rate for realizations is given by . Alternative evaluation metrics that do not depend on a particular significance level but directly on the -values are the Kullback-Leibler divergence to evaluate whether the -values are uniformly distributed (to measure how well-calibrated a test is) and the Area Under the Power Curve (AUPC) to evaluate true positives.
Next to the true and false positives of a causal method for finite samples, another performance criterion is computational runtime, though this may strongly depend on a given implementation.
VII. Comparison studies
In this section, we compare several common causal discovery methods in three numerical comparison studies highlighting the effect of dynamical noise in deterministic chaotic systems, autocorrelation, and high dimensionality.
A. Dynamical noise in deterministic chaotic systems
In Example 6, we studied a static example of determinism. Here, we evaluate the effect of dynamical noise in a system of coupled chaotic logistic maps:
with uniformly distributed independent noise and leading to chaotic dynamics. Here, controls the amount of dynamical noise in the system. To evaluate true positive rates (correctly detecting and ) and false positive rates (incorrect detections for any other variable pair, direction, or lag), realizations with time series length were generated.
We compare three methods from an information-theoretic framework (FullCI, OCE, PCMCI) with convergent-cross mapping (CCM, Sugihara et al., 2012) (see also Arnhold et al., 1999; Hirata et al., 2016) as a nonlinear dynamics-inspired approach. The significance of CMIs in FullCI and OCE is tested with a nearest-neighbor CMI estimator (Kraskov et al., 2004; Frenzel and Pompe, 2007; Vejmelka and Palus, 2008) in combination with a permutation test where is randomly shuffled. PCMCI is implemented with the CMIknn independence test (Runge, 2018) as discussed in Example 12. We also evaluate a variant (PCMCI) where the condition on the parents is dropped and the only difference to OCE is the condition-selection step. CCM reconstructs the variable’s state-spaces using lagged coordinate embedding and concludes on if points on can be well predicted using nearest neighbors in the state space of . Note that CCM and related works only use the time series of and with the underlying assumption that the dynamics of can be reconstructed using delay embedding. All methods were evaluated at a significance level of . For implementation details, see Appendix B 1. In the top panel of Fig. 14, we depict the true causal graph.
Figure 14 shows that in the purely deterministic regime with , PCMCI has almost no power, while PCMCI features a detection rate of and FullCI, OCE, and CCM almost always detect the couplings. On the other hand, CCM also has the highest number of false positives around which exceeds the significance level indicating an ill-calibrated test. FullCI, OCE, and PCMCI also do not control false positives well, while with PCMCI they are around the expected level.
For higher dynamical noise levels interesting behavior emerges: FullCI and CCM have continuously decreasing power (and slightly decreasing false positives) dropping to almost zero at , while the power of OCE and both PCMCI versions steadily increases up to after which power decreases with a more pronounced decrease for PCMCI and PCMCI having the highest power. False positives are above the threshold for FullCI and OCE for a wide range of noise levels, while for PCMCI false positives are always well-controlled at the expected .
How can these results be understood? CCM attempts to reconstruct the attractor manifolds underlying and . With more dynamical noise, this reconstruction becomes more difficult and CCM looses power. The fact that CCM does not control false positives well, especially in the deterministic regime deserved further study. FullCI suffers from the curse of dimensionality especially for higher noise levels.
The MCI test statistic for the link estimates . Now since for is a deterministic mapping of , in theory we have for the same reasons as in the simple deterministic model (19) above (and analogously for ). In practice, we can only measure the entropies at some coarse-grained level (here determined by the CMI nearest-neighbor parameter) and the deterministic dependency is never exactly recovered leading to a non-zero MCI. In OCE and PCMCI, only the parents of are included in the condition. The fact that FullCI, despite conditioning on the whole past, also detects the link deserves further study. Possible explanations are an ill-calibrated test or dynamical properties of the logistic-map system. In summary, purely deterministic dynamics here seem to generate too little momentary information which is necessary for information-theoretic coupling detection with MCI (see also Pompe and Runge, 2011).
Given at least some dynamical noise to suffice the Faithfulness condition and together with the other assumptions discussed in this paper, OCE and PCMCI provably converge to the true causal graph in the limit of infinite sample size (Runge et al., 2018; Sun et al., 2015, see also Sec. V B), while no such theoretical results is available for CCM. In the infinite sample limit also the inflated false positives of OCE due to time-dependent samples vanish. For finite samples, on the other hand, among other factors, consistency depends on how well-calibrated the significance test is. PCMCI here always yields expected levels of false positives. The inflated false positives for FullCI and OCE for a wide range of noise levels and for PCMCI for very low dynamical noise is related to the way significance testing is implemented. In theory, OCE should have less false positives than the expected due to the sequential testing problem (Sec. V C), but in our examples autocorrelation and the global permutation scheme leads to ill-calibrated tests with inflated false positives in each individual test, see Examples 12 and 13, leading to an overall higher false positive rate. The effect of autocorrelation is evaluated further in the next example.
B. Autocorrelation
In Fig. 15, we evaluate the previously introduced causal algorithms (Sec. V A) FullCI, OCE, and PCMCI on autocorrelated data which is an ubiquitous feature in real world time series data. The full model setup is described in Appendix B 2. In short, here we only evaluate the false positive rates (for ) and true positive rates (for ) of the link (top panel of Fig. 15) where the autocorrelation is varied for different numbers of common drivers and the coefficients and are chosen such that the unconditional dependence stays the same, and we only investigate the effect of autocorrelation. The time series length is and realizations were run for each model setup to evaluate false positive rates and true positive rates at an significance level.
We compare partial correlation implementations (test statistic ) of the following tests: (1) FullCI directly tests Definition 1, for . For OCE and PCMCI, we assume that the condition-selection steps already picked the correct parent sets and only test the link in the second stages of OCE and PCMCI, respectively: (2) OCE [Eq. (26), equivalent to ] here tests , where is given by the gray and blue boxes in the top panel of Fig. 15. (3) OCEpw with conditioning set as for OCE, but where all variables are pre-whitened beforehand as described in Appendix B 2. (4) OCEbs with conditioning set as for OCE, but where a block-shuffle test was used as described in Appendix B 2. (5) PCMCI tests [Eq. (29)] as given by the gray, blue, and red boxes. For all approaches, the analytical null distribution of the partial correlation test statistic was used (-test), except for the block-shuffle permutation test.
In the bottom four panels of Fig. 15, we depict results for , that is, the bivariate case, and , both for varying the autocorrelation strength . In the bivariate case, all approaches well control the false positive rates except for OCE and (slightly better) OCEbs, which feature inflated false positive rates for very high autocorrelation. The reason is that when testing with the -test, we assume i.i.d.-data, but since is autocorrelated, this is not the case. This false positive inflation is also seen in Fig. 14. Here, pre-whitening removes this autocorrelation and block-shuffling remedies it to some extent. PCMCI and FullCI both condition out autocorrelation and well-control false positives with constant true positive levels, independent of , while the true positive level depends on for OCE and its modifications, even though the coupling coefficient is constant.
For false positive inflation becomes even more severe for OCE and here also pre-whitening does not help but leads to strongly increased false positive rates since univariate pre-whitening is not suitable for multivariate conditional independence testing. The PCMCI approach conditions on the parents of the lagged variable which helps to exclude autocorrelation as shown in Runge et al. (2018) and allows us to utilize analytical null distributions that assume i.i.d. data.
C. Curse of dimensionality
In Fig. 16, we evaluate the causal algorithms (Sec. V A) for high-dimensional data using the same model as in Fig. 15 described in Appendix B 2. Here only FullCI, OCE, and PCMCI are compared, all of them again based on partial correlation.
As shown in Fig. 16, FullCI severely suffers from the curse of dimensionality and the OLS-solver even becomes ill-conditioned for since then the estimation dimension exceeds the sample size. For OCE and PCMCI, we again assume that the condition-selection algorithms selected the correct set of parents. Then the dimensionality of OCE and PCMCI increases only slightly and the power stays at higher levels.
The problem becomes even more severe for non-parametric tests such as multivariate transfer entropy. Another alternative to OLS partial correlation estimation are regularization techniques such as ridge regression (Hoerl et al., 1970; Tibshirani, 1996; Tikhonov, 1963), but these come with other difficulties, for example regarding significance testing. These issues are further analyzed in Runge et al. (2018).
VIII. Discussion and conclusions
The long preceding list of theoretical assumptions in Sec. IV may make causal discovery seem a daunting task. In most real data application, we will not have all common drivers measured, hence violating the causal sufficiency assumption. Then also the Markov assumption may be violated in many cases due to time-aggregation. A conclusion on the existence of a causal link, thus, rests on a number of partially strong assumptions. So what can we learn from an estimated causal graph? Let us consider what the assumptions on the absence of a causal link are.
Note that for the practical estimation from finite data, we also need to assume that all dependencies among lagged variables in can be modeled with the conditional independence test statistic, that is, no false negative errors occur. Nevertheless, Remark 1 rests on much weaker assumptions than the existence of a causal link since it does not require Causal Sufficiency or Causal the Markov Condition. The proof follows almost directly from the Faithfulness assumption.
Firstly, since we assume an error-free measurement process, we have that . Then . The last relation is the Faithfulness assumption. Separation implies in particular that no direct link exists in .
The second set of assumptions important for causal discovery are the assumptions underlying significance testing (Sec. V C). Failing to properly take into account autocorrelation or too simple permutation schemes imply ill-calibrated significance tests leading to inflated false positives beyond those expected by the significance level (see the comparison studies in Sec. VII). Next to the theoretical causal assumptions, statistical reliability of reconstructed networks is an important aspect for drawing causal conclusions.
This paper is intended to recapitulate the main concepts of time-lagged causal discovery from observational time series data and accessibly illustrate important challenges. But many more challenges exist, for example, we have not considered selection bias or issues with the definition of variables as elaborated on in Spirtes et al. (2000). We also have not discussed the topic of determining causal effects (Pearl, 2000) (causal quantification) or mediation (VanderWeele, 2015; Runge et al., 2015a, 2015b) as opposed the pure existence or absence of causal links presented here.
Our focus was on time series which make the causal discovery problem easier in some aspects (e.g., time order can be exploited), but more difficult in other aspects, especially regarding statistical testing. We have briefly mentioned the recent works based on different sets of assumptions in the framework of structural causal modeling (Peters et al., 2017), which do not require time-order. Also, many more techniques and insights from the conditional independence framework (Spirtes et al., 2000) can be utilized in the time series case. An important conclusion is that causal discovery is a very active area of research in many fields, from mathematics, computer science, and physics to applied sciences, and methodological progress can greatly benefit from more interdisciplinary exchange.
ACKNOWLEDGMENTS
J.R. thanks C. Glymour and F. Eberhardt for comments and D. Sejdinovic, K. Zhang, and J. Peters for helpful discussions. Special thanks to C. Linstead for help with high-performance computing. J.R. received funding from a postdoctoral award by the James S. McDonnell Foundation and gratefully acknowledges the European Regional Development Fund (ERDF), the German Federal Ministry of Education and Research, and the Land Brandenburg for supporting this project by providing resources on the high-performance computer system at the Potsdam Institute for Climate Impact Research. Software is available online under https://github.com/jakobrunge/tigramite.
Appendix A: Inconsistent causal algorithm
For the example graph shown in Fig. 11, consider the following decomposition [chain rule (Cover and Thomas, 2006), dropping ]:
Alternatively, one can decompose the same MI as
where the last term vanishes because separates and in the graph (Markov condition). From these two equations, it follows that if
Hence, the wrong parent has higher MI and would be selected with a pure forward-selection scheme.
Appendix B: Implementation details for numerical examples
B.1. Dynamical noise model
CCM was estimated with embedding dimension , and the surrogate test ebisuzaki with 500 surrogates using the R-package rEDM. CCM requires two criteria (Sugihara et al., 2012), both a significant CCM value at library size and an increasing CCM value over increasing library length. As a -value of CCM, we thus take , where is the -value of CCM at library size and is the -value for the hypothesis of an increasing linear trend. OCE was estimated with threshold parameter and in the forward step and with CMI nearest-neighbor parameter and permutation surrogates. FullCI was also estimated with CMI parameter and permutation surrogates. PCMCI was implemented with and CMIknn parameters and permutation surrogates (Runge, 2018). PCMCI was run without restricting the number of parents in the MCI step, while for PCMCI only the parents of the non-lagged variable were included. For each noise level , we ran the four methods on 200 time series realizations of the model. We compute as true positives the average rates at which the links and were detected from the 200 realizations at an significance level. We calculate as false positives the average rates for and where there is no link. Since we use an significance level, a well-calibrated test should yield 5% false positives. The edge color and width of the graphs in Fig. 14 corresponds to the lag with maximum CMI/CCM value.
B.2. Model for examples on autocorrelation and high-dimensionality
The model time series graph is depicted in Fig. 15. In the model setup, all links correspond to linear dependencies. The autocorrelation is the same for all variables . The coupling coefficient of the link is zero to test false positive rates and nonzero to test true positive rates. For nonzero its value is chosen such that in the corresponding bivariate model without autocorrelation (), we obtain a constant mutual information nats for the linear coupling with partial correlation. On the other hand, the common driver forcing coefficient and the covariance among the drivers are chosen such that in the full model for every pair () with we have nats, a relatively strong forcing corresponding to a correlation of .
This setup guarantees that the unconditional dependence stays the same and we only investigate the effect of increasing autocorrelation and higher dimensionality. We test additive Gaussian noise terms . For the model setup corresponds to two autocorrelated processes without a common driver forcing. The sample lengths are for partial correlation. realizations were evaluated to assess false and true positives.
B.3. Pre-whitening and block-shuffling
We also tested the OCE test using a pre-whitening (OCEpw) and a block-shuffle permutation test (OCEbs). For the pre-whitening test, we preprocessed all time series by estimating the univariate lag-1 autocorrelation coefficients and regressing out the AR(1) autocorrelation part of the signals:
Then the OCE test is applied to these residuals .
Another remedy is a block-shuffle permutation test, which is based on a block-shuffle surrogate test following Peifer et al. (2005) and Mader et al. (2013). For the test statistic , an ensemble of values of is generated where is a block-shuffled surrogate of , i.e., with blocks of the original time series permuted. As an optimal block-length, we use the method described in Peifer et al. (2005) and Mader et al. (2013) for non-overlapping blocks. The optimal block-length formula Eq. (6) in Mader et al. (2013) involves the decay rate of the envelope of the autocorrelation function . The latter was estimated up to a maximum delay of 5% of the samples, and the envelope was estimated using the Hilbert transform. Then a function was fit to the envelope with constant to obtain the decay rate . The block length was limited to a maximum of 10% of the sample length. Finally, the estimated values are sorted, and a -value is obtained as the fraction of surrogates with values greater than or equal to the estimated value.