Causal network reconstruction from time series is an emerging topic in many fields of science. Beyond inferring directionality between two time series, the goal of causal network reconstruction or causal discovery is to distinguish direct from indirect dependencies and common drivers among multiple time series. Here, the problem of inferring causal networks including time lags from multivariate time series is recapitulated from the underlying causal assumptions to practical estimation problems. Each aspect is illustrated with simple examples including unobserved variables, sampling issues, determinism, stationarity, nonlinearity, measurement error, and significance testing. The effects of dynamical noise, autocorrelation, and high dimensionality are highlighted in comparison studies of common causal reconstruction methods. Finally, method performance evaluation approaches and criteria are suggested. The article is intended to briefly review and accessibly illustrate the foundations and practical problems of time series-based causal discovery and stimulate further methodological developments.

Reconstructing interaction networks from observed time series is a common problem in fields where active experiments are impossible, unethical, or expensive. Pairwise association networks, for example based on correlations, cannot be interpreted causally. The goal of causal network reconstruction or causal discovery is to distinguish direct from indirect dependencies and common drivers among multiple time series. Here, we briefly recapitulate the theoretical assumptions underlying causal discovery from time series, discuss practical estimation problems, and illustrate each aspect with accessible examples.

## I. Introduction

Reconstructing the causal relations behind the phenomena we observe is a fundamental problem in all fields of science. The traditional approach is to conduct active experiments, but in many fields such as Earth system science or neuroscience, manipulations of the complex system under study are either impossible, unethical, or very expensive. On the other hand, modern science generates an ever-growing amount of data from these systems, in particular time series data. Concurrently, novel computing hardware today allows efficient processing of massive amounts of data. These developments have led to emerging interest in the problem of reconstructing causal networks or causal discovery from observational time series.

In the past few decades, a number of original causality concepts have been developed, such as Granger causality (Granger, 1969) or transfer entropy (Schreiber, 2000). Since the 1990s, computer scientists, statisticians, and philosophers have grounded causal reasoning and inference in a robust mathematical framework (Pearl, 2000; Spirtes *et al.*, 2000). The (quite natural) definition of causality underlying this framework is that $X\u2192Y$ if and only if an intervention or manipulation in $X$ has an effect on $Y$ (Pearl, 2000; Spirtes *et al.*, 2000). This effect may be in changing $Y$’s mean or any change in its *post-interventional* distribution denoted $P[Y|do(X=x)]$ which is different from the conditional distribution $P(Y|X=x)$. Unfortunately, all we can measure from observational data are statistical dependencies. These can be visualized in a *graphical model* (Lauritzen, 1996) or *time series graph* (Eichler, 2011) that represents the *conditional independence* relations among the variables and their time lags (Fig. 1). The theory of causal discovery lays out the assumptions under which the underlying causal dependencies can be inferred from observational data.

There are different sets of assumptions that allow us to *identify* a causal graph. Here, we focus on time-lagged causal discovery in the framework of conditional independence testing using the assumptions of time-order, *Causal Sufficiency*, the *Causal Markov Condition*, and *Faithfulness*, among others, which are all discussed thoroughly in this paper. But some of these assumptions can be replaced. Recent work (Peters *et al.*, 2017) shows ways to use assumptions on the noise structure and dependency types in the framework of structural causal models which can complement the approach studied here and we will include references to recent work from this framework throughout the sections.

The paper is organized as follows: In Sec. II, we relate Granger causality and similar concepts to the conditional independence-framework (Spirtes *et al.*, 2000). Section III provides the necessary definitions and notation and in Sec. IV we recapitulate the assumptions underlying time-lagged causal discovery from time series alongside illustrative examples. The practical estimation aspect from introducing some causal discovery algorithms to significance testing is presented in Sec. V, while Sec. VI discusses suggestions for performance evaluation. In Sec. VII, we present some comparison studies of common causal methods and conclude the paper with a brief discussion (Sec. VIII). The paper is accompanied by a python jupyter notebook on https://github.com/jakobrunge/tigramite to reproduce some of the examples.

## II. From Granger causality to conditional independence

Granger (1969), based on work by Wiener (1956), was the first to propose a practical, operational definition of causality based on prediction improvement. The underlying idea of measuring whether $X$*Granger causes* $Y$ is that there is some unique information in $X$ relevant for $Y$ that is not contained in $Y$’s past as well as the past of “all the information in the universe” (Granger, 1969). In practice, typically only $Y$’s past is used (bivariate Granger causality). Measuring prediction improvement can be operationalized in different ways. The most common framework are *vector autoregressive models* (*VAR*),

where $Xt=(Xt1,\u2026,XtN)$, $\Phi (\tau )$ is the $N\xd7N$ coefficient matrix at lag $\tau $, $\tau max$ some maximum time lag, and $\eta $ denotes an independent noise term. Here, $Xi$*Granger-causes*$Xj$ if *any* of the coefficients $\Phi ji(\tau )$ at lag $\tau $ is non-zero. A non-zero $\Phi ji(\tau )$ can then be denoted as a causal link $Xt\u2212\tau i\u2192Xtj$ at lag $\tau $. Another option is to compare the residual variances of the VAR fitted with and without including the variable $Xi$. The use of VARs restricts this notion of causality to a *causality in mean* (Granger, 1969). A more general definition is that of (*bivariate*) *transfer entropy* (Schreiber, 2000; Barnett *et al.*, 2009)

where $I(X;Y\u2223Z)$ denotes the *conditional mutual information* (CMI). *Bivariate TE* is a common term, another naming option would be *bivariable TE* since $X$ and $Y$ could also be multivariate variables. Transfer entropy can also be phrased in a multivariate (or multi-variable) lag-specific version (Runge *et al.*, 2012a). Many current methods are advancements of the concept of transfer entropy (Wibral *et al.*, 2013; Staniek and Lehnertz, 2008; Vejmelka and Palus, 2008), in particular in its multivariate version (Sun and Bollt, 2014; Sun *et al.*, 2015; Runge *et al.*, 2012b; 2012a; Runge, 2015).

Tests for causality are then based on testing whether a particular CMI is greater than zero. Looking at the definition of CMI,

TEbiv and its advancements essentially test for *conditional independence* of $X$ and $Y$ given $Z$, denoted $X\u22a5\u22a5Y|Z$ since

$Z$ then represents $Y$’s past and other included variables. The lag-specific generalization of the VAR model (1) then is the *full conditional independence* (FullCI) approach

where $Xt(t\u22121,\u2026,t\u2212\tau max)=(Xt\u22121,\u2026,Xt\u2212\tau max)$. $I$ can be CMI or any other conditional dependence measure. In the case of partial correlation, a non-zero entry $\Phi ji(\tau )$ corresponds to a non-zero $Ii\u2192jFullCI(\tau )$. A general concept to represent *conditional independence* relations among multiple variables and their time lags is that of *time series graphical models* (Eichler, 2011).

## III. Definitions and notation

### A. Definition of time series graphs

Consider a multivariate process $X$ of dimension $N$. We define the *time series graph* $G=(V\xd7Z,E)$ of $X$ as follows [Fig. 1(c)]: The set of nodes in that graph consists of the set of components $V$ at each time $t\u2208Z$. That is, the graph is actually infinite, but in practice defined up to some maximum time lag $\tau max$. Compared to the general concept of graphical models (Lauritzen, 1996) for data without time-ordering, for time series graphs the time-dependence is explicitly used to define directional links in the set of edges $E$ (Eichler, 2011). For convenience, we treat $X$, $Xt$, and $Xt\u2212$ as sets of random variables here and use the difference symbol “$\u2216$” for sets.

(Definition of links).

*Variables*$Xt\u2212\tau i$

*and*$Xtj$

*are connected by a lag-specific*directed link “$Xt\u2212\tau i\u2192Xtj$” $\u2208G$ [Fig. 1(c)]

*for*$\tau >0$

*if and only if*

*i.e., if they are*not

*independent conditionally on the past of the whole process denoted by*$Xt\u2212=(Xt\u22121,Xt\u22122,\u2026)$,

*which implies a lag-specific conditional dependence with respect to*$X$.

For $i=j$, we call links *autodependencies*. For $\tau =0$, the same definition can also be used to define undirected contemporaneous links (see Eichler, 2011; Runge, 2015), which would lead to the time series graph being a *mixed graph* instead of a *directed acyclic graph*. The arrow of time is a convenient way to disambiguate independence relationships: If we do not have access to the time ordering of the variables (or there is none) and observe as the only conditional independence relation $X\u22a5\u22a5Y\u2223Z$ while all other relations are dependent, then this relation can be generated by any of the three causal motifs $X\u2192Z\u2192Y$, $Y\u2192Z\u2192X$, or $X\u2190Z\u2192Y$ which form a *Markov equivalence class*.

Here, we define links with lags $t\u2212\tau $ relative to some time point $t$, but throughout this paper we assume stationarity (discussed in Sec. IV E). Then a link is repeated for every $t\u2032<t$ if a link exists at time $t$. Alternatively, the links can be estimated from different realizations at time $t$. In practice, however, the links will mostly be estimated from single time series realizations requiring stationarity.

The parents of a node $Xtj$ are defined as

In the following, $A,B,S$ denote nodes or sets of nodes in the graph and $Xt\u2212\tau ,Yt,Zt\u2212\tau ,Ut\u2212\tau ,Z$ random variables of the process $X$, sometimes dropping the subscript. We will denote a general conditional dependence measure as $I(X;Y|Z)$ which can be CMI or also some other measure such as partial correlation, depending on the context.

### B. Separation

When considering the dependency between two variables $X,Y$ given a set of conditions $Z$ as in $I(X;Y|Z)$, the idea of open and blocked paths or separation between the corresponding nodes in the time series graph $G$ [Fig. 1(c)] is important. A *directed path* is a sequence of linked nodes containing only motifs $\u2192\u2219\u2192$. But there are also other paths on which information is shared even though no causal interventions could “travel” along these. In general (Eichler, 2011), in the above defined time series graph, a path between two single nodes $A$ and $B$ is called *open* if it contains only the motifs $\u2192\u2219\u2192$ or $\u2190\u2219\u2192$. For notational convenience, we will sometimes use left-pointing arrows, while still in the time series graph all directed links are forward in time. For example, $\u2219\u2190\u2219\u2192\u2219\u2192\u2219$ is an open path. On the other hand, if *any motif* on a path is $\u2192\u2219\u2190$, the path is blocked. Nodes in such motifs are also called *colliders*. If we now consider a *separating or conditioning set $S$*, openness and blockedness of motifs reverse, i.e., denoting a conditioned node by $\u25fc$, the motifs $\u2192\u25fc\u2192$ and $\u2190\u25fc\u2192$, are blocked and the motif $\u2192\u25fc\u2190$ becomes open. For example, the path $\u2219\u2190\u25fc\u2192\u2219\u2192\u2219$ is blocked, while $\u2219\u2190\u2219\u2192\u25fc\u2190\u2219$ is open. Note that paths can also traverse links repeatedly, e.g., forward and backward.

(Separation).

*Two nodes*$A$

*and*$B$

*are*separated given a conditioning set $S$

*with*$A,B\u2209S$ ($S$

*may also be empty*)

*if*all paths

*between*$A$

*and*$B$

*are blocked, denoted*

*Conversely, two nodes are*connected given a set $S$

*if*at least one path

*between the two is open*.

Intuitively, if two nodes are separated, no information is shared between the two. For example, in Fig. 1(c) $Xt1$ and $Xt\u221214$ are separated by $S={Xt\u221222,Xt\u221224}$ or also by $S={Xt\u221211,Xt\u221222}$. Then $I(Xt\u221214;Xt1|S)=0$. Conversely, $Xt1$ and $Xt\u221233$ are still connected given $S={Xt\u221222}$ since the “back door”-path $Xt\u221233\u2190Xt\u221243\u2192Xt\u221232\u2192Xt\u221211\u2192Xt1$ is still open. These definitions are important for the relations between the graph and the underlying process.

## IV. Assumptions of causal discovery from observational time series

Causal information cannot be obtained from associations of measured variables without some assumptions. A variety of different assumptions have been shown to be sufficient to estimate the true causal graph (Spirtes *et al.*, 2000; Peters *et al.*, 2017). Here, we focus on three main assumptions under which the time series graph represents causal relations: *Causal Sufficiency*, the *Causal Markov Condition*, and *Faithfulness*. For time-lagged causal discovery from observational time series, we also need the assumptions of *no instantaneous effects* and *stationarity*. Further added to these are *dependence type assumptions* (e.g., linear or nonlinear) and *no measurement error*, and we will also assume that the joint distribution of the process has a positive density. All of these are discussed in the following.

### A. Causal sufficiency

As Granger (1969) already notes, “[t]he one completely unreal aspect of the above definitions is the use of the series $Ut\u2212$ representing all available information [in the universe].” This definition makes sure that the measured variables include all of the common causes of $X$ and $Y$. However, we do not always need the whole universe. If we have available only a limited number of measured variables, we only need to assume that there exist no other unobserved (or latent) variables that directly or indirectly influence any other pair of our set of variables which is the assumption of *Causal Sufficiency* (Spirtes *et al.*, 2000).

(Causal Sufficiency).

*A set* $W\u2282V\xd7Z$ *of variables is* causally sufficient *for a process* $X$ *if and only if in the process every common cause of any two or more variables in* $W$ *is in* $W$ *or has the same value for all units in the population*.

**(Unobserved variables)**

What happens if such an unobserved (latent) confounder exists? Consider the example depicted in Fig. 2 where we assume no autodependencies. Here, $U$ is an unobserved variable and drives both $X$ and $Z$. With the time lags considered, this common driver leads to an association between $X$ and $Z$. The *estimated* time series graph [via Eq. (6)] among the observed variables $(X,Y,Z)$ will then contain a link $Xt\u22121\u2192Zt$ even though it is a spurious association and any manipulation of $X$ would not have any effect on $Z$. Here, $U$ acts as a direct common driver leading to an induced association. But the estimated time series graph additionally contains a link $Yt\u22122\u2192Zt$ even though there is no direct confounder between $Y$ and $Z$. The reason is that $Yt\u22122\u29f8\u22a5\u22a5Zt\u2223Xt\u2212$ because the path $Yt\u22122\u2192Xt\u22121\u2190Ut\u22122\u2192Zt$ is open through the condition on $Xt\u22121\u2208Xt\u2212$ (see Definition 2).

The *Fast Causal Discovery* (FCI) algorithm (Spirtes *et al.*, 2000; Zhang, 2008) does not assume causal sufficiency and allows us to partially identify which links are spurious due to unobserved confounders and also for which links confoundedness cannot be determined. The underlying idea is that if conditional independence holds between two variables for *any subset* (including the empty set) of $W$, then these variables are not linked. In the example above, this idea can be used to remove the link $Yt\u22122\u2192Zt$ since $Yt\u22122\u22a5\u22a5Zt$, i.e., $Yt\u22122$ and $Zt$ are unconditionally independent. Latent causal discovery is further addressed, for example, in Entner and Hoyer (2010), Eichler (2013), Ramb *et al.* (2013), Smirnov (2013), Hyttinen *et al.* (2014), and Geiger *et al.* (2015).

**(Sub-sampled time series)**

Causal sufficiency can also be violated if all variables are observed, but they are sampled at too coarse time intervals relative to the causal links. Consider a process with time series graph depicted in the top panel of Fig. 3 featuring a causal loop $X\u2192Y\u2192Z\u2192X$ with all causal links at lag $\tau =1$. If we sub-sample the time series with an original resolution of $\Delta t=1$ at $\Delta t=2$, we would estimate the causal graph from $(X~,Y~,Z~)$ as shown in the bottom panel of Fig. 3 that has a completely reversed causal loop. Looking at the top panel time series graph again, this spurious reversal can be understood: For example, in the path $Zt\u22122\u2192Xt\u22121\u2192Yt$ the node $Xt\u22121$ is not sampled and, thus, unobserved, leading to a spurious link $Z~t\u22121\u2192Y~t$ in the sub-sampled time series graph in the bottom panel (note that $t$ is measured with twice the sampling rate then). Sub-sampled time series are an active area of research (Smirnov, 2013; Barnett and Seth, 2015; Spirtes and Zhang, 2016), to some extent sub-sampled time series graphs can be identified as addressed in Gong *et al.* (2015) and Hyttinen *et al.* (2016).

### B. Causal Markov condition

All independence-based causal discovery methods necessitate the Causal Markov Condition (Spirtes *et al.*, 2000) which constitutes a close relationship between the process $X$ and its graph $G$.

(Causal Markov Condition).

*The joint distribution of a time series process*$X$

*with graph*$G$

*fulfills the Causal Markov Condition if and only if for all*$Yt\u2208Xt$

*with parents*$PYt$

*in the graph*

*that is, from separation in the graph*(

*since the parents*$PYt$

*separate*$Yt$

*from*$Xt\u2212\u2216PYt$

*in the graph*)

*follows independence*.

*This includes its contraposition*

*from dependence follows connectedness*.

Note that if Causal Sufficiency is not fulfilled, then also generally the Markov condition will not hold (Spirtes *et al.*, 2000). Intuitively, the Causal Markov Condition implies that once we know the values of $Yt$’s parents, all other variables in the past ($t\u2212\tau $ for $\tau >0$) become irrelevant for predicting $Yt$. Of course, $Y$’s descendants at future time points can also “predict” $Yt$.

**(Non-markovian processes)**

**(Time aggregation)**

Another example where noise terms become dependent is time-aggregation. Consider the causal chain of processes $Xt\u22122\u2192Yt\u22121\u2192Zt$ shown in the top panel of Fig. 5. Time aggregation of a time series realization with $\Delta t=2$ is here done by constructing the new time series $X~t=12(Xt+Xt\u22121)\u2200t$ and correspondingly for $Y,Z$. Now, we observe additional contemporaneous links next to the directed links $X~t\u22121\u2192Y~t$ and $Y~t\u22121\u2192Z~t$ in the time series graph of the aggregated process due to the too coarse time resolution. But, furthermore, we also get spurious directed links, for example between $X~$ and $Z~$. In general, the causal structure of an aggregated process may be very different from the original process.

Time aggregation is an important issue in many applied fields. For example, in climate research time series are frequently measured daily and then aggregated to a monthly scale to investigate dependencies (Runge *et al.*, 2014). Ideally, the time resolution is at least as short as the shortest causal time lag (assuming no instantaneous effects, see Sec. IV D). See Breitung and Swanson (2002) and Barnett and Seth (2015) for a discussion on temporal aggregation in time series models. Ignoring time order, in some cases the recent methods discussed in Peters *et al.* (2017) can help.

### C. Faithfulness

The Causal Markov Condition guarantees that separation in the graph implies independence in the process. But what can be concluded from an estimated conditional independence relation, that is, the reverse direction? *Faithfulness* guarantees that the graph entails *all* conditional independence relations that are implied by the Markov condition.

(Faithfulness).

*The joint distribution of a time series process*$X$

*with graph*$G$

*fulfills the Faithfulness condition if and only if for all disjoint subsets of nodes*(

*or single nodes*) $A,B,S\u2282G$

*it holds that*

*that is, from independence follows separation, which includes its logical contraposition*

*from connectedness follows dependence*.

The combination of Faithfulness and the Markov property implies that $A\u22c8B\u2223S\u21d4XA\u22a5\u22a5XB\u2223XS$ and its logical contraposition $A\u29f8\u22c8B\u2223S\u21d4XA\u29f8\u22a5\u22a5XB\u2223XS$. Both conditions are an important assumption for causal discovery algorithms as discussed in Spirtes *et al.* (2000). Intuitively, Faithfulness together with the Causal Markov Condition allow us to conclude that (in the limit of infinite sample size) a measured statistical dependency is actually due to some (not necessarily direct) causal mechanism and, conversely, a measured independence (given any set of conditions) implies that no *direct* causal mechanism exists (see also Remark 1 in the Discussion).

**(Counteracting mechanisms)**

*et al.*, 2000). One can, thus, argue that non-faithful distributions arise from an unrealistic fine-tuning of dependence parameters. However,

*approximately*vanishing partial correlations despite connectedness in the graph can also occur for a distribution that is faithful, but

*almost*unfaithful, if we have only a limited sample size available as discussed in Uhler

*et al.*(2013). An example of an unfaithfully fine-tuned process is

**(Determinism)**

One may argue that we live in a deterministic world and the assumption of “an independent noise term” that is pertinent to statistics is unrealistic. On the other hand, for a given observed variable $Y$, the complexity of the underlying processes will almost always imply that $Y$ does not deterministically depend on its parents, but some unresolved processes constitute “intrinsic” or “dynamical” noise that is also driving $Y$.

*et al.*, 2000) or using structural causal models in Janzing

*et al.*(2012) and Daniusis

*et al.*(2012).

The former example only illustrated a static case of determinism. The field of nonlinear dynamics studies the properties of nonlinear and chaotic dynamical processes which has led to a plethora of nonlinear time series analysis methods (Kantz and Schreiber, 2003), often from an information-theoretic perspective (Hlavácková-Schindler *et al.*, 2007) including transfer entropy. Many of these methods built on the assumption that no system is perfectly deterministic, for example, due to the coarse-graining of the system’s phase-space in the measurement process. In Sec. VII A, we study the effect of dynamical noise on several common time-series based causal discovery approaches for chaotic systems.

**(Non-pairwise dependencies)**

Next to the fine-tuned example on counteracting mechanisms, Faithfulness can also be violated for a dependency realized in an XOR-gate. Suppose, as shown in Fig. 7 among several “faithful” pairwise dependencies, we have that $Y=X1\u2295X2+\eta Y$ with $X1,X2$ being binary random processes with equal probability for all values of $X1$ and $X2$. Then $I(X1;Y)=I(X2;Y)=0$ even though both are connected to $Y$—a violation of Faithfulness. Here, the full information is *synergistically* contained in $I[(X1,X2);Y]$. Note that from the chain rule it follows that $0<I[(X1,X2);Y]=I(X1;Y)+I(X2;Y\u2223X1)=I(X2;Y\u2223X1)$. Thus, the MI is zero, but the CMI is not and there is, hence, a link in the time series graph. Note that already if $P(X1)\u2260P(X2)$, Faithfulness is not violated anymore as analyzed in Sun *et al*. (2015), showing that Faithfulness violations are rather pathological.

Another form of synergy without violation of Faithfulness is the case that $Y=X1X2+\eta Y$, where we have a rather weak MI $I(X1;Y)$, but again a much larger $I[(X1,X2);Y]$. In Runge *et al.* (2015a), synergy is analyzed in the context of optimal prediction schemes.

As pointed out in James *et al.* (2016) for synergistic dependencies, the problem is that the concept of a pairwise dependency graphical model does not apply, but hyper-graphs are needed to represent such dependencies. Causal discovery of such graphs, however, carries the problem of combinatorial explosion if links between sets of nodes are considered.

### D. Instantaneous effects

Granger causality and the definition of time series graphs are examples for lagged definitions of causality. To guarantee that the lagged parents defined in Eq. (8) are sufficient for the Causal Markov Condition to hold, we need to assume that there are *no instantaneous (contemporaneous) causal effects*, i.e., $Xti\u2192Xtj$. One may argue that causality between dynamical systems cannot have instantaneous effects because the speed of light is finite and, if the process is sampled with sufficient resolution (otherwise, see the Examples of sub-sampling and aggregation), we only need to consider lagged causal effects. However, we often do not have a sufficiently sampled time series. Here, recent developments in causal inference theory (Zhang and Hyvärinen, 2009; Peters *et al.*, 2013; Lopez-Paz *et al.*, 2015; Spirtes and Zhang, 2016; Peters *et al.*, 2017) address instantaneous causality within the framework of structural causal models which can be applied to determine causal directionality for contemporaneous links. These models work under assumptions on the noise in the model such as non-Gaussianity. Also, the logical causal orientation rules of the causal discovery approaches in Spirtes *et al.* (2000) can be used to partially orient contemporaneous links.

### E. Stationarity

To estimate the time series graph defined in Definition 1 from time series data, we assume stationarity. Another option would be to utilize independent ensembles of realizations of lagged processes. Here, we define stationarity with respect to a time index set $T$. For example, $T$ can contain all time indices belonging to a certain regime of a dynamical process, e.g., only winter months in the climate system.

(Causal stationarity).

*The time series process*$X$

*with graph defined in Definition 1 is called*causally stationary over a time index set $T$

*if and only if for all links*$Xt\u2212\tau i\u2192Xtj$

*in the graph*

This constitutes actually a weaker form of stationarity than the common definition of stationarity in mean, variance, spectral properties, or of the value of individual coefficients in a linear model. For example, one could require that all CMIs are stationary,

which is a much stronger statement. The strength of causal mechanisms may fluctuate over time and the causal stationarity assumption only requires conditional independence to be stationary.

**(Non-stationarity due to confounding)**

Consider the data shown in Fig. 8 and suppose we first only have access to the variables $(X,Y,Z)$. Clearly, the time series of this subprocess are nonstationary in a classical sense, varying over time not only in their mean but also in their spectral properties. Estimating the time series graph on these three variables results in the graph shown in the bottom left panel of Fig. 8, where the common nonstationary trend leads to an almost fully connected graph.

A typical example of a common nonstationarity, albeit not the same as in our example, is found in climate time series which are usually all driven by solar forcing leading to a common seasonal signal. In climate research the time series are typically *anomalized*, that is, the seasonal signal is estimated and subtracted from the data (Storch and Zwiers, 1999) which is equivalent to regressing out its influence. But this is not always possible, in our example the common signal is not purely periodic and cannot easily be estimated from the data. Another option for the case of piecewise stationary processes is to include background knowledge on the stationary regimes and estimate the graphs separately for the stationary subsets of $T$. For example, the climatic seasons El Niño and La Niña lead to different causal directions of surface temperature anomalies in the tropical Pacific (Philander, 1985). Prior knowledge of when the seasons start and end allow us to restrict the estimation of time series graphs to samples within a particular season.

Now suppose we actually have access to the common signal $U$ and include it in our analysis (but without estimating the parents of $U$, i.e., treating $U$ as exogenous). Then, as shown in the bottom right panel of Fig. 8, we can recover the true causal structure where $U$ is a confounder of the three variables while they are connected through the motif $X\u2190Z\u2192Y$. Thus, here nonstationarity is a result of confounding and can be removed if we have access to the underlying trend.

The key point is that the causal structure, that is, the time series graph, of the whole process $(X,Y,Z,U)$ *is* invariant in time. One may argue that causal laws are generally invariant and non-stationarity is simply a problem of violation of causal sufficiency. The idea of finding invariant predictors for causal inference is explored in Peters *et al.* (2016).

### F. Dependency type assumptions

To test conditional independence hypotheses $X\u22a5\u22a5Y\u2223Z$, different test statistics can be utilized. These are typically based on making certain assumptions about the type of the underlying dependency structure. While classical statistical methods are often based on the assumption of linearity (which allows us to derive rigorous results), modern statistics, the physics community, and the recent field of machine learning have developed non-parametric or model-free methods that allow us to better capture the nonlinear reality of many dynamical complex systems—at the cost of weaker theoretical results. Conditional independence testing can be classified into regression-based and model-free approaches. Here, we only discuss tests for continuously valued variables, for discrete variables one can, for example, use methods based on contingency tables (Spirtes *et al.*, 2000) or discrete CMI estimation (Cover and Thomas, 2006).

Regression-based conditional independence tests of $X\u22a5\u22a5Y|Z$ are based on first regressing out the influence of $Z$ from $X$ and $Y$ and then testing the dependence between the residuals. We first fit a model assuming

for centered variables $X,Y$ and independent and identically normally distributed $\u03f5X,Y$. Now further restrictions can be laid upon the functional form of $fX,Y$. For example, the partial correlation test assumes linearity, while a non-parametric regression can be based on Gaussian Process regression (Rasmussen and Williams, 2006).

Secondly, from the estimated functions $f^$, the residuals are formed as

Finally, the dependence between the residuals can be tested with different pairwise association tests. For partial correlation this is a $t$-test, while the dependence between the residuals of a non-parametric regression can be tested with non-parametric tests (Gretton *et al.*, 2008; Székely *et al.*, 2007) such as the *distance correlation coefficient*$R(rX,rY)$ (Székely *et al.*, 2007) (see also Sec. V C). Note that these models all make parametric assumptions and, thus, do not estimate conditional independence in its most general form.

The other extreme to partial correlation are model-free methods that directly test conditional independence. The most prominent test statistic is CMI as defined in Eq. (3), for which non-parametric estimators based on nearest-neighbor statistics exist (Kraskov *et al.*, 2004; Frenzel and Pompe, 2007; Vejmelka and Palus, 2008; Póczos and Schneider, 2012) [see also Gao *et al.* (2015) and Lord *et al.* (2018) for recent progress on nearest-neighbor entropy estimators]. Other possible conditional independence tests are Kernel Conditional Independence Tests (Zhang *et al.*, 2011; Strobl *et al.*, 2017) which essentially test for zero Hilbert-Schmidt norm of the partial cross-covariance operator or *conditional distance correlation* (Wang *et al.*, 2015). Some new recent tests are based on neural networks (Sen *et al.*, 2017) or decision tree regression (Chalupka *et al.*, 2018). In Runge (2018), a conditional independence test based on CMI is introduced.

**(Nonlinearity)**

For the linear case (first row in Fig. 9), we consider $f\u22c5=cZt\u22121,2+\eta t\u22c5$ and the regression-based techniques correctly fit the dependencies of the pairs $(X,Z)$ and $(Y,Z)$ (red fit lines in gray scatterplots), and, thus, correctly identify the independence of the residuals (black scatterplot). A model-free test also correctly identifies the common driver motif here.

For the nonlinear additive noise case with a quadratic dependency $f\u22c5=c\u22c5Zt\u22121,22+\eta t\u22c5$, partial correlation cannot fit the dependencies of the pairs $(X,Z)$ and $(Y,Z)$. As a result, the residuals are still correlated (spurious gray dashed link) and the causal graph is completely wrong: We overlook the links $X\u2192Z$ and $Y\u2192Z$ and get a false positive $X\u2192Y$. Since here the dependencies are still additive functions, non-parametric regressions and model-free tests yield a correct causal graph.

Finally, if the dependencies are multiplicative (bottom row) as in $f\u22c5=c\u22c5Zt\u22121,2\u22c5\eta t\u22c5$, both regression methods fail. Then the residuals are nonlinearly related which is not detected with a partial correlation test (here two errors somewhat cancel each other out). A non-parametric test on the residuals, on the other hand, then wrongly estimates the spurious link $X\u2192Y$ (gray dashed in center bottom row).

Model-free methods in principle can deal with all these cases, which might lead to the conclusion that they are superior. But the “no-free-lunch-theorem” tells us that such generality has a price and model-free methods are very data-hungry and computationally expensive. If expert knowledge pointing to a linear or otherwise parametric dependency is available, then regression-based methods will typically greatly outperform model-free methods.

### G. Measurement error

Measurement error, unlike dynamical noise, contaminates the variables between which we seek to reconstruct dependencies and constitutes a difficult problem in causal network reconstruction (Scheines and Ramsey, 2016).

**(Observational noise)**

Here, we only discuss measurement error in its simple form as observational noise, which can be modeled as $Z~=Z+\u03f5Z$. Such observational noise presents at least two sorts of problems for causal discovery.

Firstly, observational noise attenuates true associations and, therefore, lowers detection power. This is because in general $I(X~;Y~)=I(X+\u03f5X;Y+\u03f5Y)\u2264I(X;Y)$ which is a consequence of the data processing inequality (Cover and Thomas, 2006): Manipulating a variable can only reduce its information content. In Fig. 10, we added normal observational noise with $\sigma =20$ to $Z$. Then the links $Z\u2192X$ and $Z\u2192Y$ cannot be reconstructed anymore. Secondly, too much noise on conditioning variables makes it impossible to preserve conditional independence. In Fig. 10, we have $I(X;Y\u2223Z~)=I(X;Y\u2223Z+\u03f5Z)>0$ even though $I(X;Y\u2223Z)=0$. The effect of observational noise on causal discovery is discussed further in Smirnov (2013); Scheines and Ramsey (2016) and Runge *et al.* (2018) give some numerical experiments for high-dimensional causal discovery.

## V. Practical estimation

The previous sections concerned fundamental assumptions of time-lagged causal discovery based on conditional independence relations. We now turn to the topic of practical estimation where we introduce several common causal discovery methods and discuss their consistency, significance testing, and computational complexity. We restrict the analysis to the class of conditional independence approaches, which flexibly allows us to use different independence tests. But graphical models, in general, can also be estimated with score-based Bayesian methods, e.g., the max-min hill-climbing algorithm (Tsamardinos *et al.*, 2006).

### A. Causal discovery algorithms

#### 1. Granger causality/transfer entropy/FullCI

Transfer entropy, as introduced by Schreiber (2000), is a direct information-theoretic implementation of Granger causality (Barnett *et al.*, 2009). In a lag-specific implementation, as given by FullCI [Eq. (6)], it tests for conditional independence between each $Xt\u2212\tau i$ and $Xtj$ conditioned on the entire past $Xt(t\u22121,\u2026,t\u2212\tau max)$ (excluding $Xt\u2212\tau i$). As the experiments in Sec. VII will demonstrate, this approach strongly suffers from the curse of dimensionality.

#### 2. Optimal causation entropy

Sun and Bollt (2014) and Sun *et al.* (2015) developed a discovery algorithm based on the information-theoretic *optimal causation entropy* principle (algorithm abbreviated as OCE) which reconstructs the lagged parents of a variable $Xtj$ by an iterative procedure alleviating the curse of dimensionality: Starting with an empty parent set $P^OCE(Xtj)=\u2205$, first the MIs $I(Xt\u2212\tau i;Xtj)$ for all $Xt\u2212\tau i\u2208Xt\u2212$ are estimated. As the first parent $X(1)$, the one with the largest MI with $Xtj$ is selected. The next parent $X(2)$, however, is chosen according to the largest CMI $I(Xt\u2212\tau i;Xtj|X(1))$ among all remaining variables, the third parent is the one with largest CMI conditional on the two previously selected parents, etc. The process is continued until the CMI of a selected parent is non-significant. This forward-selection stage is followed by a backward elimination whereby the significance of each of the parents $Xt\u2212\tau i\u2208P^OCE(Xtj)$ is tested conditional on the remaining parents:

The significance of CMIs can be tested with a nearest-neighbor CMI estimator (Kraskov *et al.*, 2004; Frenzel and Pompe, 2007; Vejmelka and Palus, 2008) in combination with a permutation test where $Xt\u2212\tau i$ is randomly shuffled. Of course, the conditional independencies in Eq. (26) can also be tested with other test statistics.

#### 3. PC algorithm

An alternative to this forward-backward scheme is the PC algorithm (named after its inventors Peter and Clark) (Spirtes and Glymour, 1991). The original PC algorithm was formulated for general random variables without assuming a time order. It consists of several phases where first, in the skeleton-discovery phase, an undirected graphical model (Lauritzen, 1996) is estimated whose links are then oriented using a set of logical rules (Spirtes and Glymour, 1991; Spirtes *et al.*, 2000). A later improvement led to the more robust modification called PC-stable (Colombo and Maathuis, 2014).

For the case of time series, we can use the information of time order which naturally provides an orientation rule for links. The algorithm then is as follows: For every variable $Xtj\u2208Xt$ it starts by initializing the preliminary parents $P^(Xtj)=(Xt\u22121,Xt\u22122,\u2026,Xt\u2212\tau max)$. In the first iteration ($p=0$), we remove a variable $Xt\u2212\tau i$ from $P^(Xtj)$ if the null hypothesis

cannot be rejected at a significance threshold $\alpha $. Then, in each next iteration, we increase $p\u2192p+1$ and remove a variable $Xt\u2212\tau i$ from $P^(Xtj)$ if *any* of the null hypotheses

cannot be rejected, where $S$ iterates (in an inner loop) through all combinations of subsets $S\u2286P^(Xtj)\u2216{Xt\u2212\tau i}$ with cardinality $p$. The algorithm converges for a link $Xt\u2212\tau i\u2192Xtj$ once $S=P^(Xtj)\u2216{Xt\u2212\tau i}$ and the null hypothesis $Xt\u2212\tau i\u22a5\u22a5Xtj\u2223P^(Xtj)\u2216{Xt\u2212\tau i}$ is rejected (if the null hypothesis cannot be rejected, the link is removed). Runge *et al.* (2018) provide pseudo-code for this algorithm.

The forward-backward scheme of OCE conducts conditional independence tests only using the conditions with highest CMI in the preceding stage and quickly increases the number of conditions. This can lead to wrong parents being kept in $P^OCE(Xtj)$ (see Sec. V B) which are only removed in the backward stage where the dimensionality of the set $P^OCE(Xtj)$ can be already quite high. High dimensionality, in principle, leads to lower detection power (Example VII C). The PC algorithm conducts conditional independence tests not only using the condition with highest association, but it goes through (in theory all) combinations of conditions $S$ which can help to alleviate the curse of dimensionality regarding the estimation dimension of $I(Xt\u2212\tau i;Xtj\u2223S)$ compared to OCE. On the other hand, the PC algorithm conducts many more tests which increases other problems (see Sec. V C).

#### 4. PCMCI

A more recent approach that addresses some of the shortcomings of the PC algorithm above is PCMCI (Runge *et al.*, 2018). PCMCI is a two-step approach which uses a version of the PC-algorithm only as a condition-selection step (PC$1$ algorithm) to obtain $P^(Xtj)$ for all $Xtj\u2208Xt$, followed by the *momentary conditional independence* (MCI) test for $Xt\u2212\tau i\u2192Xtj$ defined as

with $Xt\u2212=(Xt\u22121,Xt\u22122,\u2026,Xt\u2212\tau max)$. The main difference between PC and PC$1$ is that PC$1$ tests only the condition subset $S$ with largest association instead of going through all possible combinations. To this end $P^(Xtj)$ is sorted after every iteration according to the test statistic value and $S$ is determined by the first $p$ variables in $P^(Xtj)$ (excluding $Xt\u2212\tau i$). This leads to less tests, but still provably removes incorrect links.

The MCI test is the most important difference to the PC algorithm and the approach by Sun and Bollt (2014). The additional conditioning on the parents $P^(Xt\u2212\tau i)$ in MCI accounts for autocorrelation leading to well-controlled false positive rates at the expected level (Runge *et al.*, 2018). A variant (PCMCI$0$) where the condition on the parents $P^(Xt\u2212\tau i)$ is dropped leads to a very similar approach to OCE. PCMCI, like FullCI, OCE, and PC, can be implemented with different conditional independence tests. For further details on PCMCI, see Runge *et al.* (2018). In Sec. VII we compare FullCI, OCE, and PCMCI in a number of numerical comparison studies.

### B. Consistency

Consistency is an important property of causal methods that tells us whether the method provably converges to the true causal graph in the limit of infinite sample size. Consistency concerns the conditional independence tests on the one hand, but also the causal algorithm in the case of iterative approaches such as those discussed in Sec. V A.

For example, for the consistency of the non-parametric regression independence tests in Eq. (23), we need to assume that the function estimator converges to the true function, that the noise in the model is additive and independent, and finally that we have a consistent unconditional independence test for the residuals. With a consistent test, the time series graph can be directly estimated based on Definition 1. For iterative causal algorithms, we can define *universal* consistency as follows.

(Universal causal consistency).

*Denote by $G^n$ the estimated graph of some causal estimator from a sample of a distribution $P$ with sample size $n$ and by $G$ the true causal graph. Then a causal estimator is said to be universally consistent if $G^n$ converges in probability to $G$ for every distribution $P$,*

That is, the probability of estimating the wrong graph becomes arbitrarily small if enough data is available, for any distribution $P$ (hence “universal”). Consistency has been proven for classical causal discovery algorithms such as the PC-algorithm (Spirtes *et al.*, 2000), the optimal causation approach Sun and Bollt (2014), Sun *et al*. (2015) and PCMCI Runge *et al.* (2018), as an approach based on PC.

However, universal consistency is a weaker statement than, for example, *uniform consistency* which bounds the error probability as a function of the sample size $n$ giving a *rate of convergence*. Thus, for a non-uniform, but only universally consistent method, the sample size at which a given error can be guaranteed and can be different for every distribution $P$. Robins *et al.* (2003) showed that no uniformly consistent causal discovery algorithm from the class of independence-based approaches (Spirtes *et al.*, 2000) exists since the convergence can always be made arbitrarily slow by a distribution that is *almost unfaithful* with some dependencies made arbitrarily small. Uniform consistency for conditional-independence based algorithms can only be achieved under further assumptions such as having strong enough dependencies (Kalisch, 2008).

**(An inconsistent causal algorithm)**

Consider again the forward-selection stage of the OCE algorithm (Sun and Bollt, 2014; Sun *et al.*, 2015) introduced in Sec. V as a standalone method to reconstruct parents of a variable $Yt\u2208Xt$. Even though the scheme sounds appealing and efficient, the scheme alone is not a consistent estimator of causal graphs. It yields a *superset* of the parents (Sun *et al.*, 2015) which may also contain false positives: Consider the example graph shown in Fig. 11. Here, the causal parents of $Y$ are $Z1,Z2$ (dropping time subscripts $t$ here). If forward-selection alone was a causal approach, then in each step the variable with strongest association would also need to be a causal parent. But in this example the MI between $X$ and $Y$ can be larger than the MIs of $Z1$ and $Z2$ with $Y$. For example, for $a=0.5,b=2$ we have $I(X;Y)\u22480.13$ *nats* while $I(Z1;Y)=I(Z2;Y)\u22480.06$ *nats*. See Appendix A for an information-theoretic analysis. Hence, the wrong parent $X$ is selected. This scheme, thus, requires the second step of the OCE approach, given by Eq. (26).

### C. Significance testing

How can we assess the statistical significance of conditional independence tests on which the causal algorithms in Sec. V A are based, such as the tests discussed in Sec. IV F? Using a test statistic $I^n(x;y\u2223z)$ ($I$ here stands not only for CMI, but any conditional independence test statistic) for the observed samples $(x,y,z)={xi,yi,zi}i=1n$ we wish to test the hypothesis

versus the general alternative

To assess the significance of an outcome of such a test using $p$-values, we need to know the distribution $Pr(In\u2223H0)$ of the test statistic under the null hypothesis. For partial correlation tests, exact analytical expressions of the null distribution exist under certain assumptions, but for nonlinear tests (such as CMI or also non-parametric regression tests) these are typically not available except for some asymptotic large sample size cases (Strobl *et al.*, 2017). The alternative then are permutation tests as discussed in Example 12.

Given the null distribution, the $p$-value is defined as the probability—given $H0$—of observing a value of the test statistic that is the same or more extreme than what was actually observed. If our test statistic is non-negative (such as CMI), the $p$-value for an observed test statistic value $I^n$ is defined as $p=Pr(In\u2265I^n\u2223H0)$. Choosing a *significance level* $\alpha $, we reject the null hypothesis if $p<\alpha $.

There are two types of errors we can make. Rejecting $H0$ when $H0$ is true is called a type I error or false positive. Retaining $H0$ when $H1$ is true is called a type II error or false negative. We also call one minus the type II error rate of a test the true positive rate or detection rate.

If the test statistic has a continuous distribution, then under $H0$ the $p$-value has a uniform $(0,1)$ distribution (Wasserman, 2004). Therefore, if we reject $H0$ when the $p$-value is less than $\alpha $, the probability of a type I error is $\alpha $. For a well-calibrated test under $H0$, we thus expect to measure on average a false positive rate of $\alpha $. If this is not the case, the test is ill-calibrated indicating that we got the null distribution wrong because certain assumptions, such as independent samples, are violated. In Examples 12 and 13, we discuss such cases.

**(Permutation testing)**

Permutation testing is straightforward in the bivariate independence test case. To create an estimate of the null distribution of a test statistic $I^n(x;y)$, we can simply generate a large number of test statistics $I^n(x\u2217;y)$ where $x\u2217$ is a permuted version of $x$.

*conditional*independence testing of $X\u22a5\u22a5Y\u2223Z$? In the top left panel of Fig. 12, we illustrate an example scatterplot of a sample drawn from a common driver scheme $X\u2190Z\u2192Y$ where we have $X\u22a5\u22a5Y|Z$. If we now permute $x$, we get the sample shown in the top right panel. Here, any association between $x$ and $y$ is indeed destroyed, but we also destroyed the association between $x$ and $z$. That is, for the permuted sample we have

To achieve a test under the correct null hypothesis, we can use a *local permutation scheme* that preserves the associations between $x$ and $z$. Runge (2018) suggests such a scheme depicted in the bottom panel of Fig. 12 which only permutes those $xi$ and $xj$ where $zi\u2248zj$. This scheme can be used for CMI conditional independence testing or also other test statistics. Other schemes are discussed in Doran *et al.* (2014) and Sen *et al.* (2017).

**(Non-independent samples)**

A basic assumption underlying many conditional independence tests is that the samples are independent and identically distributed (*i.i.d.*). Unfortunately, time series are typically dependent in time. To take this dependence into account, one can either adapt the distribution under the null hypothesis, for example, in partial correlation $t$-tests by estimating the *effective degrees of freedom*, which is, however, difficult in a multivariate setting. Or one can modify the test statistic to explicitly account for autocorrelation (Runge *et al.*, 2018). Also, a permutation scheme needs to be adapted to preserve auto-dependencies, for example, by shuffling blocks of samples (Peifer *et al.*, 2005). For bivariate tests such an approach is again straightforward, but not for the multivariate case as analyzed in the autocorrelation comparison study in Sec. VII.

**(Sequential testing of causal links)**

The preceding discussion concerned tests of an individual conditional independence relationship. Directly testing causal links via Definition 1, thus, gives us a well-calibrated test if all assumptions are fulfilled.

However, in iterative causal algorithms (Sec. V A) such as the PC algorithm, OCE, or the first step of PCMCI (PC$1$) *multiple* tests on a particular link $Xt\u2212\tau \u2192Yt$ are conducted with different condition sets that are determined by the outcome of previous tests. If a link is found non-significant in *any* of these tests, this link is removed. Since the tests are not independent of each other (since they are typically based on the same data sample), it is almost impossible to derive a combined $p$-value of all these tests.

Figure 13 depicts an illustrative numerical example where the combined false positive rate is much lower than the $0.05$ of each individual test and the true positive rate is lower than the minimal true positive rate among the individual tests. In summary, even though all assumptions may be valid, sequential testing makes a significance assessment difficult. These issues are further discussed in Tsamardinos and Brown (2008) and Strobl and Spirtes (2016) and references therein where False Discovery Rate (Benjamini and Hochberg, 1995) approaches are discussed.

As a side remark on the previously discussed OCE and PCMCI causal discovery approaches, in the backward-elimination stage [Eq. (26)] OCE tests only the significance of each of the parents $Xt\u2212\tau i\u2208P^OCE(Xtj)$ conditional on the remaining parents. PCMCI (Runge *et al.*, 2018), on the other hand, in the second MCI step tests *all* links again [Eq. (29)] which makes the MCI test slightly less dependent on the sequential testing issue of the condition-selection step PC$1$ since parents that have been removed in PC$1$ (false negatives) are tested again in the MCI test. In particular, the false positives are as expected as demonstrated in the comparison studies in Sec. VII.

### D. Computational complexity

Application areas of causal discovery methods vary in the typical numbers of variables as well as available sample sizes $n$. Next to the properties discussed before, an important issue then is how a method scales with dimensionality and sample size. High-dimensionality arises from the number of included variables $N$ and the maximum time lag $\tau max$ [see Fig. 1(c)] and has at least two consequences: (1) higher computational complexity leading to longer runtimes and (2) typically lower detection power. Independence tests may also become ill-calibrated in high-dimensions.

For directly testing causal links via Definition 1 (FullCI), the computational complexity depends on the complexity of a single high-dimensional conditional independence test. In the linear partial correlation case, OLS regression scales $\u223cO(n(N\tau max)2)$. FullCI estimated using nearest-neighbor estimators of CMI (Kraskov *et al.*, 2004; Frenzel and Pompe, 2007), on the other hand, will scale $\u223cO(nlog\u2061n)$ regarding time complexity while the complexity in $N\tau max$ will depend on algorithmic details such as using efficient KD-tree nearest-neighbor search procedures (Maneewongvatana and Mount, 1999).

The methods PC, OCE, or PCMCI (Sec. V A) based on a condition-selection step avoid high-dimensional conditional independence estimation by conducting more tests with lower dimensional conditioning sets. Their theoretical complexities are difficult to evaluate, for numerical evaluations see Sun *et al*. (2015) and Runge *et al.* (2018), but typically they scale polynomially in time.

The other major challenge with high dimensionality is detection power as analyzed in Example VII C.

## VI. Performance evaluation criteria

How can causal discovery methods, such as those described in Sec. V be evaluated? Typically, we want to know which method performs best on data from the kind of system we wish to study. Ideally, we would like to compare different methods on a data sample where the underlying causal truth is known or evaluate methods by experimentally manipulating a system, i.e., actually performing the *do*-experiment (Pearl, 2000) mentioned in the introduction which forms the theoretical basis of the present concept of causality. Since both of these options are mostly not available, an alternative is to construct synthetic model data where the underlying ground truth is known. These can then be used to study the performance of causal methods for realistic finite sample situations.

### A. Models

To evaluate causal methods on synthetic data, several aspects for constructing model systems are relevant:

**Model realism:**The model systems should mimic the domain-specific properties of real data in terms of nonlinearity, autocorrelation, spectral properties, noise structure (dynamical as well as observational), etc.**Model diversity:**To avoid biased conclusions, a large number of different randomly selected connectivity structures should be tested [including link density as well as properties such as small-worldness (Watts and Strogatz, 1998)]. For example, the aforementioned forward-selection approach failed for the example shown in Fig. 11 but works for many other graphs. But also consistent methods may have biases for finite samples as studied in Runge*et al.*(2018).**Model dimensionality:**As studied in Fig. 16, a method may perform well only for a small number of variables and the performance for high-dimensional settings (large networks) can quickly degrade.**Sample sizes:**The comparative performance of different models may vary widely for different sample sizes which needs to be studied if no uniform consistency results are available.

### B. Metrics

The performance of a causal method on a single realization of a model does not allow for reliable conclusions. Therefore, each model needs to be evaluated from many realizations. Then the most straightforward evaluation metric is to measure false positive rates and true positive rates for a given $\alpha $ as shown in the comparison studies in Sec. VII. The number of realizations should be chosen high enough since the error of a false or true positive rate $r$ for $B$ realizations is given by $\sigma r=r(1\u2212r)/B$. Alternative evaluation metrics that do not depend on a particular significance level but directly on the $p$-values are the Kullback-Leibler divergence to evaluate whether the $p$-values are uniformly distributed (to measure how well-calibrated a test is) and the Area Under the Power Curve (AUPC) to evaluate true positives.

Next to the true and false positives of a causal method for finite samples, another performance criterion is computational runtime, though this may strongly depend on a given implementation.

## VII. Comparison studies

In this section, we compare several common causal discovery methods in three numerical comparison studies highlighting the effect of dynamical noise in deterministic chaotic systems, autocorrelation, and high dimensionality.

### A. Dynamical noise in deterministic chaotic systems

In Example 6, we studied a static example of determinism. Here, we evaluate the effect of dynamical noise in a system of coupled chaotic logistic maps:

with uniformly distributed independent noise $\eta $ and $r=4$ leading to chaotic dynamics. Here, $\sigma $ controls the amount of dynamical noise in the system. To evaluate true positive rates (correctly detecting $Zt\u22121\u2192Xt$ and $Zt\u22121\u2192Yt$) and false positive rates (incorrect detections for any other variable pair, direction, or lag), $200$ realizations with time series length $n=150$ were generated.

We compare three methods from an information-theoretic framework (FullCI, OCE, PCMCI) with *convergent-cross mapping* (CCM, Sugihara *et al.*, 2012) (see also Arnhold *et al.*, 1999; Hirata *et al.*, 2016) as a nonlinear dynamics-inspired approach. The significance of CMIs in FullCI and OCE is tested with a nearest-neighbor CMI estimator (Kraskov *et al.*, 2004; Frenzel and Pompe, 2007; Vejmelka and Palus, 2008) in combination with a permutation test where $Xt\u2212\tau i$ is randomly shuffled. PCMCI is implemented with the CMIknn independence test (Runge, 2018) as discussed in Example 12. We also evaluate a variant (PCMCI$0$) where the condition on the parents $P^(Xt\u2212\tau i)$ is dropped and the only difference to OCE is the condition-selection step. CCM reconstructs the variable’s state-spaces using lagged coordinate embedding and concludes on $X\u2192Y$ if points on $X$ can be well predicted using nearest neighbors in the state space of $Y$. Note that CCM and related works only use the time series of $X$ and $Y$ with the underlying assumption that the dynamics of $Z$ can be reconstructed using delay embedding. All methods were evaluated at a significance level of $0.05$. For implementation details, see Appendix B 1. In the top panel of Fig. 14, we depict the true causal graph.

Figure 14 shows that in the purely deterministic regime with $\sigma =0$, PCMCI has almost no power, while PCMCI$0$ features a detection rate of $0.8$ and FullCI, OCE, and CCM almost always detect the couplings. On the other hand, CCM also has the highest number of false positives around $0.2$ which exceeds the significance level indicating an ill-calibrated test. FullCI, OCE, and PCMCI$0$ also do not control false positives well, while with PCMCI they are around the expected level.

For higher dynamical noise levels interesting behavior emerges: FullCI and CCM have continuously decreasing power (and slightly decreasing false positives) dropping to almost zero at $\sigma =0.4$, while the power of OCE and both PCMCI versions steadily increases up to $\sigma =0.2$ after which power decreases with a more pronounced decrease for PCMCI and PCMCI$0$ having the highest power. False positives are above the $0.05$ threshold for FullCI and OCE for a wide range of noise levels, while for PCMCI false positives are always well-controlled at the expected $0.05$.

How can these results be understood? CCM attempts to reconstruct the attractor manifolds underlying $X$ and $Y$. With more dynamical noise, this reconstruction becomes more difficult and CCM looses power. The fact that CCM does not control false positives well, especially in the deterministic regime deserved further study. FullCI suffers from the curse of dimensionality especially for higher noise levels.

The MCI test statistic for the link $Zt\u22121\u2192Xt$ estimates $IZt\u22121\u2192XtMCI=I(Zt\u22121;Xt\u2223Zt\u22122,Xt\u22121)$. Now since for $\sigma =0$ $Zt\u22121$ is a deterministic mapping of $Zt\u22122$, in theory we have $IMCI=0$ for the same reasons as in the simple deterministic model (19) above (and analogously for $Zt\u22121\u2192Yt$). In practice, we can only measure the entropies at some coarse-grained level (here determined by the CMI nearest-neighbor parameter) and the deterministic dependency is never exactly recovered leading to a non-zero MCI. In OCE and PCMCI$0$, only the parents of $Xt$ are included in the condition. The fact that FullCI, despite conditioning on the whole past, also detects the link deserves further study. Possible explanations are an ill-calibrated test or dynamical properties of the logistic-map system. In summary, purely deterministic dynamics here seem to generate too little momentary information which is necessary for information-theoretic coupling detection with MCI (see also Pompe and Runge, 2011).

Given at least some dynamical noise to suffice the Faithfulness condition and together with the other assumptions discussed in this paper, OCE and PCMCI provably converge to the true causal graph in the limit of infinite sample size (Runge *et al.*, 2018; Sun *et al.*, 2015, see also Sec. V B), while no such theoretical results is available for CCM. In the infinite sample limit also the inflated false positives of OCE due to time-dependent samples vanish. For finite samples, on the other hand, among other factors, consistency depends on how well-calibrated the significance test is. PCMCI here always yields expected levels of false positives. The inflated false positives for FullCI and OCE for a wide range of noise levels and for PCMCI$0$ for very low dynamical noise is related to the way significance testing is implemented. In theory, OCE should have less false positives than the expected $0.05$ due to the sequential testing problem (Sec. V C), but in our examples autocorrelation and the global permutation scheme leads to ill-calibrated tests with inflated false positives in each individual test, see Examples 12 and 13, leading to an overall higher false positive rate. The effect of autocorrelation is evaluated further in the next example.

### B. Autocorrelation

In Fig. 15, we evaluate the previously introduced causal algorithms (Sec. V A) FullCI, OCE, and PCMCI on autocorrelated data which is an ubiquitous feature in real world time series data. The full model setup is described in Appendix B 2. In short, here we only evaluate the false positive rates (for $c=0$) and true positive rates (for $c\u22600$) of the link $Xt\u22121\u2192Yt$ (top panel of Fig. 15) where the autocorrelation $a$ is varied for different numbers of common drivers $DZ$ and the coefficients $b$ and $\sigma Z$ are chosen such that the unconditional dependence stays the same, and we only investigate the effect of autocorrelation. The time series length is $n=150$ and $1000$ realizations were run for each model setup to evaluate false positive rates and true positive rates at an $\alpha =0.05$ significance level.

We compare partial correlation implementations (test statistic $\rho $) of the following tests: (1) FullCI directly tests Definition 1, $Xt\u22121\u22a5\u22a5Yt\u2223Xt(t\u22121,\u2026,t\u2212\tau max)\u2216{Xt\u22121}$ for $\tau max=5$. For OCE and PCMCI, we assume that the condition-selection steps already picked the correct parent sets and only test the link $Xt\u22121\u2192Yt$ in the second stages of OCE and PCMCI, respectively: (2) OCE [Eq. (26), equivalent to $PCMCI0$] here tests $Xt\u22121\u22a5\u22a5Yt\u2223PYt$, where $PYt$ is given by the gray and blue boxes in the top panel of Fig. 15. (3) OCEpw with conditioning set as for OCE, but where all variables are pre-whitened beforehand as described in Appendix B 2. (4) OCEbs with conditioning set as for OCE, but where a block-shuffle test was used as described in Appendix B 2. (5) PCMCI tests $Xt\u22121\u22a5\u22a5Yt|PYt,PXt\u22121$ [Eq. (29)] as given by the gray, blue, and red boxes. For all approaches, the analytical null distribution of the partial correlation test statistic was used ($t$-test), except for the block-shuffle permutation test.

In the bottom four panels of Fig. 15, we depict results for $DZ=0$, that is, the bivariate case, and $DZ=4$, both for varying the autocorrelation strength $a$. In the bivariate case, all approaches well control the false positive rates except for OCE and (slightly better) OCEbs, which feature inflated false positive rates for very high autocorrelation. The reason is that when testing $\rho OCE=\rho (Xt\u22121;Yt|Yt\u22121)$ with the $t$-test, we assume i.i.d.-data, but since $X$ is autocorrelated, this is not the case. This false positive inflation is also seen in Fig. 14. Here, pre-whitening removes this autocorrelation and block-shuffling remedies it to some extent. PCMCI and FullCI both condition out autocorrelation and well-control false positives with constant true positive levels, independent of $a$, while the true positive level depends on $a$ for OCE and its modifications, even though the coupling coefficient $c$ is constant.

For $DZ=4$ false positive inflation becomes even more severe for OCE and here also pre-whitening does not help but leads to strongly increased false positive rates since univariate pre-whitening is not suitable for multivariate conditional independence testing. The PCMCI approach conditions on the parents of the lagged variable which helps to exclude autocorrelation as shown in Runge *et al.* (2018) and allows us to utilize analytical null distributions that assume i.i.d. data.

### C. Curse of dimensionality

In Fig. 16, we evaluate the causal algorithms (Sec. V A) for high-dimensional data using the same model as in Fig. 15 described in Appendix B 2. Here only FullCI, OCE, and PCMCI are compared, all of them again based on partial correlation.

As shown in Fig. 16, FullCI severely suffers from the curse of dimensionality and the OLS-solver even becomes ill-conditioned for $DZ=32$ since then the estimation dimension exceeds the sample size. For OCE and PCMCI, we again assume that the condition-selection algorithms selected the correct set of parents. Then the dimensionality of OCE and PCMCI increases only slightly and the power stays at higher levels.

The problem becomes even more severe for non-parametric tests such as multivariate transfer entropy. Another alternative to OLS partial correlation estimation are regularization techniques such as ridge regression (Hoerl *et al.*, 1970; Tibshirani, 1996; Tikhonov, 1963), but these come with other difficulties, for example regarding significance testing. These issues are further analyzed in Runge *et al.* (2018).

## VIII. Discussion and conclusions

The long preceding list of theoretical assumptions in Sec. IV may make causal discovery seem a daunting task. In most real data application, we will not have all common drivers measured, hence violating the causal sufficiency assumption. Then also the Markov assumption may be violated in many cases due to time-aggregation. A conclusion on the *existence* of a causal link, thus, rests on a number of partially strong assumptions. So what can we learn from an estimated causal graph? Let us consider what the assumptions on the *absence* of a causal link are.

*Let*$X~$

*be measurements of a stochastic process*$X$.

*Assuming Faithfulness*(

*Definition*5)

*and that all variables in*$X~$

*are measured without error we have that for*$X~,Y~,Z~\u2208X~$

*with*$X~,Y~\u2209Z~$ if

*that is, if independence is measured given any subset of conditions, then there is no direct causal link between*$Xt\u2212\tau $

*and*$Yt$

*in*$G$.

Note that for the practical estimation from finite data, we also need to assume that all dependencies among lagged variables in $X$ can be modeled with the conditional independence test statistic, that is, no false negative errors occur. Nevertheless, Remark 1 rests on much weaker assumptions than the existence of a causal link since it does not require Causal Sufficiency or Causal the Markov Condition. The proof follows almost directly from the Faithfulness assumption.

*Proof.*

Firstly, since we assume an error-free measurement process, we have that $X~=X$. Then $X~t\u2212\tau \u22a5\u22a5Y~t\u2223Z~\u21d2Xt\u2212\tau \u22a5\u22a5Yt\u2223Z\u21d2Xt\u2212\tau \u22c8Yt\u2223Z$. The last relation is the Faithfulness assumption. Separation implies in particular that no *direct* link $Xt\u2212\tau \u2192Yt$ exists in $G$.

The second set of assumptions important for causal discovery are the assumptions underlying significance testing (Sec. V C). Failing to properly take into account autocorrelation or too simple permutation schemes imply ill-calibrated significance tests leading to inflated false positives beyond those expected by the significance level (see the comparison studies in Sec. VII). Next to the theoretical causal assumptions, statistical reliability of reconstructed networks is an important aspect for drawing causal conclusions.

This paper is intended to recapitulate the main concepts of time-lagged causal discovery from observational time series data and accessibly illustrate important challenges. But many more challenges exist, for example, we have not considered selection bias or issues with the definition of variables as elaborated on in Spirtes *et al.* (2000). We also have not discussed the topic of determining causal effects (Pearl, 2000) (causal quantification) or mediation (VanderWeele, 2015; Runge *et al.*, 2015a, 2015b) as opposed the pure existence or absence of causal links presented here.

Our focus was on time series which make the causal discovery problem easier in some aspects (e.g., time order can be exploited), but more difficult in other aspects, especially regarding statistical testing. We have briefly mentioned the recent works based on different sets of assumptions in the framework of structural causal modeling (Peters *et al.*, 2017), which do not require time-order. Also, many more techniques and insights from the conditional independence framework (Spirtes *et al.*, 2000) can be utilized in the time series case. An important conclusion is that causal discovery is a very active area of research in many fields, from mathematics, computer science, and physics to applied sciences, and methodological progress can greatly benefit from more interdisciplinary exchange.

## ACKNOWLEDGMENTS

J.R. thanks C. Glymour and F. Eberhardt for comments and D. Sejdinovic, K. Zhang, and J. Peters for helpful discussions. Special thanks to C. Linstead for help with high-performance computing. J.R. received funding from a postdoctoral award by the James S. McDonnell Foundation and gratefully acknowledges the European Regional Development Fund (ERDF), the German Federal Ministry of Education and Research, and the Land Brandenburg for supporting this project by providing resources on the high-performance computer system at the Potsdam Institute for Climate Impact Research. Software is available online under https://github.com/jakobrunge/tigramite.

### Appendix A: Inconsistent causal algorithm

For the example graph shown in Fig. 11, consider the following decomposition [chain rule (Cover and Thomas, 2006), dropping $t$]:

Alternatively, one can decompose the same MI as

where the last term vanishes because $(Z1,Z2)$ separates $X$ and $Y$ in the graph (Markov condition). From these two equations, it follows that if

Hence, the wrong parent $X$ has higher MI and would be selected with a pure forward-selection scheme.

### Appendix B: Implementation details for numerical examples

#### B.1. Dynamical noise model

CCM was estimated with embedding dimension $E=2$, and the surrogate test ebisuzaki with 500 surrogates using the R-package rEDM. CCM requires two criteria (Sugihara *et al.*, 2012), both a significant CCM value at library size $n$ and an increasing CCM value over increasing library length. As a $p$-value of CCM, we thus take $max(pn,pconv)$, where $pn$ is the $p$-value of CCM at library size $n$ and $pconv$ is the $p$-value for the hypothesis of an increasing linear trend. OCE was estimated with threshold parameter $\alpha OCE=0.1$ and $\tau max=2$ in the forward step and with CMI nearest-neighbor parameter $kCMI=15$ and $B=500$ permutation surrogates. FullCI was also estimated with CMI parameter $kCMI=15$ and $B=500$ permutation surrogates. PCMCI was implemented with $\alpha PC=0.1,\tau max=2$ and CMIknn parameters $kCMI=15,kperm=5$ and $B=500$ permutation surrogates (Runge, 2018). PCMCI was run without restricting the number of parents $pX$ in the MCI step, while for PCMCI$0$ only the parents of the non-lagged variable were included. For each noise level $\sigma $, we ran the four methods on 200 time series realizations of the model. We compute as true positives the average rates at which the links $Zt\u22121\u2192Xt$ and $Zt\u22121\u2192Yt$ were detected from the 200 realizations at an $\alpha =0.05$ significance level. We calculate as false positives the average rates for $i\u2260j\u2208{X,Y,Z}$ and $\tau \u2208{1,2}$ where there is no link. Since we use an $\alpha =0.05$ significance level, a well-calibrated test should yield 5% false positives. The edge color and width of the graphs in Fig. 14 corresponds to the lag with maximum CMI/CCM value.

#### B.2. Model for examples on autocorrelation and high-dimensionality

The model time series graph is depicted in Fig. 15. In the model setup, all links correspond to linear dependencies. The autocorrelation $a$ is the same for all variables $X,Y,Z1,\u2026,ZDZ$. The coupling coefficient $c$ of the link $Xt\u22121\u2192Yt$ is zero to test false positive rates and nonzero to test true positive rates. For nonzero $c$ its value is chosen such that in the corresponding bivariate model without autocorrelation ($DZ=0,a=0$), we obtain a constant mutual information $I(Xt\u22121;Yt)=0.03$ *nats* for the linear coupling with partial correlation. On the other hand, the common driver forcing coefficient $b=b(DZ,a)$ and the covariance among the drivers $\sigma Z=b/2$ are chosen such that in the full model for every pair ($DZ>0,a\u22650$) with $c=0$ we have $I(Xt\u22121;Yt)=0.4$ *nats*, a relatively strong forcing corresponding to a correlation of $\u22480.7$.

This setup guarantees that the unconditional dependence stays the same and we only investigate the effect of increasing autocorrelation and higher dimensionality. We test additive Gaussian noise terms $\eta \u223cN(0,1)$. For $DZ=0$ the model setup corresponds to two autocorrelated processes without a common driver forcing. The sample lengths are $n=150$ for partial correlation. $1000$ realizations were evaluated to assess false and true positives.

#### B.3. Pre-whitening and block-shuffling

We also tested the OCE test using a pre-whitening (OCEpw) and a block-shuffle permutation test (OCEbs). For the pre-whitening test, we preprocessed all $N$ time series by estimating the univariate lag-1 autocorrelation coefficients $a^i=\rho (Xt\u22121i;Xti)$ and regressing out the AR(1) autocorrelation part of the signals:

Then the OCE test is applied to these residuals $X~$.

Another remedy is a block-shuffle permutation test, which is based on a block-shuffle surrogate test following Peifer *et al.* (2005) and Mader *et al.* (2013). For the test statistic $T$, an ensemble of $M=500$ values of $I(Xt\u2212\tau \u2217;Yt\u2223\u2026)$ is generated where $Xt\u2212\tau \u2217$ is a block-shuffled surrogate of $Xt\u2212\tau $, i.e., with blocks of the original time series permuted. As an optimal block-length, we use the method described in Peifer *et al.* (2005) and Mader *et al.* (2013) for non-overlapping blocks. The optimal block-length formula Eq. (6) in Mader *et al.* (2013) involves the decay rate of the envelope of the autocorrelation function $\gamma (\tau )$. The latter was estimated up to a maximum delay of 5% of the samples, and the envelope was estimated using the Hilbert transform. Then a function $C\varphi \tau $ was fit to the envelope with constant $C$ to obtain the decay rate $\varphi $. The block length was limited to a maximum of 10% of the sample length. Finally, the estimated values are sorted, and a $p$-value is obtained as the fraction of surrogates with values greater than or equal to the estimated value.