We introduce and test a general machine-learning-based technique for the inference of short term causal dependence between state variables of an unknown dynamical system from time-series measurements of its state variables. Our technique leverages the results of a machine learning process for short time prediction to achieve our goal. The basic idea is to use the machine learning to estimate the elements of the Jacobian matrix of the dynamical flow along an orbit. The type of machine learning that we employ is reservoir computing. We present numerical tests on link inference of a network of interacting dynamical nodes. It is seen that dynamical noise can greatly enhance the effectiveness of our technique, while observational noise degrades the effectiveness. We believe that the competition between these two opposing types of noise will be the key factor determining the success of causal inference in many of the most important application situations.
The general problem of determining causal dependencies in an unknown time evolving system from time-series observations is of great interest in many fields. Examples include inferring neuronal connections from spiking data, deducing causal dependencies between genes from expression data, discovering long spatial range influences in climate variations, etc. Previous work has often tackled such problems by consideration of correlations, prediction impact, or information transfer metrics. Here, we propose a new method that leverages the potential ability of machine learning to perform predictive and interpretive tasks and uses this to extract information on causal dependence. We test our method on model complex systems consisting of networks of many interconnected dynamical units. These tests show that machine learning offers a unique and potentially highly effective approach to the general problem of causal inference.
I. INTRODUCTION
The core goal of science is often described to be a generalization from observations to understanding,1 commonly embodied in predictive theories. Related to this is the desire to use measured data to infer necessary properties and the structure of any description consistent with a given class of observations. On the other hand, it has recently emerged that machine learning (ML) is capable of effectively performing a wide range of interpretive and predictive tasks on data.2 Thus, it is natural to ask whether machine learning might be useful for the common scientific goal of discovering structural properties of a system from data generated by this system. In this paper, we consider an important, widely applicable class of such tasks. Specifically, we consider the use of machine learning to address two goals.
Goal (i): Determine whether or not a state variable of a time evolving system causally influences another state variable.
Goal (ii): Determine the “strength” of such causal influences.
In the terminology of ML, Goal (i) is referred to as “classification ML,” and Goal (ii) is referred to as “regression ML.” These goals have previously been of great interest in many applications (e.g., economics,3 neuroscience,4 genomics,5 climate,6 etc.). Many past approaches have, for example, been based upon the concepts of prediction impact,3,4 correlation,7–9 information transfer,10,11 and direct physical perturbations.12,13 Other previous works have investigated the inference of network links from time series of node states assuming some prior knowledge of the form of the network system and using that knowledge in a fitting procedure to determine links.9,14–17 In addition, some recent papers address network link inference from data via techniques based on delay coordinate embedding,15 random forest methods,18 network embedding algorithms,19 and feature ranking.20 In this paper, we introduce a technique that makes the use of an ML training process in performing predictive and interpretive tasks and attempts to use it to extract information about causal dependencies. In particular, here, we use a particular type of machine learning (ML) called reservoir computing, an efficient method of time-series analysis, which has previously been successfully used for different tasks, e.g., prediction of chaotic dynamics21–23 and speech recognition,24,25 to mention a few. In our case, a “reservoir” dynamical system is trained such that it becomes synchronized to a training time series data set from the unknown system of interest. The trained reservoir system is then able to provide an estimation of the response to perturbations in different parts of the original system, thus yielding information about causal dependencies in the actual system. We will show that this ML-based technique offers a unique and potentially highly effective approach to determining causal dependencies. Furthermore, the presence of dynamical noise (either naturally present or intentionally injected) can very greatly improve the ability to infer causality,14,15 while, in contrast, observational noise degrades inference.
II. SHORT TERM CAUSAL DEPENDENCE (STCD)
We begin by considering the very general case of an evolving, deterministic, dynamical system whose state at time is represented by the -dimensional vector , where evolves via a system of differential equations, , and has reached a statistically steady dynamical state (perhaps chaotic). In this context, we frame the issue of causality as follows: Will a perturbation at time applied to a component of the state vector [i.e., ] lead to a subsequent change at a slightly later time, , of another scalar component [i.e., ]; and how can we quantify the strength of this dependence? This formulation might suggest comparison of the evolutions of that result from two identical systems, one with, and the other without, the application of the perturbation. However, we will be interested in the typical situation in which such a comparison is not possible, and one can only passively observe (measure) the state of the (single) system of interest. Aside from knowing that the dynamics of interest evolves according to a system of the form , we assume little or no additional knowledge of the system and that the available information is a limited-duration past time series of the state evolution . Nevertheless, we still desire to deduce causal dependencies, where the meaning of causal is in terms of responses to perturbations as defined above. Since, as we will see, accomplishment of this task, in principle, is not always possible, our approach will be to first propose a heuristic solution and then numerically test its validity. The main message of this paper is that our proposed procedure can be extremely effective for a very large class of important problems. We will also delineate situations where our procedure is expected to fail. We emphasize that, as our method is conceptually based on consideration of responses to perturbations, in our opinion, it provides a more direct test of what is commonly of interest when determining causality than do tests based on prediction impact, correlation, or entropy metrics.
Furthermore, although the setting motivating our procedure is for deterministic systems, , we will also investigate performance of our procedure in the presence of both dynamical noise [i.e., noise added to the state evolution equation, ] and observational noise [i.e., noise added to observations of used as training data for the machine learning]. Both types of noise are, in practice, invariably present. An important result from our study is that the presence of dynamical noise can very greatly enhance the accuracy and applicability of our method (a similar point has been made in Refs. 14 and 15), while observational noise degrades the ability to infer causal dependence.
To more precisely define causal dependence, we consider the effect of a perturbation on one variable on the other variables as follows. Taking the component of , we have
for . Perturbing by , we obtain, for small , that the component of the orbit perturbation of at time due to is
We define the Short Term Causal Dependence (STCD) metric, , of on by
where denotes a long time average of the quantity over an orbit and the function is to be chosen in a situation-dependent manner. For example, later in this paper, we consider examples addressing Goal (i) [where we want to distinguish whether or not is always zero] for which we use , while, when we consider an example addressing Goal (ii) and are concerned with the time-averaged signed value of the interaction strength, we then use . In either case, we view as quantifying the causal dependence of on , and the key goal of this paper will be to obtain and test a machine learning procedure for estimating from observations of the state evolution . For future reference, we will henceforth denote our machine learning estimate of by . In the case of our Goal (i) experiments, where , we note that defined by (1) is an average of a non-negative quantity and thus, , as will be our estimate, . Furthermore, for this case, we will define STCD of on by the condition, , and, when using our machine learning estimate , we shall judge STCD to likely apply when where we call the discrimination threshold. In the ideal case , the discrimination threshold can be set to zero, but, in practice, due to the error in our estimate, we consider to be a suitably chosen positive number. We note that, in the ideal case, can be regarded as a test for whether or not is independent of .
As a demonstration of a situation for which the determination of STCD from observations of the motion of on its attractor is not possible, we note the case where the attractor is a fixed point (a zero-dimensional attractor). Here, the measured available information is the numbers that are the coordinates of the fixed point, and this information is clearly insufficient for determining STCD. As another problematic example, we note that in certain cases, one is interested in a dynamical system that is a connected network of identical dynamical subsystems and that such a network system can exhibit exact synchronization of its component subsystems26 (including cases where the subsystem orbits are chaotic). In the case where such a synchronized state is stable, observations of the individual subsystems are indistinguishable, and it is then impossible, in principle, for one to infer causal relationships between state variables belonging to different subsystems. More generally, in addition to the above fixed point and synchronization examples, we note that the dimension of the tangent space at a given point on the attractor is, at most, the smallest embedding dimension of the part of the attractor in a small neighborhood of . Thus, the full Jacobian of at cannot be precisely determined from data on the attractor when the local attractor embedding dimension at is less than , which is commonly the case. Thus, these examples motivate the conjecture that to efficiently and accurately infer STCD, the orbital complexity of the dynamics should be large enough so as to encode the information that we seek. Note that these considerations of cases where inference of STCD is problematic do not apply to situations with dynamical noise, e.g., , as the addition of noise may be roughly thought of as introducing an infinite amount of orbital complexity. Alternatively, the addition of noise increases the embedding dimension of the data to that of the full state space, i.e., .
III. USING RESERVOIR COMPUTING TO DETERMINE STCD
We base our considerations on a type of machine learning called reservoir computing, originally put forward in Refs. 27 and 28 (for a review, see Ref. 29). We assume that we can sample the time-series data from our system at regular time intervals of length so that we have a discrete set of observations . To begin, we first describe a reservoir-computer-based machine learning procedure in which the reservoir computer is trained to give an output in response to an -dimensional input as illustrated in Fig. 1.
Schematic of the reservoir computing architecture used in this work. The input-to-reservoir coupling matrix couples the input time series for the vector to the reservoir state vector . The reservoir-to-output coupling matrix generates the output vector from the reservoir. is found to be a good estimate of after training.
Schematic of the reservoir computing architecture used in this work. The input-to-reservoir coupling matrix couples the input time series for the vector to the reservoir state vector . The reservoir-to-output coupling matrix generates the output vector from the reservoir. is found to be a good estimate of after training.
For our numerical tests, we consider a specific reservoir computer implementation (Fig. 1) in which the reservoir consists of a network of nodes whose scalar states, , are the components of the -dimensional vector .
The nodes interact dynamically with each other through an network adjacency matrix , and their evolution is also influenced by coupling of the -dimensional input to the individual nodes of the reservoir network by the input coupling matrix according to the neural-network type of evolution equation (e.g., Refs. 21–23 and 29–31),
where for a vector is defined as . For proper operation of the reservoir computer, it is important that Eq. (2) satisfies the “echo state property”21,27,29 (in nonlinear dynamics, this condition is also known as “generalized synchronization”32–34): given two different initial reservoir states, and , for the same input time series of , the difference between the two corresponding reservoir states converges to zero as they evolve in time [that is, as , implying that, after a transient initial period, essentially depends only on the past history of , for , and not on the initial condition for ].
Using measured input training data over a training interval of length , which begins after the initial transient period mentioned above, we use Eq. (2) to generate . We also record and store these determined values along with the corresponding inputs, , that created them. The matrices and are regarded as fixed and are typically chosen randomly. In contrast, the output coupling matrix , shown in Fig. 1, is regarded as an adjustable linear mapping from the reservoir states to an -dimensional output vector ,
“Training” of the machine learning reservoir computer then consists of choosing the adjustable matrix elements (“weights”) of so as to make a very good approximation to over the time duration of the training data. This is done by minimization with respect to of the quantity, . Here, , with small, is a “ridge” regularization term35 added to prevent overfitting, and are the previously recorded and stored training data. In general, is required in order to obtain a good fit of to . For illustrative purposes, we now consider the ideal case where (i.e., the training perfectly achieves its goal).
For the purpose of estimating STCD, we now wish to eliminate the quantity from the basic reservoir computer system [Eqs. (2) and (3)] to obtain an evolution equation solely for the state variable . To do this, we would like to solve (3) for in terms of . However, since , the dimension of , is much larger than , the dimension of , there are typically an infinite number of solutions of (3) for . To proceed, we hypothesize that it may be useful to eliminate by choosing it to be the solution of (3) with the smallest norm. This condition defines the so-called Moore-Penrose inverse36 of , which we denote ; i.e., the minimum norm solution for is written as . We emphasize that is not necessarily expected to give the correct obtained by solving systems (2) and (3). However, from numerical results to follow, our choice will be supported by the fact that it often yields very useful estimates of .
Now, applying to both sides of Eq. (2) and employing to eliminate from the argument of the function in Eq. (2), we obtain a surrogate time map for the evolution of , , where . Here, we note that we do not claim that this map in itself can be used for time-series prediction in place of Eqs. (2) and (3), which were commonly used in previous works (e.g., Refs. 21–23,30, and 31). Rather, we use it as a symbolic representation of the result obtained after eliminating the reservoir state vector from Eqs. (2) and (3). In particular, the prediction recipe using Eqs. (2) and (3) is always unique and well defined, in contrast to the above map, where is clearly nonunique. Therefore, we use this map only for causality estimation purposes, as described below. Differentiating with respect to , we have
In our numerical experiments, the number of training time steps is for Figs. 2 and 3 and for Fig. 4. In each case, the actual training data are obtained after discarding a transient part of time steps, and the reservoir system sampling time is . The elements of the input matrix are randomly chosen in the interval . The reservoir is a sparse random network of nodes for Figs. 2 and 3 and of nodes for Fig. 4. In each case, the average number of incoming links per node is . Each nonzero element of the reservoir adjacency matric is randomly chosen from the interval , and is then adjusted so that the maximum magnitude eigenvalue of is . The regularization parameter is . These parameters are adapted from Ref. 23. The average indicated in Eq. (1) is over time steps. The chosen time step is sufficiently small compared to the time scale over which evolves that the discrete time series is a good representation of the continuous variation of .
Results of Experiment 1 (noiseless case). Panels (a) and (b) show the results of link inferences for two noiseless cases for links and links. The inference is perfect in (a) but is very bad in (b). (c) vs for , and averaged over random realizations of the system and the reservoir adjacency matrix. (d) The orbital complexity as measured by the attractor information dimension decreases with increasing . Note that at each value of , we compute the for 10 random realizations of a network with links with . The Kaplan-Yorke dimension is then averaged over all network realizations, and the resulting plot is further smoothed by applying a moving average filter.
Results of Experiment 1 (noiseless case). Panels (a) and (b) show the results of link inferences for two noiseless cases for links and links. The inference is perfect in (a) but is very bad in (b). (c) vs for , and averaged over random realizations of the system and the reservoir adjacency matrix. (d) The orbital complexity as measured by the attractor information dimension decreases with increasing . Note that at each value of , we compute the for 10 random realizations of a network with links with . The Kaplan-Yorke dimension is then averaged over all network realizations, and the resulting plot is further smoothed by applying a moving average filter.
The effect of noise on STCD inference. Panels (a)–(c) show the effect of increasing the dynamical noise variance to greatly enhance the effectiveness of link identification even at the rather low noise level of . In contrast, as shown in panels (d)–(f), starting with the situation (c) and increasing the observational noise variance degrades link identification. for all the subfigures here.
The effect of noise on STCD inference. Panels (a)–(c) show the effect of increasing the dynamical noise variance to greatly enhance the effectiveness of link identification even at the rather low noise level of . In contrast, as shown in panels (d)–(f), starting with the situation (c) and increasing the observational noise variance degrades link identification. for all the subfigures here.
Results of Experiment 3. Panel (a) shows a pixelated, shade-coded portrait of Edward N. Lorenz and (b) reconstruction of (a) by our ML link inference technique. Note that, in (b), we plot all the values greater than or equal to as black and all the values less than or equal to as white.
Results of Experiment 3. Panel (a) shows a pixelated, shade-coded portrait of Edward N. Lorenz and (b) reconstruction of (a) by our ML link inference technique. Note that, in (b), we plot all the values greater than or equal to as black and all the values less than or equal to as white.
Although we use a specific reservoir computing implementation, we expect that, with suitable modifications, our approach can be adapted to “deep” types of machine learning,2 as well as to other implementations of reservoir computing24,25,37,38 [notably, implementations involving photonics,24 electronics,37 and field programable gate arrays (FPGAs)25].
IV. TESTS OF MACHINE LEARNING INFERENCE OF STCD
In order to evaluate the effectiveness of our proposed method, we introduce mathematical model test systems that we use as proxies for the unknown system of interest for whose state variables we wish to determine STCD. We next use the test systems to generate simulated training data from which we determine STCD by our ML technique. We then assess the performance of the technique by the correctness of its results determined from the known properties of the test systems.
We first consider examples addressing our Goal (i) [ in Eq. (1)], and for our simulation test systems, we consider the case of a network of nodes and links, where each node is a classical Lorenz system39 with heterogeneity from node to node, additive dynamical noise, and internode coupling,
The state space dimension of this system is . The coupling of the nodes is taken to be only from the variable of one node to the variable of another node with coupling constant , and is either 1 or 0 depending on whether or not there is a link from to . The adjacency matrix of our Lorenz network (not to be confused with the adjacency matrix of the reservoir) is constructed by placing directed links between distinct randomly chosen node pairs. For each node , is randomly chosen in the interval , and we call the heterogeneity parameter. Independent white noise terms of equal variance are added to the left-hand sides of the equations for , , and , where, for example, . For , each node obeys the classical chaotic Lorenz equation with the parameter values originally studied by Lorenz.39 Furthermore, denoting the right-hand side of Eq. (5) by , we have or 0, depending on whether there is, or is not, a link from to .
Since in this case, the derivative is time independent, is also either or 0, and adopting the notation , we denote its machine learning estimate by our previously described procedure by . For a reasonably large network, the number of ordered node pairs of distinct nodes is large, and we consequently have many values of . Bayesian techniques (see Ref. 40 and references therein) can be applied to such data to obtain an estimate for the total number of links , and one can then set the value of so that there are values of that are greater than . Less formally, we find that making a histogram of the values of often reveals a peak at zero and another peak at a higher positive value with a large gap or discernible minimum in between. One can then estimate by a value in the gap or by the location of the minimum between the peaks, respectively. For simplicity, in our illustrative numerical simulations to follow, we assume that is known [approximately equivalent to the case that is unknown, but that a very good estimate () has been obtained].
(A heterogeneous noiseless case)
We consider the parameter set , , , , and we vary the number of links . Figures 2(a) (for ) and 2(b) (for ) each show an array of 2020 boxes where each of the boxes represents an ordered node pair of the 20-node network, and the boxes have been colored (see Table I) according to whether the results for our procedure predict a link from to (“positive”) or not (“negative”) and whether the prediction is correct (“true”) or wrong (“false”).
TP (true positive) | Black square |
TN (true negative) | White square |
FP (false positive) | Blue square |
FN (false negative) | Red square |
TP (true positive) | Black square |
TN (true negative) | White square |
FP (false positive) | Blue square |
FN (false negative) | Red square |
We see that for a typical case with [Fig. 2(a)], all the boxes have been correctly labeled, corresponding to all boxes being either black or white. In contrast to this perfect result at , at [Fig. 2(b)], the method fails terribly, and the fraction of correct inferences is small. In fact, we find excellent performance for , but that, as increases past 50, the performance of our method degrades markedly. This is shown in Fig. 2(c) where we give plots of the number of false positives (FPs) normalized to the expected value of FP that would result if links were randomly assigned to the node pairs . [We denote this normalization ; it is given by .] Note that, with this normalization, for the different heterogeneities plotted in Fig. 2(c), the curves are similar and that they all begin increasing at around and becomes nearly (i.e., inference no better than random) past . In our earlier discussion, we have conjectured that, for inference of STCD to be possible, the orbital complexity should not be too small. To test this conjecture, we have calculated the information dimension of the network system attractor corresponding to the parameters, , , , , as a function of . We do this by calculating the Lyapunov exponents of the system [Eqs. (5)–(7)] and then applying the Kaplan-Yorke formula for in terms of the calculated Lyapunov exponents.41,42 The result is shown in Fig. 2(d), where we see that decreases with increasing . Regarding as a measure of the orbital complexity, this is consistent with our expectation that the ability to infer STCD will be lost if the orbital complexity of the dynamics is too small. As we next show, the above negative result for increasing past about does not apply even when small dynamical noise is present.
(The effects of dynamical and observational noise)
We first consider the effect of dynamical noise of variance for the parameters (homogeneous), , , and . Results [similar in style to Figs. 2(a) and 2(b)] are shown in Figs. 3(a)–3(c). For extremely low dynamical noise variance, [Fig. 3(a)]; the result is essentially the same as for zero noise, and about one quarter of the boxes are classified as TP, TN, FP, and FN each (since there are 200 links and 400 boxes, this is no better than random assignment). As the noise variance is increased to [Fig. 3(b)], the results become better, with a fraction of 0.75 of the boxes either TP or TN [as opposed to 0.52 for Fig. 3(a)]. Upon further increase of the dynamical noise variance to the still small value of [Fig. 3(c)], the results typically become perfect or nearly perfect. Furthermore, excellent results, similar to those for , continue to apply for larger . This is shown by the red curve in Fig. 3(f), which shows vs (). Importantly, we also note that our normalization of FP by essentially makes the red curve -independent over the range we have tested, . Our interpretation of this dynamical-noise-mediated strong enhancement of our ability to correctly infer links is that the dynamical noise allows the orbit to explore the state space dynamics off the orbit’s attractor and that the machine learning is able to make appropriate good use of the information it thus gains.
We now turn to the effect of observational noise by replacing the machine learning time-series training data formerly used, , by , where the parameter is the observational noise variance and the are independent Gaussian random variables with, e.g., . The blue curve in Fig. 3(f) shows the effect of adding observational noise of variance on top of dynamical noise for the situation of Fig. 3(c). We see from Figs. 3(d)–3(f) that, when is below about , it is too small to have much effect, but, as is increased above , the observational noise has an increasing deleterious effect on link inference. This negative effect of observational noise is to be expected, since inference of characteristics of the unknown system is necessarily based on the part of the signal that is influenced by the dynamics of the unknown system, which the observational noise tends to obscure.
(Inferring continuous valued dependence strengths)
V. DISCUSSION
In this paper, we have formulated and tested a new, highly effective, machine-learning-based approach for inferring causal dependencies of state variables of an unknown system from time-series observations of these state variables. A key finding is that the effectiveness of our approach is greatly enhanced in the presence of sufficient dynamical noise, provided that the deleterious effect of observational noise is not too great. The competition between the opposing effects of these two types of noise will likely be the essential key factor determining the success or failure of causality inference in many of the most important situations of interest (e.g., in neuroscience and genomics). Much work remains to be done to more fully address the utility of our method. In particular, further numerical tests on diverse systems, and, especially, experimental studies in real world applications, will ultimately determine the circumstances under which the method developed here will be useful.
ACKNOWLEDGMENTS
This work was supported by the U.S. National Science Foundation (NSF) (Grant No. DMS 1813027). The authors acknowledge useful discussion with Sarthak Chandra, Amitabha Sen, Woodrow Shew, Nuno Martins, Adrian Papamarcou, Erik Bollt, and, especially, Brian Hunt.