This work presents a heuristic for the selection of a time delay based on optimizing the global maximum of mutual information in orthonormal coordinates for embedding a dynamical system. This criterion is demonstrated to be more robust compared to methods that utilize a local minimum, as the global maximum is guaranteed to exist in the proposed coordinate system for any dynamical system. By contrast, methods using local minima can be ill-posed as a local minimum can be difficult to identify in the presence of noise or may simply not exist. The performance of the global maximum and local minimum methods are compared in the context of causality detection using convergent cross mapping using both a noisy Lorenz system and experimental data from an oscillating plasma source. The proposed heuristic for time lag selection is shown to be more consistent in the presence of noise and closer to an optimal uniform time lag selection.
The critical questions that arise when discussing methods based on delay embedding revolve around the selection of an appropriate dimension and delay. The choice of delay is topologically irrelevant in the noiseless, infinite data limit, but this is not the case when encountering real-world data. The complications induced by working with finite, discrete, noisy data motivate the use of information theoretic methods to select a time delay. To this end, the first minimum of mutual information (MI) heuristic has seen wide popularity. However, this heuristic relies on the trade-off between “spreading out” short-term data redundance while avoiding long-time irrelevance, falling short of a compelling information theoretic argument while also inducing the computational challenges regarding stable estimation of a local minimum. This work argues that maximizing mutual information is a more justifiable measure of embedding quality. It is demonstrated that transforming the delay-embedded data into the natural coordinate system of short-time principal components both eliminates short-time redundance in noisy data and saturates a dimension-dependent lower bound on mutual information due to finite data. Maximizing mutual information in these coordinates may still be suboptimal due to encoding inefficiency at intermediate time delays, yet the approach provides a lower bound on the maximum embedded channel capacity that is as computationally efficient as traditional histogram-based methods. Furthermore, it provides a more stable, conceptually justifiable delay estimate than the widely used first minimum criteria. The efficacy of the method is demonstrated to compare favorably to the traditional selection criteria with respect to brute force uniform delay optimization in the construction of cross maps for noisy data from both a model dynamical system and the real-world plasma probe data.
I. INTRODUCTION
Mathematical embeddings are crucial for nonlinear data analysis tools, such as those aimed at determining causality in a system. Specifically, time-delay embedding is a method to sufficiently “unfold” scalar measurements of a dynamical system so that distinct states of the original system remain distinct in the embedding. In order to maintain the system’s dynamics in the time-delayed coordinates, the underlying dynamics must be sufficiently smooth and time-invariant. It has been established in the literature that selecting the appropriate embedding dimension, , and time delay, , is crucial to the quality of the embedding, also termed the shadow manifold.1–5 The canonical time-delay embedding used by Sugihara (as well as similar methods) in a causality determination study relies on Takens’ theorem.6 Takens demonstrated that for sufficiently long signals without noise, attractors constructed from a sufficient number of time-delay coordinates for any nonzero delay lead to an embedding of the state-space dynamics of a finite-dimensional system.7–10 Note that for this work, delay-coordinate embedding, time-lag embedding, time-delay embedding, and embedding all refer to the same concept.
There are still some notable challenges in the application of Takens’ results to practical usage of delay-coordinate embeddings. Takens’ theorem only considers sufficiently long signals with no noise, resulting in a diffeomorphic criterion that is indifferent to how the delay is chosen. For real-world systems, certain time delays are more accurate and efficient than others due to additional factors such as noise, measurement discretization, and finite sampling frequency. At one extremum, as the time lag approaches zero, every point in the system is uniquely matched to itself and the mutual information is maximized in what is known as redundancy. Short, finite time lags decrease the value of the mutual information but can be corrupted by high frequency noise that obscures the underlying signal dynamics. At the other extremum, excessively increasing the time lag renders the signal and its lag statistically independent and minimizes the mutual information, leading to irrelevance.
Between these extrema, information theory can be employed to evaluate the information content of a temporal signal4,11 to select a suitable time lag that yields sufficient information about the system. In particular, mutual information (MI) is a cornerstone for existing and proposed time-delay selection methodologies.
The longest standing heuristic for determining the appropriate time lag has been the “first local minimum of mutual information” criterion, proposed by Shaw and popularized in the work of Fraser and Swinney, henceforth referred to as Fraser’s criterion (or method).12 Here, the mutual information between a signal and its time-delayed counterpart is calculated over a swept range of time delays, with the Fraser time delay corresponding to the first “significant” local minimum. The prevailing argument behind this approach is the avoidance of inflated correlation values due to redundant information in the limit of short time lags (local minimum) while maintaining a relevant characteristic timescale of the dynamics (first). In other words, this choice offers a balance between the redundance of small time delays while also eschewing the irrelevance of larger time delays.1 Furthermore, this minimum indicates the first point at which the embedded data ceases to “spread out” for increasing time lags, e.g., a longer time lag will begin to condense the data again.
As an alternative to Fraser’s method of the first local minimum of mutual information, this work proposes the choice of the time delay corresponding to the global maximum of total correlation (a high-dimensional generalization of mutual information). By rotating the data into an orthogonal coordinate system for uniform time-delay embeddings and applying non-uniform binning along each independent coordinate, it is shown that redundance can be completely eliminated. Upon eliminating redundance, selecting an embedding that maximizes mutual information simply corresponds to seeking the most compact clustering of noisy data near a lower-dimensional manifold embedded in high dimensions. For brevity, this will be referred to as the maximum information in transformed coordinates (MITC) method or criterion, herein. The method ensures that the measured quantity is significant, given finite data and its computational complexity are independent of the embedding dimension. This is particularly important when estimating high-dimensional information theoretic quantities. Although this method only finds an appropriate time lag, it can be paired with Cao’s algorithm13 or other methods to find the embedding dimensionality.
One of the predominant use cases where selecting the proper time delay significantly impacts the quality of results is the study of causality. Various methods have been proposed and investigated over the years to determine causality in dynamical systems14–16 to identify physical relations between dynamics or develop predictive capabilities for a given system. In particular, convergent cross mapping (CCM) will be utilized to compare the different time lag selection criteria. By mapping from the embedding of one signal onto another, CCM seeks to determine the causal relationships between two or more signals. Initial attempts to use Fraser’s method in conjunction with CCM for real-world data yielded inconclusive results despite a priori knowledge of the existence of causal relations in the data, which will be demonstrated in Sec. III B. The proposed method is shown to have superior robustness to noise and data sparseness when applied to convergent cross mapping.
Much work has been carried out to provide alternative heuristics and methods of determining an appropriate time lag (and in certain cases, the embedding dimensional). The proposed methods by Refs. 2, 5, and 17–21 provide a variety of approaches which trade off complexity and robustness for computational efficiency. Detailed reviews of each of these methods can be found in the existing comparison papers.5,17 In spite of numerous choices, Fraser’s method continues to be widely used due to its simplicity in implementation and computational cost. Hence, a viable replacement must produce accurate results in the same order of magnitude time; particularly, it should not take longer than a “brute force” approach of using every possible time lag. To that end, the proposed method scales with , where is the number of points on the embedding dimension in the analyzed signal.
Contrasting the proposed method to the trade-offs of recent methods explored in Ref. 5 is worth particular note. The first distinction is that Ref. 5 primarily focuses on non-uniform embedding. While it is true that searching a multi-dimensional space of lags that includes uniform embedding is certain to provide equivalent or better performance, the search can easily become prohibitively computationally expensive. The fastest non-uniform method tested, maximizing derivatives on projection (MDOP)22 overcame this challenge using a sequential greedy optimization strategy. However, any greedy optimization naturally trades global optimality for computational efficiency. While the work here considers only uniform embeddings to leverage the natural geometry of smooth dynamics, it is worth noting that each such sweep could also be integrated into a progressive greedy decomposition strategy. This is likely to be of particular value for inherently multi-scale systems.
The second, more critical, distinction revolves around the fundamental role of noise as it relates to densification of embedded samples. Densification is required for demonstrating asymptotic convergence in CCM. In Ref. 5, it is shown that the MDOP method also provided relatively superior stability in lag identification for their noisy Lorenz case. MDOP was stable down to 17 dB signal-to-noise compared to the SToPS and PEZCUAL methods. They further noted that MDOP was the only method to provide stable lag estimates below 17 dB. This is particularly noteworthy as MDOP, though premised on efficiently overcoming redundance as we also propose, fundamentally relies on notions of near-neighbor stability. Near-neighbors are inherently inconsistent with finite amplitude noise where sampled points near the embedded manifold are expected to correspond to overlapping distributions of sample density. At low noise levels relative to sample density, this is only a mild confounding influence. As samples near the embedded attractor become denser, as in CCM convergence studies, there exists a scale at which noise dominates the inter-sample distances. The relationship of these noise volumes to the optimal encoding of symbols on a surface motivates revisiting lag selection from a purely information theoretic perspective.
The rest of the paper is organized as follows: A presentation of mutual information and a robust approach to time lag selection, along with the benefits over Fraser’s criterion, are presented in Sec. II. The desired use case, convergent cross mapping, is presented in Sec. III along with examples using the canonical Lorenz system as well as an experimental plasma dataset.
II. METHODOLOGY
This section will present the process of selecting a time lag that maximizes the mutual information between two signals. To do this, the system is rotated onto coordinate axes which “spread out” the attractor, eliminating redundance caused by short-time delays in noisy data. From there, the axes are stretched to maximize the use of the phase space. Finally, the mutual information, or total correlation, is obtained based on this rotated, stretched representation. The new axes enable a time lag selection that is stable in the presence of noise while maintaining relatively low computational cost. The section concludes with a basic overview of the convergent cross mapping method used in evaluating embedding performance.
A. Lorenz system example
B. Orthogonalization and equiprobable binning
For any shadow manifold created through delay embeddings of a particular scalar signal, mutual information remains unchanged under homeomorphisms (as shown in the appendix of Ref. 24). Therefore, a simple rotation of the system or stretching of coordinates would have no effect on the discrete estimate of mutual information in the data as long as the same sample points fall within the equivalent discrete transformed partition boundaries encoding the discrete symbols. However, it is shown here that the short-time lag, , redundance that should be avoided in Fraser’s method can be effectively eliminated with an alternative discrete encoding that naturally removes this redundance.
Reference 25 demonstrated that a rotation can be defined by the orthogonal eigenvectors of the covariance matrix of the system and is equivalent to the principle component analysis algorithm for decomposing multidimensional data into linearly independent coordinates. This rotation can be expressed in terms of the discrete Legendre polynomials, which transforms the standard time-delay coordinates. The use of orthogonal coordinates in this work refers to this transformation. Comparison of Figs. 1(b) and 1(c) demonstrates the impact of this transformation, , for a minimal single-step time-delay for the Lorenz system. From these figures, it is evident that the high mutual information of redundance as measured for low with Fraser’s method is an artifact of the choice of coordinates. Sufficient non-redundant information is still contained in the signal along the diagonal of the delay embedding in Fig. 1(b). This is not unexpected as derivative embeddings can also be diffeomorphic to the original state space,9 and accurate estimation via finite differences requires only infinitesimal time in the absence of noise.
As will be seen in Fig. 3, this allows for the calculation of mutual information in the small limit by avoiding the auto-correlation of measurements without significantly impacting long information estimates. As will be shown in Fig. 2, rotating the embedded data into orthogonal coordinates allows for the dimensions of the phase portrait to be scaled independently of the time lag and effectively “spread out” the attractor to eliminate redundancy. While this dramatically decreases the estimated mutual information at short , it can be seen that this change is the result of an alternative encoding into higher aspect-ratio partitions aligned along the diagonal in the original coordinates. One could theoretically resolve equivalent mutual information estimates in the two coordinate systems with sufficient data to allow for mutually consistent discrete density estimates. However, the original coordinates are a particularly poor coordinate system for achieving good density estimates as because it requires sufficient partition resolution to resolve the spread of the data in the direction between two samples arbitrarily closely spaced in time. This, in turn, would require asymptotically large numbers of samples to overcome variance errors in the resulting density estimate as partition size approached zero.
Note that the three-dimensional transformation coordinates mimic the structure of the central average, centered first derivative, and centered second derivative. In higher dimensions, the highest order vector becomes the finite difference approximation for that dimension while all other vectors blend toward shapes reflecting the continuous Legendre polynomials similar to short-time principal component analysis (PCA).
The similarity to a filtered derivative embedding can explain the behavior of Figs. 1(d-e) where 17 dB of Gaussian noise have been added to the signal. While in the noiseless case the information of the embedded manifold is retained along the diagonal, relatively small magnitude, high frequency noise dominates the derivative estimates at short as seen in Fig. 1(e). The magnitude of this noise, rather than the redundance, results in poor localization of near-neighbors on the embedded manifold as used in methods like CCM. This motivates the choice of a delay selection criterion that is a pure information theoretic maximization of the signal-to-noise ratio (SNR) under the assumption of an underlying smooth, lower-dimensional, finite variance signal consistent with the assumptions inherent to both CCM and the convergence of short-time PCA to discrete Legendre coordinates.
The resulting shadow manifolds can be discretized into probability density bins as a time-lagged phase portrait (TLPP) in this rotated frame.27 This is a visual representation of the joint probability mass functions used in the discrete information estimates. Such discretization can be performed using uniform bins, where the size of the bin is regulated based on the axis of the TLPP, e.g., .
For this work, equiprobable bins with uniform marginals of the joint distributions are used instead. These would result in uniform joint distributions and zero MITC for any delay embedding of infinite-dimensional noise-dominated signals. However, for a finite number of sample points, the joint distribution of the noise deviates from uniform in a predictable way that allows for the estimation of an expected noise floor as a function of the amount of data and number of partition bins used.
Note that the term of the Poisson distribution has been dropped due to . Thus, the total correlation of the noise is given by where .
Equiprobable bins for MITC estimation are created to house an equal number of points leading to non-uniform bin sizing. This is accomplished by sorting the data along each axis and setting the edge of each bin to coincide with every ( ) observation. This dimensional sorting results in the dominant computational cost of the algorithm that is linear in embedding dimension, , where is the finite embedding dimension. However, for chaotic noisy signals, the applied orthogonalization and equiprobable binning inherently saturate the noise floor of the system at both short and long such that the global maximum will always be above the noise floor of a finite sample of uniform distribution.
The X-coordinate data of the Lorenz system are employed to demonstrate the difference between the original embedding and one that has been orthogonalized with equiprobable bins. The resulting TLPPs have noticeably different bin occupancy. The non-rotated, standard TLPP of Fig. 2(a) spans all 64 bins in each direction but has large clusters of empty bins in each of the four corners. In comparison, the rotated TLPP of Fig. 2(b) better aligns the plot along the axes and reduces the fraction of empty bins. As stated previously, this transformation does not alter the mutual information obtained from the data but does significantly alter the discrete estimate of it.
C. Mutual information and total correlation
Figure 3(b) demonstrates how the entropy and mutual information for the system are saturated for H( ), H(v), and H( )+H(v) in the transformed coordinates. The improved use of phase space by orthogonalization allows for this saturation unlike in the standard view that has a far greater number of bins with zero or one observation per bin. Furthermore, while the entropy of and the mutual information of the same pair reach toward the one-dimensional saturation value of 6 bits in the standard view as , Fig. 3(a), these values are nearly constant for all time lags in the orthogonalized view. The respective time lags for both the first local minimum and global maximum of mutual information are evident in each plot. The corresponding time lags for these values (16 and 22 time steps, respectively) were then used to create the time-lagged phase portraits in Fig. 2.
D. Rationale for mutual information maximization
Define a noisy scalar measurement of to be and let its time delay be . The measurement function is continuously differentiable and is independent and identically distributed noise. Two time series, and , can be constructed in a similar fashion as . Under the assumption that the time series have captured the statistics of , these statistics will be invariant to time shifts due to the autonomous dynamics. Consequently, and have the same entropy because they are the same measurements of offset by a delay.
Takens’ theorem (extended upon by Sauer, Yorke, and Casdagli) guarantees that is an embedding of on the conditions that (a) the measurements are smooth and generic, (b) the time delay , and (c) the embedding dimension , where is the box-counting dimension of the attractor.10 As a consequence of being an embedding of , the mutual information is achieved.
If some time delay results in , the channel capacity is not met and some components of resemble noise relative to . In other words, can be better distributed by varying so that the mutual information is higher. The that maximizes mutual information is optimal and should allow for a better, less noisy representation of from .
The inequality in Eq. (11) is useful in that will always take on values between and . Equation (12) then gives an interpretation and justification of maximizing : doing so gives the smallest gap for to inhabit, allowing to be more representative of . However, it could also be the case that pursuing a higher might push away from instead. There is no reason (neither for nor against) to believe that such a scenario will happen, but it remains a possibility that this rationale cannot rule out.
E. Convergent cross mapping overview
III. RESULTS
To show the benefits of the proposed time lag choice on convergent cross mapping, two systems are analyzed using the MITC and Fraser methods. To fairly compare with the ubiquitous Fraser time lag, the first local minimum of the total correlation is taken from the standard view (un-rotated and uniform bins), while the global maximum is obtained from the orthogonal view using equiprobable bins. Although Fraser’s method is not generally applied using the total correlation, it is used for these comparisons in order to maintain other embedding parameters equal, and is referred to as “Fraser’s method” for brevity. Furthermore, Fraser and Swinney suggested generalizing their method to higher dimensions as future work.12 For completeness, the same results were obtained in 2D using mutual information (not shown here), and though the time lags selected varied slightly between 2D MI and 4D total correlation first minima criteria, CCM results were nearly identical (not shown here) when using the 2D criteria as the full delay embedding stencil width. After a brief description of convergent cross mapping, the first comparison uses the Lorenz system described in Sec. II to demonstrate a system where the causal links are known a priori, in the limit of few data points. The second comparison uses experimental data from the Electric Propulsion Test & Evaluation Methodologies for Plasma in the Environments of Space and Testing (EP TEMPEST) program to validate the method on a real-world system with finite data and a higher number of signals. This system consists of a Hall-effect thruster inside a cage, generating 11 time-dependent diagnostic signals sampled at 25 MHz for 5 ms . Further details of this experiment can be found in (Ref. 33).
A. Noisy Lorenz system
To demonstrate the viability of the MITC time lag selection criterion, white Gaussian noise of a signal-to-noise ratio (SNR) of 25 dB is superimposed on the Lorenz system described in Sec. II. The asymptotic CCM performance and robustness are then studied at this noise level. Finally, to compare with the behavior of several modern time lag selection criteria applied to noisy data, the noisy Lorenz cases of Ref. 5 are revisited exploring the behavior of MITC with progressive levels of signal-to-noise ratios (SNRs) of 30–17 dB from Ref. 5 and down to 7 dB.
1. Total correlation
Using a four-dimensional view with 16 bins per dimension, the total correlation calculations are plotted on Fig. 4 for , , and , with the time lag choices provided by the Fraser and MITC methods denoted by and , respectively. Note that for this figure, the sequentially first local minimum is plotted in the figure. The significance of this statement will be discussed in Sec. III A 3. The number of dimensions and number of bins were selected subjectively and further optimization is left for future work. Nevertheless, Table I shows that the time lags have smaller variations when selecting the MITC compared to the Fraser time lag.
. | Fraser method . | MITC method . |
---|---|---|
X | 0.32 | 0.32 |
Y | 0.59 | 0.14 |
Z | 0.30 | 0.15 |
X | 0.61 | 0.22 |
Y | 0.59 | 0.22 |
Z | 0.31 | 0.22 |
X | 0.61 | 0.36 |
Y | 0.62 | 0.23 |
Z | 0.62 | 0.17 |
. | Fraser method . | MITC method . |
---|---|---|
X | 0.32 | 0.32 |
Y | 0.59 | 0.14 |
Z | 0.30 | 0.15 |
X | 0.61 | 0.22 |
Y | 0.59 | 0.22 |
Z | 0.31 | 0.22 |
X | 0.61 | 0.36 |
Y | 0.62 | 0.23 |
Z | 0.62 | 0.17 |
The marked time lag choices in Fig. 4 show that the MITC method consistently yields a smaller time lag than the Fraser method. An initial glance would indicate that the shorter time lag selected by the MITC method could still be contaminated by redundancy and high frequency noise, but the rotation of the data in this method eliminates these concerns by spreading out the data in phase space. While difficult to determine in these results, a further benefit of the proposed method is that it does not cross the noise floor for a given system whereas the local minimum method sometimes does. This will be further discussed at the end of this example where the implications are more evident.
Regardless of the signal, both methods are shown to have similar decay rates and asymptote at nearly identical values of total correlation as the time lag increases, corresponding to the random independent samples due to chaos, indicating that the Lorenz system has a clearly defined characteristic time scale for irrelevance (the asymptotic limit).
2. Convergent cross mapping
The resulting cross mapping ability of each time lag choice is presented in Fig. 5 by performing CCM between each pair of signals in the Lorenz system as the dataset length is reduced. Due to the coupled ODEs that make up the Lorenz system, the three spatial locations, , are causally related. Thus, applying CCM in the limit of an infinite number of data points should approach a correlation value of nearly 1 at such low SNR levels. An ideal cross map would be expected to saturate at a correlation of approximately 0.9984 based on Eq. (4) of Ref. 34.
As each pair of signals has two associated time lags for each method (one for each signal), the shorter of the two time lags in chosen. As an example, in the case of the mapping, the time lag is selected to create the shadow manifold for the local minimum method. Although this may result in redundancy in the variable with the longer indicated lag (in this case, ), this is superior to the ambiguity and irrelevance resulting from the shorter lag signal ( ) folding back upon itself at longer time lags. Furthermore, it was observed that the shorter of the two lags more frequently approached the lag characteristic of the brute force optimal solution found by performing CCM across all possible time lags and selecting the one that returned the highest Pearson correlation. Note that while for this particular signal the global maximum method returned identical time lags for all three signals, this is only a coincidence of the particular test case used.
Both methods are shown to have similar decay rates as the number of points mapped decreases, an effect of the inherent dynamics of the Lorenz system and how much of the system is captured by a diminishing signal length. It is expected that the - reconstructions, Figs. 5(a) and 5(b), in both directions will return the highest cross mapping correlations due to the similarity in attractor shape as viewed from each signal, accounting for the minimal decrease in the correlation coefficient for less data. The better ability of to map in the limit of sparse data, Fig. 5(c), instead of , Fig. 5(d), is attributed to a wider spread in data when viewed from the axis rather than the axis, making the use of the nearest neighbor technique more challenging in . It should be noted that for the Lorenz system, cross mapping can be performed from the or signals to reconstruct the signal, but not vice versa due to the topological differences between attractors. The canonical view of the Lorenz system on the axis undergoes a noninvertible 2:1 projection of the attractor that obfuscates true near-neighbors for reconstructing the or signals. Hence, cross mapping Lorenz- is expected to yield ineffective results regardless of lag choice and will not be presented as part of this work.
In the limit of dense attractors for all cross mappings, both the MITC and Fraser methods are shown to correctly saturate to nearly identical high Pearson correlation coefficient values, shown in Fig. 5. The similarity in the maximum Pearson correlation coefficients regardless of the time lag used is expected in the limit of infinite data due to the densification argument in CCM in the Lorenz system. However, it is worth mentioning that the MITC method approaches this value faster than the Fraser method in every case. Furthermore, while all the curves asymptote toward the 25 dB limit of 0.9984, converting the final CCM values at sample points back to effective noise ratios, SNR , as shown in Table II demonstrates an improvement factor of 1.7–5.6 in cross map error. These initial studies demonstrate a higher robustness to noise and number of points used for the MITC method, and so, the method was applied to a noisy real-world system.
Map . | Fraser skill (dB) . | MITC skill (dB) . | Improvement . |
---|---|---|---|
Y X | 0.9811 (14.11) | 0.9959 (20.84) | 4.71× |
X Y | 0.9685 (11.80) | 0.9942 (19.32) | 5.65× |
Y Z | 0.9844 (14.95) | 0.9910 (17.39) | 1.75× |
X Z | 0.9882 (16.19) | 0.9931 (18.56) | 1.72× |
Map . | Fraser skill (dB) . | MITC skill (dB) . | Improvement . |
---|---|---|---|
Y X | 0.9811 (14.11) | 0.9959 (20.84) | 4.71× |
X Y | 0.9685 (11.80) | 0.9942 (19.32) | 5.65× |
Y Z | 0.9844 (14.95) | 0.9910 (17.39) | 1.75× |
X Z | 0.9882 (16.19) | 0.9931 (18.56) | 1.72× |
3. Robustness
Fraser’s method presents an implementation challenge due to the way it was developed, impacting its robustness when considering noise. As can be seen in Fig. 1 of Ref. 12, the sequentially first local minimum is not selected as the candidate time lag, but instead, the first “significant” local minimum is chosen. Although the authors do not elaborate on what constitutes a ‘significant” minimum, the intention is to minimize the method’s sensitivity to noise. Such a selection is system dependent and requires detailed effort on the part of the user.
To illustrate this, Fig. 6 depicts the results of a Lorenz system run with two different random noise generator seeds. For each seed, the resulting sequentially first local minimum was calculated and revealed an inconsistent time lag selection by 30 time steps or more, as shown by the pair of test cases in Figs. 6(a) and 6(b). Conversely, the global maximum does not change with different noise seeds.
4. Total correlation and noise floor
A particular advantage of using the MITC is the guarantee that the selected time lag will result in a total correlation above the expected entropy noise floor.
Any systems with total correlation below this noise floor is indistinguishable from random noise and, therefore, useless. This is a significant concern for systems of high dimensionality with sparse data where the first local minimum may exist below the noise floor.
The noisy Lorenz system discussed in this section meets this criterion for points separated into 64 bins in each of the four embedding dimensions. The resulting noise floor compared to the Fraser and MITC time lag selections is presented in Fig. 7. While it is evident that the Fraser method lies below this noise floor, numerical analysis was performed to verify that the MITC never crosses the noise floor. The MITC method only is equal to the noise floor at , though it does asymptote toward the noise floor as .
5. Stability with large noise amplitude
Figure 8 shows the behavior of the MITC in 3D with progressive noise levels from Ref. 5 down to 7 dB. For more direct comparison between the presented method and that in Ref. 5, these cases were integrated with the time step from Ref. 5, s, to 100 s. The solution was decimated at a sampling rate of 4e-3s, resulting in only 25 000 sample points for evaluation. Figure 8 demonstrates the relative stability of lag estimates regardless of the noise level. Note that all noise levels asymptote at short toward the noise floor predicted for this data length and bin count, demonstrating the tightness of the noise floor estimate. The optimal lag drifts toward longer lags before stabilizing for all noise levels past the 17 dB case. Figure 9 shows the noise clustering near the manifold for this MITC embedding in 3D to contrast with Fig. 1(e).
The 3D embedding dimension was selected to facilitate plotting the noisy attractor due to the intractability of plotting higher dimensions. The falloff toward the noise floor at large for 4D at the beginning of the section is more apparent resulting in a sharper, more consistent peak of MITC beyond this potentially minimal embedding dimension. The MITC curves for Takens guaranteed 5D embedding of these noisy Lorenz cases shown in Fig. 10 demonstrate this effect. While the actual computational estimation of MITC curves scales well to higher dimensions, the noise floor suffers the dimension-dependent multiplicative growth of histogram bins in higher dimensions. This drives the noise floor higher, forcing MITC peaks to only exist above the noise floor for coarse binnings due to the finite data in higher dimensions. However, the peak lag value is relatively stable. When dominated by noise or irrelevance, data requirements are quite large in high dimensions to overcome the Poisson noise necessary for estimating a uniform sample density in a high-dimensional volume. Near the MITC peak, the data are maximally clustered near the lower-dimensional embedded manifold. This means that the number of nonempty bins, while still filling a higher-dimensional volume around the embedded manifold based on the noise kernel width, most significantly deviates from the uniform distribution making the estimate at the MITC peak more accurate and stable than far from it, even in high dimensions.
B. EP TEMPEST
Having used the Lorenz system to demonstrate that the capabilities of the MITC method are commensurate with the Fraser method, a real-world system with more intricately connected signals of unknown causality in the presence of complex noise sources is presented.
The EP TEMPEST test article at the Air Force Research Laboratory is an isolated, instrumented confinement cage used to study the behavior of Hall-effect thrusters. Various diagnostics were located throughout the cage and back beam dump to capture measurements of the plasma plume as a function of spatial dimension. A total of 11 measurement signals were captured as a part of this test, including the cathode, anode, beam dump, back wall, and five of the rings. The other two signals, the total cage and anode cathode, are combinations of the original signals. The data collected during this study can be used to understand the relationships between thruster operational signals and impact of the plume on the rings.34 These data have also previously been used to test the performance of shadow manifold-based interpolation in comparison to the prior FFT-based cross mapping.33 Figure 11 shows a diagram of the experimental setup with a calculated plume of the thruster. The locations of plume impingement on the walls of the confinement cage demonstrate which rings are expected to be affected the most.
During post-processing of the test data, it was observed that the thruster dynamics were dominated by an approximately one-dimensional “breathing-mode” limit cycle for the particular operating conditions used in this paper. This is a quasi-periodic oscillation in the Hall thruster discharge current commonly observed in Hall thrusters.35 The approximately one-dimensional nature of this breathing mode suggests that a minimum of three dimensions are required for the time-delay embedding to satisfy Takens’ theorem. It is noted that the results of the CCM technique were consistent between three and four dimensions, so only the four-dimensional results are presented for both of the compared methods in this article.
1. Total correlation
The total correlation curves for each of the eleven signals are shown in Fig. 12 with the standard view solutions shown in 12(a) and the transformed view solutions shown in 12(b). These calculations were made using four dimensions and 16 bins per dimension. Note that every time step corresponds to 40 ns.
The total correlation plots indicate a few key results regarding each method. Unlike the simpler noisy Lorenz system where the total correlations of each time delay had similar decay rates, variation, and value, the EP TEMPEST data vary more between time lag choices. The most evident difference is in the time taken for the cathode, anode, total cage, and anode + cathode to settle from short-time lag transients. The total correlations in the standard view cease their rapid decay beyond the 1 s time lag while in the transformed coordinates, a time lag longer than 6 s is required to asymptote. This disparity in transient time scales leads to vastly different time lags chosen by the two methods, as demonstrated in Table III, for quantities such as Ring 6 which returned a time lag of either 77 or 201 steps, depending on the chosen time lag selection method. Note that such a fast asymptote in the standard view data in Fig. 12(a) causes the Fraser criteria to be triggered relatively early at lags where the transformed coordinates in Fig. 12(b) still indicate a significant deficit of total correlation due to the presence of noise. Both methods show similarly low total correlation values for Ring 1 and the back wall, though the MITC method can be seen to provide a non-zero value while Fraser in Fig. 12(a) is effectively zero. This can be attributed to the transformed view uncovering a very low-level correlation present in each signal that was originally obfuscated by the standard view.
. | Fraser method . | MITC method . |
---|---|---|
Total cage | 3.64 | 8.84 |
Anode + Cathode | 4.44 | 7.24 |
Anode | 8.04 | 18.44 |
Cathode | 10.52 | 10.36 |
Beam dump | 2.84 | 4.2 |
Ring 6 | 3.08 | 8.04 |
Ring 5 | 0.2 | 9.08 |
Ring 4 | 0.2 | 5.08 |
Ring 3 | 0.36 | 4.52 |
Ring 1 | 0.6 | 4.04 |
Back Wall | 0.68 | 0.36 |
. | Fraser method . | MITC method . |
---|---|---|
Total cage | 3.64 | 8.84 |
Anode + Cathode | 4.44 | 7.24 |
Anode | 8.04 | 18.44 |
Cathode | 10.52 | 10.36 |
Beam dump | 2.84 | 4.2 |
Ring 6 | 3.08 | 8.04 |
Ring 5 | 0.2 | 9.08 |
Ring 4 | 0.2 | 5.08 |
Ring 3 | 0.36 | 4.52 |
Ring 1 | 0.6 | 4.04 |
Back Wall | 0.68 | 0.36 |
Considering both time lag selection methods at a time lag of 15 s reveals an inflection point in both the total cage and the anode cathode signals. This time lag corresponds to the fourth resonant of the breathing-mode oscillation frequency observed in the physical data. Similar inflection points are found at the breathing-mode oscillation frequency and its resonants. The global maximum method indicates greater sensitivity to such inherent dynamics in the system, as the inflection points are more pronounced for a larger number of signals compared to the local minimum method results. Specifically, the anode signal is clearly shown to reflect such an inflection point at 15 s in the transformed coordinate system of Fig. 12(b) whereas the standard view in Fig. 12(a) does not indicate any such peak. This provides evidence that the MITC method will return greater correlation values when applied in CCM.
2. Convergent cross mapping
Having used both selection methods to determine the suitable time lags for each signal from the EP TEMPEST data, cross mapping is performed between each pair of signals. The corresponding Fraser and MITC lag values for each signal used in the CCM studies are shown in Table III. Similar to the Lorenz system, the smaller time lag between the two signals is selected for cross mapping. Because the causal links between the signals are not known, a brute force optimal “solution” was constructed by sweeping the lag from 1 to 1000 time steps and picking the highest cross mapping value for each signal pair in each direction. A good heuristic for the time lag selection should closely reproduce the results from this brute force optimal “solution.” The results of both time lag selection methods and the brute force optimal solution are shown in Figs. 13 and 14, respectively. Note that the y axis signals are reconstructed from the x axis signals.
Both Fraser [Fig. 13(a)] and MITC methods [Fig. 13(b)] yield accurate results in the downstream area (beam dump, Ring 5, and Ring 6) and inside the thruster (anode and cathode), but the Fraser method underpredicts the Pearson correlation coefficient for the upstream data (Rings 1–4 and the back wall). The common assumption would be that most particle interactions should occur in this downstream region, and it is evident from the total correlation that Rings 3 and 4 seem to have a more pronounced physical response than the rest of the data for the Fraser method. The failure of the Fraser criteria is due to sensitivity of the numerical derivative of the total correlation with respect to lag length in determining the first local extrema from finite noisy data. Nevertheless, the MITC method is shown to better replicate the brute force optimal “solution” generated in this section with only slight underpredictions of the back wall and Ring 1 Pearson correlation coefficients.
The differences between the brute force “solution” and each of the results for the two time lag selection methods are visualized in Fig. 15. For the specific context of using the CCM algorithm, the MITC heuristic performs much better than the Fraser heuristic in reproducing the quantitative causality relationships predicted by a brute force optimal time lag section.
3. Physical relevance
As a brief example of the physical meaning behind these results, the anode and Ring 1 data will be discussed. Physically, it is expected that information flows from the tightly coupled Hall thruster circuit (of which the anode signal is a major component) downstream to the noisy Ring 1 signal measured just behind the thruster face. Consistent with the expectations of CCM, the non-obvious implication is that the near-neighbors in the Ring 1 data perform better at selecting points for reconstructing the anode than the converse. The ability of cross mapping to reconstruct the anode data with any degree of confidence is remarkable, considering how nearly indistinguishable the Ring 1 signal appears to be relative to noise when plotted on a total correlation plot such as in Fig. 12.
IV. CONCLUSION
For the specific application of time embedded phase portraits, a robust lag criterion based on the global maximum of mutual information (or total correlation in higher dimensions) using stretched orthogonal coordinates and equiprobable binning is presented. The act of transforming data using the discrete Legendre coordinate system is shown to remove the short lag redundant portion of observed mutual information for continuous nonlinear systems. Thus, this transformation provides a means of measuring mutual information that is more relevant to the identification of coupled system dynamics. This enables a robustness to noise for the MITC method that returns consistent time lags vs the Fraser criterion, which can quickly be obfuscated by inflection points. Hence, the method presented here is significantly more robust to noise and algorithmic error caused by finite data and discretization than the celebrated first local minimum criterion which relies on local quantities to select appropriate time lags.
The new heuristic was compared with Fraser’s time lag for both a noisy Lorenz system and the experimental Hall-effect thruster data. The noisy Lorenz solutions demonstrated a noticeably shorter time lag selected by the MITC method, though the overall Pearson correlation values were shown to saturate to similarly high values as the data became more abundant despite a difference in time lag. Note that the time for the number of data points necessary to approach CCM convergence was also smaller for the MITC method vs Fraser’s method for lag selection. For the particular application of the EP TEMPEST data, the lag based on the MITC criterion was shown to more reliably predict the known or expected causal links between signals when compared to a brute force optimal solution generated from the same data.
Of particular interest for future inquiry is the impact of these techniques on systems with multiple time scales to determine whether these tools can aid the decomposition of such dynamics into their constituent scales. While the assumption of decorrelated blurring due to noise used here would appear to be a relatively strong assumption compared to the complexity of the EP TEMPEST data, it is worth noting early work exploring the use of CCM to separate causally and non-causally linked time series data of arbitrarily challenging overlapping timescales and data distributions akin to the cocktail party problem in Ref. 36. This suggests a path toward a greedy sequential decomposition of multi-scale systems where unresolved scales play the role of noise in the first pass of the embedding subject to further analysis in the residuals of the cross maps that are naturally orthogonal to the recovered signal. Rather than a greedy addition of one lag dimension sequentially as in MDOP, this work instead suggests a route toward providing lag estimates for noisy multidimensional dynamics sequentially. It is worth noting further that if one of the cross map signals has low enough dimension and noise to be well embedded, further extensions to CCM allow for delay embedding based statistical denoising and deconvolution37 of causally coupled signals and even potential for true asymptotic convergence38 of the noisy cross map recovery with sufficient data. However, the full unification of this information theoretic lag selection with these sequential and asymptotically convergent limits is beyond the scope of the current study. It is also expected that any approach to tackling higher-dimensional systems with finite data will likely require concepts from the multiview embedding technique33,39 to overcome the curse of dimensionality.
ACKNOWLEDGMENTS
The authors would like to acknowledge support provided by the Air Force Office of Scientific Research (AFOSR) [Grant Nos. FA9550-18RQCOR107 (PO: Dr. Fahroo), No. FA9550-17RQCOR497 (PO: Pokines), No. FA9550-23RQCOR001 (PO: Dr. Blasch), and FA9550-23RQCOR007 (PO: Dr. Leve)].
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
R. S. Martin: Conceptualization (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Supervision (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal). C. M. Greve: Formal analysis (equal); Investigation (equal); Methodology (equal); Validation (equal); Visualization (equal); Writing – original draft (lead); Writing – review & editing (lead). C. E. Huerta: Formal analysis (equal); Investigation (equal); Methodology (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). A. S. Wong: Investigation (supporting); Methodology (supporting); Writing – original draft (supporting). J. W. Koo: Investigation (supporting); Methodology (supporting); Writing – review & editing (supporting). D. Q. Eckhardt: Conceptualization (supporting); Formal analysis (supporting); Investigation (supporting); Writing – review & editing (supporting).
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.
APPENDIX A: PROBABILITY DENSITY ALGORITHM
\begin{lstlisting}[language=Python] def compute_probabilities(signal,probability_bins,dimensions,time_lag): signal = normalize(signal,probability_bins) signal_length = len(signal) # set up a dictionary to store bins in unstructured array probability_distribution_function_dictionary = {} bin = np.zeros(dimensions,dtype=int) for i in range((dimensions-1)*time_lag,signal_length): for j in range(dimensions): bin[j] = int(signal[i-j*time_lag]) if tuple(bin) in probability_distribution_function_dictionary.keys(): probability_distribution_function_dictionary[tuple(bin)] += 1 else: probability_distribution_function_dictionary[tuple(bin)] = 1 probability_distribution_function = np.asarray(list(probability_distribution_function_dictionary.values())) indices = np.asarray(list(probability_distribution_function_dictionary.keys())) return probability_distribution_function/np.sum(probability_distribution_function), indices \end{lstlisting}
APPENDIX B: PDF ROTATION AND STRETCHING ALGORITHM
\begin{lstlisting}[language=Python] # computes pdf of rotated signal using equiprobable bins (assumes time_lag != 0) def rotate_and_compute_probabilities_equiprobable_ bins(signal,probability_bins,dimensions,time_lag): rotation_matrix = make_rotation_matrix(dimensions) lagged_signal = signal[(dimensions-1)*time_lag:] for i in range(1,dimensions): lagged_signal = np.column_stack((lagged_signal,signal[(dimensions-1-i)*time_lag:-i* time_lag])) rotated_signal = rotation_matrix@lagged_signal.T sorted_rotated_signal_indices = np.zeros(rotated_signal.shape) signal_length = len(rotated_signal[0]) # argsort(argsort(signal)) returns the position of each point of the signal if it were # sorted in ascending order. these positions are then scaled from 0 to nbin so that # the result is the bin number that each point belongs to if each bin is stretched # to contain the same number of points (equiprobable bins) sorted_rotated_signal_indices = np.argsort(np.argsort(rotated_signal,axis=1),axis=1)*probability_bins//signal_length # set up a dictionary to store bins in unstructured array probability_distribution_function_dictionary = {} bin = np.zeros(dimensions,dtype=int) for i in range(signal_length): for j in range(dimensions): bin[j] = self.sorted_rotated_signal_indices[j,i] if tuple(bin) in probability_distribution_function_dictionary.keys(): probability_distribution_function_dictionary[tuple(bin)] += 1 else: probability_distribution_function_dictionary[tuple(bin)] = 1 probability_distribution_function = np.asarray(list(probability_distribution_function_dictionary.values())) indices = np.asarray(list(probability_distribution_function_dictionary.keys())) return probability_distribution_function/np.sum(probability_distribution_function), indices \end{lstlisting}
APPENDIX C: ENTROPY CALCULATION ALGORITHM
\begin{lstlisting}[language=Python] def compute_entropy(probability_distribution_function): # keep only non-zero elements of pdf to avoid log2(0) pdf = probability_distribution_function[probability_distribution_function!=0] return np.sum(-pdf*np.log2(pdf)) \end{lstlisting}
APPENDIX D: TOTAL CORRELATION ALGORITHM
\begin{lstlisting}[language=Python] # computes total correlation (mutual information when dimensions = 2) def compute_total_correlation(probability_distribution_function,indices,probability_bins): total_correlation = 0 # keep only non-zero elements of pdf to avoid log2(0) pdf = probability_distribution_function[probability_distribution_function!=0] for dimension in range(dimensions): pdf1D = np.zeros(probability_bins) for point in range(len(probability_distribution_function)): pdf1D[indices[point,dimension]] += probability_distribution_function[point] pdf1D = pdf1D[pdf1D!=0] total_correlation += np.sum(-pdf1D*np.log2(pdf1D)) total_correlation -= np.sum(-pdf*np.log2(pdf)) return total_correlation \end{lstlisting}
APPENDIX E: HELPER ALGORITHMS
\begin{lstlisting}[language=Python] # creates rotation matrices up to 5D # matrix elements obtained from Martin et al. Time Lag Optimization paper on arxiv def make_rotation_matrix(dimensions): if dimensions == 2: a = np.sqrt(2)/2 rotation_matrix = np.asarray([[a,a],[a,-a]]) if dimensions == 3: a = np.sqrt(2) b = np.sqrt(3) c = np.sqrt(6) rotation_matrix = np.asarray([[1/b,1/b,1/b], [-1/a,0,1/a], [1/c,-2/c,1/c]]) if dimensions == 4: a = np.sqrt(20) rotation_matrix = np.asarray([[1/2,1/2,1/2,1/2], [-3/a,-1/a,1/a,3/a], [1/2,-1/2,-1/2,1/2], [-1/a,3/a,-3/a,1/a]]) if dimensions == 5: a = np.sqrt(5) b = np.sqrt(10) c = np.sqrt(14) d = np.sqrt(70) rotation_matrix = np.asarray([[1/a,1/a,1/a,1/a,1/a], [-2/b,-1/b,0,1/b,2/b], [2/c,-1/c,-2/c,-1/c,2/c], [-1/b,2/b,0,-2/b,1/b], [1/d,-4/d,6/d,-4/d,1/d]]) return rotation_matrix # normalizes a given signal from 0 to probability_bins def normalize(signal,probability_bins): signal = signal-np.min(signal) signal = signal/np.max(signal) signal = np.floor(signal/(1/(probability_bins-1))) return signal # sums over each dimension other than that specified by axis # used to calculated the pdf along one dimension given the general pdf, e.g. # p(y) = sum_over_other_axes(p(x,y,z),axis=1) def sum_over_other_axes(variable,skipped_axis,dimensions): ax = 0 variable_sum = np.copy(variable) for axis in range(dimensions): if axis == skipped_axis: ax += 1 continue variable_sum = np.sum(variable_sum,axis=ax) return variable_sum \end{lstlisting}
APPENDIX F: SAMPLE USE OF ALGORITHMS
\begin{lstlisting}[language=Python] Np = 32 Nd = 4 Ns = 3 tau_range = np.arange(1,201,1) Mxx = np.zeros((len(tau_range),Ns)) Mbv = np.zeros((len(tau_range),Ns)) plt.figure(figsize=(20,5)) for j in range(Ns): signal = lorenz_system[:,j] for i in range(len(tau_range)): tau = tau_range[i] pxx = compute_probabilities(signal,Np,Nd,tau) pbv = rotate_and_compute_probabilities_equiprobable_ bins(signal,Np,Nd,tau) Mxx[i,j] = compute_total_correlation(pxx) Mbv[i,j] = compute_total_correlation(pbv) plt.subplot(1,3,j+1) plt.plot(tau_range,Mxx[:,j],label="MI(x,x-)") plt.plot(tau_range,Mbv[:,j],label="MI(xbar,v)") plt.xlim([0,tau_range[-1]]) plt.xlabel("Time lag") plt.ylabel("Entropy/Mutual Information (bits)") plt.show() min_lag = tau_range[np.asarray([sn.find_peaks(-Mxx[:,j])[0][0] for j in range(Ns)])] max_lag = tau_range[np.argmax(Mbv,axis=0)] \end{lstlisting}