Delay embedding methods are a staple tool in the field of time series analysis and prediction. However, the selection of embedding parameters can have a big impact on the resulting analysis. This has led to the creation of a large number of methods to optimize the selection of parameters such as embedding lag. This paper aims to provide a comprehensive overview of the fundamentals of embedding theory for readers who are new to the subject. We outline a collection of existing methods for selecting embedding lag in both uniform and non-uniform delay embedding cases. Highlighting the poor dynamical explainability of existing methods of selecting non-uniform lags, we provide an alternative method of selecting embedding lags that includes a mixture of both dynamical and topological arguments. The proposed method, *Significant Times on Persistent Strands* (SToPS), uses persistent homology to construct a characteristic time spectrum that quantifies the relative dynamical significance of each time lag. We test our method on periodic, chaotic, and fast-slow time series and find that our method performs similar to existing automated non-uniform embedding methods. Additionally, $n$-step predictors trained on embeddings constructed with SToPS were found to outperform other embedding methods when predicting fast-slow time series.

Embedding methods are commonly used to analyze time series whose full system state cannot be fully or directly observed. However, most embedding methods require the careful selection of parameters to achieve a faithful reconstruction of the system dynamics. One common class of embedding methods—time delay embedding—requires careful selection of embedding lags. In this paper, we provide an outline of embedding theory and a collection of existing methods and principles for guiding the selection of embedding lags. Finally, we present an argument for the usage of non-uniform embedding and propose a new persistent homology-based method, SToPS, to inform the selection for multiple embedding lags.

## I. THE CASE FOR EMBEDDING

Since the significant results proposed by Whitney^{1} and Takens,^{2} the ideas of mathematical embedding have pervaded through almost all aspects of the nonlinear dynamics literature. The related theorems were then subsequently formalized, codified, and discussed in the seminal paper “Embedology” by Sauer, Yorke, and Casdagli.^{3} This paved the way for the development of numerous embedding techniques such as the method of derivatives,^{4} time delay embedding,^{2,5} and PCA embedding,^{6} among others, that have been subsequently applied to a wide variety of study areas. Today, the embedding approach remains an invaluable tool in the study of nonlinear time-series analysis.

A time series $x(t)\u2208Rm$ may be generically viewed as the product of a data-generating process consisting of successive, though not always regular, observations of some dynamical process with state $s\u2192(t)\u2208Rn$ via a measurement function $h:Rn\u2192Rm$. Typically, $m<n$ as the full state of the underlying dynamical process cannot be observed. For the purposes of illustration, we will consider in this paper the simplest case where $x(t)$ is scalar (i.e., $m=1$).

The main goals in the time series analysis often fall into two main categories: system identification or classification and prediction. The aims of the former are focused on characterizing and understanding the dynamics and operating mechanisms of the underlying dynamical system. This can range from the study of system invariants such as Lyapunov spectra^{7,8} and correlation dimension^{9–11} to the bifurcation analysis.^{11,12} The latter task of time-series prediction has a more practical aim that is clearly stated in its name: “given some history $x(t0)\cdots x(t)$, find the best predicted estimates of $x(t+\tau )$.” It is worth clarifying that these two areas are not exclusive and often work synergistically. However, we will focus on the latter problem of time series prediction in this paper.

Though simple in its aim, practically fulfilling the task of time series prediction presents many challenges. Outside the study of simple toy models, many interesting systems exhibit high-dimensional and also chaotic behavior.^{13–15} For the cases of time delay systems, the dimension of the system dynamics may not even be finite. The potential inaccessibility of the full system state also adds to the challenge. As such, it is common to reframe the time series prediction problem in terms of the measurement function $h:Rn\u2192Rm$ where $m<n$. Thus, time series prediction can be rewritten: “given some observed history $h(x(t0)),\u2026,h(x(t))$, predict the future state $x(t+\tau )$.” This can be framed in terms of probability theory with aim to calculate the conditional probability $P(x(t+\tau )|h(x(t0)),\u2026,h(x(t)))$.

Given observational restrictions, it is desirable for any time-series predictor to extract and utilize as much information that is contained within the observed time series $x(t)$. Therein lies the value of embedding, vis-à-vis dimension augmentation. Fortunately, under sufficient precision and noise-free assumptions, Takens’ embedding theorem guarantees that a time delay embedding with dynamics $\Phi $ defined in a space of sufficiently high dimension $Rd$ constructed from scalar observed time series is generically diffeomorphic to the full state space dynamics of the underlying dynamical system.^{2} In essence, one can reconstruct a proxy image of the full state dynamics using only part of the observed system variable. The usual approach to achieve this is to embed the observed time series into a sufficiently large dimension containing the “full” state dynamics, learn the state space dynamics or vector field using one’s favored modeling tool (e.g., neural networks,^{16,17} reservoir computing^{18–20} support vector machines,^{21,22} etc.), and predict forward, performing all required calculations in the new embedded space.

Despite the elegance and utility afforded by embedding theorems, their reliance on infinite data precision, length, and noise-free signals pose practical problems. The effectiveness of embedding is highly dependent on the choice of embedding parameters used to augment the data.^{23,24} For example, in the case of time delay embedding, the selection of time lag^{25} and embedding dimensions^{3,26} can have a profound effect on the quality of resulting reconstruction.

Thus, we may summarize the main challenge of embedding as the following question: “How do we select good embedding parameters?” This will be the focus of our discussion. Numerous methods have been proposed to tackle this problem ranging from purely statistical and dynamical arguments such as mutual information^{11,25} and continuity statistic^{27} to more purely topological arguments like distortion,^{28} noise amplification,^{28,29} and fill-factor.^{30} In many cases, each of these methods only performs well for specific types of systems and not others.

This paper has three main objectives. First, to provide a simple overview of the challenges of selecting good embedding parameters. Second, to collate and compare the various popular methods across the dynamics-topology spectrum that have been proposed to tackle the problem of embedding parameter selection. We will focus on the particular case of optimizing time delay embedding. Finally, to present a different approach based on the growing field of persistent homology—the significance score—that attempts to incorporate both dynamical and topological arguments into selection of embedding parameters.

This paper is structured as follows. We begin in Sec. II by providing an overview of embedding theory and various common embedding methods. This is followed by a short discussion on guiding principles on the selection of embedding parameters in Sec. III. Sections IV–V introduce several embedding parameter selection methods for both uniform and non-uniform delay embedding. Finally, we present SToPS in Sec. VIII, a persistent homology approach to embedding parameter selection, which is our contribution to the embedding parameter selection problem.

## II. EMBEDDING METHODS

Multiple embedding methods exist to perform state space reconstruction. However, they all aim to perform a similar task, to augment an observed time series into a high enough dimension that is useful for describing the underlying dynamics. For completeness, we provide a brief overview on embedding and three common embedding methods, time delay embedding, derivatives embedding, and global principal value embedding. However, the ideas present in this paper will focus on the selection of time delay embedding parameters. A deeper discussion on other embedding methods can be found elsewhere.^{28}

### A. Embedding theory

For a given dynamical system with state $s\u2192(t)$ with dynamics on state space $S\u2286Rd$ and evolution operator $f$ such that,

we can define a measurement function $h:Rd\u2192R$ that simulates the process of observing the system and extracting information evaluated at given time steps to produce a time-series,

An embedding can be defined as a transformation $\Psi :R\u2192Rm$ that augments the dimension of the time series using observed values across some window of time $[t1,t2]$. For example, uniform delay embedding is given by the following equation:

where $x\u2192(t)$ is the embedding vector with dynamics defined in a reconstructed state space $X\u2286Rm$ and a transformed evolution operator $F$. If $\Psi $ is a valid embedding, Takens’ embedding theorem guarantees that generically, there exists a diffeomorphism $\Phi :S\u2192X$ that preserves the dynamics of the system such that,

Learning the dynamics along the reconstructed state space $F$ is equivalent to learning the true system dynamics $f$ (see Fig. 1). Therefore, the task of time series prediction using embedding simplifies to learning the evolution operator $F$ where

### B. Time delay embedding

First described by Packard *et al.*,^{4} time delay embedding involves the augmentation of a scalar time series $x(t)$ into a higher dimension through the construction of delay vector $x\u2192(t)$ given as

where the embedding parameters to be selected are the delay lag $\tau $ and embedding dimension $m$. According to the guarantees of Takens’ theorem, any value of $\tau $ will yield a valid embedding given sufficiently large $m$ and measurement values of infinite precision. However, this is not achievable in practice, and different selections of delay lag and embedding dimension can yield varying results. A further discussion of this is given in Sec. III A.

We also note that the task of selecting ideal delay lag $\tau $ and embedding dimension $m$ is not unique to time delay embedding. Selecting values of $\tau $ and $m$ is also key decision in other time series analysis methods such as permutation entropy^{31} and ordinal partition networks.^{32,33} In both of these instances, a delay vector is constructed and represented by an encoding based on the size order of each component. The time series may then be viewed as transitions between different encoding states and used for further analysis.

Many automated time-series prediction methods such as recurrent neural networks^{16,17} and reservoir computing^{18–20,34} may also be related to delay embedding. In both cases, input time series is fed into a dynamical network that contains some notion of memory. The forward propagation of this memory of past states on future states effectively acts as a time delay embedding with small delay lag $\tau $ and large embedding dimension $m$.

### C. Derivatives embedding

The embedding method of derivatives reconstructs an embedding vector using successively increasing order of time derivatives from the observed time series.^{4} This is given by

Derivatives are taken via numerical approximations. The derivatives embedding method is a valid embedding for sufficiently large $m$ if one is able to accurately calculate the required derivatives.

### D. Integral-differential embedding

One weakness of the derivatives embedding approach is the need to evaluate numerical derivatives from data. While this may be acceptable for the first derivative, approximations of successive higher order derivatives are generally inaccurate as the signal to noise ratio tends to be negatively impacted. This is true even for the cases of very clean datasets.

An alternative to derivatives embedding is integral-differential embedding.^{35} This approach avoids the calculation of successive higher order derivatives by replacing the second-order derivative with an integral instead. This yields the following embedding construction:

where the first component is first set to zero mean before integration. The usage of a first-order integral and numerical derivative results in a degradation of the signal-to-noise ratio by only one order each for first and third embedded components. This is in contrast with the derivatives embedding approach where each successive numerical derivative has a signal-to-noise ratio that is degraded with increasing orders of magnitude. However, the integral-differential embedding approach suffers from same noise effects as the pure derivatives method for higher dimensional embedding. This limits its applicability to systems where system dynamics are presumed to be high dimensional.

### E. Global principal value embedding

The method of principal value embedding was proposed by Broomhead and King^{6} as a modified alternative to time delay embedding using the theorems by Takens. This method draws upon the ideas of the principal component analysis^{36} to find an ideal rotation of time delay embedding with a sufficiently high dimension. Given a time series $x(t)$ of length $NT$ and a sliding window of length $M$, we can construct a collection of $N=NT\u2212(M\u22121)$ delay vectors,

where $x\u2192i$ is the delay vector constructing using the $i$th value in the time series as the first component,

An $m\xd7m$ covariance matrix $C$ can be calculated from $X$. The elements $Cij$ of this matrix can be simply given as

where $\u27e8\cdots \u27e9t$ denotes a time average. The principal components of $C$ are then found by calculating its respective eigenvalues and eigenvectors. Taking the first $d$ principal components corresponding to the desired number of embedding dimension, the eigenvector matrix can be used to calculate a projection of $X$ corresponding to the final embedded coordinates. Readers are advised to refer elsewhere^{6,28} for more details.

Principal component value embedding essentially aims to distill and simplify high-dimensional delay embedding (usually obtained by taking a large number of lagged components) into a lower dimensional subspace. The remaining subspaces are argued corresponding to component directions with little dynamical variation and importance. One application of this method was as an attempt to simplify the selection of the optimal embedding dimension,^{6} where the ideal embedding dimension $m$ corresponds to the number of singular values that are distinctly greater than some “noise floor.” However, this approach has received several criticisms.^{37,38} The main of which arguing that the onset of a plateau noise floor can be attributed to precision and noise strength in the data, rather than the importance of the corresponding eigenvector direction.

Within the general context of embedding, Paluš and Dvorák test the quality of delay embedding with reduced dimension obtained using the first $k$ principal eigenvector directions.^{38} The authors show that reduced dimension embeddings’ estimates of dynamical invariants such as the correlation dimension vary with different time delays and the number of components. They argue that the usage of principal components of the covariance matrix is restricted to linear correlations. Therefore, while components in embedding may be independent in the linear sense, they may still be nonlinearly dependent. Instead, the truncation of embedding dimensions can result in the exclusion of important nonlinear components.

The inclusion of large time lags within a given principal component direction also may not make much sense for chaotic systems where temporally close observations decorrelate exponentially in time. From the perspective of selecting time lags, each principal component will almost invariably contain contributions from all $M$ possible lagged components. Apart from the dimensional reduction argument^{38} (of which care must be taken in its interpretation), global principal value embedding does not present any significant difference to the general time delay embedding with large embedding dimension $m$.

## III. EMBEDDING CONSIDERATIONS

### A. Embedding quality

As previously discussed, the theoretical guarantees of Takens’ fail in the presence of finite precision and noise^{23,25} leading to the concept of “optimal” embedding parameters. The existence of such an “optimal” set implies that not all embeddings are of equal quality. However, this requires a measure by which embedding quality can be compared against. An attempt to quantify embedding quality was studied extensively by Casdagli^{28} and Potapov.^{39}

There are large variations in the definition of embedding quality such as those based on information theoretic arguments,^{25} prediction tasks,^{28} and attractor topology.^{30,40} It is worthwhile to note that the “optimality” of a set of parameters is dependent on the task that the embedding is being used for. As such, while one may find similar results between methods based on different notions of embedding quality, disagreement between results will likely always be present.^{11,23,28} Therefore, it is better to avoid the claim that a particular set of parameters are more favorable unless there are dynamical and topological reasons within the data itself that support it. However, we will highlight in this section the general considerations that are often used in defining the quality of embedding.

The different notions of embedding quality can be summarized in two broad categories or arguments, prediction-based and topological arguments. Prediction-based notions of embedding quality can be seen to be inspired by application of embeddings in the context of time-series prediction. Fundamentally, good embeddings should enable better predictions.^{28,39}

In time series prediction tasks, the presence of an unknown measurement function $h$ and noisy data introduces some degree of uncertainty to the inference of the real system state $s(t)$. Casdagli argued that an ideal embedding should minimize the uncertainty of inferring the true state $s\u2192(t)$ given a position $x\u2192(t)$ in the reconstructed state space. In essence, the inverse transformation $\Phi \u22121$ applied to constructed delay embedding in the presence of noise should have little ambiguity on the true state of the system, if $s\u2192(t)$ could be fully known.^{28} This robustness to noise and low ambiguity should in theory be beneficial for time series prediction and also forms the basis of the ideas of noise amplification and distortion used to quantify embedding quality. In poor embeddings, such as those whose attractor manifolds are laminar with little separation, the effect of noisy perturbations across manifold layers results in significant uncertainty of the true state $s(t)$, making time series prediction difficult.

The information theoretic arguments for choosing embedding parameters (e.g., autocorrelation, minimum mutual information, and continuity statistics) are also closely related to the ideas of prediction. These methods generally try to maximize the amount of new information incorporated in each delay dimension with the aim that it will provide more information of the true system state $s\u2192(t)$ and aid in time series prediction.

The other broad category of defining embedding quality are those based on topological and geometrical arguments. Many of these methods focus on the study of the attractor structure and distribution of the manifold in its ambient state space. In essence, good embedding with respect to topology and geometry should aim to be well spaced out and unfolded in its ambient space.^{11,30,40} This notion of quality has parallels with noise amplification arguments of Casdagli. Some methods based on geometrical arguments include statistics such as the fill factor^{30} and displacement from diagonal.^{40} Ultimately, many of the considerations outlined above for determining the ideal lag and embedding dimension for time delay embedding can be summarized with the concepts of irrelevance and redundancy.^{28,41}

### B. Irrelevance and redundancy

The selection of time lag $\tau $ and embedding dimension $m$ are the main challenges when constructing time delay embedding. There is uncertainty on the relative importance between embedding lag and dimension. Furthermore, it has also been proposed that these embedding parameters may not be independent. Instead the quantity $\tau w=m\tau $, termed the embedding window, has been proposed as a more important parameter to optimize.^{28,41} However, for sufficiently large embedding dimension, it could be argued that selection of $\tau w$ may be simplified to an appropriate selection of $\tau $.

The selection of the embedding window $\tau w$ (and by extension embedding lag $\tau $) may be summarized by a notion that it must be neither too short (redundance) nor too long (irrelevance). This explanation applies for chaotic or aperiodic signals. However, periodic signals may be successfully embedded with large lags $\tau $ where the effective lag $\tau \u2217$ is related to the period $T$,

Embeddings with high redundance result in trajectories that lie in layers roughly parallel to each other (e.g., close to the diagonal). In the presence of sufficient noise, the clear separation between layers is affected. This results in a greater degree of uncertainty of the true system state $s\u2192(t)$, given some noisy observation $x\u2192(t)$ in reconstructed state space.

Similarly, embeddings with high irrelevance contain components that are highly decorrelated with the true state.^{28} This is also unfavorable as it may introduce unwanted crossings between trajectories in the reconstructed manifold. Therein lies the Goldilocks problem of selecting an embedding window $\tau w$ that is neither too large or too small.

## IV. UNIFORM DELAY EMBEDDINGS

The simplest form of delay embedding is the case of uniform delays where single constant values for $\tau $ and $m$ are selected. In this case, embedding vectors are selected with uniformly increasing time delays as given in the following equation:

Due to the debate between the selection priority of $\tau $ and $m$, multiple methods have been proposed to simultaneously estimate both parameters. Some methods include those of Gao and Zheng,^{42,43} characteristic lengths,^{23} and Schuster (wavering product).^{44} An overview of these methods is provided in Sec. V.

Other common methods attempt to simplify the problem by assuming the independence of $\tau $ and $m$ and choose to estimate both values separately. Generally, an embedding $\tau $ is first determined using a choice of various measures. Once selected, uniform delay vectors of increasing dimensions are constructed and tested with algorithms such as the Grassberger–Procaccia^{10} algorithm or false nearest neighbors^{26,45,46} until convergence is achieved. The length of delay vector when stability is reached is used to decide the embedding dimension $m$.

The methods that are used for determining embedding lag $\tau $ vary from simple heuristics to more complex statistical arguments. One common heuristic is the selection of embedding lag as one quarter of the signal period (or quasi-period for chaotic signals). Delving into more statistically grounded arguments, autocorrelation^{11} and its nonlinear generalization, mutual information,^{47} continuity statistics,^{27} and L-statistics^{29} are occasionally used to determine good values for $\tau $. A comprehensive overview and further discussion on these methods is provided in Secs. VI–VII.

## V. SIMULTANEOUS OPTIMIZATION OF UNIFORM EMBEDDING PARAMETERS

In contrast to many of the current methods that involve the selection of $\tau $ and $m$ independently, several embedding methods have been proposed that simultaneously estimate both values using a single measure. This measure is often calculated across multiple lags and repeated for increasing $m$. A value for $m$ is first selected according to some criterion followed by selection of $\tau $. Detailed steps on implementation of these methods are outlined by Celluci.^{23}

### A. Method of Gao and Zheng and characteristic lengths

The embedding method proposed by Gao and Zheng is based on the incidence of false nearest neighbors.^{42,43} False nearest neighbors can be attributed to either redundancy (insufficiently unfolded) and irrelevancy (spurious intersections in the attractor). The method proposed by Gao and Zheng operates on the notion that the separation distance and proportion of false nearest neighbors should be minimized in an ideal embedding.

Consider a pair of points in embedded space $x\u2192i,x\u2192j$ and their evolution $k$ steps into the future $x\u2192i+k,x\u2192j+k$. Points that are false nearest neighbors will tend to separate faster than real neighbors as the attractor unfolds in a time delay embedding. As a result, the ratio between their distances $|x\u2192i+k\u2212x\u2192j+k|/|x\u2192i\u2212x\u2192j|$ will be larger for pairs of false nearest neighbors and approximately equal to 1 for real neighbors. Gao and Zheng then propose the following measure $\Lambda $ to optimize the embedding parameters:

where $Nref$ is the number of randomly sampled point pairs over which the distance ratio is averaged. There are several additional restrictions on the selection of point pairs $x\u2192i,x\u2192j$. First, the initial separation of these points should satisfy $|x\u2192i\u2212x\u2192j|\u2264r$, where $r$ is a small selected threshold, i.e., the initial separation of points should be small enough such that the calculation of growing separation is sensible. Second, the selection of pairs of points should not have an intersecting Theiler window $|i\u2212j|>lTheiler$, where $lTheiler\u2208N+$. This is done to prevent unwanted correlations between points on the same local trajectory.^{48,49} Finally, the constant $k$ should not be too large and selected with respect to the natural time scale of the system dynamics.

To identify good embedding parameters, profiles of $\Lambda (\tau )$ are calculated for increasing values of embedding dimensions $m$. The value of $m$ that corresponds to the largest decrease across the profile $\Lambda (\tau )$ is selected as the embedding dimension. The embedding lag $\tau $ is then selected as the first minimum of $\Lambda (\tau )$.

### B. Characteristic lengths

The method of characteristic lengths is an extension of Gao and Zheng^{23} that attempts to solve the problem of selecting an evolution time $k$. Instead of arbitrarily selecting $k$, a characteristic length $J(m,\tau )$ describing the natural spatial scale of the system attractor is calculated,

where $\u27e8\cdots \u27e9$ denotes an average over sampled pairs of points of the attractor. The characteristic length is then used to calculate the separation time $TJ(xi\u2192,xj\u2192)$ defined as the time taken for pairs of nearest neighbors to diverge by some proportion of the characteristic length $J(m,\tau )$. For real neighbors, $TJ$ will converge to a value related to the Lyapunov exponent of the system with increasing embedding dimension $m$, while false nearest neighbors will result in a smaller value $TJ$ as trajectories quickly separate. The new measure that is used to determine the embedding parameters is given by

where $Nref$ is the number of sampled pairs of nearby neighbors. The values for $m$ and $\tau $ that maximize $C(m,\tau )$ are selected as the embedding dimension and lag.

### C. Wavering product

The wavering product^{44} is similar to that of Gao and Zheng and characteristic lengths in that all are based around the concepts of nearest neighbors. The authors propose that good embeddings should preserve the correspondence between the order of nearby neighbours of a given reference point (i.e., the order of neighbors sorted according to distance from some reference point $x\u2192i$ should be preserved). This is done by comparing the order of $N$ nearby neighbors of a point $x\u2192i$ between a given embedding $\Phi k$, whose ordered sequence neighbours are given by

and its projection onto its next order embedding $\Phi k+1$ (by increasing $m$ or $\tau $) with the sequence given by

Here, $x\u2192i,n$ corresponds to the $n$th nearest neighbor of the $i$th reference point $x\u2192i$. The projection $z\u2192i,n$ corresponds to the same neighbor data point $x\u2192i,n$ whose position is recalculated from next order embedding $\Phi k+1$.

Similarly, comparisons can also be made into a projection into an embedding of lower order (by decreasing $m$ or $\tau $) giving a new set of ordered points,

Ideally, a good embedding should preserve a one to one correspondence in these ordered sequences. This will yield a value equal to 1 for the following ratios:

The method presented by Schuster and Liebert propose the following measure as the product of the above two ratios,

The measure to be optimized is given by the average over $Nref$ randomly sampled reference reference points,

with $m$ being selected as the dimension that achieves the limiting behavior of $W$ and $\tau $ corresponds to the first minimum of the resulting profile.

## VI. SELECTING UNIFORM EMBEDDING LAG

In contrast to embedding methods in Sec. V that utilize a single measure to optimize the selection of embedding lag and dimension, the most practiced approach still selects the embedding lag $\tau $ and dimension $m$ independently according to separate metrics. Here, we will focus on the methods used to select the lag for time delay embedding. This lag may then be used to construct delay embeddings of arbitrary dimension.

The simpler case of uniform delay embedding requires the selection of a singular value of $\tau $ that is increased in multiples to construct the required delay vector. The methods used to inform the selection of delay lag $\tau $ mirror the broad dichotomy in notions of embedding quality outlined in Sec. III A. In this section, we divide the various methods for optimizing embedding lag into two categories: the first based on dynamical and information theoretic arguments and the second based on topological arguments. We also discuss the topic of non-uniform embedding in Sec. VII.

Methods rooted in a dynamical approach can be interpreted as focusing on the mechanism behind the data-generating process and statistical relationships between measurements. Information theoretic and statistical approaches are also included in this category. The methods that we will review in this category include the autocorrelation, minimum mutual information, and quarter period. This is in contrast to those in the second category whose methods are more topological. Some examples include the fill-factor^{30} and noise amplification.^{28,29}

A brief note on embedding dimension: There are several methods that are used to identify the required embedding dimension such as the false nearest neighbors (FNNs)^{26,45} and the Grassberger–Procaccia algorithm^{10} used to estimate the correlation sum and dimension. Other invariants similar to correlation dimension such as the Kaplan–Yorke dimension^{50} and box-counting dimension^{51} are also often used to determine the embedding dimension. These results are usually used in conjunction with Whitney’s theorem stating that any $d$-dimensional manifold (such as an attractor) can be embedded in at least $m>2d+1$ dimensions.^{1} However, it has been noted that this direct application has its flaws as Whitney’s theorem is only proved for integer dimensions $d$, which is rarely the case for a majority of systems of interest, such as those exhibiting fractal and chaotic behavior. However, we note that an extension of Whitney’s theorem to generalize the inequality to the box counting dimensions ($m>2DF$) was given by Sauer *et al.*^{11,52}

### A. Dynamical approaches

#### 1. Autocorrelation and minimum mutual information

Many commonly analyzed dynamical systems tend to exhibit chaotic behavior where nearby trajectories rapidly diverge and quickly become uncorrelated. The method of autocorrelation is based on the idea that each component in reconstruction should include as much new information regarding the true state as possible. It has been suggested that components in the delay vector should aim to minimize the correlation.^{11} This has similar effects to minimizing the redundancy of reconstruction.

For delay embedding, the lag corresponding to the first minimum of the autocorrelation function is taken as the embedding time lag. Alternatively, the decay time to $1/e$ of the autocorrelation signal has also been proposed.^{11} A variation of this approach based on the first root of the mean local autocovariance has also been proposed as a robust alternative to the minimum autocorrelation approach.^{53}

One weakness of autocorrelation is its inability to account for non-stationarities in the time series (e.g., drifts in phase, frequency, and magnitude). Additionally, its application is limited to linear signals.^{54} In all but simplest cases, dynamical systems exhibit some level of non-stationary and nonlinear behavior. Fraser and Swinney^{25} proposed that the mutual information between the system be used in place of autocorrelation. In their original paper, Fraser and Swinney first provide a geometrical interpretation to complement the theoretic arguments for mutual information. Namely, consider a set of points whose values in one component $x$ lie within some fixed window. From this set, track their positions $\tau $ steps into the future and calculate the distribution of values $p\tau (x)$ in the same component for the same set of points. A value of $\tau $ that results in a wider distribution $p\tau (x)$ should correspond to a good lag, which also corresponds to small values in the mutual information.

The mutual information $I(\tau )$ can be interpreted as the nonlinear analog of the autocorrelation function,

where $P(x(t))$ is the probability of observing a state $x(t)$ at any given time and $P(x(t),x(t+\tau ))$ is the joint probability defined similarly for both time $t$ and a future time $t+\tau $. Drawing from information theory, mutual information $I(\tau )$ aims to quantify the amount of information about a future state at time $t+\tau $ that is contained in an observation at time $t$. High levels of mutual information for a given lag $\tau $ imply a high degree of correlation between states and will result in higher redundancy for delay reconstruction.

The strengths of the minimum mutual information and autocorrelation lies in its ability to provide reasonable estimates for lag with relatively simple and quick computation. However, there are no guarantees for the existence of a clear minimum for a given mutual information profile $I(\tau )$.^{55} Additionally, calculating mutual information requires the numerical estimation of probability density functions $P(x(t))$ and $P(x(t),x(t+\tau ))$ and, thus, requires consideration regarding optimal histogram bin size and data length requirements.^{56,57} Numerous alternative methods to more effectively estimate mutual information have been proposed including the usage of adaptive binning,^{25,58} kernel density estimators,^{59} and $k$-nearest neighbors.^{56}

#### 2. Quarter of period

A commonly used heuristic for selecting an embedding lag is to set $\tau $ to be quarter of the most dominant period in the signal.^{60} This approach allows the natural time scale of the system dynamics to be encoded within the embedding procedure. This heuristic is inspired from the problem of embedding a sine wave in 2D $x(t)=sin\u2061(\omega t)$. In this case, $\tau =2\pi 4\omega $ yields a 2D delay embedding that is the most circular with other values resulting in elliptical trajectories instead. However, this heuristic cannot be directly applied to chaotic systems where signals are aperiodic. Instead, an estimation of some form of pseudo-period is required, which will be the focus of our proposed method in this paper.

### B. Topological approaches

#### 1. Fill-factor

The fill-factor approach first proposed by Buzug *et al.*^{30} is an entirely geometrical approach to calculate the quality of a given embedding. This method assumes that an ideal embedding should be able to unfold an attractor and maximize the separation between the trajectories. The authors argue that such an embedding optimally utilizes the ambient space and reduces the ambiguity of the true state of the system for different points in the reconstructed state space.

The fill-factor is calculated by first sampling $m+1$ random points from an $m$ dimensional delay embedding of the data. A reference point $x\u2192r$ is then selected from this collection and the corresponding relative distance vectors can be calculated,

The corresponding $m\xd7m$ matrix can then be expressed as

and the volume of the resulting parallelepiped is given by calculating the determinant of $M$,

The final expression for the fill-factor is given by calculating the average volume over a collection of randomly sampled parallelepipeds $Vi(\tau )$, normalized by the range of the sampled data points,

The authors recommend the selection of $\tau $ that maximises the fill-factor $f$ over the interval $\tau \u2208(0,Tc/2)$, where $Tc$ is the characteristic recurrence time. The value of $Tc$ is given by

where $\omega c$ is the most dominant frequency from the power spectrum of the time series.

#### 2. Noise amplification

Noise amplification was a measure proposed by Casdagli in an attempt to quantify the quality of an embedding.^{28,39} This is supported by the notion that a good embedding should be useful in performing predictions. Additionally, good embeddings should be able to still perform relatively well even in the presence of noise. Noise amplification for a given embedding $x\u2192(t)$ is defined with respect to predictability of the system $T$ steps into future under the presence of noise. Generally, this is given by

where

Here, $Var[x(T)|B\u03f5(x\u2192)]$ corresponds to the conditional variance of $T$ step predictions into the future in $R$ from an initial condition $x\u2192$ in embedding space $Rm$ contaminated with added small observation noise $\u03f5$. In this case, it is assumed that predictions have no model errors. This condition may be fulfilled by choosing nearby neighbors in the embedding $Rm$ as a proxy for noisy initial conditions.^{29}

Finally, the noise amplification quantity $\sigma (T,x\u2192)$ is averaged over a collection of reference points sampled across the time series in order to calculate the noise amplification value $\sigma $. Embeddings with high noise amplification imply that nearby neighbors in embedded space $Rm$ tend to have future trajectories that rapidly diverge because they do not correspond to real neighbors in the true manifold $M$ state space. Therefore, the impact of noise is greatly amplified as small perturbations in the reconstructed space $Rm$ result in large uncertainties in the true state of the system.

#### 3. L-statistic

One weakness of the noise amplification measure is its requirement to define $T$, the prediction horizon over which to calculate noise amplification. This was addressed by Uzal^{29} by modifying the definition of noise amplification to the following equation:

This definition calculates the noise amplification with respect to a range of prediction horizons up to a maximum value of $TM$ and is found to be relatively robust for sufficiently large $TM$.

The algorithm used to calculate $\sigma $ relies on using $k$ nearest neighbors from a reference point $x\u2192i$ as a proxy. Based on the distribution of points, this can result in effective noise levels $\u03f5$ of different sizes for each point. Therefore, Uzal proposed a normalization constant $\alpha k$ accommodate for this variation given by

Combining these two ideas, the authors propose that noise amplification $\sigma $ measures some notion of redundancy, and $\alpha k$ measures some notion of irrelevance. The L-statistic is then described as a cost function to minimize both of these values simultaneously,

## VII. NON-UNIFORM EMBEDDING

The popularity of uniform delay embedding can be attributed to its ease of implementation and optimization. In a direct application, uniform delays only require the selection of two parameters, $\tau $ and $m$. However, the convenience of such an approach comes at the cost of reduced versatility and limitations, particularly when analyzing systems with dynamics occurring on multiple disparate time scales.^{60,61}

First, the choice to use a single delay limits the ability for the reconstruction to highlight features across multiple disparate time-scales.^{27} For example, a fast-slow system with characteristic time scales $\tau 1$ and $\tau 2$ where $\tau 1/\tau 2\u226b1$, the choice of selecting $\tau 1$ (i.e., slow dynamics) as the embedding lag can limit the reconstruction’s ability to fully unfold attractor topologies corresponding to the fast dynamics. The dynamics of the time scale of $\tau 2$ (i.e., fast dynamics) will appear as noisy fluctuations within the reconstructed state space.

Second, reconstruction from uniform delay embedding that is sufficient is not necessarily optimal. Here, we must clarify that the definition of optimal presumes some criterion or notion of quality. Casdagli noted that the quality of an embedding, defined as the reconstruction’s robustness to noisy data for prediction, can vary locally throughout different regions of the attractor.^{28} This behavior was also highlighted by Uzal in his extension of Casdagli’s noise amplification and distortion methods.^{29} Additionally, we should also consider that invariant measures such as the Lyapunov exponent also vary locally.^{62,63} Hence, the selection of a single embedding lag implies that all these variations may be averaged.

Non-uniform delay embedding has been proposed as an natural extension of uniform embedding that aims to address some of the latter’s limitations. Non-uniform delay embedding requires the selection of multiple delay lags ${\tau 1,\tau 2,\u2026\tau m\u22121}$ in order to construct a delay vector,

The selection of delay lags represents a combinatorially hard problem that grows with increasing embedding dimension. The methods proposed for constructing non-uniform delay embedding often involve the iterative selection of time lags to gradually construct a delay vector until the required embedding dimension is reached. In this section, we give an overview of various methods that have been proposed to solve and automate this problem. These methods include the continuity statistic,^{27} PECUZAL,^{64} maximizing derivatives on projections (MDOP),^{40} reduced autoregressive models,^{60,61} and search optimization algorithms such as ant colony optimization^{65} and Monte Carlo decision tree search (MTCDS).^{66}

### A. Garcia and Almeida

One of the earliest proposed methods of choosing non-uniform delays was proposed by Garcia and Alemeida.^{47} They proposed a variation in the nearest neighbor methods of Kennel and Hegger applied to the problem of selecting time delays. Their method also recursively selected lags using a proposed $N$-statistic over multiple embedding cycles. At the end of each cycle, the false nearest neighbors algorithm is used to assess the quality of newly constructed embedding. This process is repeated until the false nearest neighbor statistic $F$ decreases below a critical threshold.

For the selection of the first time lag $\tau 1$, a 2D delay embedding $x\u2192(t)$ is first done with respect to some prospective time lag $\tau \u2217$ to be tested,

The closest neighbor $x\u2192(tj)$ for each point $x\u2192(ti)$ in the embedding reconstruction is identified. Neighbors should be chosen such that they are not temporally close (i.e., with respect to some Theiler window).^{48,49} This is to ensure that their spatial proximity is not purely due to their temporal proximity. The two Euclidean distances $d1,\tau \u2217(ti),d2,\tau \u2217(ti)$ between any given two points are then calculated as follows:

where $\delta t$ is the sampling time of data. Simply put, $d1,\tau \u2217$ is the spatial separation between pairs of nearest neighbors in the reconstructed state space and $d2,\tau \u2217$ is the resulting separation one step forward in time. The $N$-statistic is taken as the proportion of points whose distance ratio $d2,\tau \u2217/d2,\tau \u2217>10$,

where $N$ is the length of the time series and $1$ is the indicator function. The threshold of 10 was heuristically selected by the authors based on the numerical calculations of Kennel *et al.*^{46} The time lag corresponding to the first minimum in $N(\tau \u2217)$ is taken to be the embedding lag.

For non-uniform delay embedding, the selection of additional lags for each subsequent embedding cycle is done using a similar procedure. However, the reconstructed space used to calculate nearest neighbors and pairwise distances are calculated conditionally on previously selected lags. Therefore, the selection of the $m$th embedding lag in a non-uniform embedding procedure will require neighbors and distances $d2,\tau \u2217,d2,\tau \u2217$ to be calculated using embedding with $m\u22121$ lags that have already been chosen and the new candidate lag $\tau \u2217$,

### B. Continuity statistic

The continuity statistic was first proposed by Pecora *et al.* as a way to procedurally construct non-uniform delay vectors based on the idea of functional independence between vector coordinates.^{27} Takens’ and Sauer both discussed the requirement that an embedding reconstruction requires vectors whose coordinates are independent.^{3,67} Pecora *et al.* proposed using a test for calculating the functional dependence between the components of a delay vector’s components in order to assess the quality of embedding. A functional dependence between vector coordinates implies

where $F$ is some arbitrary function. Constructing non-uniform delay embedding requires iterative building of a collection of time lags $\tau \u2192={\tau 1,\u2026\tau m\u22121}$ that minimizes the likelihood of functional dependence between components. In each iteration, a prospective lag $\tau i$ is tested for functional dependence with the existing lagged components corresponding to $\tau \u2192$. If there is no significant functional dependence, then $\tau i$ may be added to the collection of lags. To test the equality of Eq. (39), the authors assume that $F$ is smooth and use the property of continuity to quantify functional dependence.

Consider an existing $m$-dimensional embedding $x\u2192m(t)\u2208Rm$ constructed from lag $\tau ={\tau 1,\u2026\tau m}$ and a potential new embedding lag to be tested $\tau m+1$. To test the functional dependence of a new lag, select a reference reference point $x\u2192m(t0)$ in embedded space. If a smooth functional dependence exists, then the continuity condition states that points $x\u2192m,i$ nearby the reference point ($||x\u2192m,i\u2212x\u2192m(t0)||<\delta $) in reconstructed space $Rm$ should have lagged $m+1$th components that are also close by to each other ($|xi(t\u2212\tau m+1)\u2212x(t0\u2212\tau m+1)|<\u03f5$) (see Fig. 2).

The proportion $p$ of points $x\u2192m,i$ whose lagged components lie within $\u03f5$ of the reference point’s lag component can be calculated. This proportion is then compared against a null hypothesis; that correspondence between these sets is purely by chance. Large values of $p$ suggest a strong relationship between the $m$-dimensional reconstruction and the new $\tau m+1$ lagged component. Pecora *et al.* suggest the usage of a binomial distribution with a critical value of $p\u2217=0.5$ in order to decide if a functional dependence exists with respect to some chosen $\u03f5$ due to its simplicity and robustness to noise.^{27}

For a given $\tau m+1$ to be tested and a sample of points, the continuity test is applied with decreasing values of $\u03f5$ until the null hypothesis fails to be rejected. The smallest possible value for rejecting the null hypothesis is given as $\u03f5\u2217$. This value is averaged over a collection of reference points sampled from the data to calculate the continuity statistic $\u27e8\u03f5\u2217\u27e9(\tau m+1)$.

During each iteration of choosing a candidate lag $\tau m+1$ for an existing collection of lags $\tau \u2192={\tau 1,\u2026,\tau m}$, the continuity statistic profile $\u27e8\u03f5\u2217\u27e9(\tau m+1)$ is calculated. The new lag $\tau m+1$ is taken as the lag corresponding to the relative maxima of the continuity statistic profile. This is repeated until the desired embedding dimension (as per Whitney’s theorem) is reached. Pecora *et al.* also propose an undersampling statistic that can be used as a termination criterion for iterative selection of time delays. Further details can be found in the original paper.

This method was applied to a 2-torus, yielding embedding lags and embedding dimensions that were matching with theoretical expectations.^{27} The resulting reconstructed attractor was also found to be visually optimal. Similar results were gained when applied to the Lorenz chaotic time series. However, the resulting reconstruction appeared to be visually overfolded.

### C. PECUZAL

A criticism of the continuity statistic method is the ambiguity in selecting the optimal lag $\tau $ at each embedding iteration.^{64} In the original paper of Pecora *et al.,* the definition of “relative maxima” is unclear, and there is no objective criterion for selecting the best lag between multiple prospective local maxima.^{64} Additionally, the method also does not consider the effects of selecting different distances $\delta $ used to define nearby points in the reconstruction. Finally, the undersampling statistic originally proposed as a breaking condition for the embedding algorithm is computationally intensive and does not inform on which of the prospective lags should be selected. A more detailed critique is provided by Kraemer *et al.*^{64}

Kraemer *et al.* suggested that the continuity statistics approach could be combined with Uzal’s L-statistic^{29} in order to provide a fully automated method of constructing non-uniform embedding delays. In their paper, they provide a workflow that uses the continuity statistic to perform a coarse search of multiple lag times and identify a small set of potential lags. These usually correspond to the various local maxima of the continuity statistic profile $\u27e8\u03f5\u2217\u27e9(\tau m+1)$.

The L-statistic is then used as an assessment criteria to select which of the prospective lags should be selected in each embedding cycle. This addresses the problem of ambiguity of selecting lags that is present when using continuity statistics. The prospective lag whose new extended delay embedding resulted in the largest decrease in the L-statistic is selected in each embedding cycle. The L-statistic also provides a breaking condition for the embedding algorithm. The embedding cycles end when there is no achievable decrease in the L-statistic from the collection of prospective lags, i.e., $\Delta L>0$ between successive embedding cycles.

### D. Maximum derivatives on projection

The maximum derivatives on projection (MDOP) method was first proposed by Nichkawde as a geometrical alternative to the statistics and information theoretic approaches of mutual information and continuity statistics.^{40} MDOP optimizes an embedding based on the criteria that a reconstructed attractor should be maximally unfolded and minimize redundancy in the delay components. Similar to the majority of non-uniform embedding methods, MDOP recursively constructs the delay vector through embedding cycles. Each cycle identifies a new time lag that maximizes the directional derivative $\varphi d\u2032(\tau )$ of points in reconstructed state space.

Like Pecora’s approach in continuity statistics, MDOP begins with the criterion of functional dependence between each new prospective lag and an existing time delay reconstruction [see Eq. (40)]. However, unlike in continuity statistic, Nichkawde suggests using the directional derivative of the functional dependence $F$ [see Eq. (39)] to quantify the degree of redundancy in embedding and unfolding of the reconstructed attractor. This directional derivative is given by

where $\Delta xm$ corresponds to the spatial distance between a pair of nearby neighbor points $x\u2192(ti),x\u2192(tj)$ in reconstructed space with $m\u22121$ dimensions,

The sampled pair of points should also be chosen such that spatial closeness is not due to them being virtually close in time.^{48,49} This is easily achieved by allowing for a Theiler window $lTheiler$ where $|i\u2212j|>lTheiler$.

Testing the inclusion of a new time lag $\tau m$ requires evaluating the spatial variation in the prospective new lagged component, $\Delta Fm,ij$, and is given by the following equation:

This quantity is used to evaluate the directional derivative of a small region on the reconstructed attractor,

The directional derivative is evaluated with respect to each prospective new time lag $\tau m$ and is averaged across randomly sample close pairs of points across the entire reconstructed attractor,

where $\u27e8\cdots \u27e9i,j$ corresponds to the geometric mean across all sampled pairs of points. The author proposes using a geometric mean due to its robustness in the presence of outliers. In each recursive embedding cycle, the lag $\tau m$ that maximizes the directional derivative $\beta d(\tau m)$ is selected to be used for the reconstruction in the next cycle. This process is repeated until the desired number of embedding dimensions $m$ is reached, where $m$ is chosen via a number of different embedding dimension estimation methods such as false nearest neighbors, etc.

### E. Reduced autoregressive models

The reduced autoregressive model for non-uniform embedding was proposed by Judd and Mees^{60,68} as a proposed method of constructing ideal models with respect to some information criterion. This method involves the construction of a pseudo-linear autoregressive predictive model with all $k$ possible lagged components as inputs or basis functions,

where $\lambda =(a0,\u2026,ak)T$ are the coefficients of each input to be determined and $\u03f5t$ are the model prediction errors. For a time series of length $N$, construct a matrix $V\tau $ with each row containing a vector of lags at a given time $t$,

The matrix $V\tau $ has dimensions $(N\u2212k)\xd7k$ and is defined with respect to a set of lags $B=\tau 1,\u2026,\tau k$ and $\tau 1<\tau 2<\cdots <\tau k$.

Estimates for the coefficients of $\lambda ^$ can be calculated using least squares regression,

where $\xi =(x(\tau k+1),x(\tau k+2),\u2026,x(N))$. Therefore, the resulting model errors utilizing the set of all possible lagged components $B$ can be calculated as $e\tau =V\tau \lambda ^\tau $.

In order to reduce the number lagged components to a smaller selection, Judd and Mees propose the method of minimum description length. The principle of minimum description length is an application of Occam’s razor to the context of model selection. It defines that the best model for a given time series prediction task is one that achieves the most concise description of the data. For model selection, this would require achieving a compromise between model accuracy and model complexity (i.e., model description length). Model description length $L$ may be approximated by

where $L\u03f5$ is the description length of the model errors which is a function of the length of the time series $N$ and the mean square prediction error $\u03f5\xaf$ and $L\lambda $ is the description length of the model parameters.

The algorithm for reducing the number of lagged components is as follows:

Construct an empty set $B\u2032$ of chosen lags, and $B$ of candidate lags

Define the prediction error with respect to the chosen set of lags $B\u2032$ as $eB\u2032=\xi \u2212VB\u2032T\lambda B\u2032$ where $VB\u2032T$ and $\lambda B\u2032$ are defined with respect to the smaller set of lags $B\u2032$. If $B\u2032$ is empty, then $eB\u2032=\xi $.

Calculate the vector $\mu =V\tau TeB\u2032$ and identify the index $p$ of the largest magnitude element corresponding to the most significant lag component. Add this lag ${B\u2032\u222a\tau p}\u2192B\u2032$

Recalculate $\mu $ with respect to the new set of chosen lags $B\u2032$ and verify that least significant lag component was the lag $\tau p$ that was most recently added. Otherwise, return to step 2.

Evaluate the model description lengths $LB$ and $LB\u2032$ sets of lags $B$ and $B\u2032$. If $LB\u2032<LB$, return to step 2. Otherwise, end the algorithm and return the set of chosen lags $B\u2032\u2192B$.

An implementation of the minimum description length criterion for optimal embedding lag and window was done by Small and Tse.^{69,70} An extension of this method was proposed by Hirata *et al.*^{61} where a normalized maximum likelihood $L$ is used in place of minimum description length $L$ for model selection.^{71,72} Hirata *et al.* also propose a variation in the above algorithm by Judd and Mees—cross-validation—that utilizes the radial basis modesl instead of pseudo-linear models.

### F. Search optimization algorithms

Many of the common non-uniform embedding methods involve a single optimization step in each embedding cycle. In contrast, search optimization algorithms attempt to search across state space of possible lags to identify ideal combination embedding lags without necessarily selecting the first local optima encountered. Two examples of such approaches are the ant colony optimization (ACO) method^{65} and Monte Carlo decision tree search (MTCDS).^{66}

Ant colony optimization (ACO) is a swarm intelligence method first proposed by Dorigo^{73,74} that is inspired by the foraging behavior of ant colonies. Similar to other swarm optimization methods, ACO initializes a number of agents (“ants”) that simultaneously perform an initial search of the solution space. The quality of each attempted solution is assessed according to an objective function and a “pheremone” is assigned to the corresponding search path. These pheremones are able to accumulate and fade over time. This biases the search direction of subsequent iterations of the algorithm and is reminiscent of the optimal path finding behaviors of real world ant colonies.

The ant colony optimization method applied to non-uniform embedding (ACO-NE) builds upon this framework in a few ways. (i) By using an objective function based on various notions of embedding quality (mean neighborhood distance, minimum false nearest neighbors, and minimum description length^{69,70}) to optimize parameters. (ii) Incorporating heuristics into the algorithm to speed up convergence. Interested readers are encouraged to refer to the original paper for more details.^{65}

Another search optimization algorithm, MTCDS, proposed by Kraemer *et al.*^{66} reframes the non-uniform lag selection problem into a decision tree search. Each embedding cycle is represented by a collection leaves or nodes stemming from a root (the original time series) where each leaf is the selection of a particular candidate embedding lag. A Monte Carlo approach is used to randomly sample the tree and identify various local optima for a given objective function and backpropagation is then used to decide on the best selection of lags in each step.

## VIII. PERSISTENT STRANDS AND CHARACTERISTIC TIMES

The focus of this paper is on the problem of selecting time delays for delay embedding, and, in particular, for non-uniform delay embedding. Many of the proposed methods for optimizing embedding delay in both the uniform and the non-uniform case generally fall into the broad categories of dynamics (e.g., mutual information, continuity, etc.) or topology (fill factor, MDOP). With the exception of the PECUZAL automated embedding framework, non-uniform delay embedding strategies only focus on one of these two broad aspects in their definition of good embedding.

Another weakness in non-uniform embedding is that they do not always provide a full picture on the relative significance of each delay. When operating under the iterative construction of delay vectors, each prospective new time delay must be reevaluated with respect to the most recently updated embedding. Hence, the significance of each subsequent delay is conditional on the previous selected sequence of delays. This is a weakness particularly in fully automated algorithms such as MDOP and PECUZAL where ideal embedding lags are selected automatically, with little to no reference on their relative impact on embedding quality. Additionally, there is often a lack of consistency between the results of different methods. This can be attributed to the fact that optimization is done with respect to different notions of embedding quality. Often, these methods do not provide a dynamical explanation for each time lag’s significance and ability to improve a given embedding.

In view of this, we argue that a good non-uniform embedding method should have two main qualities. First, the embedding criteria should utilize both dynamical (irrelevance, periodicity, independence of coordinates) and topological (attractor unfolding) features in their selection of embedding lag. Second, the significance of each selected embedding lag should be explainable.

As our contribution, we propose a new method, “significant times on persistent strands” or SToPS, of identifying non-uniform embedding time lags using techniques drawn from persistent homology and recurrence analysis. We introduce the idea of a characteristic time scale spectrum of a signal based on the periodicities of time series and show how this may be used to identify ideal time delays. The selection of multiple time delays are also treated independently, marking a contrast to the iterative approach of constructing delay vectors that is common in most non-uniform delay embedding methods. We demonstrate the performance of our method on a collection standard periodic and chaotic time series from the literature. Additionally, we explore its performance on experiment neuron data containing fast-slow dynamics and show that SToPS is sensitive in identifying explainable time delays.

### A. Introduction to persistent homology

Persistent homology has seen a recent growth in popularity particularly in the fields of dynamical systems.^{75–79} We also note that recent work has also been done attempting to automate the selection of delay embedding parameters with persistent homology.^{80} In its essence, persistent homology aims to quantify and track the evolution of the topological properties of an object (network, point cloud data, etc.) under an increasing notion of distance.^{76} The process of gradually increasing distances is referred to as “thickening.” Topological features that persist for a large interval of distance under the thickening process are observed to be significant. For simplicity, we refer to the coordinates or increasing distance $\u03f5$ as analogous to increasing time. Conversely, short-lived topological changes are typically perceived to be noise.

We describe the thickening process as follows. Consider a point cloud $A={xi}$ arranged in the pattern of a circle. Place an open ball $B\u03f5(xi)$ of radius $\u03f5$ centered at each point and let the union of all open balls be the set of interest,

where we are interested in calculating the homology of the set $B(\u03f5)$ whose complex is given by $K(B(\u03f5))$. As $\u03f5$ is increased, the set of open balls also increase in size, forming a filtration shown below,

where $Kn=K(B(\u03f5n))$ and $\u03f50\u2264\u03f51\u2264\cdots \u2264\u03f5n$. More precisely, each set $Kn$ represents a collection of simplicial complexes, with subsequent $\u03f5n$ yielding a nested sequence called filtration. Computation of the thickening process is well documented with two main algorithms employed, Vietoris-Rips^{81} and Čech,^{82} of which we will employ the former.

For this case, we are concerned with changes in the homology of the set as $\u03f5$ increases. Simply, persistent homology aims to enumerate and track the number of $n$-dimensional “holes” in the set.^{83} Namely, $H0$, $H1$, and $H2$ for cases of low-dimensional homology. The zero-, one-, and two-dimensional holes correspond to disjoint components, cycles, and voids, respectively. Simplices (triangles, tetrahedrons, etc.) are considered solid components and are, therefore, not holes.

In the context of low-dimensional strange attractors, persistent homology tracks the persistence or lifetime (death time minus birth time) homological features (usually $H1$ holes) of the attractor using spatial data from an embedding or otherwise. All the information pertaining to the birth and persistence of features can be represented easily in a persistence diagram. In a persistence diagram, each homological feature (i.e., $H0$—disjoint components, $H1$—holes) are represented by a plotted point with coordinates $(\u03f5b,\u03f5d)$ corresponding to its birth and death times, respectively.

The tracking of birth and death time of homological features presents two useful features. First, it allows the tracking of the locations of holes within the data. When analyzing phase space trajectories, these holes may correspond to short-term pseudo-periods or turning points in the time series (see Fig. 5). By tracking $\u03f5$ and representative cycles, one can also identify boundary points of holes. Second, the lifetimes of homological features allow an estimate of the relative size of the feature, which may or may not be related to its significance depending on the type of data. All of these features may be represented in a persistent diagram (see Fig. 3) where points further from the diagonal represent more persistent homological features.

### B. Characteristic time scales

One approach to select lags for non-uniform embedding would be to select values related to the natural time scale of the system’s dynamics. Picking lags that are much smaller or larger would logically correspond to the cases of high redundancy and irrelevance, respectively. Consider the simplest case of a periodic signal. One can argue that an ideal embedding would require a delay that is related to the time scale of its main dynamics, i.e., its periodicity. For this, we employ the quarter period heuristic in the definition of the characteristic time $\tau $,

A natural progression of this concept into more complex dynamics would be to take the lag from the collection of natural frequencies,

Because $\tau i$ is evaluated at individual frequencies $\omega i$, which may vary greatly in magnitude, it is possible to capture the dynamics of systems with multiple time scales, such as fast-slow dynamical systems. Relating embedding lag with natural frequencies and periodicity also introduces a degree of explainability to the selection of $\tau $ that also directly relates to the dynamics. While this presents a potential advantage over typical non-uniform embedding methods, it requires one to be able to accurately measure the natural frequencies (or equivalent) from any given signal.

If the analyzed signals were relatively smooth and easily decomposable into sinusoids, Fourier transforms would provide an excellent solution to this problem. However, this approach quickly fails when analyzing discontinuous-like signals such as neuron voltages where the time series is characterized by alternating phases of bursting and resting dynamics. Fourier transforms also require the time signal to be stationary on its statistics. This property is not possible for chaotic time series where the phase and period of the signal vary over time. For example, the frequency power spectrum of a chaotic signal such as Lorenz and Chua produces a shape with an exponential decay. Similar arguments may be made when analyzing experimental time series, where it is expected that drifting and oscillations in the phase and/or period may occur.

Another alternative method to detect periodicity would be to use the notion of recurrence distance employed in the recurrence analysis and unstable periodic orbit detection,^{84}

where $\tau $ is a characteristic time if the recurrence distance $d(\tau )$ is below some threshold $d\u2217$,

The recurrence distance tracks the displacement from a point in state space and its future trajectory. For a periodic orbit and correctly selected $\tau $, this will result in a local minima for $d(\tau )$ where the periodic orbit returns close to its initial position. This method of detecting characteristic times $\tau $ is unsuitable as it cannot distinguish between cases where $T=\tau /2$ and the resulting embedded trajectory clusters along the diagonal.

We propose a method of identifying and weighting the significance of characteristic times from a time series. The identification of characteristic times is done by sampling “strands” (short contiguous windows of 2D delay embedded time series) and calculating their persistent homology. The representative cycles of the persistent strands’ calculated homologies are used to assign a significance score to each identified characteristic time in order to construct a characteristic time spectrum. This spectrum can then be used to inform the selection of lags for non-uniform time delay embedding. This combined framework is named Significance Times on Persistent Strands (SToPS).

## IX. SIGNIFICANCE TIMES ON PERSISTENT STRANDS

### A. Persistent strands

The first challenge to tackle is the identification of all pseudo-periodic behavior of period $T$. For a characteristic time $\tau $, the quarter period heuristic suggests that the corresponding 2D embedding with coordinates $(x(t),x(t\u2212\tau ))$ will result in a periodic orbit that is approximately maximally convex in reconstructed space. We note that the precise shape of this orbit is not guaranteed to be circular and varies depending on the signal.

To test for periodic behavior with characteristic time $\tau $ for a signal $x(t)$, 2D delay embedding of the time series with a single lagged component is constructed,

A collection of $N$ strands of length $l=4\tau $ with random initial positions $ti$ are then sampled from the time series, where each strand is given by

We argue that strands of this length should be approximately sufficient to detect loop structures in 2D embedding based on the quarter period heuristic. The persistent homology of each strand can be calculated to detect the presence of orbits with period $l$. For each strand $x\u2192i(t)$, extract the maximum persistence from the resulting persistence diagram $PD$. A sample strand is said to contain an orbit of length $l$ if the maximum persistence of the corresponding diagram exceeds a critical value $\rho $. The value of $\rho $ is taken to be the average distance between two consecutive observations in phase space. Therefore, a naïve value $Pi$ that quantifies the significance of a characteristic time $\tau $ can be defined as

This value can be used to calculate an overall maximum persistence spectrum $P(\tau )$ by averaging over all non-zero scores $Pi$ for each characteristic time $\tau $,

We also impose an additional constraint that the number of points used to reconstruct the boundary of the hole ($H1$ homology) in the orbit should be at least $Nhole$. This avoids the problem of including spurious holes where a small number of points suggest the existence of a hole even though the embedded strand is insufficiently long to close the orbit (Fig. 4). We select a minimum value of $Nhole=8$ for our analyses based on the argument that a minimum of 8 points should be sufficient in at least identifying a hole of a small lag $\tau =2$ without discounting higher lags. This value was found to work well in our analyses.

The maximum persistence spectrum provides a simple way to quantify the degree to which $\tau $ corresponds to a dynamically significant time scale (i.e., periods). However, $P(\tau )$ is biased toward larger features due to its reliance on the lifetime of each homological feature. As a result, spatially small but dynamically significant features over small time scales are underrepresented. We also note that $P(\tau )$ is unable to differentiate between pathologically inefficient embeddings such as those whose loops are not maximally circular or have trailing tails (see Fig. 5).

As we can see from Sec. III A, a good embedding is one that has been maximally unfolded to best utilize the reconstructed state space, while being robust to the effects of noise amplification. Therefore, the significance of each characteristic time $\tau $ should be weighted according to how well the corresponding strand “unfolds” into a loop structure in the 2D delay embedding with lag $\tau $.

From this, we propose the significance score $S(\tau )$, which is a measure of the dynamical significance of each characteristic time $\tau $ that accounts for the quality of unfolding of sampled strands. Using the topological notion of good embedding, the significance score $Si(\tau )$ for the $i$th sampled strand is defined as

where $\alpha (\tau )$ and $\gamma (\tau )$ are two separate measures named the circularity and efficiency, respectively. This score is also not biased toward larger homological features, allowing for the detection of both small and large pseudo-periodic dynamics.

The circularity $\alpha (\tau )$ tries to quantify the quality of unfolding of a persistent strand in embedded space. Embeddings that yield circular loops imply a selected lag $\tau $ that maximally unfolds the dynamics (i.e., $\tau $ may correspond to a characteristic time). Therefore, loops that are more circular or regular have higher values compared to ellipses and other shapes with eccentricities. This allows circularity to also function as a measure of redundancy with lower $\alpha $ corresponding to high redundancy.

To calculate $\alpha i$ for a given sampled strand, the boundary points $b\u2192i$ corresponding to the birth of the most persistent homological feature are identified by examining representative cycles. Principal component analysis (PCA) is used to identify the major and minor axes of the embedded hole. Because each strand is used to evaluate a single lag in 2D delay embedding, the first and second principal eigenvalues approximately correspond to the relative sizes of the major and minor axes of the embedded points’ bounding ellipse. Hence, we define circularity as the ratio between the first and second principal eigenvalues, averaged across all strands,

where $\alpha (\tau )\u2208(0,1]$. As $\alpha \u21921$, embedded holes are more uniformly circular.

Efficiency $\gamma (\tau )$ is defined by the ratio of two areas,

where $Ah$ is the area of the hole given by the ordered set of boundary points $b\u2192i$ and $Apc$ is the area of the smallest convex polygon that includes all points in the strand $x\u2192i(t)$. The area of the hole $Ah$ can be simply calculated using the shoelace algorithm.^{85} The area of the latter $Apc$ can be similarly calculated by using a Graham scan algorithm to first identify the smallest bounding convex polygon.^{86} Similar to $\alpha (\tau )$, the efficiency score is also bounded with $\gamma (\tau )\u2208[0,1]$. Efficiency is a measure of how well utilized the ambient space of an embedded strand is with respect to the hole. It is used to detect cases where the detected hole is circular but does not utilize the full length of the strand (see Fig. 5).

## X. TESTING METHODOLOGY

SToPS was tested on three different types of time series covering periodic, chaotic, and fast-slow dynamics.

First, a sum of sines signal (see Fig. 6) with step size $dt=0.001$ was used to simulate the case of periodic time series,

where $N=3$ is the number of superimposed of sine waves. Phases $\varphi ={0,0.25,0.75}$ and magnitudes $A={1,0.5,0.2}$ were selected to ensure that the spacial scales of the dynamics were distinct (see Fig. 6).

The second time series analyzed was the $x$-component of the canonical Lorenz chaotic time series. This is used to represent the case of chaotic time series. This time series was numerically integrated with a fourth-order Runge–Kutta scheme with an integration time step of 0.0004 and subsampled to an effective time step of $dt=0.004$ for 25 000 steps.

The third time series consisted of experimental data measured from a lobster somatogastric ganglion (STG) lateral pyloric (LP) neuron. This time series represents a third class of dynamics corresponding to fast-slow dynamics with two different spatial and temporal scales. This time series was originally analyzed by Abarbanel^{87} and includes two characteristic dynamics. These are small magnitude and time scale oscillations corresponding to neuron spiking dynamics and a long time scale periodic behavior for neuron bursting (see Fig. 6). Additionally, the phase of the bursting dynamics also varies slightly over time. This results in a gradual shift of spatial position of the expected loop in a 2D embedding. As a result, strands with lengths $T$ that are too long are penalised as these loops eventually get filled in.

All three input time series were normalized with zero mean and unit standard deviation before applying the embedding algorithms. A small amount of additive noise $\xi \u223cN(0,0.0012)$ was applied to the sum of sines and the digitized experimental Lobster LP neuron data to ensure there were not exact overlaps in values. In all cases, the input data were limited to 25 000 points when calculating embedding lags to ensure the computational time was sufficiently short. This is due to the use of a $k$-means++ random sampler,^{88} whose computation increases with the number of points, to select strand locations that uniformly explore the reconstructed state space.

Different ranges of characteristic times were tested to calculate the efficiency, circularity, and significance score profile depending on the type of time series. Each $\tau $ consists of sampling 250 strands of length $T=4\tau $. Due to the poor computational scaling of the Vietoris-Rips filtration, long strands where $T$ is large are subsampled by a factor of $k$,

where $Np=250$ is the approximate scale of the maximum allowable strand length. This value was chosen in order to accommodate the slow computational efficiency of calculating the Vietoris-Rips filtration and associated persistent homology. The threshold $\rho \u2217$ used to define the minimum lifespan needed to classify a homological feature as significant was chosen to be the average distance between temporally adjacent points,

where $\u27e8\cdots \u27e9i,k$ is the average across all points in the 2D delay embedded strand $x\u2192(t)$ subsampled with a factor $k$. The circularity, efficiency, and significance score profiles were calculated using SToPS. Both mean $S\mu (\tau )$ and standard deviation $S\sigma (\tau )$ profiles are calculated and compared against a selection of embedding delay optimization measures. In our analyses, the peaks of each profile are selected by observation. However, automatic identification of peaks may be implemented by using a search algorithm to identify all local maxima in the significance score profiles.

The calculated significance scores were compared against the mutual information used to select $\tau $ for uniform delay embedding. Two automated methods, MDOP^{40} and PECUZAL,^{64} were also used as a comparison benchmark for non-uniform delay embedding. These calculations were done using implementations provided by the ‘‘DynamicalSystems.jl’’ package.^{89}

In addition to compare the output lags from each method, it is of interest to find how different non-uniform embeddings impact performance in prediction tasks. For each of the non-uniform embedding methods, the first two dominant lags are taken to construct 3D delay embedding. This is done by visually inspecting the $S\mu (\tau ),S\sigma (\tau )$ profile in the case of SToPS. Time lags for PECUZAL and MDOP were taken as the first two detected timelags in their respective iterative procedures.

The resulting delay embeddings used to train a simple four-layer feed forward neural network consisting of two hidden layers, one input and one output layer. Each hidden layer has 128 nodes with a ReLU activation function with an overall network architecture of 3:128:128:3.

The neural network is trained to output one-step prediction,

The learning rate was set to 0.001 with a batch size of 512 and run for 30 epochs. In each instance, only the first half of the time series data is used for training. The second half is reserved for validation. Validation is done by calculating $n$-step freerun predictions. This is calculated by providing an initial condition and feeding back the neural network outputs $n$ times to get the final prediction [see Eq. (65)], which is then used for calculating prediction error.

## XI. RESULTS AND DISCUSSION

### A. Significant times

#### 1. Periodic dynamics—Sum of sines

The periodic sum of sines time series represents the case of periodic dynamics with multiple time scales. The component frequencies were selected as $\omega ={1,5,30}$, corresponding to three different characteristic times at lags $\tau =(250,50,8)$. The sum of sines times series was a major challenge for the baseline mutual information $I(\tau )$ measure (see Fig. 7), where the minima were only able to identify the highest frequency periodicity in the data. The max persistence $P(\tau )$ was also not useful in identifying any significant time lags from the time series.

For automated non-uniform embedding methods, PECUZAL found a single time lag at $\tau =341$. MDOP returned four time lags at $\tau =(241,42,271,180)$, two of which are close to the expected lags $\tau =(250,50,8)$. The other remaining time lags detected by both methods did not bear any clear relation to the expected characteristic times.

In contrast, the proposed significance score measures, $S\mu (\tau )$ and $S\sigma (\tau )$, both showed peaks around the expected time lags corresponding to approximately one quarter of the component periodicities (i.e., $T=1,1/5,1/30$) (see Fig. 7). However, the peak corresponding to the $\omega =5$ component (i.e., $\tau =45$) is not clear with two peaks occurring at nearby time lags instead. This anomaly may be because the spatial and time scales of the second frequency are not dynamically distinct enough from the large time scale. When calculating the 2D embedding, this can cause persistent strands to form spirals instead of circular holes at the characteristic times. We note that circularity $\alpha (\tau )$ and efficiency $\gamma (\tau )$ provide quite different profiles with the latter heavily influencing the shape of the resulting profile $S\mu (\tau )$. A comparison of the phase space reconstructions between MDOP, PECUZAL, and SToPS is provided in the Appendix.

#### 2. Chaotic dynamics—Lorenz

For Lorenz, the minima of mutual information yielded lags at $\tau \u2248(40,150)$. This result matches closely with the maxima taken from the maximum persistence $P(\tau )$ profile. Of the two lags detected by PECUZAL $\tau =(46,24)$, one was similar to a minima from the mutual information curve. Similarly, MDOP yielded three different lags $\tau =(148,192,37)$, one of which approximately matches the minima of the mutual information.

The significance measures calculated using SToPS yielded two distinct maxima across $S\mu (\tau ),S\sigma (\tau )$ with lags at $\tau =(41,130)$. The first lag is similar to calculated lags from PECUZAL and MDOP. However, the second lag $\tau =130$ results in an overembedding of the time series and does not directly correspond to any lag output by either PECUZAL or MDOP. Closer inspection of the embedded time series at $\tau =120$ reveals the re-emergence of the lobes of the Lorenz attractor at $3/4$ of the period with boundaries created by multiple dense loops. While this may produce well-defined holes in persistent homology near the lobes, it does not efficiently utilize all the points in the sampled strand (i.e., $4\tau =520$ points) (see Fig. 4) and, hence, should not be classified as a characteristic time lag of the time series and is reflected in the much lower significance score. The usage of this time lag results in an overembedding of the time series. However, closer inspection of the sampled strands show that part of the increase in $S\mu (\tau )$ is attributed to the hole formed from the spread of trajectories near the saddle point of the attractor.

There is also an apparent correspondence between successive minima of mutual information and detected time lags in non-uniform embedding methods. This suggests that mutual information $I(\tau )$ may be useful for informing the selection of lags for non-uniform embedding. However, one difficulty is assessing whether the lag of a minimum is within an acceptable embedding window $m\tau $ such that irrelevance is not a problem.

#### 3. Fast-slow dynamics—Lobster LP neuron

From observing the lobster neuron data (see Fig. 6), it is possible to infer two dominant time scales corresponding to expected lags of approximately $\tau =(12400)$. Temporal variations in these can be attributed to observational noise or potentially chaotic dynamics. The uniform embedding measures of mutual information and max persistence were found to be poor in identifying lags, although max persistence $P(\tau )$ begins to quickly increase when approaching the expected characteristic lag $\tau =400$. This is unsurprising as 2D embedding at those lags begins to unfold large orbits from the large time scale dynamics corresponding to the slow bursting phase of the neuron (see Fig. 9).

For non-uniform embedding, both PECUZAL and MDOP yielded a large number of potential embedding lags. However, both methods failed to successfully recover the lags for the fast dynamics (see Fig. 8). Additionally, apart from the lag at $\tau \u2248500$ corresponding to the slow dynamics, both PECUZAL and MDOP produce multiple additional time lags with no obvious explainable relation to the time scales of the data.

The significance score using SToPS was able to retrieve the two main time lags present in the data at approximately $\tau =(5,500)$. There appears to be a slight disagreement on the location of the larger time lag with $S\mu (\tau )$ and $S\sigma (\tau )$ producing slightly different lag times. The $S\mu (\tau )$ also shows a small peak at approximately $\tau =(40,60)$. Further inspection into the representative homology of the sampled strands reveal that this is the result of multiple overlapping orbits from the fast spiking dynamics. However, this feature is captured by $\tau =5$ and the peaks at $\tau =(40,60)$ are an artifact of overembedded time series lying on similar orbits (see Fig. 4).

Based on these results, only SToPS was able to detect both dominant time scales in the data. The lags from PECUZAL and MDOP are conditional on the selection of previously detected lags due to the iterative approach employed by the algorithm. In contrast, SToPS produces a single characteristic time spectrum $S(\tau )$ from which the significance of each potential lag can be assessed and selected independently. We visually compare the resulting 3D delay embeddings of these three methods in Fig. 9. Similar comparisons for the sum of sines and Lorenz data are provided in the Appendix. For PECUZAL and MDOP, we select the first two non-zero delays in order of detection at $\tau =(265,496)$ and $\tau =(322,552)$, respectively. The lags for SToPS were selected visually from $S\mu (\tau )$ and $S\sigma (\tau )$ with lag times $\tau =(5,500)$. From the projections of the reconstructed state space, we find that SToPS associates different projections with dynamics of different time scales. This results in unfolding that is visually easier to interpret. In contrast, the PECUZAL requires a large number of dimensions before all dynamical components can be visually detected. Restricting the number of dimensions results in the fast dynamics being obscured at the expense of slow dynamics.

### B. Freerun predictions

Freerun prediction errors with models constructed from different non-uniform embeddings were calculated for the non-periodic Lorenz and lobster LP neuron time series. In each case, 3D delay embedding was constructed using the first two lags detected with PECUZAL and MDOP. For SToPS, the first two visually dominant maxima of the $S\mu (\tau )$ and $S\sigma (\tau )$ profiles were selected. In the case for Lorenz where only one relevant time lag exists (Fig. 10) (i.e., $\tau =41$), we use a uniform embedding scheme where the second lag is a multiple of the first.

The prediction error for Lorenz was calculated using a 25-step freerun prediction with a trained feedforward neural network (see Sec. X). For the lobster LP neuron, training was done on a subsampled dataset that included every third point. This subsampling was done to reduce the training time. A 10-step freerun prediction was then evaluated with the subsampled data (i.e., equivalent to 30-step freerun prediction). The lower number of freerun prediction steps was used to allow a more accurate evaluation of the prediction performance within the fast neuron spiking regime of the time series. In both cases, the first half of the data was used for training and the second half used for testing and evaluation. Additional analyses for freerun prediction with non-subsampled data are provided in the Appendix.

The error of each prediction was calculated as the magnitude of the error between the predicted delay vector $n$-steps ahead,

The resulting distributions of $E(t)$ for both time series with the three different embedding methods are given in Figs. 11 and 12. For Lorenz, we see that SToPS (persistent strands) provides a mean error in between PECUZAL and MDOP. In the experimental data case (lobster LP neuron), SToPS outperforms both measures with a lower error. These findings are also reflected in corresponding medians in both cases. The median prediction error in the Lorenz case was $0.106$ (SToPS), $0.122$ (PECUZAL), and $0.112$ (MDOP). The lobster LP neuron median prediction errors were $0.106$ (SToPS), $0.122$ (PECUZAL), and $0.112$ (MDOP). Additionally, SToPS shows an error distribution with a heavier tail for freerun predictions with the lobster LP neuron compared to PECUZAL and MDOP. Despite the potentially lower prediction error, we note that this improvement is not significant and should not be the targeted benefit of SToPS. Instead, we argue that the main advantage of SToPS is that it provides lags that are explainable in the context of the observed dynamics of the time series.

One advantage of using SToPS is the deliberate inclusion of both fast and slow time scales within the embedding of the time series. Therefore, it is expected that models trained on this embedding should be able to better resolve the fast dynamics that would have otherwise be missed if larger delays were selected. This is verified in Fig. 13 where SToPS produced a 10-step freerun prediction that is able to better replicate the small-scale, fast spiking dynamics characteristic of the neuron time series. This is in contrast to models trained on the same number of lags from PECUZAL and MDOP where freerun predictions are not able to capture the same level of detail in the spiking dynamics. We note that this advantage is not as apparent when the full dataset without subsampling is used to train a model for prediction. However, it was found that the prediction errors were slightly lower for SToPS in this case.

## XII. NOISE EFFECTS

One often cited benefit of persistent homology is its robustness to noise.^{90,91} Because only geometric features (i.e., holes) are tracked, as long as the magnitude of the noise does not destroy the underlying structure of the embedded time series, the calculated homologies should be stable. To test this property in our method, we repeat analyses with the Lorenz dataset for varying signal-to-noise ratios and observe the changes in the resulting significance score profile $S\mu (\tau )$. Five noise levels of additive Gaussian noise with signal-noise ratios $SNR=(1000,500,100,50)$ were applied after normalizing the input Lorenz time series with zero mean and unit standard deviation. Results are shown in Fig. 14. We found that each of the profiles was relatively stable and robust to increasing noise levels with little to no change in the observed significant lags. For high levels of noise ($>20$ dB), the spurious lag artifact at $\tau =120$ disappeared. This is likely due to the effects of noise across multiple trajectories along a similar orbit destroying the underlying homology structure.

Similarly, PECUZAL and MDOP revealed similar lags for low noise levels ($<37$ dB). However, higher levels of noise resulted in gradual drifts in both methods. The lags from PECUZAL also differed significantly for small $\tau $ for noise levels above 30 dB. In contrast, MDOP was relatively unaffected by noise, and drift effects were relatively small. Additionally, MDOP was also the only method able to produce any lag predictions for signal-to-noise ratios below 17 dB. Both SToPS and PECUZAL were unable to produce any results.

## XIII. COMPUTATION COMPLEXITY

The computation of persistent homology is not parsimonious and still suffers from poor computational scaling. While computation of persistent homologies for datasets with few points is relatively quick, the computation time grows rapidly for even moderately sized datasets. This presents a challenge of the SToPS method as it requires the computation of persistent homology of multiple sampled strands across a large collection of time series. From the proposed algorithm, the time complexity is approximately

where $N\tau $ and $Ns$ and $Np$ are the number of lags tested and the number of sampled strands and strand length, respectively, and $\varphi PH(Np)$ is the time complexity for persistent homology computation. This value varies depending on the implementation of the persistent homology algorithm and the type of filtration used (e.g., Čech, Rips, Delaunay, etc.). The Vietoris-Rips filtration has a simplicial complex size $K$ that scales exponentially with $2O(Np)$ in current formulations.^{92} The computation of the Vietoris-Rips filtration can be split into two phases. However, calculating the computational complexity bounds is not straightforward.^{93} We note that there is ongoing work aimed at improving and optimizing the persistent homology algorithm, which has resulted in significant gains in performance.^{92,94}

A comparison of the approximate run times for SToPS, PECUZAL, and MDOP is provided in Table I. Computation was done on a Ryzen 7 4800HS with 16 GB of RAM using only one thread in order to allow results to be comparable. Despite having a much longer computational time compared to other methods, we note that SToPS allows multiple lags to be considered in parallel as the significance of each potential time lag is evaluated independently. This is in contrast to PECUZAL and MDOP where the iterative embedding cycles approach is used, and each new lag is selected conditional upon previously selected lags. However, there is a new method proposed by Krämer *et al.* based on Monte Carlo tree search that attempts to tackle this problem.^{66} The flexibility to assess embedding lags independently may provide large gains in computational speed where the computation for multiple lags may be distributed across multiple threads. Our current implementation of the algorithm does not yet provide support for this.

Time series . | τ_{max}
. | PECUZAL . | MDOP . | SToPS . |
---|---|---|---|---|

Lorenz | 50 | 6.0 | 0.19 | 148 |

(45,23) | (42,23) | |||

100 | 10.2 | 0.23 | 982 | |

(45,23) | (100,62) | |||

150 | 15.8 | 0.28 | 1689 | |

(45,23) | (147,40) | |||

200 | 21.2 | 0.42 | 2396 | |

(45,23) | (147,189,40) | |||

Lobster | 50 | 7.1 | 0.39 | 95.9 |

(50,24) | (50,31,11,44) | |||

100 | 46.9 | 0.48 | 881 | |

(100,63,81,11,35) | (100,51,19,81) | |||

300 | 112 | 0.82 | 3376 | |

(300,251,114) | (300,149,55,245) | |||

600 | 2705 | 1.62 | 7266 | |

(515,304,132,505,32,20,253,394,81,232,55) | (544,274,599,114,423) |

Time series . | τ_{max}
. | PECUZAL . | MDOP . | SToPS . |
---|---|---|---|---|

Lorenz | 50 | 6.0 | 0.19 | 148 |

(45,23) | (42,23) | |||

100 | 10.2 | 0.23 | 982 | |

(45,23) | (100,62) | |||

150 | 15.8 | 0.28 | 1689 | |

(45,23) | (147,40) | |||

200 | 21.2 | 0.42 | 2396 | |

(45,23) | (147,189,40) | |||

Lobster | 50 | 7.1 | 0.39 | 95.9 |

(50,24) | (50,31,11,44) | |||

100 | 46.9 | 0.48 | 881 | |

(100,63,81,11,35) | (100,51,19,81) | |||

300 | 112 | 0.82 | 3376 | |

(300,251,114) | (300,149,55,245) | |||

600 | 2705 | 1.62 | 7266 | |

(515,304,132,505,32,20,253,394,81,232,55) | (544,274,599,114,423) |

The selection of new lags conditional on previously selected lags also means that the results of PECUZAL and MDOP are not robust to changes in the maximum allowable lag $\tau max$. Changes to the range of potential lags can affect the order of selection of future lags if a new candidate lag that better optimizes the objective statistic is introduced. For complex data, this can result in widely varying results as the maximum lag changes. The presence of noise in the data also results in different lags as shown by drifting lags found in Fig. 14. Calculated lags for the lobster LP neuron time series were found to vary even across trials with different realizations of identically distributed noise.

We also note that the computation time for these embedding algorithms can vary depending on the complexity of the input time series. For example, successive embedding cycles in PECUZAL typically increase in computation time. For complex or noisy time series, PECUZAL may produce multiple lags of varying significances. This is seen by the jump in estimated lags for the lobster data between max lags of 300 and 600. In contrast, the SToPS algorithm assesses the significance of each score and should grow linearly proportional to the number of lags. An exception is for small lags $\tau $ where strands are too short and have no holes’ homology to track.

## XIV. CONCLUSIONS

One of the aims of this paper is to provide an overview of the embedding fundamentals and review existing methods for optimizing embedding parameters. We first provide in Sec. I a rough overview on the fundamental concepts of embedding theorems and its applications in the context of the time series analysis. A simple case for the usage of embedding in time series prediction tasks is also given for new or uninitiated readers. In this paper, we focus on the problem of identifying good embedding parameters, specifically on the selection of embedding lags in non-uniform embedding. An overview on the considerations when selecting embedding parameters is provided in Sec. III followed by a comprehensive review of various uniform delay embedding methods in Secs. IV–VI.

We argue that a non-uniform embedding approach provides more flexibility in reconstructing fast-slow dynamical systems. Following this, an overview of existing methods that attempt to automate the selection of non-uniform embedding lags is provided in Sec. VII. However, while many of these automated non-uniform embedding methods reliably return a collection of lags, they do not necessarily agree with each other or provide a satisfactory dynamical explanation for their selection. Furthermore, due to the iterative process used to select delays, the choice of each subsequent delay is conditional on previous selections.

In Secs. VIII and IX A, we propose a new method of selecting non-uniform embedding lags, SToPS, that aims to produce lags that have more dynamical explainability and are independently selected. SToPS utilizes persistent homology to detect loops formed by 2D delay embeddings of sampled windows of the time series, which we call “persistent strands.” This is done over multiple different lengths of sample windows to produce a characteristic time spectrum $S\mu (\tau )$ where larger values of the significance score correspond to time scales that are dynamically significant (i.e., they relate to some notion of periodicity in the time series). The structure of each persistent strand loop is characterized by two quantities, circularity and efficiency, which are combined to give the significance score $Si(\tau )$. Selection of time lags $\tau $ is done based on the mean and standard deviation profiles of the significance score $S\mu (\tau ),S\sigma (\tau )$.

The SToPS embedding method is tested on three different classes of time series: periodic (sum of sines), chaotic (Lorenz), and fast-slow (lobster LP neuron). In all cases, SToPS was found to detect dynamically explainable time scales that were not reflected in other reference non-uniform embedding methods PECUZAL and MDOP. Additionally, SToPS was found to outperform PECUZAL and MDOP in identifying dominant time scales for the lobster LP neuron where fast-slow dynamics are present.

The impact of each different embedding method on the time series prediction performance was also tested. Embedded time series were used to train a one-step neural network predictor. It was found that the resulting models performed similarly across all embedding methods for both the Lorenz and lobster LP neuron time series. However, freerun predictions of the lobster LP neuron time series with models trained using SToPS embedding lags were found to be able to replicate the fast neuron spiking dynamics better than reference embedding methods. We also provide a brief discussion and analysis on the computational efficiency of SToP as well as its robustness to noisy input data.

Overall, while the performance of SToPS is only marginally better than the existing non-uniform embedding methods PECUZAL and MDOP, we argue that SToPS provides lags that are more dynamically explainable compared to its counterparts. The assessment individual time lags also allow the method to be applied to multivariate time series by considering each component as an independent scalar time series and identifying their respective time lags. This may then be used to construct a delay vector that utilizes all components of the time series. The performance of SToPS in this context has not yet been tested and presents as an avenue of further research. Additionally, the independent selection of lags via the characteristic time spectrum provides a clearer picture of the relative importance of each time lag when compared to existing iterative methods for automated non-uniform delay embedding where an explicit collection of lags is provided.

However, the pursuit of more dynamically explainable delay lags introduces a level of subjectivity in the interpretation of the characteristic time spectrum. Nevertheless, we argue that focusing on selecting dynamically relevant and explainable delay lags is potentially a more meaningful approach to construct models that are more relatable to observed system dynamics. This advantage is especially evident in systems with multiple disparate time and spatial scales, as demonstrated by the dynamics of the Lobster LP neuron. Therefore, we propose dynamical relevance and explainability to be a key additional consideration in the future development of time delay embedding methods.

## ACKNOWLEDGMENTS

D.C. was supported by the Australian Research Council through the Centre for Transforming Maintenance through Data Science (Grant No. IC180100030), funded by the Australian Government. D.W. and M.S. were supported by the Australian Research Council (Grant No. DP200102961). S.A. was supported by the Forrest Foundation. E.T. was supported by the Robert & Maude Gledden Foundation and the A.F. Pillow Applied Mathematics Trust. We would also like to thank H. Abarbanel and his team for providing us access to the experimental LP neuron data.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Eugene Tan:** Conceptualization (lead); Formal analysis (lead); Investigation (lead); Methodology (lead); Software (lead); Visualization (lead); Writing – original draft (lead); Writing – review & editing (lead). **Shannon Algar:** Conceptualization (supporting); Validation (supporting); Writing – review & editing (supporting). **Débora Corrêa:** Investigation (supporting); Supervision (supporting); Writing – review & editing (supporting). **Michael Small:** Investigation (supporting); Methodology (supporting); Project administration (supporting); Supervision (lead); Writing – review & editing (supporting). **Thomas Stemler:** Investigation (supporting); Supervision (supporting); Writing – review & editing (supporting). **David Walker:** Data curation (lead); Validation (supporting); Writing – review & editing (supporting).

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request and are openly available in Github at https://github.com/eugenetkj98/SToPS˙Public, Ref. 95.

### APPENDIX A: NON-UNIFORM EMBEDDING PROFILES

A collection of profiles of the calculated statistics used to select embedding lags the automated non-uniform embedding algorithms PECUZAL and MDOP are given in Fig. 15. Relevant statistics are the continuity statistic $\u27e8\u03f5\u2217\u27e9$ for PECUZAL and the $\beta $ statistic for MDOP. For PECUZAL, the $\tau $ lags corresponding to local maxima in $\u27e8\u03f5\u2217\u27e9$ are used as candidate lags. The lag that results in the largest decrease in the $L$-statistic is chosen as the final embedding lag in each embedding cycle. A similar process is done for MDOP, but the global maxima of $\beta $ is chosen instead. A termination criterion based on the false nearest neighbor (FNN) statistic is used in conjunction.

### APPENDIX B: FREERUN PREDICTION OF NON-SUBSAMPLED DATA

This section contains results of additional freerun prediction tests for models that train on the full lobster LP neuron dataset without any subsampling. This is theoretically an easier task as the model is provided with a larger amount of data and smaller with smaller magnitude predictions in each step. Prediction horizons of 10 steps (Fig. 16) and 30 steps (Fig. 17) were done. The latter’s prediction horizon is equal to 10-step prediction horizon models trained on the subsampled data. In the 10-step non-subsampled case, SToPS is found to yield a lower prediction error than PECUZAL and MDOP. The replication of the fast spiking dynamics is relatively similar between all methods. Similar results were found for the 30-step prediction case. However, the replication of the fast spiking dynamics is poorer than 10-step prediction cases likely due to the accumulation of errors in successively predicted values.

### APPENDIX C: PHASE SPACE RECONSTRUCTION COMPARISONS

A comparison of various reconstructed attractors for the sum of sines and Lorenz time series is provided in Figs. 18 and 19 with the first two detected lags given. For the periodic sum of sines, PECUZAL yielded only a single lag. For the Lorenz time series, SToPs only yielded a single peak at $\tau =41$, which was subsequently used for uniform embedding. Both PECUZAL and SToPS share a similar lag at $\tau \u224840$. MDOP yields lags that cause overembedding.

Recurrence plots for corresponding 2D and 3D embeddings for the sum of sines, Lorenz, and lobster LP neuron time series are provided in Figs. 20–22. All embedding methods were able to preserve some of the periodic structure for the sum of sines dataset. However, the recurrence plots for MDO and SToPS revealed more small-scale structure than PECUZAL. For Lorenz, both PECUZAL and SToPS yield similar recurrence plots. MDOP recurrence plot loses some detail in comparison and is likely due to overfolding of the attractor caused by large embedding lags. In the lobster LP neuron time series, the selection of a small lag with SToPS reveals the expected periodic behavior in several diagonal regions of the recurrence plot. This is not as clear in PECUZAL and MDOP embeddings where the detected periodic behavior is dominated by the slow neuron dynamics.

## REFERENCES

*Dynamical Systems and Turbulence, Warwick 1980*(Springer, 1981), pp. 366–381.

*The Theory of Chaotic Attractors*(Springer, 2004), pp. 170–189.

*IEEE International Conference on Neural Networks*(IEEE, 1993), pp. 1786–1793.

*International Conference on Artificial Neural Networks*(Springer, 1997), pp. 999–1004.

*Proceedings of the 15th European Symposium on Artificial Neural Networks*(Schloss Dagstuhl, Leibniz Center for Informatics, 2007), pp. 471–482.

*Functional Differential Equations and Approximation of Fixed Points*(Springer, 1979), pp. 204–227.

*European Conference on Circuits Theory and Design*(European Circuit Society and the Institute of Electrical and Electronic Engineers, 2003).

*Topological Data Analysis*(Springer, 2020).