Estimating fractal dimensions: a comparative review and open source implementations

The fractal dimension is a central quantity in nonlinear dynamics and can be estimated via several different numerical techniques. In this review paper we present a self-contained and comprehensive introduction to the fractal dimension. We collect and present various numerical estimators and focus on the three most promising ones: generalized entropy, correlation sum, and extreme value theory. We then perform an extensive quantitative evaluation of these estimators, comparing their performance and precision using different datasets and comparing the impact of features like length, noise, embedding dimension, falsify-ability, among many others. Our analysis shows that for synthetic noiseless data the correlation sum is the best estimator with extreme value theory following closely. For real experimental data we found the correlation sum to be more strongly affected by noise versus the entropy and extreme value theory. The recent extreme value theory estimator seems powerful as it has some of the advantages of both alternative methods. However, using four different ways for checking for significance, we found that the method yielded ``significant' low-dimensional results for inappropriate data like stock market timeseries. This fact, combined with some ambiguities we found in the literature of the method applications, have implications for both previous and future real world applications using the extreme value theory approach, as, for example, the argument for small effective dimensionality in the data cannot come from the method itself. All algorithms discussed are implemented as performant and easy to use open source code via the DynamicalSystems.jl library.


I. INTRODUCTION
Fractal geometry deals with geometric objects (or sets) called fractals 1,2 , that are "irregular" in terms of traditional Euclidean geometry.Their most striking property arguably a) Electronic mail: g.datseris@exeter.ac.uk is that they possess structure at all scales, which typically remains invariant in one form or another, no matter how much one zooms into the set.Because of this, the traditional topological dimension is not fit to describe these sets adequately 3 (Chap.5).The concept of a fractal dimension, a generally non-integer number, is therefore employed to characterize such objects 2,4 .The fractal dimension, which will be shortened to FD in the rest of the article, can be used to quantify the complexity of the geometry, its scaling properties and self-similarity, and the effective ratio of surface areas to volumes 2 .It has been applied in a vast array of different scenarios, from the archetypal measurement of coastlines 5,6 to being suggested as a tool for validating abstract art, e.g., that of Pollock 7,8 .
The evolution of chaotic dynamical systems results in sets in the state space that are typically fractal 4,9 and thus can be characterized by a FD [10][11][12][13][14][15][16] , done to our knowledge for the first time by Russel, Hanson, and Ott 17 .For dissipative systems, these sets are called strange or chaotic attractors 17,18 (boundaries of basins of attraction and chaotic saddles can also be fractal 19,20 ).For conservative systems they often have special properties and are (typically) called fat fractals 21 .For fractal sets resulting from evolving dynamical systems an estimate of FD provides the additional crucial information of the effective degrees of freedom of the time evolution.This means that calculating the FD for an experimentally obtained dataset can be used to guide the modelling process 3,22,23 , as the ceiling of the FD is the minimum number of independent variables that can model the system.Alternatively, if one has a model that arXiv:2109.05937v5[nlin.CD] 23 Sep 2023 confidently approximates the real system, due to physical arguments, then the FD of the model output can be used to tune model parameters: since FD is a dynamic invariant 3,22 , one can compare FDs of the model simulations to the FD obtained by observed data, and change parameters until those two values match.
These reasons have motivated many researchers to compute the FD of many real world systems by delay embedding measured timeseries 22 .Examples include global climate [24][25][26] , physiology 27 , lasers 28 , but many more exist.Unfortunately, some of these studies have been challenged, because calculating the fractal dimension in practice is a difficult, error prone process, with limited means of providing confidence to the obtained result.This is, for example, highlighted well in the controversy of estimating FDs of "climate attractors" (see e.g., Ref. 26 and references therein), where, FD estimates were incorrectly used to claim very low FD of ≈ 3 for the whole climate system.Researchers thus often need to interpret results partly subjectively, due to the lack of objective measures.
In this paper we compare as many computationally feasible estimators of FDs as possible, in as objective manner as possible, and across as many scenarios as possible.With this we provide an objective baseline with which researchers can check against their results, reducing the amount of interpretation and subjectivity.It is important to stress the separation between computing the FD of "the deterministic dynamics", i.e., the dynamic invariant characterizing the flow in the state space, and the FD of "the graph of a timeseries f (t) vs t".The latter is relevant for stochastic perspectives, is often related with the Hurst exponent 29,30 , and is typically used as a quantifier of timeseries more suitable for classification tasks versus typical statistics-based quantifiers (see e.g., Refs. 31,32).These two FD versions are very different, and unrelated in the general sense.Our paper focuses exclusively on the first version.
Hence, our main operating assumption is that we have a multivariate (sometimes also called multidimensional) dataset, obtained from a dynamical system from which we do not know the dynamic rule (equations of motion).If one has a timeseries, it first must be reconstructed into a state space set via delay embedding or other means 3,22,23 .Our operating assumption is on the one hand motivated by the increase of interest in observed or measured multivariate data, and higher accessibility of sensors and experimental data in preparation for a nonlinear-dynamics-based analysis 33 .On the other hand, there is also noticeable recent progress regarding better attractor reconstruction techniques based on delay embeddings, which can also utilize multivariate measurements, yielding a higher quality reconstruction overall, see e.g., Refs. 34,35by Kraemer et al., and references therein.We point out that our goal here is estimating FDs of chaotic sets knowing that these exist, in order to separate the problem of the FD estimation from the scientific questions surrounding the interpretation of the data.The later requires following best practices, e.g., various data pre-and post-processing, de-noising, or surrogate tests comparing the FD value of the real data with those of surrogates.These steps are entirely skipped here, and the reader can find more information in standard timeseries analysis textbooks or review articles such as Refs. 23,28.
Since the first review on the topic of FD by Theiler in 1990 16 , computers and software have improved and several new algorithms to estimate fractal dimensions have been proposed.It is thus timely to revisit the subject, and here we will provide a comparison and evaluation of FD estimators across a range of topics much larger than what has been done so far.Beyond comparing and evaluating various fractal dimension estimators, we provide optimized, easy-to-use, extensively tested, open source implementations for all algorithms discussed in this review in the DynamicalSystems.jlsoftware library 36 .
This review and comparison paper is structured as follows (and see also Fig. 1 for a summary of the paper).In Sec.II we provide a self-contained concise definition of the major methods used to compute FDs and how they relate with the natural density and with each other.Connected with this section is Appendix A, which presents all (computationally feasible) algorithms we have found for estimating a fractal dimension.The core of the paper are Sec.III and Sec.IV, which compare in detail the three best and most popular estimators of FDs: scaling of entropy, scaling of correlation sum, and an extreme value theory approach.We compare across data dimensionality, data length, different kinds of dynamical systems, noise levels, real world data, various embeddings of timeseries, the order q of the fractal dimension, among others.We close the paper with a summary of our findings in Sec.VI.Every result, plot, and method that we present in this paper is fully reproducible via open source code, and adjustable to other input datasets, see Appendix B for details.

II. STATE OF THE ART
An easy to understand introduction and review regarding the fractal dimension was given by Theiler in 1990 16 .The theoretical background of the fractal dimension and the methods (known until 1990) to compute it are summarized, and a plethora of historic references is given there.A more recent publication on fractal dimensions in a style of a review is given by by Lopes and Betrouni in 2009 37 .However, it is focused on image analysis and pattern recognition, and does not do any quantitative comparison.The most recent detailed source that provides quantitative information on the limitations and pitfalls of fractal dimension estimates is the well known textbook by Kantz and Schreiber 22 .
A number of ways to estimate a FD, that are applicable to dynamical systems, have been devised in the literature.For this comparison we found, implemented, and compared the methods that we briefly summarize in Table I.From these, our main analysis focuses on the three most prominent ones.In the rest of this section we will provide a concise, yet selfcontained, summary of the concept of a fractal dimension in dynamical systems.Then, we introduce the three main estimators that we use in the extensive comparisons done in Secs.III and IV and illustrate how the different estimators connect to the natural density and with each other.In the following, we assume that we have a set X that contains N D-dimensional points representing a (possibly observed) multivariate time- series of a dynamical system.We will use the letter ∆ to denote various versions of a fractal dimension, and we will use a superscript in parenthesis, such as ∆ (S) to denote the particular estimator used to estimate ∆.
A. What does a fractal dimension really quantify?
In the introduction we discussed how the FD is useful for practical matters, and hence worthy of being estimated.However, before discussing any estimators for a FD, it is useful to conceptualize what the FD truly characterizes, as this is the origin of all practical estimations.
For this discussion we ignore the presence of noise existing in real data, and the fact that observed timeseries need to be delay-embedded to yield higher dimensional data.The starting assumption therefore is that the set X = {x 1 , . . ., x N } we have at hand is a faithful sampling of some sort of Ddimensional invariant set (typically a chaotic attractor) of a dynamical system, i.e., X ⊂ A. A is itself characterized by its natural, or invariant, measure µ(x) which defines a probability space on top of A by requiring µ(A) = 1.Equivalently we may use the natural density ρ(x)dx = dµ, which in practical terms is the D-dimensional histogram of X.To learn more on the natural measure one should consult various textbooks on nonlinear dynamics 3,18,21 or the review by Theiler 16 .In essence, while the samples x i ∈ X form a time sequence of points with deterministic origin, they can be also be thought of as points sampled "at random" from A according to the measure µ.Note that throughout this FD description we make the fundamental assumption that A (with measure µ) is ergodic.
A FD is a number characterizing the scaling of µ with the "scale" one looks at µ at.Let B(x, ε) be a D-dimensional sphere 51 centered at x with radius ε.For sufficiently small ε, and for almost all x i ∈ A, it is assumed that the scaling of µ with ε is exponential, so that Here ∆ i is labelled the local dimension at x i , and by construction ∆ i ≤ D. The assumption in Eq. ( 1) is a reflection of our intuitive notion of dimension, which describes that the "bulk" or the "amount" of a set scales with a linear size (scale) exponentiated to the dimension.E.g., the amount of a cube scales with its side length to the 3rd power.here we measure "amount" of a subset of A by its natural measure, i.e., its relative probability mass.From ∆ i a fractal dimension characterizing A as a whole can be obtained as the mean, This intuition-based definition of a local and attractor dimension is further motivated by Theiler in his review article on the basis of the Hausdorff dimension 16 .Eq. (1) however is non computable because µ is unknown, since in practice we have only partial knowledge of µ due to the finite observations of X ⊂ A. Besides, for the overwhelming majority of dynamical systems, µ does not have an analytic expression anyways, irrespectively of finite data.An estimator of a FD therefore

Logarithmic correction
Better converging fit instead of the standard least square fit of log(C 2 ) vs. log(ε) 42

Takens' estimator
Maximum likelihood estimation of scaling exponent,  attempts to measure either the local scaling of Eq. ( 1) or the global scaling by averaging in Eq. ( 2).Before going into specific estimators we need to stress that none of the estimators we considered in this review yield the Hausdorff dimension 2,52 .Often in the literature researchers use the term "Hausdorff dimension" to refer to the output of the estimators, but no-one has provided any formal proof of the equivalence between the estimates and the Hausdorff dimension (which itself has a very precise and rigorous mathematical definition).This is especially so for the box counting dimension, where it is easy to prove that it cannot be the same as the Hausdorff dimension 53 .

B. Entropy of natural density
The first way to define and compute ∆ is based on an approximation of the natural density ρ of the set.By discretizing the state space in boxes of size ε one can assign a probability p i , i = 1, . . ., M to each box, which is simply the count of points in the box divided by the total amount of points N.These probabilities approximate ρ, and from there the Rényi (also called generalized) entropy can be obtained as 54 Here q is the order of the entropy, and it allows putting more (q < 1) or less (q > 1) weight to boxes with relatively smaller visitation frequency by the trajectory (and hence smaller p i ).
For q = 1 H q reduces to the known Shannon entropy while for q = 0 it becomes log(M) with M the minimum number of non-overlapping boxes needed to cover the set X.More than one way exists for estimating the probabilities p i , see Appendix A. In the main comparison of Sec.III we will use the algorithm A 1 which works for any values of D, N, ε while having performance scaling of D • N • log(N).
To connect H q with µ we re-write ∑ i p q i = ∑ i p i p . This is a weighted average, equaling to ⟨p (q−1) ⟩, since ∑ p i = 1.
We note that ⟨p⟩ is our notion of "amount", as we measure amount by the probability mass.Regularizing the expression by its exponent, the quantity ⟨p (q−1) ⟩ 1/(q−1) ≡ exp(−H q ) is the "average bulk" or "average amount" in a hypercube of linear size (i.e., scale) ε.The number q settles the way we average: for q = 2 we have the arithmetic (typical) average, for q = 3 a root mean square and for q → 1 a geometric average.
By following the same intuition that led to Eq. (1) (that "amount" ∼ scale ˆexponent), the so-called generalized dimension of order q 55,56 , is defined as This definition was used, to our knowledge, for the first time for q = 1 by Russel, Hanson and Ott 17 .∆ (H) 0 is called the box counting or capacity dimension, while ∆ (H) 1 is called the information dimension.The box counting dimension can also be thought as the fractal dimension of the support of the attractor, as it disregards the values of µ.
In Eq. ( 4) both limits are theoretical and cannot be realized in practice.As a result, ∆ (H) q is estimated by plotting −H q vs. log(ε) and estimating the slope of a linear scaling region (for sufficiently large N, more on this in Sec.III I).
If the fractal dimension depends on q, the set is called multifractal.This is the case when the natural measure underlying the attractor is strongly non-uniform.Large positive values of q then put more emphasis on regions of the attractor with higher natural measure.In dynamical systems theory, almost all chaotic sets are multi-fractal 14 .The dependence of ∆ (H) q on q is connected with another concept, the so-called singularity spectrum, or multifractality spectrum f (α), however we will not be calculating f (α) here and refer to other sources for more (see e.g.chapter 9 of Ref. 18 by Ott).

C. Correlation sum
The second way of defining and estimating a fractal dimension uses the correlation sum 10 .The correlation sum is an alternative way of estimating the average amount at a given scale, based on the points nearby a given point.Specifically, we associate an amount S at point x i ∈ X and at scale ε as with || • || some distance norm (here always the Euclidean, but different norms have little practical impact and theoretically exactly no impact) and B(•) = 1 if its argument is true and 0 otherwise.I.e., we count the ε-close neighbors of x i in X.The exponential scaling of S(x i , ε) versus ε is called the pointwise (local) dimension 57 .The average "amount" S over X is called the correlation sum, given by Using the same reasoning as in the preceding subsection, this average amount is expected to scale with the linear size ε exponentiated to the FD.And, again similarly with the preceding section, we do not have to limit ourselves to the typical arithmetic average when defining the correlation sum, but can introduce an order q that adjusts how we are averaging the "amount".This leads to the definition of the q-order correlation sum 11,22,39 Here we also added w ≥ 0, the correction by Theiler 58 (known as Theiler window) which excludes as neighbors points that are temporally close.This removes spurious correlations due to dense sampling of continuous dynamical systems.We choose w as follows: for each timeseries present in the multidimensional input dataset X we calculate the first minimum of its self-mutual information 3 .The maximum of the time shifts corresponding to these minima is chosen as w.
Originally the version explicitly having q = 2 was used to define the correlation dimension 10 , and the process of defining a FD was in fact the same as in H q for any q.A linear scaling region is estimated from the curve of log(C q ) versus log(ε).Then the correlation-sum-based FD ∆ (C) q is the slope of that linear region.
We may leverage two potential improvements here.First, to calculate C q (ε) we used a box-assisted method 40,41 .We modify this method as discussed in App.A 4, because otherwise it fails for data with even a small amount of noise (see discussions in Sec.III F and App.A 4).Second, besides the standard least squares fit log(C q ) ∼ a + ∆ (C) q log(ε), we also used the correction by Sprott and Rowlands 42 when possible (i.e. when at least half the range has log(ε) < 0).Ref. 42  2 ).It is intended to give better fits for sets that have a slowly-converging fractal dimension estimate, but as we will show later, it is best to not use it in practice.
In the rest of the manuscript we will use H q or C q to refer to the methods of estimating FD via the (generalized) entropy or (generalized) correlation sum.We will explicitly use a subscript H 2 ,C 2 when we make statements that apply only to this particular order q = 2.

D. Extreme value theory
The third major way of defining and estimating a FD from a set X is based on extreme value theory (EVT) applied to dynamical systems 49 .The method estimates a local dimension i , as in Eq. ( 1), and then provides the FD as the average.Interestingly, the method utilizes exactly the same information as the correlation sum: all inter-point distances.However, is not estimated directly via an exponential scaling relationship in contrast to the generalized entropy and correlation sum methods.
To the best of our knowledge, the method has been developed using progress across several papers [59][60][61][62][63][64] , see also Ref 49 (Chap.4 and 9).Even though relatively recent, this method has been applied already to a plethora of real-world cases, see e.g., Refs. 65? ? -7 , and many more, see Ref. 73 for a summary of recent applications.Despite this plethora of applications, we have noticed that some applications have ambiguities with the basic theory that connects EVT with FD, and we discuss these issues in App.A 10.
First, let us summarize the computational algorithm to estimate a FD via EVT, and then discuss how it connects to the natural measure µ.Let g(ε) = − log(ε) be a function of a distance (or radius) in state space.For the i-th point in X, we estimate with || • || the Euclidean distance.Note that g i is a real-valued vector of length N − 1. Next, we choose an "extreme" probability p for a quantile of the distribution of g i (e.g., p = 0.99).We found no reference that clarified what "extreme" means in a mathematically precise way, but we discuss in high detail how the choice of p impacts the results in Sec.IV B.
In any case, we then compute g p as the p quantile of g i .Then, we collect the exceedances of g i , defined as i.e., all values of g i larger or equal to g p , also shifted by g p .E i is also a real-valued vector with n = ⌊N(1 − p)⌋ values in total.Now, according to extreme value theory 62,74 , in the limit N → ∞, p → 1, and for the particularly chosen form of the function g, the values E i follow a generalized Pareto Distribution (GPD) with parameters σ , ξ and shift parameter 0 (see App. (A 10) for more on GPD).However, if the measure µ and attractor A satisfy the criterion of Eq. ( 1), then the GPD is reduced to an exponential distribution (EXPD) with parameter σ , E ∼ exp(E/σ ).Within this extreme value theory approach, the local dimension ∆ (E) i assigned to state space point x i is given by the inverse of the σ i parameter of the EXPD fit to the exceedances, i.e.
In the above expression we explicitly used the maximum likelihood estimator for fitting the σ parameter, which is simply the sample mean for an exponential distribution.Additionally, the above expression places one additional assumption on the local scaling of the natural measure µ, see discussion below around Eq. ( 13).The FD for A follows as the arithmetic mean, which means that the EVT FD corresponds to a FD of order q = 2.
Let us now discuss how this theory connects to Eq. ( 1) and hence relates ∆ i to the σ i parameter.Due to the invertibility of the g function, the exceedances E i correspond to when the orbit of the dynamical system comes closer than ε * to the reference point x 0 , with ε * = exp(−g p ).This clarifies why we focus on extremes of g i : we are interested in what happens around a small radius ε around a reference point x 0 , in order to connect with the fundamental definition of the local dimension of Eq. ( 1).Now, let us discuss the probability π i (E) that given that an exceedance has occurred (i.e., a state x j is at least ε * -close to x), there is an exceedance of E. This probability is by construction given by 1 minus the cumulative distribution function of the fitted EXPD, i.e., π i (E) = exp(−E/σ i ).However, the same probability can be constructed in terms of the natural measure µ to be with We can relate ∆ i with σ i if we place one more assumption on µ.We assume that locally around where f i (ε) is a slowly varying function of ε as ε → 0, which may or may not depend on reference point x i .With this assumption, both numerator and denominator of Eq. ( 11) scale exponentially with ∆ i and the remaining factors cancel-out even though ε ̸ = ε * .By taking into account also the expressions of Eq. (12), and that π i (E) = exp(−E/σ i ), we put everything together and have from which ∆ i = 1/σ i .

E. Lyapunov (Kaplan-Yorke) dimension
Lastly, a FD estimate proposed by Kaplan and Yorke 47 is based on the Lyapunov exponents characterizing an attractor.Let {λ i } denote the Lyapunov spectrum, with λ 1 ≥ λ 2 ≥ . . .≥ λ D .Then the Lyapunov dimension is defined as In simple terms, it is the (linearly interpolated) index value where the sum of the Lyapunov exponents of the set first crosses zero.Table II provides estimates for the systems we use in this paper.It is conjectured that 1 , see also 75 , and generally one does find similar values in practice, however there is no formal proof yet.An extension to Eq. ( 15) has been proposed in 48 that is not included in the main comparison but discussed in Appendix A. Keep in mind that Eq. ( 15) is defined only for dissipative systems, where ∑ λ i < 0. Applying it to conservative systems does not make much sense.E.g., Hamiltonian systems satisfy λ i = −λ D−i+1 and obtain ∆ (L) = D always, even though the motion might happen in a lower-dimensional manifold.For the 4D Hénon-Heiles system (Fig. 3) energy conservation limits the dynamics in a 3D manifold, and thus its FD cannot be greater than 3.
∆ (L) has a huge advantage when compared to the previous definitions of fractal dimensions: it can be computed with very high precision, even for high dimensional systems (where the other methods typically suffer from accuracy, as we will show below).But it also has a huge disadvantage: practically it can be computed only if the dynamic rule (equations of motion) is known.Only using the dynamical rule and its linearization one can estimate the entire Lyapunov spectrum with satisfactory precision, for example by means of the known algorithm due Shimada and Nagashima 76 and Benettin et al. 77 .From a finite, and often noisy real world dataset, calculating the entire spectrum of exponents is a very challenging task that requires for higher dimensional attractors very large data sets 78 .Therefore one is in most cases better off calculating a fractal dimension directly from data, or instead fit an explicit model to the data, e.g.Ref. 79 .
Unfortunately, it is not straightforward to connect ∆ (L) to Eq. ( 1).One may find some intuition in Sec.11.5.2. of Ref. 22 , but this discussion relies on expansion and contraction rates in state space and hence falls outside the scope of this review.

III. CORRELATION SUM VS. ENTROPY
In this section we perform a quantitatively rigorous and exhaustive comparison of the methods based on entropy H q and correlation sum C q .Even though fundamentally different, both rely on estimating the scaling of some quantity versus some size ε.For a given q they both (in theory) approximate the same quantity, the exponential scaling of the q-average of "amount" of measure versus the scale, as we illustrated in Sec II.To compute H q we use the method A 1, and for C q  the method A 4 for most cases, and the straightforward implementation A 3 for very high dimensional data.For C 2 we also tested the logarithmic correction of Ref. 42 .The motivation of choosing these methods, and why an exhaustive comparison of other methods is not presented, is explained in Appendix A.
Before any numerical analysis we normalize input data X so that each of its columns is transformed to have 0 mean and standard deviation of 1.This linear transformation leaves dynamic invariants (like the FD) unaffected, however makes all numerical methods more accurate and faster to converge.
In the following subsections all results will be presented with the same plot type, as e.g.shown in Fig. 2. The legend shows the different datasets used in the plot and a description of the plot's purpose.Top panel is entropy estimate while bottom is correlation sum.To estimate ∆ (H) q , ∆ (C) q , for each curve we identify automatically, and objectively, a linear scaling region as discussed in Sec.III I.This region is denoted by markers of the same color on each curve.The secondary legends inside the panels provide the 5-95% confidence intervals for the estimated slopes of these segments.Unless otherwise stated, all datasets used have length N = 10 5 , which is for many experiments a typical upper bound for the amount of data one has access to.Continuous time systems are sampled with approximately 10 points per characteristic oscillation period.

A. Benchmark sets with known dimensions
The best place to start is a simple sanity and accuracy test of the two main estimators for sets whose fractal dimension can be computed analytically in a straightforward manner.This is shown in Fig. 2. We use a periodic orbit from the Rössler system (∆ = 1), a quasiperiodic orbit of order 2 from the Hénon-Heiles system (∆ = 2), the Koch snowflake (∆ = log(4)/ log(3) ≈ 1.262), the Kaplan-Yorke map (∆ = 1 − log(2)/ log(0.2) ≈ 1.4306), a uniform filling of the 3D sphere (∆ = 3) and a chaotic trajectory of the Standard Map (SM) for very high k = 64, which covers uniformly the state space and thus has ∆ = 2.Some results that will be repeatedly seen throughout this section become clear.For noiseless data, the correlation sum method is clearly better for three reasons.First, its linear scaling region covers a wider range of scales, while H 2 saturates much more quickly to flatness for small ε.Notice that log(C 2 ) cannot saturate for small ε, but instead diverges to −∞, but we can never reach this point due to the process that chooses the appropriate overall range of ε (Sec.III I).Both curves would in principle saturate to flatness for very large ε, specifically exceeding the total size of X, but we again do not reach this threshold based on the choice of the range of ε.
Second, within the linear scaling region the curves of log(C 2 ) fluctuate less than those of H 2 , resulting in a narrower confidence interval.Comparing confidence intervals is only meaningful if the same method is used to extract them.That is why specifically for Fig. 2 we used the standard least squares fit for log(C 2 ) instead of the aforementioned correction of Ref. 42 .Third, the actual numbers we obtain for ∆ from the correlation sum method are closer to the analytically expected values, than those of the entropy-based method.Especially for sets that should have an integer fractal dimension the correlation sum method is much closer to the actual result.

B. Different dynamical systems
In the following we cross-compare with the value obtained from the Lyapunov (Kaplan-Yorke) ∆ (L) dimension which for all systems of interested used in this paper is found in Table II.We note that ∆ (L) is conjectured to equate to the q = 1 version of dimensions, however here we use q = 2.That is because the 2 for sets coming from different dynamical systems with known dynamical rule.CLM stands for Coupled Logistic Maps.We are not sure why H 2 of CLM yields a jagged curve instead of a clear linear slope, but the focus of this paper is not individual systems so we leave this for further study.
correlation sum algorithm does not apply to q = 1, unless one uses a fixed mass approach, but we explain in Sec.A 5 why we do not.We assume that the differences between the two estimators, when compared to ∆ (L) , should not depend much when changing q = 1 to q = 2, hence the following results remain valid.
In Fig. 3 we compare 3 discrete and 3 continuous dynamical systems of different input dimensionality (systems are defined in detail in App.C).We confirm the result of the previous section, i.e. the correlation sum method is more accurate than the entropy one because it is much closer to the values expected from the Lyapunov dimension (Kaplan-Yorke conjecture).Furthermore, the entropy method seems to underestimate the fractal dimension more strongly as the state space dimensionality increases.This is expected given the fact that the entropy method works via a histogram approximation of the natural density, and it is well known that the higher dimensional the data, the less accurate producing a histogram for them becomes (i.e., space is covered more sparsely by N points as D increases).
This figure also allows us to indeed confirm that the logarithmic correction to the correlation sum by Sprott and Rowlands 42 can be impactful, especially for high-dimensional sets where the convergence is the slowest.For example, the standard linear regression fit would give confidence intervals (6.44, 6.6) for the 8-dimensional Lorenz96 model and (5.78, 6) for the 8 coupled logistic maps.These values are closer to the entropy based estimates but further away from the ∆ (L) estimates of Table II.

C. Data length
Keeping the dimensionality constant but varying data length leads us to an interesting observation in Fig. 4. The entropy-based method performs very poorly for small N and clearly should never be used with small datasets.In contrast, the correlation dimension is much more robust and yields already useful lower bounds for ∆ For ρ = 0.1 the upper limits for N = 500, 1000, 10000, 100000 are 2 max (N) = 5.4, 6, 8, 10, respectively.In this context other choices of ρ may also be justified and will decrease (for ρ < 0.1) or increase (for ρ > 0.1) the upper limit 2 max (N).The method for detecting linear scaling regions used in this study, for example, starts searching at ε-values corresponding to ρ ≈ 1/e ≈ 0.368 (see Sec. III I).Therefore, depending on the data, in some cases linear scaling may occur for ρ > 0.1 resulting in larger upper limits ∆ On the other hand, Tsonis et al. 26 , using the results of Nerenberg and Essex 81 , argue that the minimum number of points N min to estimate a dimension x with 95% confidence is scaling like N min ∼ 10 2+0.4x .In our example, assuming true value x ≈ 7, it would require at least N min = 63095 points, which is too high compared to what we can estimate from surrounded by its fair share of ambiguity, because it involves deciding a-priori a scaling region extent, and it does not say whether the dimension x in the expression should be the embedding dimension or the actual fractal dimension (that is unknown).We will discuss this topic again in Sec.III H.If only small data sets are available, finite sample corrections derived by Grassberger 82 could be used to improved estimates of H q or C q .
D. q-order dimensions (multi-fractality) Here we examine how well the estimators capture multifractal properties, i.e. the dependence of ∆ q on q, or the ab-FIG.5: Impact of order q on the fractal dimension.We used the standard least squares fit for log(C q ) as Ref. 42 does not discuss q ̸ = 2.The curves have been vertically offset for visual clarity.
sence of multi-fractality, i.e., results that should be invariant to q.This dependence on q was discussed in the review by Theiler 16 , but we think an even better reference is in the book by Rosenberg 83 .A known theoretical result is that ∆ q is a non-increasing function of q, ∆ q 2 ≤ ∆ q 1 for q 2 > q 1 , see e.g.Ref. 11 .
Here we will use two examples.The first is the Koch snowflake, which has an (approximately) uniform density and thus its FD should have no dependence on q whatsoever.The second is the Hénon map, which has a strongly non-uniform natural measure, giving the expectation of a clear decrease of ∆ q with increasing q.The results are shown in Fig. 5.
Both the entropy and correlation sum approaches perfectly capture the absence of multi-fractality, giving identical curves for all q for the Koch snowflake.For the Hénon map estimates, both methods satisfy the criterion of a decreasing ∆ q with q.But there is a problem.For the correlation sum method and for q ̸ = 2, the function log(C q ) vs. log(ε) is no longer a straight line, but is composed of two linear regions with significantly different slope, making the results ambiguous.It is also not obvious why the slopes change at the given ε value, or why the slopes below that indicate a significantly lower dimension value.The slopes change at log ε ≈ −10 and for q = 3 the left slope becomes ≈ 0.58 while for q = 4 the left slope becomes ≈ 0.39 (right slopes are shown in figure legend).We observed this behavior of log(C q ) having two slopes in practically all example sets with non-uniform measure.
We could not find anything in the literature about this observation, and in fact, we could not find a single figure in the literature plotting C q versus ε for q ̸ = 2, even though the correlation sum for q ̸ = 2 is provided in several publications 11,22,39 , which report a value for ∆ (C) q̸ =2 (but it is unclear whether they have encountered the same problem as we have, or not).We have extensively tested our code and we are confident that the implementation of Eq. ( 6) is correct.For multi-fractal analysis we are (typically) interested in quantifying the most fine properties of the fractal.Perhaps then one should determine the slope of the linear region at the smallest ε values, instead of the slope of the linear region covering the largest range of ε (but for very small ε the statistics becomes worse for finite data sets).However, the slopes of the smallest ε are clearly incorrect; the correct slopes are the ones of the largest ε values (those also highlighted in Fig. 5).In any case, using the slope of the largest ε values gives the correct results, but this strong dependence of slope with ε when q ̸ = 2 is worrisome and indicates that more clarity regarding C q̸ =2 must be established in the literature.

E. Dimension (delay embedding)
This section examines the impact of varying state space dimensionality D of input data, which is common case when delay-embedding timeseries 22,84,85 (because there one increases D and searches for convergence of ∆).In principle, provided the condition d > ∆ (H) 0 is met, with d ∈ N the embedding dimension, then the reconstructed set has the same fractal dimension as the original set the timeseries was recorded from 86 .In Fig. 6 we check how the methods fare with this statement, and whether their accuracy decreases with increasing input dimensionality, i.e. embedding dimension.
In Fig. 6 we used a chaotic timeseries from the Hénon-Heiles system.This has ∆ = 3 and as such we expect convergence of the fractal dimension estimates to a value around 3 for d ≥ 4. We see in Fig. 6 that this is indeed the case.∆ (C 2 ) does not seem to drop in performance with increasing d, besides a very small decrease in the overall range of order of magnitudes the linear scaling region covers.Same results were obtained using a timeseries from the chaotic Rössler system with ∆ ≈ 2 or a chaotic Lorenz96 system with seems to perform poorer with increasing d by either significantly reducing the linear scaling region or being less accurate in the dimension estimates.However, how much poorer it performs depends on the dataset: for the Hénon-Heiles it is still decent, while for Lorenz96 it performs much worse (not shown).Hence, we conclude that ∆ (C 2 ) 2 performs much better as the state space dimensionality of data increases (while keeping data length and other aspects constant) versus ∆ (H) 2 , which is somewhat already evident from Fig. 3.

F. Noise
Real world data is always accompanied by noise, and therefore the impact of noise on the calculation is highly important for the choice of the method.As it is well known, the presence of noise in the data makes estimating a fractal dimension harder, as the fractal dimension of the noise is equal to that of the state space, and hence almost always larger than the fractal dimension of the clean data.Figure 7 shows results for various kinds, and amount, of noise added to the (normalized) chaotic Rössler attractor.On purpose for this plot we have used ζ = 1 (see Sec. III I), and the standard linear regression method for log(C 2 ), because the logarithmic correction of Ref. 42 overestimates ∆ > 3 for noisy data (the input dataset is three dimensional and thus cannot have ∆ > 3).
We start with the case of additive noise.There it is known that there is some distance ε σ called the "noise level", below which the slope of log(C 2 ) changes from being the fractal dimension of the chaotic set to that of the noise 87 , This fact can be used to actually estimate the noise level of the data 22 .In Fig. 7 we see how quickly this really happens for log(C 2 ).Even for 5% additive noise, the curve is already dominated with the noise slope (which has ∆ ≈ 3 for 3-dimensional additive noise), and only a small segment of the curve at large ε values where the slope becomes the deterministic value.The entropy curve does not have this property, and remains having a single slope throughout (except the saturation part of course), while the slope value is an average between the purely deterministic ∆ and that of the noise.On one hand, this may be considered a downside, because it doesn't allow estimation of noise level.On the other hand, we should note the majority of the log(C 2 ) curve is already reflecting the noise slope.Thus, if we estimated the average slope of the log(C 2 ) curve, it would be much larger than that of H 2 , i.e., it puts much more weight on the noise dimension than the deterministic data.These results regarding additive noise are typical and do not seem to strongly depend on the system considered.Note that in the case of delay reconstruction, the slope of the noise should increase with increasing embedding dimension.The slope corresponding to the FD of the deterministic set should remain constant for embedding dimensions larger than a minimum value required for successful reconstruction of the state space of the dynamical system generating the data.
For dynamic noise we turned the ODE of the Rössler system into a stochastic differential equation by adding a Wiener process term ηdW in the second equation.For small amount of dynamic noise (which here reflects a proportionality of η with the expected size of the x variable), the fractal dimension increases slightly as expected, but does not have any noticeable change in its numerical value up to 10% noise.When one turns up the dynamic noise more, the dynamics collapse and there is no "chaotic" attractor anymore (not shown).Both entropy and correlation sum methods perform equally well versus this kind of noise, and there is no noise radius or change of slope discernible in the correlation sum case.We also looked at low-resolution data, by rounding Röessler timeseries to 2 digits after the decimal.This obviously decreases the valid ε-ranges one can do the computation for (see Sec. III I).For ∆ (C) 2 this also significantly changes the result to a value smaller than "correct", while for ∆ (H) 2 it has no impact.This means that ∆ (H) 2 performs better for rounded data, which makes sense given the way it is computed (as long as points are in the same box, it doesn't matter how close they are to each other).

G. Real world data
In Fig. 8 we show fractal dimension estimates for real world experimental data.We focused specifically in experiments that are relatively clean (large signal to noise ratio) and where the underlying dynamics is well known to display low dimensional deterministic chaos.This is important, because for this review we do not want to mix the scientific question of whether an observed system accommodates a lowdimensional deterministic representation, with the technical/computational question of whether an estimator would actually detect that.
In Sec.III H we further discuss what happens with real world data where neither of these two conditions apply.We FIG.8: Fractal dimension estimates for experimental systems known to be underlined by low-dimensional nonlinear dynamics.
limited densely sampled experimental data to at most N = 50, 000, with sampling of about 10 samples per characteristic timescale.Because of the observations of the previous subsection, we have used ζ = 1 and the standard linear regression method for log(C 2 ) instead of the logarithmic correction of Ref. 42 .The datasets are as follows: two electrochemical oscillator datasets (the second being more chaotic than the first) 34 ; timeseries from a circuit replicating the dynamics of the Shinriki oscillator 88 ; the mean field of a network of 28 circuits following Rössler dynamics from Ref. 89 ; data from a mechanical double pendulum from Ref. 90 ; ECG recordings during a pacing experiment of healthy individual from Ref. 32 .All experimental timeseries were delay embedded using the recent automated method due to Kraemer et al. 34 , see App.D for the embedding parameters.The method yielded embedding space of 7 or less for all experimental timeseries, giving even more confidence that the data may display low-dimensional deterministic chaos.
For dataset "Rössler Net" the curve (log(ε), log(C 2 )) continuously changes its slope instead of having one, or at most two, constant-slope segments.This makes deducing a single fractal dimension ambiguous.Slight curving can be observed also in "electroch.1", "electroch.2" and "ECG IBI" but it is a weak enough effect that two scaling regions can nevertheless be extracted (for "ECG IBI" Fig. 8 reports the slope of the noise).On the other hand, the entropy-based approach does not suffer from this problem and, besides the expected saturation for small ε, seems to be described quite accurately by a single slope and thus a specific FD value.The results shown in Fig. 8 are from 5-6 dimensional embeddings yet both H 2 ,C 2 yield FDs that are less than the embedding dimensions (excluding the 'Rössler Net" case for C 2 where a FD cannot be estimated).Hence, we can assume that the FD of the underlying dynamics has to be somewhere between the low bounds of ∆ (H)

and ∆ (C)
The "constant slope curving" of C 2 is something we have not seen before with synthetic data.From the discussion of Sec.III F, the problem may be because realistic noise may be neither white nor stationary, or because a too high of a noise level in the data.Indeed, we saw that for 5-10% relative noise, the slopes of C 2 already reflect the noise FD.And in Fig. 8, C 2 yields consistently higher FD than H 2 , even though we know that an underlying low-dimensional representation exists.To extract this lower FD value from C 2 one therefore must focus on the larger scales ε and try to find this consistent smaller slope by increasing embedding dimension (as is standard practice 22 ).We show such an analysis in App.F. Nevertheless, it appears true that C 2 is more strongly affected by noise when compared to H.

H. Extreme cases
In this subsection we examine the result of applying the aforementioned methods to ill-conditioned data which may be non-deterministic or non-stationary, or to extremely highdimensional data, where there exists this notion in the literature that the methods used so far are unlikely to succeed.For the first dataset we used data from the Lorenz-96 model, with D = 6, while having the F parameter increase linearly during the time evolution from 1.0 (periodic motion) to 24.0 (chaotic motion with ∆ ≈ 5).The second dataset is the concatenation of a periodic trajectory from the Rössler system with noise uniformly distributed on the 3D sphere.The third dataset is paleoclimate temperature timeseries from the Vostok Ice core, embedded in 8-dimensional space, which is unlikely to be stationary or to accommodate a low-dimensional representation 25 .The fourth dataset is a stock market timeseries for the "nifty50" index embedded in 6-dimensional space, which is definitely non-stationary and rather unlikely to be deterministic.The last two datasets are extremely high dimensional data of the Lorenz96 and the Kuramoto-Sivashinsky spatiotemporal system (the latter having 101 dimensions after discretization).The results of the dimension estimation are shown in Fig. 9.
Generally speaking, in Fig. 9 the results of the first four datasets show that something is "wrong".There is a large miss-match between the estimates of the entropy and correlation sum methods and the curves do not seem to be composed of a single slope.Oddly, for the non-stationary Lorenz96 data, the correlation sum has a clear straight slope with fractal dimension somewhere between the extreme values.The Vostok data in particular are plagued both by a continuous change in slope, especially in log(C 2 ), but also, the resulting FD values do not converge when increasing the embedding dimension (not shown).Additionally, the FD values obtained from H 2 or C 2 are very different.Notice that in all cases our automated algorithm finds a value for ∆ nevertheless.This only serves to highlight how careful one should be, and to always plot the curves of H 2 , log(C 2 ).
Although in the literature there exist several tests for nonstationarity using e.g.permutation entropy or other methods, we will now describe a simple, fractal-dimension-based FIG.9: Fractal dimension estimates for extreme cases.
scheme.One can divide the data into m equal parts in two ways: Making m segments of length N/m of successive points, or by choosing every m-th point, each time starting from point 1 to m − 1.For these subsets, the same fractal dimension estimation is done.If there is non-stationarity, the first kind of selection will show significantly different estimates across its sub-datasets, while the second will show approximately the same estimates.
Let us now consider the last two datasets of Fig. 9 (32dimensional Lorenz96 and Kuramoto-Sivashinsky).Surprisingly, C 2 shows a rather clear linear scaling region that has very high slope, but not as high as the expected dimension values ∆ (L) (28 and 32, respectively).This is consistent with the upper bound Eq. ( 16) of Eckmann and Ruelle with 2 max (100000) ≈ 10 for ρ = 0.1 or ∆ (C) 2 max (100000) ≈ 23 for ρ = 1/e, because even in cases where linear scaling occurs already for relatively large values ε = ρE it cannot be expected to obtain slopes of size 28 or 32 with N = 100000 data points, only.Furthermore, Fig. 9 shows that for these data the range of scales covered by the linear region is very small: only one e.Increasing the amount of data also increases the linear scaling region and the resulting confidence intervals shrink as we get more data points in the linear region (not shown here).
As discussed already in Sec.III C these examples show again that with high dimensional data one needs much longer timeseries to properly cover the (typically high-dimensional) chaotic set, and this affects any kind of estimate of dynamical properties; it is not a problem specific to the correlation dimension.Additionally, one needs very low signal-to-noise ratio, because due to the coverage of a very small range of scaling factors ε, even a small amount of noise may ruin the estimation.But, if one does have such a clean high dimensional dataset, ∆ (C) 2 may still provide a useful estimator at least for a lower bound of the correlation dimension of that data set, where the value obtained should always be compared with ∆ (C)

I. Estimation of slopes and sizes ε
To estimate the value of ∆ (H q ) or ∆ (C q ) we need to find the slope of −H q or log(C q ) versus log(ε).This matter is typically resolved in a context-specific manner, where each plot is carefully examined and the "linear region" is decided by the practitioner by eye.This approach cannot work in an objective comparison.Here we formulate an entirely objective and sensible (but not flawless) automated process that is separated into two parts: the choice of which sizes ε to calculate −H q or log(C q ) for, and how to estimate a linear scaling region from the respective curves.Once the linear scaling region is identified, actually obtaining the fractal dimension is a simple least squares fit 91 The range of ε is always decided with generating formula ε = e x where x are k linearly spaced values from log(ε min )+ψ to log(ε max ) − ζ .I.e., the values of ε are exponentially ranged in base e. ε min is the smallest inter-point distance existing in the set and ε max the average of the lengths along each of the variables of the set, and ψ, ζ constants.Unless stated otherwise, we have used k = 16, ψ = 1, ζ = 1 in Sec.III.In essence, we are limiting ε to be one order of magnitude (in base e) larger than the smallest inter-point distance and one order of magnitude smaller than ε max .If the resulting ε range does not cover at least two orders of magnitude (common case in high dimensional data), we use ζ = 0 instead.This choice of ε max brings very good performance in the box-assisted algorithm for the correlation sum (A 4) but it is not so small as to make the computation meaningless for realistic and/or noisy data.Notice that the automated fractal dimension estimates can be sensitive on k, ζ , ψ.In practice we would recommend to produce several estimates by varying these parameters and obtain the median of ∆.
To estimate the linear region we proceed as follows.We scan the local slopes of each one of the k − 1 segments of the curve y vs log(ε) starting from the leftmost one (here y = −H q or log(C q )).If the local slope of the preceding segment is approximately equal to that of the next one with relative tolerance tol, i.e. |s i−1 − s i | ≤ tol • max(s i−1 , s i ), then these two segments belong to the same linear region.We move to the next segment and compare it in the same way with the first segment belonging to the same linear region.When we find a miss-match, we start a new linear region.This way we have segmented the curve y vs log(ε) into approximately linear regions.We then choose the linear region which spans the largest amount of the log(ε) axis, and label it "the" linear region.We finally perform a least squares fit there and report the 5-95% confidence interval of the fitted slope.In Fig. 10 we visually demonstrate the process.We also compare it with another standard way fractal dimension related plots are presented: the successive local slopes δ 1 of each point of the curves and the same slopes δ 5 but fitted in a 5-long data window.Our linear regions approach is equivalent with finding the largest plateau in the local slopes plots.
Notice that there is a clear pitfall here.This algorithm will deduce a linear region no matter what.In many scenarios this region might be meaningless, being too small to be of actual value, or could even be the scaling of the noise in noisy data, as shown in Sec.III F. So after all, careful consideration of the result is always necessary.
Besides the algorithm presented here, also worth mentioning is recent work by Deshmukh et al. 92 that offers an alternative way to estimate a slope.All possible slopes that could be estimated from the curve (by choosing all possible segments of length more than a specified minimum), are estimated.These are weighted by their length and by their inverse error and compose a distribution.The mean of the distribution is presented as the slope, while the quantiles of the distribution can be used as confidence intervals.For the work presented here, we believe our approach is more fitting, because how large a scaling region is is also part of the accuracy of a FD estimator.Furthermore, we still wanted to display problems in the presence of two scaling regions (as in e.g., Sec.III F), while the approach of Deshmukh et al. transforms the "looking at the curve y(x) for two scaling regions" problem into "looking at the distribution of all possible slopes for two peaks", which, while easier to resolve, still would add more information content in our already extensive article.Nevertheless, the approach Deshmukh et al. is most likely better suited to use in practice, when estimating a fractal dimension from experimental data, as the authors have extended their method to also provide convergence criteria of fractal dimension estimates in Ref. 93 .

IV. EXTREME VALUE THEORY ANALYSIS
In this section we thoroughly analyse the power and shortcomings of the extreme value theory based FD ∆ (E) introduced in Sec.II D for computing FDs under a similar lens as the comparison of the previous section.Unless stated otherwise we use N = 10 5 as the data length, standardize all input sets before any computations, and use p = 0.99 as the quantile

A. Exemplary sets
We start with Fig. 11 which shows ∆ (E) computed for exemplary sets.The figure should be compared with Figs. 2 and  3.The figure style is typical for the rest of this section and shows the dimension estimates as distributions, with dashed white lines indicating the mean, and dotted white lines indicating the "expected" value, for which we use ∆ (L) if possible, otherwise ∆ (C) 2 .Small horizontal red lines cap the strict limit that the dimension estimates should not exceed, i.e., the state space dimensionality.The inner legend in the plot displays information about the distribution: the mean, and in parenthesis the percentage of values that exceed the dimensionality limit cap (the red line).
All in all the ∆ (E) estimates seem to match well those obtained by ∆ (C) 2 with two notable exceptions: the method performs "poorly" for a quasiperiodic (∆ = 2) trajectory of the Hénon-Heiles system, for the chaotic attractor of the Hénon map and for the Lorenz-96 chaotic attractor.Here "poorly" means significant inaccuracy in the first decimal digit.It is not clear exactly why the EVT method is not particularly accurate for these systems, however we can speculate the respective reasons.For quasiperiodic trajectories, the sampling time may be near-commensurate with one of the two periods, making some points very rarely visited in the finite set, even if they would be uniformly visited in the limit N → ∞.As the method assigns higher dimension value to a point according to its visitation frequency, these rarely-visited points on the quasiperiodic orbit get higher dimension values than they should (within the EVT framework, the lower the visitation frequency of a state space point, the higher its local FD).For the Hénon map, the only thing to note is that the attractor natural density is extremely singular and the assumption of Eq. ( 13) likely does not hold.For the Lorenz96 and the coupled logistic maps, the thing to note is that the attractors have a high "expected" (here Lyapunov) FD, which means that for given amount of data and high underlying FD, the EVT underestimates the FD more than C 2 .

B. Quantile probability
Unlike the entropy and correlation sum methods for computing FD presented so far, the EVT one is parametric 94 : it requires the choice of an "extreme" probability value p for which to extract the quantile of g i when calculating the exceedances E i in Eq. ( 8).Therefore, before performing any further evaluation of the method, we must examine how it depends on its parameter p.We have found no formal mathematical definition of an "extreme" in the literature of this EVT methodology, or how to practically compute an "optimal" value for p, or whether an optimal value exists at all.Ref. 65 provides some methodology for checking whether the chosen p is inappropriate that we evaluate in Sec.IV C.
In this subsection we examine the impact of choices of p.This choice is somewhat linked with the data length N, as the local dimension estimation for each state space point is done based on N(1 − p) points.In Fig. 12 we therefore vary p with fixed N but also co-vary N, p with fixed N(1 − p).
The results show that increasing p increases ∆ (E) .It also appears also that not only the mean of the distribution of ∆ (E) i depends on p, but the shape of the distribution as well.On one hand, it is somewhat re-assuring than once choosing a fixed N(1 − p) the results do not vary as wildly as when N is fixed, provided that N(1 − p) is large enough.But we also noticed that under fixed N(1 − p) the estimated dimension seems to monotonically decrease, further and further away from the expected value, when decreasing p.
On the other hand, in most real world applications it is N that is fixed and hence the results of the top panel of Fig. 12 are what is of most interest.Besides, even if it was possible to co-vary N, p in a realistic application, we still cannot provide instructions of what the best choice should be for p: while Fig. 12 reveals the dependence on p, it doesn't lead to any obvious conclusions on what p should be.A saving grace here is that while there is a clear dependence on p, the mean value ∆ (E) does not change significantly (i.e., differences span less than one integer, which is anyways the accuracy we are interested in in practice), provided that p remains in a range so that N(1 − p) ≥ 100 − 1000.FIG.12: Impact of choice for extreme probability p when estimating a quantile for Eq. ( 7).Top: under fixed data length, bottom: under fixed length of exceedances.D = 8 is used for Lorenz96 and ∆ (E) for it is divided by 4 as in other plots.

C. Quantifying significance
The entropy or correlation sum approaches provide a relatively straightforward way to check for significance of results: there should a single slope covering several orders of magnitude ε.The larger the range of magnitudes, the more significant the results.Such a simple visual significant check does not exist for the EVT approach.In this section we examine possible ways to test for the significance of the EVT results based on what has been suggested in the literature or with alternative means we devised while composing this review.
In practical applications as in Ref. 65 the authors identify a range of p values that are appropriate using a statistical hypothesis test of whether E i follows an Exponential distribution (EXPD).Examples of such statistical tests are the Anderson-Darling 95 or the Kolmogorov-Smirnov 96 .Here we used the Kolmogorov-Smirnov exact one sample test.The test proceeds as follows: for a given p, each of the E i are first fitted to a EXPD.Then, the fitted EXPD is used in a statistical test for the null hypothesis: "the data coming from the given GPD".The test yields a p-value.Typically, if the p-value is very small, e.g., p<0.05, the null hypothesis can be rejected, which may mean unsuitable data all-together, or not enough data (e.g., quantile probability p was chosen too high).In practice, one hopes that the majority of the p-values (each for each E i ) is significantly larger than some low threshold of e.g., 0.05.
Formally speaking however, a p-value greater than some threshold does not mean we can accept the null hypothesis; only that we fail to reject it.Any other distribution may have generated the data equally well.Hence, the convincing power of this line of argumentation (checking for large p-values) is weak from a statistical inference point of view.An alternative test mentioned in the literature is to check how stable the distributions for σ of the fitted EXPDs are when varying p; but we didn't find this argument convincing (stability of parameters does not mean significant fit), so we ignored it.
A third way that one may judge the significance of the results is to directly estimate the error of the EXD fit for each exceedances vector E i .To do this, we use a form of a normalized mean squarer error given by: Here P j is the empirical probability density of the measured E i values at their j-th bin (i.e., the j-th bin's histogram height).G j is the fitted EXPD estimated at the bin's mid point.U is the same as G j but assuming a uniform distribution fitted to data instead of a EXPD (hence, U does not depend on j).The denominator normalizes the measure, so that the error of the EXPD fit is measured with respect to the error of the uniform distribution (and hence, it is meaningful to compare NRMSEs across different E i ).If NRMSE > 1, the uniform distribution is a better distribution model.We have used this error measure in the past in varying scenarios, and noticed that values < 0.5 correspond to a visually relatively correct fit 97,98 (without any mathematical guarantee of course).In Fig. 13 we show distributions of p-values and NRMSEs for a chaotic trajectory of the towel map.We provide more similar such plots in App.E, which establish that our observations do not depend at all on the input set X.
The results are very surprising.It appears that the p-value based test is unhelpful and/or misleading.For example, it shows that quantile p = 0.95 is a bad choice, because the overwhelming majority of p-values are ≤ 0.05, eluding to a rejection of the hypothesis "the data come from a EXPD".Yet, if we look at the actual fitted data to the right of the top panel, we do not observe "bad quality" fits at all.This is further established by the distribution of NRMSE values which has all of its mass in low values.We are not sure why the hypothesis test behaves this way in this scenario.
On the other extreme of p = 0.999 the p-value based test is again misleading.This is the case where most of the p-values are > 0.05, hence, it would be the most trustworthy in terms of the p-value test.Yet clearly, this is the case where the actual fits are the worst, by far.We have performed extensive numerical tests and are confident that the code implementations yielding the p-value are correct 99 .
This leads us to conclude that the NRMSE based test is much more trustworthy.If a practitioner wants to transform the NRMSE data into a Boolean decision "is this p okay?", we would argue to check if the majority (i.e., 99%) of the mass of the NRMSE distribution is less than 0.5.Even so, the NRMSE test may only provide a range for p where the EXPD fits are of sufficiently high quality.Sec.IV B, the fluctuations of ∆ (E) with p are relatively small if p is in an appropriate range.What made the discussion of this subsection difficult is that we have not found any information regarding the p-values of this test in the literature, despite the plethora of real world applications (see App.A 10).While it has been mentioned that the p chosen in these real world applications "satisfy" this p-value test 65 , the actual p-value distributions were not shown.
In fact, we haven't found any discussions in general regarding the significance of the EVT results.The main argument the EVT FD literature has used in favor of significance of results is that the ∆ (E) i distributions did not change much within a range of appropriate p.But by itself this argument is not convincing of the validity of the results, only of their stability.We will discuss these aspects again in Sec.IV H and in the conclusions.For the rest of the manuscript we will be using p = 0.98 − 0.99, as these values seem to yield correct results for synthetic X with lengths 10 4 − 10 5 .

D. Comparison with pointwise dimension
In Fig. 14 we again compute ∆ (E) for exemplary sets as in Sec.IV A, but now we compare it with the pointwise dimension, i.e., the scaling of the inner sum of Eq. 6 versus ε.Our goal with this comparison is to see how well either method captures the "spread" of dimension values.We expect that for rather uniform fractal sets the distribution should be narrow, and wide for sets with highly non-uniform natural measure.Note that the average of the pointwise dimensions of the correlation sum does not coincide with ∆ (C) 2 .As is also made clear in Ref. 16 , ∆ (C) 2 gives a more accurate result for the FD of the whole attractor because it utilizes more points to estimate the scaling behavior.
The most important result here is that for the chosen p = 0.99 the two methods yield very similar results, hence establishing the overall accuracy of the EVT method for synthetic data.The pointwise dimension estimates however are more accurate for the Hénon map attractor and a quasiperiodic orbit, which we already discussed in Sec.IV A. Hence, C q is slightly more accurate for noiseless deterministic sets.FIG.15: Fractal dimension estimates using the extreme value theory (EVT) when varying data aspects: length (top panel, using a chaotic Lorenz96 D = 8 trajectory), dimensionality (middle panel, using delay embeddings of a chaotic Hénon-Heiles timeseries), or sampling time (bottom panel, using a chaotic Rössler trajectory).

E. Length, dimension, sampling time
We now test the EVT approach while varying various aspects of the input data: length, state space dimensionality (using delay embeddings of increasing dimension as in Sec.III E) and sampling time.The reason to judge the quality of the EVT approach versus sampling time is because, unlike the correlation sum approach which explicitly takes into account dense time sampling via the Theiler window w, the EVT approach is typically presented as agnostic to the sampling time.The analysis is presented in Fig. 15.
Regarding data length, EVT scales well with decreasing N up to a threshold.When N becomes too low, so that N(1 − p) becomes less than 50, the results significantly loose accuracy, making the the estimated ∆ (E) increase rapidly.Interestingly, with decreasing N the EVT overestimates the FD instead of underestimating it like H 2 or C 2 .Note that one cannot sim-ply fix this problem by reducing p because, as illustrated in Sec.IV B, this has its own downside of decreasing the estimated dimension.Still, we may conclude that for the data considered here, with N max > 10 3 the EVT performs well, which is a better scaling with N than H 2 , but worse than C 2 .Like with C 2 this data length N max should scale with the FD, however there is no analytic treatment as to how (while for C 2 analytic bounds are discussed in Sec.III C).Given that real world data are typically small in length, one has to be particularly aware about this point.
As far as input dimensionality is concerned, we observe similar results as with ∆ (C) 2 in Fig. 6: ∆ (E) seems quite robust when increasing dimensionality.
When it comes to sampling time, the correlation sum approach utilizes the Theiler window, which in practice shortens the distance calculations from N to N − w.This has negligible impact on data length but significant positive impact on the FD estimate in cases where data are sampled densely in time 3 .Similar results are obtained for ∆ (E) : for very small sampling times, the FD is biased towards lower values.Hence, it would make sense to include a Theiler window in Eq. ( 7), i.e., only include j with absolute distance from index i greater than some w like in Eq. ( 6).We note that Ref. 66 also considered the impact of "temporal neighbors" and reached the same negative bias conclusion.

F. Noise
We repeat here the analysis of Sec.III F in Fig. 16 using the Rössler system combined with various forms of noise.For additive noise ∆ (E) behaves much more similarly to ∆ (H) 2 because the mean FD value of EVT is an average without focus at a specific scale.This means that the FD value in the presence of additive noise is between 1.9 (deterministic) and 3 (state space dimensionality).Additionally, the FD values of ∆ (E) are the smallest (and hence, closest to the deterministic FD value) out of the three (E,C 2 , H 2 ).∆ (E) also seems to be completely unaffected by rounding.
These observations make sense if one considers how ∆ (E) is computed (Sec.II D).The logarithms of all inter-point distances are taken into account for the computation of the quantile of g i .Rounding will have negligible effect on the distribution of inter-point distances, and additive noise will have diminished effect due to being "averaged out" in some sense when computing the quantile.These properties make ∆ (E) preferable in the presence of noise, unless one wants to identify the noise radius, in which case C 2 is more suited.
Lastly, we mention that in the case of dynamic noise, ∆ (E) provides slightly higher values than H 2 or C 2 (but of course we don't know whether any of the three is the more "correct" number).It is worth noting nevertheless that the distribution of ∆ (E) for dynamic noise is much narrower than for other types of noise, which we found unexpected.We examined no further however, as the results with dynamic noise depend strongly on the system used and the exact form the noise was added, and hence can't lead to any general statements.

G. Real world data
In Fig. 17 we use the same datasets as in Sec.III G.Because the datasets are smaller in size than the typical length we used before (N = 10 5 ), we used p = 0.98 instead of p = 0.99.This value for p also satisfies the "NRMSE test" we described in Sec.IV C, in the sense of most NRMSE values being less than 0.5.
Besides the large extent of some of the distributions of ∆ (E) i , we do not notice any downside or incorrectness in the mean FD values: they are comparable with those coming from H 2 (and we can't know whether H 2 or EVT are more correct in their estimate).Given the results of the preceding Sec.IV F this is expected due to the better response EVT has to noise (versus C 2 ).FIG. 18: Application of extreme value theory fractal dimension estimates on extreme cases.Values for Lorenz96 and Kuramoto-Sivashinsky have been divided by 4 for visual purposes.We used p = 0.95 for the "nifty50" and "vostok" data, because they are much smaller (otherwise p = 0.98).
However, mean FD values depend only weakly on p.

H. Extreme cases
In Fig. 18 we apply ∆ (E) in various extreme cases as in Sec.III H. ∆ (E) does a good job distinguishing that two sets of very different dimension have been artificially merged with each other in the first two cases of Fig. 18, because it yields bimodal distributions for ∆ (E) .However, this is mainly due to the way that the sets were created.When compared to the case of adding noise to the data, we didn't see bimodal distributions with peaks at 2 and 3 dimension values, i.e., ∆ (E) can't be used to identify a noise radius unlike ∆ (C) 2 .Nevertheless, ∆ (E) could be a good tool detecting non-stationarity in an observed set, if that non-stationarity significantly changed the FD value over time.
Unfortunately, for the sets that do not accommodate a lowvalued FD description, like the Vostok and nifty50, a straightforward application of ∆ (E) gives a rather "clear" picture that a low-dimensional FD value describes the data.This is especially obvious in the nifty50 set (which we remind is a stock market timeseries), where ∆ (E) gives the lowest dimension values and a narrow distribution.Surrogate testing did not help here either: generated random-Fourier surrogates 100 had consistently higher ∆ (E) than the original data, enhancing the wrong conclusion that the estimated ∆ (E) of the data is valid.Additionally, performing the significance tests of Sec.IV C made things worse: the resulting plots (shown in Fig. 23) look very similar to those obtained from systems with a legitimate low-dimensional representation, again re-enforces the wrong conclusion.Lastly, we attempted to perform the standard analysis of checking whether the FD result converges with increasing embedding dimension of the timeseries.In App.F we find that, in contrast to the H 2 ,C 2 methods, the EVT method shows a convergence of the "nifty50" timeseries FD.Already in embedding dimension d = 5 it shows a constant mean ∆ (E) for any d ≥ 5, which, again, re-enforces the wrong conclusion.Same results were obtained for the "Vostok" timeseries.
It appears that four different significance testing methods (p-values, NRMSEs, surrogate timeseries, convergence with increasing embedding dimension) all yielded with confidence that inappropriate data like "vostok" or "nifty50" have a small FD, which is incorrect.These results have major implications for both previous and future applications of the EVT method in real world data.We haven't found a way to test whether the FD values yielded from the EVT method actually represent a low-dimensional deterministic system or not, i.e., there is no way to falsify the method.Hence, extreme care must be taken when applying the EVT method to arbitrary real world datasets, and whether the data accommodate a deterministic representation must be confirmed by other means (e.g., a selfconsistent physical theory or using the correlation sum with increasing embedding dimension as is standard practice in nonlinear timeseries analysis 22 ).
For extremely-high dimensional, but deterministically chaotic data, ∆ (E) does an excellent job in identifying a very high FD (note the FD values are divided by 4 in the figure), that is also very close to the expected value ∆ (L) .This was expected, as C 2 also does an excellent job identifying a very high dimension for clean data, and in general EVT and C 2 perform very similarly overall when it comes to deterministic noiseless data.For the Kuramoto-Sivashinsky example we see that EVT estimates higher FD than C 2 (while typically we noticed that for high dimensional data it under-estimates the FD when compared to C 2 ).However, in this example we are not sure the results are because EVT is indeed more accurate, or because EVT over-estimates the FD due to having too small N when compared to the expected FD value (see discussion in Sec.IV E).

V. A NOTE ON SURROGATE TIMESERIES
Surrogate timeseries have been recommended by Theiler et al. in the early 90s 100 as a mean to test for nonlinearity in noisy timeseries.It was suggested there that a discriminatory statistic for the test can be the FD computed via the correlation sum.That requires a bit of care.If one uses the algorithm we described here, i.e., using Eq. ( 6) and then deducing the slope of the largest linear scaling region, then the user risks estimating the fractal dimension of the noise (already existing in the original data) instead of the "underlying deterministic nonlinear dynamics data" (if any exist), thus invalidating the hypothesis testing approach in the first place.Other discriminatory statistics should be used instead, see 101 for example.
If a fractal dimension is chosen as a discriminatory statistic nevertheless, we propose the following alternatives for computing it: (i) use Takens' estimator (Sec.A 6) with the empirically good estimate ε max = 0.25std(x) with x the original timeseries.Because Takens' estimator performs a maximum likelihood estimation instead of a linear fit, and thus considers all regions of the correlation sum up to some ε max , it produces an "average" fractal dimension of the noise and the underlying data.It should be clear, however, that the Takens estimator is used in this context (just) as a discriminating statistics and its results must not be interpreted as meaningful dimension estimates.Or (ii) use the EVT version ∆ (E) which also similarly produces an "average" of fractal dimensions of noise and deterministic set.

VI. CONCLUSIONS
In this paper we have analyzed many different practically relevant fractal dimension (FD) estimators we found in the literature.From all these estimators, we focused on an extensive quantitative comparison between the entropy-from-histogram method H q (Sec.II B and App.A 1), the (potentially boxassisted) correlation-sum method C 2 (Sec.II C) without any logarithmic corrections, and the extreme value theory method (EVT, Sec.II D).Based on our review, these three estimators are the ones most worthy using in practice (App.A).
To keep this paper within sensible limits, we have used a relatively small set of possible dynamical systems and realworld data.We cross-checked our results with different systems (not shown), and we are confident that the results presented are robust.Nevertheless, it is impossible to guarantee their universality for all possible input datasets.To amend this, the reader can repeat our extensive analysis for any other dynamical system or input dataset of choice by only changing a couple of lines of code in the provided code base, see App.B.
Our conclusions are as follows (see also Fig. 1).When comparing H 2 with C 2 , we found that for synthetic (i.e., noiseless) data C 2 is clearly superior to H 2 , retaining much better accuracy for decreasing N (amount of points) increasing D (state space dimension), or decreasing ε (size scale).C 2 can also be used to detect the "noise radius", as e.g.illustrated by Kantz and Schreiber 22 .For real data this can become a downside, making the FD estimation using C 2 ambiguous due to either an almost continuous change of slope or the majority of the slope reflecting the slope of the noise (Sec.III G).This can be partly alleviated by examining the behavior of C 2 with increasing embedding dimension d.In Sec.III C we saw that, provided that a deterministic chaotic attractor is known to generate the data, it is still quite sensible to estimate a FD of relatively small datasets using C 2 , as the estimated FD remains very accurate even for relatively small data lengths.On the other hand, H q is very sensitive to data length and underestimates the FD strongly even for moderately small data (which is often the case in experiments).We had difficulties making sense of correlation sum for order q ̸ = 2, even though it has been mentioned several times in the literature.C q for q ̸ = 2 gives correct results only when one considers the slope at the largest ε values and it is unclear why the slopes (i.e., FD values) at small ε values are incorrect.As such, we suggest that when treating multifractality, the entropy method should be preferred, or, if one has too few data points (where H q performs poorly), then using C q and recording the slope of log(C q ) at largest ε is the best alternative.
We then compared C 2 with EVT.In deterministic datasets we found a very high degree of agreement between C 2 and EVT, confirming EVT's accuracy.EVT performed equally well for high dimensional data but worse than C 2 for decreasing N. Still, EVT performs better with decreasing N, increasing D, or increasing underlying FD when compared to H 2 .When it comes to noise, EVT appears to have similar results with H 2 , i.e., the results are an average of the deterministic (smaller) FD and that of the noise.EVT reports the smallest FD values when contaminated by noise, and hence, EVT is affected less strongly by noise than C 2 or H 2 .One more advantage of EVT versus C 2 , H 2 is that it foregoes the identification, fitting, and extraction of slope from a scaling region, and hence does not face the same limitations when the data can cover only very small range of magnitudes of ε.This advantage is balanced by the disadvantage of EVT being much more of a black box method than C 2 , H 2 .
It appears that the EVT is a promising method that combines benefits from both H 2 ,C 2 : it scales well with decreasing N or increasing D and is more tolerant to noise than C 2 .However, we also observed that it has a huge downside: it cannot be falsified.Or, at least, at the moment there does not exist a method (visual or statistical) in the literature that can confidently falsify it, nor that it can quantify its significance meaningfully.All four methods we utilized and discussed in Sec.IV C (p-values, root-mean-squared errors, surrogate tests, increasing embedding dimension) failed to indicate that e.g., stock market timeseries are an inappropriate input (despite being non-stationary and not satisfying practically any of the assumptions underlying the EVT method).Instead, all ways to test for significance gave big confidence that stock market timeseries are described by a very small FD of ≈ 2.5.Surprisingly, we found practically no discussion in the literature about falsifiability, despite the plethora of real-world applications in very varied input datasets (Sec.II D).Nevertheless, we believe that falsifiability is important, because with it one resolves controversies like the fractal dimension of "global climate attractors" Ref. 26 .The lack of falsifiability has major implications for both previous and future real world applications using the extreme value theory approach: the argument for low-dimensional determinism in the data cannot come from the EVT method itself, at least not at the moment.
It is clear, as it was before this work, that estimating a FD is not an easy task and hence focusing on only a single number can mislead.The best practice we feel is to calculate several versions of ∆, from different methods and with varying the parameters of each method (including the range of ε or the quantile probability p) and produce a median of the results.Besides, given the software implementation we provide here, calculating all FD variants only necessitates a couple lines of code, see App.B. Furthermore, plotting of the appropriate quantities −H q , log(C q ) versus ε is a must, and can hint whether the methods are applied at inappropriate data.In addition, expecting more than one decimal point of accuracy is unrealistic in most practical applications.
As an outlook, we believe there is still some future research to be done regarding estimating FDs.Besides the Lyapunov dimension, which is not easily applicable to observed data, every other estimator disregards the time-ordering information in observed data (i.e., that the sequence of points follows the flow in state space instead of being randomly drawn samples on the attractor).Perhaps here is a way to make a more powerful estimator by using this discarded information of the time-ordering of the points in the dataset.For example, this time-ordering information has been used in e.g.estimating the transfer operator 102,103 , which can also yield the natural density, and perhaps this operator can be utilized to create a FD estimator of higher accuracy or with better scaling with the number of points N.
Regarding the EVT approach we believe future research can improve in two fronts: 1) developing a mathematicallyrigorous framework for choosing p; 2) developing statistical tests for correctness and significance of the method's results that can successfully falsify the method with inappropriate data.With new statistical indicators, the method may be applied with more confidence to data of unknown dynamical origin.
Appendix A: Algorithms for estimating a fractal dimension 1. Optimized histograms for arbitrary ε Here we describe an optimized method to calculate histograms, which to our knowledge neither has been published before, nor we could find a faster method that works for arbitrary ε.The process has memory allocation scaling of D • N and performance scaling of D • N log(D • N), neither of which depends on ε.The processes is as follows.Every point in the dataset is first mapped to its corresponding bin via the operation b i = ⌊(x i −x min )/ε⌋, where x min is a vector containing the minimum value of the dataset along each dimension and ⌊•⌋ is the floor operation.The resulting b (N in total) are then sorted with a quick sorting algorithm, which results in all equal b i being successive to each other.The sorting is lexicographic, i.e., sorting by first dimension, then by second, and so on.We then count the successive occurrences of equal b i ≡ b i+1 , which gives the amount of points present in the corresponding bin and move on to the next bin (there are no actual bins being created in memory, a bin is conceptually defined as the group of successively equal b i ).Dividing by the total amount of points gives the probabilities that can then be plugged into Eq.( 3) to yield the entropy H q .

The Box-Counting Algorithm by Molteno
This box-counting algorithm for the generalized dimension was introduced by Molteno 38 as an improvement for the method introduced by Grassberger 104 .It claims to be of order D • N. The algorithm partitions the data into boxes and counts the number of points in each box to retrieve the probabilities p i necessary to calculate the generalized dimension ∆ (H) q .The faster runtime of the algorithm is due to the use of integer manipulations for the division of the data points into boxes.
To perform these manipulations, all data values are converted to unsigned integers.These are calculated by finding the maxima x max and minima x min in each dimension to then identify the dimension that covers the largest range X.
where N bits is the bit size of the unsigned integer type that was used and ⌊ • ⌋ is the floor function.
The first box contains all indices to the array that stores the data points, where a box refers to an array of indices.The boxes are subsequently partitioned, until the mean number of points per filled box falls below a threshold k 0 , recommended by Molteno to be 10.
The kth step of partitioning divides the previous box into 2 D new boxes with the dimension of the data D. For a data point x, the index i of its new box is calculated as Here » is the logical shift operation, shifting the bit representation of the first value to the right by the second value, and is the bit-wise "and" operation.Thereby (x j » (N bits − k)) and 1 translates to checking whether the bit at position N bits − k + 1 is one.In a cycle of partition, all boxes that contain more than one data point are split up and empty ones are deleted.To retrieve the probabilities p i , the number of data points contained in each box is counted and divided by the total number of data points.
The algorithm provides exponentially scaled sets of values, that suit the approximation of the limit in Eq. ( 4) and it only needs to compute the operation (A2) D • N times per partitioning process.The main disadvantage is the static box size choice.It works well for low-dimensional data sets, but for larger dimensions the number of boxes increases exponentially with dimension, thereby increasing the number of points necessary to calculate the dimension exponentially.This drastically limits the ε range to which the algorithm is applied over, and for some sets makes the computation of low accuracy or even straight-out impossible.Given how fast our histogram algorithm already is (all H q curves in this paper took on average 0.1 seconds to compute), we saw no reason to use the Molteno method.Also, the Molteno method does not allow the user to decide ε values, making it less flexible.

Correlation sum
The original correlation sum can be calculated as given in (6).For q = 2, the second sum can be changed to only include indices higher than the current one (i), which makes the computation twice as fast.In all cases, the calculation time scales with D • N 2 and therefore this method is exceptionally slow for high N.Note that the version of the q-order correlation sum given in Eqn.(6) differs from the version provided by Kantz and Schreiber 22 in the exponent of 1/(q − 1).If the q-order correlation given by Kantz and Schreiber is called Cq (ε), it scales as Cq (ε) ∼ ε (q−1)∆ (C) q , then Cq (ε) 1/(q−1) = C q (ε) ∼ ε ∆ (C) q and both versions are equal.Besides the exponent, the formulation of C q also differs in the indices of the outer of the sums which range from 1 to N, requiring a slightly changed normalisation.

The Box and Prism assisted Correlation sum
Theiler 40 proposed an improvement over the calculation of the correlation dimension by Grassberger and Procaccia 105 that divides the data into boxes before calculating the distances between points.Thereby the number of distance calculations is reduced and the scaling becomes faster than N 2 (how much, it depends on the box size r 0 ).After division into boxes the formula given in ( 6) is used to calculate the correlation sum, therefore an extension to the q-order correlation sum is possible.
The integer representations of the boxes b(x) with sidelength r 0 can be calculated by b(x) = x − x min r 0 (A4) for each point x and the minimum of each dimension x min .This method is based on Appendix(A 1), as Theiler does not specify a method for the calculation of the boxes.The permutations needed to sort all representations in lexicographic order are calculated with a quick sort algorithm.The representations have the same ordering as the points they were calculated with, therefore sorting the points into boxes reduces to the iteration through the permutations and searching for changing integer representations.If the integer representation changes, all previous permutations are stored in an array.For a point inside a box, there may be points outside the box that are within the given distance of the point.Therefore the neighboring boxes w.r.t. the current box also have to be found and checked.The distances between the points in the box and its neighbors are calculated as given in Eq. ( 6).The first sum runs over all points of the initial box and the second sum uses all points in the box and adjacent boxes.For q = 2 only boxes with equal or larger indices are included in the neighbor search and the optimization, presented in the previous chapter, is used.
A point of criticism for this algorithm is its poor runtime for higher dimensions.The advantage of distributing the points into boxes beforehand diminishes as the number of boxes increases considerably with dimension.To reduce the amount of boxes, Theiler proposed a prism algorithm, where only the first P dimensions are used to distribute the data into boxes.These new boxes, where P sides are of sidelength r 0 and the other D − P sides cover the whole range, are called prisms.The best choice given by Theiler is P = 0.5 log 2 N and should be used when D exceeds 0.75 log 2 N. A downside of this prism approach is that for any P < D, some point pairs that should have been discarded may be included due to having small distances in the first P dimensions but a larger distance in at least one of the remaining dimensions.
According to Theiler, the size of the boxes r 0 can be computed as where R is the size of the chaotic attractor and the dimension 2 is estimated by computation of the correlation sum for √ N points.Since these points are chosen randomly, the estimated dimension and in consequence the box size and even the final dimension can vary strongly.A downside is that for some sets the proposed box size can drop below the minimum interpoint distance of the set.Furthermore, our tests showed an irregular output value of r 0 with differences as high as two orders of magnitude for two box size estimates on the Hénon map.
Bueno-Orovio and Pérez-García 41 proposed a different, and more stable, algorithm for the calculation of the optimal box size r 0 .They optimized the expected calculation times for the optimal number of filled boxes to .

(A6)
ℓ is the effective length, r ℓ is the box size used to calculate the effective length with r ℓ = R/10 and η ℓ is the number of filled boxes in case of the effective length.η ℓ is calculated by distributing N/10 points into boxes of size r ℓ .The dimension 2 used for the calculation of r 0 , η opt and ℓ is again estimated by computation of the correlation sum for √ N points.The box size estimator still varies in its choice of box size but shows fluctuations of smaller amplitude than the Theiler estimator.The choice of a box size smaller than the smallest interpoint distance only occurs for high dimensional data sets with a low number of points.Furthermore, Bueno-Orovio and Pérez-García chose a prism dimension of always P = 2 (for D > 2).However a prism dimension of 2 can result in box size estimates smaller than the minimal interpoint distance for high dimensional datasets with comparably low size.In this paper we used P = 2 but we have different r 0 , see below.
The main benefit of a small r 0 is that it makes the computations much faster.Unfortunately, both suggestions, and especially that of Ref. 41 , often fail in practice.They give much too small r 0 values and for data with any amount of noise whatsoever this value is well below the noise radius.This is displayed in Fig. 7.The vertical dotted line shows the r 0 estimated by Eq. (A6) (the calculation of C 2 would be limited up to this r 0 ).In fact r 0 is so small, that even for only 5% additive noise the computation would only show the noise dimension and no hint of the deterministic (and smaller) slope of log(C 2 ) vs. log(ε).That is why in this paper we decided to use the box-assisted version for better performance, but with r 0 = ε max /e 2 as discussed in Sec.III I.The performance is not too bad, e.g. for the typical data lengths N = 10 5 considered here computing the entire C 2 curve takes about a minute on an average computer.Notice that even with the optimized-for-performance r 0 of Ref. 41 , the entropy-method is still massively faster (in fact it is even faster than just computing r 0 ).

Fixed-Mass correlation dimension
The correlation dimension stems from the assumption, that the probabilities p i scale as ∑ i p q i /δ −(q−1)∆ (M) q i ∼ 1 for ε → 0 with δ i being the diameter of the partition.In most methods that compute the dimension, the diameter is fixed and the probabilities are defined by the set.Termonia and Alexandrowicz 106-108 proposed a method using a fixed mass algorithm which does the opposite: the probabilities are fixed by defining p i = n/N, where n is a chosen number of points and N the total number of points in the set.The diameter is chosen to include n points.Following the explanation by Grassberger 82 the scaling can be rewritten to assume the form of Eq. (A7) (A7) (r ( j) ) −τ is the mean of all radii that contain j points, τ = (q − 1)∆ q , M is the number of points of the set considered for the calculation, Γ is the gamma function.Eq. (A7) is not solvable for ∆ q in general.Eq. (A8) can be obtained from Eq. (A7) from the limit q → 1 and applying L'Hôspital's rule.
Ψ is the digamma function, Ψ(x) = d log Γ(x)/dx.With this general form that is not solvable this algorithm is restricted to q = 1.We compared this version with the traditional correlation sum with q = 2 and we found very similar accuracy in the estimated FD, provided that maximum j was large enough.However, we also noticed that the fixed mass results spanned much less orders of magnitude in the estimated log r ( j) , even for very high j, see Fig. 19.Furthermore, it is not clear to us how to estimate up to what maximum j should the calculations be performed, i.e., it is not something that could be straightforwardly extracted from data as in Sec.III I. Regarding computational performance, the fixed mass algorithm is implemented using KDTrees 109 , giving very high performance for low-dimensional dataset, scaling poorly for very high dimensional ones.Hence, we decided to use only the traditional correlation sum version in the main comparison of Sec.III.
FIG. 19: Fractal dimension estimates for sets coming from different dynamical systems with known dynamical rule to compare the correlation sum to the fixed mass formulation.

Takens' Estimator
The estimator introduced by Takens 13 aims to estimate the correlation dimension using the method of maximum likelihood estimation (MLE) 110 .From m interpoint-distances ρ j < ε max , one can estimate the correlation dimension as The derivation of this formula starts with the assumption that ∃ ε max > 0, so that ∀ 0 ≤ ε ≤ ε max , the correlation sum holds exactly, (A10) without any higher order terms, for some proportionality constant c.The transformed variable y i = − log(ρ i /ε max ) is distributed exponentially with parameter ∆ (T ) .The log-likelihoodfunction of this distribution is given by l is then maximised with respect to ∆ (T ) to obtain the most likely value of the correlation dimension.As Takens noted correctly, just taking the maximum of l is biased 13,43 , which can be easily corrected by writing For a Gaussian distributed random variable, the loglikelihood function is a parabola, that at 1σ has fallen by 0.5 from its maximum and at 2σ by 2. By invariance 111 , this is FIG.20: Dependency of ∆ (T ) for different systems on the parameter ε max .The shaded 5%-95% confidence intervals around the curves are not visible in most cases.Clearly, the variation of ∆ (T ) over different of ε max exceeds the confidence intervals.
also the case for a non-Gaussian random variable, letting us easily estimate the variance of ∆ (T ) .When testing the algorithm and its dependency on ε max (Fig. 20) on different dynamical systems, we found that the variation of ∆ (T ) exceeds the confidence intervals at any fixed ε max for low dimensional systems.
These variations occur because for the estimation, it is assumed that Eq. (A10) holds.Thus, the estimated ∆ (T ) and its confidence intervals are of no use as long as the validity of the assumption (A10) is not known.
While Takens' estimator does not provide a significant advantage in the precise estimation of the fractal dimension, compared to correlation-sum based methods, it can be useful in the case of surrogate timeseries as described in Sec.V.

Judd's estimator
Judd's "improved" estimator for the correlation dimension 44,45 also uses maximum likelyhood estimation.To account for deviations from uniform scaling, it allows a polynomial a of degree t, so that the assumed correlation sum is It is stated that a degree t ≤ 2 is "usually sufficient".The estimator performs a binned maximum likelihood estimation of ∆ (J) 2 alongside the coefficients of the polynomial.Judd therefore introduces a logarithmic binning, where the bins are defined by for i > 0 and λ < 1.The parameter ε 0 is called the cutoff and w = log(1/λ ) the bin width.Now, the probability of observ- ing a distance in bin B i becomes p i = P i − P i+1 , with If the bin contents b i are distributed multinomially, the negative log-likelihood function for the bin contents which must be minimized under the constraints Eq. (A12) does not necessarily only have one minimum because it depends on t + 2 parameters and must be minimized numerically.
The optimal bin width for the estimator w minimizes where m is the index of the last bin for which the bin content b m ̸ = 0 and n is the number of considered interpoint distances.
Because the computation time of Eq. (A14) becomes unreasonably large for large numbers of distances, the estimator is restricted to small sample sizes.
Once the optimal bin width is found, the cutoff ε 0 is chosen as the right edge of the fullest bin of the histogram.All bins to the right of this bin are joined to form B 0 = [ε 0 , ∞).
The minimization of Eq. (A12) is subject to two difficulties that are already noted by Judd.First, an optimizer cannot understand the idea that the exponential ε ∆ (J) is the essence of the model, while the polynomial is only a device to correct for deviations from the scaling law.Second, the optimizer is highly sensitive to the initial condition of the optimization, which could be reduced by a tailored optimizer 112 , but still one can observe a very broad distribution of ∆ (J) 2 values over different samples of a long trajectory, especially for higherdimensional systems, as is shown in Fig. 21.
Due to these problems, we decided not to include the estimator in the main comparison.

Dimension from Lyapunov exponents
In Sec.II E we described the Lyapunov dimension ∆ (L) due to Kaplan and Yorke.We do not have anything to add here regarding ∆ (L) , but we want to mention Ref. 48by Chlouverakis and Sprott.They suggest that instead of a linear interpolation to the sum of λ i , a polynomial interpolation should be done instead.However, as we found no theoretical foundation for this proposal, we decided to skip it (and we also did not notice any significant improvement with the numeric results).

Mean return times
According to the Poincaré recurrence theorem, any trajectory within an ergodic set 113 will return arbitrarily often and arbitrarily close to any neighborhood in the ergodic set 98 .We represent this neighborhood as a hypersphere of radius ε, centered at some point x 0 in the ergodic set, and define as γ the mean return time to this hypersphere.Then one expects that log(γ) ≈ −∆ (γ) log(ε) with ∆ (γ) the fractal dimension, as estimated from the mean return times.A more formal discussion of this fact, and explicit connection with the natural measure of the ergodic set and the fractal dimension obtained via the generalized entropy (4) is given by Theiler 16 .Earliest reference we found using return times to estimate fractal dimensions was Ref. 46 .
Unfortunately, the method using mean return times is not recommended at all.A fundamental limitation is that knowledge of the dynamic rule f is necessary, otherwise the results of the method for measured data are too inaccurate to be considered seriously.Even for known rule f , the method converges slowly (numerically).Furthermore, it provides an estimate of the local dimension around the point of return, similarly with the point-wise dimension.Thus, it has to be further averaged over several state-space points, requiring several orders of magnitude more computation time than the correlation sum method or the Lyapunov dimension method.Lastly, we found its numeric output (not shown, see online repository) to be quite far from the results of the correlation dimension and thus we do not consider it accurate enough.

Extreme value theory
We introduced the algorithm of estimating a FD via extreme value theory (EVT) in Sec.II D.Here we will expand more on how the algorithm has been used in the literature and highlight potential ambiguities we have noticed in its real-world applications.
Recall from the discussion of Sec.II D that the exponential distribution (EXPD) the extremes of the g(ε) function (Sec.II D) follow is in fact a simplification of a Generalized Pareto Distribution (GPD).The cumulative function of GPD is and is valid for x ≥ 0 if ξ ≥ 0 and for 0 ≤ x ≤ −σ /ξ if ξ < 0. It reduces to EXPD for ξ = 0.The first ambiguity we found in the literature concerns the recently published work of Pons et al. 73 , as well as real-world applications that explicitly fit a GPD to data (i.e., allowing ξ ̸ = 0 instead of enforcing ξ = 0).It is correct that the exceedances E in a dataset may follow a GPD, but if one wants to make the claim that the σ parameter is connected with the fractal dimension, then one must assume that the data follow the reduced EXPD instead.That is because of Eq. ( 14).If one uses the GPD cumulative function instead of the EXPD one, one would get the expression exp(−E∆ i ) to be equated to (1 + ξ E/σ i ) −1/ξ , from which it is impossible to claim ∆ i = 1/σ i like in Eq. ( 14).Hence, when using GPD fits instead of EXPD one cannot simply equate the (local) fractal dimension with the σ parameter of the fit, in contrast to what has been done in the literature 73 .
The second ambiguity we encountered is the lack of application, and even discussion, of delay coordinates embeddings.Many real-world applications, e.g., Ref. 65 , analyze a single dynamic variable (such as sea level pressure in the case of climate applications).This dynamic variable is a spatiotemporal field, and hence provides a multi-dimensional input dataset.Despite the high input dimensionality, this single variable is likely coupled to many other dynamic variables in a coupled dynamical system (which is especially true in the case of climate dynamics).The theory of delay embeddings is supposed to re-construct the missing dynamic variables and as a result provide a more correct representation of the dynamical flow.Delay embedding is missing from almost all applications of of EVT we reviewed and cited, even though they all use the timeseries of only one dynamic variable.Given that delay embedding is a well established analysis step 28 that is completely separate from estimating fractal dimensions, we are not sure why there is a lack of discussion of it.
The third ambiguity we want to highlight is the report of relatively small values for the fractal dimensions of spatiotemporal, and highly complex, real-world data.For example, Ref. 70 reports dimensions of ∆ ≈ 3.5 for the dynamics of slow earthquakes in the Cascadia region, Ref. ?reports ∆ ≈ 15 for spatiotemporal atmospheric flow (of daily resolution; hence, large scale turbulence is considered), Ref. 71 report a difference of at most δ ∆ i ≈ 2 from the states of largest and smallest local fractal dimension of the 500-hPa geopotential height (Z500) dynamic variable for the weather of the Europian-Atlantic region.Especially in the last case this would mean that, out of the potentially millions of available degrees of freedom in this (discretized) spatiotemporal system, there is a difference of at most two additional degrees accessed by the state space flow between the least and most stable regions in state space.Given that in this review we noticed much higher differences of local dimensions in much lower dimensional systems (see e.g., Figs.11 or 18), we find this reported small number difficult to grasp.In general, given the discussion on falsifiability of Secs.IV C and IV H, as well as limitations that come from the length of input data that we discussed in Sec.III C, we believe that the absolute value of the fractal dimensions reported in these publications should be taken with a grain of salt, and not be equated with the available degrees of freedom in the state space.Whether or not the relative values of the local dimensions ∆ (E) i (in the sense that relatively higher value means higher local state instability for the real system) can be used to draw conclusions or not depends on the confidence one has on the stability of the distributions of ∆ (E) i , see Sec.IV C.

Persistent homology
Persistent homology methods are based on topological timeseries analysis and applications in dynamical systems.These techniques have been relatively recently applied to estimate fractal dimensions, and a quantitative review was recently published by Jaquette and Schweinhart 50 .
The methodology is based on tracking how d-dimensional holes form or disappear as the point cloud that composes X is "inflated" or "thickened".This means that each point in X is taken as a sphere with radius initially 0, and this radius is increased as the point cloud is "inflated".Estimators based on 0-dimensional persistent homology using minimal spanning trees were proposed already in the early 1990's by van der Weygaert et al. 115 and Martínez et al. 116 who stress that this approach provides an estimate of a (generalized) fractal dimension and also works with relative small data sets.For more information about implementations of the method, see Ref. 50 .
The review of Ref. 50 compared fluctuations in the output values and the distance of the output values themselves from the reference "true" values of the test sets applied to.The results showed that the persistent homology method has similar performance for d = 0, and dramatically worse performance for d > 0, when compared to the correlation sum.Unfortunately, for d = 0 the method performs poorly when noise is present, i.e., it does not distinguish two slopes (of the noise and of the deterministic set) and instead shows that of the noise, while it does find two slopes for d = 1.Additionally the method output depends strongly on its meta-parameter α, whose value cannot be deduced from input data.For these reasons, and because the methodology itself is more complicated to both explain and implement than the correlation sum of Sec.II C, we deem the method worse than the correlation sum and only considered the correlation sum for a more indepth comparison in Sec.III.FIG.23: Same as Fig. 13 but now for the "nifty50" stock market timeseries (embedded in 6 dimensional space).Note that this timeseries is only 3125 samples long, hence we used much smaller p when compared to other plots.The proportion N(1 − p) however remains similar.

FIG. 1 :
FIG. 1: Summary and overview of this paper.Sec.II A discusses top-left panel.Secs.II B to II D discuss bottom left panel.Secs.III and IV discuss top right panel and lead to the results summarized in the table of the bottom right panel (which are also stated in the conclusions, Sec.VI).

FIG. 2 :
FIG. 2: Fractal dimension estimates of ∆ (H) 2 and ∆ (C) 2 for sets with analytically known fractal dimension.The figure title maps colors to set names.The legends within each axis map colors to two numbers which mean the (5%, 95%) confidence intervals corresponding to the estimation of the fractal dimension (curve slope).The markers on the curves denote the start and end of the estimated linear region.

2 .
time series like N = 500.Eckmann and Ruelle 80 discussed the data requirements for estimating the correlation dimension and pointed out that the minimal number N min of data points needed to estimate a dimension ∆ Assuming linear scaling for ε = ρE, where E denotes the diameter of the set (largest pointwise distance) and ρ is a small number like 0.1, Eckmann and Ruelle state that for N data points the estimation of ∆ (C) 2 will not provide values larger than The results for the 8D Lorenz96 model shown in Fig.4are essentially consistent with the estimates of Eckmann and Ruelle.For N = 500 the (5%, 95%) confidence interval of the slope is (5.57, 5.94), i.e. slightly larger than ∆ (C) 2 max (N) = 5.4 assuming ρ = 0.1 (a value of ρ = 0.113, for example, would provide ∆ (C) 2max (N) = 5.7).With increasing N the estimated slopes converge towards the value ∆ (L) = 6.91 of the Kaplan-Yorke dimension of the 8D Lorenz96 system, a value that is already included in the interval (5.51, 7.46) obtained with N = 10000, a number of points large enough to estimate correlation dimensions up to ∆ (C) 2 max (N) = 8 according to the Eckmann-Ruelle limit.The bottom panel of Fig. 4 shows log-log plots of C 2 for real-world experimental data (see Sec. III G for a description).The dimension estimates ∆ (C) 2 for different lengths N are all about 3 and only the corresponding confidence intervals shrink for increasing N.This is also in agreement with the Eckmann-Ruelle bound, because dimensions ∆ (C) 2 ≈ 3 can in principle be achieved with N = 500 or larger.

Fig. 4 .
FIG. 4: Dependence of estimators on data length N. The plot in the top panel shows results for data from an 8-dimensional Lorenz96 system and the bottom plot for the experimental dataset "electroch.1" from Fig. 8.The curves have been vertically offset for visual clarity.For these plots we have used the automatic ε range estimation described in Sec.III I only for the largest datasets and use the same range for the rest.As a result log(C 2 ) cuts off to −∞ as N decreases and curves are truncated correspondingly.

FIG. 6 :
FIG. 6: Effect of increasing input dimensionality of an inputtimeseries from a set of known fractal dimension utilizing delay embeddings of dimension d.The timeseries used is the first variable of a chaotic Hénon-Heiles trajectory.

FIG. 7 :
FIG. 7: Impact of noise using the (chaotic) Rössler system as an example.The noise percentages approximately indicate ratio of the std. of the noise divided by the std. of the deterministic dynamics.See App.A 4 for a discussion of the vertical dotted line.

FIG. 10 :
FIG. 10: Demonstration of the algorithm estimating fractal dimensions from the curves of H 2 , log(C 2 ).Shaded interval are the estimated linear scaling regions.See text for the definition of δ 1 , δ 5 (differences of the top panels).

FIG. 11 :
FIG.11: Fractal dimension estimates for exemplary systems using the extreme value theory method.Shown are the distributions of ∆ (E) ifor each system.The inner legends display the mean of the distribution and in brackets the percentage of the distribution outside a strict cutoff value (the state space dimensionality).For the two 8-dimensional systems (Lorenz96, coupled logistic maps), we divide dimension values by 4 for visual purposes.More details in Sec.IV A.
FIG. 13: Distributions of p-values of the statistical hypothesis test of whether the data are coming from a EXPD, and the relative error of the EXPD fits normalized by the error of a uniform distribution.Inner legends show the distribution medians.The figure is divided into three rows for three quantile probabilities p. Next to the distributions of p-values and NRMSE there are three exemplary plots of the distributions of the exceedances E i and with black line the fitted EXPD (the text indicates the corresponding p-value and NRMSE).The bottom panel has the NRMSE distribution clampled in [0, 1].The input dataset X is a chaotic trajectory of the towel map.See discussion in Sec.IV C for more.

FIG. 14 :
FIG. 14: Comparison of local fractal dimension estimates using the extreme value theory (EVT) method and the pointwise dimension.Left and shaded distributions are EVT, right and transparent are pointwise.Dashed white lines are means of EVT, dotted colored lines are means of pointwise.

FIG. 16 :
FIG.16: Impact of noise in the fractal dimension estimates of extreme value theory.

1 / 3 .P 2 ℓ
is the number of dimensions used for the boxing, D for boxes and P for prisms.Introducing the effective length of the chaotic attractor ℓ = r ℓ η and solving η opt = (ℓ/r 0 ) ∆(C)

FIG. 21 :
FIG. 21: Estimates of ∆ (J) 2 for different dynamical systems.For each box, 100 different samples have been drawn from a trajectory.Each individual sample contained 100 points.The degree of the polynomial was deg(a) = 1.

TABLE I :
Description of various estimators for fractal dimension considered in this paper, see also Appendix A.

TABLE II :
Lyapunov (Kaplan-Yorke) dimensions for systems with known dynamic rule (listed in TableIII).