Order patterns apply well to many fields, because of minimal stationarity assumptions. Here, we fix the methodology of patterns of length 3 by introducing an orthogonal system of four pattern contrasts, that is, weighted differences of pattern frequencies. These contrasts are statistically independent and turn up as eigenvectors of a covariance matrix both in the independence model and the random walk model. The most important contrast is the turning rate. It can be used to evaluate sleep depth directly from EEG (electroencephalographic brain data). The paper discusses fluctuations of permutation entropy, statistical tests, and the need of new models for noises like EEG.

Order patterns have been applied successfully to various biomedical, geophysical, climate, finance and other data.1,2 Their importance grows as long as the size of data grows. On the theoretical side, connections with dynamical systems have been thoroughly studied.3–5 There are links to stochastic processes,6–8 also in the multivariate case.9–11 Different statistical methods were worked out.12–17 Nevertheless, basic problems remain to be clarified and new models built to establish order patterns firmly as a mathematical subject. Our paper deals with the statistics of noisy data in the case of a single time series. Only patterns of length up to 5 are of interest, but the delay between time points can vary. For length 3 we define four contrasts, that is, differences of pattern frequencies. This will be the unique system of orthogonal and statistically independent contrasts. For EEG (electroencephalographic brain data) data, a case study shows that the turning rate contrast is most relevant. As a function of time and delay, it has the power to classify sleep stages and uncover an infra-slow rhythm of the brain. As concerns serial dependence tests, permutation entropy seems a good choice because of its slim distribution under the null hypothesis. EEG shows serial dependence at all times and scales and so requires new models of noise.

Karsten Keller asked me to give the opening talk of the Orpatt22 workshop in Dresden. This was a great honor, which encouraged me to tie together various lose ends of my research in recent years. Karsten and I collaborated on topology of fractals in the 1990s, a work which culminated in Karsten’s habilitation,18 and after 2000 with Bernd Pompe and others on ordinal time series analysis. I went to Dresden only for two nights, due to bad health of my wife. We had fruitful discussions on both evenings which I shall never forget. A few weeks later, just four days after my wife, Karsten died—unexpectedly, a shock for all of us. I write this paper with great sadness and gratitude, remembering Karsten’s very original, kind, constructive, and optimistic way to teach and do research.

Figure 1 shows the six possible order patterns of three consecutive values in a time series, as shown in many papers and defined in Sec. II. Most authors study the frequencies p 123 , p 132 , of these patterns. This paper shows that it is better to study contrasts, that is, differences of frequencies. The four lines in Fig. 1 indicate a canonical system of four contrasts. The line marked β , for instance, denotes the parameter β = p 123 p 321 which measures the asymmetry of up and down in a time series. The parameter α measures the amount of turning points (right) compared to monotonously increasing or decreasing parts (left) of the time series. We call α turning rate since it is the relative amount of time points where we have a local maximum or minimum. Small α indicates smoothness, large α means fast oscillation. The parameters γ and δ measure deviations from spatially symmetric and time reversible processes,7 similar to β.

FIG. 1.

The four pattern contrasts subdivide the six patterns in the best possible way.

FIG. 1.

The four pattern contrasts subdivide the six patterns in the best possible way.

Close modal

The parameters were introduced6,19–21 because they can be interpreted better than frequencies of single patterns. Here, we show that for mathematical reasons, these four parameters form the canonical framework for the study of patterns of length 3. The results in Sec. IV say that the estimates of these four parameters are uncorrelated while the estimates of single pattern frequencies are known to be heavily correlated.13,15,16 The four contrasts appear as eigenvectors of a covariance matrix in the random walk model as well as in the independence model. Thus, this system is unique, the methodology for patterns of length 3 is more or less fixed.

On the flip side of the coin, this shows that patterns of length 3 do not contain much information. Suppose the process under study is symmetric with respect to time and spatial reversal—which all Gaussian processes and many observed processes like EEG are. Then, the three asymmetry parameters are zero, and only the turning rate will describe the order structure of length 3. In order to better distinguish symmetric processes, we have to study patterns of length 4, as well as permutation entropy for length m = 5 or 6. Recent papers indicate that there are other ways to evaluate long order patterns than to determine all m ! pattern frequencies.17,22,23

Statistical aspects of permutation entropy are studied by simulation in Sec. V. In the applied part of the paper, we discuss the straightforward use of α to sleep stage classification by two EEG channels. Section VI C also discusses an apparent infra-slow biorhythm uncovered by the turning rate. We study the effects of low-pass and high-pass filtering on the order structure. The statistics of entropy fluctuations is used to test serial dependence of the EEG data in Sec. VII C. EEG has dependencies on every scale. It requires new models.

In 1876, Bienaymè24 studied the “turning rate,” the relative number of local maxima and minima, in a time series of T observations. He claimed that the turning rate is approximately normally distributed, with mean 2 3 and variance 8 45 T . His proof was not written down, but presented in the seminar of the Society by J. Bertrand24. However, the correct value of the variance could only be obtained by studying the 120 order patterns of length 5, in a similar way as in Refs. 13, 15, 16, and 33. Moreover, the paper contained numerous calculations with astronomical and other data which supported the normal approximation.

Today, we know that this statistics of the turning rate is exact only for white noise. We can use Bienaymè’s discovery for a test of serial dependence, see Sec. VII C. The difference between independent and random (“non classèes arbirairement”) was not yet known, and Student t test was published 30 years later.

In the century of computers and big data, the boring job of counting patterns is done by machine. Huge amounts of data require simple methods for screening, classification, and evaluation. For time series data, order patterns can be simply applied as machine learning, with the additional advantage that we understand them.

Nonlinear distortion of the data, low-frequency perturbations, and outliers have little influence on order pattern methods. Intervals of missing data can be overcome: already Bienaymè collected his turning rate from several distinct short time series. No preprocessing of data is required. Order methods are fast and easy to implement.

Modern sensors measure in tricky indirect ways. Fluctuations of voltage indicate brain activity, carbon dioxide concentration, or sea level. Changes in the environment are adjusted automatically, as a camera adapts to changing light. As a consequence, comparison of observed values is often more meaningful than the actual values themselves.

The most important advantage, however, is that minimal stationarity assumptions are required. Statistics is based on sampling, for instance repeated measurements. Many time series are observed only once, so we have to sample over time points. This requires that the rules of the game do not change with time. Assuming a stationary Gaussian process for instance means that all multidimensional dependencies among values reduce to correlations and do not change in time. This is never fulfilled in practice, so that perfect theoretical methods need not give nice results. Assumptions for order patterns are less restrictive and more likely true for data.

Having stressed practical merits of order patterns, we now regret the lack of a theory. There are some beginnings. Bandt and Shiha studied order pattern frequencies in Gaussian processes. Sinn and Keller7 were the first to study estimators for pattern frequencies. They proved strong consistency and asymptotic normality for Gaussian processes. Their work was extended by Betken et al.8 to long-range dependent processes and by Weiß16 for independent variables. Tests for serial dependence, and related tests for m-dependence and structural breaks, were suggested by many authors, see Refs. 12–17,25 and the literature quoted there. On the whole, however, classical stochastic processes and order patterns do not fit well. Probably, we must develop a combinatorial time series analysis, as indicated by Keller and Sinn.26,27

A permutation π is an ordering of m objects. In our case, the objects are consecutive time points. It is natural to start with the mathematical definition that a permutation is a one-to-one mapping from the set { 1 , 2 , , m } onto itself. A time series is also a mapping, assigning to time points t = 1 , , T values x t = x ( t ) . We write π k for π ( k ) and π = π 1 π 2 π m . Thus, π = 231 means π ( 1 ) = 2 , π ( 2 ) = 3 , π ( 3 ) = 1. The time series 2,3,1 shall represent the permutation 231.

The graph of a permutation π represents an order pattern, see Fig. 2. Any pattern that shows the same comparisons between a series of m numbers or objects is considered equivalent to π . Thus, ( 0 , 6 , 2 ) and ( 0.199 , 0.2 , 4 ) are equivalent to 231. We consider π as the name and standard form of an equivalence class of order patterns.

FIG. 2.

Left: the graph of the permutation 231. Middle: the permutation which fulfils π 2 < π 3 < π 1 is called 231 by some authors. In our notation, it is the inverse 312 of 231. Right: example time series. The dotted line indicates a pattern with d = 2.

FIG. 2.

Left: the graph of the permutation 231. Middle: the permutation which fulfils π 2 < π 3 < π 1 is called 231 by some authors. In our notation, it is the inverse 312 of 231. Right: example time series. The dotted line indicates a pattern with d = 2.

Close modal

Many authors define the permutation 231 by the relation π 2 < π 3 < π 1 . They read the values on the x axis from bottom to top while we read them on the y axis from left to right. Note that this notation is counterintuitive since the time series ( 3 , 1 , 2 ) then represents the pattern 231, and the time reversed series ( 2 , 1 , 3 ) does not correspond to the time-reversed pattern 132. However, Fig. 2 shows the only case for m = 3 where the two notations differ.

With x = x 1 , x 2 , , x T, we denote a time series of length T . For n time points t 1 < t 2 < < t m between 1 and T, we consider the pattern of corresponding values. We define
( x t 1 , x t 2 , , x t m ) shows pattern π when x ti < x tj if and only if  π i < π j .
(1)
Actually, the pattern π can be determined directly from the time series by calculating ranks: π k is the number of time points t j for which x j x k.

In this paper, we study only equally spaced time points t < t + d < t + 2 d < < t + ( m 1 ) d . We call m the length and d the delay (or lag) of the pattern. We always take m < 7 and mainly consider m = 3 and m = 4. However, we vary d as much as possible.

Example

The time series x = 1 , 2 , 0 , 1.5 , 1 , 3 in Fig. 2 shows patterns 231, 312, 231, 213 for delay 1, and 321, 213 (dotted line) for delay 2.

Now, we fix a length m . In examples, we usually take m = 3. We determine relative frequencies of all patterns π of length m , for various delays d . We divide the number of occurences of π by the number of time points where π could occur.
p π ( d ) = 1 T ( m 1 ) d # { t | 1 t T ( m 1 ) d , ( x t , x t + d , , x t + ( m 1 ) d ) shows pattern  π } .
(2)
For the example above, p 231 ( 1 ) = 1 / 2 , p 231 ( 2 ) = 0 and p 213 ( 1 ) = 1 / 4 , p 213 ( 2 ) = 1 / 2. In a language like Matlab or R, five lines of code suffice to determine all p π ( d ) for a given d.32 

We are not interested in the time series for its own sake. We rather want to know the laws of an underlying mechanism which produced this time series and which could also produce many similar time series by random variation. It is custom to represent this mechanism by a sequence X = X 1 , X 2 , of random variables on a probability space ( Ω , P ) . We call X a stochastic process and imagine that the time series x is a random choice from the ensemble X .

Since we want to find the laws of X from one single time series, we must assume that the statistics of values and dependencies of the X k do not change in time. Usually, X is required to be stationary. That is, the m-dimensional distribution of ( X t , X t + 1 , , X t + m ) does not depend on t , for all m . This is a fine starting point for theory but never fulfilled in reality. Weak stationarity says that mean M ( X t ) and C o v ( X t , X t + k ) do not depend on t.28 Even this need not be true in practice, for instance in EEG measurements. We make a less stringent assumption.

Order stationarity. For a given pattern π with length m and delay d , let
P π ( d ) = P { ( X t , X t + d , , X t + ( m 1 ) d ) shows pattern  π }
(3)
denote the probability that the process X shows pattern π just after time t . We say that X is order stationary for patterns of length m if P π ( d ) is the same value for all t. This should hold for all patterns π of length m and all delays d = 1 , , d max. We take small m , usually 3 or 4, and choose d max much smaller than the length T of our time series. Then, there is hope that such order stationarity is fulfilled in reality, at least approximately. Note that stationary processes and processes with stationary increments (cf. next section) are order stationary for all π , m , d.
P { X s = X t } = 0 for s t .
(4)
The second requirement assumed throughout this paper is we exclude equality since order patterns become complicated in the case of equal values. When they appear in reality, we can just ignore them like missing values and adapt the denominator in (2). Condition (4) can also be guaranteed in practice by adding a tiny random noise to the data.

P π ( d ) are the parameters of the process which we would like to determine. The numbers p π ( d ) are the concrete frequencies obtained from our data series x. It is natural to take p π ( d ) as an estimate of the unknown probability P π ( d ) . This naive estimator is unbiased, and for Gaussian7,8 and m-dependent processes15,16 strongly consistent and asymptotically normal. The estimation error, that is, the variance of the estimator, can only be determined for more specific models.

Spatial and time reversal. Time reversal transforms the series x 1 x 2 x T and permutation π 1 π 2 π m into the series x T x 2 x 1 and permutation π m π 2 π 1 , respectively. An order stationary process is called symmetric under time reversal if the frequency of any permutation π agrees with the frequency of its time-reversed permutation. This concept,7 tightly related to the well-known concept of reversible process,28 was recently studied by Martínez et al.29 and by Zanin et al.30 Similarly, the time series x 1 x 2 x T can be flipped at the horizontal line y = C / 2 resulting in ( C x 1 ) ( C x 2 ) ( C x T ) , for any constant n C . For the permutation π 1 π 2 π m , the spatial reversal as introduced by Sinn and Keller7 is ( m + 1 π 1 ) ( m + 1 π 2 ) ( m + 1 π m ) . For instance, 1342 has spatial reversal 4213 and time reversal 2431. An order stationary process is called symmetric under spatial reversal if the frequency of any permutation π agrees with the frequency of its spatially reversed permutation.

There are essentially only two random processes for which we know the probabilities P π ( d ) rigorously. The first one is a sequence of independent and identically distributed random variables X 1 , X 2 , which is called an i.i.d. process, and white noise in the case of normal distribution. This process induces the equidistribution on patterns: P π ( d ) = 1 / m ! for all patterns π of length m and for all delays d . The reason is that an i.i.d. process is exchangeable: ( X 1 , X 2 , , X m ) has the same distribution as ( X π 1 , X π 2 , , X π m ) for any permutation π.

Note. The probability distribution of the values, which has nothing to do with order patterns, need not be specified. For simulations, we take standard normal distribution. The class of exchangeable processes theoretically also contains mixtures of several i.i.d. processes. However, for an ergodic stationary process, Bandt and Shiha6 proved that the distribution of the values x t plus all pattern frequencies completely describes all distributions of vectors ( x t , x t + 1 , , x t + m ) of the process. Thus, order structure and one-dimensional distribution are two complementary parts which together characterize a stationary process.

An i.i.d. process is self-similar in the sense that ( X 1 , X 2 , , X m ) and ( X d , X 2 d , , X m d ) have the same distribution. Thus, P π ( d ) does not depend on d . Actually, this seems the only stationary process for which P π ( d ) is known for arbitrarily large ds.

White noise represents the null hypothesis for testing independence of the variables X k , see Sec. VII C. It should be mentioned that although X k are independent, the events X k < X k + 1 are dependent. Under the condition X 1 < X 2 < < X k, we have P { X k < X k + 1 } = 1 / ( k + 1 ), while in the mean, this probability is 1/2.

The second process is symmetric random walk where we set X 0 = 0 and let the increments I k = X k + 1 X k be independent random variables with identical distribution for k = 0 , 1 , with density function fulfilling φ ( x ) = φ ( x ) . This is a non-stationary process with stationary and independent increments I k . In the case of Gaussian distribution, the process is self-similar, so that P π ( d ) does not depend on d . This case will be called Brownian motion.

From an ordinal viewpoint, it is not appropriate to consider random walk just as a cumulative sum of an i.i.d. process. For example, geometric Brownian motion Y k = exp ( X k ) has exactly the same order structure and is not a cumulative sum. It is a standard model for data from financial markets. Gaussian random walk also represents a null hypothesis.21 

To determine P π ( d ) for d = 1 and patterns of length 3, we need only assume a symmetric random walk.15 Then,
P 123 = P { X 1 < X 2 < X 3 } = P { I 1 > 0 , I 2 > 0 } = ( 1 2 ) 2 = 1 4 = P 321 , P 132 = P { X 1 < X 3 < X 2 } = P { I 1 > 0 , I 2 < 0 and  | I 2 | < I 1 } = ( 1 2 ) 3 = 1 8 ,
(5)
and 1/8 also for the remaining probabilities. Here, we used the Markov property of random walks: the occurrence of an order pattern among the X s with s t is independent of order patterns of X s with s t .

For length 4, pattern probabilities depend on the distribution of the increments. For Gaussian random walk they were calculated in Ref. 6. For length 5 , they can be expressed as multidimensional integrals which have no analytic solution. Elizalde and Martinez31 studied classes of permutations which have the same pattern frequencies for random walks with arbitrary distribution.

Pattern probabilities have also been determined for moving average and autoregressive processes, mostly by simulation. See Refs. 6 and 15 and Table I.

TABLE I.

Pattern contrasts for the AR(1) model X t = 1 2 X t 1 + ε t with different types of noise ε t.

Noise β τ γ δ
Normal  0.086 
Uniform  0.083  0.042 
Bernoulli  1/6  1/6 
Triangular  −0.074  0.088  0.026  0.048 
exponential  0.161  0.107  0.002  −0.095 
Noise β τ γ δ
Normal  0.086 
Uniform  0.083  0.042 
Bernoulli  1/6  1/6 
Triangular  −0.074  0.088  0.026  0.048 
exponential  0.161  0.107  0.002  −0.095 

Open problems. Both stationary processes and processes with stationary increments are order stationary. Do pattern probabilities indicate whether a process is stationary or has stationary increments? Find examples of order stationary processes which are neither stationary nor have stationary increments. Does order structure plus (one-dimensional) distribution of increments completely describe a process with stationary increments?

Example

The condition p 12 = 1 characterizes an increasing series. This condition can be realized by processes with stationary increments where the distribution of increments is concentrated on positive numbers. The conditions p 231 = p 312 = p 321 = 0 describe a similar but slightly a more complicated case.

When we have the frequencies p π for a time series x , we can combine them to calculate the permutation entropy H which says something about the nature of x . Some comments on H are given in Sec. V below. It is also possible to evaluate the frequencies p π separately. This may reveal more specific information. However, the information content of p π for different π may overlap. Our intention is to obtain parameters which provide disjoint and interpretable information. We discuss here just the simplest case m = 3.

There are three reasons for considering only patterns of length up to 5.

  1. Estimation: statistical estimates of pattern frequencies would not be accurate for patterns of length 6. Already for the 120 patterns of length 5, one needs time series with several thousand values for a reasonable estimate.

  2. Interpretation: it is difficult to give a meaning to longer patterns.

  3. Assumptions of stationarity are more restrictive for longer patterns. We have no experience and intuition for such multidimensional dependencies in random processes.

In this section, we study only patterns with length m = 3 and delay d = 1, so we omit these parameters from the notation. We take an order stationary process as model and imagine that we sample a lot of time series of length T from this process. For permutation entropy, this is exemplified in Sec. V B, and it turns out that time series should never be shorter than T = 200 , in order to avoid discrete effects.

It is important to understand that in contrast to the values of a time series, pattern frequencies are “always” asymptotically normally distributed. We only have to assume that the dependence between occurence of patterns at two time points s , t decreases fast when the time difference | s t | increases. A special case treated in Refs. 13 and 15 is ordinal -dependence which says that the patterns become independent if | s t | > . This assumption is true for white noise, random walks and MA-processes, and can be expected to hold in reality for a large . Under such an assumption even the combination of all six pattern frequencies will approximately follow a multivariate normal distribution. This follows from central limit theorems, cf. Refs. 15 and 16. We shall not rely on asymptotic statements which contain no quantitative estimates. But they are good for orientation. Since correlations describe all dependencies between variables of a multivariate normal distribution, we must study covariances of pattern estimators.

Indeed the estimators of p π , that is, the sample frequencies taken as functions on the space of all sample series, are not binomially distributed. They are heavily correlated, and their variance is different for different π. For example, an increasing time series contains T 2 patterns 123 while the pattern 231 can appear at most 1 2 ( T 1 ) times since occurences cannot be consecutive. Combinatorial calculations performed by Elsinger,13 recently confirmed by de Sousa and Hlinka15 and Weiss16 show that asymptotic variances and covariances scale with 1 / T , and are different for different stochastic processes. The exact form of the covariances is a / T + b / T 2 , but the second term will not be discussed. It is negligible if T 200. Here are the asymptotic covariance matrices of the p π for an i.i.d. process13 and for symmetric random walk15 which we need later. The six rows and columns correspond to the six permutations 123, 132, 213, 231, 312, and 321.
Σ W = 1 360 T ( 46 23 23 7 7 14 23 28 10 2 20 7 23 10 28 20 2 7 7 2 20 28 10 23 7 20 2 10 28 23 14 7 7 23 23 46 ) ,
(6)
Σ B = 1 192 T ( 60 6 6 6 6 36 6 15 7 9 1 6 6 7 15 1 9 6 6 9 1 15 7 6 6 1 9 7 15 6 36 6 6 6 6 60 ) .
(7)
To illustrate the meaning of these covariance matrices, let us determine a few correlation coefficients. Remember that for two random variables U , V , the correlation coefficient is ρ ( U , V ) = Cov ( U , V ) / ( σ U σ V ), and the variances σ U 2 are on the diagonal of the covariance matrix for U = p π . In particular, the factor before the matrix cancels out, so that correlation coefficients do not change with T , for large enough T. For white noise, we get
ρ ( p 123 , p 132 ) = 23 46 28 0.64 , ρ ( p 123 , p 321 ) = 14 46 0.3 , ρ ( p 231 , p 312 ) = 10 28 0.36.
For the random walk model, we have
ρ ( p 123 , p 132 ) = 6 60 15 = 0.2 , ρ ( p 123 , p 321 ) = 36 60 = 0.6 , and  ρ ( p 231 , p 312 ) = 7 15 0.47.
Since correlation coefficients lie between 1 and +1, these values are surprisingly large. They show that p π share a lot of information. For that reason, a simple list of the p π is not the appropriate way to evaluate the ordinal structure of a time series.

Although there are three excellent presentations13,15,16 to which we refer, we add a few remarks on the matrices Σ. The covariance of random variables U , V is defined as Cov ( U , V ) = M ( U V ) M ( U ) M ( V ), where M denotes the mean. Covariance measures linear dependence and is zero for independent variables. For two patterns π and κ , we let U = T p π and V = T p κ the absolute frequencies of the patterns in time series of length T. The values of U and V are determined for all time series of length T which are generated by a certain model—for white noise we would take all permutations of length T with equal probability. Then, the covariance is determined as a mean.

This complicated theoretical calculation can be simplified by writing U = s = 1 T 2 U s and V = t = 1 T 2 V t, where U s = 1 if pattern π occurs at time point s , and U s = 0 else. Similar V t for κ . Due to the linearity of covariance,
Cov ( U , V ) = s = 1 T 2 t = 1 T 2 Cov ( U s , V t ) .
Now if the generating model is an ordinal -dependent process, then U s and V t are independent and thus Cov ( U s , V t ) = 0 whenever | s t | > . Thus, the sum for t runs from max { 0 , s } to min { T 2 , s + } . And if the process is order stationary, then Cov ( U s , V s + k ) does not depend on s. Collecting the terms, the sum boils down to
( T 2 ) Cov ( U 1 , V 1 ) + k = 1 ( T 2 k ) ( Cov ( U 1 , V k + 1 ) + Cov ( U k + 1 , V 1 ) ) .
These covariances contain terms like P ( π occurs at 1 and  κ occurs at k + 1 ) which can easily be determined by counting cases.13,15,16 For white noise, we only need all permutations of length 4 and 5, for random walk just length 4. If we replace the factors T 2 k by T (this is the only reason to call the resulting formula asymptotic) we get Cov ( U , V ) = T C for some rational number C . Since p π = U / T and p κ = V / T , this implies Cov ( p π , p κ ) = C / T. This explains the scaling factor 1 / T . Performing the calculation for all pairs of patterns, we obtain the above matrices. Our sketch of proof shows that the correlations are only due to the effects of neighboring patterns.
It is not difficult to see that the six frequencies of patterns of length 3 have only four degrees of freedom. There are two constraints. The first one is obvious.
π p π = 1.
(8)
Since in any time series the numbers of local minima and local maxima can differ by at most 1, it is true that | p 213 + p 312 p 231 p 132 | 1 / T for the frequencies in any time series of length T. For stochastic processes, without any stationarity, we then have the asymptotic equation
p 213 + p 312 p 231 p 132 = 0.
(9)
For order stationary processes, this holds exactly because P { X 1 < X 2 } = p 123 + p 132 + p 231 and P { X 2 < X 3 } = p 312 + p 213 + p 123 must be equal.
From P { X 2 > X 3 } = p 132 + p 231 + p 321 and order stationarity, we get another equation
p 12 p 21 = P { X 1 < X 2 } P { X 2 > X 3 } = p 123 p 321 .
(10)
It is rare that particular patterns have a meaning, as 1324 and 1342 in the heart rate study of Parlitz et al.35 Usually, certain differences of pattern frequencies, which we call contrasts, are more informative than p π themselves. They were introduced in Refs. 6, 19, 21 , and 32 as a kind of autocorrelation functions of d = 1 , 2 , which describe the behavior of data, for instance periodicities, short-time dependence or self-similarity. This practically important aspect is not studied here. We have d = 1 and focus on showing that the pattern contrasts form an ideal orthogonalized system of patterns, both from an algebraic and a statistical viewpoint. Let us now introduce the pattern contrasts which were illustrated in Fig. 1. “Contrast” is a basic concept of the analysis of variance, contained in every advanced textbook of statistics and in the special introduction.36 It means “weighted sum of frequencies” and is used here in exactly the same spirit. It is actually a difference since the sum of weights is usually zero.
Theup downbalance β = p 123 p 321 = p 12 p 21 = 2 p 12 1
(11)
distinguishes upward and downward patterns. In (10) above, it was shown that it is actually the contrast which comes from length 2 patterns. For symmetric or time-reversible processes, β is zero. It is positive for time series which increase slowly and decrease fast. This is the basic asymmetry parameter, with extreme values 1 for a decreasing and + 1 for an increasing time series. All identities among pattern frequencies hold for order stationary processes. For time series, they hold with a small boundary error,6,32
Thepersistence τ = p 123 + p 321 1 3 = 2 3 α
(12)
has properties very similar to classical and rank autocorrelation, when considered as a function of d. It can also be considered as a smoothness parameter since it distinguishes straight and broken patterns. It is just the negative of the turning rate α of Bienaymè, plus 2 3 . The constant of τ was chosen so that it is zero for white noise, like the other pattern contrasts. Smoother time series have smaller turning rate and larger persistence. The maximum value 2 3 of τ occurs for a monotone series. The minimum value is 1 3 and occurs for an alternating sequence, as mentioned after (22). Negative persistence is rare for d = 1 , but appears for d > 1 when 2 d is near to a period of the process. A two-dimensional variant of persistence was defined recently.23  τ or α is the most important pattern contrast, and both versions are useful. The turning rate has the advantage of clear interpretation.
Therotationalasymmetry γ = p 213 + p 231 p 132 p 312 = 2 ( p 213 p 312 ) = 2 ( p 231 p 132 )
(13)
distinguishes the patterns with middle rank 2 at the beginning from those where 2 is at the end. It is positive for an oscillation with increasing amplitude, as in Fig. 2, negative for damped oscillation, and zero for time-reversible processes. The extreme values ± 1 occur for alternating series. “Rotational” refers to a geometric half-turn of the pattern.
Theup downscaling δ = p 132 + p 213 p 231 p 312 = 2 ( p 132 p 312 ) = 2 ( p 213 p 231 )
(14)
got its name from the fact that δ ( d ) = β ( 2 d ) β ( d ) . It is tightly connected with β . It is zero for time-reversible, spatially symmetric, and self-similar processes. Thus, the pattern contrasts β , γ , δ all indicate deviation from symmetry and reversibility, but in different ways.

To give an idea of the size of the parameters, we simulated AR(1) processes X t = 1 2 X t 1 + ε t with different distributions of the i.i.d. noise ε t. The length T does not matter when it is larger than 1000. For symmetric distributions, we have β = δ = 0. However, γ is positive even for uniform noise although such a minor effect is not visible in small sample series. (To distinguish it from normal noise with error probability 5 %, let T = 3000. Then 98% of the samples with uniform noise will be recognized correctly.) Bernoulli noise means ε t = ± 1 with probability 1 2. This model can be treated rigorously. Triangular noise was simulated as minimum of two uniform random numbers u 1 , u 2 , and exponential noise as 1 + log u .

In the analysis of variance, weight vectors of different contrasts are taken to be orthogonal, and we have followed this rule. We shall see why. As above for the covariance matrices, we assign the six patterns 123, 132, 213, 231, 312, and 321 to the unit vectors in R 6 , that is ( 1 , 0 , 0 , 0 , 0 , 0 ) up to ( 0 , 0 , 0 , 0 , 0 , 1 ) . Then, the four pattern contrasts β , τ , γ, and δ correspond to the vectors
β ¯ = ( 1 , 0 , 0 , 0 , 0 , 1 ) , τ ¯ = ( 2 3 , 1 3 , 1 3 , 1 3 , 1 3 , 2 3 ) , γ ¯ = ( 0 , 1 , 1 , 1 , 1 , 0 ) , and δ ¯ = ( 0 , 1 , 1 , 1 , 1 , 0 ) .
(15)
Together with the vectors c 1 = ( 1 , 1 , 1 , 1 , 1 , 1 ) and c 2 = ( 0 , 1 , 1 , 1 , 1 , 0 ) of the two constraints (8) and (9), these vectors form an orthogonal basis of R 6. Their norms are
| β ¯ | 2 = 2 , | τ ¯ | 2 = 4 3 , | γ ¯ | 2 = | δ ¯ | 2 = 4 .
(16)
Moreover, the four functions were defined so that they are all zero for white noise, and their value for pattern probabilities p = ( p 123 , , p 321 ) is τ ( p ) = p , τ ¯ and similar for β , γ , δ. With x , y and | x | we denote scalar product and Euclidean norm, respectively. The distance to white noise20 can be defined for arbitrary m , summing over the set S m of patterns of length m . Here, it is used for m = 3. It is a version of permutation entropy as explained in Sec. V.
Δ 2 = π S m ( p π 1 m ! ) 2 .
(17)

(Pythagoras’ theorem for pattern contrasts)

Proposition 1
(Pythagoras’ theorem for pattern contrasts)
The distance to white noise defined in (17) can be partitioned as follows:
4 Δ 2 = 3 τ 2 + 2 β 2 + γ 2 + δ 2 .
The quadratic distance of pattern frequencies p to the pattern frequencies p B M of symmetric random walk [see (5)] fulfils
4 | p p B M | 2 = 3 ( τ 1 6 ) 2 + 2 β 2 + γ 2 + δ 2 .
For an arbitrary fixed pattern frequency distribution p o, we have
4 | p p o | 2 = 3 ( τ ( p ) τ ( p o ) ) 2 + 2 ( β ( p ) β ( p o ) ) 2 + ( γ ( p ) γ ( p o ) ) 2 + ( δ ( p ) δ ( p o ) ) 2 .
Proof.
For an i.i.d. process, all four pattern contrasts are zero, and for symmetric random walk only τ = 1 / 6 is non-zero. So, the first two equations are special cases of the last one which we are going to prove now. The probability vectors p and p o fulfil the constraints (8) and (9) so they are points in a four-dimensional affine space. Considering p o as origin of this space, we have a four-dimensional vector space with orthogonal basis τ ¯ , β ¯ , γ ¯ , δ ¯ . Now, whenever x is a vector in a four-dimensional space with an orthogonal basis b 1 , , b 4 ,, then
| x | 2 = k = 1 4 x , b k | b k | 2 = k = 1 4 1 | b k | 2 x , b k 2 .
Let x = p p o and take τ ¯ , β ¯ , γ ¯ , δ ¯ as b k . With (16), we obtain the desired equation.
This statement is simple algebra, no statistics involved. It says that we can decompose the variance of the pattern distribution with respect to some origin or mean p o into components. We can say how much variance is due to τ , β etc. Thus we have new parameters which add to 1. In the case of an i.i.d. process, they are32 
β ~ = 2 β 2 4 Δ 2 , τ ~ = 3 τ 2 4 Δ 2 , γ ~ = γ 2 4 Δ 2 , δ ~ = δ 2 4 Δ 2 .
(18)
This very much resembles the Anova methodology in statistics. We now have to discuss the statistical independence of the components.
We said that frequencies p π of different patterns π are correlated, and the (asymptotic) covariance matrices for symmetric random walk and white noise were calculated in Refs. 13, 15, and 16 and shown in (7) and (6). Any two row vectors x , y in R 6 can be interpreted as linear combinations of patterns, and x Σ y is the covariance of x and y . So matrix multiplication determines variances and covariances of our pattern contrasts β , τ , γ , δ considered now as estimators over the space of all sample series of length T. For symmetric random walk, it turns out that all our four pattern contrasts are uncorrelated! Their variances are
T Var β = 1 , T Var τ = 1 4 , T Var γ = 1 3 , T Var δ = 2 3 .
(19)
Principal component analysis of the covariance matrix Σ B is the standard method to find uncorrelated vectors. All eigenvalues of any covariance matrix are real and nonnegative. We determine the eigenvectors x for each eigenvalue λ . The variance of x is x Σ x = λ x x = λ | x | 2 . And if x , y belong to different eigenvalues then x Σ y = 0 , so they are uncorrelated. In our case, the vectors c 1 , c 2 of the two constraints (8) and (9) are eigenvectors corresponding to λ = 0, since they represent constant functions.

(Independence of pattern contrasts for symmetric random walk)

Proposition 2
(Independence of pattern contrasts for symmetric random walk)

For the covariance matrix T Σ B of pattern frequencies of symmetric random walk given in (7), the vectors of the four pattern contrasts β , τ , γ , δ are exactly the eigenvectors of the nonzero eigenvalues 1 2 , 3 16 , 1 6, and 1 12 . Their variances are given in (19).

The proof is a simple calculation, for instance β ¯ Σ B = 1 2 β ¯ . Note that eigenvectors are determined only up to a real constant, and eigenvector programs yield unit vectors. Our pattern contrasts are not unit vectors. Their variance is determined by x Σ B x , or λ | x | 2 when we use (16).

The case of an i.i.d. process is a bit more complicated. By matrix multiplication with Σ W, we get the (asymptotic) variances
T Var β = 1 3 , T Var τ = 8 45 , T Var γ = 2 5 , T Var δ = 2 3 .
(20)
The covariances are all zero, except for T Cov ( β , δ ) = β ¯ T Σ W δ ¯ = 1 / 3. Thus,
corr ( β , δ ) = 2 / 2 = 0.707.
(21)
Let us compare with eigenvectors. Weiss16 determined the nonzero eigenvalues of T Σ W as λ 1 = 1 12 ( 2 + 2 ) , λ 2 = 2 15 , T λ 3 = 1 10 , and λ 4 = 1 12 ( 2 2 ) . Note that λ 1 and λ 4 belong to one quadratic factor of the characteristic polynomial. Matlab yields just the numeric eigenvalues 0 , 0 , 0.049 , 0.1 , 0.133 , 0.285 and the corresponding eigenvector columns,
( 0.12 0.39 0.5 0 0.58 0.5 0.35 0.54 0.35 0.5 0.29 0.35 0.60 0.24 0.35 0.5 0.29 0.35 0.35 0.54 0.35 0.5 0.29 0.35 0.60 0.24 0.35 0.5 0.29 0.35 0.12 0.39 0.5 0 0.58 0.5 ) .

The first two columns are linear combinations of the constraint vectors c 1 and c 2 . The columns 4 and 5 are the unit vectors of γ ¯ and τ ¯ , respectively. Thus, γ has variance 1 10 4 = 2 5 , and τ has variance 2 15 4 3 = 8 45 , as Bienaymè knew. The columns 3 and 6 corresponding to λ 4 and λ 1 are connected with β and δ . Their sum is β ¯ , and their difference is δ ¯ / 2 . So the eigenvectors are 1 2 β ¯ ± 1 2 δ ¯ / 2 . These are the orthogonal and uncorrelated unit vectors of T Σ W , and they are unique since the nonzero eigenvalues are all different.

It would be difficult to interpret these linear combinations, however. We stick to β and δ , which are represented by orthogonal vectors but have negative correlation (21). (Another option would be to change δ for β 2 = δ + β which is uncorrelated with β and easy to interpret. But the vectors for β and β 2 are not orthogonal, and Pythagoras’ theorem then includes a mixed term.)

(Independence of pattern contrasts of an i.i.d. process)

Proposition 3
(Independence of pattern contrasts of an i.i.d. process)

For the covariance matrix T Σ W of pattern frequencies of white noise given in (6), the vectors of pattern contrasts τ and γ are eigenvectors of the eigenvalues 2 15 and 1 10 . The other two eigenvectors of nonzero eigenvalues generate the same plane as β ¯ and δ ¯. We choose β and δ instead of eigenvectors, for better interpretation, and must take into account the strong negative correlation (21). All other pairs of the four estimators are uncorrelated.

Open problem. We have shown that for patterns of length 3, there is a canonical choice of four pattern contrasts. What about patterns of length 4?

Pattern frequencies and contrasts are lineaar functions. The normal approximation applies to their statistical estimates.26 In this section, we discuss statistical estimation of the permutation entropy H which is a nonlinear function. This results in biased estimates. At its maximum, however, H has a special statistics which can be applied to serial independence tests,12,14 with great success in the EEG study in Sec. VII even though they are not based on the χ 2-distribution.13 

Let S m denote the set of all permutations of length m which usually will be between 3 and 6. The permutation entropy is defined as
H = H m ( p ) = π S m p π log p π where p = ( p π ) π Sm .
(22)
For simplicity, we do not include a delay parameter d , and we drop the subscript m when possible. We use a natural logarithm. In information theory, base two logarithms are often used and then the physical unit is called bit. Since log 2 x = log x / log 2 , in our version H = 1 corresponds to 1 / log 2 = 1.44 bit. The maximum value H max = log m ! is assumed for an i.i.d. process and the minimum value 0 for monotone time series.34 Note that the scale for H is nonuniform. For a period 2 sequence, like x t = ( 1 ) t / t = 1 , 1 2 , 1 3 , with P 132 = P 312 = 1 2, we have H 3 = log 2 = 0.69 , and for period 3 with P 123 = P 231 = P 312 = 1 3, we get H 3 = log 3 = 1.10. Thus, H works very well in detecting periodicities and distinguishing small periods and other dynamical phenomena.3 On the other hand, noisy time series, coming from sources like EEG or geophysical measurements, all have entropy very near to the maximum. For symmetric random walk, (5) yield H 3 = 5 2 log 2 = 1.7345 while H 3 max = log 6 = 1.7918. Nevertheless, random walk is fundamentally different from white noise.
Our question here is whether permutation entropy can distinguish statistically between such noisy processes. To this end, we consider white noise as our point zero. The essential quantity is not H , but its deviation H max H from its maximum. The distance to white noise was defined in (17). We now consider H ( p ) as a differentiable function of m ! variables p = ( p π ) π S m near the white noise parameter p o = ( 1 / m ! ) π S m. Our purpose is to get an idea of the statistical fluctuations of H in different samples of an i.i.d. process. We need the second order Taylor expansion of H at p o .
H = H m ( p ) log m ! m ! 2 Δ 2 or  m ! 2 Δ 2 H max H .
(23)
This is easy to check: H p π = 1 log p π is a constant for the i.i.d. process. So the first-order term vanishes, as it should for a maximum. The non-zero second derivatives are 2 H p π 2 = 1 p π with value m ! at p o . So, the second order term is m ! 2 times the sum of the ( p π 1 / m ! ) 2.

For a check of (23), take symmetric random walk with H 3 ( p ) = 1.7345. The Taylor approximation, calculated as H 3 ( p ) log 6 3 Δ 2 = 1.7918 1 / 16 = 1.7293 , coincides with two decimals accuracy.

We proved that near the i.i.d. process, H max H is the quadratic quantity m ! 2 Δ 2. For all other processes, the Taylor approximation is first order and the deviations are linear. This has consequences for statistical fluctuations of H in sample series of different lengths T.

For Fig. 3, we simulated 100 000 time series of different lengths T and determined the distribution of H 4 max H 4 . The true value is 0 for white noise and 0.194 for Brownian motion. This does not agree with the mean of the distributions. There is considerable bias in the statistical estimates. The bias is always to the right, to larger deviations from H max. It scales like 1 / T with the sample size.

FIG. 3.

Left: deviations H 4 max H 4 of 100 000 white noise sample series of different length T . The deviations scale like 1 / T . Right: the same simulation for Brownian motion. The bias scales like 1 / T , but the standard error scales like 1 / T.

FIG. 3.

Left: deviations H 4 max H 4 of 100 000 white noise sample series of different length T . The deviations scale like 1 / T . Right: the same simulation for Brownian motion. The bias scales like 1 / T , but the standard error scales like 1 / T.

Close modal

For white noise, the bias is easy to explain. The sample frequencies p π are never all equal, so the entropy of the sample series cannot reach the value H max. Due to the convexity of the function H max H , we have such bias also for other processes. Table II shows that the bias is divided by two when we double the sample length T . This observation helps to extrapolate the true value and correct the bias.

TABLE II.

Mean, bias, and standard deviation of H 4 max H 4 taken over 100 000 sample series of Brownian motion with different T. Scaling of bias by 2 is obvious, scaling of σ by 2 follows from the central limit theorem.

T 100 200 400 800 Limit
Mean  0.320  0.254  0.224  0.209  0.194 
Bias  0.126  0.060  0.030  0.015 
Std.dev. σT  0.0974  0.0648  0.0437  0.0304 
σT/σ2T  1.50  1.48  1.44  …  2 
T 100 200 400 800 Limit
Mean  0.320  0.254  0.224  0.209  0.194 
Bias  0.126  0.060  0.030  0.015 
Std.dev. σT  0.0974  0.0648  0.0437  0.0304 
σT/σ2T  1.50  1.48  1.44  …  2 

The estimates for white noise are much more accurate than for any other process, due to its special Taylor expansion. For T = 800, for instance, Fig. 3 shows estimates between 0 and 0.04 for white noise and between 0.1 and 0.32 for Brownian motion. Actually, the standard deviation of the distribution scales with 1 / T for white noise and with 1 / T for other processes.

The scaled deviations Z = T ( H max H ) in Fig. 4 illustrate the 1 / T-scaling of the statistical fluctuations of white noise. There is good agreement for different T , and there is a limit density for T . It is calculated in Corollary 3.2 in Weiss16 which extends to m > 3 when corresponding eigenvalues are determined.

FIG. 4.

Distribution of scaled deviations Z = T ( H max H ) of 100 000 sample series of different length T from the white noise process verifies the scaling. The distributions do not depend much on T and approach a limit function.16 The distributions for m = 3 show irregularities due to discrete effects.

FIG. 4.

Distribution of scaled deviations Z = T ( H max H ) of 100 000 sample series of different length T from the white noise process verifies the scaling. The distributions do not depend much on T and approach a limit function.16 The distributions for m = 3 show irregularities due to discrete effects.

Close modal

For m = 3, and in the case m = 4 for T = 100, the density functions look irregular. This is not statistical inaccuracy. The density did not change when the number of simulations was increased or decreased. The density did change, however, when T = 100 was replaced by 99 or 101. The irregularities are caused by the fact that absolute frequencies of patterns are relatively small integers - so there are not so many possible frequency combinations. To avoid such effects, just work with m = 4 and T 200.

When we know the distribution of Z in an i.i.d. process, in particular the p-values or tail probabilities p ( z ) = P ( Z z ) , we can perform a statistical test of serial dependence. Such tests are very important for time series analysis. Whenever a model for a time series fits the data, the residues—theoretical minus observed values—should be indistinguishable from i.i.d. random numbers. An overview of the multitude of serial dependence tests can be found in Matilla-Garcia and Ruiz Marin.12 

The paper12 suggested to take permutation entropy, more precisely the values 2 Z , as criterion for a serial dependence test. Unfortunately, the authors wrongly claimed that under the null hypothesis of an i.i.d. process, 2 Z follows a χ 2 square distribution with m ! 1 degrees of freedom. Actually, T m ! Δ 2 fulfils a well-known formula for the χ 2-test
2 Z T m ! Δ 2 = T m ! π ( p π 1 / m ! ) 2 = π ( T p π T / m ! ) 2 T / m ! = ( B E ) 2 E ,
where E = T / m ! is the expected and B = T p π the observed absolute frequency of pattern π in the sample series. This roughly shows the type of limit distribution, at least for H 3, when we take four degrees of freedom. However, Elsinger13 pointed out that besides degrees of freedom, other strong corrections are necessary due to correlation and different variances of pattern frequencies, see Sec. III E. In Refs. 14 and 25, it was pointed out that χ 2 holds when we appropriately restrict the index set of data. Following Elsinger,13 two different tests were worked out by Sousa and Hlinka15 and Weiss.16 The main tool in both cases is principal axes analysis, see Sec. IV. Weiss16 determines the limit distribution as a convolution of multiples of χ 1 2 for m = 3. His Corollary 3.2 extends to m > 3 when corresponding eigenvalues are determined. Elsinger13,15 uses a transformation to standard multivariate normal distribution and a classical χ 2 statistics.

Such asymptotic results are important for our understanding. However, they do not tell us what happens for a particular T. For tests of serial dependence in Sec. VII C, we include Fig. 5 and the critical quantiles ( z-values with P ( Z z ) = 0.95 , say) in Table III, obtained by direct simulation (100 000 samples for each T). For H = 3, the empirical formula p = e 2 z / 3 works well even for a rather large z , as Fig. 5 shows. The table for H 3 agrees with the table for 2 Z in p. 21 of Ref. 13.

FIG. 5.

To perform tests of serial dependence, we need tail probabilities ( p-values). For H 3 , they almost coincide for different T. The empirical formula p = exp ( 2 z / 3 ) yields good p-values even for larger z were χ 2 is misleading. For H 4 , dependence on T is small but visible up to T = 800.

FIG. 5.

To perform tests of serial dependence, we need tail probabilities ( p-values). For H 3 , they almost coincide for different T. The empirical formula p = exp ( 2 z / 3 ) yields good p-values even for larger z were χ 2 is misleading. For H 4 , dependence on T is small but visible up to T = 800.

Close modal
TABLE III.

Simulated critical quantiles of Z for tests of serial dependence with Z = T*(Hmax − H) for m = 3 (left) and m = 4 (right). Values for T = ∞ were extrapolated.

Sample size 95% 99% 99.9%
T = 800  4.46  6.81  10.36 
T = 400  4.47  6.82  10.37 
T = 200  4.49  6.91  10.39 
e−2z/3  4.49  6.91  10.36 
χ 4 2 ( 2 z )  4.74  6.64  9.24 
Sample size  95%  99%  99.9% 
T = ∞  17.22  21.55  27.48 
T = 800  17.29  21.63  27.56 
T = 400  17.39  21.75  27.69 
T = 200  17.64  22.06  28.08 
T = 100  18.40  22.89  28.88 
Sample size 95% 99% 99.9%
T = 800  4.46  6.81  10.36 
T = 400  4.47  6.82  10.37 
T = 200  4.49  6.91  10.39 
e−2z/3  4.49  6.91  10.36 
χ 4 2 ( 2 z )  4.74  6.64  9.24 
Sample size  95%  99%  99.9% 
T = ∞  17.22  21.55  27.48 
T = 800  17.29  21.63  27.56 
T = 400  17.39  21.75  27.69 
T = 200  17.64  22.06  28.08 
T = 100  18.40  22.89  28.88 

EEG brain signals are probably the widest field of application of order patterns. There is a well-developed methodology of EEG signals based on Fourier spectra and graphic elements, which is now combined with machine learning because of the size of data sets, see the comprehensive survey.37 On the other hand, the data contain a lot of different artefacts which make spectral methods difficult. Order patterns apply most successfully to such really big and “dirty” data sets.

Karsten Keller applied symbolic analysis as early as 2003 to EEG of epileptic children,38 to assess the results of therapy performed by Lauffer. Larger permutation entropy indicated higher vigilance of the brain and thus a therapeutic success. He came back several times to such applications.40,41 Epilepsy is the most striking application of entropy methods in EEG.15,39,42 Other studies concern effects of anesthetic43 and psychotropic44 drugs, or Alzheimer’s disease.45 

We consider data from sleep research46 which we studied before20 with permutation entropy and Δ 2. It turns out that sleep analysis is still simpler with turning rate where we count only maxima and minima. This is a new application of pattern contrasts as a function of time t , not of the delay d , in the setting of long data series. Turning rate will be used to compress the data, to classify sleep stages, and to find slow oscillations which are hardly accessible by conventional methods.

We take the EEG sleep data of Terzano et al.46 available on Physionet.47 Although 20 years old, they are of excellent quality. There were several healthy volunteers which we shall study. EEG were sampled with 500 Hz and a decent hardware low-pass filter at 30 Hz. We shall take delay d = 4 , that is 8 ms. In this range, most EEG are totally disturbed by electrical noise from the power grid which is unavoidable in a clinical environment. It is a common practice to filter away the “mains hum,” which works well for conventional methods but can destroy the fine order structure. The Terzano data did not require any preprocessing.

With 15 million values in each EEG channel (eight hours sampled with 500 Hz), these data are big time series as nowadays observed in so many fields. They must be screened automatically. Order patterns are an appropriate tool.

Figure 6 shows a sleep EEG of a healthy person. In the upper part, you see the annotation of sleep stages by a medical expert, with one value for every epoch of 30 s. Experts used various channels, among others eye trackers for REM phases. Their main goal was to find and analyze “cyclic alternating patterns.” Annotation was an arts, with a comprehensive catalog of rules. Now, it is done by machine,37 cheaper than experts.

FIG. 6.

The turning rate of an EEG directly classifies sleep stages.

FIG. 6.

The turning rate of an EEG directly classifies sleep stages.

Close modal

In the lower part, two special EEG channel differences can be seen. Both functions are almost parallel to the annotation. However, the two functions do not show the EEG values. They show the turning rate for d = 4. Thus, a simple count of maxima and minima on a time scale of milliseconds will classify sleep stages almost as efficiently as the study of various waves, graphical elements and phenomena in non-EEG channels by qualified experts, or expert systems, in a frequency range between 20 and 0.2 Hz. In Fig. 7, three very short parts of the time series demonstrate how the number of maxima and minima decreases in sleep.

FIG. 7.

Left: three clean data segments of one second demonstrate differences in turning rate. Right: turning rates of 1 s segments (bottom) are averaged over 30 s to find slow oscillations (top).

FIG. 7.

Left: three clean data segments of one second demonstrate differences in turning rate. Right: turning rates of 1 s segments (bottom) are averaged over 30 s to find slow oscillations (top).

Close modal

The 15 million data of the very irregular original time series were compressed to a series of 1000 turning rates by analyzing epochs of 30 s. With T = 15 000 values per epoch, the estimates of turning rate are fairly reliable even in the presence of small artefacts. As a result, we obtain a new time series α 1 , , α 1000 of turning rates drawn in Fig. 6. No preprocessing, no filtering, no artefact removal.

Figures 6–8 are taken from an unpublished preprint48 where more than 10 persons from Terzano et al.46 were studied, some with sleep disorders. In spite of big differences between subjects, turning rate was in all cases tightly connected with sleep stage. It seem to us that turning rate could even be used to define sleep depth. That would be better than a catalog of diverse rules. Of course, this requires further study since α does not only depend on sleep stage, on d and on the person, but also on measuring equipment, filtering options of hardware etc. REM phases of dream are more difficult to classify than sleep depth. Figure 6 shows that the frontal channel Fp2F4 has higher turning rates in REM than the central channel C4P4 while outside REM it is the other way round. This effect was not seen for all persons.

FIG. 8.

Slow oscillations of turning rate for four other healthy subjects from the Terzano database. Note the synchrony between channels.

FIG. 8.

Slow oscillations of turning rate for four other healthy subjects from the Terzano database. Note the synchrony between channels.

Close modal

So far, we took disjoint 30 s epochs to determine turning rates. For more detail, we shift the 30 s windows only by 1 s. Equivalently, we can calculate turning rates for 1 s segments and then apply a moving average with length 30, as shown in Fig. 7. The new time series has 30 000 values α t instead of 1000 and is fairly smooth.

We found clear oscillations of the turning rate with a period of about 1–2 min, at the onset of sleep, in light sleep (stage S2), and in the morning before getting awake. Such infra-slow frequencies are totally beyond the scope of conventional Fourier analysis. See Fig. 7 for our person n5, and Fig. 8 for four other healthy persons. Both channels (and further channels not presented) show almost synchronous oscillations. This may indicate a slow biorhythm from a central source and deserves further study.

In the previous section, we had fixed d = 4 and considered α as a function of time. Now, we ask how the α-functions vary when we change d. Fig. 9 illustrates the effects, for d between 2 and 32 ms on the left, and for 64 ms up to 1 s on the right. Actually, the α functions look very similar for most d , and they would agree still better when we take d = 3 , , 12 , for instance. Thus, we have a strong self-similarity in the order of EEG data.

FIG. 9.

Dependence of turning rates on the delay d , ranging from 2 to 40 ms.

FIG. 9.

Dependence of turning rates on the delay d , ranging from 2 to 40 ms.

Close modal

For d = 1 and d = 2, the number of maxima and minima is very small. This comes from the hardware low-pass filter which smoothes the measured values, and will be similar for any physical measurement. The values of turning rate further increase with increasing d , at least up to d = 32. This is quite natural. In Fig. 7 we marked the maxima for d = 1. For d = 4, there are much more maxima.

With increasing d, the curves become more noisy. (This can be improved by taking windows according to “zero-crossing.”) The turning rate for d = 256 and d = 512 is near 2 / 3 , the value for white noise. Here, we have an effect of the high-pass filter which is also necessary for any physical measurement. It guarantees that the measured values do not move too far away from the zero line, as would be the case for random walk. The high-pass threshold was set to 0.3 Hz by Terzano et al. When values move away for one second, the filter tends to get them back in the next second. In this way, the self-similarity, present in models like Brownian motion, and apparently also in EEG, will be damped by filters on both sides in experimental data.

After studying turning rate for fixed d and sliding t-windows, we now consider autocorrelation functions, with d = 1 , , 500 for fixed time epochs of 30 s. We now consider persistence τ ( d ) which resembles autocorrelation. Due to the low-pass filter, the time series is rather smooth on small scale and τ ( 1 ) is larger than 0.5, near to the maximal value 2 / 3. For white noise the value would be zero, for Brownian motion 1 6. Probably, the first values of τ must be adjusted so that different measuring devices and hardware filters can be compared. There are no studies in this direction.

For a small d , persistence rapidly declines and reaches a minimum, in the first plot of Fig. 10 near d min = 10 , in the other plots near d min = 20. This parameter is locked, the minimum was never at d = 15, say. The persistence at the minimum depends on the sleep stage and can be between 0 and 0.3. For d 100 , there were very few negative values of τ.

FIG. 10.

Persistence in different stages. Each plot shows curves for 40 successive epochs of 30 s. The delay d runs from 2 ms to 0.2 s.

FIG. 10.

Persistence in different stages. Each plot shows curves for 40 successive epochs of 30 s. The delay d runs from 2 ms to 0.2 s.

Close modal

Right of d min, the curve increases and reaches a maximum at 2 d min . In many curves, a second minimum at 3 d min is visible. This indicates that we have here the so-called α-rhythm, with frequency 12.5 Hz, corresponding to wavelength 40 ms. This rhythm is always seen in occipital channels, and we had the frontal channel Fp2F4. Here, it is apparent in the morning and did disappear most of the night, which is not shown in Fig. 10. In deep sleep τ assumed large values up to d = 400. Thus, we can classify this sleep stage also with large d . We had chosen d = 4 before the first minimum, because the curves do not vary so much in this range.

Each of the plots in Fig. 10 refers to 40 successive epochs, corresponding to 20 min of measurement. The τ-curves seem to be rather consistent and interpretable. This was not true for the other contrasts β , γ , δ. Although they often show significant positive and negative values (see Table IV), the curves did not show consistent structure. They started near zero and looked like random walk. It seems that these contrasts express the numerous artefacts in EEG recordings. They may have some meaning, but not as average over windows of 15 360 values. For instance, the cyclic alternating patterns which were the subject of the EEG study,46 consist of an increasing oscillation and a decreasing oscillation. In our average of γ over 30 s, the two will cancel out.

TABLE IV.

Testing EEG data on serial dependence. Half a million tests were performed, each with eight different test statistics. The null hypothesis says that data come from an i.i.d. process. Confidence level 95% means that in 95% of tests with i.i.d. data, H0 would be accepted. The percentage of tests for which H0 was actually accepted is actually very low, even zero for H4. In the case of two-sided tests, the percentage of significant deviations from H0 to both sides is given.

Test parameter log24 − H4 log6 − H3 τ β γ δ τ4 β4
Confidence level %  95  99.9  95  95  95  95  95  95  95 
H0 accepted %  0.002  0.5  6.7  24  36  32  9.2  22 
Significantly larger  100  99.998  99.5  80  30  40  37  73  30 
Significantly smaller  …  …  …  14  46  24  31  17  49 
Test parameter log24 − H4 log6 − H3 τ β γ δ τ4 β4
Confidence level %  95  99.9  95  95  95  95  95  95  95 
H0 accepted %  0.002  0.5  6.7  24  36  32  9.2  22 
Significantly larger  100  99.998  99.5  80  30  40  37  73  30 
Significantly smaller  …  …  …  14  46  24  31  17  49 
We calculated the sum of squares of the contrasts over all epochs and d = 1 , , 500 , and determined their percentage according to (18). We found that in the mean
τ ~ = 92 % , β ~ = 4.5 % , γ ~ = 1.3 % , and δ ~ = 2.3 % .
Similar values were obtained when we restricted the domain of time and delays. Thus, τ certainly plays the dominant rule. To model EEG, one could start with assuming β = γ = δ = 0.

It is often necessary to test a time series against the null hypothesis that the values are from an i.i.d. process. In particular, any model of a process aims to explain the data in such a way that the residues, differences between observed and model values, cannot be distinguished from independent random numbers. As mentioned in Sec. V C, such tests on the base of order patterns were suggested by Matilla-Garcia and Ruiz Marin,12,25 improved by Elsinger,13 and in this issue by Sousa and Hlinka15 and Weiss.16 There is a great variety of other tests.12,14

We shall not enter this matter, but discuss related questions for our EEG data: could they represent white noise, when they are sampled at some distance d? And will H 3 , H 4 , and the pattern contrasts from Sec. III distinguish them? We have fairly long time series with T = 15 360 for 1009 epochs of 30 s, and we can do the test for each d = 1 , , 500 between 2 ms and 1  s. Thus we can perform 500 1009 = 504 500 tests only for this data set n5 from one night. Interdependence of tests and correction for multiple testing will be irrelevant for our questions. We are not looking for few tests which reject the i.i.d. null hypothesis. This should be the normal case with sample size T = 15 360. We would like to have few tests which accept the null hypothesis. We use the 95% confidence level to give all parameters a chance. Table IV shows that they all reject the i.i.d. process for the majority of cases, but there are big differences.

For permutation entropy, we took the critical quantiles for 95% and 99.9% from Table III. For pattern contrasts we took standard normal quantiles, multiplied by σ / T where σ 2 is taken from (20). Compare Weiss16 where simulations justified the normal approximation already for much smaller T . We also included the new tests of Weiss, Ruiz Marin, Keller and Matilla-Garcia17 with two pattern contrasts of length 4 as follows. Let β 4 = p 1234 p 4321 and τ 4 = p 1234 + p 4321 1 12. In Ref. 17, it is shown that for an i.i.d. process ( p 1234 , p 4321 ) has an asymptotic normal distribution with covariance matrix 1 4032 ( 199 17 17 199 ). Thus, under the null hypothesis β 4 and τ 4 have mean zero and variances 3 28 and 181 2016 , respectively.

Table IV shows that permutation entropy has much better power than the contrasts in these serial dependence tests. Even H 3 is much better than τ although simulations of Weiss16 indicate that τ can perform better for small T. It seems that for large T , the 1 / T-scaling of H m becomes superior to the 1 / T-scaling in the normal distribution. For our EEG data, all pattern contrasts show significant deviations to both sides from an i.i.d. process. Of course this can change when we consider special conditions, for instance deep sleep, where τ and τ 4 are large (small turning rate, see Sec. VI C). In this connection, we note that for d = 1 , , 50 , all 50 450 tests with τ 4 gave significant positive deviation from the null hypothesis. This may be due to the influence of the hardware low-pass filter, however. Also τ accepted H 0 in less than 0.1% of tests with d 50.

These results support the original suggestion of Matilla-Garcia and Ruiz Marin12 to use H 4 as test quantity. By definition, it can detect all differences in pattern frequencies for m 4. On the confidence level of 99.9% , there were only 9 from 504 500 tests where H 0 was accepted.

Thus, there is not the slightest chance to find an i.i.d. process in the EEG. Such data represent another type of noise, and new models are needed to understand them. For short time series with T 300 , for example monthly economic data, we can just estimate patterns of length 3. Use of an i.i.d. model is justified in such a context. For high-resolution measurements, however, we should work with patterns of length 4 and should develop better models for noise so that we can dismiss the i.i.d. process.

The six patterns of length 3 are naturally transformed into four independent contrasts. The turning rate α quantifies roughness, and the parameters β , γ , δ describe deviations from symmetry and reversibility. For symmetric processes, to which some EEG series belong, α controls the whole order structure of length 3. In an EEG, α is taken as a function of time and delay, and its time oscillations seem to indicate an infra-slow rhythm of the brain.

The next step is to study patterns of length 4. Theory and careful study of large data sets must go hand in hand. Two contrasts τ 4 and β 4 were suggested in a new paper by Weiß et al.,17 which continues the recent development of statistical tests with order patterns.13,15,16,25 The field is active and driven by new applications. It will form its own theory in the coming years.

I am grateful to Karsten and also to Jose Amigo and Osvaldo Rosso for the wonderful workshop in Dresden. My thanks go to Bernd Pompe for fruitful discussions and to both referees of the paper for their competent comments.

The author has no conflicts to disclose.

Christoph Bandt: Conceptualization (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal).

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

1.
J.
Amigo
,
K.
Keller
, and
J.
Kurths
, “
Recent progress in symbolic dynamics and permutation complexity. Ten years of permutation entropy
,”
Eur. Phys. J. Spec. Top.
222
,
247
257
(
2013
).
2.
M.
Zanin
,
L.
Zunino
,
O. A.
Rosso
, and
D.
Papo
, “
Permutation entropy and its main biomedical and econophysics applications: A review
,”
Entropy
14
,
1553
1577
(
2012
).
3.
J. M.
Amigo
, Permutation Complexity in Dynamical Systems, Springer Series in Synergetics (Springer, Berlin, 2010).
4.
K.
Archer
and
S.
Elizalde
, “
Cyclic permutations realized by signed shifts
,”
J. Combust.
5
(
1
),
1
30
(
2014
).
5.
T.
Gutjahr
and
K.
Keller
, “
Ordinal pattern-based entropies and the Kolmogorov–Sinai entropy: An update
,”
Entropy
22
,
63
(
2020
).
6.
C.
Bandt
and
F.
Shiha
, “
Order patterns in time series
,”
J. Time Ser. Anal.
28
(
5
),
646
665
(
2007
).
7.
M.
Sinn
and
K.
Keller
, “
Estimation of ordinal pattern probabilities in Gaussian processes with stationary increments
,”
Comput. Stat. Data Anal.
55
,
1781
1790
(
2011
).
8.
A.
Betken
,
J.
Buchsteiner
,
H.
Dehling
,
I.
Munker
,
A.
Schnurr
, and
J.
Woerner
, “
Ordinal patterns in long-range dependent time series
,”
Scand. J. Stat.
48
(
3
),
969
1000
(
2021
).
9.
A.
Schnurr
, “
An ordinal pattern approach to detect and to model leverage effects and dependence structures between financial time series
,”
Stat. Pap.
55
(
4
),
919
931
(
2014
).
10.
A.
Schnurr
and
H.
Dehling
, “
Testing for structural breaks via ordinal pattern dependence
,”
J. Am. Stat. Assoc.
112
,
706
720
(
2017
).
11.
I.
Nüßgen
and
A.
Schnurr
, “
Ordinal pattern dependence in the context of long-range dependence
,”
Entropy
23
,
670
(
2021
).
12.
M.
Matilla-García
and
M.
Ruiz Marín
, “
A non-parametric independence test using permutation entropy
,”
J. Econom.
144
(
1
),
139
155
(
2008
).
13.
H.
Elsinger
, “Independence tests based on symbolic dynamics,” Working Paper No. 165 (Österreichische Nationalbank, 2010).
14.
F.
López
,
M.
Matilla-García
,
J.
Mur
, and
M.
Ruiz Marín
, “
Statistical tests of symbolic dynamics
,”
Mathematics
9
,
817
(
2021
).
15.
A. M. Y.
Rios de Sousa
and
J.
Hlinka
, “
Assessing serial dependence in ordinal patterns processes using chi-squared tests with application to EEG data analysis
,”
Chaos
32
,
073126
(
2022
).
16.
C. H.
Weiß
, “
Non-parametric tests for serial dependence in time series based on asymptotic implementations of ordinal-pattern statistics
,”
Chaos
32
,
093107
(
2022
).
17.
C. H.
Weiß
,
M.
Ruiz Marín
,
K.
Keller
, and
M.
Matilla-García
, “
Non-parametric analysis of serial dependence in time series using ordinal patterns
,”
Comput. Stat. Data Anal.
168
,
107381
(
2022
).
18.
K.
Keller
, Invariant Factors, Julia Equivalences and the (Abstract) Mandelbrot Set, Lecture Notes in Mathematics, Vol. 1732 (Springer, Berlin, 2000).
19.
C.
Bandt
, “Permutation entropy and order patterns in long time series,” in Time Series Analysis and Forecasting, Contributions to Statistics, edited by I. Rojas and H. Pomares (Springer, 2015).
20.
C.
Bandt
, “
A new kind of permutation entropy used to classify sleep stages from invisible EEG microstructure
,”
Entropy
19
(
5
),
197
(
2017
).
21.
C.
Bandt
, “
Small order patterns in big time series: A practical guide
,”
Entropy
21
(
6
),
613
(
2019
).
22.
A.
Diaz-Ruelas
, “
A combinatorial view of stochastic processes: White noise
,”
Chaos
32
,
123136
(
2022
).
23.
C.
Bandt
and
K.
Wittfeld
, “Two new parameters for the ordinal analysis of images,” see http://arxiv.org/abs/2212.14643.
24.
I. J.
Bienaymè
, “
Application d’un théorème nouveau Au calcul des probabilités
,”
Bull. Sci. Math. Astron.
9
,
219
225
(
1875
).
25.
M.
Matilla-García
,
J.
Morales
, and
M.
Ruiz
, “
Testing heteroskedasticity of unknown form using symbolic dynamicss
,”
Eur. Phys. J. Spec. Top.
222
,
317
332
(
2013
).
26.
K.
Keller
and
M.
Sinn
, “
Ordinal analysis of time series
,”
Physica A
356
(
1
),
114
120
(
2005
).
27.
K.
Keller
,
M.
Sinn
, and
J.
Emonds
, “
Time series from the ordinal viewpoint
,”
Stochastics Dyn.
7
(
2
),
247
272
(
2007
).
28.
G. R.
Grimmett
and
D. R.
Stirzaker
,
Probability and Random Processes
(
Oxford University Press
,
Oxford
,
1995
).
29.
J. H.
Martínez
,
J. L.
Herrera-Diestra
, and
M.
Chavez
, “
Detection of time reversibility in time series by ordinal patterns analysis
,”
Chaos
28
(
12
),
123111
(
2018
).
30.
M.
Zanin
,
A.
Rodríguez-González
,
E.
Menasalvas Ruiz
, and
D.
Papo
, “
Assessing time series reversibility through permutation patterns
,”
Entropy
20
(
9
),
665
(
2018
).
31.
S. R.
Elizalde
and
M.
Martinez
, “
The frequency of pattern occurence in random walks
,”
DMTCS Proc.
FPSAC’15
,
217
228
(
2015
).
32.
C.
Bandt
, “Autocorrelation type functions for big and dirty data series,” see http://arxiv.org/abs/1411.3904.
33.
C.
Bandt
, “
Order patterns, their variation and change points in financial time series and Brownian motion
,”
Stat. Pap.
55
(
4
),
919
931
(
2019
).
34.
C.
Bandt
and
B.
Pompe
, “
Permutation entropy: A natural complexity measure for time series
,”
Phys. Rev. Lett.
88
,
174102
(
2001
).
35.
U.
Parlitz
,
S.
Berg
,
S.
Luther
,
A.
Schirdewan
,
J.
Kurths
, and
N.
Wessel
, “
Classifying cardiac biosignals using ordinal pattern statistics and symbolic dynamics
,”
Comput. Biol. Med.
42
(
3
),
319
327
(
2012
).
36.
R.
Rosenthal
and
R.
Rosnow
,
Contrast Analysis
(
Cambridge University Press
,
1985
).
37.
M.
Saeidi
,
W.
Karwowski
,
K.
Fiok
,
R.
Taiar
,
P. A.
Hancock
, and
A.
Al-Juaid
, “
Neural decoding of EEG signals with machine learning: A systematic review
,”
Brain Sci.
11
,
1525
(
2021
).
38.
K.
Keller
and
H.
Lauffer
, “
Symbolic analysis of high-dimensional time series
,”
Int. J. Bifurc. Chaos
13
,
2657
2668
(
2003
).
39.
O. A.
Rosso
,
M. T.
Martin
,
A.
Figliola
,
K.
Keller
, and
A.
Plastino
, “
EEG analysis using wavelet-based information tools
,”
J. Neurosci. Methods
153
,
163
182
(
2006
).
40.
K.
Keller
,
A. M.
Unakafov
, and
V. A.
Unakafova
, “
Ordinal patterns, entropy, and EEG
,”
Entropy
16
(
12
),
6212
6239
(
2014
).
41.
J. M.
Amigo
,
K.
Keller
, and
V. A.
Unakafova
, “
Ordinal symbolic analysis and its application to biomedical recordings
,”
Philos. Trans. R. Soc. A
373
,
20140091
(
2015
).
42.
M.
Staniek
and
K.
Lehnertz
, “
Symbolic transfer entropy
,”
Phys. Rev. Lett.
100
,
158101
(
2008
).
43.
E.
Olofsen
,
J. W.
Sleigh
, and
A.
Dahan
, “
Permutation entropy of the electroencephalogram: A measure of anaesthetic drug effect
,”
Br. J. Anaesth.
101
(
6
),
810
821
(
2008
).
44.
J. M.
Diaz
,
D. M.
Mateos
, and
C.
Boyallian
, “
Complexity-entropy maps as a tool for the characterization of the clinical electrophysiological evolution of patients under pharmacological treatment with psychotropic drugs
,”
Entropy
19
,
540
(
2017
).
45.
F. C.
Morabito
,
D.
Labate
,
F.
La Foresta
,
A.
Bramanti
,
G.
Morabito
, and
I.
Palamara
, “
Multivariate multi-scale permutation entropy for complexity analysis of Alzheimer’s disease EEG
,”
Entropy
14
,
1186
1202
(
2012
).
46.
M. G.
Terzano
,
L.
Parrino
,
A.
Sherieri
,
R.
Chervin
,
S.
Chokroverty
,
C.
Guilleminault
,
M.
Hirshkowitz
,
M.
Mahowald
,
H.
Moldofsky
,
A.
Rosa
,
R.
Thomas
, and
A.
Walters
, “
Atlas, rules, and recording techniques for the scoring of cyclic alternating pattern (cap) in human sleep
,”
Sleep Med.
2
(
6
),
537
553
(
2001
).
47.
A. L.
Goldberger
,
L. A. N.
Amaral
,
L.
Glass
,
J. M.
Hausdorff
,
P. Ch.
Ivanov
,
R. G.
Mark
,
J. E.
Mietus
,
G. B.
Moody
,
C.-K.
Peng
, and
H. E.
Stanley
, “
Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals
,”
Circulation
101
(
63
),
e215
e220
(
2000
).
48.
C.
Bandt
, “Crude EEG parameter provides sleep medicine with well-defined continuous hypnograms,” see arXiv1710.00559 (2018).
Published open access through an agreement with Universitat Greifswald Institut fur Mathematik und Informatik