Virtually all questions that one can ask about the behavioral and structural complexity of a stochastic process reduce to a linear algebraic framing of a time evolution governed by an appropriate hidden-Markov process generator. Each type of question—correlation, predictability, predictive cost, observer synchronization, and the like—induces a distinct generator class. Answers are then functions of the class-appropriate transition dynamic. Unfortunately, these dynamics are generically nonnormal, nondiagonalizable, singular, and so on. Tractably analyzing these dynamics relies on adapting the recently introduced meromorphic functional calculus, which specifies the spectral decomposition of functions of nondiagonalizable linear operators, even when the function poles and zeros coincide with the operator's spectrum. Along the way, we establish special properties of the spectral projection operators that demonstrate how they capture the organization of subprocesses within a complex system. Circumventing the spurious infinities of alternative calculi, this leads in the sequel, Part II [P. M. Riechers and J. P. Crutchfield, Chaos 28, 033116 (2018)], to the first closed-form expressions for complexity measures, couched either in terms of the Drazin inverse (negative-one power of a singular operator) or the eigenvalues and projection operators of the appropriate transition dynamic.

For well over a century, science compared the randomness in physical systems via their temperatures or their thermodynamic entropies. These are measures of energy disorder. Using them, we say, one or the other system is more random (hotter or more entropic) than the other. Curiously, even today, we do not know how to compare two physical systems in terms of how organized they are. Such comparisons are particularly important when the systems of interest do not have well defined energies—as found across mathematics and the sciences, from abstract dynamical systems to economic and social systems. This is what the endeavor of exploring complexity measures addresses—develop quantities that allow one to compare how nonlinear systems are structured, how they store and process information, and how they intrinsically compute. To date, complexity measures have been estimated empirically from experimental measurements, from large-scale simulations that generate synthetic data, or theoretically in the very few cases that are analytically tractable. We show that this arduous and limited state of affairs is no longer necessary, if one can theoretically deduce or empirically estimate a statistical representation called a hidden Markov model (HMM). We provide analytic, closed-form expressions for almost all complexity measures of processes generated by hidden Markov models.

Complex systems—that is, many-body systems with strong interactions—are usually observed through low-resolution feature detectors. The consequence is that their hidden structure is, at best, only revealed over time. Since individual observations cannot capture the full resolution of each degree of freedom, let alone a sufficiently full set of them, the measurement time series often appear stochastic and non-Markovian, exhibiting long-range correlations. Empirical challenges aside, restricting to the purely theoretical domain, even finite systems can appear quite complicated. Despite admitting finite descriptions, stochastic processes with sofic support, to take one example, exhibit infinite-range dependencies among the chain of random variables they generate.1 While such infinite-correlation processes are legion in complex physical and biological systems, even approximately analyzing them is generally appreciated as difficult, if not impossible. Generically, even finite systems lead to uncountably infinite sets of predictive features.2 These facts seem to put physical sciences' most basic goal—prediction—out of reach.

We aim to show that this direct, but sobering conclusion is too bleak. Rather, there is a collection of constructive methods that address the hidden structure and the challenges associated with predicting complex systems. This follows up on our recent introduction of a functional calculus that uncovered new relationships among supposedly different complexity measures3 and that demonstrated the need for a generalized spectral theory to answer such questions.4 Those efforts yielded elegant, closed-form solutions for complexity measures that, when compared, offered insight into the overall theory of complexity measures. Here, providing the necessary background for and greatly expanding those results, we show that different questions regarding correlation, predictability, and prediction each require their own analytical structures, expressed as various kinds of hidden transition dynamic. The resulting transition dynamic among hidden variables summarizes symmetry breaking, synchronization, and information processing, for example. Each of these metadynamics, though, is built up from the original given system.

The shift in perspective that allows the new level of tractability begins by recognizing that—beyond their ability to generate many sophisticated processes of interest—hidden Markov models can be treated as exact mathematical objects when analyzing the processes they generate. Crucially, and especially when addressing nonlinear processes, most questions that we ask imply a linear transition dynamic over some hidden state space. Speaking simply, something happens, then it evolves linearly in time, then we snapshot a selected characteristic. This broad type of sequential questioning cascades, in the sense that the influence of the initial preparation cascades through state space as time evolves, affecting the final measurement. Alternatively, other, complementary kinds of questioning involve accumulating such cascades. The linear algebra underlying either kind is highlighted in Table I in terms of an appropriate discrete-time transition operator T or a continuous-time generator G of time evolution.

TABLE I.

Having identified the hidden linear dynamic, either a discrete-time operator T or a continuous-time operator G, quantitative questions tend to be either cascading or accumulating type. What changes between distinct questions are the dot products with the initial setup ·| and the final observations |·.

Linear algebra underlying complexity
Question typeDiscrete timeContinuous time
Cascading ·|TL|· ·|etG|· 
Accumulating ·|(LTL)|· ·|(etGdt)|· 
Linear algebra underlying complexity
Question typeDiscrete timeContinuous time
Cascading ·|TL|· ·|etG|· 
Accumulating ·|(LTL)|· ·|(etGdt)|· 

In this way, deploying linear algebra to analyze complex systems relies on identifying an appropriate hidden state space. And, in turn, the latter depends on the genre of the question. Here, we focus on closed-form expressions for a process' complexity measures. This determines what the internal system setup ·| and the final detection |· should be. We show that complexity questions fall into three subgenres and, for each of these, we identify the appropriate linear dynamic and closed-form expressions for several of the key questions in each genre. See Table II. The burden of the following is to explain the table in detail. We return to a much-elaborated version at the end.

TABLE II.

Question genres (leftmost column) about process complexity listed with increasing sophistication. Each genre implies a different linear transition dynamic (rightmost column). Observational questions concern the superficial, given dynamic. Predictability questions are about the observation-induced dynamic over distributions; that is, over states used to generate the superficial dynamic. Prediction questions address the dynamic over distributions over a process' causally equivalent histories. Generation questions concern the dynamic over any nonunifilar presentation M and observation-induced dynamics over its distributions. MSP is the mixed-state presentation.

Questions and their linear dynamics
GenreMeasuresHidden dynamic
Observation Correlations γ(L) HMM matrix T 
 Power spectra P(ω)  
Predictability Myopic entropy hμ(L) HMM MSP matrix W 
 Excess entropy E, E(ω)  
Prediction Causal synchrony Cμ,H+(L) ϵ-Machine MSP matrix W 
  S, S(ω)  
Generation State synchrony C(M)Generator MSP matrix 
  H(L),S  
Questions and their linear dynamics
GenreMeasuresHidden dynamic
Observation Correlations γ(L) HMM matrix T 
 Power spectra P(ω)  
Predictability Myopic entropy hμ(L) HMM MSP matrix W 
 Excess entropy E, E(ω)  
Prediction Causal synchrony Cμ,H+(L) ϵ-Machine MSP matrix W 
  S, S(ω)  
Generation State synchrony C(M)Generator MSP matrix 
  H(L),S  

Associating observables xA with transitions between hidden states sS, gives a hidden Markov model (HMM) with observation-labeled transition matrices {T(x):Ti,j(x)=Pr(x,sj|si)}xA. They sum to the row-stochastic state-to-state transition matrix T=xAT(x). (The continuous-time versions are similarly defined, which we do later on.) Adding measurement symbols xA this way—to transitions—can be considered a model of measurement itself.5 The efficacy of our choice will become clear.

It is important to note that HMMs, in continuous and discrete time, arise broadly in the sciences, from quantum mechanics,6,7 statistical mechanics,8 and stochastic thermodynamics9–11 to communication theory,12,13 information processing,14–16 computer design,17 population and evolutionary dynamics,18,19 and economics. Thus, HMMs appear in the most fundamental physics and in the most applied engineering and social sciences. The breadth suggests that the thorough-going HMM analysis developed here is worth the required effort to learn.

Since complex processes have highly structured, directional transition dynamics—T or G—we encounter the full richness of matrix algebra in analyzing HMMs. We explain how analyzing complex systems induces a nondiagonalizable metadynamics, even if the original dynamic is diagonalizable in its underlying state-space. Normal and diagonalizable restrictions, so familiar in mathematical physics, simply fail us here.

The diversity of nondiagonalizable dynamics presents a technical challenge, though. A new calculus for functions of nondiagonalizable operators—e.g., TL or etG—becomes a necessity if one's goal is an exact analysis of complex processes. Moreover, complexity measures naively and easily lead one to consider illegal operations. Taking the inverse of a singular operator is a particularly central, useful, and fraught example. Fortunately, such illegal operations can be skirted since the complexity measures only extract the excess transient behavior of an infinitely complicated orbit space.

To explain how this arises—how certain modes of behavior, such as excess transients, are selected as relevant, while others are ignored—we apply the meromorphic functional calculus and new results for spectral projection operators recently derived in Ref. 4 to analyze complex processes generated by HMMs.

The following shows that this leads to a simplified spectral theory of weighted directed graphs, that even nondiagonalizable eigenspaces can be manipulated individually, and that, more specifically, the techniques can be applied to the challenges of prediction. The results developed here greatly extend and (finally) explain those announced in Ref. 3. The latter introduced the basic methods and results by narrowly focusing on closed-form expressions for several measures of intrinsic computation, applying them to prototype complex systems.

The meromorphic functional calculus, summarized in detail later, concerns functions of nondiagonalizable operators when poles (or zeros) of the function of interest coincide with poles of the operator's resolvent—poles that appear precisely at the eigenvalues of the transition dynamics. Pole–pole and pole–zero interactions transform the complex-analysis residues within the functional calculus. One notable result is that the negative-one power of a singular operator exists in the meromorphic functional calculus. We derive its form, note that it is the Drazin inverse, and show how widely useful and common it is.

For example, the following gives the first closed-form expressions for many complexity measures in wide use—many of which turn out to be expressed most concisely in terms of a Drazin inverse. Furthermore, spectral decomposition gives insight into the subprocesses of a complex system in terms of the spectral projection operators of the appropriate transition dynamic.

In the following, we emphasize that when we observe processes generated by a source capable of even the simplest computations, much of the predictable structure lies beyond pairwise correlation. We clarify how different measures of complexity quantify and distinguish nuanced aspects of what is predictable and what is necessary for prediction. We then give closed-form solutions for this quantification, resulting in a new level of rigor, tractability, and insight.

Sections II and III briefly review the relevant background in stochastic processes, the HMMs that generate them, and complexity measures. Several classes of HMMs are discussed in Sec. III. Mixed-state presentations (MSPs)—HMM generators of a process that also track distributions induced by observation—are reviewed in Sec. IV. They are key to complexity measures within an information-theoretic framing. Section V then shows how each complexity measure reduces to the linear algebra of an appropriate HMM adapted to the question genre.

To make progress at this point, we summarize the meromorphic functional calculus in Sec. VI. Several of its mathematical implications are discussed in relation to projection operators in Sec. VII and a spectral weighted directed graph theory is presented in Sec. VIII.

With this all set out, the sequel, Part II79 finally derives the promised closed-form complexities of a process and outlines common simplifications for special cases. This leads to the discovery of the symmetry collapse index, which indicates the sophistication of finite computational structures hidden in infinite-Markov-order processes. Leveraging the functional calculus, Part II79 introduces a novel extension—the complexity measure frequency spectrum and shows how to calculate it in closed form. It provides a suite of examples to ground the theoretical developments and works through in-depth a pedagogical example.

We first describe a system of interest in terms of its observed behavior, following the approach of computational mechanics, as reviewed in Ref. 20. Again, a process is the collection of behaviors that the system produces and their probabilities of occurring. A process's behaviors are described via a bi-infinite chain of random variables, denoted by capital letters Xt2Xt1XtXt+1Xt+2. A realization is indicated by lowercase letters xt2xt1xtxt+1xt+2. We assume values xt belong to a discrete alphabet A. We work with blocks Xt:t, where the first index is inclusive and the second exclusive: Xt:t=XtXt1. Block realizations xt:t we often refer to as words w. At each time t, we can speak of the pastX:t=Xt2Xt1 and the futureXt:=XtXt+1.

A process's probabilistic specification is a density over these chains: P(X:). Practically, we work with finite blocks and their probability distributions Pr(Xt:t). To simplify the development, we primarily analyze stationary, ergodic processes: those for which Pr(Xt:t+L)=Pr(X0:L) for all t,L+. In such cases, we only need to consider a process's length-L word distributionsPr(X0:L).

A common first step to understand how processes express themselves is to analyze correlations among observables. Pairwise correlation in a sequence of observables is often summarized by the autocorrelation function

where the bar above Xt denotes its complex conjugate, and the angled brackets denote an average over all times t. Alternatively, the structure in a stochastic process is often summarized by the power spectral density, also referred to more simply as the power spectrum

where ω is the angular frequency.21 Though a basic fact, it is not always sufficiently emphasized in applications that power spectra capture only pairwise correlation. Indeed, it is straightforward to show that the power spectrum P(ω) is the windowed Fourier transform of the autocorrelation function γ(L). That is, power spectra describe how pairwise correlations are distributed across frequencies. Power spectra are common in signal processing, both in technological settings and physical experiments.22 As a physical example, diffraction patterns are the power spectra of a sequence of structure factors.23 

Other important measures of observable organization called Green–Kubo coefficients determine transport properties in near-equilibrium thermodynamic systems—but are rather more application-specific.24,25 These coefficients reflect the idea that dissipation depends on the correlation structure. They usually appear in the form of integrating the autocorrelation of derivatives of observables. A change of observables, however, turns this into an integration of a standard autocorrelation function. Green–Kubo transport coefficients then involve the limit limω0P(ω) for the process of appropriate observables.

One theme in the following is that, though widely used, correlation functions and power spectra give an impoverished view of a process's structural complexity, since they only consider ensemble averages over pairwise events. Moreover, creating a list of higher-order correlations is an impractical way to summarize complexity, as seen in the connected correlation functions of statistical mechanics.26 

Information measures, in contrast, can involve all orders of correlation and thus help to go beyond pairwise correlation in understanding, for example, how a process' past behavior affects predicting it at later times. Information theory, as developed for general complex processes,1 provides a suite of quantities that capture prediction properties using variants of Shannon's entropy H[·] and mutual information I[·;·]13 applied to sequences. Each measure answers a specific question about a process' predictability. For example:

  • How much information is contained in the words generated? The block entropy1 
  • How random is a process? Its entropy rate12 
  • For dynamical systems with a continuous phase-space B, the metric entropy also known as Kolmogorov–Sinai (KS) entropy is the supremum of entropy rates induced by partitioning B into different finite alphabets A.27 

  • How is the irreducible randomness hμ approached? Via the myopic entropy rates28 
  • How much of the future can be predicted? Its excess entropy, which is the past–future mutual information:[Ref. 1, and references therein]

    E has also been investigated in the ergodic theory29 and under the names stored information,30effective measure complexity,31 and predictive information.32 

  • How much information must be extracted to know its predictability and so see its intrinsic randomness hμ? Its transient information1 

The spectral approach, our subject, naturally leads to allied, but new information measures. To give a sense, later we introduce the excess entropy spectrumE(ω). It completely, yet concisely, summarizes the structure of myopic entropy reduction, in a way similar to how the power spectrum completely describes autocorrelation. However, while the power spectrum summarizes only pairwise linear correlation, the excess entropy spectrum captures all orders of nonlinear dependency between random variables, making it an incisive probe of hidden structure.

Before leaving the measures related to predictability, we must also point out that they have important refinements—measures that lend a particularly useful, even functional, interpretation. These include the bound, ephemeral, elusive, and related informations.33,34 Though amenable to the spectral methods of the following, we leave their discussion for another venue. Fortunately, their spectral development is straightforward, but would take us beyond the minimum necessary presentation to make good on the overall discussion of spectral decomposition.

Process predictability measures, as just enumerated, certainly say much about a process' intrinsic information processing. They leave open, though, the question of the structural complexity associated with implementing prediction. This challenge entails a complementary set of measures that directly address the inherent complexity of actually predicting what is predictable. For that matter, how cryptic is a process?

Computational mechanics describes minimal-memory maximal prediction—using the minimal memory necessary to predict everything that is predictable about the future—via a process' hidden, effective or causal states and transitions, as summarized by the process's ϵ-machine.20 A causal stateσS+ is an equivalence class of histories x:0 that all yield the same probability distribution over observable futures X0:. Therefore, knowing a process's current causal state—that S0+=σ, say—is sufficient for maximal prediction.

The computational mechanics framework can also be related to several more recent attempts at describing effective levels of complex systems. For example, if individual histories are taken to be the microstates of a stochastic process, then causal states are the minimal high-level description of a stochastic process that satisfies the informational closure criterion of Ref. 35.

Computational mechanics provides an additional suite of quantities that capture the overhead of prediction, again using variants of Shannon's entropy and mutual information applied to the ϵ-machine. Each also answers a specific question about an observer's burden of prediction. For example:

  • How much historical information must be stored for maximal prediction? The Shannon information in the causal states or statistical complexity36 
  • How unpredictable is a causal state upon observing a process for duration L? The myopic causal-state uncertainty1 
  • How much information must an observer extract to synchronize to—that is, to know with certainty—the causal state? The optimal predictor's synchronization information1 

Paralleling the purely informational suite of the previous Sec. II B, we later introduce the optimal synchronization spectrumS(ω). It completely and concisely summarizes the frequency distribution of state-uncertainty reduction, similar to how the power spectrum P(ω) completely describes autocorrelation and the excess entropy spectrum E(ω) the myopic entropy reduction. Helpfully, the above optimal prediction measures can be found from the optimal synchronization spectrum.

The structural complexities monitor an observer's burden in optimally predicting a process. And so, they have practical relevance when an intelligent artificial or biological agent must take advantage of a structured stochastic environment—e.g., a Maxwellian Demon taking advantage of correlated environmental fluctuations,37 prey avoiding easy prediction, or profiting from stock market volatility, come to mind.

Prediction has many natural generalizations. For example, since maximal prediction often requires infinite resources, sub-maximal prediction (i.e., predicting with lower fidelity) is of practical interest. Fortunately, there are principled ways to investigate the tradeoffs between predictive accuracy and computational burden.2,38–40 As another example, maximal prediction in the presence of noisy or irregular observations can be investigated with a properly generalized framework; see Ref. 41. Blending the existing tools, resource-limited prediction under such observational constraints can also be investigated. There are also many applications where prediction is relevant to the task at hand, but is not necessarily the ultimate objective; this of course has a long history, and Ref. 42 has recently tried to formalize this effort. In all of these settings, information measures similar to those listed above are key to understanding and quantifying the tradeoffs arising in prediction.

Having highlighted the difference between prediction and predictability, we can appreciate that some processes hide more internal information—are more cryptic—than others. It turns out, this can be quantified. The crypticityχ=CμE is the difference between the process's stored information Cμ and the mutual information E shared between past and future observables.43 Operationally, crypticity contrasts predictable information content E with an observer's minimal stored-memory overhead Cμ required to make predictions. To predict what is predictable, therefore, an optimal predictor must account for a process's crypticity.

How does a physical system produce its output process? This depends on many details. Some systems employ vast internal mechanistic redundancy, while others under constraints have optimized internal resources down to a minimally necessary generative structure. Different pressures give rise to different kinds of optimality. For example, minimal state-entropy generators turn out to be distinct from minimal state-set generators.44–46 The challenge then is to develop ways to monitor differences in the generative mechanism.47 

Any generative model1,48M with state-set S has a statistical state complexity (state entropy): C(M)=H[S]. Consider the corresponding myopic state-uncertainty given L sequential observations

And so

We also have the asymptotic uncertainty HlimLH(L). Related, there is the excess synchronization information

Such quantities are relevant even when an observer never fully synchronizes to a generative state; i.e., even when H>0. Finite-state ϵ-machines always synchronize49,50 and so their H vanishes.

Since many different mechanisms can generate a given process, we need useful bounds on the statistical state complexity of possible process generators. For example, the minimal generative complexityCg=minC(M), where we minimize over all models that generate the process, is the minimal state-information a physical system must store to generate its future.46 The predictability and the statistical complexities bound each other

That is, the predictable future information E is less than or equal to the information Cg necessary to produce the future which, in turn, is less than or equal to the information Cμ necessary to predict the future.1,44–47 Such relationships have been explored even for quantum generators of (classical) stochastic processes [Ref. 51, and references therein].

Up to this point, the development focused on introducing and interpreting various information and complexity measures. It was not constructive in that there was no specification of how to calculate these quantities for a given process. To do so, requires models or, in the vernacular, a presentation of a process. Fortunately, a common mathematical representation describes a wide class of process generators: the edge-labeled hidden Markov models (HMMs), also known as Mealy HMMs.48,52 Using these as our preferred presentations, we will first classify them and then describe how to calculate the information measures of the processes they generate.

Definition 1. A finite-state, edge-labeled hidden Markov model M={S,A,{T(x)}xA,η0}consists of:

  • A finite set of hidden statesS={s1,,sM}. Stis the random variable for the hidden state at time t.

  • A finite output alphabetA.

  • A set of M × M symbol-labeled transition matrices{T(x)}xA, whereTi,j(x)=Pr(x,sj|si)is the probability of transitioning from state si to state sj and emitting symbol x. The corresponding overall state-to-state transition matrix is the row-stochastic matrixT=xAT(x).

  • An initial distribution over hidden states:η0=(Pr(S0=s1),Pr(S0=s2),,Pr(S0=sM)).

Contrast this with the class-equivalent state-labeled HMMs, also known as Moore HMMs.11,45,52,75 In the automata theory, a finite-state HMM is called a probabilistic nondeterministic finite automaton.76 The information theory13 refers to them as finite-state information sources and stochastic process theory defines them as functions of a Markov chain.53,58,77,78

The dynamics of such finite-state models are governed by transition matrices amenable to the linear algebra of vector spaces. As a result, bra-ket notation is useful. Bras ·| are row vectors and kets |· are column vectors. One benefit of the notation is immediately recognizing the mathematical object type. For example, on the one hand, any expression that forms a closed bra-ket pair—either ·|· or ·|·|·—is a scalar quantity and commutes as a unit with anything. On the other hand, when useful, an expression of the ket-bra form |··| can be interpreted as a matrix.

T's row-stochasticity means that each of its rows sum to unity. Introducing |1 as the column vector of all 1s, this can be restated as

This is readily recognized as an eigenequation: T|λ=λ|λ. That is, the all-ones vector |1 is always a right eigenvector of T associated with the eigenvalue λ of unity.

When the internal Markov transition matrix T is irreducible, the Perron-Frobenius theorem guarantees that there is a unique asymptotic state distribution π determined by

with the further condition that π is normalized in probability: π|1=1. This again is recognized as an eigenequation: the stationary distribution π over the hidden states is T's left eigenvector associated with the eigenvalue of unity. When 1 is the only one of T's eigenvalues on the unit circle—i.e., when the process and presentation lack any deterministic periodicities—then the stationary distribution π is also the asymptotic state distribution for an ensemble of realizations, regardless of the initial distribution.

To describe a stationary process, as done often in the following, the initial hidden-state distribution η0 is set to the stationary one: η0=π. The resulting process generated is then stationary. Choosing an alternative η0 is useful in many contexts. Note that, starting with ηt the expected state distribution at the next time is simply ηt+1|=ηtT|. However, starting with such alternatives typically yields a nonstationary process.

An HMM M describes a process' behaviors as a formal languageL=1A of allowed realizations. Moreover, M succinctly describes a process's word distribution Pr(w) over all words wL. (Appropriately, M also assigns zero probability to words outside of the process' language: Pr(w)=0 for all wLc, L's complement.) Specifically, the stationary probability of observing a particular length-L word w=x0x1xL1 is given by

(1)

where T(w)T(x0)T(x1)T(xL1).

More generally, given a nonstationary state distribution η, the probability of seeing w is

(2)

where Stη means that the random variable St is distributed as η.13 And, the state distribution having seen word w starting in η is

(3)

These conditional distributions are used often since, for example, most observations induce a nonstationary distribution over hidden states. Tracking such observation-induced distributions is the role of a related model class—the mixed-state presentation, introduced shortly. To get there, we must first introduce several, prerequisite HMM classes. See Fig. 1. A simple example of the general HMM just discussed is shown in Fig. 1(a).

FIG. 1.

Finite HMM classes and example processes they generate, depicted by their state-transition diagrams: For any setting of the transition probabilities p,q(0,1) and transition rates a,b,c(0,), each HMM generates an observable stochastic process over its alphabet A{0,1,2}—the latent states themselves are not directly observable from the output process and so are “hidden.” (a) Simple nonunifilar source: two transitions leaving from the same state generate the same output symbol. (b) Nonminimal unifilar HMM. (c) ϵ-Machine: minimal unifilar HMM for the stochastic process generated. (d) Generator of a continuous-time stochastic process.

FIG. 1.

Finite HMM classes and example processes they generate, depicted by their state-transition diagrams: For any setting of the transition probabilities p,q(0,1) and transition rates a,b,c(0,), each HMM generates an observable stochastic process over its alphabet A{0,1,2}—the latent states themselves are not directly observable from the output process and so are “hidden.” (a) Simple nonunifilar source: two transitions leaving from the same state generate the same output symbol. (b) Nonminimal unifilar HMM. (c) ϵ-Machine: minimal unifilar HMM for the stochastic process generated. (d) Generator of a continuous-time stochastic process.

Close modal

An important class of HMMs consists of those that are unifilar. Unifilarity guarantees that, given a start state and a sequence of observations, there is a unique path through the internal states.53 This, in turn, allows one to directly translate between properties of the internal Markov chain and properties of the observed behavior generated along the sequence of edges traversed. The states of unifilar HMMs are maximally predictive.54 

In contrast, general—that is, nonunifilar—HMMs have an exponentially growing number of possible state paths as a function of observed word length. Thus, nonunifilar process presentations break most all quantitative connections between internal dynamics and observations, rendering them markedly less useful process presentations. While they can be used to generate realizations of a given process, they cannot be used outright to predict a process. Unifilarity is required.

Definition 2. A finite-state, edge-labeled, unifilar HMM (uHMM)55 is a finite-state, edge-labeled HMM with the following property:

  • Unifilarity: For each statesSand each symbolxA, there is at most one outgoing edge from state s that emits symbol x.

An example is shown in Fig. 1(b).

Minimal models are not only convenient to use, but very often allow for determining essential informational properties, such as a process' memory Cμ. A process' minimal state-entropy uHMM is the same as its minimal-state uHMM. And, the latter turns out to be the process' ϵ-machine in computational mechanics.20 Computational mechanics shows how to calculate a process' ϵ-machine from the process' conditional word distributions. Specifically, ϵ-machine states, the process' causal statesσS+, are equivalence classes of histories that yield the same predictions for the future. Explicitly, two histories x and x′ map to the same causal state ϵ(x)=ϵ(x)=σ if and only if Pr(X|x)=Pr(X|x). Thus, each causal state comes with a prediction of the future Pr(X|σ)—its future morph. In short, a process' ϵ-machine is its minimal size, maximally predictive predictor.

Converting a given uHMM to its corresponding ϵ-machine employs probabilistic variants of well-known state-minimization algorithms in the automata theory.56 One can also verify that a given uHMM is minimal by checking that all its states are probabilistically distinct.49,50

Definition 3. A uHMM's states are probabilistically distinct if for each pair of distinct states sk,sjS there exists some finite word w=x0x1xL1 such that

If this is the case, then the process' uHMM is its ϵ-machine.

An example is shown in Fig. 1(c).

The finite-state presentations in these classes form a hierarchy in terms of the processes they can finitely generate:44 Processes(ϵ-machines) = Processes(uHMMs)  Processes(HMMs). That is, finite HMMs generate a strictly larger class of stochastic processes than finite uHMMs. The class of processes generated by finite uHMMs, though, is the same as generated by finite ϵ-machines.

Though we concentrate on discrete-time processes, many of the process classifications, properties, and calculational methods carry over easily to continuous time. In this setting transition rates are more appropriate than transition probabilities. Continuous-time HMMs can often be obtained as a discrete-time limit Δt0 of an edge-labeled HMM whose edges operate for a time Δt. The most natural continuous-time HMM presentation, though, has a continuous-time generator G of time evolution over hidden states, with observables emitted as deterministic functions of an internal Markov chain: f:SA.

Respecting the continuous-time analog of probability conservation, each row of G sums to zero. Over a finite time interval t, marginalizing over all possible observations, the row-stochastic state-to-state transition dynamic is

Any nontrivial continuous-time process generated by such a continuous-time HMM has an uncountably infinite number of possible realizations within a finite time interval, and most of these have vanishing probability. However, probabilities regarding what state the system is in at any finite set of times can be easily calculated, essentially by bundling measurable sets of trajectories that satisfy certain constraints. For this purpose, we introduce the continuous-time observation matrices

where δx,f(s) is a Kronecker delta, |δs the column vector of all 0s except for a 1 at the position for state s, and δs| its transpose (|δs). These “projectors” sum to the identity: xAΓx=I.

An example is shown in Fig. 1(d).

A given process can be generated by nonunifilar, unifilar, and ϵ-machine HMM presentations. A process' ϵ-machine is unique. However, within either the unifilar or nonunifilar HMM classes, there are an infinite number of presentations that generate the process.

This flexibility suggests that we can create a HMM process generator that, through its embellished structure, answers more refined questions than the information generation (hμ) and memory (Cμ) calculated from the ϵ-machine. To this end, we introduce the mixed-state presentation (MSP). An MSP tracks important supplementary information in the hidden states and, through well-crafted dynamics, over the hidden states. In particular, an MSP generates a process while tracking the observation-induced distribution over the states of an alternative process generator. Here, we review only that subset of mixed-state theory required by the following.

Consider a HMM presentation M=(S,A,{T(x)}xA,π) of some process in statistical equilibrium. A mixed state η can be any state distribution over S and so we work with the simplex ΔS of state distributions. However, ΔS is uncountable and so contains far more mixed states than needed to calculate many complexity measures.

How to efficiently monitor the way in which an observer comes to know the HMM state as it sees successive symbols from the process? This is the problem of observer–state synchronization. To analyze the evolution of the observer's knowledge through a sequence of observation-induced mixed states, we use the set Rπ of mixed states that are induced by all allowed words wL from the initial mixed state η0=π

The cardinality of Rπ is finite when there are only a finite number of distinct probability distributions over M's states that can be induced by observed sequences, if starting from the stationary distribution π.

If w is the first (in lexicographic order) word that induces a particular distribution over S, then we denote this distribution as ηw; a shorthand for Eq. (3). For example, if the two words 010 and 110110 both induce the same distribution η over S and no word shorter than 010 induces that distribution, then the mixed state is denoted η010. It corresponds to the distribution

Since a given observed symbol induces a unique updated distribution from a previous distribution, the dynamic over mixed states is unifilar. Transition probabilities among mixed states can be obtained via Eqs. (2) and (3). So, if

and

then

These transition probabilities over the mixed states in Rπ are the matrix elements for the observation-labeled transition matrices {W(x)}xA of M's synchronizing MSP (S-MSP)

where δπ is the distribution over Rπ peaked at the unique start-(mixed)-state π. The row-stochastic net mixed-state-to-state transition matrix of S-MSP(M) is W=xAW(x). If irreducible, then there is a unique stationary probability distribution πW| over S-MSP(M)'s states obtained by solving πW|=πW|W. We use Rt to denote the random variable for the MSP's state at time t.

Figure 2 illustrates how a S-MSP relates to the HMM whose state distributions it tracks. Both the original HMM and its MSP generate the same process. However, the MSP's dynamic over mixed-states also tracks how, through the mixed states induced by successive observations, an observer comes to know the original HMM's current state. Importantly, this MSP's dynamic is nondiagonalizable, while the original HMM's dynamic is diagonalizable. The appearance of nondiagonalizability when analyzing particular properties, in this case, observer synchronization, is generic. Constructively working with this unavoidable fact motivates much of the following. Crucially, we find that the burden of predicting a stochastic process is fundamentally dependent on the nondiagonalizable characteristics of its MSP.

FIG. 2.

(a) Example HMM and (b) its S-MSP: both of these HMMs generate the (21)-GP-(2) process considered in Part II.79 However, S-MSP also tracks observation-induced states of knowledge about the example HMM's distribution over internal states A,B,C, and D. The double-circle state in the MSP state-transition diagram denotes the S-MSP's start-state. The green states are transient, whereas the blue states are recurrent. The MSP's mixed-state-to-state dynamic is nondiagonalizable—which is generic—even though the example HMM's state-to-state dynamic is diagonalizable.

FIG. 2.

(a) Example HMM and (b) its S-MSP: both of these HMMs generate the (21)-GP-(2) process considered in Part II.79 However, S-MSP also tracks observation-induced states of knowledge about the example HMM's distribution over internal states A,B,C, and D. The double-circle state in the MSP state-transition diagram denotes the S-MSP's start-state. The green states are transient, whereas the blue states are recurrent. The MSP's mixed-state-to-state dynamic is nondiagonalizable—which is generic—even though the example HMM's state-to-state dynamic is diagonalizable.

Close modal

That feedforward network structures lead to nondiagonalizability and that it is important to network functionality was also observed in a quite different setting—oscillator synchronization on complex networks with directional coupling dynamics.57 Briefly comparing settings reveals the origins of the commonality. In each, nondiagonalizability arises from an irreducible interdependence among elements—an interdependence that can be harnessed for hierarchical control. In Ref. 57, nondiagonalizable network structures allow upstream oscillators to influence downstream oscillators and this enables optimal synchronization among all oscillators. In contrast, while our setting does not concern synchronizing oscillator nodes to each other, it does analyze how an observer's belief state synchronizes to the true state of the system under study. During this kind of synchronization, past states of knowledge feed into future states of knowledge. In short, nondiagonalizability corresponds to intrinsically interdependent updates in the evolution of knowledge.

More generally, we work with distributions over ΔS. We must consider a mixed-state dynamic that starts from a general (nonpeaked) distribution over ΔS. This may be counterintuitive, since a distribution over distributions should correspond to a single distribution. However, the general MSP theory with a general starting distribution over ΔS allows us to consider a weighted average of behaviors originating from different histories. And, this is distinct from considering the behavior originating from a weighted average of histories. This more general MSP formalism arises in the closed-form solutions for more sophisticated complexity measures, such as the bound information. This anticipates tools needed in a sequel.

With this brief overview of mixed states, we can now turn to use them. Section V shows that tracking distributions over the states of another generator makes the MSP an ideal algebraic object for closed-form complexity expressions involving conditional entropies—measures that require conditional probabilities. Sections II B and II C showed that many of the complexity measures for predictability and predictive burden are indeed framed as conditional entropies. And so, MSPs are central to their closed-form expressions.

Historically, mixed states were already implicit in Ref. 58, introduced in their modern form by Refs. 44 and 45, and have been used recently; e.g., in Refs. 59 and 60. Most of these efforts, however, used mixed-states in the specific context of the synchronizing MSP (S-MSP). A greatly extended development of mixed-state dynamics appears in Ref. 41.

The overall strategy, though, is easy to explain. Different information-theoretic questions require different mixed-state dynamics, each of which is a unifilar presentation. Employing the mathematical methods developed here, we find that the desired closed-form solutions are often simple functions of the transition dynamic of an appropriate MSP. Specifically, the spectral properties of the relevant MSP control the form of information-theoretic quantities.

Finally, we note that similar linear-algebraic constructions—whose hidden states track relevant information—that are nevertheless not MSPs are just as important for answering different sets of questions about a process. Since these constructions are not directly about predictability and prediction, we report on them elsewhere.

We are now in a position to identify the hidden linear dynamic appropriate to many of the questions that arise in complex systems—their observation, predictability, prediction, and generation, as outlined in Table II. In part, this section addresses a very practical need for specific calculations. In part, it also lays the foundations for further generalizations, to be discussed at the end. Identifying the linear dynamic means identifying the linear operator A such that a question of interest can be reformulated as either being of the cascading form ·|An|· or as an accumulation of such cascading events via ·|(nAn)|·; recall Table I. Helpfully, many well-known questions of complexity can be mapped to these archetypal forms. And so, we now proceed to uncover the hidden linear dynamics of the cascading questions approximately in the order they were introduced in Sec. II.

For observable correlation, any HMM transition operator will do as the linear dynamic. We simply observe, let time (or space) evolve forward, and observe again. Let us be concrete.

Recall the familiar autocorrelation function. For a discrete-domain process, it is61 

where L and the bar denotes the complex conjugate. The autocorrelation function is symmetric about L = 0, so we can focus on L0. For L = 0, we simply have

For L > 0, we have

Each “*” above is a wildcard symbol denoting indifference to the particular symbol observed in its place. That is, the *s denotes marginalizing over the intervening random variables. We develop the consequence of this, explicitly calculating62 and finding

The result is the autocorrelation in the cascading form ·|Tt|·, which can be made particularly transparent by subsuming time-independent factors on the left and right into the bras and kets. Let us introduce the new row vector

and the column vector

Then, the autocorrelation function for nonzero integer L is simply

(4)

Clearly, the autocorrelation function is a direct, albeit filtered, signature of iterates of the transition dynamic of any process presentation.

This result can easily be translated to the continuous-time setting. If the process is represented as a function of a Markov chain and we make the translation that

then the autocorrelation function for any τ is simply

(5)

where G is determined from T following Sec. III D. Again, the autocorrelation function is a direct fingerprint of the transition dynamic over the hidden states.

The power spectrum is a modulated accumulation of the autocorrelation function. With some algebra, one can show that it is

Reference 61 shows that for discrete-domain processes, the continuous part of the power spectrum is simply

(6)

where Re (·) denotes the real part of its argument and I is the identity matrix. Similarly, for continuous-domain processes, one has

(7)

Although useful, these signatures of pairwise correlation are only first-order complexity measures. Common measures of complexity that include higher orders of correlation can also be written in the simple cascading and accumulating forms, but require a more careful choice of representation.

For example, any HMM presentation allows us to calculate using Eq. (1) a process's block entropy

but at a computational cost O(|S|3L|A|L) exponential in L, due to the exponentially growing number of words in LAL. Consequently, using a general HMM, one can neither directly nor efficiently calculate many key complexity measures, including a process's entropy rate and excess entropy.

These limitations motivate using more specialized HMM classes. To take one example, it has been known for some time that a process' entropy rate hμ can be calculated directly from any of its unifilar presentations.53 Another is that we can calculate the excess entropy directly from a process's uHMM forward S+ and reverse S states:59,60E=I[X;X]=I[S+;S].

However, efficient computation of myopic entropy rates hμ(L) remained elusive for some time, and we only recently found their closed-form expression.3 The myopic entropy rates are important because they represent the apparent entropy rate of a process if it is modeled as a finite Markov order-(L – 1) process—a very common approximation. Crucially, the difference hμ(L)hμ from the process' true entropy rate is the surplus entropy rate incurred by using an order-(L – 1) Markov approximation. Similarly, these surplus entropy rates lead directly to not only an apparent loss of predictability, but errors in the inferred physical properties. These include overestimates of dissipation associated with the surplus entropy rate assigned to a physical thermodynamic system.37 

Unifilarity, it turns out, is not enough to calculate a process' hμ(L) directly. Rather, the S-MSP of any process presentation is what is required. Let us now develop the closed-form expression for the myopic entropy rates, following Ref. 41.

The length-L myopic entropy rate is the expected uncertainty in the Lth random variable XL1, given the preceding L – 1 random variables X0:L1

(8)

where, in the second line, we explicitly give the condition η0=π specifying our ignorance of the initial state. That is, without making any observations, we can only assume that the initial distribution η0 over M's states is the expected asymptotic distribution π. For a mixing ergodic process, for example, even if another distribution ηN=α was known in the distant past, we still have η0|=ηN|TNπ|, as N.

Assuming an initial probability distribution over M's states, a given observation sequence induces a particular sequence of updated state distributions. That is, the S-MSP(M) is unifilar regardless of whether M is unifilar or not. Or, in other words, given the S-MSP's unique start state—R0=π—and a particular realization X0:L1=wL1 of the last L – 1 random variables, we end up at the particular mixed state RL1=ηwL1Rπ. Moreover, the entropy of the next observation is uniquely determined by M's state distribution, suggesting that Eq. (8) becomes

as proven elsewhere.41 Intuitively, conditioning on all of the past observation random variables is equivalent to conditioning on the random variable for the state distribution induced by particular observation sequences.

We can now recast Eq. (8) in terms of the (S-MSP, finding

Here

is simply the column vector whose ith entry is the entropy of transitioning from the ith state of S-MSP. Critically, |H(WA) is independent of L.

Notice that taking the logarithm of the sum of the entries of the row vector δη|W(x) via δη|W(x)|1 is only permissible since S-MSP's unifilarity guarantees that W(x) has at most one nonzero entry per row. (We also use the familiar convention that 0log20=0.13)

The result is a particularly compact and efficient expression for the length-L myopic entropy rates

(9)

Thus, all that is required is computing powers WL1 of the MSP transition dynamic. The computational cost O(L|Rπ|3) is now only linear in L. Moreover, W is very sparse, especially so with a small alphabet A. And, this means that the computational cost can be reduced even further via numerical optimization.

With hμ(L) in hand, the hierarchy of complexity measures that derive from it immediately follow, including the entropy rate hμ, the excess entropy E, and the transient information T.1 Specifically, we have

The sequel, Part II,79 discusses these in more detail, introducing their closed-form expressions. To prepare for this, we must first review the meromorphic functional calculus, which is needed for working with the above operators.

We saw that correlation measures are easily extended to the continuous-time domain via continuous-time HMMs. Entropy rates (since they are rates) and state entropy (since it depends only on the instantaneous distribution) also carry over rather easily to continuous time. Indeed, the former is well studied for chaotic systems63 and the latter is exemplified by the thermodynamic entropy. Yet, other information-theoretic measures of information transduction are awkward when directly translated to continuous time. At least one approach has been taken recently towards understanding their structure,64–66 but more work is necessary.

If a process' state-space is known, then the S-MSP of the generating model allows one to track the observation-induced distributions over its states. This naturally leads to closed-form solutions to informational questions about how an observer comes to know, or how it synchronizes to, the system's states.

To monitor how an observer's knowledge of a process' internal state changes with increasing measurements, we use the myopic state uncertaintyH(L)=H[S0|XL:0].1 Expressing it in terms of the S-MSP, one finds41 

Here, H[η] is the presentation-state uncertainty specified by the mixed state η

(10)

where |δs is the length-|S| column vector of all zeros except for a 1 at the appropriate index of the presentation-state s.

Continuing, we re-express H(L) in terms of powers of the S-MSP transition dynamic

(11)

Here, we defined

which is the L-independent length-|Rπ| column vector whose entries are the appropriately indexed entropies of each mixed state.

The forms of Eqs. (9) and (11) demonstrate that hμ(L+1) and H(L) differ only in the type of information being extracted after being evolved by the operator: observable entropy H(WA) or state entropy H[η], as implicated by their respective kets. Each of these entropies decreases as the distributions induced by longer observation sequences converge to the synchronized distribution. If synchronization is achieved, the distributions become δ-functions on a single state and the associated state-entropy vanishes.

Paralleling hμ(L), there is a complementary hierarchy of complexity measures that are built up from functions of H(L). These include the asymptotic state uncertaintyH and excess synchronization informationS, to mention only two

Compared to the hμ(L) family of measures, H and S mirror the roles of hμ and E, respectively.

The model state-complexity

also has an analog in the hμ(L) hierarchy—the process' alphabet complexity

We just reviewed the linear underpinnings of synchronizing to any model of a process. However, the myopic state uncertainty of the ϵ-machine has a distinguished role in determining the synchronization cost for maximally predicting a process, regardless of the presentation that generated it. Using the ϵ-machine's S-MSP, the ϵ-machine myopic state uncertainty can be written in direct parallel to the myopic state uncertainty of any model

The script W emphasizes that we are now specifically working with the state-to-state transition dynamic of the ϵ-machine's MSP.

Paralleling H(L), an obvious hierarchy of complexity measures is built from functions of H+(L). For example, the ϵ-machine's state-complexity is the statistical complexityCμ=H+(0). The information that must be obtained to synchronize to the causal state in order to maximally predict—the causal synchronization information—is given in terms of the ϵ-machine's S-MSP by S=L=0H+(L).

An important difference when using ϵ-machine presentations is that they have zero asymptotic state uncertainty

Therefore, S=S(ϵ-machine). Moreover, we conjecture that S=minML=0H(L) for any presentation M that generates the process, even if CμCg.

Many of the complexity measures use a mixed-state presentation as the appropriate linear dynamic, with particular focus on the S-MSP. However, we want to emphasize that this is more a reflection of questions that have become common. It does not indicate the general answer that one expects in the broader approach to finding the hidden linear dynamic. Here, we give a brief overview for how other linear dynamics can appear for different types of complexity questions. These have been uncovered recently and will be reported in more detail in sequels.

First, we found the reverse-time mixed-functional presentation (MFP) of any forward-time generator. The MFP tracks the reverse-time dynamic over linear functionals |η of state distributions induced by reverse-time observations

The MFP allows direct calculation of the convergence of the preparation uncertainty(L)H(S0|X0:L) via powers of the linear MFP transition dynamic. The preparation uncertainty in turn gives a new perspective on the transient information since

can be interpreted as the predictive advantage of hindsight. Related, the myopic process crypticity χ(L) = +(L)H+(L) had been previously introduced.43 Since limLH+(L)=H+=0, the asymptotic crypticity is χ = ++H+ = +. And, this reveals a refined partitioning underlying the sum

Crypticity χ=H(S0+|X0:) itself is positive only if the process' cryptic order

is positive. The cryptic order is always less than or equal to its better-known cousin, the Markov order R

since conditioning can never increase entropy. In the case of the cryptic order, we condition on future observations X0:.

The forward-time cryptic operator presentation gives the forward-time observation-induced dynamic over the operators

Since the reverse causal state S0 at time 0 is a linear combination of forward causal states,67,68 this presentation allows new calculations of the convergence to crypticity that implicate Pr(S0+|XL:).

In fact, the cryptic operator presentation is a special case of the more general myopic bidirectional dynamic over operators

induced by new observations of either the future or the past. This is key to understanding the interplay between forgetfulness and shortsightedness: Pr(S0|XM:0,X0:N).

The list of these extensions continues. Detailed bounds on entropy-rate convergence are obtained from the transition dynamic of the so-called possibility machine, beyond the asymptotic result obtained in Ref. 50. And, the importance of post-synchronized monitoring, as quantified by the information lost due to negligence over a duration

can be determined using yet another type of modified MSP.

These examples all find an exact solution via a theory parallel to that outlined in the following, but applied to the linear dynamic appropriate for the corresponding complexity question. Furthermore, they highlight the opportunity, enabled by the full meromorphic functional calculus,4 to ask and answer more nuanced and, thus, more probing questions about the structure, predictability, and prediction.

It would seem that we achieved our goal. We identified the appropriate transition dynamic for common complexity questions and, by some standard, gave formulae for their exact solution. In point of fact, the effort so far has all been in preparation. Although we set the framework up appropriately for linear analysis, closed-form expressions for the complexity measures still await the mathematical developments of the following Secs. VI. At the same time, at the level of qualitative understanding and scientific interpretation, so far we failed to answer the simple question:

  • What range of possible behaviors do these complexity measures exhibit?

    and the natural follow-up question:

  • What mechanisms produce qualitatively different informational signatures?

The following Sec. VI reviews the recently developed functional calculus that allows us to actually decompose arbitrary functions of the nondiagonalizable hidden dynamic to give conclusive answers to these fundamental questions.4 We then analyze the range of possible behaviors and identify the internal mechanisms that give rise to qualitatively different contributions to complexity.

The investment in this and the succeeding Secs. VIVIII allow Part II to express new closed-form solutions for many complexity measures beyond those achieved to date. In addition to obvious calculational advantages, this also gives new insights into possible behaviors of the complexity measures and, moreover, their unexpected similarities with each other. In many ways, the results shed new light on what we were (implicitly) probing with already-familiar complexity measures. Constructively, this suggests extending complexity magnitudes to complexity functions that succinctly capture the organization to all orders of correlation. Just as our intuition for pairwise correlation grows out of power spectra, so too these extensions unveil the workings of both a process' predictability and the burden of prediction for an observer.

Here, we briefly review the spectral decomposition theory from Ref. 4 needed for working with nondiagonalizable linear operators. As will become clear, it goes significantly beyond the spectral theorem for normal operators. Although the linear operator theory (especially as developed in the mathematical literature of functional analysis) already addresses nonnormal operators, it had not delivered the comparable machinery for a tractable spectral decomposition of nonnormal and nondiagonalizable operators. Reference 4 explored this topic and derived new relations that enable the practical analysis of nonnormal and nondiagonalizable systems—in which independent subprocesses (irreducible subspaces) can be directly manipulated.

We restrict our attention to operators that have at most a countably infinite spectrum. Such operators share many features with finite-dimensional square matrices. And so, we review several elementary but essential facts that are used extensively in the following.

Recall that if A is a finite-dimensional square matrix, then A's spectrum is simply its set of eigenvalues

where det(·) is the determinant of its argument.

For reference later, recall that the algebraic multiplicityaλ of eigenvalue λ is the power of the term (zλ) in the characteristic polynomial det(zIA). In contrast, the geometric multiplicitygλ is the dimension of the kernel of the transformation AλI or the number of linearly independent eigenvectors for the eigenvalue. Moreover, gλ is the number of Jordan blocks associated with λ. The algebraic and geometric multiplicities are all equal when the matrix is diagonalizable.

Since there can be multiple subspaces associated with a single eigenvalue, corresponding to different Jordan blocks in the Jordan canonical form, it is structurally important to introduce the index of the eigenvalue to describe the size of its largest-dimension associated subspace.

Definition 4.The indexνλof eigenvalue λ is the size of the largest Jordan block associated with λ.

The index gives information beyond what the algebraic and geometric multiplicities themselves reveal. Nevertheless, for λΛA, it is always true that νλ1aλgλaλ1. In the diagonalizable case, aλ=gλ and νλ=1 for all λΛA.

The resolvent

defined with the help of the continuous complex variable z, captures all of the spectral information about A through the poles of the resolvent's matrix elements. In fact, the resolvent contains more than just the spectrum: the order of each pole gives the index of the corresponding eigenvalue.

Each eigenvalue λ of A has an associated spectral projection operatorAλ, which is the residue of the resolvent as zλ

(12)

where Cλ is a counterclockwise contour in the complex plane around eigenvalue λ. The residue of the matrix can be calculated elementwise.

The projection operators are orthonormal

(13)

and sum to the identity

(14)

For cases where νλ=1, we found that the projection operator associated with λ can be calculated as4 

(15)

Not all projection operators of a nondiagonalizable operator can be found directly from Eq. (15), since some have an index larger than one. However, if there is only one eigenvalue that has an index larger than one—the almost diagonalizable case treated in Part II79—then Eq. (15), together with the fact that the projection operators must sum to the identity, does give a full solution to the set of projection operators. Next, we consider the general case, with no restriction on νλ.

In general, as we now discuss, an operator's eigenprojectors can be obtained from all left and right eigenvectors and generalized eigenvectors associated with the eigenvalue. Let (ΛA)=(λ1,λ2,,λn) be the n-tuple of eigenvalues in which each eigenvalue λΛA is listed gλ times. So k=1nδλk,λ=gλ, and n=λΛAgλ is the total number of Jordan blocks in the Jordan canonical form. Each λk(ΛA) corresponds to a particular Jordan block of size mk. The index νλ of λ is thus

There is a corresponding n-tuple of mk-tuples of linearly independent generalized right-eigenvectors

where

and a corresponding n-tuple of mk-tuples of linearly independent generalized left-eigenvectors

where

such that

(16)

and

(17)

for 0mmk1, where |λj(0)=0 and λj(0)|=0. Specifically, |λk(1) and λk(1)| are conventional right and left eigenvectors, respectively.

Most directly, the generalized right and left eigenvectors can be found as the nontrivial solutions to

and

respectively. Imposing appropriate normalization, we find that

(18)

Crucially, right and left eigenvectors are no longer simply related by complex-conjugate transposition and right eigenvectors are not necessarily orthogonal to each other. Rather, left eigenvectors and generalized eigenvectors form a dual basis to the right eigenvectors and generalized eigenvectors. Somewhat surprisingly, the most generalized left eigenvector λk(mk)| associated with λk is dual to the least generalized right eigenvector |λk(1) associated with λk

Explicitly, we find that the spectral projection operators for a nondiagonalizable matrix can be written as

(19)

It is useful to introduce the generalized set of companion operators

(20)

for λΛA and m{0,1,2,}. These operators satisfy the following semigroup relation:

(21)

Aλ,m reduces to the eigenprojector for m = 0

(22)

and it exactly reduces to the zero-matrix for mνλ

(23)

Crucially, we can rewrite the resolvent as a weighted sum of the companion matrices {Aλ,m}, with complex coefficients that have poles at each eigenvalue λ up to the eigenvalue's index νλ

(24)

Ultimately these results allow us to easily evaluate arbitrary functions of nondiagonalizable operators, to which we now turn. (Reference 4 gives more background.)

The meromorphic functional calculus4 gives meaning to arbitrary functions f(·) of any linear operator A. Its starting point is the Cauchy-integral-like formula

(25)

where Cλ denotes a sufficiently small counterclockwise contour around λ in the complex plane such that no singularity of the integrand besides the possible pole at z=λ is enclosed by the contour.

Invoking Eq. (24), yields the desired formulation

(26)

Hence, with the eigenprojectors {Aλ}λΛA in hand, evaluating an arbitrary function of the nondiagonalizable operator A comes down to the evaluation of several residues.

Typically, evaluating Eq. (26) requires less work than one might expect when looking at the equation in its full generality. For example, whenever f(z) is holomorphic (i.e., well behaved) at z=λ, the residue simplifies to

where f(m)(λ) is the mth derivative of f(z) evaluated at z=λ. However, if f(z) has a pole or zero at z=λ, then it substantially changes the complex contour integration. In the simplest case, when A is diagonalizable and f(z) is holomorphic at ΛA, the matrix-valued function reduces to the simple form

Moreover, if λ is nondegenerate, then

although λ| here should be interpreted as the solution to the left eigenequation λ|A=λλ| and, in general, λ|(|λ).

The meromorphic functional calculus agrees with the Taylor-series approach whenever the series converges and agrees with the holomorphic functional calculus of Ref. 69 whenever f(z) is holomorphic at ΛA. However, when both these functional calculi fail, the meromorphic functional calculus extends the domain of f(A) in a way that is key to the following analysis. We show, for example, that within the meromorphic functional calculus, the negative-one power of a singular operator is the Drazin inverse. The Drazin inverse effectively inverts everything that is invertible. Notably, it appears ubiquitously in the new-found solutions to many complexity measures.

How does one use Eq. (26)? It says that the spectral decomposition of f(A) reduces to the evaluation of several residues, where

So, to make progress with Eq. (26), we must evaluate function-dependent residues of the form Res(f(z)/(zλ)m+1,zλ). This is basic complex analysis. Recall that the residue of a complex-valued function g(z) around its isolated pole λ of order n + 1 can be calculated from

Equation (26) allows us to explicitly derive the spectral decomposition of powers of an operator. For f(A)=ALf(z)=zL, z = 0 can be either a zero or a pole of f(z), depending on the value of L. In either case, an eigenvalue of λ = 0 will distinguish itself in the residue calculation of AL via its unique ability to change the order of the pole (or zero) at z = 0.

For example, at this special value of λ and for integer L > 0, λ = 0 induces poles that cancel with the zeros of f(z)=zL, since zL has a zero at z = 0 of order L. For integer L < 0, an eigenvalue of λ = 0 increases the order of the z = 0 pole of f(z)=zL. For all other eigenvalues, the residues will be as expected.

Hence, for any L

(27)

where (Lm) is the generalized binomial coefficient

(28)

with (L0)=1 and where [0ΛA] is the Iverson bracket. The latter takes value 1 if 0 is an eigenvalue of A and value 0 if not. Equation (27) applies to any linear operator with only isolated singularities in its resolvent.

If L is a nonnegative integer such that Lνλ1 for all λΛA, then

(29)

where (Lm) is now reduced to the traditional binomial coefficient L!/m!(Lm)!.

The form of Eq. (27), together with our earlier operator expressions for complexity measures that take on a cascading form, directly leads to the first fully general closed-form expressions for correlation, myopic entropy rates, and remaining state uncertainty, among others, for the broad class of processes that can be generated by HMMs. This will be made explicit in Part II,79 where the consequences will also be unraveled.

The negative-one power A1 of a linear operator is in general not the same as its inverse inv(A), since the latter need not exist. However, the negative-one power of a linear operator is always defined via Eq. (27)

(30)

Notably, when the operator is singular, we find that

This is the Drazin inverseAD of A, also known as the {1ν0,2,5}-inverse.70 (Note that it is not the same as the Moore–Penrose pseudo-inverse.) Although the Drazin inverse is usually defined axiomatically to satisfy certain criteria. In contrast, Ref. 4 naturally derived it as the negative one power of a singular operator in the meromorphic functional calculus.

Whenever A is invertible, however, A1=inv(A). That said, we should not confuse this coincidence with equivalence. More to the point, there is no reason other than historical accidents of notation that the negative-one power should in general be equivalent to the inverse—especially if an operator is not invertible. To avoid confusing A1 with inv(A), we use the notation AD for the Drazin inverse of A. Still, AD=inv(A) whenever 0ΛA.

Although Eq. (30) is a constructive way to build the Drazin inverse, it suggests more work than is actually necessary. We derived several simple constructions for it that require only the original operator and the eigenvalue-0 projector. For example, Ref. 4 found that, for any c{0}

(31)

Later, we will also need the decomposition of (IW)D, as it enters into many closed-form complexity expressions related to accumulated transients—the past–future mutual information among them. Reference 4 showed that

(32)

for any stochastic matrix T, where T1 is the projection operator associated with λ = 1. If T is the state-transition matrix of an ergodic process, then the RHS of Eq. (32) becomes especially simple to evaluate since then T1=|1π|.

Somewhat tangentially, this connects to the fundamental matrixZ=(IT+T1)1 used by Ref. 71 in its analysis of Markovian dynamics. More immediately, Eq. (32) plays a prominent role when deriving excess entropy and synchronization information. The explicit spectral decomposition is also useful

(33)

The preceding employed the notation that A is a general linear operator. In the following, we reserve T for the operator of a stochastic transition dynamic, as in the state-to-state transition dynamic of an HMM: T=xAT(x). If the state space is finite and has a stationary distribution, then T has a representation that is a nonnegative row-stochastic—all rows sum to unity—transition matrix.

We are now in a position to summarize several useful properties for the projection operators of any row-stochastic matrix T. Naturally, if one uses column-stochastic instead of row-stochastic matrices, all results can be translated by simply taking the transpose of every line in the derivations. [Recall that (ABC)=CBA.]

The fact that all elements of the transition matrix are real-valued guarantees that, for each λΛT, its complex conjugate λ¯ is also in ΛT. Moreover, the spectral projection operator associated with the complex conjugate of λ is Tλ's complex conjugate

This also implies that Tλ is real if λ is real.

If the dynamic induced by T has a stationary distribution over the state space, then T's spectral radius is unity and all its eigenvalues lie on or within the unit circle in the complex plane. The maximal eigenvalues have unity magnitude and 1ΛT. Moreover, an extension of the Perron–Frobenius theorem guarantees that eigenvalues on the unit circle have algebraic multiplicity equal to their geometric multiplicity. And, so, νζ=1 for all ζ{λΛT:|λ|=1}.

T's index-one eigenvalue λ = 1 is associated with stationarity of the hidden Markov model. T's other eigenvalues on the unit circle are roots of unity and correspond to deterministic periodicities within the process.

If T is row-stochastic, then by definition

Hence, via the general eigenprojector construction Eq. (19) and the general orthogonality condition Eq. (18), we find that

(34)

This shows that T's projection operator T1 is row-stochastic, whereas each row of every other projection operator must sum to zero. This can also be viewed as a consequence of conservation of probability for dynamics over Markov models.

If unity is the only eigenvalue of ΛT on the unit circle, then the process has no deterministic periodicities. In this case, every initial condition leads to a stationary asymptotic distribution. The expected stationary distribution πα from any initial distribution α is

(35)

An attractive feature of Eq. (35) is that it holds even for nonergodic processes—those with multiple stationary components.

When the stochastic process is ergodic (one stationary component), then a1=1 and there is only one stationary distribution π. The T1 projection operator becomes

(36)

even if there are deterministic periodicities. Deterministic periodicities imply that different initial conditions may still induce different asymptotic oscillations, according to {Tλ:|λ|=1}. In the case of ergodic processes without deterministic periodicities, every initial condition relaxes to the same steady-state distribution over the hidden states: πα|=α|T1=π| regardless of α, so long as α is a properly normalized probability distribution.

As suggested in Ref. 4, the new results above extend the spectral theory to arbitrary functions of nondiagonalizable operators in a way that contributes to a spectral weighted digraph theory beyond the purview of spectral graph theory proper.72 Moreover, this enables new analyses. In particular, the spectra of undirected graphs and their graph Laplacian matrices have been studied extensively and continue to be. However, those efforts have been extensive due in part to the spectral theory for normal operators applying directly to both undirected graphs and their Laplacians. Digraph spectra have also been studied,73 but to a much lesser extent. Again, this is due in part to the spectral theorem not typically applying, rendering this case much more complicated. Thus, the spectral theory of nonnormal and nondiagonalizable operators offers new opportunities. This not only hints at the importance of extracting eigenvalues from directed graph motifs, but also begins to show how eigenvectors and eigenprojectors can be built up iteratively from directed graph clusters.

The next Secs. VIII A and VIII B show how spectra and eigenprojectors can be intuited, computed, and applied in the analysis of complex systems. These techniques often make the problem at hand analytically tractable, and they will be used in the examples of Part II79 to give exact expressions for complexity measures.

Consider a directed graph structure with cascading dependencies: one cluster of nodes feeds back only to itself according to matrix A and feeds forward to another cluster of nodes according to matrix B, which is not necessarily a square matrix. The second cluster feeds back only to itself according to matrix C. The latter node cluster might also feed forward to another cluster, but such considerations can be applied iteratively.

The simple situation just described is summarized, with proper index permutation, by a block matrix of the form: W=[AB0C]. In this case, it is easy to see that

(37)
(38)

And so, ΛW=ΛAΛC. This simplification presents an opportunity to read off eigenvalues from clustered graph structures that often appear in practice, especially for transient graph structures associated with synchronization, as with transient mixed-state transitions in MSPs.

Cyclic cluster structures (say, of length N and edge-weights α1 through αN) yield especially simple spectra

(39)

That is, the eigenvalues are simply the Nth roots of the product of all of the edge-weights. See Fig. 3(a).

FIG. 3.

(a) Weighted directed graph (digraph) of the feedback matrix A of a cyclic cluster structure that contributes eigenvalues ΛA={(i=1Nαi)1/Nein2π/N}n=0N1 with algebraic multiplicities aλ=1 for all λΛA. (b) Weighted digraph of the feedback matrix A of a doubly cyclic cluster structure that contributes eigenvalues ΛA={0}{(α[(i=1Nβi)+(i=1Nγi)])1N+1ein2πN+1}n=0N with algebraic multiplicities a0=N1 and aλ=1 for λ0. (This eigenvalue “rule” depends on having the same number of β-transitions as γ-transitions.) The 0-eigenvalue only has geometric multiplicity of g0=1, so the structure is nondiagonalizable for N > 2. Nevertheless, the generalized eigenvectors are easy to construct. The spectral analysis of the cluster structure in (b) suggests more general rules that can be gleaned from reading-off eigenvalues from digraph clusters; e.g., if a chain of α's appears in the bisecting path.

FIG. 3.

(a) Weighted directed graph (digraph) of the feedback matrix A of a cyclic cluster structure that contributes eigenvalues ΛA={(i=1Nαi)1/Nein2π/N}n=0N1 with algebraic multiplicities aλ=1 for all λΛA. (b) Weighted digraph of the feedback matrix A of a doubly cyclic cluster structure that contributes eigenvalues ΛA={0}{(α[(i=1Nβi)+(i=1Nγi)])1N+1ein2πN+1}n=0N with algebraic multiplicities a0=N1 and aλ=1 for λ0. (This eigenvalue “rule” depends on having the same number of β-transitions as γ-transitions.) The 0-eigenvalue only has geometric multiplicity of g0=1, so the structure is nondiagonalizable for N > 2. Nevertheless, the generalized eigenvectors are easy to construct. The spectral analysis of the cluster structure in (b) suggests more general rules that can be gleaned from reading-off eigenvalues from digraph clusters; e.g., if a chain of α's appears in the bisecting path.

Close modal

Similar rules for reading off spectra from other cluster structures exist. Although we cannot list them exhaustively here, we give another simple but useful rule in Fig. 3(b). It also indicates the ubiquity of nondiagonalizability in weighted digraph structures. This second rule is suggestive of further generalizations where spectra can be read off from common digraph motifs.

We just outlined how clustered directed graph structures yield simplified joint spectra. Is there a corresponding simplification of the spectral projection operators? In fact, there is and it leads to an iterative construction of “higher-level” projectors from “lower-level” clustered components. In contrast to the joint spectrum though, that completely ignores the feedforward matrix B, the emergent projectors do require B to pull the associated eigencontributions into the generalized setting. Figure 4 summarizes the results for the simple case of nondegenerate eigenvalues. The general case is constructed similarly.

FIG. 4.

Construction of W-eigenprojectors Wλ from low-level A-projectors and C-projectors, when W=[AB0C]. (Recall that (λIA)1 and (λIC)1 can be constructed from the lower-level projectors.) For simplicity, we assume that the algebraic multiplicity aλ=1 in each of these cases.

FIG. 4.

Construction of W-eigenprojectors Wλ from low-level A-projectors and C-projectors, when W=[AB0C]. (Recall that (λIA)1 and (λIC)1 can be constructed from the lower-level projectors.) For simplicity, we assume that the algebraic multiplicity aλ=1 in each of these cases.

Close modal

The preceding results imply a number of algorithms, both for analytic and numerical calculations. Most directly, this points to the fact that eigenanalysis can be partitioned into a series of simpler problems that are later combined to a final solution. However, in addition to more efficient serial computation, there are opportunities for numerical parallelization of the algorithms to compute the eigenprojectors, whether they are computed directly, say from Eq. (15), or from right and left eigenvectors and generalized eigenvectors. These opportunities for further optimization are perhaps rare considering how extremely well developed the field of numerical linear algebra already is. That said, the automation now possible will be key to applying our analysis methods to real systems with immense data produced from very high-dimensional state spaces.

Surprisingly, many questions we ask about a structured stochastic nonlinear process imply a linear dynamic over a preferred hidden state space. These questions often concern predictability and prediction. To make predictions about the real world, though, it is not sufficient to have a model of the world. Additionally, the predictor must synchronize their model to the real-world data that has been observed up to the present time. This metadynamic of synchronization—the transition structure among belief states—is intrinsically linear, but is typically nondiagonalizable.

We presented results for the observed processes generated by HMMs. However, the results easily apply to other state-based models, including observable operator models (OOMs)74 and generalized HMMs (GHMMs).45 In each case, the observation-induced synchronizing metadynamic is still an HMM. It will also be useful to adapt our methods to open quantum models, where a density matrix evolves via environmental influence and a protocol for partial measurements (POVMs) induces a synchronizing (meta)dynamic.

Recall organizational Tables I and II from the Introduction. After all the intervening detail, let's consider a more nuanced formulation. We saw that once we frame questions in terms of the hidden linear transition dynamic, complexity measures are usually either of the cascading or accumulation type. Scalar complexity measures often accumulate only the interesting transient structure that rides on top of the asymptotics. Skimming off the asymptotics led to the Drazin inverse. Modified accumulation turned complexity scalars into complexity functions. Tables III and IV summarize the results. Notably, Table IV gives closed-form formulae for many complexity measures that previously were only expressed as infinite sums over functions of probabilities.

TABLE III.

Once we identify the hidden linear dynamic behind our questions, most are either of the cascading or accumulating type. Moreover, if a complexity measure accumulates transients, the Drazin inverse is likely to appear. Interspersed accumulation can be a helpful theoretical tool, since all derivatives and integrals of cascading type can be calculated, if we know the modified accumulation with z. With z, modulated accumulation involves an operator-valued z-transform. However with z=eiω and ω, modulated accumulation involves an operator-valued Fourier-transform.

Discrete timeContinuous time
Derivatives of cascading ↑ Cascading ·|AL|· ·|etG|· 
Integrals of cascading ↓ Accumulated transients ·|(L(AA1)L)|· ·|((etGG0)dt)|· 
 Modulated accumulation ·|(L(zA)L)|· ·|((zeG)tdt)|· 
Discrete timeContinuous time
Derivatives of cascading ↑ Cascading ·|AL|· ·|etG|· 
Integrals of cascading ↓ Accumulated transients ·|(L(AA1)L)|· ·|((etGG0)dt)|· 
 Modulated accumulation ·|(L(zA)L)|· ·|((zeG)tdt)|· 
TABLE IV.

Genres of complexity questions given in order of increasing sophistication; summary of Part I and a preview of Part II.79 Each implies a different linear transition dynamic. Closed-form formulae are given for several complexity measures, showing the similarity among them down the same column. Formulae in the same row have matching bra-ket pairs. The similarity within the column corresponds to similarity in the time-evolution implied by the question type. The similarity within the row corresponds to the similarity in question genre.

GenreImplied linear transition dynamicExample questions
CascadingAccumulated transientsModulated accumulation
Overt observational Transition matrix T of any HMM Correlation, γ(L): πA¯|T|L|1|A1 Green-Kubo transport coefficients Power spectra, P(ω): 2πA¯|(eiωIT)1|A1 
Predictability Transition matrix W of MSP of any HMM Myopic entropy rate, hμ(L): δπ|WL1|H(WA) Excess entropy, E: δπ|(IW)D|H(WA) E(z): δπ|(zIW)1|H(WA) 
Optimal predication Transition matrix W of MSP of ϵ-machine Causal state uncertainty, H+(L): δπ|WL|H[η] Synchronization info, S: δπ|(IW)D|H[η] S(z): δπ|(zIW)1|H[η] 
     
GenreImplied linear transition dynamicExample questions
CascadingAccumulated transientsModulated accumulation
Overt observational Transition matrix T of any HMM Correlation, γ(L): πA¯|T|L|1|A1 Green-Kubo transport coefficients Power spectra, P(ω): 2πA¯|(eiωIT)1|A1 
Predictability Transition matrix W of MSP of any HMM Myopic entropy rate, hμ(L): δπ|WL1|H(WA) Excess entropy, E: δπ|(IW)D|H(WA) E(z): δπ|(zIW)1|H(WA) 
Optimal predication Transition matrix W of MSP of ϵ-machine Causal state uncertainty, H+(L): δπ|WL|H[η] Synchronization info, S: δπ|(IW)D|H[η] S(z): δπ|(zIW)1|H[η] 
     

Let us remind ourselves: why, in this analysis, were nondiagonalizable dynamics noteworthy? They are since the metadynamics of diagonalizable dynamics are generically nondiagonalizable. And, this is typically due to the 0-eigenvalue subspace that is responsible for the initial, ephemeral epoch of symmetry collapse. The metadynamics of transitioning between belief states demonstrated this explicitly. However, other metadynamics beyond those focused on prediction are also generically nondiagonalizable. For example, in the analysis of quantum compression, crypticity, and other aspects of hidden structure, the relevant linear dynamic is not the MSP. Instead, it is a nondiagonalizable structure that can be fruitfully analyzed with the same generalized spectral theory of nonnormal operators.4 

Using the appropriate dynamic for common complexity questions and the meromorphic functional calculus to overcome nondiagonalizability, the sequel (Part II)79 goes on to develop closed-form expressions for complexity measures as simple functions of the corresponding transition dynamic of the implied HMM.

J.P.C. thanks the Santa Fe Institute for its hospitality. The authors thank Chris Ellison, Ryan James, John Mahoney, Alec Boyd, and Dowman Varn for the helpful discussions. This material is based upon work supported by, or in part by, the U.S. Army Research Laboratory and the U.S. Army Research Office under Contract Nos. W911NF-12-1-0234, W911NF-13-1-0340, and W911NF-13-1-0390.

1.
J. P.
Crutchfield
and
D. P.
Feldman
, “
Regularities unseen, randomness observed: Levels of entropy convergence
,”
Chaos
13
(
1
),
25
54
(
2003
).
2.
S. E.
Marzen
and
J. P.
Crutchfield
, “
Nearly maximally predictive features and their dimensions
,”
Phys. Rev. E
95
(
5
),
051301(R)
(
2017
).
3.
J. P.
Crutchfield
,
C. J.
Ellison
, and
P. M.
Riechers
, “
Exact complexity: The spectral decomposition of intrinsic computation
,”
Phys. Lett. A
380
(
9
),
998
1002
(
2016
).
4.
P. M.
Riechers
and
J. P.
Crutchfield
, “
Beyond the spectral theorem: Decomposing arbitrary functions of nondiagonalizable operators
,” arXiv:1607.06526 [math-ph].
5.

While we follow Shannon12 in this, it differs from the more widely used state-labeled HMMs.

6.
C.
Moore
and
J. P.
Crutchfield
, “
Quantum automata and quantum grammars
,”
Theor. Comput. Sci.
237
(
1–2
),
275
306
(
2000
).
7.
L. A.
Clark
,
W.
Huang
,
T. M.
Barlow
, and
A.
Beige
, “
Hidden quantum Markov models and open quantum systems with instantaneous feedback
,”
New J. Phys.
14
,
143
151
(
2015
).
8.
O.
Penrose
,
Foundations of Statistical Mechanics: A Deductive Treatment
(
Pergamon Press
,
Oxford
,
1970
).
9.
U.
Seifert
, “
Stochastic thermodynamics, fluctuation theorems and molecular machines
,”
Rep. Prog. Phys.
75
,
126001
(
2012
).
10.
Nonequilibrium Statistical Physics of Small Systems: Fluctuation Relations and Beyond
, edited by
R.
Klages
,
W.
Just
, and
C.
Jarzynski
(
Wiley
,
New York
,
2013
).
11.
J.
Bechhoefer
, “
Hidden Markov models for stochastic thermodynamics
,”
New. J. Phys.
17
,
075003
(
2015
).
12.
C. E.
Shannon
, “
A mathematical theory of communication
,”
Bell Syst. Tech. J.
27
,
379
423
, 623–656 (
1948
).
13.
T. M.
Cover
and
J. A.
Thomas
,
Elements of Information Theory
, 2nd ed. (
Wiley-Interscience
,
New York
,
2006
).
14.
L. R.
Rabiner
and
B. H.
Juang
, “
An introduction to hidden Markov models
,”
IEEE ASSP Mag.
3
,
4
16
(
1986
).
15.
R. A.
Roberts
and
C. T.
Mullis
,
Digital Signal Processing
(
Addison-Wesley
,
Reading, Massachusetts
,
1987
).
16.
L. R.
Rabiner
, “
A tutorial on hidden Markov models and selected applications
,”
IEEE Proc.
77
,
257
(
1989
).
17.
M. O.
Rabin
, “
Probabilistic automata
,”
Inf. Control
6
,
230
245
(
1963
).
18.
W. J.
Ewens
,
Mathematical Population Genetics
, 2nd ed. (
Springer
,
New York
,
2004
).
19.
M.
Nowak
,
Evolutionary Dynamics: Exploring the Equations of Life
(
Belnap Press
,
New York
,
2006
).
20.
J. P.
Crutchfield
, “
Between order and chaos
,”
Nat. Phys.
8
,
17
24
(
2012
).
21.
P.
Stoica
and
R. L.
Moses
,
Spectral Analysis of Signals
(
Pearson Prentice Hall
,
Upper Saddle River, New Jersey
,
2005
).
22.
R. W.
Hamming
,
Digital Filterns
, 3rd ed. (
Dover Publications
,
New York
,
1997
).
23.
M. M.
Woolfson
,
An Introduction to X-Ray Crystallography
(
Cambridge University Press
,
Cambridge, United Kingdom
,
1997
).
24.
M. S.
Green
, “
Markoff random processes and the statistical mechanics of time-dependent phenomena. II. Irreversible processes in fluids
,”
J. Chem. Phys.
22
(
3
),
398
413
(
1954
).
25.
R.
Zwanzig
, “
Time-correlation functions and transport coefficients in statistical mechanics
,”
Annu. Rev. Phys. Chem.
16
(
1
),
67
102
(
1965
).
26.
J. J.
Binney
,
N. J.
Dowrick
,
A. J.
Fisher
, and
M. E. J.
Newman
,
The Theory of Critical Phenomena
(
Oxford University Press
,
Oxford
,
1992
).
27.
A. N.
Kolmogorov
, “
Entropy per unit time as a metric invariant of automorphisms
,”
Dokl. Akad. Nauk. SSSR
124
,
754
(
1959
) (in Russian)
A. N.
Kolmogorov
, [
Math. Rev.
21
, no.
2035b.5
].
28.
J. P.
Crutchfield
and
N. H.
Packard
, “
Symbolic dynamics of noisy chaos
,”
Physica D
7
(
1–3
),
201
223
(
1983
).
29.
A.
del Junco
and
M.
Rahe
, “
Finitary codings and weak Bernoulli partitions
,”
Proc. AMS
75
,
259
(
1979
).
30.
R.
Shaw
,
The Dripping Faucet as a Model Chaotic System
(
Aerial Press
,
Santa Cruz, California
,
1984
).
31.
P.
Grassberger
, “
Toward a quantitative theory of self-generated complexity
,”
Int. J. Theor. Phys.
25
,
907
(
1986
).
32.
W.
Bialek
,
I.
Nemenman
, and
N.
Tishby
, “
Predictability, complexity, and learning
,”
Neural Comput
13
(
11
),
2409
2463
(
2001
).
33.
R. G.
James
,
C. J.
Ellison
, and
J. P.
Crutchfield
, “
Anatomy of a bit: Information in a time series observation
,”
Chaos
21
(
3
),
037109
(
2011
).
34.
P. M.
Ara
,
R. G.
James
, and
J. P.
Crutchfield
, “
The elusive present: Hidden past and future dependence and why we build models
,”
Phys. Rev. E
93
(
2
),
022143
(
2016
).
35.
O.
Pfante
,
N.
Bertschinger
,
E.
Olbrich
,
N.
Ay
, and
J.
Jost
, “
Comparison between different methods of level identification
,”
Adv. Complex Syst.
17
(
02
),
1450007
(
2014
).
36.
J. P.
Crutchfield
and
K.
Young
, “
Inferring statistical complexity
,”
Phys. Rev. Lett.
63
,
105
108
(
1989
).
37.
A. B.
Boyd
,
D.
Mandal
, and
J. P.
Crutchfield
, “
Leveraging environmental correlations: The thermodynamics of requisite variety
,”
J. Stat. Phys.
167
(
6
),
1555
1585
(
2017
).
38.
S.
Still
,
J. P.
Crutchfield
, and
C. J.
Ellison
, “
Optimal causal inference: Estimating stored information and approximating causal architecture
,”
Chaos
20
(
3
),
037111
(
2010
).
39.
F.
Creutzig
,
A.
Globerson
, and
N.
Tishby
, “
Past-future information bottleneck in dynamical systems
,”
Phys. Rev. E
79
(
4
),
041925
(
2009
).
40.
S. E.
Marzen
and
J. P.
Crutchfield
, “
Predictive rate-distortion for infinite-order Markov processes
,”
J. Stat. Phys.
163
(
6
),
1312
1338
(
2016
).
41.
C. J.
Ellison
and
J. P.
Crutchfield
, “
States of states of uncertainty
,” (unpublished).
42.
D.
Wolpert
,
E.
Libby
,
J.
Grochow
, and
S.
Dedeo
, “
The Many Faces of State Space Compression
,” in
From Matter to Life
, edited by
S.
Walker
,
P.
Davies
, and
G.
Ellis
(
Cambridge University Press
,
NY
,
2017
) pp.
199
243
.
43.
J. R.
Mahoney
,
C. J.
Ellison
,
R. G.
James
, and
J. P.
Crutchfield
, “
How hidden are hidden processes? A primer on crypticity and entropy convergence
,”
Chaos
21
(
3
),
037112
(
2011
).
44.
J. P.
Crutchfield
, “
The calculi of emergence: Computation, dynamics, and induction
,”
Physica D
75
,
11
54
(
1994
).
45.
D. R.
Upper
, “
Theory and algorithms for hidden Markov models and generalized hidden Markov models
,” Ph.D. thesis (
University of California, Berkeley/Microfilms Intl
.,
Ann Arbor, Michigan
,
1997
).
46.
W.
Lohr
and
N.
Ay
, “
Non-sufficient memories that are sufficient for prediction
,” in
Complex Sciences
, Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering Vol.
4
(
Springer
,
2009
), pp.
265
276
.
47.
J.
Ruebeck
,
R. G.
James
,
J. R.
Mahoney
, and
J. P.
Crutchfield
, “
Prediction and generation of binary Markov processes: Can a finite-state fox catch a Markov mouse?
,”
Chaos
28
(
1
),
013109
(
2018
).
48.
J. P.
Crutchfield
,
C. J.
Ellison
,
J. R.
Mahoney
, and
R. G.
James
, “
Synchronization and control in intrinsic and designed computation: An information-theoretic analysis of competing models of stochastic computation
,”
Chaos
20
(
3
),
037105
(
2010
).
49.
N.
Travers
and
J. P.
Crutchfield
, “
Exact synchronization for finite-state sources
,”
J. Stat. Phys.
145
(
5
),
1181
1201
(
2011
).
50.
N.
Travers
and
J. P.
Crutchfield
, “
Asymptotic synchronization for finite-state sources
,”
J. Stat. Phys.
145
(
5
),
1202
1223
(
2011
).
51.
J. R.
Mahoney
,
C.
Aghamohammadi
, and
J. P.
Crutchfield
, “
Occam's quantum strop: Synchronizing and compressing classical cryptic processes via a quantum channel
,”
Sci. Rep.
6
,
20495
(
2016
).
52.
B.
Vanluyten
,
J. C.
Willems
, and
B.
De Moor
, “
Equivalence of state representations for hidden Markov models
,”
Syst. Control Lett.
57
(
5
),
410
419
(
2008
).
53.
R. B.
Ash
,
Information Theory
, Dover Books on Advanced Mathematics (
Dover Publications
,
1965
).
54.
C. R.
Shalizi
and
J. P.
Crutchfield
, “
Computational mechanics: Pattern and prediction, structure and simplicity
,”
J. Stat. Phys.
104
,
817
879
(
2001
).
55.
The automata theory would refer to a uHMM as a probabilistic deterministic finite automaton.76 The awkward terminology does not recommend itself.
56.
J. E.
Hopcroft
and
J. D.
Ullman
,
Introduction to Automata Theory, Languages, and Computation
(
Addison-Wesley
,
Reading
,
1979
).
57.
T.
Nishikawa
and
A. E.
Motter
, “
Synchronization is optimal in nondiagonalizable networks
,”
Phys. Rev. E
73
,
065106
(
2006
).
58.
D.
Blackwell
,
The Entropy of Functions of Finite-State Markov Chains
(
Publishing House of the Czechoslovak Academy of Sciences
,
Prague
,
1957
), Vol.
28
, pp.
13
20
.
59.
J. P.
Crutchfield
,
C. J.
Ellison
, and
J. R.
Mahoney
, “
Time's barbed arrow: Irreversibility, crypticity, and stored information
,”
Phys. Rev. Lett.
103
(
9
),
094101
(
2009
).
60.
C. J.
Ellison
,
J. R.
Mahoney
, and
J. P.
Crutchfield
, “
Prediction, retrodiction, and the amount of information stored in the present
,”
J. Stat. Phys.
136
(
6
),
1005
1034
(
2009
).
61.
P. M.
Riechers
and
J. P.
Crutchfield
, “
Power spectra of stochastic processes from transition matrices of hidden Markov models
,” (unpublished).
62.
Averaging over t invokes unconditioned word probabilities that must be calculated using the stationary probability π over the recurrent states. Effectively, this ignores any transient nonstationarity that may exist in a process, since only the recurrent part of the HMM presentation plays a role in the autocorrelation function. One practical lesson is that if T has transient states, they might as well be trimmed prior to such a calculation.
63.
P.
Gaspard
and
X.
Wang
, “
Noise, chaos, and (ε, τ)-entropy per unit time
,”
Phys. Rep.
235
(
6
),
291
343
(
1993
).
64.
S.
Marzen
,
M. R.
DeWeese
, and
J. P.
Crutchfield
, “
Time resolution dependence of information measures for spiking neurons: Scaling and universality
,”
Front. Comput. Neurosci.
9
,
105
(
2015
).
65.
S.
Marzen
and
J. P.
Crutchfield
, “
Informational and causal architecture of continuous-time renewal processes
,”
J. Stat. Phys.
168
(
1
),
109
127
(
2017
).
66.
S.
Marzen
and
J. P.
Crutchfield
, “
Structure and randomness of continuous-time discrete-event processes
,”
J. Stat. Phys.
169
(
2
),
303
315
(
2017
).
67.
J. R.
Mahoney
,
C. J.
Ellison
, and
J. P.
Crutchfield
, “
Information accessibility and cryptic processes
,”
J. Phys. A: Math. Theor.
42
,
362002
(
2009
).
68.
J. R.
Mahoney
,
C. J.
Ellison
, and
J. P.
Crutchfield
, “
Information accessibility and cryptic processes: Linear combinations of causal states
,” e-print arXiv.org:0906.5099 [cond-mat].
69.
N.
Dunford
, “
Spectral theory I. Convergence to projections
,”
Trans. Am. Math. Soc.
54
(
2
),
185
217
(
1943
).
70.
A.
Ben-Israel
and
T. N. E.
Greville
,
Generalized Inverses: Theory and Applications
, CMS Books in Mathematics (
Springer
,
2003
).
71.
J. G.
Kemeny
and
J. L.
Snell
,
Finite Markov Chains
(
Springer
,
New York
,
1960
), Vol.
356
.
72.
D. M.
Cvetkovic
,
M.
Doob
, and
H.
Sachs
,
Spectra of Graphs: Theory and Applications
, 3rd ed. (
Wiley
,
New York, New York
,
1998
).
73.
R. A.
Brualdi
, “
Spectra of digraphs
,”
Linear Algebra Appl.
432
(
9
),
2181
2213
(
2010
).
74.
H.
Jaeger
, “
Observable operator models for discrete stochastic time series
,”
Neural Comput.
12
(
6
),
1371
1398
(
2000
).
75.
V.
Balasubramanian
, “
Equivalence and reduction of hidden Markov models
,” Technical Report AITR-1370,
1993
.
76.
M.
Sipser
,
Introduction to the Theory of Computation
(
Cengage Learning
,
2012
).
77.
J. J.
Birch
, “
Approximations for the entropy for functions of Markov chains
,”
Ann. Math. Stat.
33
(
2
),
930
938
(
1962
).
78.
N. F.
Travers
, “
Exponential bounds for convergence of entropy rate approximations in hidden Markov models satisfying a path-mergeability condition
,”
Stochastic Processes Appl.
124
(
12
),
4149
4170
(
2014
).
79.
P. M.
Riechers
and
J. P.
Crutchfield
,
Chaos
(to be published).