Virtually all questions that one can ask about the behavioral and structural complexity of a stochastic process reduce to a linear algebraic framing of a time evolution governed by an appropriate hidden-Markov process generator. Each type of question—correlation, predictability, predictive cost, observer synchronization, and the like—induces a distinct generator class. Answers are then functions of the class-appropriate transition dynamic. Unfortunately, these dynamics are generically nonnormal, nondiagonalizable, singular, and so on. Tractably analyzing these dynamics relies on adapting the recently introduced meromorphic functional calculus, which specifies the spectral decomposition of functions of nondiagonalizable linear operators, even when the function poles and zeros coincide with the operator's spectrum. Along the way, we establish special properties of the spectral projection operators that demonstrate how they capture the organization of subprocesses within a complex system. Circumventing the spurious infinities of alternative calculi, this leads in the sequel, Part II [P. M. Riechers and J. P. Crutchfield, Chaos 28, 033116 (2018)], to the first closed-form expressions for complexity measures, couched either in terms of the Drazin inverse (negative-one power of a singular operator) or the eigenvalues and projection operators of the appropriate transition dynamic.
For well over a century, science compared the randomness in physical systems via their temperatures or their thermodynamic entropies. These are measures of energy disorder. Using them, we say, one or the other system is more random (hotter or more entropic) than the other. Curiously, even today, we do not know how to compare two physical systems in terms of how organized they are. Such comparisons are particularly important when the systems of interest do not have well defined energies—as found across mathematics and the sciences, from abstract dynamical systems to economic and social systems. This is what the endeavor of exploring complexity measures addresses—develop quantities that allow one to compare how nonlinear systems are structured, how they store and process information, and how they intrinsically compute. To date, complexity measures have been estimated empirically from experimental measurements, from large-scale simulations that generate synthetic data, or theoretically in the very few cases that are analytically tractable. We show that this arduous and limited state of affairs is no longer necessary, if one can theoretically deduce or empirically estimate a statistical representation called a hidden Markov model (HMM). We provide analytic, closed-form expressions for almost all complexity measures of processes generated by hidden Markov models.
I. INTRODUCTION
Complex systems—that is, many-body systems with strong interactions—are usually observed through low-resolution feature detectors. The consequence is that their hidden structure is, at best, only revealed over time. Since individual observations cannot capture the full resolution of each degree of freedom, let alone a sufficiently full set of them, the measurement time series often appear stochastic and non-Markovian, exhibiting long-range correlations. Empirical challenges aside, restricting to the purely theoretical domain, even finite systems can appear quite complicated. Despite admitting finite descriptions, stochastic processes with sofic support, to take one example, exhibit infinite-range dependencies among the chain of random variables they generate.1 While such infinite-correlation processes are legion in complex physical and biological systems, even approximately analyzing them is generally appreciated as difficult, if not impossible. Generically, even finite systems lead to uncountably infinite sets of predictive features.2 These facts seem to put physical sciences' most basic goal—prediction—out of reach.
We aim to show that this direct, but sobering conclusion is too bleak. Rather, there is a collection of constructive methods that address the hidden structure and the challenges associated with predicting complex systems. This follows up on our recent introduction of a functional calculus that uncovered new relationships among supposedly different complexity measures3 and that demonstrated the need for a generalized spectral theory to answer such questions.4 Those efforts yielded elegant, closed-form solutions for complexity measures that, when compared, offered insight into the overall theory of complexity measures. Here, providing the necessary background for and greatly expanding those results, we show that different questions regarding correlation, predictability, and prediction each require their own analytical structures, expressed as various kinds of hidden transition dynamic. The resulting transition dynamic among hidden variables summarizes symmetry breaking, synchronization, and information processing, for example. Each of these metadynamics, though, is built up from the original given system.
The shift in perspective that allows the new level of tractability begins by recognizing that—beyond their ability to generate many sophisticated processes of interest—hidden Markov models can be treated as exact mathematical objects when analyzing the processes they generate. Crucially, and especially when addressing nonlinear processes, most questions that we ask imply a linear transition dynamic over some hidden state space. Speaking simply, something happens, then it evolves linearly in time, then we snapshot a selected characteristic. This broad type of sequential questioning cascades, in the sense that the influence of the initial preparation cascades through state space as time evolves, affecting the final measurement. Alternatively, other, complementary kinds of questioning involve accumulating such cascades. The linear algebra underlying either kind is highlighted in Table I in terms of an appropriate discrete-time transition operator T or a continuous-time generator G of time evolution.
Having identified the hidden linear dynamic, either a discrete-time operator T or a continuous-time operator G, quantitative questions tend to be either cascading or accumulating type. What changes between distinct questions are the dot products with the initial setup and the final observations .
Linear algebra underlying complexity . | ||
---|---|---|
Question type . | Discrete time . | Continuous time . |
Cascading | ||
Accumulating |
Linear algebra underlying complexity . | ||
---|---|---|
Question type . | Discrete time . | Continuous time . |
Cascading | ||
Accumulating |
In this way, deploying linear algebra to analyze complex systems relies on identifying an appropriate hidden state space. And, in turn, the latter depends on the genre of the question. Here, we focus on closed-form expressions for a process' complexity measures. This determines what the internal system setup and the final detection should be. We show that complexity questions fall into three subgenres and, for each of these, we identify the appropriate linear dynamic and closed-form expressions for several of the key questions in each genre. See Table II. The burden of the following is to explain the table in detail. We return to a much-elaborated version at the end.
Question genres (leftmost column) about process complexity listed with increasing sophistication. Each genre implies a different linear transition dynamic (rightmost column). Observational questions concern the superficial, given dynamic. Predictability questions are about the observation-induced dynamic over distributions; that is, over states used to generate the superficial dynamic. Prediction questions address the dynamic over distributions over a process' causally equivalent histories. Generation questions concern the dynamic over any nonunifilar presentation and observation-induced dynamics over its distributions. MSP is the mixed-state presentation.
Questions and their linear dynamics . | |||
---|---|---|---|
Genre . | Measures . | . | Hidden dynamic . |
Observation | Correlations | HMM matrix T | |
Power spectra | |||
Predictability | Myopic entropy | HMM MSP matrix W | |
Excess entropy | E, | ||
Prediction | Causal synchrony | ϵ-Machine MSP matrix | |
S, | |||
Generation | State synchrony | , | Generator MSP matrix |
Questions and their linear dynamics . | |||
---|---|---|---|
Genre . | Measures . | . | Hidden dynamic . |
Observation | Correlations | HMM matrix T | |
Power spectra | |||
Predictability | Myopic entropy | HMM MSP matrix W | |
Excess entropy | E, | ||
Prediction | Causal synchrony | ϵ-Machine MSP matrix | |
S, | |||
Generation | State synchrony | , | Generator MSP matrix |
Associating observables with transitions between hidden states , gives a hidden Markov model (HMM) with observation-labeled transition matrices . They sum to the row-stochastic state-to-state transition matrix . (The continuous-time versions are similarly defined, which we do later on.) Adding measurement symbols this way—to transitions—can be considered a model of measurement itself.5 The efficacy of our choice will become clear.
It is important to note that HMMs, in continuous and discrete time, arise broadly in the sciences, from quantum mechanics,6,7 statistical mechanics,8 and stochastic thermodynamics9–11 to communication theory,12,13 information processing,14–16 computer design,17 population and evolutionary dynamics,18,19 and economics. Thus, HMMs appear in the most fundamental physics and in the most applied engineering and social sciences. The breadth suggests that the thorough-going HMM analysis developed here is worth the required effort to learn.
Since complex processes have highly structured, directional transition dynamics—T or G—we encounter the full richness of matrix algebra in analyzing HMMs. We explain how analyzing complex systems induces a nondiagonalizable metadynamics, even if the original dynamic is diagonalizable in its underlying state-space. Normal and diagonalizable restrictions, so familiar in mathematical physics, simply fail us here.
The diversity of nondiagonalizable dynamics presents a technical challenge, though. A new calculus for functions of nondiagonalizable operators—e.g., TL or etG—becomes a necessity if one's goal is an exact analysis of complex processes. Moreover, complexity measures naively and easily lead one to consider illegal operations. Taking the inverse of a singular operator is a particularly central, useful, and fraught example. Fortunately, such illegal operations can be skirted since the complexity measures only extract the excess transient behavior of an infinitely complicated orbit space.
To explain how this arises—how certain modes of behavior, such as excess transients, are selected as relevant, while others are ignored—we apply the meromorphic functional calculus and new results for spectral projection operators recently derived in Ref. 4 to analyze complex processes generated by HMMs.
The following shows that this leads to a simplified spectral theory of weighted directed graphs, that even nondiagonalizable eigenspaces can be manipulated individually, and that, more specifically, the techniques can be applied to the challenges of prediction. The results developed here greatly extend and (finally) explain those announced in Ref. 3. The latter introduced the basic methods and results by narrowly focusing on closed-form expressions for several measures of intrinsic computation, applying them to prototype complex systems.
The meromorphic functional calculus, summarized in detail later, concerns functions of nondiagonalizable operators when poles (or zeros) of the function of interest coincide with poles of the operator's resolvent—poles that appear precisely at the eigenvalues of the transition dynamics. Pole–pole and pole–zero interactions transform the complex-analysis residues within the functional calculus. One notable result is that the negative-one power of a singular operator exists in the meromorphic functional calculus. We derive its form, note that it is the Drazin inverse, and show how widely useful and common it is.
For example, the following gives the first closed-form expressions for many complexity measures in wide use—many of which turn out to be expressed most concisely in terms of a Drazin inverse. Furthermore, spectral decomposition gives insight into the subprocesses of a complex system in terms of the spectral projection operators of the appropriate transition dynamic.
In the following, we emphasize that when we observe processes generated by a source capable of even the simplest computations, much of the predictable structure lies beyond pairwise correlation. We clarify how different measures of complexity quantify and distinguish nuanced aspects of what is predictable and what is necessary for prediction. We then give closed-form solutions for this quantification, resulting in a new level of rigor, tractability, and insight.
Sections II and III briefly review the relevant background in stochastic processes, the HMMs that generate them, and complexity measures. Several classes of HMMs are discussed in Sec. III. Mixed-state presentations (MSPs)—HMM generators of a process that also track distributions induced by observation—are reviewed in Sec. IV. They are key to complexity measures within an information-theoretic framing. Section V then shows how each complexity measure reduces to the linear algebra of an appropriate HMM adapted to the question genre.
To make progress at this point, we summarize the meromorphic functional calculus in Sec. VI. Several of its mathematical implications are discussed in relation to projection operators in Sec. VII and a spectral weighted directed graph theory is presented in Sec. VIII.
With this all set out, the sequel, Part II79 finally derives the promised closed-form complexities of a process and outlines common simplifications for special cases. This leads to the discovery of the symmetry collapse index, which indicates the sophistication of finite computational structures hidden in infinite-Markov-order processes. Leveraging the functional calculus, Part II79 introduces a novel extension—the complexity measure frequency spectrum and shows how to calculate it in closed form. It provides a suite of examples to ground the theoretical developments and works through in-depth a pedagogical example.
II. STRUCTURED PROCESSES AND THEIR COMPLEXITIES
We first describe a system of interest in terms of its observed behavior, following the approach of computational mechanics, as reviewed in Ref. 20. Again, a process is the collection of behaviors that the system produces and their probabilities of occurring. A process's behaviors are described via a bi-infinite chain of random variables, denoted by capital letters . A realization is indicated by lowercase letters . We assume values xt belong to a discrete alphabet . We work with blocks , where the first index is inclusive and the second exclusive: . Block realizations we often refer to as words w. At each time t, we can speak of the past and the future .
A process's probabilistic specification is a density over these chains: . Practically, we work with finite blocks and their probability distributions . To simplify the development, we primarily analyze stationary, ergodic processes: those for which for all . In such cases, we only need to consider a process's length-L word distributions .
A. Directly observable organization
A common first step to understand how processes express themselves is to analyze correlations among observables. Pairwise correlation in a sequence of observables is often summarized by the autocorrelation function
where the bar above Xt denotes its complex conjugate, and the angled brackets denote an average over all times . Alternatively, the structure in a stochastic process is often summarized by the power spectral density, also referred to more simply as the power spectrum
where is the angular frequency.21 Though a basic fact, it is not always sufficiently emphasized in applications that power spectra capture only pairwise correlation. Indeed, it is straightforward to show that the power spectrum is the windowed Fourier transform of the autocorrelation function . That is, power spectra describe how pairwise correlations are distributed across frequencies. Power spectra are common in signal processing, both in technological settings and physical experiments.22 As a physical example, diffraction patterns are the power spectra of a sequence of structure factors.23
Other important measures of observable organization called Green–Kubo coefficients determine transport properties in near-equilibrium thermodynamic systems—but are rather more application-specific.24,25 These coefficients reflect the idea that dissipation depends on the correlation structure. They usually appear in the form of integrating the autocorrelation of derivatives of observables. A change of observables, however, turns this into an integration of a standard autocorrelation function. Green–Kubo transport coefficients then involve the limit for the process of appropriate observables.
One theme in the following is that, though widely used, correlation functions and power spectra give an impoverished view of a process's structural complexity, since they only consider ensemble averages over pairwise events. Moreover, creating a list of higher-order correlations is an impractical way to summarize complexity, as seen in the connected correlation functions of statistical mechanics.26
B. Intrinsic predictability
Information measures, in contrast, can involve all orders of correlation and thus help to go beyond pairwise correlation in understanding, for example, how a process' past behavior affects predicting it at later times. Information theory, as developed for general complex processes,1 provides a suite of quantities that capture prediction properties using variants of Shannon's entropy and mutual information 13 applied to sequences. Each measure answers a specific question about a process' predictability. For example:
For dynamical systems with a continuous phase-space , the metric entropy also known as Kolmogorov–Sinai (KS) entropy is the supremum of entropy rates induced by partitioning into different finite alphabets .27
- How much of the future can be predicted? Its excess entropy, which is the past–future mutual information:[Ref. 1, and references therein]
E has also been investigated in the ergodic theory29 and under the names stored information,30 effective measure complexity,31 and predictive information.32
- How much information must be extracted to know its predictability and so see its intrinsic randomness ? Its transient information1
The spectral approach, our subject, naturally leads to allied, but new information measures. To give a sense, later we introduce the excess entropy spectrum . It completely, yet concisely, summarizes the structure of myopic entropy reduction, in a way similar to how the power spectrum completely describes autocorrelation. However, while the power spectrum summarizes only pairwise linear correlation, the excess entropy spectrum captures all orders of nonlinear dependency between random variables, making it an incisive probe of hidden structure.
Before leaving the measures related to predictability, we must also point out that they have important refinements—measures that lend a particularly useful, even functional, interpretation. These include the bound, ephemeral, elusive, and related informations.33,34 Though amenable to the spectral methods of the following, we leave their discussion for another venue. Fortunately, their spectral development is straightforward, but would take us beyond the minimum necessary presentation to make good on the overall discussion of spectral decomposition.
C. Prediction overhead
Process predictability measures, as just enumerated, certainly say much about a process' intrinsic information processing. They leave open, though, the question of the structural complexity associated with implementing prediction. This challenge entails a complementary set of measures that directly address the inherent complexity of actually predicting what is predictable. For that matter, how cryptic is a process?
Computational mechanics describes minimal-memory maximal prediction—using the minimal memory necessary to predict everything that is predictable about the future—via a process' hidden, effective or causal states and transitions, as summarized by the process's ϵ-machine.20 A causal state is an equivalence class of histories that all yield the same probability distribution over observable futures . Therefore, knowing a process's current causal state—that , say—is sufficient for maximal prediction.
The computational mechanics framework can also be related to several more recent attempts at describing effective levels of complex systems. For example, if individual histories are taken to be the microstates of a stochastic process, then causal states are the minimal high-level description of a stochastic process that satisfies the informational closure criterion of Ref. 35.
Computational mechanics provides an additional suite of quantities that capture the overhead of prediction, again using variants of Shannon's entropy and mutual information applied to the ϵ-machine. Each also answers a specific question about an observer's burden of prediction. For example:
- How much historical information must be stored for maximal prediction? The Shannon information in the causal states or statistical complexity36
- How unpredictable is a causal state upon observing a process for duration L? The myopic causal-state uncertainty1
- How much information must an observer extract to synchronize to—that is, to know with certainty—the causal state? The optimal predictor's synchronization information1
Paralleling the purely informational suite of the previous Sec. II B, we later introduce the optimal synchronization spectrum . It completely and concisely summarizes the frequency distribution of state-uncertainty reduction, similar to how the power spectrum completely describes autocorrelation and the excess entropy spectrum the myopic entropy reduction. Helpfully, the above optimal prediction measures can be found from the optimal synchronization spectrum.
The structural complexities monitor an observer's burden in optimally predicting a process. And so, they have practical relevance when an intelligent artificial or biological agent must take advantage of a structured stochastic environment—e.g., a Maxwellian Demon taking advantage of correlated environmental fluctuations,37 prey avoiding easy prediction, or profiting from stock market volatility, come to mind.
Prediction has many natural generalizations. For example, since maximal prediction often requires infinite resources, sub-maximal prediction (i.e., predicting with lower fidelity) is of practical interest. Fortunately, there are principled ways to investigate the tradeoffs between predictive accuracy and computational burden.2,38–40 As another example, maximal prediction in the presence of noisy or irregular observations can be investigated with a properly generalized framework; see Ref. 41. Blending the existing tools, resource-limited prediction under such observational constraints can also be investigated. There are also many applications where prediction is relevant to the task at hand, but is not necessarily the ultimate objective; this of course has a long history, and Ref. 42 has recently tried to formalize this effort. In all of these settings, information measures similar to those listed above are key to understanding and quantifying the tradeoffs arising in prediction.
Having highlighted the difference between prediction and predictability, we can appreciate that some processes hide more internal information—are more cryptic—than others. It turns out, this can be quantified. The crypticity is the difference between the process's stored information and the mutual information E shared between past and future observables.43 Operationally, crypticity contrasts predictable information content E with an observer's minimal stored-memory overhead required to make predictions. To predict what is predictable, therefore, an optimal predictor must account for a process's crypticity.
D. Generative complexities
How does a physical system produce its output process? This depends on many details. Some systems employ vast internal mechanistic redundancy, while others under constraints have optimized internal resources down to a minimally necessary generative structure. Different pressures give rise to different kinds of optimality. For example, minimal state-entropy generators turn out to be distinct from minimal state-set generators.44–46 The challenge then is to develop ways to monitor differences in the generative mechanism.47
Any generative model1,48 with state-set has a statistical state complexity (state entropy): . Consider the corresponding myopic state-uncertainty given L sequential observations
And so
We also have the asymptotic uncertainty . Related, there is the excess synchronization information
Such quantities are relevant even when an observer never fully synchronizes to a generative state; i.e., even when . Finite-state ϵ-machines always synchronize49,50 and so their vanishes.
Since many different mechanisms can generate a given process, we need useful bounds on the statistical state complexity of possible process generators. For example, the minimal generative complexity , where we minimize over all models that generate the process, is the minimal state-information a physical system must store to generate its future.46 The predictability and the statistical complexities bound each other
That is, the predictable future information E is less than or equal to the information necessary to produce the future which, in turn, is less than or equal to the information necessary to predict the future.1,44–47 Such relationships have been explored even for quantum generators of (classical) stochastic processes [Ref. 51, and references therein].
III. HIDDEN MARKOV MODELS
Up to this point, the development focused on introducing and interpreting various information and complexity measures. It was not constructive in that there was no specification of how to calculate these quantities for a given process. To do so, requires models or, in the vernacular, a presentation of a process. Fortunately, a common mathematical representation describes a wide class of process generators: the edge-labeled hidden Markov models (HMMs), also known as Mealy HMMs.48,52 Using these as our preferred presentations, we will first classify them and then describe how to calculate the information measures of the processes they generate.
Definition 1. A finite-state, edge-labeled hidden Markov model consists of:
A finite set of hidden states . is the random variable for the hidden state at time t.
A finite output alphabet .
A set of M × M symbol-labeled transition matrices , where is the probability of transitioning from state si to state sj and emitting symbol x. The corresponding overall state-to-state transition matrix is the row-stochastic matrix .
An initial distribution over hidden states: .
Contrast this with the class-equivalent state-labeled HMMs, also known as Moore HMMs.11,45,52,75 In the automata theory, a finite-state HMM is called a probabilistic nondeterministic finite automaton.76 The information theory13 refers to them as finite-state information sources and stochastic process theory defines them as functions of a Markov chain.53,58,77,78
The dynamics of such finite-state models are governed by transition matrices amenable to the linear algebra of vector spaces. As a result, bra-ket notation is useful. Bras are row vectors and kets are column vectors. One benefit of the notation is immediately recognizing the mathematical object type. For example, on the one hand, any expression that forms a closed bra-ket pair—either or —is a scalar quantity and commutes as a unit with anything. On the other hand, when useful, an expression of the ket-bra form can be interpreted as a matrix.
T's row-stochasticity means that each of its rows sum to unity. Introducing as the column vector of all 1s, this can be restated as
This is readily recognized as an eigenequation: . That is, the all-ones vector is always a right eigenvector of T associated with the eigenvalue λ of unity.
When the internal Markov transition matrix T is irreducible, the Perron-Frobenius theorem guarantees that there is a unique asymptotic state distribution π determined by
with the further condition that π is normalized in probability: . This again is recognized as an eigenequation: the stationary distribution π over the hidden states is T's left eigenvector associated with the eigenvalue of unity. When 1 is the only one of T's eigenvalues on the unit circle—i.e., when the process and presentation lack any deterministic periodicities—then the stationary distribution π is also the asymptotic state distribution for an ensemble of realizations, regardless of the initial distribution.
To describe a stationary process, as done often in the following, the initial hidden-state distribution η0 is set to the stationary one: . The resulting process generated is then stationary. Choosing an alternative η0 is useful in many contexts. Note that, starting with ηt the expected state distribution at the next time is simply . However, starting with such alternatives typically yields a nonstationary process.
An HMM describes a process' behaviors as a formal language of allowed realizations. Moreover, succinctly describes a process's word distribution over all words . (Appropriately, also assigns zero probability to words outside of the process' language: for all , 's complement.) Specifically, the stationary probability of observing a particular length-L word is given by
where .
More generally, given a nonstationary state distribution η, the probability of seeing w is
where means that the random variable is distributed as η.13 And, the state distribution having seen word w starting in η is
These conditional distributions are used often since, for example, most observations induce a nonstationary distribution over hidden states. Tracking such observation-induced distributions is the role of a related model class—the mixed-state presentation, introduced shortly. To get there, we must first introduce several, prerequisite HMM classes. See Fig. 1. A simple example of the general HMM just discussed is shown in Fig. 1(a).
Finite HMM classes and example processes they generate, depicted by their state-transition diagrams: For any setting of the transition probabilities and transition rates , each HMM generates an observable stochastic process over its alphabet —the latent states themselves are not directly observable from the output process and so are “hidden.” (a) Simple nonunifilar source: two transitions leaving from the same state generate the same output symbol. (b) Nonminimal unifilar HMM. (c) ϵ-Machine: minimal unifilar HMM for the stochastic process generated. (d) Generator of a continuous-time stochastic process.
Finite HMM classes and example processes they generate, depicted by their state-transition diagrams: For any setting of the transition probabilities and transition rates , each HMM generates an observable stochastic process over its alphabet —the latent states themselves are not directly observable from the output process and so are “hidden.” (a) Simple nonunifilar source: two transitions leaving from the same state generate the same output symbol. (b) Nonminimal unifilar HMM. (c) ϵ-Machine: minimal unifilar HMM for the stochastic process generated. (d) Generator of a continuous-time stochastic process.
A. Unifilar HMMs (uHMMs)
An important class of HMMs consists of those that are unifilar. Unifilarity guarantees that, given a start state and a sequence of observations, there is a unique path through the internal states.53 This, in turn, allows one to directly translate between properties of the internal Markov chain and properties of the observed behavior generated along the sequence of edges traversed. The states of unifilar HMMs are maximally predictive.54
In contrast, general—that is, nonunifilar—HMMs have an exponentially growing number of possible state paths as a function of observed word length. Thus, nonunifilar process presentations break most all quantitative connections between internal dynamics and observations, rendering them markedly less useful process presentations. While they can be used to generate realizations of a given process, they cannot be used outright to predict a process. Unifilarity is required.
Definition 2. A finite-state, edge-labeled, unifilar HMM (uHMM)55 is a finite-state, edge-labeled HMM with the following property:
Unifilarity: For each state and each symbol , there is at most one outgoing edge from state s that emits symbol x.
An example is shown in Fig. 1(b).
B. Minimal unifilar HMMs
Minimal models are not only convenient to use, but very often allow for determining essential informational properties, such as a process' memory . A process' minimal state-entropy uHMM is the same as its minimal-state uHMM. And, the latter turns out to be the process' ϵ-machine in computational mechanics.20 Computational mechanics shows how to calculate a process' ϵ-machine from the process' conditional word distributions. Specifically, ϵ-machine states, the process' causal states , are equivalence classes of histories that yield the same predictions for the future. Explicitly, two histories and ′ map to the same causal state if and only if . Thus, each causal state comes with a prediction of the future —its future morph. In short, a process' ϵ-machine is its minimal size, maximally predictive predictor.
Converting a given uHMM to its corresponding ϵ-machine employs probabilistic variants of well-known state-minimization algorithms in the automata theory.56 One can also verify that a given uHMM is minimal by checking that all its states are probabilistically distinct.49,50
Definition 3. A uHMM's states are probabilistically distinct if for each pair of distinct states there exists some finite word such that
If this is the case, then the process' uHMM is its ϵ-machine.
An example is shown in Fig. 1(c).
C. Finitary stochastic process hierarchy
The finite-state presentations in these classes form a hierarchy in terms of the processes they can finitely generate:44 Processes(ϵ-machines) = Processes(uHMMs) Processes(HMMs). That is, finite HMMs generate a strictly larger class of stochastic processes than finite uHMMs. The class of processes generated by finite uHMMs, though, is the same as generated by finite ϵ-machines.
D. Continuous-time HMMs
Though we concentrate on discrete-time processes, many of the process classifications, properties, and calculational methods carry over easily to continuous time. In this setting transition rates are more appropriate than transition probabilities. Continuous-time HMMs can often be obtained as a discrete-time limit of an edge-labeled HMM whose edges operate for a time . The most natural continuous-time HMM presentation, though, has a continuous-time generator G of time evolution over hidden states, with observables emitted as deterministic functions of an internal Markov chain: .
Respecting the continuous-time analog of probability conservation, each row of G sums to zero. Over a finite time interval t, marginalizing over all possible observations, the row-stochastic state-to-state transition dynamic is
Any nontrivial continuous-time process generated by such a continuous-time HMM has an uncountably infinite number of possible realizations within a finite time interval, and most of these have vanishing probability. However, probabilities regarding what state the system is in at any finite set of times can be easily calculated, essentially by bundling measurable sets of trajectories that satisfy certain constraints. For this purpose, we introduce the continuous-time observation matrices
where is a Kronecker delta, the column vector of all 0s except for a 1 at the position for state s, and its transpose . These “projectors” sum to the identity: .
An example is shown in Fig. 1(d).
IV. MIXED-STATE PRESENTATIONS
A given process can be generated by nonunifilar, unifilar, and ϵ-machine HMM presentations. A process' ϵ-machine is unique. However, within either the unifilar or nonunifilar HMM classes, there are an infinite number of presentations that generate the process.
This flexibility suggests that we can create a HMM process generator that, through its embellished structure, answers more refined questions than the information generation () and memory () calculated from the ϵ-machine. To this end, we introduce the mixed-state presentation (MSP). An MSP tracks important supplementary information in the hidden states and, through well-crafted dynamics, over the hidden states. In particular, an MSP generates a process while tracking the observation-induced distribution over the states of an alternative process generator. Here, we review only that subset of mixed-state theory required by the following.
Consider a HMM presentation of some process in statistical equilibrium. A mixed state η can be any state distribution over and so we work with the simplex of state distributions. However, is uncountable and so contains far more mixed states than needed to calculate many complexity measures.
How to efficiently monitor the way in which an observer comes to know the HMM state as it sees successive symbols from the process? This is the problem of observer–state synchronization. To analyze the evolution of the observer's knowledge through a sequence of observation-induced mixed states, we use the set of mixed states that are induced by all allowed words from the initial mixed state
The cardinality of is finite when there are only a finite number of distinct probability distributions over 's states that can be induced by observed sequences, if starting from the stationary distribution π.
If w is the first (in lexicographic order) word that induces a particular distribution over , then we denote this distribution as ηw; a shorthand for Eq. (3). For example, if the two words 010 and 110110 both induce the same distribution η over and no word shorter than 010 induces that distribution, then the mixed state is denoted . It corresponds to the distribution
Since a given observed symbol induces a unique updated distribution from a previous distribution, the dynamic over mixed states is unifilar. Transition probabilities among mixed states can be obtained via Eqs. (2) and (3). So, if
and
then
These transition probabilities over the mixed states in are the matrix elements for the observation-labeled transition matrices of 's synchronizing MSP (-MSP)
where is the distribution over peaked at the unique start-(mixed)-state π. The row-stochastic net mixed-state-to-state transition matrix of -MSP() is . If irreducible, then there is a unique stationary probability distribution over -MSP()'s states obtained by solving . We use to denote the random variable for the MSP's state at time t.
Figure 2 illustrates how a -MSP relates to the HMM whose state distributions it tracks. Both the original HMM and its MSP generate the same process. However, the MSP's dynamic over mixed-states also tracks how, through the mixed states induced by successive observations, an observer comes to know the original HMM's current state. Importantly, this MSP's dynamic is nondiagonalizable, while the original HMM's dynamic is diagonalizable. The appearance of nondiagonalizability when analyzing particular properties, in this case, observer synchronization, is generic. Constructively working with this unavoidable fact motivates much of the following. Crucially, we find that the burden of predicting a stochastic process is fundamentally dependent on the nondiagonalizable characteristics of its MSP.
(a) Example HMM and (b) its -MSP: both of these HMMs generate the -GP-(2) process considered in Part II.79 However, -MSP also tracks observation-induced states of knowledge about the example HMM's distribution over internal states , and . The double-circle state in the MSP state-transition diagram denotes the -MSP's start-state. The green states are transient, whereas the blue states are recurrent. The MSP's mixed-state-to-state dynamic is nondiagonalizable—which is generic—even though the example HMM's state-to-state dynamic is diagonalizable.
(a) Example HMM and (b) its -MSP: both of these HMMs generate the -GP-(2) process considered in Part II.79 However, -MSP also tracks observation-induced states of knowledge about the example HMM's distribution over internal states , and . The double-circle state in the MSP state-transition diagram denotes the -MSP's start-state. The green states are transient, whereas the blue states are recurrent. The MSP's mixed-state-to-state dynamic is nondiagonalizable—which is generic—even though the example HMM's state-to-state dynamic is diagonalizable.
That feedforward network structures lead to nondiagonalizability and that it is important to network functionality was also observed in a quite different setting—oscillator synchronization on complex networks with directional coupling dynamics.57 Briefly comparing settings reveals the origins of the commonality. In each, nondiagonalizability arises from an irreducible interdependence among elements—an interdependence that can be harnessed for hierarchical control. In Ref. 57, nondiagonalizable network structures allow upstream oscillators to influence downstream oscillators and this enables optimal synchronization among all oscillators. In contrast, while our setting does not concern synchronizing oscillator nodes to each other, it does analyze how an observer's belief state synchronizes to the true state of the system under study. During this kind of synchronization, past states of knowledge feed into future states of knowledge. In short, nondiagonalizability corresponds to intrinsically interdependent updates in the evolution of knowledge.
More generally, we work with distributions over . We must consider a mixed-state dynamic that starts from a general (nonpeaked) distribution over . This may be counterintuitive, since a distribution over distributions should correspond to a single distribution. However, the general MSP theory with a general starting distribution over allows us to consider a weighted average of behaviors originating from different histories. And, this is distinct from considering the behavior originating from a weighted average of histories. This more general MSP formalism arises in the closed-form solutions for more sophisticated complexity measures, such as the bound information. This anticipates tools needed in a sequel.
With this brief overview of mixed states, we can now turn to use them. Section V shows that tracking distributions over the states of another generator makes the MSP an ideal algebraic object for closed-form complexity expressions involving conditional entropies—measures that require conditional probabilities. Sections II B and II C showed that many of the complexity measures for predictability and predictive burden are indeed framed as conditional entropies. And so, MSPs are central to their closed-form expressions.
Historically, mixed states were already implicit in Ref. 58, introduced in their modern form by Refs. 44 and 45, and have been used recently; e.g., in Refs. 59 and 60. Most of these efforts, however, used mixed-states in the specific context of the synchronizing MSP (-MSP). A greatly extended development of mixed-state dynamics appears in Ref. 41.
The overall strategy, though, is easy to explain. Different information-theoretic questions require different mixed-state dynamics, each of which is a unifilar presentation. Employing the mathematical methods developed here, we find that the desired closed-form solutions are often simple functions of the transition dynamic of an appropriate MSP. Specifically, the spectral properties of the relevant MSP control the form of information-theoretic quantities.
Finally, we note that similar linear-algebraic constructions—whose hidden states track relevant information—that are nevertheless not MSPs are just as important for answering different sets of questions about a process. Since these constructions are not directly about predictability and prediction, we report on them elsewhere.
V. IDENTIFYING THE HIDDEN LINEAR DYNAMIC
We are now in a position to identify the hidden linear dynamic appropriate to many of the questions that arise in complex systems—their observation, predictability, prediction, and generation, as outlined in Table II. In part, this section addresses a very practical need for specific calculations. In part, it also lays the foundations for further generalizations, to be discussed at the end. Identifying the linear dynamic means identifying the linear operator A such that a question of interest can be reformulated as either being of the cascading form or as an accumulation of such cascading events via ; recall Table I. Helpfully, many well-known questions of complexity can be mapped to these archetypal forms. And so, we now proceed to uncover the hidden linear dynamics of the cascading questions approximately in the order they were introduced in Sec. II.
A. Simple complexity from any presentation
For observable correlation, any HMM transition operator will do as the linear dynamic. We simply observe, let time (or space) evolve forward, and observe again. Let us be concrete.
Recall the familiar autocorrelation function. For a discrete-domain process, it is61
where and the bar denotes the complex conjugate. The autocorrelation function is symmetric about L = 0, so we can focus on . For L = 0, we simply have
For L > 0, we have
Each “” above is a wildcard symbol denoting indifference to the particular symbol observed in its place. That is, the s denotes marginalizing over the intervening random variables. We develop the consequence of this, explicitly calculating62 and finding
The result is the autocorrelation in the cascading form , which can be made particularly transparent by subsuming time-independent factors on the left and right into the bras and kets. Let us introduce the new row vector
and the column vector
Then, the autocorrelation function for nonzero integer L is simply
Clearly, the autocorrelation function is a direct, albeit filtered, signature of iterates of the transition dynamic of any process presentation.
This result can easily be translated to the continuous-time setting. If the process is represented as a function of a Markov chain and we make the translation that
then the autocorrelation function for any is simply
where G is determined from T following Sec. III D. Again, the autocorrelation function is a direct fingerprint of the transition dynamic over the hidden states.
The power spectrum is a modulated accumulation of the autocorrelation function. With some algebra, one can show that it is
Reference 61 shows that for discrete-domain processes, the continuous part of the power spectrum is simply
where Re denotes the real part of its argument and I is the identity matrix. Similarly, for continuous-domain processes, one has
Although useful, these signatures of pairwise correlation are only first-order complexity measures. Common measures of complexity that include higher orders of correlation can also be written in the simple cascading and accumulating forms, but require a more careful choice of representation.
B. Predictability from a presentation MSP
For example, any HMM presentation allows us to calculate using Eq. (1) a process's block entropy
but at a computational cost exponential in L, due to the exponentially growing number of words in . Consequently, using a general HMM, one can neither directly nor efficiently calculate many key complexity measures, including a process's entropy rate and excess entropy.
These limitations motivate using more specialized HMM classes. To take one example, it has been known for some time that a process' entropy rate can be calculated directly from any of its unifilar presentations.53 Another is that we can calculate the excess entropy directly from a process's uHMM forward and reverse states:59,60 .
However, efficient computation of myopic entropy rates remained elusive for some time, and we only recently found their closed-form expression.3 The myopic entropy rates are important because they represent the apparent entropy rate of a process if it is modeled as a finite Markov order-(L – 1) process—a very common approximation. Crucially, the difference from the process' true entropy rate is the surplus entropy rate incurred by using an order-(L – 1) Markov approximation. Similarly, these surplus entropy rates lead directly to not only an apparent loss of predictability, but errors in the inferred physical properties. These include overestimates of dissipation associated with the surplus entropy rate assigned to a physical thermodynamic system.37
Unifilarity, it turns out, is not enough to calculate a process' directly. Rather, the -MSP of any process presentation is what is required. Let us now develop the closed-form expression for the myopic entropy rates, following Ref. 41.
The length-L myopic entropy rate is the expected uncertainty in the Lth random variable , given the preceding L – 1 random variables
where, in the second line, we explicitly give the condition specifying our ignorance of the initial state. That is, without making any observations, we can only assume that the initial distribution η0 over 's states is the expected asymptotic distribution π. For a mixing ergodic process, for example, even if another distribution was known in the distant past, we still have , as .
Assuming an initial probability distribution over 's states, a given observation sequence induces a particular sequence of updated state distributions. That is, the -MSP() is unifilar regardless of whether is unifilar or not. Or, in other words, given the -MSP's unique start state——and a particular realization of the last L – 1 random variables, we end up at the particular mixed state . Moreover, the entropy of the next observation is uniquely determined by 's state distribution, suggesting that Eq. (8) becomes
as proven elsewhere.41 Intuitively, conditioning on all of the past observation random variables is equivalent to conditioning on the random variable for the state distribution induced by particular observation sequences.
We can now recast Eq. (8) in terms of the -MSP, finding
Here
is simply the column vector whose ith entry is the entropy of transitioning from the ith state of -MSP. Critically, is independent of L.
Notice that taking the logarithm of the sum of the entries of the row vector via is only permissible since -MSP's unifilarity guarantees that has at most one nonzero entry per row. (We also use the familiar convention that .13)
The result is a particularly compact and efficient expression for the length-L myopic entropy rates
Thus, all that is required is computing powers of the MSP transition dynamic. The computational cost is now only linear in L. Moreover, W is very sparse, especially so with a small alphabet . And, this means that the computational cost can be reduced even further via numerical optimization.
With in hand, the hierarchy of complexity measures that derive from it immediately follow, including the entropy rate , the excess entropy E, and the transient information T.1 Specifically, we have
The sequel, Part II,79 discusses these in more detail, introducing their closed-form expressions. To prepare for this, we must first review the meromorphic functional calculus, which is needed for working with the above operators.
C. Continuous time?
We saw that correlation measures are easily extended to the continuous-time domain via continuous-time HMMs. Entropy rates (since they are rates) and state entropy (since it depends only on the instantaneous distribution) also carry over rather easily to continuous time. Indeed, the former is well studied for chaotic systems63 and the latter is exemplified by the thermodynamic entropy. Yet, other information-theoretic measures of information transduction are awkward when directly translated to continuous time. At least one approach has been taken recently towards understanding their structure,64–66 but more work is necessary.
D. Synchronization from generator MSP
If a process' state-space is known, then the -MSP of the generating model allows one to track the observation-induced distributions over its states. This naturally leads to closed-form solutions to informational questions about how an observer comes to know, or how it synchronizes to, the system's states.
To monitor how an observer's knowledge of a process' internal state changes with increasing measurements, we use the myopic state uncertainty .1 Expressing it in terms of the -MSP, one finds41
Here, is the presentation-state uncertainty specified by the mixed state η
where is the length- column vector of all zeros except for a 1 at the appropriate index of the presentation-state s.
Continuing, we re-express in terms of powers of the -MSP transition dynamic
Here, we defined
which is the L-independent length- column vector whose entries are the appropriately indexed entropies of each mixed state.
The forms of Eqs. (9) and (11) demonstrate that and differ only in the type of information being extracted after being evolved by the operator: observable entropy or state entropy , as implicated by their respective kets. Each of these entropies decreases as the distributions induced by longer observation sequences converge to the synchronized distribution. If synchronization is achieved, the distributions become δ-functions on a single state and the associated state-entropy vanishes.
Paralleling , there is a complementary hierarchy of complexity measures that are built up from functions of . These include the asymptotic state uncertainty and excess synchronization information , to mention only two
Compared to the family of measures, and mirror the roles of and E, respectively.
The model state-complexity
also has an analog in the hierarchy—the process' alphabet complexity
E. Maximal prediction from ϵ-machine MSP
We just reviewed the linear underpinnings of synchronizing to any model of a process. However, the myopic state uncertainty of the has a distinguished role in determining the synchronization cost for maximally predicting a process, regardless of the presentation that generated it. Using the ϵ-machine's -MSP, the ϵ-machine myopic state uncertainty can be written in direct parallel to the myopic state uncertainty of any model
The script emphasizes that we are now specifically working with the state-to-state transition dynamic of the ϵ-machine's MSP.
Paralleling , an obvious hierarchy of complexity measures is built from functions of . For example, the ϵ-machine's state-complexity is the statistical complexity . The information that must be obtained to synchronize to the causal state in order to maximally predict—the causal synchronization information—is given in terms of the ϵ-machine's -MSP by .
An important difference when using ϵ-machine presentations is that they have zero asymptotic state uncertainty
Therefore, . Moreover, we conjecture that for any presentation that generates the process, even if .
F. Beyond the MSP
Many of the complexity measures use a mixed-state presentation as the appropriate linear dynamic, with particular focus on the -MSP. However, we want to emphasize that this is more a reflection of questions that have become common. It does not indicate the general answer that one expects in the broader approach to finding the hidden linear dynamic. Here, we give a brief overview for how other linear dynamics can appear for different types of complexity questions. These have been uncovered recently and will be reported in more detail in sequels.
First, we found the reverse-time mixed-functional presentation (MFP) of any forward-time generator. The MFP tracks the reverse-time dynamic over linear functionals of state distributions induced by reverse-time observations
The MFP allows direct calculation of the convergence of the preparation uncertainty via powers of the linear MFP transition dynamic. The preparation uncertainty in turn gives a new perspective on the transient information since
can be interpreted as the predictive advantage of hindsight. Related, the myopic process crypticity χ(L) = + had been previously introduced.43 Since , the asymptotic crypticity is χ =
+ =
+. And, this reveals a refined partitioning underlying the sum
Crypticity itself is positive only if the process' cryptic order
is positive. The cryptic order is always less than or equal to its better-known cousin, the Markov order R
since conditioning can never increase entropy. In the case of the cryptic order, we condition on future observations .
The forward-time cryptic operator presentation gives the forward-time observation-induced dynamic over the operators
Since the reverse causal state at time 0 is a linear combination of forward causal states,67,68 this presentation allows new calculations of the convergence to crypticity that implicate .
In fact, the cryptic operator presentation is a special case of the more general myopic bidirectional dynamic over operators
induced by new observations of either the future or the past. This is key to understanding the interplay between forgetfulness and shortsightedness: .
The list of these extensions continues. Detailed bounds on entropy-rate convergence are obtained from the transition dynamic of the so-called possibility machine, beyond the asymptotic result obtained in Ref. 50. And, the importance of post-synchronized monitoring, as quantified by the information lost due to negligence over a duration
can be determined using yet another type of modified MSP.
These examples all find an exact solution via a theory parallel to that outlined in the following, but applied to the linear dynamic appropriate for the corresponding complexity question. Furthermore, they highlight the opportunity, enabled by the full meromorphic functional calculus,4 to ask and answer more nuanced and, thus, more probing questions about the structure, predictability, and prediction.
G. The end?
It would seem that we achieved our goal. We identified the appropriate transition dynamic for common complexity questions and, by some standard, gave formulae for their exact solution. In point of fact, the effort so far has all been in preparation. Although we set the framework up appropriately for linear analysis, closed-form expressions for the complexity measures still await the mathematical developments of the following Secs. VI. At the same time, at the level of qualitative understanding and scientific interpretation, so far we failed to answer the simple question:
What range of possible behaviors do these complexity measures exhibit?
and the natural follow-up question:
What mechanisms produce qualitatively different informational signatures?
The following Sec. VI reviews the recently developed functional calculus that allows us to actually decompose arbitrary functions of the nondiagonalizable hidden dynamic to give conclusive answers to these fundamental questions.4 We then analyze the range of possible behaviors and identify the internal mechanisms that give rise to qualitatively different contributions to complexity.
The investment in this and the succeeding Secs. VI–VIII allow Part II to express new closed-form solutions for many complexity measures beyond those achieved to date. In addition to obvious calculational advantages, this also gives new insights into possible behaviors of the complexity measures and, moreover, their unexpected similarities with each other. In many ways, the results shed new light on what we were (implicitly) probing with already-familiar complexity measures. Constructively, this suggests extending complexity magnitudes to complexity functions that succinctly capture the organization to all orders of correlation. Just as our intuition for pairwise correlation grows out of power spectra, so too these extensions unveil the workings of both a process' predictability and the burden of prediction for an observer.
VI. SPECTRAL THEORY BEYOND THE SPECTRAL THEOREM
Here, we briefly review the spectral decomposition theory from Ref. 4 needed for working with nondiagonalizable linear operators. As will become clear, it goes significantly beyond the spectral theorem for normal operators. Although the linear operator theory (especially as developed in the mathematical literature of functional analysis) already addresses nonnormal operators, it had not delivered the comparable machinery for a tractable spectral decomposition of nonnormal and nondiagonalizable operators. Reference 4 explored this topic and derived new relations that enable the practical analysis of nonnormal and nondiagonalizable systems—in which independent subprocesses (irreducible subspaces) can be directly manipulated.
A. Spectral primer
We restrict our attention to operators that have at most a countably infinite spectrum. Such operators share many features with finite-dimensional square matrices. And so, we review several elementary but essential facts that are used extensively in the following.
Recall that if A is a finite-dimensional square matrix, then A's spectrum is simply its set of eigenvalues
where det is the determinant of its argument.
For reference later, recall that the algebraic multiplicity of eigenvalue λ is the power of the term in the characteristic polynomial det. In contrast, the geometric multiplicity is the dimension of the kernel of the transformation or the number of linearly independent eigenvectors for the eigenvalue. Moreover, is the number of Jordan blocks associated with λ. The algebraic and geometric multiplicities are all equal when the matrix is diagonalizable.
Since there can be multiple subspaces associated with a single eigenvalue, corresponding to different Jordan blocks in the Jordan canonical form, it is structurally important to introduce the index of the eigenvalue to describe the size of its largest-dimension associated subspace.
Definition 4. The index of eigenvalue λ is the size of the largest Jordan block associated with λ.
The index gives information beyond what the algebraic and geometric multiplicities themselves reveal. Nevertheless, for , it is always true that . In the diagonalizable case, and for all .
The resolvent
defined with the help of the continuous complex variable , captures all of the spectral information about A through the poles of the resolvent's matrix elements. In fact, the resolvent contains more than just the spectrum: the order of each pole gives the index of the corresponding eigenvalue.
Each eigenvalue λ of A has an associated spectral projection operator , which is the residue of the resolvent as
where is a counterclockwise contour in the complex plane around eigenvalue λ. The residue of the matrix can be calculated elementwise.
The projection operators are orthonormal
and sum to the identity
For cases where , we found that the projection operator associated with λ can be calculated as4
Not all projection operators of a nondiagonalizable operator can be found directly from Eq. (15), since some have an index larger than one. However, if there is only one eigenvalue that has an index larger than one—the almost diagonalizable case treated in Part II79—then Eq. (15), together with the fact that the projection operators must sum to the identity, does give a full solution to the set of projection operators. Next, we consider the general case, with no restriction on .
B. Eigenprojectors via left and right generalized eigenvectors
In general, as we now discuss, an operator's eigenprojectors can be obtained from all left and right eigenvectors and generalized eigenvectors associated with the eigenvalue. Let be the n-tuple of eigenvalues in which each eigenvalue is listed times. So , and is the total number of Jordan blocks in the Jordan canonical form. Each corresponds to a particular Jordan block of size mk. The index of λ is thus
There is a corresponding n-tuple of mk-tuples of linearly independent generalized right-eigenvectors
where
and a corresponding n-tuple of mk-tuples of linearly independent generalized left-eigenvectors
where
such that
and
for , where and . Specifically, and are conventional right and left eigenvectors, respectively.
Most directly, the generalized right and left eigenvectors can be found as the nontrivial solutions to
and
respectively. Imposing appropriate normalization, we find that
Crucially, right and left eigenvectors are no longer simply related by complex-conjugate transposition and right eigenvectors are not necessarily orthogonal to each other. Rather, left eigenvectors and generalized eigenvectors form a dual basis to the right eigenvectors and generalized eigenvectors. Somewhat surprisingly, the most generalized left eigenvector associated with λk is dual to the least generalized right eigenvector associated with λk
Explicitly, we find that the spectral projection operators for a nondiagonalizable matrix can be written as
C. Companion operators and resolvent decomposition
It is useful to introduce the generalized set of companion operators
for and . These operators satisfy the following semigroup relation:
reduces to the eigenprojector for m = 0
and it exactly reduces to the zero-matrix for
Crucially, we can rewrite the resolvent as a weighted sum of the companion matrices , with complex coefficients that have poles at each eigenvalue λ up to the eigenvalue's index
Ultimately these results allow us to easily evaluate arbitrary functions of nondiagonalizable operators, to which we now turn. (Reference 4 gives more background.)
D. Functions of nondiagonalizable operators
The meromorphic functional calculus4 gives meaning to arbitrary functions of any linear operator A. Its starting point is the Cauchy-integral-like formula
where denotes a sufficiently small counterclockwise contour around λ in the complex plane such that no singularity of the integrand besides the possible pole at is enclosed by the contour.
Invoking Eq. (24), yields the desired formulation
Hence, with the eigenprojectors in hand, evaluating an arbitrary function of the nondiagonalizable operator A comes down to the evaluation of several residues.
Typically, evaluating Eq. (26) requires less work than one might expect when looking at the equation in its full generality. For example, whenever f(z) is holomorphic (i.e., well behaved) at , the residue simplifies to
where is the mth derivative of f(z) evaluated at . However, if f(z) has a pole or zero at , then it substantially changes the complex contour integration. In the simplest case, when A is diagonalizable and f(z) is holomorphic at ΛA, the matrix-valued function reduces to the simple form
Moreover, if λ is nondegenerate, then
although here should be interpreted as the solution to the left eigenequation and, in general, .
The meromorphic functional calculus agrees with the Taylor-series approach whenever the series converges and agrees with the holomorphic functional calculus of Ref. 69 whenever f(z) is holomorphic at ΛA. However, when both these functional calculi fail, the meromorphic functional calculus extends the domain of f(A) in a way that is key to the following analysis. We show, for example, that within the meromorphic functional calculus, the negative-one power of a singular operator is the Drazin inverse. The Drazin inverse effectively inverts everything that is invertible. Notably, it appears ubiquitously in the new-found solutions to many complexity measures.
E. Evaluating residues
How does one use Eq. (26)? It says that the spectral decomposition of f(A) reduces to the evaluation of several residues, where
So, to make progress with Eq. (26), we must evaluate function-dependent residues of the form . This is basic complex analysis. Recall that the residue of a complex-valued function g(z) around its isolated pole λ of order n + 1 can be calculated from
F. Decomposing AL
Equation (26) allows us to explicitly derive the spectral decomposition of powers of an operator. For , z = 0 can be either a zero or a pole of f(z), depending on the value of L. In either case, an eigenvalue of λ = 0 will distinguish itself in the residue calculation of AL via its unique ability to change the order of the pole (or zero) at z = 0.
For example, at this special value of λ and for integer L > 0, λ = 0 induces poles that cancel with the zeros of , since zL has a zero at z = 0 of order L. For integer L < 0, an eigenvalue of λ = 0 increases the order of the z = 0 pole of . For all other eigenvalues, the residues will be as expected.
Hence, for any
where is the generalized binomial coefficient
with and where is the Iverson bracket. The latter takes value 1 if 0 is an eigenvalue of A and value 0 if not. Equation (27) applies to any linear operator with only isolated singularities in its resolvent.
If L is a nonnegative integer such that for all , then
where is now reduced to the traditional binomial coefficient .
The form of Eq. (27), together with our earlier operator expressions for complexity measures that take on a cascading form, directly leads to the first fully general closed-form expressions for correlation, myopic entropy rates, and remaining state uncertainty, among others, for the broad class of processes that can be generated by HMMs. This will be made explicit in Part II,79 where the consequences will also be unraveled.
G. Drazin inverse
The negative-one power of a linear operator is in general not the same as its inverse , since the latter need not exist. However, the negative-one power of a linear operator is always defined via Eq. (27)
Notably, when the operator is singular, we find that
This is the Drazin inverse of A, also known as the -inverse.70 (Note that it is not the same as the Moore–Penrose pseudo-inverse.) Although the Drazin inverse is usually defined axiomatically to satisfy certain criteria. In contrast, Ref. 4 naturally derived it as the negative one power of a singular operator in the meromorphic functional calculus.
Whenever A is invertible, however, . That said, we should not confuse this coincidence with equivalence. More to the point, there is no reason other than historical accidents of notation that the negative-one power should in general be equivalent to the inverse—especially if an operator is not invertible. To avoid confusing with , we use the notation for the Drazin inverse of A. Still, whenever .
Although Eq. (30) is a constructive way to build the Drazin inverse, it suggests more work than is actually necessary. We derived several simple constructions for it that require only the original operator and the eigenvalue-0 projector. For example, Ref. 4 found that, for any
Later, we will also need the decomposition of , as it enters into many closed-form complexity expressions related to accumulated transients—the past–future mutual information among them. Reference 4 showed that
for any stochastic matrix T, where T1 is the projection operator associated with λ = 1. If T is the state-transition matrix of an ergodic process, then the RHS of Eq. (32) becomes especially simple to evaluate since then .
Somewhat tangentially, this connects to the fundamental matrix used by Ref. 71 in its analysis of Markovian dynamics. More immediately, Eq. (32) plays a prominent role when deriving excess entropy and synchronization information. The explicit spectral decomposition is also useful
VII. PROJECTION OPERATORS FOR STOCHASTIC DYNAMICS
The preceding employed the notation that A is a general linear operator. In the following, we reserve T for the operator of a stochastic transition dynamic, as in the state-to-state transition dynamic of an HMM: . If the state space is finite and has a stationary distribution, then T has a representation that is a nonnegative row-stochastic—all rows sum to unity—transition matrix.
We are now in a position to summarize several useful properties for the projection operators of any row-stochastic matrix T. Naturally, if one uses column-stochastic instead of row-stochastic matrices, all results can be translated by simply taking the transpose of every line in the derivations. [Recall that .]
The fact that all elements of the transition matrix are real-valued guarantees that, for each , its complex conjugate is also in ΛT. Moreover, the spectral projection operator associated with the complex conjugate of λ is 's complex conjugate
This also implies that is real if λ is real.
If the dynamic induced by T has a stationary distribution over the state space, then T's spectral radius is unity and all its eigenvalues lie on or within the unit circle in the complex plane. The maximal eigenvalues have unity magnitude and . Moreover, an extension of the Perron–Frobenius theorem guarantees that eigenvalues on the unit circle have algebraic multiplicity equal to their geometric multiplicity. And, so, for all .
T's index-one eigenvalue λ = 1 is associated with stationarity of the hidden Markov model. T's other eigenvalues on the unit circle are roots of unity and correspond to deterministic periodicities within the process.
A. Row sums
If T is row-stochastic, then by definition
Hence, via the general eigenprojector construction Eq. (19) and the general orthogonality condition Eq. (18), we find that
This shows that T's projection operator T1 is row-stochastic, whereas each row of every other projection operator must sum to zero. This can also be viewed as a consequence of conservation of probability for dynamics over Markov models.
B. Expected stationary distribution
If unity is the only eigenvalue of ΛT on the unit circle, then the process has no deterministic periodicities. In this case, every initial condition leads to a stationary asymptotic distribution. The expected stationary distribution from any initial distribution α is
An attractive feature of Eq. (35) is that it holds even for nonergodic processes—those with multiple stationary components.
When the stochastic process is ergodic (one stationary component), then and there is only one stationary distribution π. The T1 projection operator becomes
even if there are deterministic periodicities. Deterministic periodicities imply that different initial conditions may still induce different asymptotic oscillations, according to . In the case of ergodic processes without deterministic periodicities, every initial condition relaxes to the same steady-state distribution over the hidden states: regardless of α, so long as α is a properly normalized probability distribution.
As suggested in Ref. 4, the new results above extend the spectral theory to arbitrary functions of nondiagonalizable operators in a way that contributes to a spectral weighted digraph theory beyond the purview of spectral graph theory proper.72 Moreover, this enables new analyses. In particular, the spectra of undirected graphs and their graph Laplacian matrices have been studied extensively and continue to be. However, those efforts have been extensive due in part to the spectral theory for normal operators applying directly to both undirected graphs and their Laplacians. Digraph spectra have also been studied,73 but to a much lesser extent. Again, this is due in part to the spectral theorem not typically applying, rendering this case much more complicated. Thus, the spectral theory of nonnormal and nondiagonalizable operators offers new opportunities. This not only hints at the importance of extracting eigenvalues from directed graph motifs, but also begins to show how eigenvectors and eigenprojectors can be built up iteratively from directed graph clusters.
VIII. SPECTRA BY INSPECTION
The next Secs. VIII A and VIII B show how spectra and eigenprojectors can be intuited, computed, and applied in the analysis of complex systems. These techniques often make the problem at hand analytically tractable, and they will be used in the examples of Part II79 to give exact expressions for complexity measures.
A. Eigenvalues from a graph structure
Consider a directed graph structure with cascading dependencies: one cluster of nodes feeds back only to itself according to matrix A and feeds forward to another cluster of nodes according to matrix B, which is not necessarily a square matrix. The second cluster feeds back only to itself according to matrix C. The latter node cluster might also feed forward to another cluster, but such considerations can be applied iteratively.
The simple situation just described is summarized, with proper index permutation, by a block matrix of the form: . In this case, it is easy to see that
And so, . This simplification presents an opportunity to read off eigenvalues from clustered graph structures that often appear in practice, especially for transient graph structures associated with synchronization, as with transient mixed-state transitions in MSPs.
Cyclic cluster structures (say, of length N and edge-weights α1 through αN) yield especially simple spectra
That is, the eigenvalues are simply the Nth roots of the product of all of the edge-weights. See Fig. 3(a).
(a) Weighted directed graph (digraph) of the feedback matrix A of a cyclic cluster structure that contributes eigenvalues with algebraic multiplicities for all . (b) Weighted digraph of the feedback matrix A of a doubly cyclic cluster structure that contributes eigenvalues with algebraic multiplicities and for . (This eigenvalue “rule” depends on having the same number of β-transitions as γ-transitions.) The 0-eigenvalue only has geometric multiplicity of , so the structure is nondiagonalizable for N > 2. Nevertheless, the generalized eigenvectors are easy to construct. The spectral analysis of the cluster structure in (b) suggests more general rules that can be gleaned from reading-off eigenvalues from digraph clusters; e.g., if a chain of α's appears in the bisecting path.
(a) Weighted directed graph (digraph) of the feedback matrix A of a cyclic cluster structure that contributes eigenvalues with algebraic multiplicities for all . (b) Weighted digraph of the feedback matrix A of a doubly cyclic cluster structure that contributes eigenvalues with algebraic multiplicities and for . (This eigenvalue “rule” depends on having the same number of β-transitions as γ-transitions.) The 0-eigenvalue only has geometric multiplicity of , so the structure is nondiagonalizable for N > 2. Nevertheless, the generalized eigenvectors are easy to construct. The spectral analysis of the cluster structure in (b) suggests more general rules that can be gleaned from reading-off eigenvalues from digraph clusters; e.g., if a chain of α's appears in the bisecting path.
Similar rules for reading off spectra from other cluster structures exist. Although we cannot list them exhaustively here, we give another simple but useful rule in Fig. 3(b). It also indicates the ubiquity of nondiagonalizability in weighted digraph structures. This second rule is suggestive of further generalizations where spectra can be read off from common digraph motifs.
B. Eigenprojectors from a graph structure
We just outlined how clustered directed graph structures yield simplified joint spectra. Is there a corresponding simplification of the spectral projection operators? In fact, there is and it leads to an iterative construction of “higher-level” projectors from “lower-level” clustered components. In contrast to the joint spectrum though, that completely ignores the feedforward matrix B, the emergent projectors do require B to pull the associated eigencontributions into the generalized setting. Figure 4 summarizes the results for the simple case of nondegenerate eigenvalues. The general case is constructed similarly.
Construction of W-eigenprojectors from low-level A-projectors and C-projectors, when . (Recall that and can be constructed from the lower-level projectors.) For simplicity, we assume that the algebraic multiplicity in each of these cases.
Construction of W-eigenprojectors from low-level A-projectors and C-projectors, when . (Recall that and can be constructed from the lower-level projectors.) For simplicity, we assume that the algebraic multiplicity in each of these cases.
The preceding results imply a number of algorithms, both for analytic and numerical calculations. Most directly, this points to the fact that eigenanalysis can be partitioned into a series of simpler problems that are later combined to a final solution. However, in addition to more efficient serial computation, there are opportunities for numerical parallelization of the algorithms to compute the eigenprojectors, whether they are computed directly, say from Eq. (15), or from right and left eigenvectors and generalized eigenvectors. These opportunities for further optimization are perhaps rare considering how extremely well developed the field of numerical linear algebra already is. That said, the automation now possible will be key to applying our analysis methods to real systems with immense data produced from very high-dimensional state spaces.
IX. CONCLUSION
Surprisingly, many questions we ask about a structured stochastic nonlinear process imply a linear dynamic over a preferred hidden state space. These questions often concern predictability and prediction. To make predictions about the real world, though, it is not sufficient to have a model of the world. Additionally, the predictor must synchronize their model to the real-world data that has been observed up to the present time. This metadynamic of synchronization—the transition structure among belief states—is intrinsically linear, but is typically nondiagonalizable.
We presented results for the observed processes generated by HMMs. However, the results easily apply to other state-based models, including observable operator models (OOMs)74 and generalized HMMs (GHMMs).45 In each case, the observation-induced synchronizing metadynamic is still an HMM. It will also be useful to adapt our methods to open quantum models, where a density matrix evolves via environmental influence and a protocol for partial measurements (POVMs) induces a synchronizing (meta)dynamic.
Recall organizational Tables I and II from the Introduction. After all the intervening detail, let's consider a more nuanced formulation. We saw that once we frame questions in terms of the hidden linear transition dynamic, complexity measures are usually either of the cascading or accumulation type. Scalar complexity measures often accumulate only the interesting transient structure that rides on top of the asymptotics. Skimming off the asymptotics led to the Drazin inverse. Modified accumulation turned complexity scalars into complexity functions. Tables III and IV summarize the results. Notably, Table IV gives closed-form formulae for many complexity measures that previously were only expressed as infinite sums over functions of probabilities.
Once we identify the hidden linear dynamic behind our questions, most are either of the cascading or accumulating type. Moreover, if a complexity measure accumulates transients, the Drazin inverse is likely to appear. Interspersed accumulation can be a helpful theoretical tool, since all derivatives and integrals of cascading type can be calculated, if we know the modified accumulation with . With , modulated accumulation involves an operator-valued z-transform. However with and , modulated accumulation involves an operator-valued Fourier-transform.
. | . | Discrete time . | Continuous time . |
---|---|---|---|
Derivatives of cascading ↑ | Cascading | ||
Integrals of cascading ↓ | Accumulated transients | ||
Modulated accumulation |
. | . | Discrete time . | Continuous time . |
---|---|---|---|
Derivatives of cascading ↑ | Cascading | ||
Integrals of cascading ↓ | Accumulated transients | ||
Modulated accumulation |
Genres of complexity questions given in order of increasing sophistication; summary of Part I and a preview of Part II.79 Each implies a different linear transition dynamic. Closed-form formulae are given for several complexity measures, showing the similarity among them down the same column. Formulae in the same row have matching bra-ket pairs. The similarity within the column corresponds to similarity in the time-evolution implied by the question type. The similarity within the row corresponds to the similarity in question genre.
Genre . | Implied linear transition dynamic . | Example questions . | ||
---|---|---|---|---|
Cascading . | Accumulated transients . | Modulated accumulation . | ||
Overt observational | Transition matrix T of any HMM | Correlation, : | Green-Kubo transport coefficients | Power spectra, : |
Predictability | Transition matrix W of MSP of any HMM | Myopic entropy rate, : | Excess entropy, E: | E(z): |
Optimal predication | Transition matrix of MSP of ϵ-machine | Causal state uncertainty, : | Synchronization info, S: | S(z): |
Genre . | Implied linear transition dynamic . | Example questions . | ||
---|---|---|---|---|
Cascading . | Accumulated transients . | Modulated accumulation . | ||
Overt observational | Transition matrix T of any HMM | Correlation, : | Green-Kubo transport coefficients | Power spectra, : |
Predictability | Transition matrix W of MSP of any HMM | Myopic entropy rate, : | Excess entropy, E: | E(z): |
Optimal predication | Transition matrix of MSP of ϵ-machine | Causal state uncertainty, : | Synchronization info, S: | S(z): |
Let us remind ourselves: why, in this analysis, were nondiagonalizable dynamics noteworthy? They are since the metadynamics of diagonalizable dynamics are generically nondiagonalizable. And, this is typically due to the 0-eigenvalue subspace that is responsible for the initial, ephemeral epoch of symmetry collapse. The metadynamics of transitioning between belief states demonstrated this explicitly. However, other metadynamics beyond those focused on prediction are also generically nondiagonalizable. For example, in the analysis of quantum compression, crypticity, and other aspects of hidden structure, the relevant linear dynamic is not the MSP. Instead, it is a nondiagonalizable structure that can be fruitfully analyzed with the same generalized spectral theory of nonnormal operators.4
Using the appropriate dynamic for common complexity questions and the meromorphic functional calculus to overcome nondiagonalizability, the sequel (Part II)79 goes on to develop closed-form expressions for complexity measures as simple functions of the corresponding transition dynamic of the implied HMM.
ACKNOWLEDGMENTS
J.P.C. thanks the Santa Fe Institute for its hospitality. The authors thank Chris Ellison, Ryan James, John Mahoney, Alec Boyd, and Dowman Varn for the helpful discussions. This material is based upon work supported by, or in part by, the U.S. Army Research Laboratory and the U.S. Army Research Office under Contract Nos. W911NF-12-1-0234, W911NF-13-1-0340, and W911NF-13-1-0390.
References
While we follow Shannon12 in this, it differs from the more widely used state-labeled HMMs.