We merge computational mechanics’ definition of causal states (predictively equivalent histories) with reproducing-kernel Hilbert space (RKHS) representation inference. The result is a widely applicable method that infers causal structure directly from observations of a system’s behaviors whether they are over discrete or continuous events or time. A structural representation—a finite- or infinite-state kernel ϵ-machine—is extracted by a reduced-dimension transform that gives an efficient representation of causal states and their topology. In this way, the system dynamics are represented by a stochastic (ordinary or partial) differential equation that acts on causal states. We introduce an algorithm to estimate the associated evolution operator. Paralleling the Fokker–Planck equation, it efficiently evolves causal-state distributions and makes predictions in the original data space via an RKHS functional mapping. We demonstrate these techniques, together with their predictive abilities, on discrete-time, discrete-value infinite Markov-order processes generated by finite-state hidden Markov models with (i) finite or (ii) uncountably infinite causal states and (iii) continuous-time, continuous-value processes generated by thermally driven chaotic flows. The method robustly estimates causal structure in the presence of varying external and measurement noise levels and for very high-dimensional data.

Computational mechanics is a mathematical framework for pattern discovery that describes how information is stored, structured, and transformed in a physical process. Its constructive application to observed data has been demonstrated for some time. Until now, though, success was limited by the need to strongly discretize observations or discover state-space generating partitions for correct symbolic dynamics. Exploiting modern machine-learning foundations in functional analysis, we broadly extend computational mechanics to inferring models of many distinct classes of structured processes, going beyond fully discrete data to processes with continuous data and those measured by heterogeneous instruments. Equations of motion for the evolution of process states can also be reconstructed from data. The method successfully recovers a process’s causal states and its dynamics in both discrete and continuous cases, including the recovery of noisy, high-dimensional chaotic attractors.

At root, the physical sciences and engineering turn on successfully modeling the behavior of a physical system from observations. Noting that they have been successful in this is an understatement, at best. That success begs a deep question, though—one requiring careful reflection. How, starting only from measurements, does one construct a model for behavioral evolution consistent with the given data?

Not surprisingly, dynamical-model inference has been tackled via a variety of theoretical frameworks. However, most approaches make strong assumptions about the internal organization of the data-generating process: Fourier analysis assumes a collection of exactly periodic oscillators; Laplace transforms, a collection of exponential relaxation processes; feed-forward neural networks, linearly coupled threshold units. We are all familiar with the results: A method is very successful when the data generator is in the assumed model class. The question begged in this is that one must know something—for some process classes, quite a lot—about the data before starting a successful analysis. And, if the wrong model class is assumed, this strategy tells one vanishingly little or nothing about how to find the correct one. This form of model inference is what we call pattern recognition: Does the assumed model class fairly represent the data and, in particular, the generator’s organization?

At some level of abstraction, one cannot escape these issues. Yet, the assumptions required by our framework are sufficiently permissive that we can attempt pattern discovery, in contrast with pattern recognition. Can we learn new structures, not assumed? Can we extract efficient representations of the data and their evolution? Do these yield optimal predictions?

Framed this way, it is not surprising that contemporary model inference is an active research area—one in which success is handsomely rewarded. Indeed, it is becoming increasingly important in our current era of massive data sets and cloud computing.

The following starts from and then extends computational mechanics1,2—a mathematical framework that lays out the foundations for pattern discovery. The core idea is to statistically describe a system’s evolution in terms of causally equivalent states—effective states, built only from given data, that lead to the same consequences in a system’s future. Its theorems state that causal states and their transition dynamic can be reconstructed from data, giving the minimal, optimally predictive model.3 Constructively, for discrete-event, discrete-time or continuous-time processes, computational mechanics delineates the minimal causal representations—their ϵ-machines.4,5 This leaves open, for example, ϵ-machines for the rather large and important class of continuous-value, continuous-time processes and also related spatiotemporal processes. The practical parallel to these remaining challenges is that currently, there is no single algorithm that reconstructs ϵ-machines from data in a reasonable time and without substantial discrete approximations.6–8 

Thus, one goal of the following is to extend computational mechanics beyond its current domains to continuous time and arbitrary data. This results in a new class of inference algorithms for kernel ϵ-machines that, notably, can be less resource-demanding than their predecessors. At the very least, the algorithms work with a set of weaker assumptions about the given data, extending computational mechanics’ structural representations to previously inaccessible applications with continuous data.

One of the primary motivations for using ϵ-machine representations in the first place is not just modeling. Once in hand, their mathematical properties allow direct and efficient estimation of a range of statistical, information-theoretic, and thermodynamic properties of a process. This latter benefit, however, is not the focus of the following; on the contrary, those goals are the target of sequels.

In the new perspective, kernel ϵ-machines are analogous to models associated with Langevin dynamics, which act on a set of state variables representing system configurations. The new continuous kernel ϵ-machines can also be written in the form of a stochastic differential equation (SDE) that acts instead on the predictively equivalent causal states. Similarly, a Fokker–Planck equation describing the evolution of a distribution over causal states can be defined. It is then used to infer the evolution from an initial probability distribution over the model’s internal states (be it a real distribution induced by uncertainty or a delta function), which is then used to make predictions in the original data space.

Section II recalls the minimal set of computational mechanics required for the full development. Section III B establishes that the space of causal states has a natural metric and a well-defined measure using reproducing-kernel Hilbert spaces. This broadly extends computational mechanics to many kinds of processes, from discrete-value and -time to continuous-value and -time and everywhere in between, including spatiotemporal and network dynamical systems. Section IV then presents the evolution equations and discusses how to infer from empirical data the evolution operator for system-state distributions. Section VI demonstrates how to apply the algorithm to discover optimal models from realizations of discrete-time, discrete-value infinite Markov-order processes generated by finite-state hidden Markov models of varying complexity and continuous-time, continuous-value processes generated by thermally driven deterministic chaotic flows.

We first describe the main objects of study in computational mechanics—stochastic processes. Then, we define the effective or causal states and their transition dynamics directly in terms of predicting a process’s realizations.

Though targeted to discover a stochastic process’s intrinsic structure, computational mechanics takes the task of predicting a process as its starting point. To describe how a process’s structure is discovered within a purely predictive framework, we introduce the following setup.

Consider a system of interest whose behaviors we observe over time t. We describe the set of behaviors as a stochastic processP. We require that the process is nonanticipatory: There exists a natural filtration—i.e., an ordered (discrete or continuous) set of indices tT associated to time—such that the observed behaviors of the system of interest only depend on past times.

P’s observed behaviors are described by random variables X, indexed by t and denoted by a capital letter. A particular realizationXt=xX is denoted via a lowercase letter. We assume that X is defined on the measurable space (X,BX,νX), where the domain X of X’s values could be a discrete alphabet or a continuous set. X is endowed with a reference measure νX over its Borel sets BX. In particular, P’s measure νX applies to time blocks: {Pr(Xa<tb):a<b,a,bT}.

The process(Xt)tT is the main object of study. We associate the present to a specific time tT that we map to 0 without loss of generality. We refer to a process’s pastXt0 and its futureXt>0. In particular, we allow T=R and infinite past or future sequences.

Beyond the typical setting in which an observation x is a simple scalar, here, realizations Xt=0=x can encode, for example, a vector x=[(mi)t0]i=1,,M of M measured time series (mi)t0, each from a different sensor mi, up to t=0. Often, though, there are reasons to truncate such observation templates to a fixed-length past. Take, for example, the case of exponentially decaying membrane potentials where x measures a neuron’s electric activity9 or a lattice of spins with decaying spatial correlations. After several decay epochs, the templates can often be profitably truncated.

To emphasize, this contrasts with related works—e.g., Ref. 10—in that we need not consider Xt to be only the presently observed value nor, for that matter, Yt=Xt+1 to be its successor in a time series. In what follows, Xt can be the full set of observations up to time t. This leads to substantial conceptual broadening, the consequences of which are detailed in Refs. 1–3.

As just argued, we consider that random variable Xt contains information available from observing a system up to time t. At this stage, we just have given data, and we have made no assumptions about its utility for prediction. Consider another random variable Y that describes system observations we wish to predict from X. A common example would be future sequences (possibly truncated, as just noted) that occur at times t>0. We also assume that Y is defined on a measurable space (Y,BY,νY).

A prediction then is the distribution of outcomes Y given observed system configuration X=x at time t, denoted Pr(Y|Xt=x). The same definition extends to nontemporal predictive settings by changing t to the relevant quantity over which observations are collected, e.g., indices for pixels in an image.

A common restriction on processes is that they are stationary—the same realization xt at different times t occurs with the same probability.

(Stationarity)

Definition 1
(Stationarity)

A process (Xt)tT is stationary if, at all times tt, the same distribution of its values is observed: Pr(Xt=x)Pr(Xt=x) for all xX and except possibly on a null-measure set.

The following, in contrast with this common assumption, does not require (Xt)tT to be stationary. In point of fact, such an assumption rather begs the question. Part of discovering a process’s structure requires determining from data if such a property holds. However, we assume the following from realization to realization.

(Conditional stationarity)

Definition 2
(Conditional stationarity)

A process (Xt)tT is conditionally stationary if, at all times tt, the same conditional distribution of next values Y is observed: Pr(Y|Xt=x)Pr(Y|Xt=x) for all xX and except possibly on a null-measure set.

That is, conditional stationarity allows marginals Pr(Xt) to depend on time. However, at any given time, for each realization of the process where the same x is observed, then the same conditional distribution is also observed. In a spatially extended context, the hypothesis could also encompass realizations at different locations for which the same X=x is observed.

When multiple realizations of the same process are not available, we will instead consider a limited form of ergodicity: except for measure zero sets, any one realization, measured over a long enough period, reveals the process’s full statistics. In that case, observations of the same X=x can be collected at multiple times to build a unique distribution Pr(Y|X=x), now invariant by time-shifting. It may be that the system undergoes structural changes: for example, a volcanic chamber slowly evolving over the course of a few months or weather patterns evolving as the general long-term climate changes. Then, we will only assume that the Pr(Y|X=x) distribution is stable over a long enough time window to allow its meaningful inference from data. It may slowly evolve over longer time periods.

If multiple realizations of the same process are available, this hypothesis may not be needed. Then, only Definition 2’s conditional stationarity is required for building the system states according to the method we now introduce.

Finally, we come to the workhorse of computational mechanics that forms the basis on which a process’s structure is identified.

(Predictive equivalence)

Definition 3
(Predictive equivalence)
Realizations x and x that lead to the same predictions (outcome distributions) are then gathered into classes ϵ() defined by the predictive equivalence relationϵ,
ϵ(x)={xX:Pr(Y|X=x)Pr(Y|X=x)}.
(1)

In other words, observing two realizations x and x means that the process is in the same effective state—the same equivalence class—if they lead to the same predictions,

xϵxPr(Y|X=x)Pr(Y|X=x).
(2)

Speaking in terms of pure temporal processes, two observed pasts x and x that belong to the same predictive class ϵ() are operationally indistinguishable. Indeed, by definition of the conditional distribution over futures Y, no new observation can help discriminate whether the causal state arose from past x or x. For all practical purposes, these pasts are equivalent, having brought the process to the same condition regarding future behaviors.

(Causal states1)

Definition 4
(Causal states1)

Since the classes {ϵ():xX} induce the same consequences—in particular, the same behavior distribution Pr(Y|X=x)=Pr(Y|ϵ(x))—they capture a process’s internal causal dependencies. They are a process’s causal statesσS.

Predictive equivalence can then be summarized as the same causes lead to the same consequences. Thanks to it, grouping a process’s behaviors X under the equivalence relation ϵ also gives the minimal partition required for optimal predictions: further refining the classes of pasts is useless, as just argued, while, at the same time, each class is associated with a unique distribution of possible outcomes.

This setup encompasses both deterministic observations, where Y=f(X) is fixed and Pr(Y|X=x) becomes a Dirac distribution with support f(x), as well as stochastic observations. The source of stochasticity need not be specified: fundamental laws of nature, lack of infinite precision in measurements of a chaotic system, and so on.

Beyond identifying a process’s internal structure, predictive equivalence is important for practical purposes: it ensures that the partition induced by ϵ on X is stable through time. Hence, data observed at different times ti may be exploited to estimate the causal states. In practice, if the long-term dynamic changes, we assume that predictive equivalence holds over a short time window.

While causal states are rigorously defined as above, in empirical circumstances, one does not know all of a system’s realizations and so one cannot extract its causal states. Practically, we assume that the given data consist of a set of N observations (xi,yi,ti)i=1,,N. These data encode, for each configuration xi at time ti, what subsequent configuration yi was then observed. The goal is to recover from such data an approximate set of causal states that model the system evolution; see, e.g., Refs. 11 and 12.

For now, assume that the data xX are a past—a sequence x=(vLX<t0) of discrete past values vtV at discrete times t0. This discrete setting helps introduce several important concepts, anticipating generalizations that follow shortly. Indices t are now discrete observation times, ranging from LX in the past up to the present at t=0. We allow LX= when calculating causal states analytically on known systems. Similarly, we take yY to be a future—a sequence (v0<tLY) of discrete future values vY that may also be truncated at time LY in the future.

For discrete-value processes, V becomes an alphabet of symbols v, and X and Y both become semi-infinite (or truncated) sequences over that alphabet. The transition from time t0 to time t1 is associated with the observation of a new symbol vV.

It should be noted that a surrogate space V can always be obtained from continuous-value data if regularly sampled at times ti for ti+1=ti+Δt with a fixed Δt. In this case, pairs (xi,xi+1) are equivalent to observing transition symbols v. There are at most |V| such possible transitions from the current causal state σ0=ϵ(x0) to which x0=(vt0) belongs.

Conditional symbol-emission probabilities Pr(vV|σ) are defined (and can be empirically estimated) for each causal state σ and for each emitted symbol vV.3 Since σS encodes all past influence, by construction, state-to-state transitions do not depend on previous states. That is, causal states are Markovian in this sense. The dynamic over causal states is then specified by a set of symbol-labeled causal-state transition matrices {Tσ,σ(v):σ,σS,xV}.

Diagrammatically, the causal states σ form the nodes of a directed graph whose edges are labeled by elements of vV with associated transition probabilities Pr(v|σ). Moreover, the transitions are unifilar: the current state σ0 and next symbol ν0 uniquely determine the next state σ1=f(σ0,ν0).

Taken altogether, the set of causal states and their transition dynamic define the ϵ-machine.1 Graphically, they specify the state-transition diagram of an edge-emitting, unifilar hidden Markov model (HMM). These differ in several crucial ways from perhaps-more-familiar state-emitting HMMs.13 For example, symbol emission in the latter depends on the state and is independent of state-to-state transition probabilities; see Fig. 1.

FIG. 1.

(left) State-emitting hidden Markov models: State-transition probabilities Pr(si|sj) are specified independently from the symbol-emission probabilities q(a|si) and q(b|sj). (right) ε-Machines: Symbols are emitted on transitions, and the (causal) states capture dependencies. Unfortunately, for state-emitting HMMs, the number of hidden states is a poor proxy for structural complexity and is often a meta-parameter with low interpretability. Since the ϵ-machine is unique, therefore, it directly represents a stochastic process’s intrinsic properties, such as generated randomness (Shannon entropy rate) and structural complexity (memory).

FIG. 1.

(left) State-emitting hidden Markov models: State-transition probabilities Pr(si|sj) are specified independently from the symbol-emission probabilities q(a|si) and q(b|sj). (right) ε-Machines: Symbols are emitted on transitions, and the (causal) states capture dependencies. Unfortunately, for state-emitting HMMs, the number of hidden states is a poor proxy for structural complexity and is often a meta-parameter with low interpretability. Since the ϵ-machine is unique, therefore, it directly represents a stochastic process’s intrinsic properties, such as generated randomness (Shannon entropy rate) and structural complexity (memory).

Close modal

The principle reason for using an ϵ-machine is that it is a process’s optimally predictive model.3 Specifically, an ϵ-machine is the minimal, unifilar, and unique edge-emitting HMM that generates the process.

Notably, ϵ-machines’ Markov property is derived from the predictive equivalence relation, thanks to the latter’s conditioning on pasts. More generally, the causal states are an intrinsic property of a process. They do not reflect a choice of representation, in contrast to how state-emitting HMMs are often deployed; again, see Fig. 1. The same holds for state transitions and symbol emissions, all of which are uniquely inferred from the process.

Given that they make markedly fewer and less restrictive assumptions, it is not surprising that reconstruction algorithms for estimating ϵ-machines from data are more demanding and costly to implement7,8,14,15 than performing a standard expectation-maximum estimate for an hypothesized HMM.16 Most ϵ-machine algorithms rely on a combination of clustering empirically estimated sequence probability distributions Pr(Y|X) together with splitting the candidate causal states when required for consistency (unifilarity) of unique emissions Pr(vV|σS).6,17 To date, though, Bayesian inference for ϵ-machines provides the most reliable estimation and works well on small data sets.11 

We mentioned that ϵ-machines are unifilar edge-emitting HMMs. Smaller nonunifilar HMMs can exist that are not predictive, but rather generate data with the same statistical properties as the given process. However, one cost in using these smaller HMMs is the loss of determinacy for which state a process is in based on observations. The practical consequence is that the processes generated by nonunifilar HMMs typically have an uncountable infinity of causal states. This forces one to use probabilistic mixed states, which are distributions over the states of the generating HMM. References 18–20 develop the theory of mixed states for nonunifilar, generative HMMs. For simplicity, the following focuses on processes with finite “predictive” models—the ϵ-machines. That said, Sec. VI B below analyzes an inference experiment using a process with an uncountable infinity of mixed states to probe the algorithm’s performance.

An ϵ-machine’s hidden states and transitions have a definite meaning and allow for proper definitions of process structure and organization—indicators of a process’s complexity.2 For example, we can calculate in closed-form various information-theoretic measures to quantify the information conveyed from the past to the future or that stored in the present.21 In this way, ϵ-machines give a very precise picture of a process’s information processing abilities.22,23

More specifically, each causal state’s surprisal can be used to build powerful data filters.8,24–27 The entropy of the causal states—the cost of coding them, the statistical complexity—can be used in entropy–complexity surveys of whole process families to discriminate purely random from chaotic data.2 Recent advances in signal processing28 show how processes with arbitrarily complex causal structures can still exhibit a flat power spectrum since the spectrum is the Fourier transform of only the two-point autocorrelation function. This demonstrates the benefit of inferring process structures using the full setup presented above—consider Xt encompassing all information up to time t and not restricting Xt to a present observation.23,28

However, these benefits have been circumscribed. Many previous ϵ-machine inference methods work with symbolic (discrete value) data in discrete time. However, in practice, often we monitor continuous physical processes at arbitrary sampling rates, and these measurements can take a continuum of data values within a range V. In these cases, estimation algorithms rely on clustering of causal states by imposing arbitrary boundaries between discretized groups. However, there may be a fractal set or continuum of causal states.29 More recent approaches consider continuous time but keep discrete events, such as renewal and semi-Markov processes.5,30,31

The method introduced below is, to our knowledge, the first that is able to estimate causal states for essentially arbitrary data types and to represent their dynamic in continuous time. This approach offers alternative algorithms that provide a radically different set of assumptions and algorithmic complexity than previous approaches. While it is also applicable to the discrete case (see Sec. VI A), it vastly expands computational mechanics’s applicability to process classes that were previously inaccessible.

The predictive-equivalence relation implicates conditional distributions with expressing a process’ structure. To work directly with conditional distributions, this section recalls the main results concerning the geometric view of probability distributions as points in a reproducing-kernel Hilbert space. Following the method from Refs. 32 and 33, we describe both unconditional and conditional distributions. Both are needed in computational mechanics. Once they are established in the RKHS setting, we then describe the geometry of causal states.

Consider a function kX of two arguments, defined on X×XR (resp., C). Fixing one argument to xX and leaving the second free, we consider kX(x,) as a function in the Hilbert space HX from X to R (resp., C). If kX is a positive symmetric (resp., sesquilinear) definite function,34 then the reproducing property holds: For any function fHX and for all xX, we have f,k(x,)HX=f(x). Here, ,HX is the inner product in HX or a completion of HX; see Ref. 34. HX is known as the reproducing-kernel Hilbert space (RKHS) associated with kernelkX.

Kernel functions kX are easy to construct and so have been defined for a wide variety of data types, including vectors, strings, graphs, and so on. A product of kernels corresponds to the direct product of the associated Hilbert spaces.34 Thus, products maintain the reproducing property. Due to this, it is possible to compose kernels when working with heterogeneous data types.

Kernels are widely used in machine learning. A common use is to convert a given linear algorithm, such as estimating support vector machines, into nonlinear algorithms.35 Indeed, when a linear algorithm can be written entirely in terms of inner products, scalings, and sums of observations x, then it is easy to replace the inner products in the original space X by inner products in the Hilbert space HX. The kX(x,) functions are then called feature maps and HX the feature space.

Returning to the original space X, the algorithm now works with kernel evaluations kX(x1,x2) whenever an inner product kX(x1,),kX(x2,)HX is encountered. In this way, the linear algorithm in HX has been converted to a nonlinear algorithm in the original space X. Speaking simply, what was nonlinear in the original space is linearized in the associated Hilbert space. This powerful “kernel trick” is at the root of the probability distribution mapping that we now recall from Ref. 32.

1. Unconditional distributions

Consider a probability distribution Pr(X) of the random variable X. Then, the average map αHX of the kernel evaluations in the RKHS is given by α=EX[kX(x,)]. Note that for any fHX,

α,f=EX[kX(x,)],f=EX[kX(x,),f]=EX[f(x)].

An estimator for this average map can be computed simply as α^=1Ni=1NkX(xi,). This estimator is consistent32 and so converges to α in the limit of N.

Consider now the two-sample test problem: We are given two sets of samples taken from a priori distinct random variables A and B, both valued in X but with possibly different distributions Pr(A)=PA and Pr(B)=PB. Do these distributions match: PA=PB? This scenario is a classical statistics problem and many tests were designed to address it, including the chi-square and the Kolmogorov–Smirnov tests.

Using the RKHS setup and the average map, a new test36 is simply to compute the distance between the average maps using the Hilbert space norm: αAαBHX. Under suitable mild conditions on the kernel, it can be shown36 that αAαBHX=0 if, and only if, PA=PB up to a set of points with null measure in X.

This maximum mean discrepancy (MMD) test is consistent and accurate. Moreover, confidence levels can be obtained through bootstrapping or other techniques described in Ref. 36. In practice, for samples {ai}i=1..N and {bj}j=1..M, α^Aα^BHX can be computed via

α^Aα^BHX2=α^Aα^B,α^Aα^B=α^A,α^A+α^B,α^B2α^A,α^B,

where inner products between α^A=1Ni=1NkX(xi,) and α^B=1Mj=1MkX(xj,) can easily be developed into sums of kernels evaluations kX(xi,xj).

This test makes it possible to compare two distributions without density estimation and directly from data. For all practical purposes, under mild technical conditions,37 a distribution Pr(X) of random variable X can then be represented as a point in the RKHS HX, consistently estimated by the mean mapping α^. The RKHS norm then becomes a true distance between probability distributions.

2. Conditional distributions

Consider random variables X and Y with the same notations as above. The joint variable (X,Y) leads to a direct product Hilbert space HXHY,34 with the product kernel,

kX,Y((x,y),)=kX(x,)kY(y,).

For functions fHX and gHY, a covariance operator CYX:HXHY can be defined such that

g,CYXfHY=E[f(X)g(Y)].

Similarly, CXX can be defined in the case Y=X.

Then, under strong conditions on the kernel,33,38 we can relate the conditional mean map in HY to the conditioning point in HX using

EY[kY(y,)|X=x]=CYXCXX1kX(x,).

The strong conditions can be relaxed to allow the use of a wide range of kernels by considering a regularized version,

EY,ε[kY(y,)|X=x]=CYX(CXX+εI)1kX(x,).

This is a consistent estimator of the unregularized version when computed empirically from samples.39 

As for the unconditional case,

sY|X=x=EY[kY(y,)|X=x]

can be seen as uniquely representing the distribution Pr(Y|X=x), up to a null-measure set.

The set S={sY|X=x}xXHY traces out all possible conditional distributions Pr(Y|X=x) for all xX. It inherits the RKHS norm from HY: s1s2HY is well-defined for any s1,s2S. Note that sY|X=x=ς(x) can also be interpreted as an injective function ς:XS, selecting points in the RKHS HY for each x.

The connection with the regression problem for estimating s^Y|X=x from data,38,40 together with the representer theorem,41 ensures that s^Y|X=x lies in the data span,

s^Y|X=x=i=1Nωi(x)kY(yi,)

with N being the number of samples. For all practical purposes, then, dimension N is sufficient when working with SHY. This is in contrast to working with the infinite dimension of the underlying HY. The unconditional case shown in Sec. III A 1 can be understood when ωi(x)=1/N, in which case, no dependency on x remains.

A regularized and consistent estimator33,38,39 is given by

ω(x)=(GX+εI)1K(x),

with GijX=kX(xi,xj) being the Gram matrix of the X samples and K(x) a column vector such that Ki(x)=kX(x,xi). Thanks to Ref. 40, ε can be set by cross-validation.

In practice, it is more efficient to equivalently solve the linear system,

(GX+εI)ω(x)=K(x),
(3)

to find the vector ω(x). Note that, for an invertible matrix and without regularization, ω(x) is simply the vector with the only nonnull entry ωi=1 for x=xi. In practice, there may be duplicate entries (e.g., for discrete data) or nearby values for which the regularization becomes necessary.

In this way, employing a suitable kernel and regularization, a conditional probability distribution Pr(Y|X=x) is represented in the RKHS HY as a point,

s^Y|X=x=i=1Nωi(x)kY(yi,),

with a reasonably easy way to estimate the coefficients ω(x) from data.

In this light, the full Ω matrix obtained via

(GX+εI)Ω=GX

can also be seen as a way to “spread” the influence of the (xi,yi) observations to nearby xi values so that all duplicate xi effectively belong to the same estimated conditional distribution.

It is also possible to convert a conditional embedding back into a true density over X using RKHS preimage techniques;42 the most advanced one to date is Ref. 43.

Section II D defined a process’s causal states σS via the predictive equivalence relation,

ϵ(x)={wX:Pr(Y|X=w)=Pr(Y|X=x)}.

This is exactly the preimage E=ς1(sY|x)X of the unique point sY|EHY, with sY|E=sY|w for all wE=ϵ(x). Therefore, we can refer equivalently to one or the other concept: we refer to the set in X by the equivalence class and the point in the RKHS as sHY.

In short, the set S={sY|X=x}xXHY, for xX, is the set S of causal states. Note that S is a subset of all possible conditional probability-distribution mappings in HY. It consists of only the mappings that actually correspond to some x in the domain of interest. We now drop referring to S and refer only to causal states σS with random variable S.

Let B be the Borel sets over S. For any set βB, its preimage in X is defined by

ς1(β)=σβ{xς1(σ)}.

Recall that, by definition, the causal states form a partition of X. Hence, each x belongs to a unique preimage ς1(σ) when the union is taken over σβ in the preceding definition.

Recall that νX is the reference measure on X, in terms of which probability distribution Pr(X) is defined. A natural, push-forward measure μ is defined on (S,B) by

μ(βB)=ν(ς1(β)).

If Pr(X) admits a probability density function p=dPr(X)/dνX—the Radon–Nikodym derivative of Pr(x) with respect to νX—then we can similarly push-forward the density on X to define a density of causal states,

q(σ)=xς1(σ)p(x)dνX.

The distribution Q of states over S is then defined by

Q(βB)=σβq(σ)dμσ=σβ(xς1(σ)p(x)dνX)dμσ.

Note that the measure μ is defined on (and restricted to) S and that no equivalent of the Lebesgue measure exists for the whole infinite-dimensional space HY.

The net consequence is that causal states are to be viewed simultaneously as

  • Sets of points in X, using the predictive equivalence relation ϵ, with equivalence classes forming a partition of X. Though this is the original definition, presented in Sec. II D, it is still very useful in the RKHS setup to define the push-forward measure.

  • Conditional probability distributions of possible outcomes Y given observed system configurations X. This view is used in clustering algorithms7,15 to identify causal states. These algorithms directly map to the RKHS setting by using the MMD test instead of other statistical tests for clustering conditional distributions.

  • Points in the RKHS HY: More specifically, points in the subset S={σY|X=x}xX. The subset S (and only S, not the rest of HY) is endowed with a push-forward measure μ. Thus, we can properly define probability distributions and densities of causal states in this Hilbert space setting.

Compared to previous works, this third alternate view offers several advantages:

  • (Nearly )arbitrary data types can be handled though the use of reproducing kernels. We are no longer limited to discrete alphabets V of symbolic values. Adequate kernels exist for many data types and can even be composed to work directly with heterogeneous data types.

  • The distance HY can serve as the basis for clustering similar states together. Thanks to the MMD test introduced in Sec. III A 1, algorithms need not rely on estimating the conditional densities Pr(Y|X=x) for clustering states.

The preceding constructed causal states σS statically. Computational mechanics aims at modeling a process’s behaviors, though, not just describing its constituents. A process’s dynamic induces trajectories over S. Sections IV AIV C describe them and then define a new class of predictive models—the kernel ϵ-machines—and explain how to use them for computing statistics of the given data, how to simulate new “data,” and how to predict a process.

Consider a series of causal states σt1σtσt+1. Distinct pasts xσ0 and wσ0 both contain the same information concerning the process dynamics since they induce the same distribution of futures Pr(Y|σ0). However, σ0 does not contain the information about which past is responsible for arriving at it: whether x or w or any other past led to the same state σ0.

So, what kind of causality is entailed by causal states? Causal states capture the same global information needed to describe how the future is impacted by the present, and they consist of all possible ways that information was organized in the past. Unifilarity can now be understood as a change of state δσ uniquely determined by the information gain. In contrast, the equivalent of nonunifilarity would be that despite having all useful knowledge about the past (previous state) and having all possible information about the change of system configuration (either δx or δw, and so on), the model states are still not defined unequivocally. This is not the case with the definition of the causal states, but would be with other generative models, for which only a distribution of states can be inferred from data.29 

The continuity of causal state trajectories can be understood from an information-theoretic perspective. The relative information dKL(xt+dt||xt) (Kullback–Leibler divergence) between xt and its evolution after an infinitesimal time t+dt is simply the change of information we have on x. Assuming that information comes only at finite velocity, then dKL(xt+dt||xt)0 as dt0. However, it is known that dKL(xt+dt||xt)0 implies xt+dtxtHX0.37 Using Sec. III A 2’s construction of conditional distributions, we find that σt+dtσtHY0 as dt0 with continuous kernels and positive definite operators. Hence, causal-state trajectories are continuous.

In practice, assumptions leading to continuous trajectories need not be respected:

  • For practical reasons, it is rarely possible to work with infinite series. Truncating the past and the future can be necessary. In these cases, there is information not fully contained in x^—the truncated estimate of x. Truncation can be justified when the causal influence of past system configurations onto the present decays sufficiently fast with time; then, we ignore old configurations with negligible impact (and similarly for the future). This amounts to saying that there are no long-range correlations and that no useful information for prediction is lost when ignoring these past configurations. Hence, that Pr(Y|X) is unchanged by the truncation. This hypothesis may fail, of course, with the consequence that truncating the past or future may introduce jumps in the estimated trajectories of the causal states in S—a jump induced by the sudden loss or gain of information.

  • When using continuous kernel functions, the associated RKHS norm is also continuous in its arguments: σY|X=x1σY|X=x2HY0 as x1x2. Hence, continuity in data trajectories also implies continuity in causal-state trajectories. However, when data are truly discrete in nature, the situation of Sec. II E is recovered. An alternate view is that, at a fundamental level, information comes in small packets: these are the symbols induced by the data changes in this discrete scenario.

  • It may also be that measurements are performed at a finite time scale τ. Then, the information gained between two consecutive measurements can be arbitrarily large but still appear instantaneously at the scale at which data are measured. This leads to apparent discontinuities.

Here, we discuss only the elements needed to address causal states in continuous time with nearly arbitrary data, i.e., for which a reproducing kernel exists. A full treatment of discontinuities is beyond the present scope and best left for the future.

From here on, assume that trajectories of causal states σS are continuous. Recall, though, that SHY—a metric space. This guarantees the existence of a reference Wiener measure ω on the space of trajectories defined over a time interval [0,t<T] (possibly T=) with a canonical Wiener process W. With σt being a state-continuous Markov process, we posit that its actual trajectories evolve under an Itô diffusion—a stochastic differential equation (SDE) of the form

dσt=a(σt)dt+b(σt)dWt,
(4)

where a(σt) is a (deterministic) drift and b(σt) a diffusion. Both depend on the current state σt; hence, this Itô diffusion is inhomogeneous in state space. However, the coefficients are stable in time a(σt,t)=a(σt) and b(σt,t)=b(σt) if the (possibly limited) ergodicity assumption is made in addition to the conditional stationarity assumption of Sec. II D. Therefore, the diffusion is homogeneous in time. This SDE is the equivalent of the ϵ-machine’s causal-state-to-state transition probabilities introduced in the discrete case (Sec. II E). The evolution equation encodes, for each causal state σt, how the process evolves to nearby states at t+dt.

The causal state behavior arises from integrating over time,

σt=σ0+0ta(στ)dτ+0tb(στ)dWτ

for any state σ0 being the trajectory’s initial condition. As for the evolution of the causal-state distribution, consider a unit probability mass initially concentrated on δ(σ0). Then, the states σt that reached at time t define a probability distribution Pr(βB|σ0,t) on S, with associated density p(σt|σ0). (Recall that the latter is defined as p=dPr()/dμ with μ being the push-forward measure from X; see Sec. III B.) This distribution encodes the state-to-state transition probabilities at any time t, parallel to iterating an ϵ-machine in the discrete case.

The evolution of probability densities p other than δ(σ0) is governed by a Fokker–Planck equation. The infinitesimal generator Γ of the process σt is

Γf(σ0)=limt0E[f(σt)|σ0,t]f(σ0)t,

where Γ is an operator applied on a suitable class of functions f.44 For a state distribution Q, with associated density q=dQ/dμ, the Fokker–Planck equation corresponding to the Itô diffusion can be written in terms of the generator’s adjoint Γ,

qt(σ,t)=Γq(σ,t).
(5)

Restricting to the data span in S (see Sec. III A 2), using the representer theorem, and the samples as a pseudo-basis,41 the operator Γ can be represented using a vector equation in a construction similar to classical Euclidean RN state spaces; see Ref. 45, p. 103. There are, however, practical difficulties associated with estimating coefficients a and b (using tangent spaces) at each observed state σt in RKHS.

As an alternative, we represent the generator in a suitable functional basis in which q is also represented as a set of coefficients. Then, the state distribution evolves directly under

q(σ,t)=e(tt0)Γq(σ,t0).
(6)

The evolution operator E(t)=e(tt0)Γ can also be inferred from data for standard SDE over RN, as detailed in Ref. 46. Section IV C summarizes the method and adapts it to the RKHS setting.

Every Itô diffusion also admits a limit distribution L(σS) as t, with associated density =dL/dμ. By definition of the limit distribution, is the eigenvector of E associated to eigenvalue 1: E(t)[]=.

The limit distribution can also be useful to compute how “complex” or “uncommon” a state is, using its self-information h(σ)=log(σ). This is the equivalent of how the statistical complexity is defined in the discrete case.3 Also, it can be used for similar purposes (e.g., building filters8,24,26,27). That said, this differential entropy should be interpreted with caution.

The representer theorem41 allows us to develop causal-state estimates σ^ in the span of the data kernel functions kY(yi,), used as a pseudo-basis; see Sec. III A 1. Since the span dimension grows with the number of samples, estimation algorithms are impractical. However, the causal states are an intrinsic property of the system, independent of how they are coordinatized. Therefore, when working with data acquired from physical processes, S will appear as the dominant structure. The question then becomes how to work with it.

The following assumes S is of small finite dimension M compared to the number of observations N.47 Residing on top of the dominant structure, the statistical inaccuracies in the MMD test appear as much smaller contributions. The following, thus, introduces the tools needed to work directly on S, whether for visualizing the causal states using reduced coordinates or for representing evolution operators on S instead of using Eq. (4) on HY.

To do this, we exploit the methodology introduced for diffusion maps.48 These maps are especially relevant when data lie on a curved manifold, where the embedding’s high-dimensional distance between two data points does not reflect the manifold’s geometry. In contrast, diffusion maps, and their variable-bandwidth cousin,49 easily recover the Laplace–Beltrami operator for functions defined on the manifold. Assuming S is a smooth Riemannian manifold, then the diffusion coordinates cleanly recover S’s intrinsic geometry (a static property) independent of the observed local sampling density (a dynamic property, linked to trajectories as in Sec. IV).48 

The original diffusion map method artificially builds a proximity measure for data points using a localizing kernel, i.e., the one that takes nonnegligible values only for near neighbors. Path lengths are computed from these neighborhood relations. Two points are deemed “close” if there is a large number of short paths connecting them and “far” if there are only a few paths, or paths with long distances, connecting them. See Ref. 48 for details, which also shows that the diffusion distance is a genuine metric.

That said, in the present context, there is already a notion of proximity between causal states σS. Indeed, reusing notation from Sec. III A 2, the state estimates in RKHS are of the form

σ^Y|X=x=α=1Nωα(x)kY(yα,).

Therefore, a Gram matrix GS of their inner products can be defined easily,

GijS=σ^Y|X=xi,σ^Y|X=xj,GijSi=α=1Nωα(xi)kY(yα,),β=1Nωβ(xj)kY(yβ,),GijS=α=1Nβ=1Nωα(xi)ωβ(xj)kY(yα,yβ),
(7)

where kY(yα,yβ)=GαβY is the Gram matrix of the Y observations. The determination of the ω coefficients for each x relies on the GX gram matrix, as detailed in Sec. III A 1.

It turns out that diffusion maps can be built from kernels with exponential decay.50 The original fixed-bandwidth diffusion maps48 also use exponential kernels for building the proximity relation. Such kernels are reproducing and they are also characteristic, fulfilling the assumptions needed for representing conditional distributions.39 Hence, when using the exponential kernel, the RKHS Gram matrix is also exactly a similarity matrix that can be used for building a diffusion map. (This was already made explicit in Ref. 51.) Moreover, Eq. (7) explicitly represents GijS as a weighted sum of exponentially decaying kernels and, hence, is itself exponentially decaying. Thus, GS can be directly used as a similarity matrix to reconstruct S’s geometry via diffusion maps.

Notice, though, that here, data are already scaled by the reproducing kernel. Therefore, for example, using a Gaussian kernel,

kX(x1,x2)=exp(x1x2X2/ξ2),

ξ specifies the scale at which differences in the X norm are relevant, and this is also the case for kY and Y. Since that scale ξ can be set by cross-validation,40 we exploit this fact shortly in Sec. VI C’s experiments when an objective measure is provided, e.g., prediction accuracy.

In practice, a large range of ξ can produce good results, and an automated method for selecting ξ has been proposed.52 Varying the analyzing scale to coarse-grain the manifold S is also possible. Using the method from Ref. 53, this is similar, in a way, to wavelet analysis.

Once the data scale is properly set and the similarity matrix built, the diffusion map algorithm can be parameterized to cleanly recover S’s Riemannian geometry, doing so independently from how densely sampled S is. This is explained in Fig. 4 of Ref. 48, and it is exactly what is needed need here: separate S’s static causal-state geometric structure from the dynamical properties (trajectories) that induce the density on S. This is achieved by normalizing the similarity matrix GS to remove the influence of the sampling density and then applying a spectral decomposition to the nonsymmetric normalized matrix; see Ref. 48.

The result is a representation of the form

σi(λ1ψi,1,,λNψi,N),
(8)

where each ψα, α=1,,N, is a right eigenvector and each λα is the associated eigenvalue. Note that coefficients σi,j=λjψi,j are also weights for the conjugate left eigenvectors Φj=1,,N, which are themselves functions of the RKHS. (Hence, they are represented in practice by the N values they take at each sample.)

The first eigenvalue is 1, and it is associated with constant right eigenvector coordinates.48 The conjugate left eigenvector coefficients yield an estimate of the limit density (σ^i), with respect to the reference measure μ in our case. Hence, Φ1 and ψ1 can be normalized so that jΦ1,j=1 and ψ1,j=1 for all j. With these conventions, λ1ψ1,j=1 is constant and can be ignored, all the while respecting the bi-orthogonality ψ1,Φ1=1. The other eigenvalues are all positive 1>λα>0, and we can choose the indexing α so they are sorted by decreasing order.

When S is a low-dimensional manifold, as assumed here, a spectral gap should be observed. Then, it is sufficient to retain only the MN first components. Otherwise, M can be set so that the residual distance α>Mλα2(ψi,αψj,α) remains below some threshold θ. Since the diffusion distance is a true metric in the embedding space HY, θ can also be set below a prescribed significance level for the MMD test (Sec. III A 1), if so desired. The residual components are then statistically irrelevant.

Taking stock, the preceding established that

  1. The causal-state manifold S can be represented in terms of a functional basis {Φ}m=1,,M in HY of reduced dimension M. This is in contrast to using the full data span {kiY(yi,)}HY of size N. The remaining components are irrelevant.

  2. The functional basis {Φ}m=1,,M can be defined in such a way that the induced representation of S does not depend on the density at which various regions of S are sampled. This cleanly recovers S’s geometry.

  3. Each causal state σ is represented as a set of coefficients on that basis.

Taken altogether, the RKHS causal states and diffusion-map equations of motion define a new structural model class—the kernel ϵ-machines. The constituents inherit the desirable properties of optimality, minimality, and uniqueness of ϵ-machines generally and provide a representation for determining quantitative properties and deriving analytical results. That said, establishing these requires careful investigation that we leave to the future.

As with ϵ-machines generally, kernel ϵ-machines can be used in a number of ways. The following describes how to compute statistics and how to make predictions.

To recover the expected values of functions of data, a functional can be defined on the reduced basis. Recall that thanks to the reproducing property

σ,f=EP(Y|σ)[f(xσ)]

for any fHX. Such functions f are represented in practice by the values they take on each of the observed N samples, with the reproducing property only achieved for N samples (and otherwise approximated). One such is fτ that for each observed past, x associates the entry of a future time series y matching time τ. This function can be fit from observed data. It is easy to generalize to spatially extended or network systems or to any function of the (xi,yi) data pairs. However, σ,f can be expressed equally well in the reduced basis {Φ}m=1,,M. Then, fτ is simply projected fm=fτ,Φm onto each eigenvector.

This leads to an initial way to make predictions:

  1. For each data sample, represent the function f by the value it takes for that sample. For example, for vector time series of dimension D, x is a past series ending at present time t0 and y a future series. fτ is then the D values observed at time t0+τ in the data set, for each sample, yielding a N×D matrix.

  2. Project the function f to the reduced basis by computing fm=fτ,Φm for each left eigenvector mM. This yields a M×D matrix representation f^.

  3. Compute f^[σi] for a state σi, itself represented as a set of M coefficients on a reduced basis [Eq. (8)]. This yields a D-dimensional vector in the original data space X in this example. This can be compared to the actual value from yi at time τ.

A model is useful for prediction if its states can be computed on newly acquired data xnew for which future values y are not available. In the case of kernel methods and diffusion maps, Nyström extension54 is a well-established method that yields a reasonable state estimate σ^new if xnew lies within a dense region of data. That said, it is known to be inaccurate in sparsely sampled regions.

Given the Fokker–Planck evolution equation solutions [Eq. (6)] and the evolution operator estimation methods described shortly, we may estimate a distribution q^new(σ) over S, encoding the probability that the causal state associated with xnew is at location σS. Then, a single approximation σ^new=E[q^new(σ)] could be obtained, if desired. We can also allow the distribution to degenerate to the Dirac δ distribution and yield effectively a single estimate. This could occur, for example, when the evolution is applied to one of the original samples xnew=xi used for estimating the model.

To estimate a distribution q^new(σ), we employ kernel moment matching,55 adapted to our context. The similarity of the new data observation xnew to each reference data xi=1,,N is computed using kernel evaluations K(xnew)={kX(xnew,xi)}i=1,,N. Applying Eq. (3) to the new vector K(xnew),

ωnew=argminω|(GX+εI)ωK(xnew)|2,

subject to ωinew0 for i=1,,N.

Compared to Eq. (3), this adds a positivity constraint similar to kernel moment matching.55 This also implies ωinew(1+ξ)/(1+ε), where ξ is the tolerance of the argmin solver. Proof: We know R=GX+εI has 1+ε on the diagonal. Both GX and K(xnew) have positive entries. Then, (1+ε)ωi+jiRijωj=kX(xnew,xi)±e, where e<ξ is the error of the argmin solver. However, jiRijωj0 since GijX>0 and ωj0 by constraint. kX(xnew,xi)1 by construction. Hence, (1+ε)ωi1+ξ and ωi(1+ξ)/(1+ε).

ωnew is thus the closest solution to (GX+εI)1K(xnew) so that the estimated state

σ^new=i=1NωinewkY(yi,)

remains in the convex set in HY of the data kY(yi,), up to a good approximation (1+ξ)/(1+ε)1. Currently, lacking a formal justification for convexity, we found that better results are obtained with it than with Nyström extensions—these use possibly negative ωinew values due to the unbounded matrix inverse, yielding estimates that can wander arbitrarily (depending on ε) far away in HY. Normalizing, we get a probability distribution of the form

q^new(σi)=ωinewjωjnew,

where, as usual in RKHS settings, the function q^new is expressed as the values it takes on every reference data sample, σi=1,,N in this case.

Compared to kernel moment matching,55 we used the kernels kY(yi,) as the basis for expressing the density estimation, with coefficients that encode the similarity of the new sample to each reference sample in X. An alternative would be to adapt the method from Ref. 43 and express q^new on reference samples drawn from the limit distribution over S; see Sec. IV B.

When data are projected on a reduced basis {Φ}m=1,,M, the distribution q^new can be applied to the reference state coordinates σi,m=λmψi,m so that an estimated state σ^new=E[q^new(σ)] can be computed with

σ^mnew=(λmiq^inewψi,m)m=1,,M.

Section V A’s method allows predicting any arbitrary future time τ, provided that the future is sufficiently well represented in the variable yY. In practice, this means that the future look ahead LY needs to be larger than τ and that sufficient data are captured to ensure a good reproducing capability for the kernel kY. However, for some systems, the autocorrelation decreases exponentially fast, and there is no practical way to collect enough data for good predictions at large τ.

An alternative employs the Fokker–Planck equation to evolve state distributions over time. This, in turn, yields an estimate

EQ(σ,t|t0)[EPr(Y|σ)[f(xσ)]].

Q here is the state distribution reached at time t>t0, whose density is given by Eq. (6). This method exploits S’s full structure, its Markovian properties, and the generator described in Sec. IV B.

This allows reaching longer times t>τ, while using look aheads LY (Sec. II E) that match the natural decrease of autocorrelation. If the variable yY captures sufficient information about the system’s immediate future for a given xX, then the causal structure is consistently propagated to longer times by exploiting the causal-state dynamics given by the evolution operator E(t)=e(tt0)Γ.

Thanks to expressing S on the basis {Φ}m=1,,M, it is possible to explicitly represent the coefficients a(σt) and b(σt) of Eq. (4) in a traditional vector representation, with established SDE coefficient estimation methods.45,56 However, recent results suggest that working directly with the evolution operator is more reliable43,46,57 for similar models working directly in data space—instead of, as here, causal-state space.

Assuming states are estimated from observations acquired at regularly sampled intervals so that σi+1 and σi are separated by time increment Δt, then, on the functional basis {Φ}m=1,,M, the coefficients ψi+1,m are related to the coefficients ψi,m by the action of the evolution operator E(Δt) on the state si. Hence, this time-shifting operator from t0 to time t=t0+Δt can be estimated with

E^(Δt)ψ2:NTΦ1:N1,
(9)

where ψ2:N is the set of M right eigenvectors, restricted to times tt0+Δt and Φ1:N1 are the corresponding M left eigenvectors, restricted to times tt0+(N1)Δt. Normalization can be performed a posteriori: E^(Δt)[σ^] should have the constant 1 as the first coefficient (Sec. IV C). Therefore, it is straightforward to divide E^(Δt)[σ^] by its first coefficient for normalization. The estimator E^(Δt) is efficiently represented by an M×M matrix. For a similar operator working in data space X, instead in S, the estimator is consistent in the limit of N, with an error growing as O(Δt/N).46 

Predictions for values at future times nΔt, obtained by operator exponentiation,

E(nΔt)=enΔtΓ=(eΔtΓ)n,

are simply estimated with their matrix counterpart E^(nΔt)=E^(Δt)n. Thanks to the bi-orthogonality of the left and right eigenvectors, this last expression can be computed efficiently with

E^(nΔt)ψn:NTΦ1:Nn

and a posteriori normalization. Though this estimator is statistically consistent in the limit of N, in practice, when nN, the method clearly fails. This is counterbalanced in some cases by a convergence toward the limit distribution in nN steps; therefore, the problem does not appear. This is the case for experiments in Sec. VI C using a chaotic flow. Yet, the general case is a topic for future research.

To synopsize, prediction is achieved with the following steps:

  1. Represent a function f, defined on X, by the values it takes for each observed data pair (xi,yi), with the method described in Sec. V A. This gives an estimate f^.

  2. Build the evolution operator E^(Δt) as described above or powers of it E^(nΔt) for predicting n steps in the future.

  3. Compute the function E^(nΔt)[f^]. This amounts to a matrix multiplication on a reduced-basis representation, together with an a posteriori normalization.

  4. When a new data sample xnew becomes available at the present time t0, estimate a distribution Q over the training samples that best represents xnew. Q is expressed by its density q^new on a reduced basis {Φ}m=1..M as detailed in see Sec. V B.

  5. Apply the evolved function E^(nΔt)[f^] to q^new to obtain the expected value EQ(σ,t0+nΔt)[EP(Y|σ)[f(xσ)]] that f takes at future time t0+nΔt.

Section VI C applies this to a concrete example.

The following illustrates reconstructing kernel ϵ-machines from data in three complementary cases: (i) an infinite Markov-order binary process generated by a two-state hidden Markov model, (ii) a binary process generated by an HMM with an uncountable infinity of causal states, and (iii) thermally driven continuous-state deterministic chaotic flows. In each case, the hidden causal structure is discovered assuming only that the processes are conditionally stationary.

The even process is generated by a two-state, unifilar, edge-emitting Markov model that emits discrete data values v{0,1}. Figure 2 displays the ϵ-machine HMM state-transition diagram—states and transition probabilities.

FIG. 2.

Even process state-transition diagram: An HMM that generates a binary process over outputs v{0,1}. Transitions are labeled with the symbol, followed by the probability to take this transition. The even process has infinite Markov order—emitted 1s occur in even blocks (of arbitrary length) bounded by 0s. The process is stationary when starting with the state distribution Pr(S=σ0,S=σ1)=(2/3,1/3). This HMM is an ϵ-machine.

FIG. 2.

Even process state-transition diagram: An HMM that generates a binary process over outputs v{0,1}. Transitions are labeled with the symbol, followed by the probability to take this transition. The even process has infinite Markov order—emitted 1s occur in even blocks (of arbitrary length) bounded by 0s. The process is stationary when starting with the state distribution Pr(S=σ0,S=σ1)=(2/3,1/3). This HMM is an ϵ-machine.

Close modal

Realizations x=(vt)LX<t0 and y=(vt)0<tLY consist of sequences in which blocks of even number of 1s are bounded by any number of 0s; e.g., 01101111000011001111110. An infinite-past look ahead LX is required to correctly predict these realizations. Indeed, truncation generates ambiguities when only 1s are observed.

For example, with LX=4, the observed series 1111 could be part of a larger group 011111, in which case, the next symbol is necessarily a 1, or larger groups of the form 111111 or 001111 or 101111, in which case, the next symbol is either 0 or 1 with probability 1/2. However, with a limited look ahead of LX=4, a prediction has no way to encode that the next symbol is necessarily a 1 in the first case. One implication is that there does not exist any finite Markov chain that generates the process.

Despite its simplicity—only two internal states—and compared to processes generated by HMMs with finite Markov order, the even process is a helpfully challenging benchmark for testing the reconstruction capabilities of kernel ϵ-machine estimation on discrete data.

Reconstruction is performed using product kernels,

kX(x,)=i=0LX1kV(xi,)γiLX1,kY(y,)=i=1LYkV(yi,)γi1LY1,

with a decay parameter γ setting the relative influence of the most distant past (future) symbol xLX+1 (yLY). We use the exponential kernel,

kV(a,b)=e(ab)2/2ξ2,

where a,b{0,1} are the symbols in the series.

Figure 3 presents the results for LX=10 and LY=5 for a typical run with N=30000 sample (x,y) pairs, with a decay γ=0.01 and a bandwidth ξ=1.

FIG. 3.

Even process: Reconstructed-state coordinates ψ1 on the first reduced-basis eigenvector Φ1, together with a graphical representation of the transitions inferred between the colored clusters.

FIG. 3.

Even process: Reconstructed-state coordinates ψ1 on the first reduced-basis eigenvector Φ1, together with a graphical representation of the transitions inferred between the colored clusters.

Close modal

The eigenvalue spectrum of the reduced basis decreases rapidly: λ0=1 (as expected), λ1102, and all other eigenvalues λj2<104. We, therefore, project the causal states on the only relevant eigenvector Φ1 and build the histogram shown in Fig. 3. Colors match labels automatically found by a clustering algorithm.58 

Figure 3 summarizes the cluster probabilities and their transitions.59 They match the original even process together with a transient causal state. By inspection, one sees that this state represents the set of sequences of all 1s mentioned above—the source of ambiguity. Its probability, for LX=10, is that of jumping five consecutive times from state σ0 to state σ1 in the generating even process. Hence, 1/250.03, which is the value we observe. From that transient state, the ambiguity cannot be resolved; therefore, transitions follow the (1/3,2/3) proportions of the symbols in the series. Note that unifilarity is broken since there are two paths for the symbol v=1 starting from state σ1, also reflecting the ambiguity induced by the finite truncation.

The even process is a case where ambiguity arises from incomplete knowledge due to finite-range truncation. However, even for discrete finite data alphabets V, there are processes whose causal states are irreducibly infinite. This occurs generically for processes generated by nonunifilar HMMs. In this case, knowledge of the observed data is insufficient to determine the generative model’s internal state. Only distributions over those states—the mixed states29—are predictive.

In the limit LX, the causal states then correspond to unique mixed-state distributions,21 and there can be infinitely many (countable or uncountable) causal states, even for a simple finite generative HMM. This arises from nonunifilarity since the same observed symbol allows transitions to two distinct internal states. In contrast, the predictive-equivalence determines that the same information (newly observed symbol), starting from the same current state [equivalence class ϵ(X) of pasts XX], induces the same consequences [possible futures Y, with a fixed distribution Pr(Y|ϵ(X))]. Thus, nonunifilar models can be more compact generators. This can be beneficial, but it comes at the cost of being markedly worse predictors than unifilar HMMs.

Consider the nonunifilar mess3 HMM introduced in Ref. 29, represented graphically in Fig. 4.

FIG. 4.

Nonunifilar generative HMM mess3: Only transition probabilities are depicted; emitted symbols are described in the main text.

FIG. 4.

Nonunifilar generative HMM mess3: Only transition probabilities are depicted; emitted symbols are described in the main text.

Close modal

Mess3 uses three symbols V={0,1,2} and consists of three generative states s0, s1, and s2. The state-transition diagram is symmetric, and each state has the same transition structure. Consider the state si for i=0,1,2 and modulo arithmetic so that i1=2 when i=0 and i+1=0 when i=2. Then, the transitions are

  1. From state si to itself, for a total probability p=α=ay+2bx, symbols are emitted as Pr(v=i)=ay and Pr(v=i1)=Pr(v=i+1)=bx.

  2. From state si to state si+1, for a total probability p=(1α)/2, symbols are emitted as Pr(v=i)=ax, Pr(v=i1)=bx, and Pr(v=i+1)=by.

  3. From state si to state si1, for a total probability p=(1α)/2, symbols are emitted as Pr(v=i)=ax, Pr(v=i1)=by, and Pr(v=i+1)=bx.

In this, a=0.6, b=(1a)/2=0.2, x=0.15, y=12x=0.7.

The generated process is known to give uncountably infinite causal states. Their ensemble S forms a Sierpinski gasket in the mixed-state simplex.29 The probability to observe states at subdivisions decreases exponentially, though. With limited samples, only a few self-similar subdivisions of the main gasket triangle can be observed.

Reconstructing S is performed using LX=15 and LY=1. The same product exponential kernel is used as above, with a decay γ=0.01 and a bandwidth ξ=0.1. N=25000 sample (x,y) pairs are sufficient to reconstruct the first self-similar subdivisions.

The eigenvalue spectrum is λ0=1, λ1=0.999, λ2=0.999, and λi3<1.2×1015. The algorithm thus finds only two components of equal significance. Figure 5 shows a scatterplot of the inferred causal states using coordinates on a reduced basis {Φ1,Φ2}. The states are very well clustered on the main vertices of the Sierpinski gasket. (The inset shows that 4 orders of magnitude are needed to see a spread within each cluster compared to the main triangle size.) The triangle is remarkably equilateral and its center even lies on (0,0), reflecting the symmetry of the original HMM. To our knowledge, no other algorithm is currently able to correctly recover the causal states purely from data from such a challenging, complex process.

FIG. 5.

Projection of the mess3’s causal states on a reduced basis {Φ1,Φ2}. The number of samples stacked on each point is indicated.

FIG. 5.

Projection of the mess3’s causal states on a reduced basis {Φ1,Φ2}. The number of samples stacked on each point is indicated.

Close modal

The number of samples at each triangle node is indicated. The nodes at the first subdivision are about 800 times less populated than the main nodes. While theory predicts that states appear on all subdivisions of the Sierpinski gasket, the sample size needed to observe enough such states is out of reach computationally.

The first two examples demonstrated that kernel ϵ-machine reconstruction works well on discrete-valued data, even when states are expected to appear on a complicated fractal structure or when the time series has infinite-range correlations. The next step, and a central point of the development, is to reconstruct a continuous infinity of causal states from sampled data. The following example processes also serve to demonstrate time-series prediction using the estimated kernel ϵ-machine, recalling the method introduced in Sec. IV B.

We first use the chaotic Lorenz ordinary differential equations from 1963 with the usual parameters (σ,ρ,β)=(10,28,8/3).60 We add isotropic stochastic noise components dW at amplitude η to model thermal fluctuations driving the three main macroscopic fluid modes, which the ODEs were crafted to capture

du=σ(uv)dt+ηdW,dv=(ρuvuw)dt+ηdW,dw=(βw+uv)dt+ηdW.

A random initial condition (u0,v0,w0) is drawn in the region near the attractor, and a sufficiently long transient is discarded before collecting data (ut,vt,wt). SDE integration is performed using dt=0.01, yielding samples (ut,vt,wt)0t<T up to a maximal time T corresponding to N=20000 (past, future) pairs.

In addition to the thermal fluctuations, we also model systematic measurement error by adding a centered Gaussian process Γ=(γ0,γ1,γ2) with isotropic variance ν2. The result is the observed series (ut,vt,wt)=(ut+γt0,vt+γt1,wt+γt2) used to estimate a kernel ϵ-machine and perform a sensitivity analysis.

1. Kernel ε-machine reconstruction in the presence of noise and error

When η=0 and ν=0, the deterministic ODEs’ trajectories do not cross and are uniquely determined by the initial condition (ut,vt,wt) at t=0. Hence, each state on the attractor is its own causal state. Retaining information from the past is moot, but only if (u,v,w) is known with infinite precision, due to the ODEs’ chaotic solutions, that amplify fluctuations exponentially fast. This is never the case in practice. Therefore, considering small values of LX and LY may still be useful to better determine the causal states. We use LX=LY=5 in the reconstructions.

Figure 6 shows the projections (right) with coordinates (ψ1,ψ2,ψ3), together with the original (pre-reconstruction) attractor data (left) for different noise combinations. Figure 6 (top row) displays the results of kernel ϵ-machine estimation from the noiseless data (ν=0 and η=0). The second row shows the effects of pure measurement noise (ν=1 and η=0) on the raw data (left) and on the estimated kernel ϵ-machine (right). Similarly, the last row shows the effect of pure thermal noise (ν=0 and η=1).

FIG. 6.

Lorenz attractor (left) and its estimated kernel ϵ-machine (right) at various thermal η and measurement ν noise levels.

FIG. 6.

Lorenz attractor (left) and its estimated kernel ϵ-machine (right) at various thermal η and measurement ν noise levels.

Close modal

As expected, the structure is well recovered when no noise is added. A slight distortion is observed. In practice, the causal states are unaffected by coordinate transforms and reparameterizations of X and Y, which do not change equivalence in conditional distributions P(Y|X=x1)P(Y|X=x2); therefore, the algorithm could very well have found another parametrization of the Lorenz-63 attractor. See also Sec. VI D. The causal states of the original data series are also well recovered even when that series is severely corrupted by strong measurement noise at ν=1 in Fig. 6 (middle row).

To appreciate these results, recall that, in the noiseless case, if x1 and x2 are in the same causal state x2ϵ(x1) of the original series, then by definition, Pr(Y|X=x1)=Pr(Y|X=x2). Measurement noise (ν>0), independent at each time step, does not change this. Since measurement noise is added to each and every time step independently, noisy series x ending with the same noisy triplet (u,v,w) at the current time t=t0 end up in the same causal state of the noisy system. This is reduced to the current triplet (u,v,w), itself a specific causal state of the deterministic system. Hence, the causal states of the deterministic ODEs are subsets of those of kernel ϵ-machine estimated from the noisy-measured series. We arrive at the important conclusion that, at least for deterministic chaotic systems, the uniqueness and continuity of ODE solutions guarantee that causal states are unaffected by measurement noise. This is generally not true for many-to-one maps and other functional state-space transforms that merge states.

In contrast to measurement noise, thermal noise (η>0) modifies the equations of motion, and the resulting trajectories reflect the accumulated perturbations. Since each state on the (deterministic) attractor is its own state, the estimated causal states are modified.

Let us probe the robustness of kernel ϵ-machine estimation. With the parameters detailed below, we obtain an eigenvalue spectrum shown in Fig. 7. There is an inflection point after the first three components. The eigenvalues are remarkably insensitive to a strong measurement noise level ν=1. They are also very robust to the thermal noise η=1, which induces only some minor eigenvalue changes.

FIG. 7.

Eigenvalues for the Lorenz-63 attractor estimated kernel ϵ-machines at various thermal and measurement noise levels.

FIG. 7.

Eigenvalues for the Lorenz-63 attractor estimated kernel ϵ-machines at various thermal and measurement noise levels.

Close modal

Thus, kernel ϵ-machine estimation achieves a form of denoising beneficial for random dynamical systems. However, the reconstructed causal states reflect the thermal noise induced by η=1, as can be seen in the fine details of the bottom row of Fig. 6. Note that the algorithm is able to strongly reduce the measurement noise and do so even while the attractor is very corrupted, while the apparently minor thermal noise is preserved.

2. Kernel ε-machine prediction and sensitivity

In Ref. 46, a prediction experiment is performed using an evolution operator computed directly in (u,v,w) space instead of in S, as here; recall Sec. V C. That study focuses on how the error due to a small perturbation propagates with the number of prediction steps.

Here, we use a more typical approach to first learn the model—i.e., compute the states, the basis Φ, the evolution operator,…—on a data set of N=20000 samples. Then, however, we estimate the prediction error on a completely separate “test” set. This data set is generated with the same parameters as for the training set, but starting from another randomly drawn initial condition. P=100 test samples are selected after the transient, each separated by a large subsampling interval so as to reduce their correlation. Unlike the examples in Secs. VI A and VI B, this produces an objective error function—the prediction accuracy—useful for cross-validating the relevant meta-parameters.

Due to the computational cost involved with grid searches, we only cross-validate the data kernel kV bandwidth on a reduced data set in preliminary experiments. In Ref. 46, an arbitrary variance was chosen for the distribution used to project (u,v,w) triplets onto the operator eigenbasis. We also cross-validate it to improve the results for a fair comparison with kernel ϵ-machine reconstruction.

Figure 8 presents the results. Forecasts are produced at every Δt=0.05 interval, and operator exponentiation is used in between, as detailed in Sec. V C. This allows a comparison of trajectories over 500 elementary integration steps, which is large enough for the trajectory to switch between the attractor main lobes several times. Such trajectories are computed starting from each of the 100 test samples.

FIG. 8.

Predictions on the Lorenz-63 attractor.

FIG. 8.

Predictions on the Lorenz-63 attractor.

Close modal

In the noiseless case η=ν=0, the average discrepancy (u,v,w)(up,vp,wp) measures how predicted triplets (up,vp,wp) differ from data triplets (u,v,w), averaged at each prediction step over the 100 trajectories.

In the noisy cases, it makes little sense to test the algorithm’s ability to reproduce noisy data. We instead test its ability to predict accurately the noiseless series (u,v,w)t>t0 based on noisy past observations (u,v,w)tt0. This is easily done for measurement noise ν>0, for which the original noiseless series is available by construction.

For simulated thermal noise η>0, all we have is a particular realization of an SDE, but no clean reference. Starting from the current noisy values at each prediction point (u,v,w), we evolve a noiseless trajectory with the basic Lorenz ODE equations. Since we use an isotropic centered ηdW Wiener process, that trajectory is also the ensemble average over many realizations of the SDE, starting from that same (u,v,w) point. It makes more sense to predict that ensemble average, from the current noisy causal-state estimate, rather than a particular SDE trajectory realization.

The results in Fig. 8 show a clear gain in precision when using the RKHS method, both in the unperturbed data case and when data are perturbed by measurement noise ν=1. This gain persists until the trajectory becomes completely uncorrelated with the original prediction point. The situation is less favorable for thermal noise η=1.

Figure 9 presents a sensitivity analysis that focuses on predictions after 50 time steps. For lower noise levels η<1, the RKHS method still improves the prediction error, while the operator in data space does not seem to be sensitive to η. We note (not shown here) that, for longer time scales, the RKHS method may produce worse results on average. The handling of measurement noise ν<1 is also in favor of the RKHS method, consistent with above results.

FIG. 9.

Sensitivity when predicting trajectories along the Lorenz-63 attractor: Reconstruction error dependency on noise levels. Data for the ν=η=0 noiseless case are shown as markers on the vertical axis.

FIG. 9.

Sensitivity when predicting trajectories along the Lorenz-63 attractor: Reconstruction error dependency on noise levels. Data for the ν=η=0 noiseless case are shown as markers on the vertical axis.

Close modal

This is all well and good, but do attractor reconstruction and denoising capabilities hold in higher dimensions? In fact, it has been known for decades61 that the Lorenz-63 system is special and very easily reconstructed. This is due to its high state-space volume contraction rate and its simple and smooth vectorfield that sports only two nonlinear terms.

To address these issues, we employ the Lorenz 1996 model,62 with tunable dimension parameter D, defined for i=1,,D by dui/dt=ui2ui1+ui1ui+1ui+F, with modulo arithmetic on the indices. We use F=8, which yields chaotic dynamics. In order to better cover all of the attractor, series are generated from ten different starting points taken at random on the basin of attraction, after discarding a sufficiently long transient. N samples of (x,y) pairs are collected from these series, using history lengths of LX=LY=5 values of vectors u. The projection of these series along the first three dimensions of the attractors for D=5 and D=100 are shown on the left part of Fig. 10.

FIG. 10.

Eigenvalues for the reconstruction of a Lorenz-96 attractor, randomly projected in dimension 1000, with added high-dimensional noise, for N=10000 samples. The dimension in the legend refers to the original parametrization of a Lorenz-96 attractor, which is recovered by the algorithm in the form of a spectral gap when reconstructed from the 1000-dimensional noisy time series. Values after the gap are very low and given in Table I, together with the gap itself.

FIG. 10.

Eigenvalues for the reconstruction of a Lorenz-96 attractor, randomly projected in dimension 1000, with added high-dimensional noise, for N=10000 samples. The dimension in the legend refers to the original parametrization of a Lorenz-96 attractor, which is recovered by the algorithm in the form of a spectral gap when reconstructed from the 1000-dimensional noisy time series. Values after the gap are very low and given in Table I, together with the gap itself.

Close modal

To test the algorithm’s reconstruction performance, we embed the time series in a 1000-dimensional space using random projections: V=UR with R being a matrix of random components of size D×1000, taken from a normal distribution of standard deviation 1/D. U holds the collected samples, organized in a matrix of size N×D. In practice, in such a high dimension, the projection directions in R are nearly orthogonal. In addition, Gaussian noise with variance ν2=1/D is added to each component of V, similar to the setup in Sec. VI C 1. We use the resulting data W=V+noise as input to the algorithm: Truly 1000-dimensional data, now leaking into all dimensions thanks to the added noise, while retaining the overall lower D-dimensional structure of the Lorenz-96 attractor. This experiment seeks to reconstruct this hidden D-dimensional structure, solely from noisy 1000-dimensional time series.

The middle panes of Fig. 10 show the corrupted W data, projected back into the original space using the pseudo-inverse pinv(R) for D=5 and D=100. The scaling of the noise variance makes it so that the strength of the noise is similar irrespective of D.

The right panel of Fig. 10 shows the well-reconstructed attractor. Moreover, the denoising for chaotic attractors observed in Sec. VI C 1 is also well reproduced in this markedly more complicated setting. Interestingly, the first three coordinates of the reconstructed attractor do not, and need not, match that of the original U series. Indeed, the causal states are equivalence classes of conditional distributions and, as such, are invariant by reparameterizations of the original data that preserve these equivalence classes, such as coordinate transformations.

We see that the algorithm finds a parametrization where each added coordinate best encodes the conditional distributions in the low-dimensional coordinate representation, as explained in Sec. VI C. Yet, that parametrization encodes each and every initial D component, even though it is presented with 1000-dimensional noisy time series of lengths LX=LY=5. This is shown in Fig. 11 and Table I, which clearly demonstrate spectral gaps at exactly D reconstructed components. These spectral gaps are more pronounced as the number of samples N increases, as shown in Table II. As expected, a larger number of samples N is required for properly capturing the spectral gaps as the dimension D increases.

FIG. 11.

(top left) First 3 components of the Lorenz-96 D=5-dimensional attractor, with N=5000 samples. (top middle) Same, back projected onto the original space from the noisy 1000-dimensional random embedding. (top right) First 3 coordinates of the causal states set S reconstruction. (bottom row) Same plots for the D=100-dimensional Lorenz-96 formulation. The reconstructed coordinates can be equivalent reparameterizations of the original variables and need not match those 1 to 1. In each case, the noise has been greatly reduced, as in Fig. 6. Colors correspond to ten time series, each starting from a distinct random location on the basin of attraction, taken after a sufficiently long transient.

FIG. 11.

(top left) First 3 components of the Lorenz-96 D=5-dimensional attractor, with N=5000 samples. (top middle) Same, back projected onto the original space from the noisy 1000-dimensional random embedding. (top right) First 3 coordinates of the causal states set S reconstruction. (bottom row) Same plots for the D=100-dimensional Lorenz-96 formulation. The reconstructed coordinates can be equivalent reparameterizations of the original variables and need not match those 1 to 1. In each case, the noise has been greatly reduced, as in Fig. 6. Colors correspond to ten time series, each starting from a distinct random location on the basin of attraction, taken after a sufficiently long transient.

Close modal
TABLE I.

Top row: Dimension of the Lorenz-96 ODE system, that is, before trajectories on the attractor undergo random projection and adding noise. Middle row: Eigenvalue after the spectral gap, expressed in % of the first eigenvalue. Bottom row: Spectral gaps, relative to the value after the gap. Results for N = 10 000 samples are graphically presented in Fig. 11.

Dim. D5101520253035404550556065707580859095100
100λD+1λ1 0.31 0.23 0.26 0.24 0.24 0.24 0.22 0.25 0.26 0.21 0.23 0.23 0.23 0.22 0.23 0.26 0.24 0.20 0.25 0.21 
λDλD+1 60.3 64.1 51.2 49.3 42.2 35.2 34.9 30.8 26.8 26.4 25.1 24.1 22.0 20.7 18.5 18.1 15.8 14.3 14.5 13.9 
Dim. D5101520253035404550556065707580859095100
100λD+1λ1 0.31 0.23 0.26 0.24 0.24 0.24 0.22 0.25 0.26 0.21 0.23 0.23 0.23 0.22 0.23 0.26 0.24 0.20 0.25 0.21 
λDλD+1 60.3 64.1 51.2 49.3 42.2 35.2 34.9 30.8 26.8 26.4 25.1 24.1 22.0 20.7 18.5 18.1 15.8 14.3 14.5 13.9 
TABLE II.

Spectral gap dependency on the number N of samples: Eigenvalues after the gap do not change in magnitude. Spectral gaps themselves, though, are much better resolved for larger N, especially in higher dimensions D.

N = 5000N = 10 000N = 20 000
Dim. D909510090951009095100
100λD+1λ1 0.22 0.18 0.21 0.20 0.25 0.21 0.21 0.20 0.21 
λDλD+1 6.3 5.8 5.0 14.3 14.5 13.9 26.1 25.6 26.5 
N = 5000N = 10 000N = 20 000
Dim. D909510090951009095100
100λD+1λ1 0.22 0.18 0.21 0.20 0.25 0.21 0.21 0.20 0.21 
λDλD+1 6.3 5.8 5.0 14.3 14.5 13.9 26.1 25.6 26.5 

We introduced kernel ϵ-machine reconstruction—a first-principles approach to empirically discovering causal structure. The main step was to represent computational mechanics’ causal states in reproducing kernel Hilbert spaces. This gave a mathematically principled method for estimating optimal predictors of minimal size and so for discovering causal structure in data series of wide-ranging types.

Practically, it extends computational mechanics to nearly arbitrary data types (at least, those for which a characteristic reproducing kernel exists). Section III A showed, though, that this includes heterogeneous data types via kernel compositions.

Based on this, we presented theoretical arguments and analyzed cases for which causal-state trajectories are continuous. In this setting, the kernel ϵ-machine is equivalent to an Itô diffusion acting on the structured set of causal states—a time-homogeneous, state-heterogeneous diffusion. The generator of that diffusion and its evolution operator can be estimated directly from data. This allows efficiently evolving causal-state distributions in a way similar to a Fokker–Planck equation. This, in turn, facilitates predicting a process in its original data space in a new way, one particularly suited to time series analysis.

Future efforts will address the introduction of discontinuities, which may arise for reasons mentioned in Sec. IV A. This will be necessary to properly handle cases where data sampling has occurred above the scale at which time and causal-state trajectories can be considered continuous. Similarly, when the characteristic scale of the observed system dynamics is much larger than the sampling scale, a model reproducing the dynamics of the sampled data may simply not be relevant. Extensions of the current approach are thus needed, possibly incorporating jump components to properly account for a measured system’s dynamics at different scales. This, of course, will bring us back to the computational mechanics of renewal and semi-Markov processes.5 

Another future challenge is to extend kernel ϵ-machine reconstruction to spatiotemporal systems, where temporal evolution depends not only on past times but also on spatially nearby state values.8,27,63,64 An archetypal example of these systems is found with cellular automata. In fact, any numerical finite element simulation performed on discrete grids also falls into this category, including reaction–diffusion chemical oscillations and hydrodynamic flows. Spacetime complicates the definition of the evolution operator compared to that of time series. However, the applicability of kernel ϵ-machine reconstruction would be greatly expanded to a large category of empirical data sets.

Last but not least are the arenas of information thermodynamics and stochastic thermodynamics. In short, it is time to translate recent results on information engines and their nonequilibrium thermodynamics21–23,28 to this broadened use of computational mechanics. This, in addition to stimulating theoretical advances, has great potential for providing new and powerful tools for analyzing complex physical and biological systems, not only their dynamics and statistics, but also their energy use and dissipation.

The authors thank Alexandra Jurgens for providing the code used for generating samples from the “mess3” machine used in Sec. VI B. We also thank Tyrus Berry for useful exchanges over nonlinear dimension reduction through diffusion map variants and operator estimation methods.46,49,50 The authors acknowledge the kind hospitality of the Institute for Advanced Study at the University of Amsterdam. This material is based upon work supported by, or in part by, Inria’s CONCAUST exploratory action, the Foundational Questions Institute (Grant No. FQXi-RFP-IPW-1902), the U.S. Army Research Laboratory and the U.S. Army Research Office (Grant No. W911NF-18-1-0028), and the U.S. Department of Energy (Grant No. DE-SC0017324).

The authors declare no conflict of interest.

The source code for the method described in this document is provided as free/libre software and is available from this page: https://team.inria.fr/comcausa/continuous-causal-states/. Experiments in Secs. VI AVI C can be reproduced by retrieving the tagged version “first_arxiv” from the GIT archive and the experiment in Sec. VI D with the tagged version “chaos_submission.”

1.
J. P.
Crutchfield
and
K.
Young
, “
Inferring statistical complexity
,”
Phys. Rev. Lett.
63
,
105
108
(
1989
).
2.
J. P.
Crutchfield
, “
Between order and chaos
,”
Nat. Phys.
8
,
17
24
(
2012
).
3.
C. R.
Shalizi
and
J. P.
Crutchfield
, “
Computational mechanics: Pattern and prediction, structure and simplicity
,”
J. Stat. Phys.
104
,
817
879
(
2001
).
4.
S. E.
Marzen
and
J. P.
Crutchfield
, “Inference, prediction, and entropy-rate estimation of continuous-time, discrete-event processes,” arxiv:2005.03750 (2020).
5.
S.
Marzen
and
J. P.
Crutchfield
, “
Structure and randomness of continuous-time discrete-event processes
,”
J. Stat. Phys.
169
(
2
),
303
315
(
2017
).
6.
C. R.
Shalizi
,
K. L.
Shalizi
, and
J. P.
Crutchfield
, “Pattern discovery in time series, part I: Theory, algorithm, analysis, and convergence,” arXiv:cs/0210025 (2002).
7.
G. M.
Goerg
and
C. R.
Shalizi
, “Mixed LICORS: A nonparametric algorithm for predictive state reconstruction,” in Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics (PMLR, 2013), pp. 289–297, see https://proceedings.mlr.press/v31/goerg13a.html.
8.
A.
Rupe
,
N.
Kumar
,
V.
Epifanov
,
K.
Kashinath
,
O.
Pavlyk
,
F.
Schlimbach
,
M.
Patwary
,
S.
Maidanov
,
V.
Lee
,
M.
Prabhat
et al., “DisCo: Physics-based unsupervised discovery of coherent structures in spatiotemporal systems,” in 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) (IEEE, 2019), pp. 75–87.
9.
N.
Brodu
, “Quantifying the effect of learning on recurrent spiking neurons,” in 2007 International Joint Conference on Neural Networks (IEEE, 2007), pp. 512–517.
10.
S.
Klus
,
I.
Schuster
, and
K.
Muandet
, “
Eigendecompositions of transfer operators in reproducing kernel Hilbert spaces
,”
J. Nonlinear Sci.
30
(
1
),
283
315
(
2020
).
11.
C. C.
Strelioff
and
J. P.
Crutchfield
, “
Bayesian structural inference for hidden processes
,”
Phys. Rev. E
89
,
042119
(
2014
).
12.
S.
Marzen
and
J. P.
Crutchfield
, “
Predictive rate-distortion for infinite-order Markov processes
,”
J. Stat. Phys.
163
(
6
),
1312
1338
(
2016
).
13.
R. J.
Elliot
,
L.
Aggoun
, and
J. B.
Moore
, Hidden Markov Models: Estimation and Control, Applications of Mathematics Vol. 29 (Springer, New York, 1995).
14.
C. R.
Shalizi
and
K. L.
Klinkner
, “Blind construction of optimal nonlinear recursive predictors for discrete sequences,” in Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference (UAI 2004), edited by M. Chickering and J. Y. Halpern (AUAI Press, Arlington, VA, 2004), pp. 504–511.
15.
N.
Brodu
, “
Reconstruction of epsilon-machines in predictive frameworks and decisional states
,”
Adv. Complex Syst.
14
(
05
),
761
794
(
2011
).
16.
R. J.
Elliott
,
L.
Aggoun
, and
J. B.
Moore
,
Hidden Markov Models: Estimation and Control
(
Springer
,
New York
,
1994
).
17.
C. R.
Shalizi
,
K. L.
Shalizi
, and
R.
Haslinger
, “
Quantifying self-organization with optimal predictors
,”
Phys. Rev. Lett.
93
,
118701
(
2004
).
18.
A.
Jurgens
and
J. P.
Crutchfield
, “
Shannon entropy rate of hidden Markov processes
,”
J. Stat. Phys.
183
,
32
(
2021
).
19.
A. M.
Jurgens
and
J. P.
Crutchfield
, “
Divergent predictive states: The statistical complexity dimension of stationary, ergodic hidden Markov processes
,”
Chaos
31
(
8
),
083114
(
2021
).
20.
A.
Jurgens
and
J. P.
Crutchfield
, “Minimal embedding dimension of minimally infinite hidden Markov processes,” (unpublished) (2020).
21.
C. J.
Ellison
,
J. R.
Mahoney
, and
J. P.
Crutchfield
, “
Prediction, retrodiction, and the amount of information stored in the present
,”
J. Stat. Phys.
136
(
6
),
1005
1034
(
2009
).
22.
J. P.
Crutchfield
,
C. J.
Ellison
, and
J. R.
Mahoney
, “
Time’s barbed arrow: Irreversibility, crypticity, and stored information
,”
Phys. Rev. Lett.
103
(
9
),
094101
(
2009
).
23.
P. M.
Riechers
and
J. P.
Crutchfield
, “
Spectral simplicity of apparent complexity. II. Exact complexities and complexity spectra
,”
Chaos
28
,
033116
(
2018
).
24.
J. E.
Hanson
and
J. P.
Crutchfield
, “
Computational mechanics of cellular automata: An example
,”
Physica D
103
,
169
189
(
1997
).
25.
C. S.
McTague
and
J. P.
Crutchfield
, “
Automated pattern discovery—An algorithm for constructing optimally synchronizing multi-regular language filters
,”
Theor. Comput. Sci.
359
(
1–3
),
306
328
(
2006
).
26.
C. R.
Shalizi
,
R.
Haslinger
,
J.-B.
Rouquier
,
K. L.
Klinkner
, and
C.
Moore
, “
Automatic filters for the detection of coherent structure in spatiotemporal systems
,”
Phys. Rev. E
73
(
3
),
036104
(
2006
).
27.
A.
Rupe
and
J. P.
Crutchfield
, “
Local causal states and discrete coherent structures
,”
Chaos
28
(
7
),
075312
(
2018
).
28.
P. M.
Riechers
and
J. P.
Crutchfield
, “
Fraudulent white noise: Flat power spectra belie arbitrarily complex processes
,”
Phys. Rev. Res.
3
(
1
),
013170
(
2021
).
29.
S. E.
Marzen
and
J. P.
Crutchfield
, “
Nearly maximally predictive features and their dimensions
,”
Phys. Rev. E
95
(
5
),
051301
(
2017
).
30.
S.
Marzen
and
J. P.
Crutchfield
, “
Informational and causal architecture of continuous-time renewal processes
,”
J. Stat. Phys.
168
,
109
127
(
2017
).
31.
S.
Marzen
,
M. R.
DeWeese
, and
J. P.
Crutchfield
, “
Time resolution dependence of information measures for spiking neurons: Scaling and universality
,”
Front. Comput. Neurosci.
9
,
89
(
2015
).
32.
A.
Smola
,
A.
Gretton
,
L.
Song
, and
B.
Schölkopf
, “A Hilbert space embedding for distributions,” in
Algorithmic Learning Theory: 18th International Conference
(Springer, Berlin, Heidelberg 2007), Vol. 31, pp. 13–31.
33.
L.
Song
,
J.
Huang
,
A.
Smola
, and
K.
Fukumizu
, “Hilbert space embeddings of conditional distributions with applications to dynamical systems,” in Proceedings of the 26th Annual International Conference on Machine Learning (ACM, 2009), pp. 961–968.
34.
N.
Aronszajn
, “
Theory of reproducing kernels
,”
Trans. Am. Math. Soc.
68
(
3
),
337
404
(
1950
).
35.
B. E.
Boser
,
I. M.
Guyon
, and
V. N.
Vapnik
, “A training algorithm for optimal margin classifiers,” in Proceedings of the Fifth Annual Workshop on Computational Learning Theory (ACM, 1992), pp. 144–152.
36.
A.
Gretton
,
K. M.
Borgwardt
,
M. J.
Rasch
,
B.
Schölkopf
, and
A.
Smola
, “
A kernel two-sample test
,”
J. Mach. Learn. Res.
13
,
723
773
(
2012
).
37.
B.
Sriperumbudur
,
A.
Gretton
,
K.
Fukumizu
,
B.
Schoelkopf
, and
G.
Lanckriet
, “
Hilbert space embeddings and metrics on probability measures
,”
J. Mach. Learn. Res.
11
,
1517
1561
(
2010
).
38.
L.
Song
,
K.
Fukumizu
, and
A.
Gretton
, “
Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models
,”
IEEE Signal Process. Mag.
30
(
4
),
98
111
(
2013
).
39.
K.
Fukumizu
,
L.
Song
, and
A.
Gretton
, “
Kernel Bayes’ rule: Bayesian inference with positive definite kernels
,”
J. Mach. Learn. Res.
14
(
1
),
3753
3783
(
2013
).
40.
S.
Grünewälder
,
G.
Lever
,
L.
Baldassarre
,
S.
Patterson
,
A.
Gretton
, and
M.
Pontil
, “Conditional mean embeddings as regressors,” in Proceedings of the 29th International Conference on Machine Learning (ACM, 2012).
41.
B.
Schölkopf
,
R.
Herbrich
, and
A. J.
Smola
, “A generalized representer theorem,” in International Conference on Computational Learning Theory (Springer, 2001), pp. 416–426.
42.
P.
Honeine
and
C.
Richard
, “Solving the pre-image problem in kernel machines: A direct method,” in 2009 IEEE International Workshop on Machine Learning for Signal Processing (IEEE, 2009), pp. 1–6.
43.
I.
Schuster
,
M.
Attes Mollenhauer
,
S.
Klus
, and
K.
Muandet
, “Kernel conditional density operators,” in International Conference on Artificial Intelligence and Statistics (PMLR, 2020), pp. 993–1004.
44.
For Itô diffusions, these f are compactly supported and twice differentiable.
45.
K.
Jacobs
,
Stochastic Processes for Physicists: Understanding Noisy Systems
(
Cambridge University Press
,
2010
).
46.
T.
Berry
,
D.
Giannakis
, and
J.
Harlim
, “
Nonparametric forecasting of low-dimensional dynamical systems
,”
Phys. Rev. E
91
(
3
),
032915
(
2015
).
47.
It may be that S’s dimension is actually infinite so that its estimate grows with the number of samples. This can be detected using the spectral method presented in the main text.
48.
R. R.
Coifman
and
S.
Lafon
, “
Diffusion maps
,”
Appl. Comput. Harmon. Anal.
21
(
1
),
5
30
(
2006
).
49.
T.
Berry
and
J.
Harlim
, “
Variable bandwidth diffusion kernels
,”
Appl. Comput. Harmon. Anal.
40
(
1
),
68
96
(
2016
).
50.
T.
Berry
and
T.
Sauer
, “
Local kernels and the geometric structure of data
,”
Appl. Comput. Harmon. Anal.
40
(
3
),
439
469
(
2016
).
51.
R. R.
Coifman
,
S.
Lafon
,
A. B.
Lee
,
M.
Maggioni
,
B.
Nadler
,
F.
Warner
, and
S. W.
Zucker
, “
Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps
,”
Proc. Natl. Acad. Sci. U.S.A.
102
(
21
),
7426
7431
(
2005
).
52.
R. R.
Coifman
,
Y.
Shkolnisky
,
F. J.
Sigworth
, and
A.
Singer
, “
Graph Laplacian tomography from unknown random projections
,”
IEEE Trans. Image Process.
17
(
10
),
1891
1899
(
2008
).
53.
R. R.
Coifman
,
S.
Lafon
,
A. B.
Lee
,
M.
Maggioni
,
B.
Nadler
,
F.
Warner
, and
S. W.
Zucker
, “
Geometric diffusions as a tool for harmonic analysis and structure definition of data: Multiscale methods
,”
Proc. Natl. Acad. Sci. U.S.A.
102
(
21
),
7432
7437
(
2005
).
54.
P.
Drineas
and
M. W.
Mahoney
, “
On the Nyström method for approximating a Gram matrix for improved kernel-based learning
,”
J. Mach. Learn. Res.
6
,
2153
2175
(
2005
).
55.
L.
Song
,
X.
Zhang
,
A.
Smola
,
A.
Gretton
, and
B.
Schölkopf
, “Tailoring density estimation via reproducing kernel moment matching,” in Proceedings of the 25th International Conference on Machine Learning (ACM, 2008), pp. 992–999.
56.
R.
Friedrich
,
J.
Peinke
,
M.
Sahimi
, and
M. R. R.
Tabar
, “
Approaching complexity by stochastic methods: From biological systems to turbulence
,”
Phys. Rep.
506
(
5
),
87
162
(
2011
).
57.
R.
Alexander
and
D.
Giannakis
, “
Operator-theoretic framework for forecasting nonlinear time series with kernel analog techniques
,”
Physica D
409
,
132520
(
2020
).
58.
Here, DBSCAN with a threshold of 0.1 was used. However, any reasonable clustering algorithm will work given that clusters are well separated.
59.
The probabilities sum exactly to 1. We show all the transitions found.
60.
E. N.
Lorenz
, “
Deterministic nonperiodic flow
,”
J. Atmos. Sci.
20
,
130
(
1963
).
61.
J. P.
Crutchfield
and
B. S.
McNamara
, “
Equations of motion from a data series
,”
Complex Syst.
1
,
417
452
(
1987
), see https://www.complex-systems.com/abstracts/v01_i03_a03/.
62.
E. N.
Lorenz
, “Predictability: A problem partly solved,” in Proceedings of the Seminar on Predictability (ECMWF, 1996), Vol. 1.
63.
A.
Rupe
,
K.
Kashinath
,
N.
Kumar
,
V.
Lee
,
M.
Prabhat
, and
J. P.
Crutchfield
, “Towards unsupervised segmentation of extreme weather events,” arXiv:1909.07520 (2019).
64.
A.
Rupe
and
J. P.
Crutchfield
, “Spacetime autoencoders using local causal states,” in AAAI Fall Series 2020 Symposium on Physics-Guided AI for Accelerating Scientific DiscoveryarXiv:2010.05451 (2020).