We merge computational mechanics’ definition of causal states (predictively equivalent histories) with reproducing-kernel Hilbert space (RKHS) representation inference. The result is a widely applicable method that infers causal structure directly from observations of a system’s behaviors whether they are over discrete or continuous events or time. A structural representation—a finite- or infinite-state kernel $\u03f5$-machine—is extracted by a reduced-dimension transform that gives an efficient representation of causal states and their topology. In this way, the system dynamics are represented by a stochastic (ordinary or partial) differential equation that acts on causal states. We introduce an algorithm to estimate the associated evolution operator. Paralleling the Fokker–Planck equation, it efficiently evolves causal-state distributions and makes predictions in the original data space via an RKHS functional mapping. We demonstrate these techniques, together with their predictive abilities, on discrete-time, discrete-value infinite Markov-order processes generated by finite-state hidden Markov models with (i) finite or (ii) uncountably infinite causal states and (iii) continuous-time, continuous-value processes generated by thermally driven chaotic flows. The method robustly estimates causal structure in the presence of varying external and measurement noise levels and for very high-dimensional data.

Computational mechanics is a mathematical framework for pattern discovery that describes how information is stored, structured, and transformed in a physical process. Its constructive application to observed data has been demonstrated for some time. Until now, though, success was limited by the need to strongly discretize observations or discover state-space generating partitions for correct symbolic dynamics. Exploiting modern machine-learning foundations in functional analysis, we broadly extend computational mechanics to inferring models of many distinct classes of structured processes, going beyond fully discrete data to processes with continuous data and those measured by heterogeneous instruments. Equations of motion for the evolution of process states can also be reconstructed from data. The method successfully recovers a process’s causal states and its dynamics in both discrete and continuous cases, including the recovery of noisy, high-dimensional chaotic attractors.

## I. INTRODUCTION

At root, the physical sciences and engineering turn on successfully modeling the behavior of a physical system from observations. Noting that they have been successful in this is an understatement, at best. That success begs a deep question, though—one requiring careful reflection. How, starting only from measurements, does one construct a model for behavioral evolution consistent with the given data?

Not surprisingly, dynamical-model inference has been tackled via a variety of theoretical frameworks. However, most approaches make strong assumptions about the internal organization of the data-generating process: Fourier analysis assumes a collection of exactly periodic oscillators; Laplace transforms, a collection of exponential relaxation processes; feed-forward neural networks, linearly coupled threshold units. We are all familiar with the results: A method is very successful when the data generator is in the assumed model class. The question begged in this is that one must know something—for some process classes, quite a lot—about the data before starting a successful analysis. And, if the wrong model class is assumed, this strategy tells one vanishingly little or nothing about how to find the correct one. This form of model inference is what we call *pattern recognition*: Does the assumed model class fairly represent the data and, in particular, the generator’s organization?

At some level of abstraction, one cannot escape these issues. Yet, the assumptions required by our framework are sufficiently permissive that we can attempt *pattern discovery*, in contrast with pattern recognition. Can we learn new structures, not assumed? Can we extract efficient representations of the data and their evolution? Do these yield optimal predictions?

Framed this way, it is not surprising that contemporary model inference is an active research area—one in which success is handsomely rewarded. Indeed, it is becoming increasingly important in our current era of massive data sets and cloud computing.

The following starts from and then extends *computational mechanics*^{1,2}—a mathematical framework that lays out the foundations for pattern discovery. The core idea is to statistically describe a system’s evolution in terms of causally equivalent states—effective states, built only from given data, that lead to the same consequences in a system’s future. Its theorems state that causal states and their transition dynamic can be reconstructed from data, giving the minimal, optimally predictive model.^{3} Constructively, for discrete-event, discrete-time or continuous-time processes, computational mechanics delineates the minimal causal representations—their *$\u03f5$-machines*.^{4,5} This leaves open, for example, $\u03f5$-machines for the rather large and important class of continuous-value, continuous-time processes and also related spatiotemporal processes. The practical parallel to these remaining challenges is that currently, there is no single algorithm that reconstructs $\u03f5$-machines from data in a reasonable time and without substantial discrete approximations.^{6–8}

Thus, one goal of the following is to extend computational mechanics beyond its current domains to continuous time and arbitrary data. This results in a new class of inference algorithms for *kernel $\u03f5$-machines* that, notably, can be less resource-demanding than their predecessors. At the very least, the algorithms work with a set of weaker assumptions about the given data, extending computational mechanics’ structural representations to previously inaccessible applications with continuous data.

One of the primary motivations for using $\u03f5$-machine representations in the first place is not just modeling. Once in hand, their mathematical properties allow direct and efficient estimation of a range of statistical, information-theoretic, and thermodynamic properties of a process. This latter benefit, however, is not the focus of the following; on the contrary, those goals are the target of sequels.

In the new perspective, kernel $\u03f5$-machines are analogous to models associated with Langevin dynamics, which act on a set of state variables representing system configurations. The new continuous kernel $\u03f5$-machines can also be written in the form of a stochastic differential equation (SDE) that acts instead on the predictively equivalent causal states. Similarly, a Fokker–Planck equation describing the evolution of a distribution over causal states can be defined. It is then used to infer the evolution from an initial probability distribution over the model’s internal states (be it a real distribution induced by uncertainty or a delta function), which is then used to make predictions in the original data space.

Section II recalls the minimal set of computational mechanics required for the full development. Section III B establishes that the space of causal states has a natural metric and a well-defined measure using *reproducing-kernel Hilbert spaces*. This broadly extends computational mechanics to many kinds of processes, from discrete-value and -time to continuous-value and -time and everywhere in between, including spatiotemporal and network dynamical systems. Section IV then presents the evolution equations and discusses how to infer from empirical data the evolution operator for system-state distributions. Section VI demonstrates how to apply the algorithm to discover optimal models from realizations of discrete-time, discrete-value infinite Markov-order processes generated by finite-state hidden Markov models of varying complexity and continuous-time, continuous-value processes generated by thermally driven deterministic chaotic flows.

## II. COMPUTATIONAL MECHANICS: A SYNOPSIS

We first describe the main objects of study in computational mechanics—stochastic processes. Then, we define the effective or causal states and their transition dynamics directly in terms of predicting a process’s realizations.

### A. Processes

Though targeted to discover a stochastic process’s intrinsic structure, computational mechanics takes the task of predicting a process as its starting point. To describe how a process’s structure is discovered within a purely predictive framework, we introduce the following setup.

Consider a *system of interest* whose behaviors we observe over time $t$. We describe the set of behaviors as a *stochastic process* $P$. We require that the process is *nonanticipatory*: There exists a natural *filtration*—i.e., an ordered (discrete or continuous) set of indices $t\u2208T$ associated to time—such that the observed behaviors of the system of interest only depend on past times.

$P$’s observed behaviors are described by random variables $X$, indexed by $t$ and denoted by a capital letter. A particular *realization* $Xt=x\u2208X$ is denoted via a lowercase letter. We assume that $X$ is defined on the measurable space $(X,BX,\nu X)$, where the domain $X$ of $X$’s values could be a discrete alphabet or a continuous set. $X$ is endowed with a reference measure $\nu X$ over its Borel sets $BX$. In particular, $P$’s measure $\nu X$ applies to time blocks: ${Pr(Xa<t\u2264b):a<b,a,b\u2208T}$.

The *process* $(Xt)t\u2208T$ is the main object of study. We associate the *present* to a specific time $t\u2208T$ that we map to $0$ without loss of generality. We refer to a process’s *past* $Xt\u22640$ and its *future* $Xt>0$. In particular, we allow $T=R$ and infinite past or future sequences.

Beyond the typical setting in which an observation $x$ is a simple scalar, here, realizations $Xt=0=x$ can encode, for example, a vector $x=[(mi)t\u22640]i=1,\u2026,M$ of $M$ measured time series $(mi)t\u22640$, each from a different sensor $mi$, up to $t=0$. Often, though, there are reasons to truncate such observation templates to a fixed-length past. Take, for example, the case of exponentially decaying membrane potentials where $x$ measures a neuron’s electric activity^{9} or a lattice of spins with decaying spatial correlations. After several decay epochs, the templates can often be profitably truncated.

To emphasize, this contrasts with related works—e.g., Ref. 10—in that we need not consider $Xt$ to be only the presently observed value nor, for that matter, $Yt=Xt+1$ to be its successor in a time series. In what follows, $Xt$ can be the full set of observations up to time $t$. This leads to substantial conceptual broadening, the consequences of which are detailed in Refs. 1–3.

### B. Predictions

As just argued, we consider that random variable $Xt$ contains information available from observing a system up to time $t$. At this stage, we just have given data, and we have made *no* assumptions about its utility for prediction. Consider another random variable $Y$ that describes system observations we wish to predict from $X$. A common example would be future sequences (possibly truncated, as just noted) that occur at times $t>0$. We also assume that $Y$ is defined on a measurable space $(Y,BY,\nu Y)$.

A *prediction* then is the distribution of outcomes $Y$ given observed system configuration $X=x$ at time $t$, denoted $Pr(Y|Xt=x)$. The same definition extends to nontemporal predictive settings by changing $t$ to the relevant quantity over which observations are collected, e.g., indices for pixels in an image.

### C. Process types

A common restriction on processes is that they are stationary—the same realization $xt$ at different times $t$ occurs with the same probability.

#### (Stationarity)

**(Stationarity)**

A process $(Xt)t\u2208T$ is *stationary* if, at all times $t\u2032\u2260t$, the same distribution of its values is observed: $Pr(Xt\u2032=x)\u2261Pr(Xt=x)$ for all $x\u2208X$ and except possibly on a null-measure set.

The following, in contrast with this common assumption, does not require $(Xt)t\u2208T$ to be stationary. In point of fact, such an assumption rather begs the question. Part of discovering a process’s structure requires determining from data if such a property holds. However, we assume the following from realization to realization.

#### (Conditional stationarity)

**(Conditional stationarity)**

A process $(Xt)t\u2208T$ is *conditionally stationary* if, at all times $t\u2032\u2260t$, the same conditional distribution of next values $Y$ is observed: $Pr(Y|Xt\u2032=x)\u2261Pr(Y|Xt=x)$ for all $x\u2208X$ and except possibly on a null-measure set.

That is, conditional stationarity allows marginals $Pr(Xt)$ to depend on time. However, at *any given time*, for each realization of the process where the same $x$ is observed, then the same conditional distribution is also observed. In a spatially extended context, the hypothesis could also encompass realizations at different locations for which the same $X=x$ is observed.

When multiple realizations of the same process are not available, we will instead consider a limited form of *ergodicity*: except for measure zero sets, any one realization, measured over a long enough period, reveals the process’s full statistics. In that case, observations of the same $X=x$ can be collected at multiple times to build a unique distribution $Pr(Y|X=x)$, now invariant by time-shifting. It may be that the system undergoes structural changes: for example, a volcanic chamber slowly evolving over the course of a few months or weather patterns evolving as the general long-term climate changes. Then, we will only assume that the $Pr(Y|X=x)$ distribution is stable over a long enough time window to allow its meaningful inference from data. It may slowly evolve over longer time periods.

If multiple realizations of the same process are available, this hypothesis may not be needed. Then, only Definition 2’s conditional stationarity is required for building the system states according to the method we now introduce.

### D. Causal states

Finally, we come to the workhorse of computational mechanics that forms the basis on which a process’s structure is identified.

#### (Predictive equivalence)

**(Predictive equivalence)**

*predictive equivalence relation*$\u223c\u03f5$,

In other words, observing two realizations $x$ and $x\u2032$ means that the process is in the same effective state—the same equivalence class—if they lead to the same predictions,

Speaking in terms of pure temporal processes, two observed pasts $x$ and $x\u2032$ that belong to the same predictive class $\u03f5(\u22c5)$ are operationally indistinguishable. Indeed, by definition of the conditional distribution over futures $Y$, no new observation can help discriminate whether the causal state arose from past $x\u2032$ or $x$. For all practical purposes, these pasts are equivalent, having brought the process to the same condition regarding future behaviors.

#### (Causal states1)

**(Causal states**

^{1})Since the classes ${\u03f5(\u22c5):x\u2208X}$ induce the same consequences—in particular, the same behavior distribution $Pr(Y|X=x)=Pr(Y|\u03f5(x))$—they capture a process’s internal causal dependencies. They are a process’s *causal states* $\sigma \u2208S$.

Predictive equivalence can then be summarized as *the same causes lead to the same consequences*. Thanks to it, grouping a process’s behaviors $X$ under the equivalence relation $\u223c\u03f5$ also gives the minimal partition required for optimal predictions: further refining the classes of pasts is useless, as just argued, while, at the same time, each class is associated with a unique distribution of possible outcomes.

This setup encompasses both deterministic observations, where $Y=f(X)$ is fixed and $Pr(Y|X=x)$ becomes a Dirac distribution with support $f(x)$, as well as stochastic observations. The source of stochasticity need not be specified: fundamental laws of nature, lack of infinite precision in measurements of a chaotic system, and so on.

Beyond identifying a process’s internal structure, predictive equivalence is important for practical purposes: it ensures that the partition induced by $\u223c\u03f5$ on $X$ is stable through time. Hence, data observed at different times $ti$ may be exploited to estimate the causal states. In practice, if the long-term dynamic changes, we assume that predictive equivalence holds over a short time window.

While causal states are rigorously defined as above, in empirical circumstances, one does not know all of a system’s realizations and so one cannot extract its causal states. Practically, we assume that the given data consist of a set of $N$ observations $(xi,yi,ti)i=1,\u2026,N$. These data encode, for each configuration $xi$ at time $ti$, what subsequent configuration $yi$ was then observed. The goal is to recover from such data an approximate set of causal states that model the system evolution; see, e.g., Refs. 11 and 12.

### E. Causal-state dynamic for discrete sequences

For now, assume that the data $x\u2208X$ are a *past*—a sequence $x=(v\u2212LX<t\u22640)$ of discrete past values $vt\u2208V$ at discrete times $t\u22640$. This discrete setting helps introduce several important concepts, anticipating generalizations that follow shortly. Indices $t$ are now discrete observation times, ranging from $LX$ in the past up to the present at $t=0$. We allow $LX=\u221e$ when calculating causal states analytically on known systems. Similarly, we take $y\u2208Y$ to be a *future*—a sequence $(v0<t\u2264LY)$ of discrete future values $v\u2208Y$ that may also be truncated at time $LY$ in the future.

For discrete-value processes, $V$ becomes an alphabet of symbols $v$, and $X$ and $Y$ both become semi-infinite (or truncated) sequences over that alphabet. The transition from time $t0$ to time $t1$ is associated with the observation of a new symbol $v\u2208V$.

It should be noted that a surrogate space $V$ can always be obtained from continuous-value data if regularly sampled at times $ti$ for $ti+1=ti+\Delta t$ with a fixed $\Delta t$. In this case, pairs $(xi,xi+1)$ are equivalent to observing transition symbols $v$. There are at most $|V|$ such possible transitions from the current causal state $\sigma 0=\u03f5(x0)$ to which $x0=(vt\u22640)$ belongs.

Conditional symbol-emission probabilities $Pr(v\u2208V|\sigma )$ are defined (and can be empirically estimated) for each causal state $\sigma $ and for each emitted symbol $v\u2208V$.^{3} Since $\sigma \u2208S$ encodes all past influence, by construction, state-to-state transitions do not depend on previous states. That is, causal states are Markovian in this sense. The dynamic over causal states is then specified by a set of symbol-labeled causal-state transition matrices ${T\sigma ,\sigma \u2032(v):\sigma ,\sigma \u2032\u2208S,x\u2208V}$.

Diagrammatically, the causal states $\sigma $ form the nodes of a directed graph whose edges are labeled by elements of $v\u2208V$ with associated transition probabilities $Pr(v|\sigma )$. Moreover, the transitions are *unifilar*: the current state $\sigma 0$ and next symbol $\nu 0$ uniquely determine the next state $\sigma 1=f(\sigma 0,\nu 0)$.

### F. $\epsilon $-Machines

Taken altogether, the set of causal states and their transition dynamic define the *$\u03f5$-machine*.^{1} Graphically, they specify the state-transition diagram of an edge-emitting, unifilar *hidden Markov model* (HMM). These differ in several crucial ways from perhaps-more-familiar state-emitting HMMs.^{13} For example, symbol emission in the latter depends on the state and is independent of state-to-state transition probabilities; see Fig. 1.

The principle reason for using an $\u03f5$-machine is that it is a process’s *optimally predictive* model.^{3} Specifically, an $\u03f5$-machine is the minimal, unifilar, and unique edge-emitting HMM that generates the process.

Notably, $\u03f5$-machines’ Markov property is derived from the predictive equivalence relation, thanks to the latter’s conditioning on pasts. More generally, the causal states are an intrinsic property of a process. They do *not* reflect a choice of representation, in contrast to how state-emitting HMMs are often deployed; again, see Fig. 1. The same holds for state transitions and symbol emissions, all of which are uniquely inferred from the process.

Given that they make markedly fewer and less restrictive assumptions, it is not surprising that reconstruction algorithms for estimating $\u03f5$-machines from data are more demanding and costly to implement^{7,8,14,15} than performing a standard expectation-maximum estimate for an hypothesized HMM.^{16} Most $\u03f5$-machine algorithms rely on a combination of clustering empirically estimated sequence probability distributions $Pr(Y|X)$ together with splitting the candidate causal states when required for consistency (unifilarity) of unique emissions $Pr(v\u2208V|\sigma \u2208S)$.^{6,17} To date, though, Bayesian inference for $\u03f5$-machines provides the most reliable estimation and works well on small data sets.^{11}

We mentioned that $\u03f5$-machines are unifilar edge-emitting HMMs. Smaller *nonunifilar* HMMs can exist that are not predictive, but rather *generate* data with the same statistical properties as the given process. However, one cost in using these smaller HMMs is the loss of determinacy for which state a process is in based on observations. The practical consequence is that the processes generated by nonunifilar HMMs typically have an uncountable infinity of causal states. This forces one to use probabilistic *mixed states*, which are distributions over the states of the generating HMM. References 18–20 develop the theory of mixed states for nonunifilar, generative HMMs. For simplicity, the following focuses on processes with finite “predictive” models—the $\u03f5$-machines. That said, Sec. VI B below analyzes an inference experiment using a process with an uncountable infinity of mixed states to probe the algorithm’s performance.

### G. Patterns captured by *ε*-machines

*ε*

An $\u03f5$-machine’s hidden states and transitions have a definite meaning and allow for proper definitions of process structure and organization—indicators of a process’s complexity.^{2} For example, we can calculate in closed-form various information-theoretic measures to quantify the information conveyed from the past to the future or that stored in the present.^{21} In this way, $\u03f5$-machines give a very precise picture of a process’s information processing abilities.^{22,23}

More specifically, each causal state’s surprisal can be used to build powerful data filters.^{8,24–27} The entropy of the causal states—the cost of coding them, the *statistical complexity*—can be used in entropy–complexity surveys of whole process families to discriminate purely random from chaotic data.^{2} Recent advances in signal processing^{28} show how processes with arbitrarily complex causal structures can still exhibit a flat power spectrum since the spectrum is the Fourier transform of only the two-point autocorrelation function. This demonstrates the benefit of inferring process structures using the full setup presented above—consider $Xt$ encompassing all information up to time $t$ and not restricting $Xt$ to a present observation.^{23,28}

However, these benefits have been circumscribed. Many previous $\u03f5$-machine inference methods work with symbolic (discrete value) data in discrete time. However, in practice, often we monitor continuous physical processes at arbitrary sampling rates, and these measurements can take a continuum of data values within a range $V$. In these cases, estimation algorithms rely on clustering of causal states by imposing arbitrary boundaries between discretized groups. However, there may be a fractal set or continuum of causal states.^{29} More recent approaches consider continuous time but keep discrete events, such as renewal and semi-Markov processes.^{5,30,31}

The method introduced below is, to our knowledge, the first that is able to estimate causal states for essentially arbitrary data types and to represent their dynamic in continuous time. This approach offers alternative algorithms that provide a radically different set of assumptions and algorithmic complexity than previous approaches. While it is also applicable to the discrete case (see Sec. VI A), it vastly expands computational mechanics’s applicability to process classes that were previously inaccessible.

## III. CONSTRUCTING CAUSAL STATES USING REPRODUCING KERNELS

The predictive-equivalence relation implicates conditional distributions with expressing a process’ structure. To work directly with conditional distributions, this section recalls the main results concerning the geometric view of probability distributions as points in a reproducing-kernel Hilbert space. Following the method from Refs. 32 and 33, we describe both unconditional and conditional distributions. Both are needed in computational mechanics. Once they are established in the RKHS setting, we then describe the geometry of causal states.

### A. Distributions as points in a reproducing-kernel Hilbert space

Consider a function $kX$ of two arguments, defined on $X\xd7X\u2192R$ (resp., $C$). Fixing one argument to $x\u2208X$ and leaving the second free, we consider $kX(x,\u22c5)$ as a function in the Hilbert space $HX$ from $X$ to $R$ (resp., $C$). If $kX$ is a positive symmetric (resp., sesquilinear) definite function,^{34} then the *reproducing property* holds: For any function $f\u2208HX$ and for all $x\u2208X$, we have $\u27e8f,k(x,\u22c5)\u27e9HX=f(x)$. Here, $\u27e8\u22c5,\u22c5\u27e9HX$ is the inner product in $HX$ or a *completion* of $HX$; see Ref. 34. $HX$ is known as the *reproducing-kernel Hilbert space* (RKHS) associated with *kernel* $kX$.

Kernel functions $kX$ are easy to construct and so have been defined for a wide variety of data types, including vectors, strings, graphs, and so on. A product of kernels corresponds to the direct product of the associated Hilbert spaces.^{34} Thus, products maintain the reproducing property. Due to this, it is possible to compose kernels when working with heterogeneous data types.

Kernels are widely used in machine learning. A common use is to convert a given linear algorithm, such as estimating *support vector machines*, into nonlinear algorithms.^{35} Indeed, when a linear algorithm can be written entirely in terms of inner products, scalings, and sums of observations $x$, then it is easy to replace the inner products in the original space $X$ by inner products in the Hilbert space $HX$. The $kX(x,\u22c5)$ functions are then called *feature maps* and $HX$ the *feature space*.

Returning to the original space $X$, the algorithm now works with kernel evaluations $kX(x1,x2)$ whenever an inner product $\u27e8kX(x1,\u22c5),kX(x2,\u22c5)\u27e9HX$ is encountered. In this way, the linear algorithm in $HX$ has been converted to a nonlinear algorithm in the original space $X$. Speaking simply, what was nonlinear in the original space is linearized in the associated Hilbert space. This powerful “kernel trick” is at the root of the probability distribution mapping that we now recall from Ref. 32.

#### 1. Unconditional distributions

Consider a probability distribution $Pr(X)$ of the random variable $X$. Then, the average map $\alpha \u2208HX$ of the kernel evaluations in the RKHS is given by $\alpha =EX[kX(x,\u22c5)]$. Note that for any $f\u2208HX$,

An estimator for this average map can be computed simply as $\alpha ^=1N\u2211i=1NkX(xi,\u22c5)$. This estimator is consistent^{32} and so converges to $\alpha $ in the limit of $N\u2192\u221e$.

Consider now the two-sample test problem: We are given two sets of samples taken from *a priori* distinct random variables $A$ and $B$, both valued in $X$ but with possibly different distributions $Pr(A)=PA$ and $Pr(B)=PB$. Do these distributions match: $PA=PB$? This scenario is a classical statistics problem and many tests were designed to address it, including the chi-square and the Kolmogorov–Smirnov tests.

Using the RKHS setup and the average map, a new test^{36} is simply to compute the distance between the average maps using the Hilbert space norm: $\Vert \alpha A\u2212\alpha B\Vert HX$. Under suitable mild conditions on the kernel, it can be shown^{36} that $\Vert \alpha A\u2212\alpha B\Vert HX=0$ if, and only if, $PA=PB$ up to a set of points with null measure in $X$.

This *maximum mean discrepancy* (MMD) test is consistent and accurate. Moreover, confidence levels can be obtained through bootstrapping or other techniques described in Ref. 36. In practice, for samples ${ai}i=1..N$ and ${bj}j=1..M$, $\Vert \alpha ^A\u2212\alpha ^B\Vert HX$ can be computed via

where inner products between $\alpha ^A=1N\u2211i=1NkX(xi,\u22c5)$ and $\alpha ^B=1M\u2211j=1MkX(xj,\u22c5)$ can easily be developed into sums of kernels evaluations $kX(xi,xj)$.

This test makes it possible to compare two distributions *without* density estimation and directly from data. For all practical purposes, under mild technical conditions,^{37} a distribution $Pr(X)$ of random variable $X$ can then be represented as a point in the RKHS $HX$, consistently estimated by the mean mapping $\alpha ^$. The RKHS norm then becomes a true distance between probability distributions.

#### 2. Conditional distributions

Consider random variables $X$ and $Y$ with the same notations as above. The joint variable $(X,Y)$ leads to a direct product Hilbert space $HX\u2297HY$,^{34} with the product kernel,

For functions $f\u2208HX$ and $g\u2208HY$, a covariance operator $CYX:HX\u2192HY$ can be defined such that

Similarly, $CXX$ can be defined in the case $Y=X$.

Then, under strong conditions on the kernel,^{33,38} we can relate the conditional mean map in $HY$ to the conditioning point in $HX$ using

The strong conditions can be relaxed to allow the use of a wide range of kernels by considering a regularized version,

This is a consistent estimator of the unregularized version when computed empirically from samples.^{39}

As for the unconditional case,

can be seen as uniquely representing the distribution $Pr(Y|X=x)$, up to a null-measure set.

The set $S={sY|X=x}x\u2208X\u2282HY$ traces out all possible conditional distributions $Pr(Y|X=x)$ for all $x\u2208X$. It inherits the RKHS norm from $HY$: $\Vert s1\u2212s2\Vert HY$ is well-defined for any $s1,s2\u2208S$. Note that $sY|X=x=\u03c2(x)$ can also be interpreted as an injective function $\u03c2:X\u2192S$, selecting points in the RKHS $HY$ for each $x$.

The connection with the regression problem for estimating $s^Y|X=x$ from data,^{38,40} together with the representer theorem,^{41} ensures that $s^Y|X=x$ lies in the data span,

with $N$ being the number of samples. For all practical purposes, then, dimension $N$ is sufficient when working with $S\u2282HY$. This is in contrast to working with the infinite dimension of the underlying $HY$. The unconditional case shown in Sec. III A 1 can be understood when $\omega i(x)=1/N$, in which case, no dependency on $x$ remains.

A regularized and consistent estimator^{33,38,39} is given by

with $GijX=kX(xi,xj)$ being the Gram matrix of the $X$ samples and $K(x)$ a column vector such that $Ki(x)=kX(x,xi)$. Thanks to Ref. 40, $\epsilon $ can be set by cross-validation.

In practice, it is more efficient to equivalently solve the linear system,

to find the vector $\omega (x)$. Note that, for an invertible matrix and without regularization, $\omega (x)$ is simply the vector with the only nonnull entry $\omega i=1$ for $x=xi$. In practice, there may be duplicate entries (e.g., for discrete data) or nearby values for which the regularization becomes necessary.

In this way, employing a suitable kernel and regularization, a conditional probability distribution $Pr(Y|X=x)$ is represented in the RKHS $HY$ as a point,

with a reasonably easy way to estimate the coefficients $\omega (x)$ from data.

In this light, the full $\Omega $ matrix obtained via

can also be seen as a way to “spread” the influence of the $(xi,yi)$ observations to nearby $xi$ values so that all duplicate $xi$ effectively belong to the same estimated conditional distribution.

### B. RKHS causal states

Section II D defined a process’s causal states $\sigma \u2208S$ via the predictive equivalence relation,

This is exactly the preimage $E=\u03c2\u22121(sY|x)\u2282X$ of the unique point $sY|E\u2208HY$, with $sY|E=sY|w$ for all $w\u2208E=\u03f5(x)$. Therefore, we can refer equivalently to one or the other concept: we refer to the set in $X$ by the equivalence class and the point in the RKHS as $s\u2208HY$.

In short, the set $S={sY|X=x}x\u2208X\u2282HY$, for $x\u2208X$, is the set $S$ of causal states. Note that $S$ is a subset of all possible conditional probability-distribution mappings in $HY$. It consists of only the mappings that actually correspond to some $x$ in the domain of interest. We now drop referring to $S$ and refer only to causal states $\sigma \u2208S$ with random variable $S$.

Let $B$ be the Borel sets over $S$. For any set $\beta \u2208B$, its preimage in $X$ is defined by

Recall that, by definition, the causal states form a partition of $X$. Hence, each $x$ belongs to a unique preimage $\u03c2\u22121(\sigma )$ when the union is taken over $\sigma \u2208\beta $ in the preceding definition.

Recall that $\nu X$ is the reference measure on $X$, in terms of which probability distribution $Pr(X)$ is defined. A natural, push-forward measure $\mu $ is defined on $(S,B)$ by

If $Pr(X)$ admits a probability density function $p=dPr(X)/d\nu X$—the *Radon–Nikodym derivative* of $Pr(x)$ with respect to $\nu X$—then we can similarly push-forward the density on $X$ to define a density of causal states,

The distribution $Q$ of states over $S$ is then defined by

Note that the measure $\mu $ is defined on (and restricted to) $S$ and that no equivalent of the Lebesgue measure exists for the whole infinite-dimensional space $HY$.

The net consequence is that causal states are to be viewed simultaneously as

Sets of points in $X$, using the predictive equivalence relation $\u223c\u03f5$, with equivalence classes forming a partition of $X$. Though this is the original definition, presented in Sec. II D, it is still very useful in the RKHS setup to define the push-forward measure.

Conditional probability distributions of possible outcomes $Y$ given observed system configurations $X$. This view is used in clustering algorithms

^{7,15}to identify causal states. These algorithms directly map to the RKHS setting by using the MMD test instead of other statistical tests for clustering conditional distributions.Points in the RKHS $HY$: More specifically, points in the subset $S={\sigma Y|X=x}x\u2208X$. The subset $S$ (and only $S$, not the rest of $HY$) is endowed with a push-forward measure $\mu $. Thus, we can properly define probability distributions and densities of causal states in this Hilbert space setting.

Compared to previous works, this third alternate view offers several advantages:

(Nearly )arbitrary data types can be handled though the use of reproducing kernels. We are no longer limited to discrete alphabets $V$ of symbolic values. Adequate kernels exist for many data types and can even be composed to work directly with heterogeneous data types.

The distance $\Vert \u22c5\Vert HY$ can serve as the basis for clustering similar states together. Thanks to the MMD test introduced in Sec. III A 1, algorithms need not rely on estimating the conditional densities $Pr(Y|X=x)$ for clustering states.

## IV. DYNAMICS OF RKHS CAUSAL STATES

The preceding constructed causal states $\sigma \u2208S$ statically. Computational mechanics aims at modeling a process’s behaviors, though, not just describing its constituents. A process’s dynamic induces trajectories over $S$. Sections IV A–IV C describe them and then define a new class of predictive models—the kernel $\u03f5$-machines—and explain how to use them for computing statistics of the given data, how to simulate new “data,” and how to predict a process.

### A. Causality and continuity

Consider a series of causal states $\u2026\sigma t\u22121\sigma t\sigma t+1\u2026$. Distinct pasts $x\u2208\sigma 0$ and $w\u2208\sigma 0$ both contain the same information concerning the process dynamics since they induce the same distribution of futures $Pr(Y|\sigma 0)$. However, $\sigma 0$ does not contain the information about which past is responsible for arriving at it: whether $x$ or $w$ or any other past led to the same state $\sigma 0$.

So, what kind of causality is entailed by causal states? Causal states capture the same global information needed to describe how the future is impacted by the present, and they consist of all possible ways that information was organized in the past. Unifilarity can now be understood as a change of state $\delta \sigma $ uniquely determined by the information gain. In contrast, the equivalent of nonunifilarity would be that despite having all useful knowledge about the past (previous state) and having all possible information about the change of system configuration (either $\delta x$ or $\delta w$, and so on), the model states are still not defined unequivocally. This is not the case with the definition of the causal states, but would be with other generative models, for which only a distribution of states can be inferred from data.^{29}

The continuity of causal state trajectories can be understood from an information-theoretic perspective. The relative information $dKL(xt+dt||xt)$ (Kullback–Leibler divergence) between $xt$ and its evolution after an infinitesimal time $t+dt$ is simply the change of information we have on $x$. Assuming that information comes only at finite velocity, then $dKL(xt+dt||xt)\u21920$ as $dt\u21920$. However, it is known that $dKL(xt+dt||xt)\u21920$ implies $\Vert xt+dt\u2212xt\Vert HX\u21920$.^{37} Using Sec. III A 2’s construction of conditional distributions, we find that $\Vert \sigma t+dt\u2212\sigma t\Vert HY\u21920$ as $dt\u21920$ with continuous kernels and positive definite operators. Hence, causal-state trajectories are continuous.

In practice, assumptions leading to continuous trajectories need not be respected:

For practical reasons, it is rarely possible to work with infinite series. Truncating the past and the future can be necessary. In these cases, there is information not fully contained in $x^$—the truncated estimate of $x$. Truncation can be justified when the causal influence of past system configurations onto the present decays sufficiently fast with time; then, we ignore old configurations with negligible impact (and similarly for the future). This amounts to saying that there are no long-range correlations and that no useful information for prediction is lost when ignoring these past configurations. Hence, that $Pr(Y|X)$ is unchanged by the truncation. This hypothesis may fail, of course, with the consequence that truncating the past or future may introduce jumps in the estimated trajectories of the causal states in $S$—a jump induced by the sudden loss or gain of information.

When using continuous kernel functions, the associated RKHS norm is also continuous in its arguments: $\Vert \sigma Y|X=x1\u2212\sigma Y|X=x2\Vert HY\u21920$ as $x1\u2192x2$. Hence, continuity in data trajectories also implies continuity in causal-state trajectories. However, when data are truly discrete in nature, the situation of Sec. II E is recovered. An alternate view is that, at a fundamental level, information comes in small packets: these are the symbols induced by the data changes in this discrete scenario.

It may also be that measurements are performed at a finite time scale $\tau $. Then, the information gained between two consecutive measurements can be arbitrarily large but still appear instantaneously at the scale at which data are measured. This leads to apparent discontinuities.

Here, we discuss only the elements needed to address causal states in continuous time with nearly arbitrary data, i.e., for which a reproducing kernel exists. A full treatment of discontinuities is beyond the present scope and best left for the future.

### B. Continuous causal-state trajectories

From here on, assume that trajectories of causal states $\sigma \u2208S$ are continuous. Recall, though, that $S\u2282HY$—a metric space. This guarantees the existence of a reference Wiener measure $\omega $ on the space of trajectories defined over a time interval $[0,t<T]$ (possibly $T=\u221e$) with a canonical Wiener process $W$. With $\sigma t$ being a state-continuous Markov process, we posit that its actual trajectories evolve under an Itô diffusion—a stochastic differential equation (SDE) of the form

where $a(\sigma t)$ is a (deterministic) *drift* and $b(\sigma t)$ a *diffusion*. Both depend on the current state $\sigma t$; hence, this Itô diffusion is inhomogeneous in state space. However, the coefficients are stable in time $a(\sigma t,t)=a(\sigma t)$ and $b(\sigma t,t)=b(\sigma t)$ if the (possibly limited) ergodicity assumption is made in addition to the conditional stationarity assumption of Sec. II D. Therefore, the diffusion is homogeneous in time. This SDE is the equivalent of the $\u03f5$-machine’s causal-state-to-state transition probabilities introduced in the discrete case (Sec. II E). The evolution equation encodes, for each causal state $\sigma t$, how the process evolves to nearby states at $t+dt$.

The causal state behavior arises from integrating over time,

for any state $\sigma 0$ being the trajectory’s initial condition. As for the evolution of the causal-state distribution, consider a unit probability mass initially concentrated on $\delta (\sigma 0)$. Then, the states $\sigma t$ that reached at time $t$ define a probability distribution $Pr(\beta \u2208B|\sigma 0,t)$ on $S$, with associated density $p(\sigma t|\sigma 0)$. (Recall that the latter is defined as $p=dPr(\u22c5)/d\mu $ with $\mu $ being the push-forward measure from $X$; see Sec. III B.) This distribution encodes the state-to-state transition probabilities at any time $t$, parallel to iterating an $\u03f5$-machine in the discrete case.

The evolution of probability densities $p$ other than $\delta (\sigma 0)$ is governed by a Fokker–Planck equation. The infinitesimal generator $\Gamma $ of the process $\sigma t$ is

where $\Gamma $ is an operator applied on a suitable class of functions $f$.^{44} For a state distribution $Q$, with associated density $q=dQ/d\mu $, the Fokker–Planck equation corresponding to the Itô diffusion can be written in terms of the generator’s adjoint $\Gamma \u2217$,

Restricting to the data span in $S$ (see Sec. III A 2), using the representer theorem, and the samples as a pseudo-basis,^{41} the operator $\Gamma $ can be represented using a vector equation in a construction similar to classical Euclidean $RN$ state spaces; see Ref. 45, p. 103. There are, however, practical difficulties associated with estimating coefficients $a$ and $b$ (using tangent spaces) at each observed state $\sigma t$ in RKHS.

As an alternative, we represent the generator in a suitable functional basis in which $q$ is also represented as a set of coefficients. Then, the state distribution evolves directly under

The evolution operator $E(t)=e(t\u2212t0)\Gamma \u2217$ can also be inferred from data for standard SDE over $RN$, as detailed in Ref. 46. Section IV C summarizes the method and adapts it to the RKHS setting.

Every Itô diffusion also admits a limit distribution $L(\sigma \u2208S)$ as $t\u2192\u221e$, with associated density $\u2113=dL/d\mu $. By definition of the limit distribution, $\u2113$ is the eigenvector of $E$ associated to eigenvalue $1$: $E(t)[\u2113]=\u2113$.

The limit distribution can also be useful to compute how “complex” or “uncommon” a state is, using its self-information $h(\sigma )=\u2212log\u2061\u2113(\sigma )$. This is the equivalent of how the statistical complexity is defined in the discrete case.^{3} Also, it can be used for similar purposes (e.g., building filters^{8,24,26,27}). That said, this differential entropy should be interpreted with caution.

### C. Diffusion maps and the intrinsic geometry of causal states

The representer theorem^{41} allows us to develop causal-state estimates $\sigma ^$ in the span of the data kernel functions $kY(yi,\u22c5)$, used as a pseudo-basis; see Sec. III A 1. Since the span dimension grows with the number of samples, estimation algorithms are impractical. However, the causal states are an intrinsic property of the system, independent of how they are coordinatized. Therefore, when working with data acquired from physical processes, $S$ will appear as the dominant structure. The question then becomes how to work with it.

The following assumes $S$ is of small finite dimension $M$ compared to the number of observations $N$.^{47} Residing on top of the dominant structure, the statistical inaccuracies in the MMD test appear as much smaller contributions. The following, thus, introduces the tools needed to work directly on $S$, whether for visualizing the causal states using reduced coordinates or for representing evolution operators on $S$ instead of using Eq. (4) on $HY$.

To do this, we exploit the methodology introduced for diffusion maps.^{48} These maps are especially relevant when data lie on a curved manifold, where the embedding’s high-dimensional distance between two data points does not reflect the manifold’s geometry. In contrast, diffusion maps, and their variable-bandwidth cousin,^{49} easily recover the Laplace–Beltrami operator for functions defined on the manifold. Assuming $S$ is a smooth Riemannian manifold, then the diffusion coordinates cleanly recover $S$’s intrinsic geometry (a static property) independent of the observed local sampling density (a dynamic property, linked to trajectories as in Sec. IV).^{48}

The original diffusion map method artificially builds a proximity measure for data points using a localizing kernel, i.e., the one that takes nonnegligible values only for near neighbors. Path lengths are computed from these neighborhood relations. Two points are deemed “close” if there is a large number of short paths connecting them and “far” if there are only a few paths, or paths with long distances, connecting them. See Ref. 48 for details, which also shows that the diffusion distance is a genuine metric.

That said, in the present context, there is already a notion of proximity between causal states $\sigma \u2208S$. Indeed, reusing notation from Sec. III A 2, the state estimates in RKHS are of the form

Therefore, a Gram matrix $GS$ of their inner products can be defined easily,

where $kY(y\alpha ,y\beta )=G\alpha \beta Y$ is the Gram matrix of the $Y$ observations. The determination of the $\omega $ coefficients for each $x$ relies on the $GX$ gram matrix, as detailed in Sec. III A 1.

It turns out that diffusion maps can be built from kernels with exponential decay.^{50} The original fixed-bandwidth diffusion maps^{48} also use exponential kernels for building the proximity relation. Such kernels are reproducing and they are also characteristic, fulfilling the assumptions needed for representing conditional distributions.^{39} Hence, when using the exponential kernel, the RKHS Gram matrix is also exactly a similarity matrix that can be used for building a diffusion map. (This was already made explicit in Ref. 51.) Moreover, Eq. (7) explicitly represents $GijS$ as a weighted sum of exponentially decaying kernels and, hence, is itself exponentially decaying. Thus, $GS$ can be directly used as a similarity matrix to reconstruct $S$’s geometry via diffusion maps.

Notice, though, that here, data are already scaled by the reproducing kernel. Therefore, for example, using a Gaussian kernel,

$\xi $ specifies the scale at which differences in the $X$ norm are relevant, and this is also the case for $kY$ and $Y$. Since that scale $\xi $ can be set by cross-validation,^{40} we exploit this fact shortly in Sec. VI C’s experiments when an objective measure is provided, e.g., prediction accuracy.

In practice, a large range of $\xi $ can produce good results, and an automated method for selecting $\xi $ has been proposed.^{52} Varying the analyzing scale to coarse-grain the manifold $S$ is also possible. Using the method from Ref. 53, this is similar, in a way, to wavelet analysis.

Once the data scale is properly set and the similarity matrix built, the diffusion map algorithm can be parameterized to cleanly recover $S$’s Riemannian geometry, doing so independently from how densely sampled $S$ is. This is explained in Fig. 4 of Ref. 48, and it is exactly what is needed need here: separate $S$’s static causal-state geometric structure from the dynamical properties (trajectories) that induce the density on $S$. This is achieved by normalizing the similarity matrix $GS$ to remove the influence of the sampling density and then applying a spectral decomposition to the nonsymmetric normalized matrix; see Ref. 48.

The result is a representation of the form

where each $\psi \alpha $, $\alpha =1,\u2026,N$, is a right eigenvector and each $\lambda \alpha $ is the associated eigenvalue. Note that coefficients $\sigma i,j=\lambda j\psi i,j$ are also weights for the conjugate left eigenvectors $\Phi j=1,\u2026,N$, which are themselves functions of the RKHS. (Hence, they are represented in practice by the $N$ values they take at each sample.)

The first eigenvalue is $1$, and it is associated with constant right eigenvector coordinates.^{48} The conjugate left eigenvector coefficients yield an estimate of the limit density $\u2113(\sigma ^i)$, with respect to the reference measure $\mu $ in our case. Hence, $\Phi 1$ and $\psi 1$ can be normalized so that $\u2211j\Phi 1,j=1$ and $\psi 1,j=1$ for all $j$. With these conventions, $\lambda 1\psi 1,j=1$ is constant and can be ignored, all the while respecting the bi-orthogonality $\u27e8\psi 1,\Phi 1\u27e9=1$. The other eigenvalues are all positive $1>\lambda \alpha >0$, and we can choose the indexing $\alpha $ so they are sorted by decreasing order.

When $S$ is a low-dimensional manifold, as assumed here, a spectral gap should be observed. Then, it is sufficient to retain only the $M\u226aN$ first components. Otherwise, $M$ can be set so that the residual distance $\u2211\alpha >M\lambda \alpha 2(\psi i,\alpha \u2212\psi j,\alpha )$ remains below some threshold $\theta $. Since the diffusion distance is a true metric in the embedding space $HY$, $\theta $ can also be set below a prescribed significance level for the MMD test (Sec. III A 1), if so desired. The residual components are then statistically irrelevant.

Taking stock, the preceding established that

The causal-state manifold $S$ can be represented in terms of a functional basis ${\Phi}m=1,\u2026,M$ in $HY$ of reduced dimension $M$. This is in contrast to using the full data span ${kiY(yi,\u22c5)}HY$ of size $N$. The remaining components are irrelevant.

The functional basis ${\Phi}m=1,\u2026,M$ can be defined in such a way that the induced representation of $S$ does not depend on the density at which various regions of $S$ are sampled. This cleanly recovers $S$’s geometry.

Each causal state $\sigma $ is represented as a set of coefficients on that basis.

Taken altogether, the RKHS causal states and diffusion-map equations of motion define a new structural model class—the *kernel $\u03f5$-machines*. The constituents inherit the desirable properties of optimality, minimality, and uniqueness of $\u03f5$-machines generally and provide a representation for determining quantitative properties and deriving analytical results. That said, establishing these requires careful investigation that we leave to the future.

## V. DEPLOYING KERNEL *ε*-MACHINES

*ε*

As with $\u03f5$-machines generally, kernel $\u03f5$-machines can be used in a number of ways. The following describes how to compute statistics and how to make predictions.

### A. Computing functions of given data

To recover the expected values of functions of data, a functional can be defined on the reduced basis. Recall that thanks to the reproducing property

for any $f\u2208HX$. Such functions $f$ are represented in practice by the values they take on each of the observed $N$ samples, with the reproducing property only achieved for $N\u2192\u221e$ samples (and otherwise approximated). One such is $f\tau $ that for each observed past, $x$ associates the entry of a future time series $y$ matching time $\tau $. This function can be fit from observed data. It is easy to generalize to spatially extended or network systems or to any function of the $(xi,yi)$ data pairs. However, $\u27e8\sigma ,f\u27e9$ can be expressed equally well in the reduced basis ${\Phi}m=1,\u2026,M$. Then, $f\tau $ is simply projected $fm=\u27e8f\tau ,\Phi m\u27e9$ onto each eigenvector.

This leads to an initial way to make predictions:

For each data sample, represent the function $f$ by the value it takes for that sample. For example, for vector time series of dimension $D$, $x$ is a past series ending at present time $t0$ and $y$ a future series. $f\tau $ is then the $D$ values observed at time $t0+\tau $ in the data set, for each sample, yielding a $N\xd7D$ matrix.

Project the function $f$ to the reduced basis by computing $fm=\u27e8f\tau ,\Phi m\u27e9$ for each left eigenvector $m\u2264M$. This yields a $M\xd7D$ matrix representation $f^$.

Compute $f^[\sigma i]$ for a state $\sigma i$, itself represented as a set of $M$ coefficients on a reduced basis [Eq. (8)]. This yields a $D$-dimensional vector in the original data space $X$ in this example. This can be compared to the actual value from $yi$ at time $\tau $.

### B. Representing new data values

A model is useful for prediction if its states can be computed on newly acquired data $xnew$ for which future values $y$ are not available. In the case of kernel methods and diffusion maps, Nyström extension^{54} is a well-established method that yields a reasonable state estimate $\sigma ^new$ if $xnew$ lies within a dense region of data. That said, it is known to be inaccurate in sparsely sampled regions.

Given the Fokker–Planck evolution equation solutions [Eq. (6)] and the evolution operator estimation methods described shortly, we may estimate a distribution $q^new(\sigma )$ over $S$, encoding the probability that the causal state associated with $xnew$ is at location $\sigma \u2208S$. Then, a single approximation $\sigma ^new=E[q^new(\sigma )]$ could be obtained, if desired. We can also allow the distribution to degenerate to the Dirac $\delta $ distribution and yield effectively a single estimate. This could occur, for example, when the evolution is applied to one of the original samples $xnew=xi$ used for estimating the model.

To estimate a distribution $q^new(\sigma )$, we employ *kernel moment matching*,^{55} adapted to our context. The similarity of the new data observation $xnew$ to each reference data $xi=1,\u2026,N$ is computed using kernel evaluations $K(xnew)={kX(xnew,xi)}i=1,\u2026,N$. Applying Eq. (3) to the new vector $K(xnew)$,

subject to $\omega inew\u22650$ for $i=1,\u2026,N$.

Compared to Eq. (3), this adds a positivity constraint similar to kernel moment matching.^{55} This also implies $\omega inew\u2264(1+\xi )/(1+\epsilon )$, where $\xi $ is the tolerance of the argmin solver. Proof: We know $R=GX+\epsilon I$ has $1+\epsilon $ on the diagonal. Both $GX$ and $K(xnew)$ have positive entries. Then, $(1+\epsilon )\omega i+\u2211j\u2260iRij\omega j=kX(xnew,xi)\xb1e$, where $e<\xi $ is the error of the argmin solver. However, $\u2211j\u2260iRij\omega j\u22650$ since $GijX>0$ and $\omega j\u22650$ by constraint. $kX(xnew,xi)\u22641$ by construction. Hence, $(1+\epsilon )\omega i\u22641+\xi $ and $\omega i\u2264(1+\xi )/(1+\epsilon )$.

$\omega new$ is thus the closest solution to $(GX+\epsilon I)\u22121K(xnew)$ so that the estimated state

remains in the convex set in $HY$ of the data $kY(yi,\u22c5)$, up to a good approximation $(1+\xi )/(1+\epsilon )\u22481$. Currently, lacking a formal justification for convexity, we found that better results are obtained with it than with Nyström extensions—these use possibly negative $\omega inew$ values due to the unbounded matrix inverse, yielding estimates that can wander arbitrarily (depending on $\epsilon $) far away in $HY$. Normalizing, we get a probability distribution of the form

where, as usual in RKHS settings, the function $q^new$ is expressed as the values it takes on every reference data sample, $\sigma i=1,\u2026,N$ in this case.

Compared to kernel moment matching,^{55} we used the kernels $kY(yi,\u22c5)$ as the basis for expressing the density estimation, with coefficients that encode the similarity of the new sample to each reference sample in $X$. An alternative would be to adapt the method from Ref. 43 and express $q^new$ on reference samples drawn from the limit distribution $\u2113$ over $S$; see Sec. IV B.

When data are projected on a reduced basis ${\Phi}m=1,\u2026,M$, the distribution $q^new$ can be applied to the reference state coordinates $\sigma i,m=\lambda m\psi i,m$ so that an estimated state $\sigma ^new=E[q^new(\sigma )]$ can be computed with

### C. Prediction with the evolution operator

Section V A’s method allows predicting any arbitrary future time $\tau $, provided that the future is sufficiently well represented in the variable $y\u2208Y$. In practice, this means that the future look ahead $LY$ needs to be larger than $\tau $ and that sufficient data are captured to ensure a good reproducing capability for the kernel $kY$. However, for some systems, the autocorrelation decreases exponentially fast, and there is no practical way to collect enough data for good predictions at large $\tau $.

An alternative employs the Fokker–Planck equation to evolve state distributions over time. This, in turn, yields an estimate

$Q$ here is the state distribution reached at time $t>t0$, whose density is given by Eq. (6). This method exploits $S$’s full structure, its Markovian properties, and the generator described in Sec. IV B.

This allows reaching longer times $t>\tau $, while using look aheads $LY$ (Sec. II E) that match the natural decrease of autocorrelation. If the variable $y\u2208Y$ captures sufficient information about the system’s immediate future for a given $x\u2208X$, then the causal structure is consistently propagated to longer times by exploiting the causal-state dynamics given by the evolution operator $E(t)=e(t\u2212t0)\Gamma \u2217$.

Thanks to expressing $S$ on the basis ${\Phi}m=1,\u2026,M$, it is possible to explicitly represent the coefficients $a(\sigma t)$ and $b(\sigma t)$ of Eq. (4) in a traditional vector representation, with established SDE coefficient estimation methods.^{45,56} However, recent results suggest that working directly with the evolution operator is more reliable^{43,46,57} for similar models working directly in data space—instead of, as here, causal-state space.

Assuming states are estimated from observations acquired at regularly sampled intervals so that $\sigma i+1$ and $\sigma i$ are separated by time increment $\Delta t$, then, on the functional basis ${\Phi}m=1,\u2026,M$, the coefficients $\psi i+1,m$ are related to the coefficients $\psi i,m$ by the action of the evolution operator $E(\Delta t)$ on the state $si$. Hence, this time-shifting operator from $t0$ to time $t=t0+\Delta t$ can be estimated with

where $\psi 2:N$ is the set of $M$ right eigenvectors, restricted to times $t\u2265t0+\Delta t$ and $\Phi 1:N\u22121$ are the corresponding $M$ left eigenvectors, restricted to times $t\u2264t0+(N\u22121)\Delta t$. Normalization can be performed *a posteriori*: $E^(\Delta t)[\sigma ^]$ should have the constant $1$ as the first coefficient (Sec. IV C). Therefore, it is straightforward to divide $E^(\Delta t)[\sigma ^]$ by its first coefficient for normalization. The estimator $E^(\Delta t)$ is efficiently represented by an $M\xd7M$ matrix. For a similar operator working in data space $X$, instead in $S$, the estimator is consistent in the limit of $N\u2192\u221e$, with an error growing as $O(\Delta t/N)$.^{46}

Predictions for values at future times $n\Delta t$, obtained by operator exponentiation,

are simply estimated with their matrix counterpart $E^(n\Delta t)=E^(\Delta t)n$. Thanks to the bi-orthogonality of the left and right eigenvectors, this last expression can be computed efficiently with

and *a posteriori* normalization. Though this estimator is statistically consistent in the limit of $N\u2192\u221e$, in practice, when $n\u2265N$, the method clearly fails. This is counterbalanced in some cases by a convergence toward the limit distribution in $n\u226aN$ steps; therefore, the problem does not appear. This is the case for experiments in Sec. VI C using a chaotic flow. Yet, the general case is a topic for future research.

To synopsize, prediction is achieved with the following steps:

Represent a function $f$, defined on $X$, by the values it takes for each observed data pair $(xi,yi)$, with the method described in Sec. V A. This gives an estimate $f^$.

Build the evolution operator $E^(\Delta t)$ as described above or powers of it $E^(n\Delta t)$ for predicting $n$ steps in the future.

Compute the function $E^(n\Delta t)[f^]$. This amounts to a matrix multiplication on a reduced-basis representation, together with an

*a posteriori*normalization.When a new data sample $xnew$ becomes available at the present time $t0$, estimate a distribution $Q$ over the training samples that best represents $xnew$. $Q$ is expressed by its density $q^new$ on a reduced basis ${\Phi}m=1..M$ as detailed in see Sec. V B.

Apply the evolved function $E^(n\Delta t)[f^]$ to $q^new$ to obtain the expected value $EQ(\sigma ,t0+n\Delta t)[EP(Y|\sigma )[f(x\u2208\sigma )]]$ that $f$ takes at future time $t0+n\Delta t$.

Section VI C applies this to a concrete example.

## VI. VALIDATION AND EXAMPLES

The following illustrates reconstructing kernel $\u03f5$-machines from data in three complementary cases: (i) an infinite Markov-order binary process generated by a two-state hidden Markov model, (ii) a binary process generated by an HMM with an uncountable infinity of causal states, and (iii) thermally driven continuous-state deterministic chaotic flows. In each case, the hidden causal structure is discovered assuming only that the processes are conditionally stationary.

### A. Infinite-range correlation: The even process

The *even process* is generated by a two-state, unifilar, edge-emitting Markov model that emits discrete data values $v\u2208{0,1}$. Figure 2 displays the $\u03f5$-machine HMM state-transition diagram—states and transition probabilities.

Realizations $x=(vt)\u2212LX<t\u22640$ and $y=(vt)0<t\u2264LY$ consist of sequences in which blocks of even number of $1$s are bounded by any number of $0$s; e.g., $01101111000011001111110\u2026$. An infinite-past look ahead $LX$ is required to correctly predict these realizations. Indeed, truncation generates ambiguities when only $1$s are observed.

For example, with $LX=4$, the observed series $1111$ could be part of a larger group $\u2026011111$, in which case, the next symbol is necessarily a $1$, or larger groups of the form $\u2026111111$ or $\u2026001111$ or $\u2026101111$, in which case, the next symbol is either $0$ or $1$ with probability $1/2$. However, with a limited look ahead of $LX=4$, a prediction has no way to encode that the next symbol is necessarily a $1$ in the first case. One implication is that there does not exist any finite Markov chain that generates the process.

Despite its simplicity—only two internal states—and compared to processes generated by HMMs with finite Markov order, the even process is a helpfully challenging benchmark for testing the reconstruction capabilities of kernel $\u03f5$-machine estimation on discrete data.

Reconstruction is performed using product kernels,

with a decay parameter $\gamma $ setting the relative influence of the most distant past (future) symbol $x\u2212LX+1$ ($yLY$). We use the exponential kernel,

where $a,b\u2208{0,1}$ are the symbols in the series.

Figure 3 presents the results for $LX=10$ and $LY=5$ for a typical run with $N=30000$ sample $(x,y)$ pairs, with a decay $\gamma =0.01$ and a bandwidth $\xi =1$.

The eigenvalue spectrum of the reduced basis decreases rapidly: $\lambda 0=1$ (as expected), $\lambda 1\u224810\u22122$, and all other eigenvalues $\lambda j\u22652<10\u22124$. We, therefore, project the causal states on the only relevant eigenvector $\Phi 1$ and build the histogram shown in Fig. 3. Colors match labels automatically found by a clustering algorithm.^{58}

Figure 3 summarizes the cluster probabilities and their transitions.^{59} They match the original even process together with a transient causal state. By inspection, one sees that this state represents the set of sequences of all $1$s mentioned above—the source of ambiguity. Its probability, for $LX=10$, is that of jumping five consecutive times from state $\sigma 0$ to state $\sigma 1$ in the generating even process. Hence, $1/25\u22480.03$, which is the value we observe. From that transient state, the ambiguity cannot be resolved; therefore, transitions follow the $(1/3,2/3)$ proportions of the symbols in the series. Note that unifilarity is broken since there are two paths for the symbol $v=1$ starting from state $\sigma 1$, also reflecting the ambiguity induced by the finite truncation.

### B. Infinite-state complexity: An uncountable causal-state process

The even process is a case where ambiguity arises from incomplete knowledge due to finite-range truncation. However, even for discrete finite data alphabets $V$, there are processes whose causal states are irreducibly infinite. This occurs generically for processes generated by nonunifilar HMMs. In this case, knowledge of the observed data is insufficient to determine the generative model’s internal state. Only distributions over those states—the *mixed states*^{29}—are predictive.

In the limit $LX\u2192\u221e$, the causal states then correspond to unique mixed-state distributions,^{21} and there can be infinitely many (countable or uncountable) causal states, even for a simple finite generative HMM. This arises from nonunifilarity since the same observed symbol allows transitions to two distinct internal states. In contrast, the predictive-equivalence determines that the same information (newly observed symbol), starting from the same current state [equivalence class $\u03f5(X)$ of pasts $X\u2208X$], induces the same consequences [possible futures $Y$, with a fixed distribution $Pr(Y|\u03f5(X))$]. Thus, nonunifilar models can be more compact generators. This can be beneficial, but it comes at the cost of being markedly worse predictors than unifilar HMMs.

Mess3 uses three symbols $V={0,1,2}$ and consists of three generative states $s0$, $s1$, and $s2$. The state-transition diagram is symmetric, and each state has the same transition structure. Consider the state $si$ for $i=0,1,2$ and modulo arithmetic so that $i\u22121=2$ when $i=0$ and $i+1=0$ when $i=2$. Then, the transitions are

From state $si$ to itself, for a total probability $p=\alpha =ay+2bx$, symbols are emitted as $Pr(v=i)=ay$ and $Pr(v=i\u22121)=Pr(v=i+1)=bx$.

From state $si$ to state $si+1$, for a total probability $p=(1\u2212\alpha )/2$, symbols are emitted as $Pr(v=i)=ax$, $Pr(v=i\u22121)=bx$, and $Pr(v=i+1)=by$.

From state $si$ to state $si\u22121$, for a total probability $p=(1\u2212\alpha )/2$, symbols are emitted as $Pr(v=i)=ax$, $Pr(v=i\u22121)=by$, and $Pr(v=i+1)=bx$.

In this, $a=0.6$, $b=(1\u2212a)/2=0.2$, $x=0.15$, $y=1\u22122x=0.7$.

The generated process is known to give uncountably infinite causal states. Their ensemble $S$ forms a Sierpinski gasket in the mixed-state simplex.^{29} The probability to observe states at subdivisions decreases exponentially, though. With limited samples, only a few self-similar subdivisions of the main gasket triangle can be observed.

Reconstructing $S$ is performed using $LX=15$ and $LY=1$. The same product exponential kernel is used as above, with a decay $\gamma =0.01$ and a bandwidth $\xi =0.1$. $N=25000$ sample $(x,y)$ pairs are sufficient to reconstruct the first self-similar subdivisions.

The eigenvalue spectrum is $\lambda 0=1$, $\lambda 1=0.999$, $\lambda 2=0.999$, and $\lambda i\u22653<1.2\xd710\u221215$. The algorithm thus finds only two components of equal significance. Figure 5 shows a scatterplot of the inferred causal states using coordinates on a reduced basis ${\Phi 1,\Phi 2}$. The states are very well clustered on the main vertices of the Sierpinski gasket. (The inset shows that $4$ orders of magnitude are needed to see a spread within each cluster compared to the main triangle size.) The triangle is remarkably equilateral and its center even lies on $\u2248(0,0)$, reflecting the symmetry of the original HMM. To our knowledge, no other algorithm is currently able to correctly recover the causal states purely from data from such a challenging, complex process.

The number of samples at each triangle node is indicated. The nodes at the first subdivision are about $800$ times less populated than the main nodes. While theory predicts that states appear on all subdivisions of the Sierpinski gasket, the sample size needed to observe enough such states is out of reach computationally.

### C. Thermally driven continuous processes: Chaotic Lorenz attractors

The first two examples demonstrated that kernel $\u03f5$-machine reconstruction works well on discrete-valued data, even when states are expected to appear on a complicated fractal structure or when the time series has infinite-range correlations. The next step, and a central point of the development, is to reconstruct a continuous infinity of causal states from sampled data. The following example processes also serve to demonstrate time-series prediction using the estimated kernel $\u03f5$-machine, recalling the method introduced in Sec. IV B.

We first use the chaotic Lorenz ordinary differential equations from 1963 with the usual parameters $(\sigma ,\rho ,\beta )=(10,28,8/3)$.^{60} We add isotropic stochastic noise components $dW$ at amplitude $\eta $ to model thermal fluctuations driving the three main macroscopic fluid modes, which the ODEs were crafted to capture

A random initial condition $(u0,v0,w0)$ is drawn in the region near the attractor, and a sufficiently long transient is discarded before collecting data $(ut,vt,wt)$. SDE integration is performed using $dt=0.01$, yielding samples $(ut,vt,wt)0\u2264t<T$ up to a maximal time $T$ corresponding to $N=20000$ (past, future) pairs.

In addition to the thermal fluctuations, we also model systematic measurement error by adding a centered Gaussian process $\Gamma =(\gamma 0,\gamma 1,\gamma 2)$ with isotropic variance $\nu 2$. The result is the observed series $(ut\u2032,vt\u2032,wt\u2032)=(ut+\gamma t0,vt+\gamma t1,wt+\gamma t2)$ used to estimate a kernel $\u03f5$-machine and perform a sensitivity analysis.

#### 1. Kernel *ε*-machine reconstruction in the presence of noise and error

*ε*

When $\eta =0$ and $\nu =0$, the deterministic ODEs’ trajectories do not cross and are uniquely determined by the initial condition $(ut,vt,wt)$ at $t=0$. Hence, each state on the attractor is its own causal state. Retaining information from the past is moot, but only if $(u,v,w)$ is known with infinite precision, due to the ODEs’ chaotic solutions, that amplify fluctuations exponentially fast. This is never the case in practice. Therefore, considering small values of $LX$ and $LY$ may still be useful to better determine the causal states. We use $LX=LY=5$ in the reconstructions.

Figure 6 shows the projections (right) with coordinates $(\psi 1,\psi 2,\psi 3)$, together with the original (pre-reconstruction) attractor data (left) for different noise combinations. Figure 6 (top row) displays the results of kernel $\u03f5$-machine estimation from the noiseless data ($\nu =0$ and $\eta =0$). The second row shows the effects of pure measurement noise ($\nu =1$ and $\eta =0$) on the raw data (left) and on the estimated kernel $\u03f5$-machine (right). Similarly, the last row shows the effect of pure thermal noise ($\nu =0$ and $\eta =1$).

As expected, the structure is well recovered when no noise is added. A slight distortion is observed. In practice, the causal states are unaffected by coordinate transforms and reparameterizations of $X$ and $Y$, which do not change equivalence in conditional distributions $P(Y|X=x1)\u2261P(Y|X=x2)$; therefore, the algorithm could very well have found another parametrization of the Lorenz-63 attractor. See also Sec. VI D. The causal states of the original data series are also well recovered even when that series is severely corrupted by strong measurement noise at $\nu =1$ in Fig. 6 (middle row).

To appreciate these results, recall that, in the noiseless case, if $x1$ and $x2$ are in the same causal state $x2\u2208\u03f5(x1)$ of the original series, then by definition, $Pr(Y|X=x1)=Pr(Y|X=x2)$. Measurement noise ($\nu >0$), independent at each time step, does not change this. Since measurement noise is added to each and every time step independently, noisy series $x\u2032$ ending with the same noisy triplet $(u\u2032,v\u2032,w\u2032)$ at the current time $t=t0$ end up in the same causal state of the noisy system. This is reduced to the current triplet $(u\u2032,v\u2032,w\u2032)$, itself a specific causal state of the deterministic system. Hence, the causal states of the deterministic ODEs are subsets of those of kernel $\u03f5$-machine estimated from the noisy-measured series. We arrive at the important conclusion that, at least for deterministic chaotic systems, the uniqueness and continuity of ODE solutions guarantee that *causal states are unaffected by measurement noise*. This is generally not true for many-to-one maps and other functional state-space transforms that merge states.

In contrast to measurement noise, thermal noise ($\eta >0$) modifies the equations of motion, and the resulting trajectories reflect the accumulated perturbations. Since each state on the (deterministic) attractor is its own state, the estimated causal states are modified.

Let us probe the robustness of kernel $\u03f5$-machine estimation. With the parameters detailed below, we obtain an eigenvalue spectrum shown in Fig. 7. There is an inflection point after the first three components. The eigenvalues are remarkably insensitive to a strong measurement noise level $\nu =1$. They are also very robust to the thermal noise $\eta =1$, which induces only some minor eigenvalue changes.

Thus, kernel $\u03f5$-machine estimation achieves a form of denoising beneficial for random dynamical systems. However, the reconstructed causal states reflect the thermal noise induced by $\eta =1$, as can be seen in the fine details of the bottom row of Fig. 6. Note that the algorithm is able to strongly reduce the measurement noise and do so even while the attractor is very corrupted, while the apparently minor thermal noise is preserved.

#### 2. Kernel *ε*-machine prediction and sensitivity

*ε*

In Ref. 46, a prediction experiment is performed using an evolution operator computed directly in $(u,v,w)$ space instead of in $S$, as here; recall Sec. V C. That study focuses on how the error due to a small perturbation propagates with the number of prediction steps.

Here, we use a more typical approach to first learn the model—i.e., compute the states, the basis $\Phi $, the evolution operator,…—on a data set of $N=20000$ samples. Then, however, we estimate the prediction error on a completely separate “test” set. This data set is generated with the same parameters as for the training set, but starting from another randomly drawn initial condition. $P=100$ test samples are selected after the transient, each separated by a large subsampling interval so as to reduce their correlation. Unlike the examples in Secs. VI A and VI B, this produces an objective error function—the *prediction accuracy*—useful for cross-validating the relevant meta-parameters.

Due to the computational cost involved with grid searches, we only cross-validate the data kernel $kV$ bandwidth on a reduced data set in preliminary experiments. In Ref. 46, an arbitrary variance was chosen for the distribution used to project $(u,v,w)$ triplets onto the operator eigenbasis. We also cross-validate it to improve the results for a fair comparison with kernel $\u03f5$-machine reconstruction.

Figure 8 presents the results. Forecasts are produced at every $\Delta t=0.05$ interval, and operator exponentiation is used in between, as detailed in Sec. V C. This allows a comparison of trajectories over $500$ elementary integration steps, which is large enough for the trajectory to switch between the attractor main lobes several times. Such trajectories are computed starting from each of the $100$ test samples.

In the noiseless case $\eta =\nu =0$, the average discrepancy $\u27e8\Vert (u,v,w)\u2212(up,vp,wp)\Vert \u27e9$ measures how predicted triplets $(up,vp,wp)$ differ from data triplets $(u,v,w)$, averaged at each prediction step over the $100$ trajectories.

In the noisy cases, it makes little sense to test the algorithm’s ability to reproduce noisy data. We instead test its ability to predict accurately the noiseless series $(u,v,w)t>t0$ based on noisy past observations $(u\u2032,v\u2032,w\u2032)t\u2264t0$. This is easily done for measurement noise $\nu >0$, for which the original noiseless series is available by construction.

For simulated thermal noise $\eta >0$, all we have is a particular realization of an SDE, but no clean reference. Starting from the current noisy values at each prediction point $(u\u2032,v\u2032,w\u2032)$, we evolve a noiseless trajectory with the basic Lorenz ODE equations. Since we use an isotropic centered $\eta dW$ Wiener process, that trajectory is also the ensemble average over many realizations of the SDE, starting from that same $(u\u2032,v\u2032,w\u2032)$ point. It makes more sense to predict that ensemble average, from the current noisy causal-state estimate, rather than a particular SDE trajectory realization.

The results in Fig. 8 show a clear gain in precision when using the RKHS method, both in the unperturbed data case and when data are perturbed by measurement noise $\nu =1$. This gain persists until the trajectory becomes completely uncorrelated with the original prediction point. The situation is less favorable for thermal noise $\eta =1$.

Figure 9 presents a sensitivity analysis that focuses on predictions after $50$ time steps. For lower noise levels $\eta <1$, the RKHS method still improves the prediction error, while the operator in data space does not seem to be sensitive to $\eta $. We note (not shown here) that, for longer time scales, the RKHS method may produce worse results on average. The handling of measurement noise $\nu <1$ is also in favor of the RKHS method, consistent with above results.

### D. Behavior with high-dimensional data and attractors

This is all well and good, but do attractor reconstruction and denoising capabilities hold in higher dimensions? In fact, it has been known for decades^{61} that the Lorenz-63 system is special and very easily reconstructed. This is due to its high state-space volume contraction rate and its simple and smooth vectorfield that sports only two nonlinear terms.

To address these issues, we employ the Lorenz 1996 model,^{62} with tunable dimension parameter $D$, defined for $i=1,\u2026,D$ by $dui/dt=\u2212ui\u22122ui\u22121+ui\u22121ui+1\u2212ui+F$, with modulo arithmetic on the indices. We use $F=8$, which yields chaotic dynamics. In order to better cover all of the attractor, series are generated from ten different starting points taken at random on the basin of attraction, after discarding a sufficiently long transient. $N$ samples of $(x,y)$ pairs are collected from these series, using history lengths of $LX=LY=5$ values of vectors $u$. The projection of these series along the first three dimensions of the attractors for $D=5$ and $D=100$ are shown on the left part of Fig. 10.

To test the algorithm’s reconstruction performance, we embed the time series in a $1000$-dimensional space using random projections: $V=UR$ with $R$ being a matrix of random components of size $D\xd71000$, taken from a normal distribution of standard deviation $1/D$. $U$ holds the collected samples, organized in a matrix of size $N\xd7D$. In practice, in such a high dimension, the projection directions in $R$ are nearly orthogonal. In addition, Gaussian noise with variance $\nu 2=1/D$ is added to each component of $V$, similar to the setup in Sec. VI C 1. We use the resulting data $W=V+noise$ as input to the algorithm: Truly 1000-dimensional data, now leaking into all dimensions thanks to the added noise, while retaining the overall lower $D$-dimensional structure of the Lorenz-96 attractor. This experiment seeks to reconstruct this hidden $D$-dimensional structure, solely from noisy $1000$-dimensional time series.

The middle panes of Fig. 10 show the corrupted $W$ data, projected back into the original space using the pseudo-inverse $pinv(R)$ for $D=5$ and $D=100$. The scaling of the noise variance makes it so that the strength of the noise is similar irrespective of $D$.

The right panel of Fig. 10 shows the well-reconstructed attractor. Moreover, the denoising for chaotic attractors observed in Sec. VI C 1 is also well reproduced in this markedly more complicated setting. Interestingly, the first three coordinates of the reconstructed attractor do not, and need not, match that of the original $U$ series. Indeed, the causal states are equivalence classes of conditional distributions and, as such, are invariant by reparameterizations of the original data that preserve these equivalence classes, such as coordinate transformations.

We see that the algorithm finds a parametrization where each added coordinate best encodes the conditional distributions in the low-dimensional coordinate representation, as explained in Sec. VI C. Yet, that parametrization encodes each and every initial $D$ component, even though it is presented with $1000$-dimensional noisy time series of lengths $LX=LY=5$. This is shown in Fig. 11 and Table I, which clearly demonstrate spectral gaps at exactly $D$ reconstructed components. These spectral gaps are more pronounced as the number of samples $N$ increases, as shown in Table II. As expected, a larger number of samples $N$ is required for properly capturing the spectral gaps as the dimension $D$ increases.

Dim. D
. | 5 . | 10 . | 15 . | 20 . | 25 . | 30 . | 35 . | 40 . | 45 . | 50 . | 55 . | 60 . | 65 . | 70 . | 75 . | 80 . | 85 . | 90 . | 95 . | 100 . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

$100\lambda D+1\lambda 1$ | 0.31 | 0.23 | 0.26 | 0.24 | 0.24 | 0.24 | 0.22 | 0.25 | 0.26 | 0.21 | 0.23 | 0.23 | 0.23 | 0.22 | 0.23 | 0.26 | 0.24 | 0.20 | 0.25 | 0.21 |

$\lambda D\lambda D+1$ | 60.3 | 64.1 | 51.2 | 49.3 | 42.2 | 35.2 | 34.9 | 30.8 | 26.8 | 26.4 | 25.1 | 24.1 | 22.0 | 20.7 | 18.5 | 18.1 | 15.8 | 14.3 | 14.5 | 13.9 |

Dim. D
. | 5 . | 10 . | 15 . | 20 . | 25 . | 30 . | 35 . | 40 . | 45 . | 50 . | 55 . | 60 . | 65 . | 70 . | 75 . | 80 . | 85 . | 90 . | 95 . | 100 . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

$100\lambda D+1\lambda 1$ | 0.31 | 0.23 | 0.26 | 0.24 | 0.24 | 0.24 | 0.22 | 0.25 | 0.26 | 0.21 | 0.23 | 0.23 | 0.23 | 0.22 | 0.23 | 0.26 | 0.24 | 0.20 | 0.25 | 0.21 |

$\lambda D\lambda D+1$ | 60.3 | 64.1 | 51.2 | 49.3 | 42.2 | 35.2 | 34.9 | 30.8 | 26.8 | 26.4 | 25.1 | 24.1 | 22.0 | 20.7 | 18.5 | 18.1 | 15.8 | 14.3 | 14.5 | 13.9 |

. | N = 5000
. | N = 10 000
. | N = 20 000
. | ||||||
---|---|---|---|---|---|---|---|---|---|

Dim. D
. | 90 . | 95 . | 100 . | 90 . | 95 . | 100 . | 90 . | 95 . | 100 . |

$100\lambda D+1\lambda 1$ | 0.22 | 0.18 | 0.21 | 0.20 | 0.25 | 0.21 | 0.21 | 0.20 | 0.21 |

$\lambda D\lambda D+1$ | 6.3 | 5.8 | 5.0 | 14.3 | 14.5 | 13.9 | 26.1 | 25.6 | 26.5 |

. | N = 5000
. | N = 10 000
. | N = 20 000
. | ||||||
---|---|---|---|---|---|---|---|---|---|

Dim. D
. | 90 . | 95 . | 100 . | 90 . | 95 . | 100 . | 90 . | 95 . | 100 . |

$100\lambda D+1\lambda 1$ | 0.22 | 0.18 | 0.21 | 0.20 | 0.25 | 0.21 | 0.21 | 0.20 | 0.21 |

$\lambda D\lambda D+1$ | 6.3 | 5.8 | 5.0 | 14.3 | 14.5 | 13.9 | 26.1 | 25.6 | 26.5 |

## VII. CONCLUSION

We introduced kernel $\u03f5$-machine reconstruction—a first-principles approach to empirically discovering causal structure. The main step was to represent computational mechanics’ causal states in reproducing kernel Hilbert spaces. This gave a mathematically principled method for estimating optimal predictors of minimal size and so for discovering causal structure in data series of wide-ranging types.

Practically, it extends computational mechanics to nearly arbitrary data types (at least, those for which a characteristic reproducing kernel exists). Section III A showed, though, that this includes heterogeneous data types via kernel compositions.

Based on this, we presented theoretical arguments and analyzed cases for which causal-state trajectories are continuous. In this setting, the kernel $\u03f5$-machine is equivalent to an Itô diffusion acting on the structured set of causal states—a time-homogeneous, state-heterogeneous diffusion. The generator of that diffusion and its evolution operator can be estimated directly from data. This allows efficiently evolving causal-state distributions in a way similar to a Fokker–Planck equation. This, in turn, facilitates predicting a process in its original data space in a new way, one particularly suited to time series analysis.

Future efforts will address the introduction of discontinuities, which may arise for reasons mentioned in Sec. IV A. This will be necessary to properly handle cases where data sampling has occurred above the scale at which time and causal-state trajectories can be considered continuous. Similarly, when the characteristic scale of the observed system dynamics is much larger than the sampling scale, a model reproducing the dynamics of the sampled data may simply not be relevant. Extensions of the current approach are thus needed, possibly incorporating jump components to properly account for a measured system’s dynamics at different scales. This, of course, will bring us back to the computational mechanics of renewal and semi-Markov processes.^{5}

Another future challenge is to extend kernel $\u03f5$-machine reconstruction to spatiotemporal systems, where temporal evolution depends not only on past times but also on spatially nearby state values.^{8,27,63,64} An archetypal example of these systems is found with cellular automata. In fact, any numerical finite element simulation performed on discrete grids also falls into this category, including reaction–diffusion chemical oscillations and hydrodynamic flows. Spacetime complicates the definition of the evolution operator compared to that of time series. However, the applicability of kernel $\u03f5$-machine reconstruction would be greatly expanded to a large category of empirical data sets.

Last but not least are the arenas of information thermodynamics and stochastic thermodynamics. In short, it is time to translate recent results on information engines and their nonequilibrium thermodynamics^{21–23,28} to this broadened use of computational mechanics. This, in addition to stimulating theoretical advances, has great potential for providing new and powerful tools for analyzing complex physical and biological systems, not only their dynamics and statistics, but also their energy use and dissipation.

## ACKNOWLEDGMENTS

The authors thank Alexandra Jurgens for providing the code used for generating samples from the “mess3” machine used in Sec. VI B. We also thank Tyrus Berry for useful exchanges over nonlinear dimension reduction through diffusion map variants and operator estimation methods.^{46,49,50} The authors acknowledge the kind hospitality of the Institute for Advanced Study at the University of Amsterdam. This material is based upon work supported by, or in part by, Inria’s CONCAUST exploratory action, the Foundational Questions Institute (Grant No. FQXi-RFP-IPW-1902), the U.S. Army Research Laboratory and the U.S. Army Research Office (Grant No. W911NF-18-1-0028), and the U.S. Department of Energy (Grant No. DE-SC0017324).

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors declare no conflict of interest.

## DATA AVAILABILITY

The source code for the method described in this document is provided as free/libre software and is available from this page: https://team.inria.fr/comcausa/continuous-causal-states/. Experiments in Secs. VI A–VI C can be reproduced by retrieving the tagged version “first_arxiv” from the GIT archive and the experiment in Sec. VI D with the tagged version “chaos_submission.”

## REFERENCES

*Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics*(PMLR, 2013), pp. 289–297, see https://proceedings.mlr.press/v31/goerg13a.html.

*et al.*, “DisCo: Physics-based unsupervised discovery of coherent structures in spatiotemporal systems,” in

*2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)*(IEEE, 2019), pp. 75–87.

*2007 International Joint Conference on Neural Networks*(IEEE, 2007), pp. 512–517.

*Hidden Markov Models: Estimation and Control*, Applications of Mathematics Vol. 29 (Springer, New York, 1995).

*Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference (UAI 2004)*, edited by M. Chickering and J. Y. Halpern (AUAI Press, Arlington, VA, 2004), pp. 504–511.

*Proceedings of the 26th Annual International Conference on Machine Learning*(ACM, 2009), pp. 961–968.

*Proceedings of the Fifth Annual Workshop on Computational Learning Theory*(ACM, 1992), pp. 144–152.

*Proceedings of the 29th International Conference on Machine Learning*(ACM, 2012).

*International Conference on Computational Learning Theory*(Springer, 2001), pp. 416–426.

*2009 IEEE International Workshop on Machine Learning for Signal Processing*(IEEE, 2009), pp. 1–6.

*International Conference on Artificial Intelligence and Statistics*(PMLR, 2020), pp. 993–1004.

*Proceedings of the 25th International Conference on Machine Learning*(ACM, 2008), pp. 992–999.

*Proceedings of the Seminar on Predictability*(ECMWF, 1996), Vol. 1.

*AAAI Fall Series 2020 Symposium on Physics-Guided AI for Accelerating Scientific Discovery*arXiv:2010.05451 (2020).