Isolating slower dynamics from fast fluctuations has proven remarkably powerful, but how do we proceed from partial observations of dynamical systems for which we lack underlying equations? Here, we construct maximally predictive states by concatenating measurements in time, partitioning the resulting sequences using maximum entropy, and choosing the sequence length to maximize short-time predictive information. Transitions between these states yield a simple approximation of the transfer operator, which we use to reveal timescale separation and long-lived collective modes through the operator spectrum. Applicable to both deterministic and stochastic processes, we illustrate our approach through partial observations of the Lorenz system and the stochastic dynamics of a particle in a double-well potential. We use our transfer operator approach to provide a new estimator of the Kolmogorov–Sinai entropy, which we demonstrate in discrete and continuous-time systems, as well as the movement behavior of the nematode worm *C. elegans*.

A complex structure often arises from a limited set of simpler elements, such as novels from letters or proteins from amino acids. In musical composition, for example, we experience such construction in time; sounds and silences combine to form motifs; motifs form passages, which in turn form movements. But how can we generally identify variables which distinguish structures across timescales, especially under typical conditions of nonlinear dynamics which are measured only incompletely? Here, we introduce a framework for finding effective coarse-grained variables by leveraging the transfer operator evolution of ensembles. Just as a musical piece transitions from one movement to the next, the ensemble dynamics consists of transitions between local collections of states. We use operator dynamics to guide the construction of maximally predictive states from incomplete measurements, a reconstruction and partitioning of the underlying state space. We then identify timescale separated, coarse-grained variables through operator eigenvalue decomposition. The generality of our approach provides an opportunity for insights on long term dynamics within a wide variety of complex systems.

## I. INTRODUCTION

The constituents of the natural world combine to form a dizzying variety of emergent structures across a wide range of scales. Sometimes, these structures are obvious without referencing their underlying components; we need not, famously, link bulldozers to quarks.^{1} However, many are more ambiguous, especially in complex systems. In these cases, connecting across scales can provide greater insight than from either scale alone.

Statistical physics provides a number of successful examples in which the principled integration of fine-scale degrees of freedom gives rise to coarse-grained theories that successfully capture and predict larger scale structures. However powerful, such approaches generally require either a deep understanding of the underlying dynamics, symmetries, and conservation laws or an appropriate parameterization for methods such as the renormalization group.^{2} We aim to emulate this success but working directly from data in systems, which permit neither detailed equations of motion nor obvious scale separations. How can we find effective discretizations and coarse-grainings in such partially observed complex dynamical systems?

In the absence of theory, measurements typically constitute only a partial observation of the dynamics: one cannot know *a priori* the relevant degrees of freedom to measure. This poses fundamental challenges to predictive modeling. Indeed, successful forecasting of time series data typically requires history-dependent terms, either explicitly through autoregressive models^{3} or more implicitly through recurrent neural networks or hidden Markov models,^{4,5} for example. These methods are justified mathematically through delay-embedding theorems developed by Takens and others,^{6–10} which state that given a generic measurement of a dynamical system, it is possible to reconstruct the state space by concatenating enough time delays of the measurement data. Building on these results, we search for principled coarse-grainings of the dynamics from partial observations.

We trade individual trajectories for ensembles and seek long-lived dynamics in the patterns of stochastic state-space transitions. Formally, we apply the machinery of *transfer operators*^{11–14} to simultaneously reconstruct the dynamics and learn effective coarse-grained descriptions on long timescales. In this framework, nonlinear dynamics are captured through linear yet infinite-dimensional differential operators; the Fokker–Planck equation is a simple example. We focus on the discrete-time evolution of the probability density, which is governed by Perron–Frobenius (PF) operators^{15–18} that “transfer” state-space densities forward in time.

Transfer operators are invariant to smooth transformations of the state-space coordinates.^{19–22} This makes them especially convenient when working directly with incomplete time series measurements, from which it is generally possible to obtain a topologically equivalent state-space reconstruction through delay embedding.^{6–8,23–25} In addition to Fokker–Planck, transfer operators appear in other white noise processes and even deterministic dynamics,^{17,18,26} such as the Liouville operator of classical Hamiltonian systems. We leverage this universality to study stochastic and deterministic systems through a single framework.

The operator approach affords an additional important advantage; the timescales of the system are naturally ordered by the eigenvalue spectrum, even when this ordering is hidden or subtle in the original observations. Correspondingly, the eigenfunctions capture emergent coarse-grained dynamics, such as transitions between metastable states. In the context of chemical kinetics, for example, operator eigenfunctions have been shown to provide effective reaction coordinates or order parameters.^{27–29}

Here, from partial observations of a complex dynamical system, we simultaneously reconstruct a maximally Markovian state space through delay embedding and partition the resulting reconstruction to obtain a finite approximation of the PF operator as a Markov transition matrix.^{14,18} We then use the eigendecomposition of the transition matrix to connect fine-scale and coarse-grained descriptions. In Sec. II, we introduce “maximally predictive states,” constructed by optimizing both the number of time delays and the spatial discretization scales given the constraints imposed by the data. In Sec. III, we coarse-grain the dynamics by using the eigenspectrum of the transition matrix to maximize the dynamical coherence of groups of fine-scale states. Finally, in Sec. IV, we use our transfer operator dynamics to provide a novel estimator of the Kolmogorov–Sinai (KS) entropy rate.

## II. MAXIMALLY PREDICTIVE STATE-SPACE RECONSTRUCTION

We consider the general setting of the time evolution of a system’s state $x\u2192$ through a set of differential equations $x\u2192\u02d9=f(x\u2192)$, where $f$ encompasses both deterministic and noisy influences. Measurement data typically results from a generic nonlinear, possibly noisy, observation function^{30,31} $M$ that maps the unobserved state $x\u2192(t)\u2208RD$, $t\u2208R0+$ into a discrete time series $y\u2192(t)=M(x\u2192(t))\u2208Rd$, $t={0,\delta t,\u2026,T}$, where generally $d<D$. This means that we typically only measure a subset of the variables that define the state $x\u2192$, with finite precision. The unobserved degrees of freedom induce history-dependence on $y\u2192$(t), which not only depends on $y\u2192(t\u22121)$ but also on past observations. Formally, the measurement function $M$ imposes in a Mori–Zwanzig projection of the dynamics^{32,33} such that

where $f0$ is a Markovian term, $fk$ captures non-Markovian effects over a timescale $K\u2217$, and $\eta t$ are fluctuations orthogonal to the projection $M$. This means that neglecting the history dependence imposed by this projection results in a systematic source of error,^{25,34,35} which we must resolve in order to build an accurate predictive model of the dynamics.

Instead of modeling $fk$ explicitly, in our approach, we search for a representation in which the dynamics is maximally Markovian. We concatenate $K\u22121$ time delays of the measurement time series, yielding candidate state space $XK(t)\u2208Rd\xd7K$, Fig. 1(a). Including past information (larger $K$) decreases the uncertainty of the immediate future, thus increasing the predictability of the state space. We search for the number of delays $K\u2217$ that maximize this predictability, as quantified through the short-time entropy rate.^{36}

We bin $X={XK(0),XK(\delta t),\u2026,XK(T)}$ into a discrete partitioned space with $N$ Voronoi cells, $sX={si(0),sj(\delta t),\u2026,sk(T)}$ and estimate the short-time entropy rate^{37} as

where $Pij(\delta t)=p(sj(t+\delta t)|si(t))$ is a row-stochastic Markov chain and $\pi $ is the stationary distribution of $P$, Fig. 1(a). We note that $P$ is an approximation of the PF transfer operator that evolves densities on the reconstructed state space (Methods).

The behavior of $h\delta t(N,K)$ with $N$, or equivalently, with a typical length scale $\u03f5\u223c1/N$, is a non-decreasing function that is indicative of different classes of dynamics.^{38} For example, stochastic processes possess information on all length scales, and so, $h\delta t(N\u2192\u221e,K)=\u221e$, whereas the fractal nature of deterministic chaotic systems yields a typical length scale below which the entropy stops changing: $h\delta t(N\u2217=1/\u03f5\u2217,K)=h\delta t(\u221e,K)$. In practice, however, finite-size effects yield an underestimation of the entropy rate for large $N$. With the aim of preserving as much information as possible given the data and our partitioning scheme, we set the number of partitions as the largest $N$ after which the entropy rate stops increasing, Fig. 1(a). We, thus, maximize the entropy with respect to the number of partitions, obtaining $h\delta t(K)=maxNh\delta t(N,K)$.

For sufficiently short $K\delta t$, $h\delta t(K)$ is a monotonically non-increasing function of $K$, and so, we add time delays until we find $K\u2217$ such that $\delta h=h\delta t(K\u2217+1)\u2212h\delta t(K\u2217)\u22480$, capturing the amount of memory sitting in the incomplete measurements^{25,39,40} and obtaining a maximally predictive description of the dynamics. We note that $\delta h$ has been previously used to define measures of forecasting complexity in dynamical systems^{41} and is the amount of information that has to be kept in the $K\u22121$ time delays for an accurate forecast of the next time step. Notably, $K\u2217$ is related to the intrinsic dimension of the state space $D$: given a generic measurement function^{30,31} $M$, delay-embedding theorems^{6–10} guarantee a Markovian reconstruction when the number of time delays is larger than twice the dimension of the underlying state space, $K>2D$.

In Sec. II A, we illustrate our approach to build maximally predictive states with applications in stochastic equilibrium dynamics of a particle in a double-well potential and the dissipative chaotic dynamics of the Lorenz system.

### A. Illustrative applications

Consider the underdamped Langevin dynamics of a particle in a double-well potential at thermal equilibrium, Fig. 2(a, top),

where $\beta \u22121=kBT$, $kB$, and $T$ are the Boltzmann constant and temperature, respectively, and $dWt$ is the Wiener process capturing the much faster random collisions with the heat bath. We sample at $\delta t=0.05s$ for $T=106s$, with temperatures ranging between $\beta \u22121=0.5J$ and $\beta \u22121=2.5J$ ( Appendix B): an example trajectory for $\beta \u22121=0.5J$ is shown in Fig. 2(a, top). We emulate a real data example by sampling only $x$ at discrete time steps, from which we seek to reconstruct the state space by including time delays. For simplicity, we here take a linear measurement function with noise coming from the finite precision of the simulation protocol. Nonetheless, our results generalize to more complex functions and added measurement noise, as previously discussed in, e.g., Refs. 42 and 43. We concatenate the $x$ measurement with $K$ delays, partition the resulting state with $N$ partitions, and study the behavior of the short-time entropy rate as a function $K$ and $N$ [see Fig. 2(a, middle) and Fig. S1(a) in the supplementary material]. The stochastic nature of the dynamics yields $limN\u2192\u221eh\delta t(N,K)=\u221e$, up to finite-sampling effects, which produce an underestimation of the entropy^{38,44} for large $N$, a drop we see most visibly for the $K=1frame$. At intermediate $N$, where estimation effects are negligible, there is an abrupt change in the entropy rate from $K=1$ to $K>2frames=0.1s$, indicative of the finite memory sitting in the measurement time series^{45–48} from the missing degree of freedom.^{49} The behavior of the entropy with $K$ and $N$ is qualitatively conserved across the range of temperatures studied here [Fig. S2(a) in the supplementary material], and we choose $K\u2217=7frames=0.35s$ for subsequent analysis. Our reconstructed state recovers the equipartition theorem, Fig. S2(b) in the supplementary material, showing that momentum information is accurately captured by the reconstructed state. In addition, our effective model with $K\u2217$ produces realistic simulations of the position dynamics, as illustrated in Figs. 2(a, bottom) and S3(a) in the supplementary material.

Consider next the chaotic dynamics of the Lorenz system,^{50}

with $\sigma =10$, $\rho =28$, and $\beta =8/3$, Fig. 2(b, top). Unlike the double-well dynamics, the fractal structure of the Lorenz system does not technically satisfy the requirements of our modeling approach since our partition scheme (into disjoint non-overlapping sets) assumes a smooth invariant measure rather than a fractal set. Nonetheless, we will show how we can still use the same exact framework to accurately reconstruct the state space of the Lorenz system and recover effective coarse-grained states from partial observations.

We measure only $x$ at $\delta t=0.01s$ for $T=5\xd7105s$ ( Appendix B) and reconstruct the state space as before: concatenating $K$ time delays and studying the behavior of the short-time entropy rate with $N$ and $K$, Fig. 2(b, middle). As expected for deterministic chaos,^{38} we observe that the short-time entropy rate reaches an asymptotic value for increasing $N$, $h\delta t(N>N\u2217,K)\u2248h\delta t(N\u2217,K)$. As a function of $K$, the short-time entropy rate exhibits a sharp decrease, becoming constant after $K\u227310frames$, Fig. S1(b) in the supplementary material, indicating that we have reached a maximally predictive state. We use $K\u2217=12frames=0.12s$ to reconstruct the state space. Importantly, we note that our partition of the state space results in an over-estimation of the Kolmogorov–Sinai entropy, only attainable in the limit $\delta t\u21920$ and $\u03f5\u223c1/N\u21920$, a point we return to in Sec. IV. Nonetheless, the state space reconstructed with our choice of $K\u2217$ is equivalent to the underlying state of the system: the short-time entropy rate computed from the underlying $(x,y,z)$ state space across $N$ matches that of the state reconstructed with $K\u2217$ time delays [dashed black lines in Fig. 2(b, middle)]. In addition, the inferred maximally predictive states and the resulting Markov approximation of the dynamics allow us to generate realistic simulations of the $x$ time series, as illustrated in Fig. 2(b, bottom) and Fig. S3(b) in the supplementary material. Finally, we also note that the reconstructed transfer operator dynamics yields accurate estimates of the information dimension^{51} (Fig. S4 in the supplementary material).

## III. EXTRACTING COARSE-GRAINED DYNAMICS

In the state-space reconstruction, we encode ensemble dynamics in a matrix $Pij$ constructed by counting transitions between cells $si$ and $sj$ in a transition time $\tau $: formally, this is a discrete-space approximation to the PF operator $P\tau $,^{14} Fig. 1(a, right). This object plays a central role in our analysis: it not only guides the state-space reconstruction but also provides means for principled coarse-graining of the state space through its eigenvalue spectrum. The eigenvalues of $Pij$ directly reveal timescale separation, and the corresponding eigenvectors can be used to identify regions of state space where the system lingers, these are “macroscopic” metastable states.^{14,52} The structure of these regions and the kinetics between them offer principled coarse-graining of the original dynamics.

### A. Choosing a transition time

While the instantaneous ensemble dynamics corresponding to $x\u02d9=f(x)$ is given by $\rho \u02d9=L\rho $, the finite-time transfer operator $\rho t+\tau =P\tau \rho t$ is more immediately available when working with discrete-time measurements. The operators $L$ and $P\tau $ share the same set of eigenfunctions, while the eigenvalues $\lambda i\u2217$ of $P\tau $ are exponential functions of eigenvalues $\Lambda i\u2217$ of $L$, $\lambda i\u2217=e\Lambda i\u2217\tau $. When the estimated eigenvalues $\lambda i$ are real, the relaxation of the inferred density dynamics is characterized by the implied relaxation times corresponding to each eigenvalue,

$P\tau $ lives in an infinite-dimensional functional space, which we discretize using $N$ basis functions: in our partitioned space, the basis functions are characteristic functions and the measure is piecewise constant ( Appendix A). The truncation at finite $N$ erases fine-scale information within the partition, and therefore, the ability of the transfer operator approximation $Pij$ to capture the large-scale dynamics depends on the transition time $\tau $, a parameter that we vary. For the state-space reconstruction described in Sec. II, we chose $\tau $ as the sampling time $\tau =\delta t$ in order to maximize short-time predictability. However, to accurately capture longer-time dynamics and metastable states, we vary $\tau $ and study changes in the inferred spectrum, Fig. 1(b).

For $\tau \u21920$, $Pij$ is close to an identity matrix (the system has little time to leave its current partition), and all eigenvalues are nearly degenerate and close to unity, Fig. 1(b). For $\tau $ much longer than the mixing time of the system, the eigenvalues start collapsing and in the limit of $\tau \u2192\u221e\u21d2Pij(\tau )\u223c\pi j$: the probability of two subsequent states separated by such $\tau $ becomes independent, the eigenvalues of $Pij(\tau )$ stop changing, and $timp$ grows linearly with $\tau $ for all eigenfunctions. In this regime, the transition probability matrix contains noisy copies of the invariant density akin to a shuffle of the symbolic sequence. This yields an effective noise floor, which is observed earlier (shorter $\tau $) for faster decaying eigenfunctions. For intermediate $\tau $, we find a spectral gap, indicating that the fast dynamics have relaxed, and we can isolate the slow dynamics. In addition, the Markovian nature of the dynamics implies that the inferred relaxation times reach a constant value that matches the underlying long timescale $\Lambda 2\u22121$ of the system:^{53} the Chapman–Kolmogorov equation (see, e.g., Ref. 54) is verified $Pn\tau \rho =P\tau n\rho $, and thus, $tiimp(n\tau )=\u2212n\tau log\u2061(\lambda i(\tau )n)=tiimp(\tau )$. We choose $\tau \u2217$ after the initial transient, searching for nearly constant $\Lambda ^2$ and a large spectral gap [green shaded region in Fig. 1(b)].

### B. Identifying metastable states

Within our reconstructed ensemble evolution, coarse-grained dynamics can be identified between state-space regions where the system becomes temporarily trapped, and these are metastable states or almost-invariant sets.^{55,56} We search for collections of states that move coherently with the flow: their relative future states belong to the same macroscopic set. Since the slowest decay to the stationary distribution is captured by the first non-trivial eigenfunction of the transfer operator, we search for the subdivision along this eigenfunction that maximizes the coherence of the resulting macroscopic sets, a previously introduced heuristic.^{55}

A set $S$ is coherent when the system is more likely to remain within the set than it is to leave it within time $\tau $. We quantify this intuition by measuring the overlap between a set of states and their time evolution through $\chi \mu ,\tau (S)=\mu (S\u2229\Phi \u2212\tau S)\mu (S)=\mu (\Phi \tau S\u2229S)\mu (S)$, where $\mu $ is the invariant measure preserved by the invertible flow $\Phi \tau $. Given an inferred transfer operator $Pij(\tau )$ and its associated stationary eigenvector $\pi $, we can immediately compute $\chi $ ( Appendix A),

To identify optimally coherent metastable states through spectral analysis, we define a time-symmetric (reversibilized) transfer operator, $Pr\u2261(P+P\u2020)/2$, where $P\u2020$ is the dual operator to $P$, pulling the dynamics backward in time; see Appendix A for the discrete numerical approximation, $Pr$. While transfer operators describing ensemble dynamics are not generally symmetric,^{55,57} the definition of coherence is invariant under time reversal: since the measure is time invariant, it does not matter in which direction we look for mass loss from a set.^{58,59} Besides the invariance property, an important benefit of working with $Pr$ is that its second eigenvector $\varphi 2$ provides an *optimal* subdivision of the state space into almost-invariant sets, as shown in Ref. 58. Given $\tau \u2217$ and the corresponding $Pr(\tau \u2217)$, we identify metastable sets by choosing a subdivision along $\varphi 2$ that maximizes an overall measure of coherence,

where ${S+(\varphi 2c),S\u2212(\varphi 2c)}$ results from a partition at $\varphi 2c$: the optimal almost invariant sets are identified with respect to the sign of $\varphi 2\u2212\varphi 2c$, Fig. 1(b, bottom). In Sec. III C, we illustrate the process of extracting coarse-grained dynamics from incomplete measurements.

### C. Illustrative applications

We return to the equilibrium dynamics of a particle in a double-well potential, Eq. (2), Fig. 2(a). Using the maximally predictive reconstructed state space, we approximate the transfer operator with a transition timescale $\tau $. We construct the reversibilized transition matrix $Pr$ and show the 15 slowest inferred relaxation times for $N=1000$ partitions and transition times $\tau $, Fig. 3(a, left). The longest relaxation time initially decays and reaches its asymptotic limit after $\tau \u223c5s$. As we increase $\tau $ to a few multiples of the hopping timescale, the eigenvalues stop changing and $timp$ simply grows linearly with $\tau $ [recall $timp(\tau )=|\Lambda \u22121(\tau )|=\u2212\tau /log\u2061\lambda (\tau )$]. Faster decaying eigenfunctions exhibit this behavior earlier. Importantly, when $timp$ is approximately constant, the inferred timescales accurately predict the mean first passage time between potential wells, as expected from theory (see, e.g., Ref. 61). In fact, we find that the inferred $|\Lambda ^2|$ provides an excellent fit to the hopping rates across temperatures, Fig. 3(a, right). We note that in order to examine different temperatures, the choice of $\tau \u2217$ should change accordingly to reflect the different resulting dynamics [Fig. S5(b) in the supplementary material]. Increasing the temperature reduces the hopping rates, and therefore, the amount of time it takes before the system relaxes to the stationary distribution. Nonetheless, the qualitative behavior of the longest inferred implied relaxation times is conserved across temperatures [Fig. S5(b) in the supplementary material]: a short $\tau $ transient quickly converges to the timescale of hopping between wells, before the system completely mixes and the implied timescales grow linearly. To choose $\tau \u2217$ consistently across temperatures, we find the transition time that minimizes the longest reversibilized $timp(\tau )$. As expected, the resulting $\tau \u2217$ is reduced as we increase the temperature, reflecting the faster dynamics.

To complement the above analysis of a stochastic dynamics, we study the operator spectrum in the deterministic chaotic dynamics of the Lorenz system. Chaotic dynamics generally exhibit an intricate interplay of multiple timescales, which is apparent in the eigenvalue structure of $Pr$ for different transition times, Fig. 4(a). Indeed, in the chaotic regime, the Lorenz dynamics meanders around a skeleton of unstable periodic orbits (UPOs)^{62} and that is reflected in the periodic behavior of the eigenvalues as a function of $\tau $. Nonetheless, for short timescales, we find a regime in which the slowest implied timescale is approximately constant [Fig. 4(a,inset)], and we choose $\tau \u2217=0.1s$ for subsequent analysis. Notably, despite the fact that we only have access to partial observations, the reconstructed Markovian dynamics are predictive over timescales of $\u27e8\tau pred\u27e9=3.04(2.99,3.08)Lyapunov times$, Fig. S7(a) in the supplementary material.

In both our example systems, the structure of the second eigenvector of the reversibilized transition matrix $\varphi 2$ reveals a collective coordinate capturing transitions between coherent structures or almost-invariant sets, Figs. 3(b) and 4(b). To visualize the state space, we perform single value decomposition (SVD) on the reconstructed trajectory matrices $XK\u2217=U\Sigma VT$ and plot the projection onto the first two singular vectors $(v1,v2)$. Notably, in the double-well example, the dominant SVD modes correspond to the position and velocity $(u1,u2)\u223c(x,v)$ [see Fig. S2(c) in the supplementary material]. A contour plot of the inferred $\varphi 2$ for the double-well potential in the $(u1,u2)$ space shows that the sign of the eigenvector effectively splits the reconstructed state space into the two wells of the system, Fig. 3(b). The same is true across temperatures, Fig. S6 in the supplementary material, with the maximum of $\chi $, Eq. (6), at $\varphi 2c\u22480$, Fig. S5(c) in the supplementary material. Notably, due to the underdamped nature of the dynamics, the separation between potential wells includes the high velocity transition regions $|u2|$, as expected from theory.^{63} Similarly, the sign of $\varphi 2$ for the Lorenz system divides the state space into its almost-invariant sets, Figs. 4(b) and S7(b) in the supplementary material, which are partially split along the shortest period UPO,^{60,64} identified here through recurrences^{65} ( Appendix A).

In summary, we introduce a general framework for the principled extraction of coarse-grained, slow dynamics directly from time series data, which applies to both deterministic and stochastic systems. To fully capture short-time dynamics, we concatenate measurements in time and partition the expanded state while maximizing predictive information. We then leverage the geometric-invariance of the transfer operator representation to build an effective model of the dynamics from which we can extract coarse-grained metastable states.

## IV. ESTIMATING THE KOLMOGOROV–SINAI ENTROPY

Our approximation of the transfer operator dynamics yields an approximate model that holds for length scales and timescales beyond those set by the partition length scale $\u03f5\u221d1/N$ and the transition timescale $\tau $. However, certain fundamental properties of the underlying continuous dynamics are not immediately available from this representation. Among these, we here focus on the Kolmogorov–Sinai (KS) entropy, which is a fundamental measure of the “compressibility” of a dynamical machine,^{66} capturing the inherent unpredictability of the dynamics. However important, the KS entropy is also extremely challenging to estimate from time series data.^{67,68} The KS entropy is defined in the $\u03f5\u21920$, $\tau \u21920$ limit, and requires infinitesimal trajectory information to be properly estimated. This is reflected in our overestimation of the KS entropy in the Lorenz system, Fig. 2(b), highlighting the challenge of bridging between a discrete representation and the underlying continuous dynamics. However, despite the fact that our transfer operator approximation is constructed for finite $\u03f5$ and $\tau $, we here show that we can establish a connection between our estimates of the short-time entropy rate in a partitioned space and the underlying KS entropy.

Our estimates of the short-time unpredictability of the dynamics through the entropy rate are related to the Shannon–Kolmogorov $(\u03f5,\tau )$-entropy per unit time,^{38}

where $HSK(\u03f5,\tau ,T)$ is the Shannon–Kolmogorov $(\u03f5,\tau )$-entropy obtained from the probability that segments of length $T$ are within a distance smaller than $\u03f5$. In comparison, our partition-based estimate of the short-time entropy rate, Eq. (1), is equivalent to truncating the limit in Eq. (7) at $T=2\tau =2\delta t$ and setting $\u03f5\u223c1/N$,

Equivalently, we can define the partition-based Shannon entropy $H(S,\tau ,T)$, where $S$ represents the symbol set composed of $N$ partitions, from which follows the definition of the Kolmogorov–Sinai entropy rate,^{38,69}

which is the supremum of the entropy rate over all possible partitions of the state space.

In practice, the KS entropy is extremely challenging to estimate from time series data due to the several limits in Eq. (8) and the inherent arbitrariness of the choice of $\tau $ and partition $S$. The situation simplifies when the partition is generating,^{14,70} in which case there is a one-to-one mapping between state-space trajectories and the resulting symbolic dynamics, and therefore, even a Markovian estimator of the entropy rate will correspond to the KS entropy. However, there are no general algorithms for finding generating partitions, with the exception of simple one dimensional discrete maps^{67} (e.g., a logistic map or a tent map) or two dimensional maps for which heuristic approximation schemes have been successfully applied^{71–73} (e.g., a Henon map or an Ikeda map). A misplaced partition can result in severe underestimation of the entropy,^{68} whereas correlations that can stem from partial observations and/or an inappropriate partition result in the overestimation of the entropy.^{67} Given a finite partition, one can possibly circumvent the lack of a generating partition by taking the sequence length to infinity $T\u2192\u221e$. However, a naïve estimator based on counting symbol sequences of increasing length $T$ is in practice unattainable: the number of possible symbolic sequences grows exponentially with the sequence length, making it impossible to reach moderate to large values of $T$ needed to reach the KS entropy limit. Indeed, this has motivated the development of algorithms for the estimation of entropy in undersampled symbolic sequences (see, e.g., Refs. 74 and 75).

Alternative geometric-based approaches, such as from Cohen and Procaccia,^{44} overcome the issues of symbolic estimates of the entropy rate by working directly with the time series data and essentially estimating the probability of finding two trajectories of length $T$ within an $\u03f5$ distance of each other. Given a measurement time series ${y(t)}$ where $t\u2208{\tau ,2\tau ,\u2026,T}$, $\u03f5$-tubes are built around sequences of length $K\tau $, $yt:t+K$, and the entropy is estimated as

where $CK\tau (\u03f52)$ is the correlation function of $K$-dimensional $\u03f5$-tubes built from time series measurements with a sampling time of $\tau $, which essentially measures the probability of finding two $yt:t+K$ vectors within a distance $\u03f5$ of each other. This notion allows for the estimation of a number of ergodic properties of the dynamics, including intrinsic dimensions and entropies. For example, the correlation entropy rate can then be obtained by estimating

which converges to the Kolmogorov–Sinai entropy in the limit $\u03f5\u21920$ and $\tau \u21920$. Alternatively, for some classes of dynamical systems, one can use the sum of positive Lyapunov exponents to obtain an upper bound to the KS entropy, according to Pesin’s theorem.^{76} However powerful, geometric approaches can be challenging to apply directly to data since they are sensitive to the underlying geometry and the precise dimension of the reconstructed state space.

Here, we present a novel estimator of the KS entropy, which combines ideas from previous approaches with the transfer operator formalism, to obtain accurate estimates of the KS entropy that are robust to the precise geometry of the reconstructed state space and do not require a generating partition. As Cohen and Procaccia,^{44} we build trajectory segments of increasing length $T=K\delta t$, but then partition the resulting space and estimate the Markov approximation of the transfer operator for $\tau =K\delta t$,

We then estimate the KS entropy as a supremum over $N$, $h^(K)=maxNh^(N,K)$, estimate $h^(K)$ for increasing number of delays and probe the $K\u2192\u221e$ limit (see Appendix C),

In Fig. 5, we illustrate our estimation procedure with applications in discrete maps (logistic and Henon) and in the continuous-time dynamics of the Lorenz system and the Rössler system, as well as a real world data example from *C. elegans* locomotion, where a complementary estimate of the entropy rate derived from Lyapunov exponents already exists.^{77} We assess the accuracy of our KS entropy estimates by comparing against the sum of positive Lyapunov exponents.^{76} For the discrete maps, we find that the direct estimates of $h^(K)$ provide an accurate approximation of the KS entropy, provided that we choose $K$ before finite-size effects become evident, as shown in Fig. S8(a) of the supplementary material. In contrast, for continuous-time dynamics, we find that it is often challenging to reach the KS entropy limit directly, and we instead estimate the $K\u2192\u221e$ limit by extrapolation,^{78} Fig. S8(b) in the supplementary material ( Appendix C). In both discrete and continuous systems, our entropy estimates from Eq. (12) are in excellent agreement with the sum of positive Lyapunov exponents.

## V. DISCUSSION

When seeking order in complex systems, many simplifications are possible, and this choice has consequences for both the nature and complexity of the resulting model. Often, such simplifications are chosen *a priori*, for example, a particular discretization, and the complexity of the model is then taken as a fact of the underlying dynamics. However, this is not generally true; the definition of the states and the resulting dynamics between them are inextricably linked. Here, we address this link directly and choose the simplification informed by the resulting model dynamics; we define maximally predictive states in an attempt to achieve an effective Markov model of the large-scale dynamics. Applicable to both deterministic and stochastic systems, we combine state-space reconstruction with ensemble dynamics to infer probabilistic state-space transitions and coarse-grained descriptions directly from data.

By characterizing nonlinear dynamics through transitions between state-space partitions, we trade trajectory-based techniques for the analysis of linear operators. This approach is complementary to previous methods of state-space reconstruction, which focus on geometric and topological quantities.^{36,65,77,79,80} In particular, the ergodic analyses of trajectories rely on precise estimates of dimension and of the local Jacobian, which is challenging in systems with unknown equations.

In our approach, we recognize and address the mutual dependence of state-space reconstruction and Markovian evolution: reconstructing the state space is required for an effective Markovian description, and the framework of ensemble dynamics provides the principled, information-theoretic measure of memory used to optimize the reconstructed state. By reconstructing the state space, hidden dynamics are included into the state such that the slow modes uncovered by the operator reconstruction correspond to the slowest mensurable ergodic dynamics in the system.

The approximation of transfer operators for the extraction of long-lived states has been successfully applied in a number of systems: from identifying metastable protein conformations in molecular dynamics simulations to determining coherent structures in oceanic flows (see, e.g., Refs. 14, 26, 28, 52, 59, and 81–85). Notably, most of these approaches benefit from a detailed physical understanding of the systems under study, which can be leveraged to approximate the transfer operator with high precision. However, finding optimal coordinates and metrics that can capture kinetic differences between metastable states is an active field of research.^{28} Interestingly, recent advances have been achieved by using temporal information to find better coordinates for Markov state modeling,^{27,86,87} which results from effective delay-embedding of the underlying state-space dynamics. In our approach, representation and modeling are unified under a single framework: we search for maximally predictive state-space reconstructions that yield a representation in which the transfer operator dynamics optimally captures the slow dynamics of the system and, thus, can appropriately identify timescale separation and metastable states from incomplete time series measurements.

To identify almost-invariant sets, we use the reversibilized operator $Pr$, Eq. (A2), which is generally different from $P$. For an overdamped system in a thermodynamic equilibrium, this symmetrization simply enforces detailed balance and $|\Lambda 2\u22121|$ is directly related to the hopping rates between metastable states. For irreversible dynamics, however, the dynamics of $Pr$ will relax more slowly than those of $P$, providing an upper bound to the true relaxation times.^{88,89} It will also be interesting to explore approaches based directly on $P$^{85} and to extract additional physical quantities, such as the rate of entropy production.

We restrict our analysis to autonomous ergodic systems, but slow non-ergodic variables that drive the dynamics on timescales comparable to the observation time can render the data non-stationary. In this case, the transfer operator has an explicit time dependence $P\tau (t)$ that we cannot capture within our current approach. Nonetheless, varying $\tau $ might still allow for the identification of coherent sets or collective coordinates that dominate the dynamics on the timescale $\tau $.^{85,90–94}

When the state space is known or its reconstruction decoupled from the ensemble dynamics, mesh-free discretizations have been used to characterize ensemble evolution, including diffusion maps^{95,96} and other kernel-based approaches, such as reproducing kernel Hilbert spaces^{97} or extended dynamic mode decomposition.^{98} Though powerful, such methods require subtle choices in kernels, neighborhood length scales, and transition times, for which we lack guiding principles. Nonetheless, incorporating such approaches into our framework would likely be fruitful. In particular, when the measurement time series is high dimensional, our partitioning scheme will likely struggle: for large $K$, the trajectory matrices are high dimensional, making it challenging to perform k-means clustering. In this case, a prior dimensionality reduction step or a more appropriate choice of metric might be required.

The principled integration of fluctuating, microscopic dynamics resulting in coarse-grained but effective theories is a remarkable success of statistical physics. Conceptually similar, our work here is designed toward systems sampled from data whose fundamental dynamics are unknown. We leverage, rather than ignore, small-scale variability to subsume nonlinear dynamics into linear, ensemble evolution, enabling the principled identification of coarse-grained, long-timescale processes, which we expect to be informative in a wide variety of systems.

## SUPPLEMENTARY MATERIAL

See the supplementary material for Figs. S1–S9.

## ACKNOWLEDGMENTS

We thank Massimo Vergassola and Federica Ferretti for comments and lively discussions. This work was supported by a program grant from the Netherlands Organization for Scientific Research (Sticthing voor Fundamenteel Onderzoek der Materie): FOM V1310M (A.C.C. and G.J.S.), by the LabEx ENS-ICFP: ANR-10-LABX-0010/ANR-10-IDEX-0001-02 PSL (A.C.C.), and we also acknowledge support from the OIST Graduate University (T.A. and G.J.S.), the Herchel Smith Fund (D.J.), and the Vrije Universiteit Amsterdam (A.C.C. and G.J.S.). G.J.S. and A.C.C. acknowledge useful (in-person!) discussions at the Aspen Center for Physics, which is supported by the National Science Foundation grant (No. PHY-1607611).

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

All the authors contributed equally to this work.

**Antonio C. Costa:** Conceptualization (equal); Data curation (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). **Tosif Ahamed:** Conceptualization (equal); Formal analysis (equal); Investigation (equal); Software (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). **David Jordan:** Conceptualization (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). **Greg J. Stephens:** Conceptualization (equal); Formal analysis (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Project administration (equal); Supervision (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

The data that support the findings of this study are openly available in Zenodo at https://doi.org/10.5281/zenodo.7130012. In addition, the code for reproducing our results is publicly available: https://github.com/AntonioCCosta/maximally_predictive_states.

### APPENDIX A: METHODS

#### 1. State-space reconstruction

Given a measurement time series, $y\u2192(t)$, with $t\u2208{\delta t,\u2026,T\delta t}$ and $y\u2192\u2208Rd$, we build a trajectory matrix by stacking $K$ time-shifted copies of $y\u2192$, yielding a $(T\u2212K)\xd7Kd$ matrix $XK$. For each $K$, we partition the candidate state space and estimate the entropy rate of the associated Markov chain (see below). We choose $K\u2217$ such that $\u2202Kh\delta t(K,N\u2217)|K\u2217\u223c0$.

#### 2. State-space partitioning

We partition the state space into $N$ Voronoi cells, $si,i\u2208{1,\u2026,N}$, through k-means clustering with a k-means++ initialization using scikit-learn.^{99}

#### 3. Approximation of the Perron–Frobenius operator

We build a finite-dimensional approximation of the Perron–Frobenius operator using Ulam–Galerkin discretization. A Galerkin projection takes the infinite-dimensional operator onto an $N\xd7N$ operator of finite rank by truncating an infinite-dimensional set of basis functions at a finite $N$. Ulam’s method uses characteristic functions as the basis for this projection,

Our characteristic functions are implicitly defined through the k-means discretization of the space. We, thus, partition the space into $N$ connected sets with an nonempty and disjoint interior that covers $M$: $M=\u222ai=1Nsi$ and approximate the operator as a Markov chain by counting transitions from $si$ to $sj$ in a finite time $\tau $. Given T observations, a set of $N$ partitions, and a transition time $\tau $, we compute

The maximum likelihood estimator of the transition matrix is obtained by simply row normalizing the count matrix,

#### 4. Invariant density estimation

Given a transition matrix $P$, the invariant density is obtained through the left eigenvector of the non-degenerate eigenvalue 1 of $P$, $\pi P=\pi $.

#### 5. Short-time entropy rate estimation

Given a transition matrix $P(\tau =\delta t)$ and its corresponding invariant density $\pi $, we compute the short-time entropy rate through Eq. (1).

#### 6. Markov model simulations

At each iteration, we sample from the conditional distribution $P(sj(t+\delta t)|si(t))$, given by the $i$th row of the inferred $Pij(\delta t)$ matrix, to generate a symbolic sequence. At each frame $t$, we then randomly sample a state-space point within the partition corresponding to the simulated symbol and unfold it to obtain $xt\u2212K\u2217/2:t+K\u2217/2$, from which we get $xsim(t)$.

#### 7. Metastable state identification

Metastable states correspond to regions of the state space that the system visits often, separated by regions where transitions occur. We, therefore, search for collections of states that move coherently with the flow: their relative future states belong to the same macroscopic set. We leverage a heuristic introduced in Ref. 55, which makes use of the first non-trivial eigenfunction of the reversibilized transfer operator to identify almost-invariant sets:^{60} the coherence properties are invariant to this transformation,^{58} and the analysis is simplified due to the optimality properties of reversible Markov chains. In discrete time and space, $Pr$ is defined as

where

is the stochastic matrix governing the time reversal of the Markov chain. The first nontrivial right eigenvector of $Pr$, $\varphi 2$, allows us to define macrostates as

and $\varphi 2c$ is chosen so as to maximize $\chi $, Eq. (6), yielding almost-invariant or metastable states. In practice, we compute Eq. (6) for the complete discrete set of $\varphi 2c$ and find the global maximum: this is an inexpensive calculation since $\chi $ can be obtained by matrix multiplications and $\varphi 2c$ can only take $N$ different values, where $N$ is the number of partitions. We note that for an overdamped thermodynamic system in equilibrium, the eigenvectors $\varphi k$ are discrete approximations of the eigenfunctions of the Koopman operator,^{98} which are nearly constant within a metastable set, thus simplifying the clustering into almost-invariant sets.

#### 8. Choice of transition time $\tau \u2217$

We choose $\tau \u2217$ as the shortest transition timescale after which the inferred implied relaxation times reach a plateau. For $\tau $ too short, the approximation of the operator will yield a transition matrix that is nearly identity (due to the finite size of the partitions and too short transition time), which results in degenerate eigenvalues close to $\lambda \u223c1$: an artifact of the discretization and not reflective of the underlying dynamics. For $\tau $ too large, the transition probabilities become indistinguishable from noisy estimates of invariant density, which results in a single surviving eigenvalue $\lambda 1=1$, while the remaining eigenvalues converge to a noise floor resulting from a finite sampling of the invariant density. Between such regimes, we find a region with the largest time scale separation [as illustrated in Fig. 1(b, middle)], which also corresponds to the regime for which the longest relaxation times $timp$, Eq. (4), are robust to the choice of $\tau $. We compute $timp$ using the eigenvalues of the reversibilized transition matrix, $Pr$, which only gives an upper bound to the relaxation dynamics.

#### 9. Eigenspectrum estimation

Since we are focused on long-lived dynamics, we estimate the $nmodes$ largest magnitude real eigenvalues using the ARPACK^{100} algorithm.

#### 10. Periodic orbit identification

We identify the shortest period unstable periodic orbit of the Lorenz system, Eq. (3), by studying the distribution of recurrence times. We set a short distance $\u03f5$ and look for the times $p$ at which $||Xt+p\u2212Xt||<\u03f5$, where $||\u22c5||$ represents the Euclidean distance. In practice, we compute $1/||Xt+p\u2212Xt||2$ and find peaks with height larger than $1/\u03f52$ using the find_peaks function from the scipy.signal package.^{101} For short enough $\u03f5$ ($\u03f5=5\xd710\u22126$ in our case), the distribution of recurrence times has its first peak at the period of the shortest unstable periodic orbit.^{102} A trajectory corresponding to a shadow of this unstable periodic orbit^{103,104} is shown in Figs. 4(b) and S7(b) of the supplementary material.

### APPENDIX B: SIMULATIONS

#### 1. Double well

We use an Euler–Maruyama integration scheme to simulate Eq. (2) and generate a $T=106s$ long trajectory of a particle in a double-well potential with $m=\gamma =1$ and $\beta \u22121={0.5,0.75,1.00,1.25,1.5,1.75,2.00,2.25,2.5}J$. We first sampled at $0.01s$ and then downsampled to $\delta t=0.05s$. In addition, the first $1000s$ were discarded to avoid transients. In Fig. S5(a) of the supplementary material, we project the Boltzmann distribution into the SVD space by learning linear mapping $\theta $, between $[u1,u2]$ and $[x,v]$: $[u1,u2]=\theta \u22c5[x,v]$. This allows us to project the centroids of each Voronoi cell from the $[u1,u2]$ space to the $[x,v]$ space and estimate the corresponding Boltzmann weight $p([u1,u2]\u22c5\theta \u22121)=p([x,v])=exp\u2061{\u2212\beta [(x2\u22121)2+v2/2]}/Z$, where $Z$ represents the normalization.

#### 2. Lorenz system

### APPENDIX C: ESTIMATING THE KOLMOGOROV–SINAI ENTROPY RATE IN A PARTITIONED STATE SPACE

In order to estimate the KS entropy of the source, we build trajectory matrices for increasing delays $K$, $XK$, and estimate the entropy rate for increasing $K$ on a transition time $K\delta t$, Eq. (11). In order to capture as much information as possible in the partitioned state space, we take the supremum over $N$, $h^(K)=maxNh^(N,K)$, and probe the $K\u2192\u221e$ limit [Eq. (12)]. We note that Eq. (12) has an implicit $\delta t$-dependency, which we lift by probing the consistency of the estimation for different values of $\delta t$, as exemplified in the Rössler system (see Fig. S9 in the supplementary material). We aim for $\delta t\u21920$, while paying attention to the fact that when $\delta t$ is too short, we need longer $K$ to reach the $T=K\delta t\u2192\u221e$ limit, which might be impractical. We use different methods to probe the $K\u2192\u221e$ limit in discrete maps and in continuous-time dynamical systems. For discrete maps, we find that it is possible to achieve an accurate estimator of the KS entropy with a short amount of delays, and we identify the $K\u2192\u221e$ as the largest $K$ before which finite-size effects start becoming evident. Typically, as $K$ grows, the change in $h^(K)$ with $K$ slows down (typically as $1/K$), meaning that

should be a monotonically decreasing function. However, we find that finite-size effects result in the underestimation of the entropy (beyond the expected $\u223c1/K$), which results in a increase in $\Delta h^(K)$. Therefore, we identify the largest $K$ before finite-size effects become significant by taking the minimum of $\Delta h^(K)$ [see Fig. S8(a) in the supplementary material]. In comparison, we find that given our choice of $\delta t$ for the continuous-time dynamical systems studied here, the value of $K$ required for an accurate estimate of the KS entropy is unattainable. In order to reach the $K\u2192\u221e$ limit, $h\u221e$, we leverage the $1/K$ behavior of $h^(K)$ to extrapolate to $K\u2192\u221e$ by weighted least squares regression of $h^(K)=m/K+h\u221e$ as in Ref. 78 [see Fig. S8(c) in the supplementary material]. We note that this observation is fundamentally tied to the choice of sampling time $\delta t$ as discussed in Fig. S9 of the supplementary material.

### APPENDIX D: ADDITIONAL SIMULATIONS FOR SEC. IV

#### 1. Rössler system

We use SciPy’s odeint package^{101} to generate a $T=5\xd7103s$ long trajectory of the Rössler system,^{105,106}

in a chaotic regime with $(a,b,c)=(0.52,2,4)$, sampled at $\delta t=0.1s$. We discard the first $100s$ to avoid transients.

#### 2. Lorenz system in a limit cycle regime

#### 3. Henon map

We iterate the Henon map,^{107}

in a chaotic regime $(a,b)=(1.4,0.3)$, to generate a trajectory with $T=106$ time steps, discarding the first $1000$ iterations to avoid transients.

#### 4. Logistic map

We iterate the logistic equation

to generate trajectories with $T=106$ time steps, discarding the first $100$ iterations to avoid transients. We change the parameter $r$ to span multiple dynamical regimes in the period-doubling route to chaos.

## REFERENCES

*Lectures on Phase Transitions and the Renormalization Group*, Frontiers in Physics (Addison-Wesley, Reading, MA, 1992; CRC Press, 2018).

*Time Series Analysis: Forecasting and Control*, 1st ed., edited by E. Robinson (Holden-Day, San Francisco, CA, 1976).

*Dynamical Systems and Turbulence, Warwick 1980*, edited by D. Rand and L.-S. Young (Springer, Berlin, 1981), pp. 366–381.

*Chaos, Scattering and Statistical Mechanics*, Cambridge Nonlinear Science Series (Cambridge University Press, 1998).

^{69}which poses a fundamental bound to the predictability of a dynamical system. We note that our inferred Markov chain is

*not*a complete model of the dynamics, but only an approximate description on timescales and length scales beyond the transition time $\tau $ and the length scale imposed by the partitioning $\u03f5\u223c1/N$. In principle, there exist

*generating partitions*that exactly preserve the continuum dynamics, but these are generally challenging to find except for simple one or two-dimensional discrete maps.

^{14,70}We return to this point in Sec. IV.

^{45–48}

*Ergodic Theory, Analysis, and Efficient Simulation of Dynamical Systems*, edited by B. Fiedler (Springer, Berlin, 2001), pp. 191–223.

*Selected Works of A. N. Kolmogorov*,

*Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic: NIPS’01*(MIT Press, Cambridge, MA, 2001), pp. 471–478.

*An Introduction to Markov State Models and Their Application to Long Timescale Molecular Simulation*, Advances in Experimental Medicine and Biology Vol. 797, edited by G. R. Bowman, V. S. Pande, and F. Noé (Springer Netherlands, Dordrecht, 2014).