Providing efficient and accurate parameterizations for model reduction is a key goal in many areas of science and technology. Here, we present a strong link between data-driven and theoretical approaches to achieving this goal. Formal perturbation expansions of the Koopman operator allow us to derive general stochastic parameterizations of weakly coupled dynamical systems. Such parameterizations yield a set of stochastic integrodifferential equations with explicit noise and memory kernel formulas to describe the effects of unresolved variables. We show that the perturbation expansions involved need not be truncated when the coupling is additive. The unwieldy integrodifferential equations can be recast as a simpler multilevel Markovian model, and we establish an intuitive connection with a generalized Langevin equation. This connection helps setting up a parallelism between the top-down, equation-based methodology herein and the well-established empirical model reduction (EMR) methodology that has been shown to provide efficient dynamical closures to partially observed systems. Hence, our findings, on the one hand, support the physical basis and robustness of the EMR methodology and, on the other hand, illustrate the practical relevance of the perturbative expansion used for deriving the parameterizations.

Parameterizations aim to reduce the complexity of high-dimensional dynamical systems. Here, a theory-based and a data-driven approach for the parameterization of coupled systems are compared, showing that both yield the same stochastic multilevel structure. The results provide very strong support to the use of empirical methods in model reduction and clarify the practical relevance of the proposed theoretical framework.

## I. INTRODUCTION AND MOTIVATION

Multiscale systems are typically characterized by the presence of significant variability over a large range of spatial and temporal scales. This multiscale character is due to a combination of the following factors: the nature of the external forcings; the inhomogeneity of the properties of the system’s various components; the complexity of the coupling mechanisms between the components; and the variety of instabilities, dissipative processes, and feedback acting at different scales. In many cases, both the theoretical understanding of such systems and the formulation of numerical models for simulating their properties are based on focusing upon a reduced range of large spatial and long temporal scales of interest, and upon devising an efficient way to effectively capture the impact of the faster dynamical processes acting predominantly in the neglected smaller spatial scales.^{24,64}

Given a high-dimensional dynamical system, we are thus interested in reformulating it in such a way that only the variables of interest are resolved. A first guess would be to ignore altogether the unresolved variables and consider uniquely the filtered evolution laws valid for the targeted range of scales. However, this is well known to be inadequate, because nonlinearity guarantees that the unresolved variables have an impact on the resolved ones, in terms of both the detailed dynamics and its statistical properties. Therefore, the problem of constructing accurate and efficient reduced-order models—or, equivalently, of defining the coarse-grained dynamics—is an essential and fundamental aspect of studying multiscale systems, both theoretically and through numerical simulations.

For the sake of concreteness, let us consider the case of climate science. It is well-nigh impossible, therefore, given our current scientific knowledge and our available or even foreseeable technological capabilities, to create a numerical model able to directly simulate the climate system in all details for all the relevant timescales, which span a range of over 15 orders of magnitude.^{30,32,66} Hence, one has to focus on a specific range of scales through suitably developed, approximate evolution equations that provide the basis for the numerical modeling. Such equations are derived from the fundamental laws of climate dynamics through systematic asymptotic expansions that are based on imposing an approximate balance between the forces acting on geophysical flows. These balance relations lead to removing small-scale, fast processes that are assumed to play a minor role at the scales of interest by filtering out the corresponding waves.^{38,42,80}

In climate science, parameterization schemes have been traditionally formulated in such a way that one expresses the net impact on the scales of interest of processes occurring within the unresolved scales via deterministic functions of the resolved variables, as in the pioneering work on the parameterization of convective processes by Arakawa and Schubert.^{2} More recently, it has been recognized, mostly on empirical grounds, that parameterizations should involve stochastic and non-Markovian components.^{4,28,62} Machine learning methods have been proposed as the next frontier of data-driven parameterizations,^{29,61,68,84,92} able to deliver a new generation of Earth system models;^{73} see, though, the caveats discussed by Refs. 26 and 39.

### A. The projection operator formalism of Mori and Zwanzig

Let us reformulate the problem of constructing parameterizations as a projection of the dynamics onto the set of resolved variables. By working at the level of observables, Mori^{60} and Zwanzig^{90} showed that the evolution laws for the projected dynamics incorporate a deterministic term that would be obtained by neglecting altogether the impact of the unresolved variables, to which a stochastic and a non-Markovian correction had to be added. Chorin *et al.*^{20,21} played an important role in developing further these ideas and applying them to several important problems. We briefly recapitulate below the Mori–Zwanzig projection operator approach.

Formally, let $\Phi $ denote a generic observable defined on a state space viewed as the product of two finite-dimensional spaces $X\xd7Y$, with variables $x\u2208X$ and $y\u2208Y$ being the resolved and unresolved variables, respectively. Next, let us define $P$ to be a projector onto functions depending only on the target variables in $X$, with the complementary projector on the unresolved variables being defined by $Q=Id\u2212P$.

Given a smooth flow $\psi t$ on $X\xd7Y$, we consider its action on smooth observables $\Phi =\Phi (x,y)$ defined by

The operator $Ut$ is the Koopman operator and the family ${Ut}t\u22650$ of Koopman operators indexed by time forms a semigroup; see Sec. II for further details. The Koopman operator describes how functions on the phase space change under the action of the flow; its time evolution obeys the following equation:

The linear operator $L$ gives the instantaneous rate of change of $\Phi $ under the action of the flow $\psi t$. Representing $Ut$ formally as the exponential of $L$, $Ut=etL$, is both useful and fully justified by operator semigroup theory.^{65} With this representation, $L$ satisfies the following identities:

where we have employed the Dyson identity^{23,25} to obtain Eq. (1.3b), as will be discussed in Sec. II A. The first term in Eq. (1.3b) is the contribution of the resolved variables $x$ alone to the instantaneous rate of change of $\Phi $. The second term models the fluctuating effects of the unresolved $y$-variable by itself, while the third and last term represents, via an integral, the time-delayed influence upon $x$ of its interactions with $y$.

This formal calculation suggests that any closed model for the $x$ variables should incorporate a fluctuating term to account for the $y$ contributions and a memory or integral term for the $y$–$x$ interactions. Unfortunately, the Mori–Zwanzig equation (1.3b)—also known as the generalized Langevin equation (GLE)^{43,63}—does not provide explicit analytic formulas to determine each of the three summands in the right-hand side (RHS) of Eq. (1.3b). Hence, we need efficient ways to approximate such an equation.

In the limit of perfect timescale separation between the $x$- and $y$ variables, the non-Markovian term drops out and the fluctuating term can be represented as a—possibly multiplicative—white-noise term, thus recovering the basic results obtained via homogenization theory;^{34,58,64} see the classical derivation by Hasselmann^{93} of this result in the context of climate dynamics. When no such separation exists, however, one has to resort to finding an integral kernel beyond the abstract formulation of Eq. (1.3b); see, for instance, the theoretical ansatz based on perturbation expansion presented in Refs. 88 and 89 and discussed later in the paper, and Refs. 43 and 83 for concrete applications.

In parallel with the theoretical approaches to approximate the Mori–Zwanzig equation (1.3b), data-driven methods have been proposed to model fluctuations and memory effects arising as a result of projecting a large state space onto (much) smaller subspaces. To this end, the work in Ref. 43 provides a rigorous connection between the Mori–Zwanzig equation (1.3b) and multilevel regression models that were initially introduced for climatological purposes in Ref. 48. More recently, the authors in Ref. 51 proposed Nonlinear Auto-Regressive Moving Average with eXogenous input (NARMAX) models as a data-driven methodology that is comparable with the Mori–Zwanzig formalism and applied such models to the deterministic Kuramoto-Sivashinsky equation and to a stochastic Burgers equation; see also Gupta and Lermusiaux.^{92} The complementarity of theory-based and data-driven model reduction methods in the absence of timescale separation is very well documented in Ref. 51 as well. A highly complementary recent contribution is aimed at finding a common thread between data-driven methods and the Mori–Zwanzig theoretical framework.^{52}

Efforts at using ingeniously selected basis functions as a stepping stone in data-driven methods for model reduction, effective simulation with partial data, and even prediction are multiple and flourishing. Thus, the eigenvalues and eigenfunctions of the Koopman and transfer operators have been used to capture the modes of variability of the underlying flow, regardless of the latter being deterministic or stochastic; see, respectively, Refs. 3 and 19. Dynamic mode decomposition (DMD)^{72} allows one to reconstruct from observations the eigenvectors and eigenvalues of the Koopman operator for observables of interest even in high-dimensional dynamical systems.^{59,69} The latter approach is complementary to the one presented herein because we shall use the eigenvectors of the Koopman operator to build the projected dynamics for the observables of interest, which can then be rewritten as a multilevel Markovian stochastic model.

Examples of other types of selection of dynamically interesting and effective bases are multichannel singular-spectrum analysis (MSSA)^{1,5,31} and data-adaptive harmonic decomposition (DAH-MSLM).^{10,44} Both methodologies have been used extensively and successfully in the simulation, as well as the prediction, of complex phenomena.^{31,41,44}

Our main theoretical result, Theorem 2.1, establishes how such spectral elements determine in turn the constitutive elements of data-driven non-Markovian closure of partially observed complex systems, when rewritten as a multilevel Markovian stochastic model. This theorem highlights, in particular, new bridges with Koopman modes and the DMD,^{59,69,72} as well with other kinds of projections onto spectral bases.^{10,31,37,47,67,78} A more thorough discussion of the complementary approaches involved is beyond the scope of this paper.

### B. This paper

Many approaches to constructing theoretically rigorous parameterizations have been devised. These can be broadly divided into top-down and data-driven approaches: Top-down approaches aim at deriving the parameterizations by applying suitable approximations to the equations describing the dynamics of the whole systems, for instance,^{35,57,83,86,88,89} while data-driven parameterizations are built by constructing a statistical-dynamical model of the impact of unresolved scales on the scales of interest. In fact, partial observations of a time-evolving system can be used to deduce the fluctuating and delayed effects of the unobserved processes, as shown in Refs. 43, 45, 48, and 85.

In this paper, we will discuss and compare the properties of the Wouters–Lucarini (WL) top-down parameterization^{83,88,89} and of the empirical model reduction (EMR) data-driven parameterization.^{43,45,46,48,49} We will also see when and how the integrodifferential equation occurring in the WL parameterization can be recast into a set of Markovian stochastic differential equations (SDEs). In other words, we investigate the quasi-Markovianity of the latter parameterization.^{63}

The two aforementioned methodologies are conceptually and practically distinct, even though the ultimate goal of both is to provide a computationally practical approximation for the Mori–Zwanzig or GLE integrodifferential equation. In other words, both approaches—top-down and data-driven—provide fluctuations in the form of stochastic noise and memory effects determined by an integral kernel. On the one hand, the WL approach assumes prior knowledge about the decoupled hidden dynamics but no information about the statistical properties of the coupled system. The empirical approach, on the other hand, samples the observed variables evolving according to the latter. The structure of the multilevel stochastic models (MSMs) that generalize EMRs^{43} allows one, moreover, to derive explicit formulas for the fluctuating and memory correction terms that parameterize the influence of hidden processes.

The overall goal of this paper is to provide a conceptual and analytical link between these two approaches, aiming, on the one hand, to buttress the practical relevance of the WL perturbative approach and, on the other hand, to provide further insight into the well-documented robustness of the EMR method. Moreover, we will clarify how multilevel systems arise from both the *top-down* (WL) and the *bottom-up* (EMR) approaches. The paper explores the complete set of boxes and explains all the arrows in Fig. 1. The diagram in the figure shows that, starting from the top box, one can arrive at a memory equation, like (1.3b), via *top-down* or *bottom-up* methods, as indicated by the left and right sequence of arrows, respectively.

The paper is structured as follows. Section II revisits the derivation of the WL parameterization method by applying the Dyson expansion to the Koopman operator associated with two weakly coupled dynamical systems. We show that such an expansion need not be truncated for additively coupled models and consider more general coupling laws than those in Ref. 89.

Furthermore, we study the problem of finding Markovian representations of the memory equation in the WL approach based on the spectral decomposition of the Koopman semigroup. Specifically, Theorem 2.1 shows, in the case of a scalar observable, how to recast the stochastic integrodifferential equation arising in the WL parameterization as a multilevel Markovian stochastic system involving explicitly the spectral elements of the (uncoupled) Koopman operator, and we point out in Remark 2.5 how such a Markovianization extends to the multidimensional setting. Section III provides new insights into the Markovian representation adopted in the MSM framework; these insights help one to determine, in particular, the number of levels required for EMR to converge.

Section IV presents a comparison of the data-driven and top-down parameterization approaches using a simple conceptual stochastic climate model. Finally, we discuss the conclusions obtained from this investigation in Sec. V.

Appendixes are included in order to avoid diverting the attention of the reader from the main message of the paper. Appendix A provides the proof of the key Theorem 2.1. Appendix B briefly revisits the spectral decomposition of correlation functions and provides a criterion to quantify the loss of the semigroup property that is useful in identifying non-Markovian effects. Appendix C discusses the stochastic Itô integration of the elementary form of an MSM. Appendix D shows how EMR approaches can capture the dynamical properties of partially observed systems; in it, we consider a simple climate model, obtained by coupling the Lorenz atmospheric model^{54} (L84) and the Lorenz convection model^{53} (L63).

## II. REVISITING THE WEAK-COUPLING-LIMIT PARAMETERIZATION

To study dynamical systems in which one can separate the variables into two groups, with weak coupling between the two, one often resorts to the so-called parameterizations of the effects of one group on the other. In the weak-coupling limit, the coupling itself can be treated as a perturbation of the main dynamics.^{56,88,89} Granted such an assumption and some degree of structural stability of the system, one can apply response theory to derive explicit stochastic and memory terms to describe the impact of the variables we want to neglect on the variables of interest, in the Mori–Zwanzig spirit. Note that, to do so, no assumption on timescale separation between the two groups of variables is necessary. This point is particularly relevant in fields like climate sciences, where no clear timescale separation is observed so that asymptotic expansions of the kind used in homogenization theory are of limited utility.

### A. Deriving the WL approximation for the Mori–Zwanzig formalism

Here, using a perturbative approach, we review the derivation of the parameterization presented in Refs. 56, 88, and 89. Formally, we want to couple two dynamical systems generated independently by two vector fields $F:X\u2286Rd1\u27f6X$ and $G:Y\u2286Rd2\u27f6Y$ with possibly $d1\u2260d2$ and typically $d1\u226ad2$. We study a broad class of systems of the form

The operation indicated by the colon $x:y$ denotes the Hadamard product that multiplies vectors or matrices component-wise. Here, four new vector fields have been introduced to model the coupling law, namely, $Cxx:X\u27f6X$, $Cxy:X\u27f6Y$, $Cyx:Y\u27f6X$, and $Cyy:Y\u27f6Y$.

The real parameter $\u03f5$ controls the strength of the coupling between the two groups of variables, $x(t)$ and $y(t)$, so that the $x$ and $y$ variables are uncoupled for $\u03f5=0$. We assume that the vector fields $F$ and $G$, as well as the coupling laws in Eqs. (2.1a) and (2.1b), are such that the system (2.1) possesses a global attractor. Furthermore, we assume throughout this article that this global attractor supports an physical invariant probability measure $\mu $ that describes the distribution of trajectories onto the global attractor.

The WL parameterization views the coupling as an $\u03f5$-perturbation of the otherwise independent $x$- and $y$-processes, with $x$ being the observed and $y$ being the hidden variables. One next assumes that the impacts of perturbations applied to these processes can be addressed using response theory^{70,71} so that response formulas can be used to derive an effective equation for the $x$ variables.

Taking the Mori–Zwanzig^{60,90} point of view, we wish to calculate the evolution of observables that depend on the observed variables $x$ alone, $\Phi =\Phi (x(t))$. The idea, following Ref. 89, is to perform a perturbative expansion of the differential operator $L$ governing the evolution of $\Phi (x(t))$ under the action of the flow associated with Eq. (2.1). Denoting by $u(x,y,t)$ the time evolution of a smooth observable $\Phi $ in $C\u221e(X\xd7Y)$, the first step of this Dyson-like operator expansion reads as follows:

Here, $L0$ and $L1$ account for the advective effects of the uncoupled and coupling terms, respectively, that compose the RHS of Eq. (2.1), namely,

in Eq. (2.3), $\u2207x$ and $\u2207y$ denote the vector differential operators with respect to the variables $x$ and $y$.

Recalling Eq. (1.2), the solution operator of Eq. (2.2) is the Koopman operator. Formally, its dual acts on densities and it is the so-called *transfer operator*.^{3,50} Equation (2.2) is thus a transport equation, where the physical quantity or observable is advected by the vector field on the RHS of Eq. (2.1).

Note that the operator formalism presented here in the deterministic dynamical systems setting—and the associated semigroup theory—extends to Markov diffusion processes driven by a stochastic forcing.^{19} In the latter case, the transport equation (2.2) becomes the so-called backward Kolmogorov equation that describes the evolution of the expected value of observables. Loosely speaking, the corresponding extension amounts to adding a Laplacian-like operator to the advection operator $L$.^{19,63} See Ref. 36 for the appropriate context in testing the applicability of response theory to the independent $x$ and $y$ processes, when $\u03f5=0$.

More precisely, one associates with the solution $u(x,y,t)$ of Eq. (2.2) unfolding from an observable $\Phi =\Phi (x,y)$ at time $t=0$, a family of linear Koopman operators indexed by time ${Ut}t\u22650$ such that $u(x,y,t)=Ut\Phi (x,y)$, for any $t\u22650$ and $(x,y)$ in $X\xd7Y$. These operators are defined—as mentioned already in connection with introducing the GLE (1.3b)—as exponentials of the operator $L$, i.e., $Ut=etL$. This notation is formal as the operator $L$ is unbounded; it is, however, usable as ${Ut}t\u22650$ satisfies the semigroup property, i.e., $Ut+s=UtUs$, $t,s\u22650$, as for a standard exponential. Over the appropriate function space [Such a space can be chosen, for instance, as $Dp={\Phi \u2208L\mu p(X\xd7Y)|A\Phi =limt\u21920t\u22121(Ut\Phi \u2212\Phi )exists}$ for some $p\u2208[1,\u221e]$, with $\mu $ denoting a relevant invariant measure of the system (2.1) supported by the global attractor, while the limit is taken in the sense of strong convergence.^{25}] of observables $\Phi $, this family actually forms a strongly continuous contracting semigroup.^{25}

The action of the flow on an observable $\Phi $ becomes thus more transparent thanks to the operator $Ut$, according to the equation

where $(x(t;x0),y(t;y0))$ denotes the system’s solution at time $t$ emanating from the initial state $(x0,y0)$ at time $t=0$. In what follows, we omit the subscript $0$ in $(x0,y0)$ but still take it as an initial state.

The semigroup ${Ut}t\u22650$ is known as the Koopman semigroup and for each $t$, $Ut$ is Koopman operator mentioned above; see also Ref. 43, Sec. 4. When the coupling parameter $\u03f5$ in system (2.1) is small, one can use formal perturbation expansions of the Koopman semigroup to better isolate and assess the coupling effects at the level of observables. To do so, we follow here the perturbation expansion first introduced by Freeman J. Dyson in the context of quantum electrodynamics^{23} and later formulated rigorously in mathematical terms in Ref. 33. Formally, this expansion reads as follows:

and it yields the following expansion of the Koopman operator in $\u03f5$:

This identity shows that the evolution of a generic observable can be described as an $\u03f5$-perturbation of its decoupled evolution according to $L0$. We note that these expansions are purely formal and, in particular, it is not clear in which sense this expansion might converge. For a bounded perturbation operator $L1$, it would be straightforward to prove boundedness of the resulting perturbed semigroup. However, $L1$ here is a differential linear operator, for which direct estimates are more laborious. Leaving aside the functional analysis framework that would make such an expansion rigorously convergent, we shall use nevertheless the expansion (2.6) throughout this article.

The objective now is, using this operator expansion, to derive an effective reduced-order model for the evolution of the $x$-variable without having to resolve the $y$-process. We start observing the system at $t=0$, but assume that it has already attained a steady state. Since we are only concerned with observables depending solely on the $x$ variables, we formulate now an evolution equation for such observables. To do so, we consider first the Liouville equation for a generic $y$-independent observable $\Phi $, this is, $\Phi (x,y)=\Phi (x)$, for every $x$ and $y$.

For such an observable, at the time we start observing the coupled system, Eq. (2.2) reduces to

where $\u22c5$ denotes an inner product in $X$. Equation (2.7) illustrates the trivial fact that the time evolution in Eq. (2.1) of an $x$-dependent physical quantity is also affected by the $y$ variables.

Following Refs. 88 and 89, the decoupled equations are assumed to have been evolving for some time prior to the coupling. Hence, we have to formally parameterize the evolution of the $Cyx(y)$-contribution to the vector field which is, ultimately, a vector-valued observable.

We do so by introducing an extended version of the Koopman operators that act on vectors component-wise, rather than just on real-valued observables. Consider $v:X\xd7Y\u27f6Rd$, for some positive integer $d$, and define the action of the Koopman operator $etL$ on $v$ as

for every $i=1,\u2026,d$. The definition (2.8) will allow us to use the semigroup notation for observables of possibly different dimensions, all of which take their inputs in the phase space $X\xd7Y$ . Ultimately, this is a component-wise evaluation of our extended Koopman operator family, and its generator can be obtained analogously. As mentioned above, we have to model the effects of the coupling vector field $Cyx(y)$, whose state at time $t=0$ is the product of the evolution from time $\u2212t$ to $0$. We then have, with the dynamics starting at time $\u2212t$ and initial state $(x0,y0)$,

This equation is an exact reformulation of the problem induced by Eq. (2.7). This reformulation demonstrates that memory effects enter at second order in powers of the coupling parameter. Notice, though, that even if Eq. (2.11) reduces the dimensionality of the problem from $d1+d2$ to $d1$, it does not constitute an approximation for the evolution of $\Phi $ as an observable of $x$ alone, since it depends on the evolution of the $y$ variables in the coupled regime by means of the action of $esL$ onto $L1$.

Therefore, we need to perform a further approximation by considering Eq. (2.10d) instead, which leads to

where the terms of order $\u03f53$ have been dropped. Equation (2.12) is our equivalent of the Dyson approximation in quantum electrodynamics; it approximates the evolution of the $x$ variables with no need for the evolution of the $y$ variables in the coupled regime.

This result amounts to saying that—by observing only the statistical properties of the decoupled dynamics of the $y$-process, obtained with $\u03f5=0$—one can construct a Markovian contribution

and a non-Markovian contribution

to the dynamics of the $x$ variables that is able to describe the impact of the coupling.

Expanding the kernel $K~$ of the memory contribution, we get

Note that the leading-order Koopman operator $esL0$ models the evolution of the observables in the uncoupled regime. Since there is no prior knowledge on initializing the coupled system at time $\u2212t$, the initial state $y0$ in the hidden variables should be drawn from an ensemble, according to a probability density function. At this stage, there is freedom in the choice of such a prior. However, since we are assuming that the coupled system was initialized at time $\u2212t$, it is natural to draw $y0$ according to the invariant measure $\nu $ associated with the dynamical system generated by the vector field $G$ from Eq. (2.1).

We wish to sample initial conditions from the coupled steady state, but do not assume any prior knowledge of the coupled statistics. As discussed in Refs. 88 and 89, we can take advantage of response theory to address this situation. Indeed, for any sufficiently smooth observable $\Psi $, we have

where $\u27e8\Psi \u27e9\u03f5$ is the expectation value of $\Psi $ in the coupled system (2.1), $\u27e8\Psi \u27e9\u03f5=0$ is the expectation value of $\Psi $ according to the statistics generated by the uncoupled $y$ process obtained by setting $\u03f5=0$ in Eq. (2.1b), and $\u03f5k\delta k[\Psi ]$ is the $k$th-order response. In what follows, we remove the subscripts for the averages when $\u03f5=0$. Therefore, we have that the expected value of the coupling function reads as

Likewise, we can calculate the average of such function at time $t$,

Now, by letting $\eta ~(t,y0)=etL0Cyx(y0)$, we find that in order for the approximate statistics to agree up to second order in $\u03f5$ with the exact one, we only need to impose the following conditions upon the first two moments of the parameterized noisy fluctuations (see also Ref. 88):

where $(\u22c5)\u22a4$ stands for the transpose of a vector or a matrix. It follows that any stochastic noise $\eta (t)$ that satisfies the two conditions above will be suitable for parameterizing the fluctuations in the $y$-dynamics tied to the lack of knowledge in the initial state. Each of the entries in the correlation matrix given by Eq. (2.16b) is the correlation function between the components of the vector field $Cyx$ and these will become explicit provided a suitable spectral decomposition is at hand. Such a decomposition will be provided later in Sec. II C, although the reader is referred at this point to Appendix B for clarity.

In the memory term, though, we neglect $\u03f5$-corrections to its statistics since memory effects are of order $\u03f52$ already. Thus, we have by averaging the kernel $K~(t,s,x0,y0)$ in Eq. (2.13b) with respect to the $y$ variables,

This way, the memory kernel only depends on the $x$ variables. Hence, we find a self-consistent evolution of the $x$ variables, subject to the influence of unobserved variables $y$, in the form of a stochastic integrodifferential equation (SIDE) resembling the GLE (1.3b),

where $\eta (t)$ is a stochastic forcing that agrees with the mean and correlation properties stated in Eq. (2.16).

We emphasize that the solution $x(t)$ of the original system of ordinary differential equations (2.1) does not satisfy Eq. (2.18): it is just the proposed reduced-order model for the $x$ variables. The closure provided by expressing the corrections in the second and third term on the RHS of Eq. (2.18) as functions of $x$ alone is typically called a parameterization of the effect of the unobserved $y$ variables in the climate sciences.^{32}

Note that there is considerable freedom in choosing the noise, since we only require agreement up to the second moment. However, a direct consequence of this weak-coupling parameterization is that realizations of the noise can be produced by directly integrating the decoupled hidden variables or by representing it using simple autoregressive models.^{82} We are assuming here that the uncoupled dynamics leads to a noisy signal; this can be achieved either by the presence of stochastic forcing in the hidden variables^{82,87} or by their uncoupled dynamics being chaotic.^{83}

To summarize, the weak-coupling limit allows one to develop a parameterization of the hidden variables for a system of coupled equations where no separation of timescales is assumed. Moreover, this approach provides explicit approximate expressions for the deterministic, stochastic, and non-Markovian terms in the Mori–Zwanzig formalism of Eq. (1.3b).

There are two sources of error in the parameterization proposed in Eq. (2.18). First, the truncation performed in the Dyson expansion neglects higher-order effects, which are weighted by the third power of the coupling parameter $\u03f5$. Second, averaging over the statistics of the uncoupled dynamics can also introduce errors. Furthermore, the nature of the stochastic correction is not fully determined except for its lagged correlation.

The perturbation operator approach taken here is analogous to that of Ref. 89, who only considered the independent or additive-coupling cases; the latter is expanded upon in Sec. II B. Here, though, we generalize further the parameterization formulas that can be obtained via perturbative expansion of linear operators. In fact, the present approach can also be extended to weakly coupled systems of the form

where $Cy$ encodes interactions that need not be separable between the $x$ and $y$ variables in the hidden layer of the model. Note that the full parameterization of arbitrary couplings was discussed by the two authors of Ref. 89 in the previous work,^{88} in which they used a response-theoretic approach.

### B. The additive-coupling case

The approximate Dyson expansion given in Eq. (2.12) is exact in the case of additive coupling. Such systems take the form

Indeed, letting $Cy(x,y)=Cy(x)$ in Eq. (2.19b) and using Eq. (2.10b) allow us to avoid the truncation of the Dyson expansion and yield the following expression for the memory term:

which is exact. Next, taking averages with respect to $\nu $, we obtain

Hence, the parameterization in this additive-coupling case is exact, as no terms proportional to $\u03f5k$, $k\u22653$ are present. The only assumption made is that the statistics in the $y$ variables have reached a steady state according to the unperturbed system. Finally, the full SIDE in this case takes the form

where the stochastic process $\eta $ has the mean and correlation properties given by Eq. (2.16). This equation is, thus, exactly the one obtained in Ref. 89.

Memory effects represented by integral terms seem unavoidable unless the memory kernels vanish quickly with respect to time. Infinite timescale separation between the two sets of variables leads, though, to the vanishing of the associated integral expressions.^{64} Here, we are not assuming no such property in the coupled dynamical system under study; see Eqs. (2.1) and (2.20). On the other hand, reduced phase spaces can help explain the statistics of the dynamical system without resorting to delayed effects that entail the integrals in Eqs. (2.18) and (2.23). Following Ref. 19, we briefly review in Appendix B a criterion based on Koopman operators—and, more generally, Markov operators—that enables one to decide whether memory effects can help explain the dynamics and statistics in reduced phase spaces.

### C. Markovian representation through leading Koopman eigenfunctions

In the context of Langevin dynamics, there are known conditions on memory kernels that allow one to recast certain stochastic integrodifferential equations into a Markovian SDE by means of extended variables; see Ref. 63, Sec. 8.2. The stochastic processes that allow such a procedure are called quasi-Markovian.^{63} (Such a Markovianization procedure is actually not limited to stochastic processes and it relies on the same type of ideas in other contexts; see Refs. 7, 9, 22, and 43, Sec. 1.3.)

This Markovianization theory can be formulated in the setting of near-equilibrium statistical mechanics, where one uses fluctuation-dissipation-like relations that link the decay properties of the memory kernel and the decorrelation rates of the fluctuations. Here, we follow the approach in Ref. 63 but without making any assumptions on the Hamiltonian behavior of the $y$ variables. We need, though, to make assumptions on the spectral properties of the generator of the $y$-dynamics, as explained below.

We define the generator $L0y$ of the Koopman semigroup associated with the $y$-dynamics by

for every real-valued observable $\Phi \u2208C\u221e(Rd2)$ and we denote the associated Koopman operator at time $t$ by $Ut=eL0yt$; the subscript $y$ has been dropped from the $\u2207$ operator herewith, for notational clarity. Recall that the spectrum of such operators provides useful insights into the statistical properties of the system; this topic is beyond the scope of the present paper but it is treated in detail in Ref. 19.

It suffices to show below that the spectrum of $Ut$ allows one to characterize the constitutive ingredients of the WL parameterization (2.18) and (2.23), subject to natural assumptions. Even though we have clarified in Eq. (2.8) how the Koopman operator acts on vector-valued observables of any dimension, we restrict now its action for simplicity to scalar real-valued observables, as in Eq. (2.24). In this case, along the lines of the methodology of dynamic mode decomposition,^{59,69,72} we can (formally) decompose the operator as

where ${\lambda j}j=1N$ are the eigenvalues that form the point spectrum of $L0y$ and $\Pi j$ is the spectral projector onto the eigenspace spanned by the eigenfunction $\psi j$. Here, $R(t)$ is the residual operator associated with the essential spectrum of $L0y$ and its norm is controlled by a decaying exponential. We assume furthermore that the spectrum of $L0y$ lies in the complex left half-plane and that, in particular, $Re\lambda j\u22640$ for any $j$.

Such a spectral decomposition and its properties can be rigorously justified for a broad class of differential equations perturbed by small noise disturbances; see Ref. 19, Theorem 1 and Appendix A.5. Based on these rigorous results, we assume, roughly speaking, that these properties survive in a certain small-noise limit and concentrate here on vector fields $y$ given by $G$ in (2.24) for which a decomposition such as (2.25) holds and a spectral gap does exist in the appropriate functional space.

In the following lines, we examine the expression of the memory kernel $K$ appearing in Eq. (2.23) using the eigendecomposition proposed in Eq. (2.25). In particular, we study such an integral kernel $K$ component-wise,

for $i=1,\u2026,d1$, where

and we have neglected the contribution coming from the essential spectrum. The $(\u22c5)\u2217$ superscript is used to denote the dual eigenfunction.

This decomposition highlights the fact that the leading eigenvalues of the operator governing the evolution of observables in the uncoupled $y$-dynamics set the timescale for the memory kernel. Furthermore, this spectral approximation implies that the correlation functions of the noise have the same decay properties, as will become apparent later in the proof of Theorem 2.1. It follows that the correspondence between the noise and integral timescales allows us to recast the SIDE in the WL equation (2.23) into a fully Markovian version with linearly driven hidden variables that are forced by the observed variables through a functional dependence that can be nonlinear. More exactly, we have the following theorem.

The point spectrum of $L0y$ is assumed to be constituted of $N$ simple eigenvalues, whose corresponding eigenpairs ${(\lambda j,\psi j),j=1,\u2026,N}$ are ordered as follows: $0\u2265Re\lambda j\u2265Re\lambda j+1$ and $\lambda j=\lambda j+1\xaf$ when $Im\lambda j>0$, for $j$ in ${1,\u2026,N}$.

We assume that $Cx$ in (2.20) lies in the $span{\psi j,j=1,\u2026,N}$ and has $\nu $-mean zero.

The full proof appears in Appendix A.

Note that the Koopman operator of interest here is the one associated with the $y$-subspace $Y\u2286Rd2$ and not with the entire $(x,y)$-space $X\xd7Y\u2286Rd1+d2$. Other techniques, like the DMD mentioned in Sec. I A, aim at extracting the modes of variability of the full system by means of studying the Koopman operator in the entire phase space through suitably defined observables. To this end, the latter methods employ projections of observables onto the eigenfunctions of the Koopman operator to obtain the so-called Koopman modes, which are susceptible of capturing the underlying dynamics. Notice that in Theorem 2.1, instead, we are using the Koopman eigenfunctions to identify the closure model, while projections only come into play in the definition of the coefficients $\alpha j$ and $\beta j$; see Eqs. (2.32a) and (2.32b), respectively.

Assumptions on $F$, $R$, and $\Lambda $ that ensure that (2.30) possesses a global random attractor—and thus a stable asymptotic behavior in the pullback sense—appear in Ref. 43, Theorem 3.1 and Corollary 3.2.

- Note that Theorem 2.1 can be viewed as a generalization of other Markovianization results for GLEs that appeared in the literature; see Ref. 63. For instance, the scalar GLE in $R$ reduces towhere $\lambda $ is in $Rn$, $M$ is a positive definite $n\xd7n$ matrix, and $K(t\u2212s)=(eM(t\u2212s)\lambda )\u22c5\lambda $ determines the autocorrelation of the process $\eta (t)$. In this setting, Eq. (2.36) is equivalent to the following SDE:(2.36)$x\u02d9=F(x(t))\u2212\u222b0tK(t\u2212s)x(s)ds+\eta (t),$with $\Sigma \Sigma \u2217=M+M\u2217$. Theorem 2.1 allows for nonlinear dependence on $x$ in the $z$-equation, and thus for memory kernels that are more complicated than in (2.36). Such a generalization is of practical importance since the process $z$ can then have a more complex correlation dependence on the observed variable $x$ than the one afforded by linear memory terms.(2.37)$x\u02d9=F(x(t))+\lambda \u22c5z,dz=(x\lambda \u2212Mz)dt+\Sigma dWt,$

When $L0y$ is self-adjoint—in a suitable Hilbert space, as outlined in Appendix B, i.e., when $L0y=L0y\u2217$—the eigenvalues are real and the eigenvectors are mutually orthogonal. Self-adjointness thus implies that there are no oscillations in the correlation functions of the noise or, equivalently, peaks in their power spectrum. With respect to Theorem 2.1, the matrix $H$ in this case would be the identity, since the eigenvalues and eigenfunctions are real and the Itô solutions of (2.30b) are, hence, real as well.

The resulting system given by Eq. (2.30) is now fully Markovian and the only sources of error with respect to the original SIDE (2.18) lie (i) in the effects of the essential spectrum, which are neglected herein, and (ii) the assumptions about the coupling terms. Neglecting the essential spectrum is only valid for Koopman operators with a point spectrum capable of capturing the correlations in the decoupled $y$-system; the latter might only hold in the case of Markovian diffusion processes and not for deterministic ones. Also, the assumption that the coupling functions project solely on the point spectrum might not hold in general.

From a practical perspective, though, a suitable choice of dominant eigendirections can reduce the number of extra dimensions needed to integrate the system. Such a suitable choice boils down to neglecting particular eigendirections and this can be done according to two handy criteria,

The weight determined by the $\alpha j$ and $\beta j$ coefficients defined in Eqs. (2.32a) and (2.32b) is small and

The eigenvalues $\lambda j$ of $L0y$ satisfy $Re\lambda j0\u226a\lambda \u2020$, for some $j0\u2208{1,\u2026,N}$, in which case $e\lambda j0t$ decays rapidly as $t$ grows; here $\lambda \u2020<0$ and $|\lambda \u2020|$ is some characteristic inverse time for the deterministic system $F$. In addition, if $\alpha j0>\alpha j$ for $j=1,\u2026,N$ and $j\u2260j0$, both the memory and the noise correlations die out fast. Hence, one can neglect the integral terms and perform a fully Markovian parameterization, which is possible in the presence of white noise.

Theorem 2.1 is stated for $x$ scalar for the sake of simplicity, but this result extends to the $d1$-dimensional Eq. (2.20) for the (observed) variables $x$. In this remark, we sketch the main elements that permit such a generalization.

The system equation (2.40) has the general structure one would obtain if the coupling function $Cx$ projected along all the eigendirections in the point spectrum. This might not be true in general, but the drift matrix $D$ can be rearranged so that only the relevant modes of variability are modeled—following criteria (i) and (ii), as formulated in Remark 2.4—and still afford a reduction of the number of levels $N$. Notice that the $N$th level variables $zN$ described by Eq. (2.40c) decorrelate the fastest, the rest, since their exponential decorrelation rate is given by $|Re\lambda N|\u2265|Re\lambda j|$, for all $j=1,\u2026,N\u22121$.

The advantages of the Markovian system of Eqs. (2.30) and (2.40) over the original WL equation (2.23) are twofold. First, we identify situations in which the WL equation can be Markovianized by introducing extended, hidden variables. This idea was already introduced in a preliminary application of the WL parameterization,^{87} in which the authors resorted to a Markovian system to perform their simulations. In fact, one of their examples is studied in the present framework; see Sec. II D.

Second, memory equations contain nonlocal terms that are cumbersome and computationally expensive to integrate, as well as requiring much larger storage for the full history of the system’s variables. The efficient Markovianization of evolution equations with memory terms is an active field of research in diverse areas of mathematics and the applied sciences; these areas include the study of bifurcations of delay differential equations,^{8,11} the reduction of stochastic partial differential equations to stochastic invariant manifolds,^{14,15} and material sciences,^{22} among many others.

### D. Preliminary example

As seen earlier in Theorem 2.1, if the coupling function is resonant with the Koopman operator associated with the $y$-dynamics, one can identify the dominant exponential rates of decay of the memory term and the characteristic decorrelation time of the noise. As a consequence, one can Markovianize the parameterization and greatly facilitate the numerical integrations involved.

To illustrate the above statement, we revisit here the preliminary application of the WL parameterization in the context of multiscale triads.^{87} In that work, the authors implemented the parameterization for a collection of three-dimensional models that do exhibit time scale separation and compare the corresponding outputs to those obtained via homogenization. The results are encouraging, since the parameterizations in Ref. 87 were obtained only from the decoupled hidden dynamics, in the lines of the present paper as well; see derivation of Eq. (2.18).

One of the first multiscale triads studied in Ref. 87 is the following:

Here, we require that $\u2211jB(j)=0$, $dWt(1)$ and $dWt(2)$ are scalar Brownian increments, and the parameter $\u03f5$ indicates both the timescale separation and the coupling strength. Notice that when the system is decoupled, i.e., when $\u03f5=0$, the fast dynamics evolve according to an Ornstein-Uhlenbeck (OU) process whose steady-state statistics are governed by Gaussian distributions with explicit mean and variance (see, e.g., Ref. 63). Hence, by virtue of the previous formulas or by following Ref. 87, the WL parameterization yields the following scalar SIDE:

here,

where the angular brackets refer to the ensemble averages according to the already mentioned Gaussian distributions arising from the decoupled model. Expanding these averages, Eq. (2.44c) leads to

The timescales are indicated by the exponents in the formulas above and they are the same for the noise and the memory kernel. This equality suggests the possibility of Markovianizing the memory equation into the following two-dimensional system:

Clearly, performing a numerical integration of this system is easier than for a memory equation like Eq. (2.43).

The results of Sec. II C allow us to carry out the dimension reduction of the multiscale triad by analyzing the spectral properties of the Koopman operator associated with the decoupled $y$-dynamics. Since the $y$ variables evolve stochastically, the Koopman operator becomes the *backward-Kolmogorov* equation, which governs the evolution of the expectation values of the observables. Thus, for a generic observable $\Psi $ in the $y$ phase space, the evolution of its expectation value is given by

Now, let $\Psi (y1,y2)=y1y2$ be the coupling function of the triad system (2.42), for which we find that

The above equation is an eigenvalue problem, showing that this particular $\Psi $ is an eigenfunction of the Koopman operator associated with the eigenvalue $e\u2212(\gamma 1+\gamma 2)$. This is no surprise since $y1$ and $y2$ are, respectively, the Hermite polynomial eigenfunctions of the backward-Kolmogorov equation of the scalar OU process.^{79} Hence, the product $y1y2$ is also an eigenfunction of the same equation for the joint process. Therefore, we can immediately re-Markovianize the parameterization according to Eqs. (2.30), where $D=\gamma 1+\gamma 2$ and

## III. MULTILEVEL STOCHASTIC MODELS AND EMPIRICAL MODEL REDUCTION (EMR)

### A. Multilevel stochastic models (MSMs)

MSMs are a general class of SDEs that were introduced in Ref. 43 and are, by their layered structure, susceptible to provide a good approximation of the GLE (1.3b) formulated by Mori and Zwanzig when a high-dimensional system is partially observed; see Ref. 43, Proposition 3.3 and Sec. 5. The MSM framework allows one to provide such approximations that are accompanied by useful dynamical properties, such as the existence of random attractors (Theorem 3.1 in Ref. 43). The conditions on the high-dimensional system’s coupling interactions between the resolved and hidden variables are also well understood (Corollary 3.2 in Ref. 43). (Moreover, random attractors with fractal structures that survive highly degenerate noise^{17}—a situation that might occur when approximating deterministic chaotic dynamics by stochastic pathwise dynamics—can still be present in MSMs; see Ref. 43, Sec. 7.)

As discussed in Ref. 43, MSMs arise in a variety of data-driven protocols for model reduction that typically use successive regressions from partial observations; see Sec. III B. The general form of an MSM is given by Ref. 43, Eq. (**MSM**); we only use herein its most basic version, which has the following structure:

Here, the observed vector variable $x(t)$ lies in $Rd1$ and, for $\u03f5=0$, the hidden variables $y(t)\u2208Rd2$ evolve in time independently. Otherwise, the dynamics of the $x$ variables is linearly coupled to that of the $y$ variables, which act upon (3.1a) as a stochastic forcing, via the canonical projection $\Pi :Rd2\u27f6Rd1$, while $Wt$ in (3.1b) is a $d1$-dimensional Wiener process. Clearly, Eq. (3.1) is closely related to Eq. (2.38) discussed above.

The matrix $C$ in $Rd2\xd7d1$ models the feedback of the $x$-process onto the $y$ variables. In the case of $C\u22610$, $y$ would evolve according to an OU process with drift matrix $D$ and covariance matrix $\Sigma \Sigma \u2217$. For the sake of simplicity, we restrict ourselves to the case $d1=d2$ so that the projection $\Pi $ reduces to the identity.

The more general MSM with nonlinear coupling considered in Ref. 43 was shown to be equivalent to a SIDE with explicit expressions for the memory kernels and stochastic forcing being obtained; see Ref. 43, Proposition 3.3. The noise term there results from successive convolutions of the homogeneous solutions of the lower levels of the system with an OU process. In particular, using the Itô stochastic calculus, one readily obtains a SIDE that is equivalent to an MSM; see Ref. 43, Sec. 3.2 and Appendix C.

We show next that the same SIDE can actually be obtained by using the operator formalism presented in Sec. II. One might object that an MSM is a stochastic system, due to the presence of white noise in the hidden layer, whereas the theory presented above applies to deterministic dynamics. However, as clarified below, the operator formalism applies equally well to the MSM case.

In fact, given a smooth, $C\u221e$ observable

its expected value along a stochastic trajectory $Xt=(x(t),y(t))\u22a4$ solving Eq. (3.1), namely, $E(\Phi (Xt))$, defines a Markov semigroup $Pt$ by

the only difference with respect to the transport equation (2.2) lies in the presence of a second-order differential operator induced by the white noise.

We introduce the operators

which play a role that is analogous to their deterministic relatives in Eq. (2.3) of Sec. II. Again, the operator $L1$ is viewed as a perturbation to the operator $L0$ due to the coupling.

If one considers observables $\Phi =\Phi (x)$, Eq. (3.4) becomes at time $t=0$,

and we apply now, as in Sec. II, the Dyson perturbative expansion. By virtue of the formula (2.18), the parameterization leads to a reduced equation of the form

where the hidden variables in the decoupled regime are governed by an OU process with invariant measure $\mu y$. The properties of the stochastic noise $\eta (t)$ are given by

where $y$ is a function analogous to the coupling function $Cyx$ in Sec. II and the initial condition $y0$ is assumed to be normally distributed with zero mean and variance $\Sigma \Sigma \u2217$. The memory kernel is given by

Using the intermediate steps above, the explicit parameterization finally becomes

The integrodifferential equation above is the same as Eq. (C2) one obtains using the Itô integration described in Appendix C. This similarity of results occurs because we are considering the case of additive coupling, and the Dyson expansion can be truncated after the memory term proportional to $\u03f52$, cf. Sec. II B.

### B. Empirical model reduction (EMR)

As discussed in Secs. I and II, and illustrated in Fig. 1, the evolution of the resolved variables is forced by fluctuating terms and the effects of the previous state of the system. It is desirable, therefore, to construct a full model of the system even when only capable to partially observe it. The EMR methodology^{43,45,48,49} aims at achieving this goal; we discuss it below in the broader context of MSMs. Note that EMR provides a solution for the dynamical closure of partially observed systems and thus it differs from the methodology recently proposed in Ref. 6, which requires one to fully observe the system for the data-driven discovery of its underlying equations to work.

Having a set of reduced $d1$-dimensional observations ${xi:i=1,\u2026,n}$ every $dt$ time units, one seeks to regress the tendencies ${dxi:i=1,\u2026,n}$ of the data onto a quadratic function of the form

where $b$ in $Rd1\xd7d1$ describes dissipative processes and $Q$ is a quadratic form describing self-interaction between the $x$ variables. The $i$th component of the quadratic form is given by

where $Ai$ in $Rd1\xd7d1$. The function $F$ is expected to approximate the vector field driving the dynamics in the absence of hidden external influences. Of course, performing regressions yields an error called *residual*${yi:i=1,\u2026,n}$. Hence, the evolution of the $x$ variables satisfies the equation

At this point, one can study the properties of the residual time series ${yi}i=1n$ and construct a model able to reproduce its main statistical features. However, we know that if it is possible to sample all the variables of the dynamical system of interest, one expects that the residuals are explained by the errors committed exclusively in the regression algorithm. If some sort of subsampling is done, whether spatial or temporal, the residuals are also due to the delayed influence of unresolved processes that are involved in the coupling.

Allowing for the main level variables $x$ to be linearly coupled with the residual $y$, we are creating a model that is able to incorporate memory effects as well. Hence, for each component $i$ of $x$, we have

where we have introduced new matrices $b(j)\u2208Rd1\xd7(j+2)d1$ that model the linear coupling. The residual at the last level $[y(l+1)]i$ is assumed to obey the Wiener process for which the correlation matrix is obtained from the last residual time series. The choice of stochastic process in the last step can only be done if the decorrelation of $[y(l+1)]i$ is sufficiently fast according to the timescale set by $dt$. This motivates the problem of choosing the optimal number $l$ of levels.

Several criteria have been established to determine the optimal number of levels $l$. The basic idea is that the resulting $(l+1)$-residual in Eq. (3.14e) should be well approximated by Gaussian white noise.^{43,49} One has, therefore, to test whether the residual variables decorrelate at lag $dt$ and whether the lag-0 covariance matrix is invariant in the last levels.

Therefore, regression on the tendency of the optimal level $y(l)$ should yield

where $\gamma (l)$ is the residual of the previous regression and is approximately equal to $y(l+1)$. Hence, $\gamma (l)$ would become a lagged version of $y(l+1)$. Subject to this assumption, it is possible to estimate the optimal value of the coefficient of determination $R2$,

This means that, when the amount of unexplained variance of the last regression is $50%$, one has reached the optimal number of levels. It is worth stressing that the empirical model (3.14) has the structure (3.1) of an MSM, as discussed in Ref. 43. It can, therewith, be integrated to transform it into an integrodifferential equation with explicit formulas for the fluctuating noise and memory kernel, cf. Ref. 43, Proposition 3.3; see also Sec. III A for such a transformation from another perspective.

Finally, note that the aforementioned stopping criterion for EMR—namely, $R2\u22430.5$, see Ref. 43, Appendix A—is based on decorrelation times and it is also present in the multilevel WL equation (2.40). We noted, in fact, in Remark 2.5 of Sec. II C that the last level modeled by Eq. (2.40c) decorrelates the fastest with respect to the rest. Ultimately, making these points amounts to saying that a low number of levels is expected to arise in the EMR method, provided most of the eigenvalues $\lambda j$ in Theorem 2.1 are located far away from the imaginary axis, except for a very few of them. Conversely, if the Koopman eigenvalues cluster near the imaginary axis or do not exhibit a spectral gap located at a, suitably defined, small negative real part, many levels are expected to be needed to capture the hidden dynamics; see again Remark 2.5.

## IV. NUMERICAL EXPERIMENTS

In Secs. II and III, we have shown that both the WL top-down approach and the EMR data-driven method yield a set of multilevel equations for the variables of interest in a multi-scale system. In particular, both approaches give explicit formulas for the fluctuation term and memory kernel in the GLE (1.3b) of the Mori–Zwanzig formalism. Furthermore, their Markovian representation share that the hidden layers are linearly driven with a white noise background [see Eqs. (2.38) and (3.1)]. We now compare the two approaches to model reduction in a simple, conceptual stochastic climate model.

Since the modeling of geophysical flows is the primary motivation for this research, we consider a set of SDEs proposed in Ref. 27 among others as a physically consistent climate “toy” model. In such a model, the main $x$ variables are slow and weakly coupled to the fast $y$ variables. The latter correspond to weather fluctuations and carry, in fact, most of the system’s variance. The model’s governing equations are

where $Wt(1)$ and $Wt(2)$ are two independent Wiener processes.

These equations describe the evolution of four real variables $x=(x1,x2)$ and $y=(y1,y2)$; their timescale separation is determined by the parameter $h$ and the coupling strength is controlled by $\u03f5$. The parameter values used herein are $c134=c341=0.25$, $c413=\u22120.5$, $L12=L21=1$, $L24=\u2212L13=1$, $a1=\u2212a2=1$, $d1=0.2$, $d2=0.1$, $F1=\u22120.25$, $F2=F3=F4=0$, $\gamma 1=2$, $\gamma 2=1$, and $\sigma 1=\sigma 2=1$. The timescale separation and the coupling strength are $h=0.1$ and $\u03f5=0.5$, respectively.

### A. WL approximation

Notice that the hidden variables evolve according to a decoupled OU process. Taking advantage of this fact, we calculate the weak-coupling–limit parameterization of the model, according to the formulas presented in Sec. II for the separable coupling functions given by

The coupling function $Cx$ in the slow equation (4.2a) is independent of the $x$ variables, indicating that the noise correction can be additively incorporated and implemented by examining the decoupled hidden process. Note that the functional form of $Cy$ implies that the WL parameterization cannot be exact in $\u03f5$, as noted in Eqs. (2.11) and (2.12). This indicates that the WL reduced model will not only introduce an error in averaging over the decoupled steady state, but also that the Dyson expansion Eq. (2.5) has to be truncated at $\u03f53$, rather than merely at $\u03f52$ where no memory effects would be included.

According to the WL parameterization discussed in Sec. II, the fluctuation terms correspond to the decoupled evolution of the coupling function $Cyx$, concretely as in Eq. (2.16). This allows to directly compute the correlation function,

From Eq. (4.3), we deduce that the noise covariance matrix for the given parameter values is given by

The memory kernel $K$, which is a vector of two components $(\kappa 1,\kappa 2)\u22a4$, is given by

here the brackets $\u27e8\u22c5\u27e9$ indicate the averages for the uncoupled equilibrium in the $y$ variables, which happen to be a set of independent OU processes. Explicitly,

The reduced-order model obtained herewith does give explicit formulas for the evaluation of the stochastic noise and the memory kernel, independently of the timescale separation $h$, although these formulas are rather complicated. Still, the scheme remains the same when changing parameter values, so it is flexible in studying different scenarios.

### B. EMR model and results

#### 1. Basic EMR algorithm implementation

Regarding the data-driven EMR protocol, we integrated the full model with a time step of $d\u2113t=10\u22123$ time units for a duration of $T\u2113=104$ time units in order to learn the model parameters. Then, a separate run was performed in order to examine the ability of the inferred model to reproduce the general statistical features. This time, the EMR system was integrated together with the full model using a time step of $d\tau t=10\u22122$ time units for a total of $T\tau =105$ time units. The equations were solved using a fourth-order Runge–Kutta and a Euler–Maruyama method for the deterministic and stochastic components, respectively.

By sampling every time step, we learned an EMR model whose coefficients were explicitly found. The convergence criterion $R2\u22430.5$ was attained by adding two extra levels, for a total of three. The convergence was not affected by changes in the timescale separation parameter—namely, $h=0.1$ and $h=1$ in the case at hand. Probably, the value of $h$ was not that important here because of the low dimensionality and stochastic nature of the hidden process. However, convergence is likely to be altered in more complicated models, as illustrated in Appendix D.

The climatologies of the slow $x$ variables are obtained using data from the full model, the EMR model, and the WL parameterization. The two-dimensional probability density functions (PDFs) of the stochastic model (4.1) in the $(x1,x2)$ plane are shown in Fig. 2. These PDFs were calculated by employing the Matlab R2019a kernel smoothing function *ksdensity*. Their respective marginals are shown in Fig. 3. The agreement between the two methodologies when approximating the clearly non-Gaussian density arising from the full model is clearly excellent.

The timescale separation between the $x$ variables and the $y$ variables is clearly depicted in the left panel of Fig. 4, where the fast variables decorrelate almost instantly compared to the slow ones. The approximation of these autocorrelation functions is also obtained using the EMR and WL methods.

In general, cf. Ref. 48, the regressions performed in the main level (3.14a) of the EMR model allow one to effectively reconstruct the coefficients of a weakly coupled model; see Appendix D. The EMR methodology, though, only allows for linear coupling between the slow $x$’s and the fast $y$’s. The nonlinear coupling between the slow and fast variables in system (4.1) compromises the estimation of the main model parameters in Eq. (3.14a) so that we cannot expect to recover the original, full model’s behavior given by (4.1). The EMR model coefficients at the first and second levels are as shown in Tables I and II, respectively.

f
. | x_{1}
. | x_{2}
. | $x12$ . | x_{1}x_{2}
. | $x22$ . |
---|---|---|---|---|---|

−0.314 04 | −0.509 54 | −0.065 313 | 0 | −1.009 2 | 0.997 04 |

−0.153 56 | 0.123 53 | 0.219 79 | 1.0092 | −0.997 04 | 0 |

f
. | x_{1}
. | x_{2}
. | $x12$ . | x_{1}x_{2}
. | $x22$ . |
---|---|---|---|---|---|

−0.314 04 | −0.509 54 | −0.065 313 | 0 | −1.009 2 | 0.997 04 |

−0.153 56 | 0.123 53 | 0.219 79 | 1.0092 | −0.997 04 | 0 |

f^{(1)}
. | x_{1}
. | x_{2}
. | $r1(1)$ . | $r2(1)$ . |
---|---|---|---|---|

0 | −5.6397 × 10^{−4} | −6.9382 × 10^{−05} | −20.282 3 | 0.016 165 |

0 | −1.648 × 10^{−4} | −4.72 × 10^{−4} | 0.086 621 | −10.042 6 |

f^{(1)}
. | x_{1}
. | x_{2}
. | $r1(1)$ . | $r2(1)$ . |
---|---|---|---|---|

0 | −5.6397 × 10^{−4} | −6.9382 × 10^{−05} | −20.282 3 | 0.016 165 |

0 | −1.648 × 10^{−4} | −4.72 × 10^{−4} | 0.086 621 | −10.042 6 |

As discussed in Sec. III B, the EMR has the structure of an MSM and it can be recast into an integrodifferential equation. If one only considers the first added level, the EMR can be readily integrated giving the following equation for the evolution of the slow variables $x=(x1,x2)$:

Here, $Ws$ is an independent two-dimensional Wiener process and

First thing to note is that the matrix $C$ has a small norm and, by virtue of Eq. (4.7), it means that memory effects are going to be very small. On the other hand, the eigenvalues of the matrix $D$ are $\lambda 1\u2243\u221220,\lambda 2\u2243\u221210$, which are approximatelly the drift coefficients of the uncoupled OU process driving the $y$ variables. This indicates that the exponential kernel is damping the effects of the $x$ variables in past times rather quickly. Moreover, the covariance matrix $\Sigma $ corresponds to that obtained by integrating the $y$-dynamics independently, according to Eq. (4.4).

Regarding the WL approximation, we stress that the $y$ variables are no longer present, after taking the averages in its construction. The memory kernel $K$ in this case differs from the matrix $D$ above, although its dominant terms correspond to its eigenvalues. Note that, if $L13=0$, the coupling function $Cyx$ would project entirely onto the eigenfunction of the OU process associated with the eigenvalue $\u2212(\gamma 1+\gamma 2)$. The same statement would hold for $c134=0$, where in this case $Cyx$ projects onto the eigenfunctions associated with the eigenvalue $\u2212\gamma 1$.

#### 2. Reduced time scale separation

The parameter $h$ controls the timescale separation in the evolution of the $x$ and $y$ variables. Here, we set $h=1$ so that this separation is reduced by an order of magnitude as illustrated by the autocorrelation functions in Fig. 7. The question to be addressed in this subsection is the effect of such a reduction in the WL and EMR parameterizations and their respective performance.

In the WL parameterization, there is no need to sample the dynamics in order to construct it, since the formulas of Sec. II are explicit and do not depend on $h$. In the case of model (4.1), the covariance matrix and time correlations of the WL noise correction are thus given by Eq. (4.3) with no reference to $h$. The memory term, though, is expected to change as the kernel $K$ will decay more slowly by a factor of $10$. Therefore, memory effects are more important, as expected.

The EMR approach, on the other hand, requires a new learning phase for this value of $h=1$. We used the same numerical integration parameters $d\u2113t=10\u22123$ and $T\u2113=104$ time units as for the previous case. In the first level regression, one observes that the coefficient values listed in Table III are essentially the same from those estimated in the previous case, for $h=0.1$, and listed in Table I, as well as being even more distant from the original ones in the table’s first row. The second level coefficients, for $h=1$, are shown in Table IV.

f
. | x_{1}
. | x_{2}
. | $x12$ . | x_{1}x_{2}
. | $x22$ . |
---|---|---|---|---|---|

−0.311 81 | −0.339 | −0.429 44 | 0 | −0.934 39 | 0.979 58 |

−0.169 25 | 0.468 33 | 0.175 13 | 0.934 39 | −0.979 58 | 0 |

f
. | x_{1}
. | x_{2}
. | $x12$ . | x_{1}x_{2}
. | $x22$ . |
---|---|---|---|---|---|

−0.311 81 | −0.339 | −0.429 44 | 0 | −0.934 39 | 0.979 58 |

−0.169 25 | 0.468 33 | 0.175 13 | 0.934 39 | −0.979 58 | 0 |

f^{(1)}
. | x_{1}
. | x_{2}
. | $r1(1)$ . | $r2(1)$ . |
---|---|---|---|---|

0 | −3.9452e-3 | −1.5027e-4 | −2.100 1 | 0.016 509 |

0 | −5.0312e-4 | −3.79e-3 | 0.054 861 | −1.121 4 |

f^{(1)}
. | x_{1}
. | x_{2}
. | $r1(1)$ . | $r2(1)$ . |
---|---|---|---|---|

0 | −3.9452e-3 | −1.5027e-4 | −2.100 1 | 0.016 509 |

0 | −5.0312e-4 | −3.79e-3 | 0.054 861 | −1.121 4 |

The covariance matrix $\Sigma $ of the noise correction is indicated in Eqs. (4.9a) and (4.9b) and it agrees fairly well with the previous values, for $h=0.1$, as given in Eq. (4.8b). The matrix $C$ indicates that the strength of the memory effects also has a magnitude that is of the same order as that in the previous case of $h=0.1$, which is rather surprising, given the factor of $10$ in timescale separation $h$; compare Eqs. (4.8a) and (4.9a). This observation tells us that the loss of Markovianity might be intrinsic to the nature of the coupling rather than being due to the time scale separation, even though, in the limit case of infinite scale separation, memory effects will dissapear entirely. The memory kernel, as determined by $D$, scales almost exactly with the timescale separation and it is expected to change depending on how the coupling functions project onto the eigenspaces of the underlying Orstein–Uhlenbeck process, as discussed more generally earlier in Theorem 2.1.

The performance of both parameterization techniques is summarized in Figs. 5–7. These figures are the exact counterparts of Figs. 2–4 for the reduced timescale separation $h=1$. First, we note from Fig. 7(a) that for $h=1$ there is indeed no strict timescale separation, as indicated by the autocorrelation functions obtained from the full model. Second, consideration of Figs. 5(a)–5(c), 6(a), 6(b), and 7(b) shows that neither the WL nor the EMR approach seems to be affected by the timescale reduction

### C. Memory effects

We would like to end this results section by analyzing the role of the memory effects when performing a reduction of the highly idealized model given by Eq. (4.1). For this purpose, we apply the criterion (B4) discussed in the corresponding Appendix B. We thus spectrally approximate the autocorrelation functions of the variables $x1$ and $x2$ using Eq. (B4) with the Koopman operator $T\tau $ estimated using Ulam’s method with a transition time of $\tau =0.5$ time units for the case where $h=0.1$ and $\tau =1$ time unit for $h=1$.

The difference between the two $\tau $-values is due to the fact that we expect the coarse-grained phase space to be sensitive to the system’s variability. Hence, if the timescale separation is large, a shorter transition time is required in order to capture the influence of the hidden processes. In fact, a range of transition times was tested in the case of $h=1$ to find that the optimal value was $\tau =1$. It follows that the methodology is robust in showing the effects of memory in the projected phase space.

We clearly observe in Fig. 8 that the correlation functions can be accurately reconstructed in the case of large timescale separation $h=0.1$ [Fig. 8(a)] but not so for $h=1$ [Fig. 8(b)]. This indicates, naturally, that memory effects are negligible in the first case and relevant in the second case.

## V. CONCLUSIONS

To formulate accurate and efficient parameterizations for multiscale processes is a crucial challenge in many areas of science and technology for one of the two reasons: either the numerical simulation of all scales active in a given system is computationally unfeasible or there is a mismatch between the model resolution and the granularity and homogeneity of the observations, as in the case in geophysical flows and in the climate system. Moreover, the construction of parameterizations is instrumental to help understand the nature of nonlinear fluxes across scales and the physical processes responsible for cascades, instabilities, and feedbacks.

There are two main approaches for constructing parameterizations: top-down, by deriving the parameterizations directly from the evolution equations governing the system through the use of suitable approximations; and data-driven, in which the parameterizations are constructed through suitable optimization procedures, which are first tuned in a training phase and then actually used in the prediction phase. Both approaches aim to derive the effective dynamics for the variables of interest: formally, this is achieved by applying the Mori–Zwanzig projection operator^{60,90} to the full dynamics. The result of doing so is to describe the impact of the hidden variables by formulating a generalized Langevin equation (GLE) (1.3b) for the variables of interest that includes a deterministic, a stochastic, and a non-Markovian component.

Top-down and data-driven approaches are conceptually complementary and have different practical advantages and disadvantages. In this paper, we have shown the fundamental equivalence between a top-down and a data-driven approach that have been formulated and applied in the recent literature. This equivalence was illustrated schematically in Fig. 1.

We first revisited in Sec. II the WL parameterization of Refs. 88 and 89, which relies on an assumption of weak coupling between the hidden and observed variables and have extended the previous results by considering more general coupling classes. We have also shown that the perturbative expansion that yields the WL parameterization is exact when the coupling between the hidden and resolved variables is additive.

The Dyson formalism (2.12) appears to be essential for computing the effects of the hidden processes on the dynamics of the observed variables, when working at the level of the system’s observables. This methodology is explicit in the sense that no information about the actual coupled process is needed, because the formal computations are performed by considering the limit in which no coupling is present. Other advantages of this approach are that it can be implemented without the need for any hypothesis on the timescale separation between the hidden variables and the observed ones and that it is also scale adaptive.^{83}

We addressed systematically the problem of re-Markovianizing the WL memory equation, which was first pointed out in Ref. 87 and discussed further in Sec. II D. In this example, a system (2.42) with one observed and two hidden variables that yielded a scalar WL parameterization was re-Markovianized to a Markovian system with just two scalar differential equations. Throughout Sec. II, we provided a broader framework for re-Markovianization. This framework was presented for a scalar equation in Theorem 2.1 and described for higher dimensional systems in Remark 2.5. The required assumptions for this treatment boil down to certain spectral properties of the Koopman operator for the hidden variables.

The multilevel structure of the re-Markovianization obtained in Sec. II motivated the comparison with multilayer stochastic models (MSMs) in Sec. III. Such MSMs arise naturally in data-driven reduction methods and they had been shown in Ref. 43 to approximate the GLE predicted by Mori^{60} and Zwanzig.^{91}

We showed in Sec. III A that a seemless application of the WL parameterization solves the MSM of Eq. (3.1) and coincides with its Itô integration, cf. Ref. 43. Note that an MSM can be obtained from partial observations of the coupled system, which amounts to the special case of the data-driven empirical model reduction (EMR) methodology.^{43,45,46,48,49}

The EMR methodology was revisited here in Sec. III B and it is, in principle, dual to the WL parameterization, in the sense that only partial observations of the coupled system are required, without the need for knowing the actual equations of motion. Comparing the multilevel structure of Eq. (2.40) with that of Eq. (3.14) suggests that the Koopman eigenvalues $\lambda j$ highlighted in Theorem 2.1 may help provide insights into the number of levels needed for EMR to converge. This practical role of the $\lambda j$’s deserves, therewith, a more careful examination in further work.

Additionally, we considered in Sec. IV a conceptual climate model to which we applied both of the methodologies revisited herein. Since both the MSM and the WL parameterization yield a memory equation that involves integrals and stochastic noise, we were able to compare their structure, as well as their statistical outputs. We found that both methodologies produced equivalent numerical results and that the memory kernel and noise predicted in the WL parameterization agreed with what was found using the data-driven EMR approach.

Concluding, our viewpoint is complementary to the dynamic mode decomposition^{59,69,72} as it uses the basis of eigenvectors of the Koopman operator to construct the projected—in the sense of Mori–Zwanzig—dynamics of the observables of interest, which is then recast in the Markovian form using the multilevel Markovian model framework, where the number of levels corresponds to the number of eigenvectors of the Koopman operators one considers in the reconstruction of the dynamics. In a nutshell, our findings support, on the one hand, the physical basis and robustness of the EMR methodology and, on the other hand, illustrate the practical relevance of the WL perturbative expansion used for deriving the parameterizations.

## ACKNOWLEDGMENTS

It is a pleasure to acknowledge useful exchanges with Georg Gottwald, Jeroen Wouters, and Gabriele Vissio. The authors would like to thank both reviewers for their encouraging comments and insightful suggestions. V.L. acknowledges the support received from the European Union’s Horizon 2020 research and innovation program through the project CRESCENDO (Grant Agreement No. 641816). This article is TiPES contribution No. 56; the TiPES (Tipping Points in the Earth System) project has received funding from the European Union’s Horizon 2020 research and innovation program under Grant Agreement No. 820970. This research was partially supported by the Israeli Council for Higher Education (CHE) via the Weizmann Data Science Research Center, by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant Agreement No. 810370), and by the Office of Naval Research (ONR) Multidisciplinary University Research Initiative (MURI) under Grant No. N00014-20-1-2023. This paper has also been partially supported by the EIT Climate-KIC; EIT Climate-KIC is supported by the European Institute of Innovation & Technology (EIT), a body of the European Union, under Grant Agreement No. 190733, and by the Russian Science Foundation (Grant No. 20-62-46056).

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

### APPENDIX A: PROOF OF THEOREM 2.1

The aim is to show that under the assumptions of this theorem—which require that the coupling function $Cx:R\u27f6R$ projects entirely onto $span{\psi j,j=1,\u2026,N}$—the memory and noise terms of the WL equation (2.23) are obtained from the term $\u03f5\Lambda \u22c5Z(t)$ in Eq. (2.30a), after integration of Eq. (2.30b).

**Step 1.** In this step, we expand the memory term and the lag correlations of the noise in the WL equation (2.23) in terms of the leading eigenelements of the uncoupled Koopman operator $L0y$. These expansions will serve us in Step 2 to compare the noise and memory terms of the WL equation with those produced after integration of Eq. (2.30b).

Note that to go from (A6) to (A7), we have made use of the aforementioned conjugacy relations that led to a real-valued memory kernel at the end. With (A3) and (A7) at hand, the noise and memory terms in the WL equation (2.23) are thus characterized in terms of the leading eigenelements of the uncoupled Koopman operator $L0y$.

**Step 2.** The second step consists of analyzing the noise and memory terms produced by integration of Eq. (2.30b) and to compare these terms with those of the WL equation.

However, the noise term $\eta $ in the WL equation is a real-valued stochastic process, and we are dealing with complex scalars, so therefore we still have to show that $\Lambda \u22c5q(t)$ is real for every $t$ in $R$. To do so, let us denote by $w\u22a4=[w1,\u2026,wN]$ any arbitrary row vector in $RN$. In other words, $w$ is an arbitrary column vector with real entries.

### APPENDIX B: SEMIGROUP PROPERTY OF THE PROJECTED KOOPMAN OPERATOR FAMILY

It was shown in Ref. 16, Theorem A, that projection onto a reduced state space is closely related with a coarse graining of the (full) probability transitions on the original system’s attractor, while Theorem 2 in Ref. 19 dealt recently with the impact of such a projection in terms of reduction of the Koopman semigroup. In Ref. 19, the authors proposed a criterion based on the spectral theory of Markov semigroups to ascertain whether the reduced state space associated with a given projection can fully explain the statistics of the desired variables. This approach provides potential insights into the need for modeling non-Markovian effects by inspecting the loss of the semigroup property, as explained below.

Moreover, it follows from Ref. 19 that the analysis of correlation functions is not only of physical interest but also of methodological utility. Correlation functions can be defined by means of the Koopman operator or, dually, by means of the transfer operator’s providing the solution of the Liouville equation.

Let $\mu $ denote an ergodic invariant measure of the system and takes two observables $\Phi 1$ and $\Phi 2$ in the space $L\mu 2$ of zero-mean functions that are square-integrable with respect to $\mu $. Assume furthermore that the spectrum of the operator $L$ in $L\mu 2$ is a pure point spectrum, given by the eigenvalues ${\lambda j}j=1\u221e$ and their associated eigenfunctions ${\psi j}j=1\u221e$, where the eigenvalues are ordered by their decreasing real parts.

Then, the correlation function $C\Phi 1,\Phi 2(t)$ between the functions $\Phi 1$ and $\Phi 2$ is given by

and it can be expanded, formally, as

The dual operators in (B1) and the adjoint eigenvectors in (B2) are indicated by the superscript $(\u22c5)\u2217$, while $\u27e8\u22c5,\u22c5\u27e9\mu $ denotes the inner product. We refer to Corollary 1 in Ref. 19 for a proof of (B2) in the context of Markov semigroups. The proof actually applies to the case of the Koopman semigroups considered here as long as the Koopman semigroup $Ut$ defined by (2.4) is a strongly continuous semigroup in $L\mu 2$. The RHS of Eq. (B2) consists of a linear combination of exponential terms whose coefficients are calculated by projecting $\Phi 1$ and $\Phi 2$ onto the corresponding eigenspaces. These coefficients weight each exponential function and they can become exceedingly large if the Koopman operator deviates very much from normality.^{77} Note also that the set of eigenvalues $\lambda j$’s play a key role in defining the response of the system to perturbations.^{55,75}

The interactions between the resolved and hidden variables that are modeled by the Dyson expansion of the Koopman operator in Sec. I A may introduce memory effects into the closed, reduced model for the $x$ variables, as given by Eqs. (2.18)–(2.22). In certain situations, such memory effects can be neglected, even in the absence of exact slaving relationships between the resolved and hidden variables.^{13} But the loss of slaving relationships requires, in general, an explicit representation of memory effects^{12} to achieve an efficient model reduction.

Furthermore, it was shown in Refs. 16 and 19 that the reduction of the Koopman semigroup to observables that act only on the reduced state space leads, in most circumstances, to a family of operators that, while Markovian, no longer satisfy the semigroup property. One might then ask to which extent this loss of the semigroup property arising from the reduction, and the related emergence of memory effects, is crucial for providing a faithful reduced model of the observed variables.

When considering reduced state spaces obtained by projection, along with observables $\Phi 1$ and $\Phi 2$ defined on them, Theorem 2 in Ref. 19 shows the existence of a family of Markov operators ${Tt}t\u22650$ that satisfies

for every $t\u22650$, where $\pi x$ is the canonical projection onto the reduced subspace and $\mu x$ is the *disintegrated* or *sample measure* associated with $\pi x$; see Ref. 19, Remark 3. However, due to the projection, the semigroup property is lost, namely, $TsTt\u2260Tt+s$ for some $t,s$.

Following the reasoning given above, one can establish a criterion for the need to model a memory contribution when performing the model reduction. Formally, if there exist $\tau >0$ and $T\u2208N$ such that for every $t\u2208{k\tau \u2208R:0\u2264k\u2264T}$, we have

and one can say that the semigroup is preserved, to some extent, depending on how large $T$ can be in Eq. (B4). Other such criteria are available in the context of mutually dual Koopman and transfer operators. Thus, Tantet *et al.*^{76} had already considered empirical ways of quantifying the loss of the semigroup property in reduced dimensions.

The interpretation of $\tau $ comes from the practical implementation of the methodology and it is usually referred to as the *transition time*. Indeed, numerically, the approximation of such Markov operators is done by Ulam’s method, by projecting them onto a finite basis, typically using the characteristic functions of certain domains of phase space. Then, the transitions between domains—after an adequate transition time $\tau $—are counted to obtain matrix estimates of the operator $T\tau $ acting on the reduced phase space. Hence, one seeks a suitably small, or large,^{75} transition time to obtain the best candidate for applying Eq. (B4), see also Ref. 19, Sec. 3.3. Thus, Eq. (B4) can be implemented in practice this way in order to (potentially) reconstruct correlation functions on the whole phase space. A very simple illustration of such a transfer operator calculation is given in Ref. 81.

### APPENDIX C: ITÔ INTEGRATION OF THE MSM

In the main text, we proposed a solution of the MSM given by Eq. (3.1) using the Dyson expansion for the linear operators involved in the backward Kolmogorov equation—advection acting on functions. Therefore, we substituted nonlinear ordinary differential equations for a partial differential equation, for the sake of having linear operators in hand. The same solution can be attained by direct integration of the MSM in the form (3.1). We convolute, in the Itô sense, Eq. (3.1b) to find an explicit solution for $y(t)$ when $d1=d2$,

here, $y(0)$ indicates an initial state that can be assumed to be distributed in a prescribed way. For more general, nonlinear MSMs considered there, see Ref. 43, Proposition 3.3.

By substituting the expression (C1) into Eq. (3.1a), we find an exact expression for the evolution of $x(t)$,

in which the memory effects in the fourth term are of second order in $\u03f5$. Note that the $\u03f5$-order terms arise from a noise realization in the decoupled regime, whereas the memory term is exclusive due to the coupling of the main variables with the hidden ones. Hence, the only degree of freedom left is the distribution of the initial states $y(0)$.

### APPENDIX D: THE COUPLED L84–L63 SYSTEM

The EMR methodology’s ability to capture the statistics of low-dimensional dynamical systems was illustrated in Ref. 48, where the authors considered the L63 system^{53} as a test case in which the phase space can be fully sampled. Moreover, provided that the integration time step is short enough, the parameters of the underlying model can be fully captured with a high degree of confidence.

Here, we repeat the analysis of Ref. 48 to illustrate the effectiveness of EMR in capturing statistical and dynamical properties in an extended system. The model we consider is the result of coupling the $X=(X,Y,Z)$ variables of the L84 system^{54} with the $x=(x,y,z)$ variables of the L63 system,^{53} namely,

the parameter values are $a=0.25,b=4,$$F0=8,G=1$, and $s=10,$$\rho =28,\beta =8/3$, respectively. The parameter $h$ measures the strength of the coupling, while $\tau $ scales the rate of change in the L63 system and, therewith, the timescale ratio between the two subsystems.

This system is a skew-product, in the sense of Ref. 74, since the coupling is one-way only, with the L63 system driving the L84 dynamics. Hence, one has—as noted in Ref. 82—a fully Markovian parameterization of the L63 variables. Furthermore, the correlation function that defines the stochastic noise $\eta (t)$ can be further expanded and simplified, with respect to Eq. (2.16). One can, in fact, write explicitly

where the angular brackets $\u27e8\u22c5\u27e9$ indicate averages with respect to the physical measure associated with the L63 system. Since L84 does not feed back into L63, the evolution of $x(t)$ only obeys the dynamics of L63, and thus the decorrelation of the noise scales with $\tau $.

In most of the numerical experiments, the timescale separation between the two systems is $\tau =5$. The relevance of this timescale parameter was investigated in the previous work.^{82} Here, we focus on the effects of the coupling strength $h$, and we shall study the cases of $h=0.25$ and $0.025$. Partial observations only will be used in these experiments, by sampling the three-dimensional outputs of the L84 system. Then, the observed tendencies are regressed and sequentially layered following the EMR approach, as explained in Sec. III.

#### 1. EMR outputs

The L84–L63 model is integrated for 730 time units that correspond in L84 to 10 natural years, with a time step of $5\xd710\u22123$ time units; two separate runs are made for the coupling strengths $h=0.25$ and $h=0.025$. These two full-model runs are used to train the corresponding EMR model versions, both of which use only the slow $X$ variables and eliminate the fast $x$ variables. Then, two separate full-model simulations are run, for testing purposes, over 7–300 time units, and the EMR model’s output is compared with it, for the two parameter values. Below we show the main statistical outputs of the EMR methodology compared to the two reference integrations of the full model. The results for the two separate $h$-values are shown in Figs. 9–11 and 12–14, respectively.

The region of phase space explored by the EMR model clearly coincides with the one visited by the full model, as seen in Figs. 9 and 12, and the relative occupancies within this region—as indicated by the smoothed PDFs shown in Figs. 10 and 13, respectively—agree very well. The timescales are also well captured, as indicated by the good approximation of the autocorrelation functions, cf. Figs. 11 and 14.

Notice that, while the original L84–L63 system is purely deterministic, the EMR model includes white noise acting on the hidden layers of the learned model. This fact could suggest that a smoothing of the invariant measure is inevitable and that the EMR methodology may not be able to capture fractal geometries in phase space, since the EMR model does not satisfy the Hörmander’s hypoellipticity condition.^{18,40} The numerical evidence in Figs. 9 and 12, however, illustrates a strikingly good approximation of the full model’s attractor, including its very fine, and presumably fractal, structure.

Actually, Theorem 3.1 and Corollary 3.2 in Ref. 43 provided sufficient conditions for the existence of a random attractor for a broad class of MSMs that are not subject to a non-degeneracy condition of Hörmander type. In other words, one can have an MSM that possesses a random attractor and is thus dynamically quite stable, while exhibiting in a forward sense an invariant measure of the associated Fokker–Planck equation that is singular with respect to the Lebesgue measure. This mathematically rigorous result helps explain what is observed numerically not only in the present paper for the EMR of the L84–L63 model, but also in the case of the EMR model of the Lotka–Volterra example in Ref. 43, Fig. 7.

Ulam’s method was used on the projection of the full $(X,x)$ phase space onto the $X$ subspace to approximate the spectrum of the Koopman operator, since it can provide further information on the characteristics of the time series, beyond PDFs and correlation functions. The observed spectra (red $\xd7$ symbols) using a coarse partition of phase space into $512$ nonintersecting boxes showed good agreement with the spectra based on the full model (blue open circles); see Fig. 15. This agreement confirms further that, at this level of coarse graining, the EMR model captures well the characteristics of the full model’s solutions.

#### 2. Convergence

Convergence in the EMR approach is determined by the “whiteness” of the last-level residual, as explained in Sec. III B; see Eq. (3.16) and discussion thereof. In Fig. 16, we plotted the mean of the determination coefficients $R2$ for the three $X$ variables and we show that its convergence in the EMR approach depends only mildly on the coupling parameter $h$. Indeed, for $h=0.25$ we observe in panel (a) that around 18 levels are necessary before achieving the optimal level, whereas for weaker coupling with $h=0.025$ convergence is attained in panel (b) already with 15 levels, as one might expect.

Furthermore, as already pointed out in Ref. 43, Sec. 7, on a different example, the results in Fig. 16(c) illustrate that a smaller timescale separation $\tau $ can require a higher number of levels for EMR to attain convergence: in the case at hand, around 25 levels are needed. For completeness, Fig. 16(d) shows that including additive white noise in the L63 system can, in fact, accelerate the convergence of the method, with convergence achieved at $\u2113=7$.

#### 3. Model coefficients

We show here that the EMR model coefficients can be efficiently approximated when phase space subsampling is carried out. Here, regressions are performed over $50$ short time series of $10$ time units each, with a time step of $5\xd710\u22123$, as in Section 1 of Appendix D. The reason for taking this sample length here is that $10$ time units is visually enough for the slow variable $X$ to go through a cycle, as illustrated in Fig. 17, for both $h=0.25$ and $0.025$.

The estimated coefficients and their standard deviations using the EMR regressions are listed in Tables V and VI for $h=0.25$ and in Tables VII and VIII for $h=0.025$. The tables show the coefficients of the linear and quadratic forms at the first level in the EMR regressions: see Eq. (3.14a).

EMR . | 1 . | x
. | y
. | z
. | x^{2}
. | xy
. | y^{2}
. | xz
. | yz
. | z^{2}
. |
---|---|---|---|---|---|---|---|---|---|---|

f_{X} | 1.949 | −0.352 | −0.002 | −0.104 | −0.015 | 0.01 | 0.052 | −0.946 | −0.001 | −0.921 |

f_{Y} | 0.999 | 0.001 | −1.001 | 0.002 | 0 | 1.001 | −4.003 | 0 | 0 | 0 |

f_{Z} | 0.002 | −0.003 | −0.002 | −1.001 | 0.001 | 4.003 | 1.001 | 0 | 0 | 0 |

EMR . | 1 . | x
. | y
. | z
. | x^{2}
. | xy
. | y^{2}
. | xz
. | yz
. | z^{2}
. |
---|---|---|---|---|---|---|---|---|---|---|

f_{X} | 1.949 | −0.352 | −0.002 | −0.104 | −0.015 | 0.01 | 0.052 | −0.946 | −0.001 | −0.921 |

f_{Y} | 0.999 | 0.001 | −1.001 | 0.002 | 0 | 1.001 | −4.003 | 0 | 0 | 0 |

f_{Z} | 0.002 | −0.003 | −0.002 | −1.001 | 0.001 | 4.003 | 1.001 | 0 | 0 | 0 |

EMR . | 1 . | x
. | y
. | z
. | x^{2}
. | xy
. | y^{2}
. | xz
. | yz
. | z^{2}
. |
---|---|---|---|---|---|---|---|---|---|---|

f_{X} | 0.543 | 1.005 | 0.246 | 0.291 | 0.428 | 0.152 | 0.188 | 0.104 | 0.076 | 0.099 |

f_{Y} | 0.001 | 0.002 | 0 | 0.001 | 0.001 | 0 | 0 | 0 | 0 | 0 |

f_{Z} | 0.001 | 0.002 | 0.001 | 0.001 | 0.001 | 0.001 | 0 | 0 | 0 | 0 |

EMR . | 1 . | x
. | y
. | z
. | x^{2}
. | xy
. | y^{2}
. | xz
. | yz
. | z^{2}
. |
---|---|---|---|---|---|---|---|---|---|---|

f_{X} | 0.543 | 1.005 | 0.246 | 0.291 | 0.428 | 0.152 | 0.188 | 0.104 | 0.076 | 0.099 |

f_{Y} | 0.001 | 0.002 | 0 | 0.001 | 0.001 | 0 | 0 | 0 | 0 | 0 |

f_{Z} | 0.001 | 0.002 | 0.001 | 0.001 | 0.001 | 0.001 | 0 | 0 | 0 | 0 |

EMR . | 1 . | x
. | y
. | z
. | x^{2}
. | xy
. | y^{2}
. | xz
. | yz
. | z^{2}
. |
---|---|---|---|---|---|---|---|---|---|---|

f_{X} | 2.006 | −0.253 | −0.001 | −0.003 | −0.004 | 0 | 0.001 | −1.002 | 0.001 | −1.003 |

f_{Y} | 1 | 0 | −1.001 | 0.002 | 0 | 1.001 | −4.003 | 0 | 0 | 0 |

f_{Z} | 0.001 | −0.002 | −0.002 | −1.001 | 0.001 | 4.003 | 1.001 | 0 | 0 | 0 |

EMR . | 1 . | x
. | y
. | z
. | x^{2}
. | xy
. | y^{2}
. | xz
. | yz
. | z^{2}
. |
---|---|---|---|---|---|---|---|---|---|---|

f_{X} | 2.006 | −0.253 | −0.001 | −0.003 | −0.004 | 0 | 0.001 | −1.002 | 0.001 | −1.003 |

f_{Y} | 1 | 0 | −1.001 | 0.002 | 0 | 1.001 | −4.003 | 0 | 0 | 0 |

f_{Z} | 0.001 | −0.002 | −0.002 | −1.001 | 0.001 | 4.003 | 1.001 | 0 | 0 | 0 |

EMR . | 1 . | x
. | y
. | z
. | x^{2}
. | xy
. | y^{2}
. | xz
. | yz
. | z^{2}
. |
---|---|---|---|---|---|---|---|---|---|---|

f_{X} | 0.034 | 0.063 | 0.019 | 0.021 | 0.026 | 0.01 | 0.014 | 0.007 | 0.008 | 0.006 |

f_{Y} | 0.001 | 0.002 | 0 | 0.001 | 0.001 | 0 | 0 | 0 | 0 | 0 |

f_{Z} | 0.001 | 0.002 | 0.001 | 0.001 | 0.001 | 0.001 | 0 | 0 | 0 | 0 |

EMR . | 1 . | x
. | y
. | z
. | x^{2}
. | xy
. | y^{2}
. | xz
. | yz
. | z^{2}
. |
---|---|---|---|---|---|---|---|---|---|---|

f_{X} | 0.034 | 0.063 | 0.019 | 0.021 | 0.026 | 0.01 | 0.014 | 0.007 | 0.008 | 0.006 |

f_{Y} | 0.001 | 0.002 | 0 | 0.001 | 0.001 | 0 | 0 | 0 | 0 | 0 |

f_{Z} | 0.001 | 0.002 | 0.001 | 0.001 | 0.001 | 0.001 | 0 | 0 | 0 | 0 |

As expected, a stronger coupling of $h=0.25$ leads to greater uncertainty in the estimation, as indicated by the corresponding standard error. For the fairly complex and chaotic system at hand, we note that no memory effects are artificially introduced in the regressions at the second level. Indeed, we found that the coupling of the main level with the subsequent ones was $0$ to the fourth decimal place.

## REFERENCES

*Climate Change: Multidecadal and Beyond*, edited by C. P. Chang, M. Ghil, M. Latif, and J. M. Wallace (World Scientific Publishing Co., 2015), pp. 31–51.

*Advances in Nonlinear Geosciences*, edited by A. A. Tsonis (Springer International Publishing, 2018).

*Stochastic Physics and Climate Modelling*, edited by T. N. Palmer and P. Williams (Cambridge University Press, Cambridge, 2010), pp. 35–72.

*Stochastic Physics and Climate Modelling*, edited by T. N. Palmer and P. Williams (Cambridge University Press, Cambridge, 2009).