We address a three-tier numerical framework based on nonlinear manifold learning for the forecasting of high-dimensional time series, relaxing the “curse of dimensionality” related to the training phase of surrogate/machine learning models. At the first step, we embed the high-dimensional time series into a reduced low-dimensional space using nonlinear manifold learning (local linear embedding and parsimonious diffusion maps). Then, we construct reduced-order surrogate models on the manifold (here, for our illustrations, we used multivariate autoregressive and Gaussian process regression models) to forecast the embedded dynamics. Finally, we solve the pre-image problem, thus lifting the embedded time series back to the original high-dimensional space using radial basis function interpolation and geometric harmonics. The proposed numerical data-driven scheme can also be applied as a reduced-order model procedure for the numerical solution/propagation of the (transient) dynamics of partial differential equations (PDEs). We assess the performance of the proposed scheme via three different families of problems: (a) the forecasting of synthetic time series generated by three simplistic linear and weakly nonlinear stochastic models resembling electroencephalography signals, (b) the prediction/propagation of the solution profiles of a linear parabolic PDE and the Brusselator model (a set of two nonlinear parabolic PDEs), and (c) the forecasting of a real-world data set containing daily time series of ten key foreign exchange rates spanning the time period 3 September 2001–29 October 2020.

Forecasting is of utmost importance in a wide spectrum of fields ranging from finance and macro-economics to geomechanics and earthquake forecasting and from ecology and environmental engineering to neuroscience and epidemiology, to name just a few. A key task is the construction of reduced-order surrogate models that are able to generalize well. However, for high-dimensional time series, the “curse of dimensionality” is a stumbling block toward this aim. Here, we address a numerical scheme to perform out-of-sample predictions by first embedding the high-dimensional time series into a low-dimensional manifold, then train reduced-order surrogate models, and finally solve the pre-image problem to reconstruct the predictions in the ambient space. The efficiency of the proposed approach is demonstrated using (a) synthetic stochastic time series resembling electroencephalography (EEG) signals, (b) linear and nonlinear parabolic PDEs, and (c) a real data set of a financial problem and, in particular, that of a foreign exchange rate (FOREX) market.

## I. INTRODUCTION

Forecasting methods include exponential smoothing and moving average models, ARIMA, Bayesian nonparametric models including Gaussian processes and other machine learning schemes, such as forward and recurrent neural networks, deep learning, LSTMs, reinforcement learning, reservoir computing, and fuzzy systems.^{1–11} An inherent problem of using such data-driven modeling/forecasting approaches, when dealing with temporal measurements in a high-dimensional feature space, is the “curse of dimensionality.”^{12} In such cases, training, i.e., the estimation of the values of the model parameters, requires a large number of snapshots which increases exponentially with the dimension of the feature space. Thus, a fundamental task in modeling and forecasting is that of dimensionality reduction. Out of the many features that can be measured in time, one has to first identify the intrinsic dimension of the possible low-dimensional subspace (manifold) and the corresponding vectors that span it, which actually parameterize the emergent/macroscopically observed dynamics of the system under study. Assuming that the emergent dynamics evolve on a smooth enough low-dimensional manifold, an arsenal of linear, such as singular value decomposition (SVD) and dynamic mode decomposition (DMD),^{13–15} and nonlinear, such as kernel-PCA,^{16} locally linear embedding (LLE),^{17} ISOMAP,^{18,19} Laplacian eigenmaps,^{20} diffusion maps (DMs),^{21–25} and Koopman operator,^{26–28} related manifold learning algorithms can be exploited toward this direction.

For the twofold task of dimensionality reduction and modeling in the low-dimensional subspace, Bollt^{29} used an ISOMAP to approximate the manifold of dynamical systems in the form of ODEs and PDEs and to construct reduced-order models. Chiavazzo *et al.*^{30} constructed reduced kinetics models by extracting the slow dynamics on a manifold globally parameterized by truncated DMs. A comparison of the reconstructed and the original high-dimensional dynamics was also reported. The reconstruction was achieved using radial basis functions (RBF) interpolation and geometric harmonics.^{23} Liu *et al.*^{31} used DMs to identify coarse variables that govern the emergent dynamics of a particle-based model of animal swarming and based on these, they constructed a reduced stochastic differential equation. Dsilva *et al.*^{32} used DMs with a Mahalanobis distance to parameterize the slow manifold of multiscale dynamical systems. Williams *et al.*^{27} addressed an extension of DMD to compute approximations of the leading eigenvalues, eigenfunctions, and modes of the Koopman operator, thus showing that for large data sets, the procedure converges to the numerical approximation one would obtain from a Galerkin method. The performance of the method was tested via the unforced Duffing equation and a stochastic differential equation with a double-well potential that converges to a Fokker–Planck equation. Brunton *et al.*^{33} used SVD to embed the dynamics of nonlinear systems in a linear manifold spanned by a few modes and constructed state-space linear models on the manifold to approximate the full dynamics. The so-called sparse identification of the nonlinear dynamics (SINDy) method was demonstrated using the Lorenz equations and the 2D Navier–Stokes equations. Bhattacharjee and Matouš^{34} used ISOMAP to construct reduced-order models of heterogeneous hyperelastic materials in the 3D space, whose solutions are obtained by finite elements, while for the construction of the inverse map, they used RBF interpolation. Wand and Sapsis^{35} proposed a methodology for forecasting and quantifying uncertainty in a reduced-dimensionality space based on Gaussian process regression. The efficiency of the proposed approach was validated using data series from the Lorenz 96 model, the Kuramoto–Sivashinsky, as well as a barotropic climate model in the form of a PDE. Kutz *et al.*^{36} proposed a scheme based on the Koopman-operator theory to embed spatiotemporal dynamics of PDEs with the aid of DMD for approximating the evolution of the high-dimensional dynamics on the reduced manifold; the reconstruction of the states of the original system was achieved by a linear transformation. Chen and Ferguson^{37} have proposed a molecular enhanced sampling method based on autoencoders^{38} to discover the low-dimensional projection of molecular dynamics and then reconstructed back the atomic coordinates. Vlachas *et al.*^{39} proposed an LSTM-based method to predict the high-dimensional dynamics from the information acquired in a low-dimensional space. The embedding space is constructed based on the Takens embedding theorem.^{40} The method is demonstrated through time series obtained from the Lorenz 96 equations, the Kuramoto–Sivashinsky equation, and a barotropic climate model as in Wand and Sapsis.^{35} Herzog *et al.*^{41} used a convolutional autoencoder and extended conditional random fields for dimensionality reduction and prediction, respectively, for dealing with chaotic spatiotemporal series. The method was demonstrated via the Lorenz-96 system and the Kuramoto–Sivashinsky PDEs. Lee *et al.*^{42} used DMs and automatic relevance determination to construct embedded PDEs by the aid of artificial neural networks (ANNs) and Gaussian processes. The methodology was illustrated with the lattice-Boltzmann simulations of the 1D FitzHugh–Nagumo model. Koronaki *et al.*^{43} used DMs to embed the dynamics of a one-dimensional tubular reactor as modeled by a system of two PDEs on a 2D manifold and then constructed a feedforward ANN to learn the dynamics on the 2D manifold. The efficiency of the scheme was compared with the original high-dimensional PDE dynamics. Isensee *et al.*^{44} used PCA for dimensionality reduction of spatiotemporal time series based on delay embedding nearest neighbor methods in the low-dimensinal space for prediction. The approach was demonstrated through noisy data produced from a Barkley model the Bueno–Orovio–Cherry–Fenton and the Kuramoto–Sivashinsky model. Recently, Lin and Lu^{45} used the Koopman and the Mori–Zwanzig projection operator formalism to construct reduced-order models for chaotic and randomly forced dynamical systems, and then based on the Wiener projection, they derived nonlinear auto-regressive moving average with exogenous input models. The approach was demonstrated through the Kuramoto–Sivashinsky PDE and the viscous Burgers equation with stochastic forcing.

The performance of most of the above schemes was assessed in terms of interpolation; namely, the data were produced using dynamical models (ODEs, PDEs, SDEs, or kinetic models) that were simulated in some specific interval of the high-dimensional domain, then reduced-order models were constructed and trained based on the generated data, and finally, a reconstruction of the high-dimensional dynamics was implemented within the original interval of the high-dimensional domain.

In this work, we address a three-tier scheme to perform forecasting, i.e., out-of-sample extrapolation of high-dimensional time series by (i) embedding the high-dimensional time series into a low-dimensional manifold using nonlinear manifold learning, (ii) constructing and training reduced-order surrogate models on the low-dimensional manifold and, based on these, make predictions, and finally, (iii) solve the pre-image problem by lifting the predictions in the high-dimensional space with geometric harmonics^{23} or radial basis function (RBF) interpolation. The performance of this “embed–forecast–lift” scheme is assessed through three different families of problems: (a) the forecasting of three simplistic synthetic time series resembling EEG recordings generated by linear and weakly nonlinear discrete stochastic models, (b) the prediction/propagation of the solutions of two parabolic PDEs (a linear one and the Brusselator model giving rise to sustained oscillations) based on past spatiotemporal data, and (c) the forecasting of a real-world data set containing the time series of 10 key pairs of foreign exchange prices retrieved from the www.investing.com open API, spanning the period 3 September 2001–29 October 2020. For our computations, we used two well-established nonlinear manifold learning algorithms, namely, the LLE algorithm and DMs, two types of surrogate models, namely multivariate autoregressive (MVAR) and Gaussian process regression (GPR) models, and two methods for solving the pre-image problem, namely, geometric harmonics and radial basis function interpolation. We considered various combinations of the previous methods and compared them against the naïve random walk and MVAR and GPR surrogate models implemented directly in the original space. A comparison with the results obtained with PCA for the cases of the numerical solution of the parabolic PDEs and the FOREX forecasting problem is also provided. To the best of our knowledge, this is the first work that addresses such a numerical framework for the solution of the problem of forecasting of high-dimensional time series based on nonlinear manifold learning, thus providing a comparison of various embedding, modeling, and reconstruction approaches and assessing their performance on synthetic time series, PDEs, and a real-world FOREX data set.

## II. METHODOLOGY AND PRELIMINARIES

Let us denote by $xi\u2208RD$ the vector containing the features at the $i$th snapshot of the time series and by $X\u2208RD\xd7N$ the matrix having as columns the vectors spanning a time window of $N$ observations. The assumption is that the time series/dynamics lie on a “smooth enough” low-dimensional (say, of dimension $d$) manifold that is embedded in the high-dimensional $RD$ space. Our aim is to exploit manifold learning algorithms to forecast the time series in the high-dimensional feature space. Thus, our purpose is to bypass the “curse of dimensionality” associated with the training of surrogate models in the original input space and, with the limited size that usually characterizes real-world data (such as financial data). Toward this aim, we propose a numerical framework that consists of three steps.

At the first step, we employ embedding algorithms (LLE and DMs) to construct nonlinear maps from the high-dimensional input space to a low-dimensional subspace that preserve as much as possible the intrinsic geometry of the manifold (i.e., maps assuring that neighborhoods of points in the high-dimensional space are mapped to neighborhoods of the same points in the low-dimensional manifold, and the notion of distance between the points is maintained as much as possible). In general, it is expected that the manifold learning algorithms will provide an approximation to the intrinsic embedding dimension, which will differ from the “true” one. This depends on the threshold that one puts for the selection of the number of embedded vectors that span the low-dimensional manifold. A central concept here is the Riemannian metric, which defines the properties of this map. At the second step, based on the resulting embedding features that span the low-dimensional manifold, we train a class of surrogate models (MVAR and GPR) to predict the evolution of the embedded time series on the manifold based on their past values. The final step implements the solution of the pre-image problem, i.e., a construction of a lifting operator. The aim here is to form the inverse map from the out-of-sample forecasted embedded points to the reconstruction of the features in the high-dimensional input space. On one hand, the existence of such an inverse map is theoretically guaranteed by the assumption that a low-dimensional “sufficiently smooth” manifold exists. At this point, we should note that the embedding generated using noisy data may produce a distorted and, therefore, unreliable geometry (see, e.g., the discussion in Gajamannage *et al.*^{46}). On the other hand, the data-driven derivation of the corresponding inverse map is neither unique^{47} nor trivial as, for example, when using PCA. In general, one generally has to solve a nonlinear least-squares problem requiring the minimization of an objective function that can be formed using different criteria for the properties of the neighborhood of points; this results in different inverse maps whose performance has to be validated numerically.

Next, and for the completeness and clarity of the presentation of the methodology, we briefly describe basic concepts of the manifold theory.

Manifold learning techniques can be viewed as unsupervised machine learning algorithms in the sense that they “learn” from the available data a low-dimensional representation of the original high-dimensional input space, thus providing an “optimal” (under certain assumptions) embedded subspace where the information available in the high-dimensional space is preserved as much as possible. Here, we briefly present some basic elements of the theory of manifolds and manifold learning.^{47–50} Let us start with the definition of a $d$-dimensional manifold.

### (Manifold)

**(Manifold)**

A set $M\u2282Rn$ is called a $d$-dimensional manifold (of class $C\u221e$) if for each point $p\u2208M$, there is an open set $W\u2282M$ such that there exists a $C\u221e$ bijection $f:U\u2192W$, $U\u2282Rd$ open, with $C\u221e$ inverse $f\u22121:W\u2192U$ (i.e., $W$ is $C\u221e$-diffeomorphic to $U$). The bijection $f$ is called a local parameterization of $M$, while $f\u22121$ is called a local coordinate mapping, and the pair $(W,f\u22121)$ is called a chart (or neighborhood) of $p$ on $M$.

Thus, in a coordinate system $(W,f\u22121)$, a point $p\u2208W$ can be expressed by the coordinates $(f1\u22121,f2\u22121,\u2026,fn\u22121)$, where $fi\u22121$ is the $i$th element of $f\u22121$. A chart $(W,f\u22121)$ is centered at $p$ if and only if $f\u22121(p)=0$. Thus, we always assume that the manifold satisfies the Hausdorff separation axiom, stating that every pair of distinct points on $M$ has disjoint open neighborhoods. Summarizing, a manifold is a Hausdorff (Separated/$T2$) space, in which every point has a neighborhood that is diffeomorphic to an open subset in $Rd$.

### (Tangent space and tangent bundle to a manifold)

**(Tangent space and tangent bundle to a manifold)**

### (Riemannian manifold and metric)

**(Riemannian manifold and metric)**

A Riemannian manifold is a manifold $M$ endowed with a positive definite inner product $gp$ defined on the tangent space $TpM$ at each point $p$. The family of inner products $gp$ is called a Riemannian metric, and the Riemannian manifold is denoted by $(M,g)$.

The above definitions refer to the case of continuum limit—infinite number of points. However, for real data with a limited number of observations, one can only try to approximate the manifold; the analytic charts of the Riemannian metric that actually define the geometry of the manifold are simply not available. Thus, a consistent manifold learning algorithm constructs in a numerical way a map that approximates in a probabilistic way the continuous one when the sample size $N\u2192\u221e$. Thus, a fundamental preprocessing step of a class of manifold learning algorithms, such as Laplacian maps, LLE, ISOMAP, and DMs, is the construction of a weighted graph, say, $G(X,E)$, where $E$ denotes the set of weights between points. This construction is based on a predefined metric (such as the Gaussian kernel and the $k$-NN algorithm), which is used to appropriately connect each point $x\u2208RD$ with all the others. This defines a weighted graph of neighborhoods of points. Theoretically, with an appropriate choice of the metric and its parameters, the graph is guaranteed to be connected; i.e., there is a path between every pair of points in $X$.^{47}

## III. MANIFOLD LEARNING ALGORITHMS

### A. Locally linear embedding (LLE)

The LLE algorithm^{51} is a nonlinear manifold learning algorithm, which constructs $G(X,E)$ based on the $k$-NN algorithm. It is assumed that the neighborhood forms a basis for the reconstruction of any point in the neighborhood itself. Thus, every point is written as a linear combination of its neighbors. The weights of all pairs are then estimated with least squares, minimizing the $L2$ norm of the difference between the points and their neighborhood-based reconstruction. The same rationale is assumed for the low-dimensional subspace, keeping the same estimates of the weights in the high-dimensional space. In the low-dimensional subspace, one now seeks for the coordinates of the points that minimize the $L2$ norm of the difference between the coordinate of the points on a $d$-dimensional manifold and the weighted neighborhood reconstruction. The minimization problem is represented as an eigenvalue problem, whose first $d$ bottom non-zero eigenvectors provide an ordered set of orthogonal coordinates that span the manifold. It should be noted that in order to obtain a unique solution to the minimization problem, the number of $k$ nearest neighbors should not exceed the dimension of the input space.^{51}

The LLE algorithm can be summarized in the following three basic steps.^{51}

Identify the $k$ nearest neighbors for all $xi\u2208RD,i=1,2,\u2026,N$ (with $k\u2265d+1$). For each $xi$, this forms a set $K{xi}\u2282Rk\xd7D$ containing the $k$ nearest neighbors of $xi$.

- Write each $xi$ in terms of $K{xi}$ asand find the matrix $W=[wij]\u2208RN\xd7N$ of the unknown weights by minimizing the objective function(1)$xi=\u2211j\u2208K{xi}wijxj$with the constraint(2)$L(W)=\u2211i=1N\Vert xi\u2212\u2211j=1,j\u2260iNwijxj\Vert L22,$(3)$\u2211j=1Nwij=1.$
- Embed the points $xi\u2208RD$, $i=1,2,\u2026,N$, into a low-dimensional space with coordinates $yi\u2208Rd$, $i=1,2,\u2026,N$, $d\u226aD$. This step in LLE is accomplished by computing the vectors $yi\u2208Rd$ that minimize the objective function,where the weights $wij$ are fixed at the values found by solving the minimization problem (2). The embedding vectors are required to be centered at the origin with an identity covariance matrix.(4)$\varphi (Y)=\u2211i=1N\Vert yi\u2212\u2211j=1,j\u2260iNwijyj\Vert L22,$
^{51}The embedded vectors $yi$ are constrained so that they have a zero mean and a unit covariance matrix.

The cost function (4) is quadratic and can be stated as

involving the inner products of the embedding vectors and a symmetric square matrix $Q$ with elements

$\delta ij=1$ if $i=j$ and $0$ otherwise. In a matrix form, $Q\u2208RN\xd7N$ can be written as

In practice, this representation of $Q$ gives rise to a significant reduction of the computational cost, especially when $N$ is large, since left multiplication by $Q$ can be performed as

requiring just two multiplications by the sparse matrices $W$ and $WT$.

Minimizing the cost function (4) can be performed via the eigenvalue decomposition of $Q$. Specifically, the eigenvector associated with the smallest (zero) eigenvalue is the vector with all 1s, and it trivially minimizes (4);^{51} it is disregarded because it leads to a constant (degenerate) embedding. The optimal embedding is, therefore, obtained by the $d$ eigenvectors of $Q$, denoted as $qk\u2208RN$, $k=1,2,\u2026,d$, corresponding to the next $d$ smallest eigenvalues. Thus, the coordinates of $xi$, $i=1,2,\u2026,N$, in the embedded space are given by the vector

where $qij$ denotes the $j$th element of eigenvector $qi$.

### B. Diffusion maps (DMs)

Here, the construction of the affinity matrix is based on the computation of a random walk on the graph $G(X,E)$. The first step is to construct a graph using a kernel function, say, $k(xi,xj)$. The kernel function can be chosen as a Riemannian metric so that it is symmetric and positive definite. Standard kernels, such as the Gaussian kernel, typically define a neighborhood of each point $xi$, i.e., a set of points $xj$ having strong connections with $xi$, namely, large values of $k(xi,xj)$.

At the next step, one constructs a Markovian transition matrix, say, $P$, whose elements correspond to the probability of jumping from one point to another in the high-dimensional space. This transition matrix defines a Markovian (i.e., memoryless) random walk $Xt$ by setting

For a graph constructed from a sample of finite size $N$, the weighted degree of a point (node) is defined by

and the volume of the graph is given by

Then, the random walk on such a weighted graph can be defined by the transition probabilities

Clearly, from the above definition, we have that $\u2211j=1Np(xi,xj)=1$.

We note that in the continuum, the above can be described as a continuous Markov process on a probability space $(\Omega ,H,P)$, where $\Omega $ is the sample space, $H$ is a $\sigma $-algebra of events in $\Omega $, and $P$ is a probability measure. Let $\mu $ be the density function of the points in the sample space $\Omega $ induced from the probability measure, $\mu :H\u2192R$ with $\mu (\Omega )=1$. Then, using the kernel function $k$, the transition probability from a point $x\u2208\Omega $ to another point $y\u2208\Omega $ is given by

which is the continuous-space counterpart of (11).

The above procedure defines a row-stochastic transition matrix, $P=[pij]$, which encapsulates the information about the neighborhoods of the points. Note that by raising $P$ to the power of $t=1,2,\u2026$, we get the jumping probabilities in $t$ steps. This way, the underlying geometry is revealed through high or low transition probabilities between the points, i.e., paths that follow the underlying geometry have a high probability of occurrence, while paths away from the “true” embedded structure have a low probability. Note that the $t$-step transition probabilities, say, $pt(xi,xj)$, satisfy the Chapman–Kolmogorov equation,

The Markov process defined by the probability matrix $P$ has a stationary distribution given by

and it is reversible, i.e.,

Furthermore, if the kernel is appropriately chosen so that the graph is connected, then the Markov chain is ergodic and irreducible, and its transition matrix has a stationary vector $\pi $ such that

where $d$ is the degree vector whose $i$th entry is $deg\u2061(xi)$ and $1$ is an all ones vector.

From the Perron–Frobenius theorem, we know that its geometric multiplicity is 1. It can be shown that all other eigenvalues have a magnitude smaller than 1.^{22} From (11), we get

where $kij=k(xi,xj)$. We note that the transition matrix $P$ is similar to the symmetric positive definite matrix $S=D\u22121/2KD\u22121/2$. Thus, the transition matrix $P$ has a decomposition given by

where $\lambda i\u2208R$ are the (positive) eigenvalues of $P$, $ui\u2208RN$ are the left eigenvectors, and $vi\u2208RN$ are the right eigenvectors such that $\u27e8ui,vj\u27e9=\delta ij$.

The set of right eigenvectors $vi$ establishes an orthonormal basis for the subspace $R(PT)$ of $RD$ spanned by the rows of $P$. Row $i$ represents the transition probabilities from point $xi$ to all the other points of the graph. According to the Eckart–Young–Mirsky theorem,^{52,53} the best $d$-dimensional low-rank approximation of the row space of $P$ in the Euclidean space $Rd$ is provided by its $d$ right eigenvectors corresponding to the $d$ largest eigenvalues.

It has been shown that asymptotically, when the number of data points uniformly sampled from a low dimensional manifold goes to infinity, the matrix $1\sigma (I\u2212P)$ (where $\sigma $ is the kernel function scale) approaches the Laplace–Beltrami operator of the underlying Riemannian manifold.^{21,24} This allows one to consider the eigenvectors corresponding to the largest few eigenvalues of $P$ as discrete approximations of the principal eigenfunctions of the Laplace–Beltrami operator. Since the principal eigenfunctions of the Laplace–Beltrami operator establish an accurate embedding of the manifold,^{54} this result promotes the usage of the eigenvectors of $P$ for practical (discrete) data embedding.

At this point, the so-called diffusion distance, which is an affinity distance related to the reachability between two points $xi$ and $xj$, is given by

where $pt(xi,\u22c5)$ is the $i$th row of $Pt$. The embedding of the $t$-step transition probabilities is achieved by forming a family of maps (DMs) of the $N$ points $xi\u2208X$ in the Euclidean subspace of $Rd$ defined by

A useful property of DMs is that the Euclidean distance in the embedded space $||Rt(xi)\u2212Rt(xj)||L2$ is the best $d$-dimensional approximation of the diffusion distance, given by (20), where the equality holds for $d=N$.

## IV. FORECASTING WITH REGRESSION MODELS

Let us assume that the time series are being generated by a (nonlinear) law of the form

where $yt\u2208Rd$ denotes the vector containing the measured values of the response variables at time $t$, $\varphi :Rp\xd7Rq\u2192Rd$ is a function encapsulating the law that governs the system dynamics, which contains the parameters and exogenous variables represented by the vector $\mu \u2208Rq$, $z\u2208Rp$ is the vector containing the explanatory input variables (predictors), which may include past values of the response variables at certain times, say, $t\u22121,t\u22122,\u2026,t\u2212m$, and $et$ is the vector of the unobserved noise at time $t$.

In general, the forecasting problem (here in the low-dimensional space) using regression models can be written as a minimization problem of the following form:

where $k$ is the prediction horizon, $g:Rp\xd7Rl\u2192Rd$ is the regression model,

with parameters $\theta \u2208Rl$, $G$ is the space of available models $g$, $y^t$ is the output of the model at time $t$, and $L$ is the loss function that determines the optimal forecast.

We note that the forecasting problem can be posed in two different modes:^{55} the iterative and the direct one. In the iterative mode, one trains one-step ahead surrogate models based on the available time series and simulates/iterates the models to predict future values. In the direct mode, forecasting is performed in a sliding-window framework, where the model is retrained with the data contained within the sliding window to provide a multiperiod-ahead value of the dependent variables. Here, we aim at testing the performance of the proposed scheme for one-period ahead of time predictions (i.e., with $k=1$) with both iterative and direct modes. For our illustrations, we used two surrogate models, namely, MVAR and GPR. A brief description of the models follows.

### A. Multivariate autoregressive (MVAR) model

An MVAR model can then be written as

where $yt=[yt1,yt2,\u2026,ytd]T\u2208Rd$ is the vector of the response time series at time $t$, $m$ is the model order, i.e., the maximum time lag, $\theta 0\u2208Rd$ is the regression intercept, $\Theta j\u2208Rd\xd7d$ is the matrix containing the regression coefficients of the MVAR model, and $et=[et(1),et(2),\u2026,et(d)]T$ is the vector of the unobserved errors at time $t$, which are assumed to be uncorrelated random variables with zero mean and constant variance $\sigma 2$. In a more general form, in view of (22), the MVAR model can be written as

where $yik$ is the model output for the $i$th variable at the $k$th time instant, $\theta 0k\u2208R$ is the corresponding regression intercept and $\theta jk$ the corresponding $j$th regression coefficient, and $zij$ is the $j$th predictor of the $i$th response variable (e.g., the time-delayed time series). According to the Gauss–Markov theorem, the best unbiased linear estimator of the regression coefficients is the one that results from the solution of the least-squares (LS) problem,

given by

where $Y=[y1,y2,\u2026,yd]\u2208RN\xd7d$ and $Z=[1N,z1,z2,\u2026,zm]\u2208RN\xd7(m+1)$. Assuming that the unobserved errors are i.i.d. normal random variables and then by the maximum likelihood estimate of the error covariance, one can also estimate the forecasting intervals of a new observation.

### B. Gaussian process regression (GPR)

An introduction to the use of Gaussian processes for time-series forecasting can be found in Refs. 5 and 56. For the implementation of GPR, it is assumed that the unknown function $\varphi $ in (22) can be modeled by $d$ single-output Gaussian distributions given by

where $\varphi i=[\varphi i(z1),\varphi i(z2),\u2026,\varphi i(zN)]$, $\varphi i$ is the $i$th component of $\varphi $, $\mu (z)$ is the vector with the expected values of the function, and $K(z,\theta )$ is a $N\xd7N$ covariance matrix formed by a kernel. The prior mean function is often set to $\mu (z)=0$ with appropriate normalization of the data.

Predictions at a new point, say, $z\u2217$, are made by drawing $\varphi i\u2217$ from the posterior distribution $P(\varphi i|(Z,yi))$, $i=1,2,\u2026,d$, where $Z=[z1,z2,\u2026,zN]T\u2208RN\xd7p$, and by assuming a Gaussian distributed noise, and appropriate normalization of the data points is given by the joint multivariate normal distribution,

It can be shown that the posterior conditional distribution

can be analytically derived, and the expected value and covariance of the estimation are given by

The hyperparameters in the above equations are estimated by minimizing the marginal likelihood that is given by

Here, for the forecasting of the EEG and FOREX signals, we used a mixed kernel composed by (see, e.g., Corani *et al.*^{57}), namely, a radial basis function kernel,

a linear kernel,

a periodic kernel,

and a white noise kernel,

For the PDEs, we used a kernel of the Matérn class, and, in particular, the Matérn52 kernel given by^{56}

where $\nu =52$, $K\nu $ is the modified second kind Bessel function, and $l>0$.

## V. SOLUTION OF THE PRE-IMAGE PROBLEM

The final step is to solve the pre-image problem, i.e., “lift” the predictions made by the reduced-order surrogate models on the manifold back to the original high-dimensional space. In the case of PCA, this task is trivial, as the solution to the reconstruction problem, given by

is just a linear transformation, which maximizes the variance of the data on the linear manifold, and is given by the first $d$ principal eigenvectors of the covariance matrix.

In the case of nonlinear manifold learning algorithms, such as LLE and DMs, we want to learn the inverse map (lifting operator),

for new samples on the manifold $y\u2217\u2209R(X)$. This inverse problem is referred as the “pre-image” problem. The “out-of-sample extension” problem usually refers to the direct problem, i.e., that of learning the direct embedding map (i.e., the restrictions to the manifold) $R(X):X\u2192R$ for new samples in the input space $x\u2217\u2209X$. Toward this aim, a well-established methodology is the Nyström extension.^{58}

In general, the solution of the pre-image problem can be posed as

subject to a constraint, where $L(\u22c5)|c$ is the lifting operator depending on some parameters $c$.

Below, we describe the reconstruction methods that we used in this work, namely, RBFs interpolation and geometric harmonics (GH), which provide a direct solution of the inverse problem, thus giving some insight about their implementation and pros and cons. For a review and a comparison of such methods in the framework of chemical kinetics, see Chiavazzo *et al*.^{30}

### A. Radial basis function (RBF) interpolation

The lifting operator is constructed with interpolation through RBFs among the corresponding set of say, $k$ neighbors of the new point $y\u2217$. The lifting operator is defined by^{30}

where $xi\u2217$ is the $i$th coordinate of $x\u2217$, the $yj$’s are the neighbors of the unseen sample $y\u2217$ on the manifold, and $\psi $ is the radial basis function.

Similarly, the restriction operator can be written as

where $L(yj)=xj$ is the known image of $yj$ (the neighbors of $y\u2217$ in the ambient space).

Two of the most common RBFs used for this task are the Gaussian kernel defined as

where $\epsilon $ is the so-called shape parameter and the so-called radial powers defined as

where $p$ is an odd integer.

Thus, the unknown coefficients $cji$ of the lifting operator are given by the solution of the following linear system:

where

and $xji$ is the $i$th coordinate of the $j$th point in the ambient space, whose restriction to the manifold is the $j$th nearest neighbor of $y\u2217$. Then, (41) can be used to find the coordinates of $x\u2217$ in the ambient space.

#### 1. Convergence and efficiency

It has been proven (see Monning *et al.*^{59} and the references therein) that for Gaussian RBFs, the approximation convergence as defined by the fill distance, say, $h$, in terms of $Linf$ to a function in the subspace spanned by the RBFs (also called native space) is exponential. However, for any practical purposes, Gaussian kernels result to an ill-conditioned matrix $A$ when $k$ is large,^{59,60} relatively narrow native spaces of convergence, while the optimal selection of the optimal shape parameter imposes an extra burden. In such cases, one can resort to established techniques for the solution of (45) using, for example, the Moore–Penrose pseudoinverse of $A$ via singular value decomposition or Tikhonov regularization. On the other hand, radial powers are better for the task of interpolation compared to Gaussian kernels, as they do not suffer from saturation error, thus result in a wider native space of convergence and in faster convergence rates.^{59,61} Here, for our computations, we used radial powers with $p=1$.

### B. Geometric harmonics (GHs)

GHs are a set of functions that allow the extension of the embedding of new unseen points on the manifold $x\u2217\u2209X$, which are not given in the set of points used for building the embedding.^{23} Their derivation is based on the Nyström (or quadratic) extension method,^{58} which has been used for the numerical solutions of integral equations,^{62,63} and in particular, the Fredholm equation of a second kind, read as

where $f(t)$ is the unknown function, while $k(t,s)$ and $g(t)$ are known. The Nyström method starts with the approximation of the integral, i.e.,

where $sj$ are $N$ appropriately chosen collocation points and $wj$ are the corresponding weights, which are determined, e.g., by the Gauss–Jacobi quadrature rule. Then, by using (48) in (47) and evaluating $f$ and $g$ at the $N$ collocation points, we get the following approximation:

where the matrix $K~$ has elements $k~ij=k(si,sj)wj$. Based on the above, the solution of the homogenous Fredholm problem ($g=0$) is given by the solution of the eigenvalue problem,

i.e.,

where $f^i=f^(si)$ is the $i$th component of $f^$. The Nyström extension of $f(t)$, using a set of $N$ sample (collocation) points, at an arbitrary point $x$ in the full domain, is given by

Within the framework of DMs, we seek for the out-of-the-sample (filtered) extension of a real-valued function $f$ defined at $N$ sample points $xi\u2208X$ to one or more unseen points $x\u2217\u2209X$. The function $f$ can be, for example, a DM coordinate $\lambda jtvj(xi)$, $j=1,2,\u2026,d$, or another function representing the output of a regression model.^{64,65}

Recalling that the eigenvectors $vj$ form a basis, the extension is implemented^{23} by first expanding $f(xi)$ in the first $d$ parsimonious eigenvectors $vl$ of the Markovian matrix $Pt$,

where $al=\u27e8vl,f\u27e9$ are the projection coefficients of the function on the first $d$ parsimonious eigenvectors and $f\u2208RN$ is the vector containing the values of the function at the $N$ points $xi$. Then, one computes the Nyström extension of $f^$ at $x\u2217$ using the same projection coefficients as

where

are the corresponding GHs. Scaling up the (filtered) extension of the function $f$ to a set of, say, $L$ new points can be computed using the following matrix product:^{23,64}

where $KL\xd7N$ is the corresponding kernel matrix, $VN\xd7d$ is the matrix with columns the $d$ parsimonious eigenvectors $vl$, and $\Lambda d\xd7d$ is the diagonal matrix with elements $\lambda lt$.

The above direct approach provides a map from the ambient space to the reduced-order space (restriction) and vice versa (lifting).

## VI. THE CASE STUDIES

For demonstrating the performance of the proposed numerical framework and comparing the various embedding, modeling, and reconstruction approaches, we used (a) synthetic time series resembling EEG recordings generated by simplistic linear and weakly nonlinear stochastic discrete models, (b) a linear parabolic PDE for which an analytical solution exists, and the celebrated Brusselator model introduced by Prigogine^{66} consisting of a system of two nonlinear parabolic PDEs, and (c) a real-world data set of ten key FOREX pairs.

### A. The synthetic EEG time series

Our synthetic stochastic signals that resemble EEG time series^{67,68} are generated by simplistic linear and weakly nonlinear five-dimensional discrete stochastic models with white noise. Here, we note that specific low-dimensional synthetic EEG time series are tractable, transparent paradigms with simple/trivial dynamics, thus serving to illustrate a “computational proof of concept” of the proposed scheme, for which an MVAR model in the ambient space is expected for any practical means to perform accurately in the presence of low-amplitude noise.

The linear five-dimensional stochastic discrete model is given by the following equations:

where for the training process, the model order is assumed to be known (here equal to 1).

A second nonlinear model is given by the following equations (see, e.g., Nicolaou and Constandinou^{68}):

The proposed scheme was also validated through the time series produced by a linear stochastic model with a model order greater than 1, given by the following equations:

In all the above models, the time series $wt(i)$, $i=1,2,3,4,5$, are uncorrelated normally distributed noise with zero mean and unit standard deviation.

### B. THE NUMERICAL CONTINUATION OF SOLUTIONS OF (PARABOLIC) PDEs

The proposed scheme can also be potentially used for constructing reduced-order surrogate models to learn and consequently predict the dynamics of high-dimensional systems as those resulting from the discretization of PDEs. Here, we illustrate this possibility by considering two problems. The first one is a linear parabolic PDE given by^{69}

This provides a simple paradigm that can be easily and successfully treated using PCA. Using Neumann boundary conditions in $[0\pi ]$ and initial conditions $u(x,0)=0.4cos\u2061(2x)+1.5$, the analytical solution is given by

The second problem is the celebrated Brusselator system of nonlinear PDEs that gives rise to sustained oscillations (limit cycles). The model is given by the following coupled system of parabolic PDEs:

with Dirichlet boundary conditions (BCs). Here, for our illustrations, we solved the above system with $a=1$, $b=3$, $Du=Dv=150$ using central finite differences with 20 equidistant collocation points in $(01)$ with BCs,

and initial conditions (ICs)

For the time integration of the resulting stiff system of 40 ODEs in the interval $(033]$, we have used the ode15s stiff solver of the Matlab ODE suite^{70} with absolute and relative tolerances set to $1\xd710\u22123$ and $1\xd710\u22126$, respectively. In order to find (i) the optimal reduced-order GPR surrogate model, (ii) the optimal set of values of the hyperparameters (values of the kernel parameters, number of eigenpairs, number of nearest neighbors) for the embedding and the solution of the pre-image problem, and (iii) to assess the prediction performance of the proposed scheme, we split the dataset into training, validation, and test sets. Details on the splitting are given in Sec. VII.

### C. The FOREX forecasting problem

We used the investpy 0.9.14 module of Python^{71} to download ten USD-based FOREX pairs daily spot closing prices from the www.investing.com open API, spanning the period 03/09/2001–29/10/2020: EUR/USD, GBP/USD, AUD/USD, NZD/USD, JPY/USD, CAD/USD, CHF/USD, SEK/USD, NOK/USD, and DKK/USD. The time period was selected to contain crucial market crashes like the 2008 Lehman related stock market crash, the 2010 sovereign debt crisis in the Eurozone, and even the most recent COVID-19 related market crash of 2020.

For our analysis, we computed the compounded daily returns of the FOREX pairs as

where $Si,t$ is the spot price of the $i$th FOREX pair in the data set above and $ri,t$ is the corresponding compounded return at date $t$. However, trading the FOREX markets is not only associated with the spot price movements but also with the so-called currency carry trading. The currency carry-trading involves first the identification of high interest-paying investment assets (e.g., bonds or short-term bank deposits) denominated in a country currency (e.g., Japanese yen). Then, the investment is carried on by borrowing money in another currency for which the paying interest rate is lower. Such a trading requires the approximation of the so-called interest-rate-differential (IRD) excess returns, which should be incorporated in the FOREX price movements (see, e.g., Menkhoff *et al.*^{72}). Here, we used the short-term interest rates data retrieved from the OECD database as the proxy of these IRD excess returns. Since the short-term interest rates data are reported on a monthly basis and on an annual percentage format, we constructed daily approximations by linearly interpolating through the available downloaded records (using the Pandas module of Python^{73}) and normalizing on the basis of an annual calendar period. After this pre-processing step, we denote as IR$USA,t$ the time series of the daily approximations of the USA short-term interest rates, and IR$i,t$ is the corresponding time series of each of the other $i=1,\u2026,10$ countries. Then, the interest rate differential (IRD) time series are given by

Finally, the so-called carry-adjusted returns of the FOREX are given by (see also, e.g., Menkhoff *et al.*^{72})

where $xi,t$ is the $i$th FX pair “carry” adjusted return corresponding to its raw “unadjusted” market price return, $ri,t$ is defined in (63) and IRD$i,t$.

For quantifying the potential excess returns (profits), we constructed a trading strategy based on the so-called risk parity rationale (see, e.g., Braga^{74}) where each asset is allocated with a portfolio weight, which is proportional to its inverse risk. Thus, one assigns higher portfolio weights to less volatile assets and smaller portfolio weights to more risky assets. Thus, the risk is quantified using the volatility $\sigma i,t$ (measured by the standard deviation of logarithmic returns over a specific time period up to the $t$ trading day) of each asset $i$ of the portfolio plus the total portfolio volatility. In our FOREX problem, the risk parity portfolio allocation practically means investing $1/\sigma i,t$ at each of the $i=1,2,\u2026,10$ FOREX pairs with corresponding carry-adjusted returns $xi,t$. By performing one-step ahead predictions for each of carry-adjusted returns of the 10 pairs, denoted as $x^i,t+1$, we create a “binary,” or otherwise called “directional,” trading signal of “buy” or “sell” for each one of the FOREX. Thus, the trading strategy reads as follows:

Based on the above, the profit or loss at the next day ($t+1$) is given by

where $xi,t+1$ is the real $i$th FOREX pair return at time $t+1$, and $\Pi t+1$ is the risk parity portfolio return at time $t+1$.

## VII. NUMERICAL RESULTS

For the implementation of the numerical algorithms, we used the datafold, sklearn, and statsmodels packages of Python.^{75–77} The selection of the eigensolver was based on the sparsity and size of the input matrix: ARPACK was used if the size of the input matrix was greater than 200 and $n+1<10$ (where $n$ is the number of requested eigenvalues); otherwise, the “dense” option was used. The ARPACK package^{78} implements implicitly restarted Arnoldi methods, using a random default starting vector for each iteration, with a tolerance $tol=10\u22126$ and a maximum number of 100 iterations. This approach is implemented by the function scipy.sparse.linalg.eigsh of the scipy module (upon which the sklearn one depends).^{79} The “dense” eigensolver is implemented by the function eigh of the scipy module and returns the eigenvalues and eigenvectors computed using the LAPACK routine syevd using the divide and conquer algorithm.^{80,81} We used the default value of the tolerance of the Newton–Raphson iterations, which is of the order of the floating-point precision, and the maximum number of iterations was set to $30N$ iterations, where $N$ is the size of the matrix.

The DM embedding on the training set of all data sets was performed using $d$ parsimonious eigenvectors.^{82} Here, for the construction of the graph at the first step, we used the standard Gaussian kernel, defined by

where $\sigma $ is a scaling parameter that controls the size of the neighborhood (or the connectivity of the graph). For its derivation, we follow the systematic approach provided by Singer *et al*.^{83} The full kernel is used for the calculations.

The VAR models were trained using the VAR class of the statsmodels.tsa. vector_ar.var_model routine using the OLS default method for the parameter estimation. The corresponding LAPACK function used to solve the least-squares problem is the default gelsd, which exploits the QR or LQ factorization of the input matrix.

The hyperparameters of the GPR model were optimized using the (default) L-BFGS-B algorithm of the scipy.optimize. minimize method.^{84,85} The gradient vector was estimated using forward finite differences with the numerical approximations of the Jacobian being performed with a default step size $eps=10\u22128$. We used the default values for tolerances and the maximum number of function evaluations and the maximum number of iterations (i.e., 15 000, as well as the default value for the maximum number of line searches per iteration, i.e., 20).

For the lifting task for the problems of EEG signals and FOREX, 50 nearest neighbors were considered for interpolation by all the methods. The underlying $k$-NN algorithm is based on the algorithm proposed in Maneewongvatana and Mount.^{86} Using different values of the number of nearest neighbors within the range 20–100 did not change the outcomes of the analysis. In the case of RBF interpolation, the underlying linear system of equations was solved by the LAPACK dgesv routine from scipy.linalg.lapack.dgesv, which implements the default method of the LU decomposition with partial pivoting. For the GH approach, we used the Gaussian kernel. In effect, we are performing “double” DM here—computing diffusion maps on the leading retained Diffusion map components for the reduced embedding. This procedure, suggested in Chiavazzo *et al.*,^{30} can actually be performed “once and for all” globally.^{87}

The computations were performed using a system with an Intel Core i7-8750H CPU @2.20 GHz and 16GB of RAM.

### A. Synthetic time series

For both the linear and nonlinear models, we produced 2000 points. We used 1500 points for learning the manifold and for training the various models and 500 points to test the forecasting performance of the various schemes. The forecasting performance was tested using the iterative mode, i.e., by training the models for one-step ahead predictions and then simulating the trained model iteratively to predict future values. The performance was measured using the root mean square error (RMSE) of the residuals, read as

where $xi^$ are the predictions and $xi$ the actual data. To quantify the forecasting intervals due to the stochasticity of the models, we performed 100 runs for each combination of manifold learning algorithms (DMs and LLE), models (MVAR and GPR), and lifting methods (RBFs and GHs), reporting the median and the 5th and 95th percentiles of the resulting RMSE. The RMSE values obtained with the naïve random walk model, as well as with the MVAR and GPR models trained in the original space, are also given for comparison purposes.

In Table I, we report the forecasting statistics of the time series produced with the linear model given by (56) as obtained over 100 runs. As it is shown, both the MVAR and GPR models trained in the original five-dimensional space outperform the naïve random walk. The RMSEs of the MVAR and GPR models suggest a good match with the stochastic model, with the residuals being approximately within one standard deviation of the distribution of the noise level. In the same table, we provide the corresponding forecasting statistics as obtained with the proposed “embed–forecast–lift” scheme for the various combinations of manifold learning algorithms, regression models, and lifting approaches. For our illustrations, we have chosen the first two parsimonious DM coordinates and the corresponding two LLE eigenvectors. As shown, the best performance is obtained with the GH lifting operator. Using GHs for lifting and any combination of the selected manifold learning algorithms and models outperforms all other combinations, thus resulting in practically the same RMSE values when compared with the predictions made in the original space. This suggests that the proposed “embed–forecast–lift” scheme applied in the 5D feature space provides a very good reconstruction of the predictions made in the original space. On the other hand, lifting with RBF interpolation with LU decomposition generally resulted in poor reconstructions of the high-dimensional space, thus, in many cases, giving wide forecasting intervals that contained the median value of the naïve random walk RMSE.

Model/variable . | $yt(1)$ . | $yt(2)$ . | $yt(3)$ . | $yt(4)$ . | $yt(5)$ . |
---|---|---|---|---|---|

Random walk | 1.374 | 1.446 | 1.427 | 1.551 | 1.425 |

(1.287,1.466) | (1.333,1.528) | (1.339,1.497) | (1.455,1.654) | (1.335,1.512) | |

MVAR(OS) | 1.151 | 1.180 | 1.010 | 1.224 | 1.011 |

(1.077,1.218) | (1.113,1.269) | (0.966,1.062) | (1.156,1.281) | (0.960,1.063) | |

GPR(OS) | 1.151 | 1.179 | 1.010 | 1.224 | 1.011 |

(1.077,1.218) | (1.113,1.272) | (0.966,1.061) | (1.154,1.284) | (0.959,1.064) | |

DM-GPR-GH | 1.153 | 1.184 | 1.012 | 1.228 | 1.013 |

(1.082,1.220) | (1.117,1.273) | (0.966,1.061) | (1.160,1.287) | (0.959,1.065) | |

LLE-GPR-GH | 1.155 | 1.182 | 1.011 | 1.230 | 1.014 |

(1.077,1.218) | (1.115,1.272) | (0.966,1.065) | (1.157,1.292) | (0.962,1.064) | |

DM-GPR-RBF | 1.260 | 1.390 | 1.163 | 1.365 | 1.166 |

(1.111,2.201) | (1.149,1.980) | (0.997,2.331) | (1.173,1.939) | (0.997,1.853) | |

LLE-GPR-RBF | 1.197 | 1.234 | 1.076 | 1.270 | 1.103 |

(1.103,1.452) | (1.131,1.461) | (0.993,1.582) | (1.175,1.468) | (0.990,1.553) | |

DM-MVAR-GH | 1.151 | 1.180 | 1.011 | 1.225 | 1.011 |

(1.078,1.220) | (1.113,1.269) | (0.966,1.063) | (1.155,1.288) | (0.959,1.063) | |

LLE-MVAR-GH | 1.155 | 1.182 | 1.011 | 1.229 | 1.014 |

(1.077,1.218) | (1.115,1.271) | (0.966,1.065) | (1.158,1.293) | (0.962,1.064) | |

DM-MVAR-RBF | 1.289 | 1.312 | 1.135 | 1.341 | 1.106 |

(1.124,2.132) | (1.146,2.018) | (0.996,1.635) | (1.186,1.933) | (0.99,1.601) | |

LLE-MVAR-RBF | 1.187 | 1.230 | 1.080 | 1.264 | 1.096 |

(1.106,1.477) | (1.14,1.477) | (0.995,1.602) | (1.177,1.479) | (0.989,1.558) |

Model/variable . | $yt(1)$ . | $yt(2)$ . | $yt(3)$ . | $yt(4)$ . | $yt(5)$ . |
---|---|---|---|---|---|

Random walk | 1.374 | 1.446 | 1.427 | 1.551 | 1.425 |

(1.287,1.466) | (1.333,1.528) | (1.339,1.497) | (1.455,1.654) | (1.335,1.512) | |

MVAR(OS) | 1.151 | 1.180 | 1.010 | 1.224 | 1.011 |

(1.077,1.218) | (1.113,1.269) | (0.966,1.062) | (1.156,1.281) | (0.960,1.063) | |

GPR(OS) | 1.151 | 1.179 | 1.010 | 1.224 | 1.011 |

(1.077,1.218) | (1.113,1.272) | (0.966,1.061) | (1.154,1.284) | (0.959,1.064) | |

DM-GPR-GH | 1.153 | 1.184 | 1.012 | 1.228 | 1.013 |

(1.082,1.220) | (1.117,1.273) | (0.966,1.061) | (1.160,1.287) | (0.959,1.065) | |

LLE-GPR-GH | 1.155 | 1.182 | 1.011 | 1.230 | 1.014 |

(1.077,1.218) | (1.115,1.272) | (0.966,1.065) | (1.157,1.292) | (0.962,1.064) | |

DM-GPR-RBF | 1.260 | 1.390 | 1.163 | 1.365 | 1.166 |

(1.111,2.201) | (1.149,1.980) | (0.997,2.331) | (1.173,1.939) | (0.997,1.853) | |

LLE-GPR-RBF | 1.197 | 1.234 | 1.076 | 1.270 | 1.103 |

(1.103,1.452) | (1.131,1.461) | (0.993,1.582) | (1.175,1.468) | (0.990,1.553) | |

DM-MVAR-GH | 1.151 | 1.180 | 1.011 | 1.225 | 1.011 |

(1.078,1.220) | (1.113,1.269) | (0.966,1.063) | (1.155,1.288) | (0.959,1.063) | |

LLE-MVAR-GH | 1.155 | 1.182 | 1.011 | 1.229 | 1.014 |

(1.077,1.218) | (1.115,1.271) | (0.966,1.065) | (1.158,1.293) | (0.962,1.064) | |

DM-MVAR-RBF | 1.289 | 1.312 | 1.135 | 1.341 | 1.106 |

(1.124,2.132) | (1.146,2.018) | (0.996,1.635) | (1.186,1.933) | (0.99,1.601) | |

LLE-MVAR-RBF | 1.187 | 1.230 | 1.080 | 1.264 | 1.096 |

(1.106,1.477) | (1.14,1.477) | (0.995,1.602) | (1.177,1.479) | (0.989,1.558) |

Next, in Table II, we report the forecasting statistics of the time series for the nonlinear stochastic model (57) as obtained over 100 runs. As in the case of the linear stochastic model, both MVAR and GPR trained in the original 5D space outperform the naïve random walk. The resulting RMSE values suggest a good match with the nonlinear stochastic model. Yet, the match is poorer than the one obtained for the linear model.

Model/variable . | $yt(1)$ . | $yt(2)$ . | $yt(3)$ . | $yt(4)$ . | $yt(5)$ . |
---|---|---|---|---|---|

Random walk | 1.711 | 2.085 | 2.292 | 1.731 | 1.435 |

(1.597,1.815) | (1.939,2.258) | (2.147,2.437) | (1.62,1.819) | (1.354,1.527) | |

MVAR(OS) | 1.181 | 1.438 | 1.565 | 1.212 | 1.021 |

(1.123,1.234) | (1.359,1.555) | (1.463,1.648) | (1.151,1.272) | (0.967,1.076) | |

GPR(OS) | 1.182 | 1.437 | 1.684 | 1.213 | 1.020 |

(1.123,1.233) | (1.360,1.555) | (1.540,1.799) | (1.151,1.272) | (0.966,1.076) | |

DM-GPR-GH | 1.181 | 1.440 | 1.573 | 1.214 | 1.022 |

(1.123,1.234) | (1.363,1.56) | (1.477,1.654) | (1.151,1.276) | (0.968,1.077) | |

LLE-GPR-GH | 1.182 | 1.451 | 1.579 | 1.214 | 1.023 |

(1.126,1.236) | (1.363,1.563) | (1.473,1.675) | (1.151,1.273) | (0.97,1.076) | |

DM-GPR-RBF | 1.488 | 1.618 | 2.060 | 1.710 | 1.199 |

(1.172,8.88) | (1.392,9.95) | (1.642,6.102) | (1.203,16.868) | (1.012,9.108) | |

LLE-GPR-RBF | 1.214 | 1.449 | 1.587 | 1.254 | 1.088 |

(1.133,1.557) | (1.365,1.561) | (1.502,1.692) | (1.159,1.446) | (0.989,1.5) | |

DM-MVAR-GH | 1.181 | 1.438 | 1.571 | 1.213 | 1.022 |

(1.123,1.234) | (1.365,1.555) | (1.472,1.649) | (1.151,1.274) | (0.968,1.076) | |

LLE-MVAR-GH | 1.182 | 1.452 | 1.586 | 1.214 | 1.023 |

(1.125,1.235) | (1.364,1.563) | (1.468,1.669) | (1.152,1.275) | (0.97,1.076) | |

DM-MVAR-RBF | 1.387 | 1.534 | 1.902 | 1.560 | 1.131 |

(1.169,6.743) | (1.368,4.634) | (1.609,4.5) | (1.199,9.294) | (0.99,4.655) | |

LLE-MVAR-RBF | 1.213 | 1.445 | 1.570 | 1.253 | 1.091 |

(1.135,1.507) | (1.366,1.561) | (1.484,1.657) | (1.159,1.621) | (0.998,1.674) |

Model/variable . | $yt(1)$ . | $yt(2)$ . | $yt(3)$ . | $yt(4)$ . | $yt(5)$ . |
---|---|---|---|---|---|

Random walk | 1.711 | 2.085 | 2.292 | 1.731 | 1.435 |

(1.597,1.815) | (1.939,2.258) | (2.147,2.437) | (1.62,1.819) | (1.354,1.527) | |

MVAR(OS) | 1.181 | 1.438 | 1.565 | 1.212 | 1.021 |

(1.123,1.234) | (1.359,1.555) | (1.463,1.648) | (1.151,1.272) | (0.967,1.076) | |

GPR(OS) | 1.182 | 1.437 | 1.684 | 1.213 | 1.020 |

(1.123,1.233) | (1.360,1.555) | (1.540,1.799) | (1.151,1.272) | (0.966,1.076) | |

DM-GPR-GH | 1.181 | 1.440 | 1.573 | 1.214 | 1.022 |

(1.123,1.234) | (1.363,1.56) | (1.477,1.654) | (1.151,1.276) | (0.968,1.077) | |

LLE-GPR-GH | 1.182 | 1.451 | 1.579 | 1.214 | 1.023 |

(1.126,1.236) | (1.363,1.563) | (1.473,1.675) | (1.151,1.273) | (0.97,1.076) | |

DM-GPR-RBF | 1.488 | 1.618 | 2.060 | 1.710 | 1.199 |

(1.172,8.88) | (1.392,9.95) | (1.642,6.102) | (1.203,16.868) | (1.012,9.108) | |

LLE-GPR-RBF | 1.214 | 1.449 | 1.587 | 1.254 | 1.088 |

(1.133,1.557) | (1.365,1.561) | (1.502,1.692) | (1.159,1.446) | (0.989,1.5) | |

DM-MVAR-GH | 1.181 | 1.438 | 1.571 | 1.213 | 1.022 |

(1.123,1.234) | (1.365,1.555) | (1.472,1.649) | (1.151,1.274) | (0.968,1.076) | |

LLE-MVAR-GH | 1.182 | 1.452 | 1.586 | 1.214 | 1.023 |

(1.125,1.235) | (1.364,1.563) | (1.468,1.669) | (1.152,1.275) | (0.97,1.076) | |

DM-MVAR-RBF | 1.387 | 1.534 | 1.902 | 1.560 | 1.131 |

(1.169,6.743) | (1.368,4.634) | (1.609,4.5) | (1.199,9.294) | (0.99,4.655) | |

LLE-MVAR-RBF | 1.213 | 1.445 | 1.570 | 1.253 | 1.091 |

(1.135,1.507) | (1.366,1.561) | (1.484,1.657) | (1.159,1.621) | (0.998,1.674) |

We also provide the corresponding forecasting statistics as obtained with the proposed “embed–forecast–lift” scheme. For embedding in the reduced space, we have taken three (parsimonious) coordinates. Again, the best performance is obtained with the GH lifting operator for any combination of manifold learning algorithms and models. Importantly, the reconstruction errors between the forecasts with the “full” MVAR and GPR models trained directly in the original 5D space and the ones obtained with the proposed “embed–forecast–lift” scheme are negligible up to a three-digit accuracy for all five variables. As with the previous case, lifting with RBFs resulted in a poor reconstruction of the high-dimensional space, with forecasting intervals containing the median RMSE value of the naïve random walk.

Finally, in Table III, we report the RMSE statistics for the time series produced with the linear model with a model order three [see (58)] as obtained over 100 runs from (a) the naïve random walk model applied to the original 5D space, (b) the MVAR models trained in the original 5D data set with model orders one [MVAR(1)] and three [MVAR(3)], and (c) the proposed “embed–forecast–lift” method with the embedding applied to the original 5D data set using DM and LLE for embedding with three coordinates. In the reduced-order space, we have trained MVAR models with model orders 1 and 3 and used GHs for lifting. The best results were obtained when using the proposed scheme with DMs for embedding and a model order 3 for the training of the MVAR model in the corresponding manifold. Importantly, the implementation of DM-MVAR(3)-GH succeeds in reproducing quite well the results obtained by training MVAR with a maximum delay of three in the original space.

Model/variable . | $yt(1)$ . | $yt(2)$ . | $yt(3)$ . | $yt(4)$ . | $yt(5)$ . |
---|---|---|---|---|---|

Random walk | 2.138 | 2.910 | 1.930 | 3.499 | 2.477 |

(1.898,2.461) | (2.412,3.551) | (1.755,2.152) | (2.978,4.195) | (2.253,2.739) | |

MVAR(1) | 1.629 | 2.121 | 1.382 | 2.567 | 1.840 |

(1.466,1.851) | (1.817,2.515) | (1.275,1.517) | (2.244,2.997) | (1.699,2.031) | |

MVAR(3) | 1.611 | 2.092 | 1.371 | 2.531 | 1.824 |

(1.449,1.833) | (1.787,2.483) | (1.265,1.501) | (2.206,2.963) | (1.689,2.019) | |

DM-MVAR(1)-GH | 1.630 | 2.121 | 1.383 | 2.569 | 1.843 |

(1.467,1.854) | (1.825,2.515) | (1.273,1.516) | (2.244,2.998) | (1.697,2.037) | |

LLE-MVAR(1)-GH | 1.651 | 2.159 | 1.397 | 2.619 | 1.866 |

(1.472,1.905) | (1.832,2.567) | (1.281,1.545) | (2.269,3.080) | (1.705,2.084) | |

DM-MVAR(3)-GH | 1.621 | 2.103 | 1.376 | 2.546 | 1.835 |

(1.455,1.839) | (1.791,2.484) | (1.267,1.509) | (2.217,2.972) | (1.694,2.028) | |

LLE-MVAR(3)-GH | 1.649 | 2.149 | 1.393 | 2.608 | 1.862 |

(1.473,1.902) | (1.820,2.572) | (1.277,1.550) | (2.252,3.065) | (1.703,2.094) |

Model/variable . | $yt(1)$ . | $yt(2)$ . | $yt(3)$ . | $yt(4)$ . | $yt(5)$ . |
---|---|---|---|---|---|

Random walk | 2.138 | 2.910 | 1.930 | 3.499 | 2.477 |

(1.898,2.461) | (2.412,3.551) | (1.755,2.152) | (2.978,4.195) | (2.253,2.739) | |

MVAR(1) | 1.629 | 2.121 | 1.382 | 2.567 | 1.840 |

(1.466,1.851) | (1.817,2.515) | (1.275,1.517) | (2.244,2.997) | (1.699,2.031) | |

MVAR(3) | 1.611 | 2.092 | 1.371 | 2.531 | 1.824 |

(1.449,1.833) | (1.787,2.483) | (1.265,1.501) | (2.206,2.963) | (1.689,2.019) | |

DM-MVAR(1)-GH | 1.630 | 2.121 | 1.383 | 2.569 | 1.843 |

(1.467,1.854) | (1.825,2.515) | (1.273,1.516) | (2.244,2.998) | (1.697,2.037) | |

LLE-MVAR(1)-GH | 1.651 | 2.159 | 1.397 | 2.619 | 1.866 |

(1.472,1.905) | (1.832,2.567) | (1.281,1.545) | (2.269,3.080) | (1.705,2.084) | |

DM-MVAR(3)-GH | 1.621 | 2.103 | 1.376 | 2.546 | 1.835 |

(1.455,1.839) | (1.791,2.484) | (1.267,1.509) | (2.217,2.972) | (1.694,2.028) | |

LLE-MVAR(3)-GH | 1.649 | 2.149 | 1.393 | 2.608 | 1.862 |

(1.473,1.902) | (1.820,2.572) | (1.277,1.550) | (2.252,3.065) | (1.703,2.094) |

### B. Propagation of solutions of PDEs

For both problems, we show the results obtained with PCA, DMs, and LLE for embedding the data on the low-dimensional manifold, GPR as a reduced-order surrogate model, and RBFs and GHs for solving the pre-image problem. For the linear PDE (59), we used the analytical solution [see Eq. (60)] with 100 equidistant spatial points to produce 5000 solution profiles in the time interval $[00.25]$. We used the first 4000 solution profiles to learn the manifold and to train the surrogate reduced-order GPR model, the next 500 solution profiles as a validation test to get the optimal set of values of the hyperparameters for the solution of the pre-image problem, thus resulting in the best prediction for this interval, and the last unseen 500 solution profiles as a test set to assess the prediction performance of the proposed scheme. In Fig. 1(a), we show the analytical solution of the PDE. We used the first 4000 solution profiles for training; blue lines depict the validation data and red lines the unseen data. In Fig. 1(b), we show the errors with respect to the analytical solution using the leading two principal components of PCA. We note that the PCA fails to adequately approximate the solution using just the first principal component. In Figs. 1(c) and 1(d), we show the spatiotemporal errors with respect to the analytical solution using the first LLE coordinate and RBF and GH, respectively, for lifting. Similarly, in Figs. 1(e) and 1(f), we show the spatiotemporal errors with respect to the analytical solution using the first DM for embedding and RBF and GH, respectively, for lifting. We see that for any practical purposes, the 1D LLE and 1D DM with both RBFs and GHs for lifting perform equivalently and both outperform the 2D PCA.

For the 1D Brusselator model of coupled nonlinear parabolic PDEs given by Eqs. (61) and (62), we sampled the solution in $[033]$ every $DT=0.01$, thus getting 3300 solution profiles; we used the first 2750 as a training set to learn the manifold and train the reduced-order surrogate GPR model, the next 250 points to find the optimal set of values of the hyperparameters for the solution of the pre-image problem, resulting in the best prediction performance for this interval, and finally, the last 300 unseen points as a test set to assess the prediction performance scheme. In Figs. 2(a) and 2(b), we show the solution profiles for $u(x,t)$ and $v(x,t)$, respectively. In Figs. 2(c)–2(f), we show the spatiotemporal errors with respect to the numerical solution using three DM coordinates for embedding and RBFs [Figs. 2(c) and 2(d)] and GHs [Figs. 2(e) and 2(f)], respectively, for lifting. In Figs. 2(c) and 2(e), we show the spatiotemporal errors for $u(x,t)$ and in Figs. 2(d) and 2(f) the spatiotemporal errors for $v(x,t)$. For this nonlinear problem, the use of PCA fails to predict-reconstruct adequately the solution profiles even if one takes more than 20 principal components.

### C. FOREX trading

Here, we assessed the performance of the proposed forecasting framework in the FOREX trading application described earlier, under the annualized Sharpe (SH) ratio^{88} of the constructed risk parity portfolio. The returns are computed by (67). The basic formula for calculating the SH ratio is given by

where $\mu \Pi $ is the sample average returns of the risk parity portfolio and $\sigma \Pi $ the corresponding volatility, while the $Rf$ is the risk-free rate, usually set equal to the annual yield of the US Treasury Bonds. Here, for our analysis, we have set $Rf=0$, which is a fair approximation of the reality.

The underlying dynamics of the FOREX market is in general non-autonomous, at least over a long time period. In principle, there are time-varying exogenous factors, including macroscopic economic indices, social interaction, and mimesis (see, e.g., Papaioannou *et al.*,^{89} where machine learning has been used to forecast FOREX taking into account Twitter messages), but also seasonal factors and rare events (such as the recent COVID19 pandemic), which influence FOREX over time. This comes in contrast to the synthetic time series that we have examined here and also other autonomous models that have been considered in other studies that served as benchmarks to assess the performance of the various schemes (see also the discussion in the conclusions, Sec. VIII). Hence, to cope with such changes of the FOREX market and in general of financial assets, one would set up a sliding/rolling window, then train models within the rolling window, and perform (usually) one-day ahead forecasts for trading purposes.

For our illustrations, we assessed the performance of the proposed scheme based on the so-called risk parity trading strategy using a one-year (250 trading days) embedding rolling window and the last 50 or 100 days of the 250-day rolling window for training. The forecasting performance of the proposed scheme using DMs and LLE for embedding with three coordinates, MVAR and GPR for prediction, and GHs for lifting was compared against the naïve random walk and the full MVAR and GPR models trained and implemented directly in the original space. A comparison against the linear embedding provided by PCA with the same number of principal components was also performed.

Figure 3 depicts the SH ratios obtained with the various methods. As it is shown, the proposed schemes using the DM and LLE algorithms for embedding outperform all the other schemes when considering the same size of the training window. In particular, the highest SH ratios ($\u223c0.83$) are obtained with the combination of LLE and MVAR within the 100-day training window, followed by the combination of DMs and GPR in the 100-day training window, resulting in an SH ratio $\u223c0.76$. The third ($\u223c0.73$) and fourth ($\u223c0.72$) larger SH ratios result again from the implementation of the DMs and MVAR within the 50- and 100-day training windows, respectively. On the other hand, the naïve random walk (black bar) resulted in a negative SH ratio ($\u223c\u22120.42$). Negative SH ratios (of the order of $\u22120.35$) resulted also from the implementation of PCA with GPR (yellow bars) for both sizes of the training window, while for the 50-day training window, the combination of PCA with MVAR resulted in an almost zero (but still negative) SH ratio. With PCA, a (small) positive SH value ($\u223c0.23$) was obtained with MVAR within the 100-day training window. The full MVAR and GPR models, trained and implemented directly in the original space, produced positive SH ratios, but still smaller than those of the DMs and LLE schemes. In particular, within the 50-day training window, the full MVAR (GPR) model resulted in an SH ratio of $\u223c0.39$ ($\u223c0.04$), while within the 100-day training window, the full MVAR (GPR) model resulted in an SH ratio of $\u223c0.68$ ($\u223c0.3$). Finally, we also tested the forecasting performance of the full MVAR and GPR models using the 250-day training window. Within this configuration, the MVAR model yielded an SH ratio of $\u223c0.49$, while the GPR model an SH ratio of $\u223c0.2$.

In Fig. 4, we depict indicatively the time series of the USD/EURO FOREX pair returns and the predictions using the full MVAR model (blue line), the LLE-MVAR(1)-GH scheme (red line) using a 100-day sliding training window.

## VIII. CONCLUSIONS AND DISCUSSION

We proposed a data-driven numerical framework based on nonlinear manifold learning for forecasting high-dimensional time series, which is composed of three basic steps: (i) embedding of the original high-dimensional data in a low-dimensional manifold, (ii) construction of reduced-order surrogate models and forecasting on the manifold, and (iii) reconstruction of the predictions in the high-dimensional space by solving the pre-image problem.

The task of forecasting is different from that attempted for the reconstruction of high-dimensional models of dynamical systems based on interpolation in four main aspects. First, for real-world data sets, the existence of a relatively smooth low-dimensional manifold is not guaranteed as in the case of well-defined dynamical models (see, e.g., the discussion in Gajamannage *et al.*^{46}). Second, non-stationary dynamics which, in general, are not an issue when dealing with the approximation of dynamical systems, pose a major challenge for a reliable/consistent forecasting. It should be noted that the stationarity assumption may be required to hold true even for interpolation problems. For example, the implementation of the MVAR models, but also Gaussian processes with a Gaussian kernel, requires the stationary covariance function assumption to be satisfied (see, e.g., the discussion in Rasmussen^{90} and Cheng *et al.*^{91}). Third, when dealing with real-world time series, such as financial time series, the number of available snapshots (even at the intraday frequency of trading) is limited in contrast to the size of temporal data that one can produce by model simulations. In such cases, the quest for beating the “curse of dimensionality” using the correct (parsimonious) embedding and modeling is stronger. Finally, in the case of smooth dynamical systems, as the dynamics emerges from the same physical laws as expressed by the underlying ODEs, PDEs, or SDEs, what is usually sought is a single global manifold (a single geometry). However, in many complex problems, such as those in finance, the best parameterization of the manifold (if it exists) may change over time. Thus, one would seek for a family of (sequential-in-time) submanifolds, which can be “identified” within a rolling window approach.

The performance of the scheme was assessed by implementing and comparing different combinations of nonlinear manifold learning algorithms (LLE and DMs), regression models (MVAR and GPR), and methods for solving the pre-image problem (geometric harmonics and radial basis functions) on various problems, including synthetic stochastic data series, the numerical propagation in time of the solution of linear and nonlinear parabolic PDEs, and a real-world data set of FOREX time series.

As discussed, the solution of the linear system associated with the RBF interpolation with Gaussian kernels may not be a good choice, as the system matrix (46) may be numerically rank deficient. In this case, one could compute the solution using, e.g., the Moore–Penrose pseudoinverse. Moreover, for large-scale almost-singular linear systems, one could use the GMRES method with preconditioning.^{92–94} We intend to study the performance of such approaches in a future work.

At this point, we remark that there is no systematic way to choose which manifold learning technique to use (specifically, LLE or diffusion maps), and in practice, it often highly depends on the particular case. In the literature, one can find several empirical studies,^{95} and the current understanding in the community is that diffusion maps may possibly perform better in modeling non-convex manifolds, data on the manifold that are non-uniformly sampled, and boundary effects. These empirical observations have been (partially) supported theoretically in a recent line of work.^{96–98}

As a “restrict-run-lift” scheme, the proposed approach shares analogies with the equation-free approach and coarse-timestepping.^{99–102} Furthermore, as we demonstrated for the case of the PDEs, the proposed scheme can be bridged with the concept of “gap-tooth”/“patch dynamics” schemes^{103–105} for the numerical solution of high-dimensional PDEs; we will report on such a methodology in a future paper.

Finally, we note that in order to cope with the generalization property and the topological instability issues (see, e.g., Balasubramanian *et al.*^{19}) that arise in implementing kernel-based manifold learning algorithms when the data set is not sufficiently dense on the manifold, and/or in the presence of “strong” stochasticity, one can resort to techniques, such as the constant-shift one to appropriately adjust the graph metric, and techniques for the removal of outliers from the data set and the construction of smooth geodesics.^{46,50,106}

## ACKNOWLEDGMENTS

This work was partially supported by the Italian program Fondo Integrativo Speciale per la Ricerca (FISR) (No. FISR2020IP-02893). The work of Ioannis G. Kevrekidis was partially supported by the U.S. Department of Energy and the U.S. Air Force Office of Scientific Research.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Panagiotis G. Papaioannou:** Data curation (lead); Formal analysis (lead); Investigation (lead); Methodology (supporting); Software (lead); Validation (lead); Writing – original draft (supporting); Writing – review and editing (supporting). **Ronen Talmon:** Formal analysis (supporting); Methodology (supporting); Validation (equal); Writing – original draft (supporting); Writing – review and editing (equal). **Ioannis G. Kevrekidis:** Conceptualization (supporting); Methodology (supporting); Validation (supporting); Writing – original draft (supporting); Writing – review and editing (equal). **Constantinos Siettos:** Conceptualization (lead); Formal analysis (equal); Investigation (equal); Methodology (lead); Supervision (lead); Validation (equal); Writing – original draft (lead); Writing – review and editing (lead).

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

## REFERENCES

*Dynamical Systems and Turbulence, Warwick 1980*(Springer, 1981), pp. 366–381.

*Introduction to Smooth Manifolds*(Springer, 2013), pp. 1–31.

*Proceedings of the 9th Python in Science Conference*, Vol. 445 (Austin, TX, 2010), pp. 51–56.

*Proceedings of the 9th Python in Science Conference*, Vol. 57 (Austin, TX, 2010), p. 61.

*et al.*, “

*Data Structures, Near Neighbor Searches, and Methodology*(American Mathematical Society, 2002), Vol. 59, pp. 105–123.

*Summer School on Machine Learning*(Springer, 2003), pp. 63–71.

*et al.*, “