DEFM: Delay-Embedding-based Forecast Machine for Time Series Forecasting by Spatiotemporal Information Transformation

Making accurate forecasts for a complex system is a challenge in various practical applications. The major difficulty in solving such a problem concerns nonlinear spatiotemporal dynamics with time-varying characteristics. Takens’ delay embedding theory provides a way to transform high-dimensional spatial information into temporal information. In this work, by combining delay embedding theory and deep learning techniques, we propose a novel framework, Delay-Embedding-based Forecast Machine (DEFM), to predict the future values of a target variable in a self-supervised and multistep-ahead manner based on high-dimensional observations. With a three-module spatiotemporal architecture, the DEFM leverages deep neural networks to effectively extract both the spatially and temporally associated information from the observed time series even with time-varying parameters or additive noise. The DEFM can accurately predict future information by transforming spatiotemporal information to the delay embeddings of a target variable. The efficacy and precision of the DEFM are substantiated through applications in three spatiotemporally chaotic systems: a 90-dimensional (90D) coupled Lorenz system, the Lorenz 96 system, and the Kuramoto-Sivashinsky (KS) equation with inhomogeneity. Additionally, the performance of the DEFM is evaluated on six real-world datasets spanning various fields. Comparative experiments with five prediction methods illustrate the superiority and robustness of


Introduction
In the era of big data, enormous amounts of time-course data are observed and generated in many fields, such as gene expression data in molecular biology, neural activity data in neuroscience, and atmospheric data in meteorology [1][2][3][4][5][6][7] .In these areas, the availability of accurate empirical models for multistep-ahead prediction is highly desirable.However, in view of the complexity and nonlinearity of real-world systems, it is challenging to bridge known information to future dynamics, especially when the known time series contains a massive number of factors (high-dimensional variables).
Numerous approaches have been developed for time series analysis, e.g., statistical regression methods, such as the mean average through exponential smoothing 8,9 and the autoregressive integrated moving average (ARIMA) model 10 , and machine learning methods, including recurrent neural networks (RNNs) [11][12][13][14] , long short-term memory (LSTM) networks 15,16 , and reservoir computing 17,18 .However, most existing methods generally only rely on the temporal information of time series, which limits their applications in many real-world systems.In contrast with the numerous existing prediction methods based on time association mining, few studies focusing on leveraging high-dimensional data can be found.Making time series forecasts based on high-dimensional data is usually necessary.On the one hand, high-dimensional time series data always contain spatial heterogeneity in their system dynamics, which can significantly improve the performance of the time series forecasting method and reduce the needed length for the input time series 19 .On the other hand, handling highdimensional data is a common occurrence in many practical scenarios, such as biological experiments 20 .Thus, new strategies are needed to better explore the time series derived from high-dimensional complex systems.
Takens' delay embedding theory 21,22 reveals that a high-dimensional attractor and reconstructed delay attractors with appropriately selected dimensions are topologically conjugated, thus suggesting mappings from the original attractor to the reconstructed attractors.Derived from delay embedding theory, the randomly distributed embedding (RDE) 23,24 approach makes a one-step-ahead prediction for a target variable based on short-term time series through a spatial-temporal information (STI) transformation.Our recently proposed autoreservoir computing framework (ARNN) 25 achieves multistepahead forecasting based on a semi-linearized STI transformation; however, the challenge is how to exploit the nonlinear features of the observed high-dimensional variables, which restricts the forecasting performance of the model in terms of robustness and accuracy.
In this study, we propose a novel neural network-based framework, a delayembedding-based forecasting machine (DEFM), to produce highly accurate multistepahead predictions by transforming the spatial information of a high-dimensional time series to the temporal information of a target variable (Fig. 1 (a)).Motivated by the previous linear or semi-linearized STI-based works 23,25 , the DEFM is designed to solve the nonlinear STI equation that maps the observed high-dimensional variables to the delayed embeddings of a target variable (variable to be predicted) in a self-supervised manner.Specifically, the DEFM utilizes its spatiotemporal structure, consisting of a spatial module and a temporal module, to exploit the intrinsic dynamics of a complex system and extract both the spatial interactions and temporally associated information among the variables in high-dimensional data (Fig. 1(b)).With such a data-driven architecture, the DEFM takes the high-dimensional variables as inputs and provides the delay embeddings of the target variable as outputs, and it is trained by a "consistently self-constrained scheme", thus directly producing a multistep-ahead prediction for the target variable with only one forward step (Fig. 1 (b)).Simultaneously, the DEFM can be applied to make long-term predictions through an iterative scheme that takes the predicted values as part of the input for the next prediction step.We present DEFM applications across several benchmark systems, underscoring the robustness and effectiveness of the approach.Through real-world applications, our investigation highlights the capacity of the DEFM to serve as a valuable analytical instrument for the examination and prediction of intricate dynamic systems.Notably, our method relies on high-dimensional time series data, emphasizing its versatility and applicability in practical scenarios.

Delay embedding theorem for dynamical systems
The dynamics of a general discrete-time dissipative system can be defined as: where : ℝ  → ℝ  is a nonlinear mapping, and its variables are defined in a - at a time point  =   , where the symbol "  " is the transpose of a vector, and any time interval between two consecutive time points is equal.After a sufficiently long time, all states converge to a compact manifold .Denoting  as the box-counting dimension for the attractor contained in manifold , the delay embedding theorem indicates that only the observed long-term data of a single variable are needed to topologically reconstruct the attractor of the original high-dimensional system when certain conditions are satisfied.Takens' embedding theorem 21,22 is formally articulated in Theorem 1 as follows.
Theorem 1.If  is a compact manifold or an attractor with a dimensionality of , for a smooth diffeomorphism :  →  and a smooth function :  → ℝ 1 , there is a generic property stating that the mapping ℱ , :  → ℝ  is an embedding when  > 2, that is: where the symbol " ∘ " is the function composition operation.

DEFM-based STI transformation
As shown in Fig. 1 (a), the original observed attractor in an -dimensional space is defined as ( where   is an  ×  matrix constructed by a series of delay embeddings.Eq. ( 4) is also called the DEFM-based STI equation, in which the high-dimensional known/past Clearly, fitting the formal nonlinear mapping ℱ is the key to precisely predicting the future values of the target variable   from the known information .

The architecture of the DEFM
First, based on Eq. ( 4),  = [  To explore the intertwined information in observed high-dimensional time series data, a spatiotemporal structure is deployed for the DEFM framework, as shown in Fig.The temporal module of the DEFM is assembled by  self-attention 26 layers, which are applied to determine the temporal interactions among variables [26][27][28] .In particular, each self-attention layer contains a multi-head attention layer and a Feed-Forward layer.First, the high-dimensional    ∈ ℝ  ( = 1,2, … , ) is projected to a query vector    ∈ ℝ   , a key vector    ∈ ℝ   and a value vector    ∈ ℝ   , where   is the dimensionality of the attention model.Note that  = [ with where   ∈ ℝ ℎ  ×  ,    ∈ ℝ   ×  ,    ∈ ℝ   ×  and    ∈ ℝ   ×  are the weight matrices.The Feed-Forward layer is usually a fully connected network, which is adopted to separately process each feature.Thus, each column  ̂ ∈ ℝ  ̂ in the output of the temporal module  ̂∈ ℝ  ̂× (Fig.  4)), we have a total of  +  − 3 temporally selfconstrained conditions: where  ∈ {2, 3, … , } and is a spatial sample at time point  =   ( = 1, 2, … , ).Among these conditions,  − 1 conditions are imposed for the known states, and  − 2 conditions are applied for the future values.Clearly, these cross-sample conditions in Eq. ( 7) constrain the training process in terms of the temporal sample sequence.For the target variable   , the estimated values of its delay embeddings  ̂ are obtained from the known data matrix : where (̂   )  ( = 1, 2, … ,  +  − 1;  = 1, 2, … , ) is generated from the output of the j th sub-mapping function ℱ  .From the matrix  ̂ estimated above, the unknown future values of the target variable {   +1 ,    +2 , … ,    +−1 } are provided by Eq. ( 9): Through an autoperception procedure, the training and optimization steps of the DEFM are accomplished through a process that minimizes a loss function where ℒ  is a determined-state loss derived from the observed/known states where (̂ where mean ((̂   )) denotes the average of all estimated future values of   that correspond to the same future time point   ( =  + 1,  + 2, … ,  +  − 1) .
Clearly, ℒ  is constructed from the temporally self-constrained conditions in Eq. ( 7).
An intuitive understanding of the future-consistency loss is that minimizing ℒ  ensures that the outputs obtained from different sub-mappings but corresponding to the same future time point  =   are identical, which preserves the temporal consistency of the outputs in the lower-right corner of the delay embedding matrix   during the training procedure.
aased on the integration of the above two losses, the DEFM is trained in a self-where the variables () = ( 1  , … ,  90  )  , (⋅) is a 90D function set of the Lorenz system, and the vector  represents the time-invariant parameters.Specifically, the 90D coupled Lorenz system (Eq.( 12)) is composed of 30 subsystems, and each subsystem is formed as: where  = 1, 2, … ,30 denotes the  -th subsystem,  −1, is the coupled factor for coupling the -th subsystem with the previous subsystem and we set  − 1 = 30 for the situation  = 1 to keep the system closed,  = 0.1,  = 28,  = 8 3 are constants of each subsystem.  is the factor determining whether the system is time-varying or time-invariant.A detailed description of the 90D coupled Lorenz system (Eq.( 12)) is provided in Supplementary Section 2.1.
First, the DEFM is applied to a noise-free system (Eq.( 12)) to predict the future dynamics of three randomly selected target variables   1 ,   2 and   3 .The prediction process is carried out with  = 90 dimensions,  = 45 as the length of the known series, and  − 1 = 18 as the prediction length.From a three-dimensional perspective (Figs. 2 (a)-(c)), the DEFM predicts the future trends (the red curves) of all the targets {  1 ,   2 ,   3 }.The DEFM produces multistep-ahead predictions not only for the singlewing case, i.e., the case in which the known/observed and unknown/to-be-predicted time series are distributed in the same wing of the attractor (as in Fig. 2(c)), but also for the cross-wing cases, i.e., the case in which the known/observed and unknown/to-bepredicted time series are distributed in two different wings of the attractor (as in Fig. 2 (a) and (b)).Figs.S1 (d)-(l) present the accurate single-target variable predictions for the three cases, i.e., (d)-(f) for   1 , (g)-(i) for   2 , and (j)-(l) for   3 .For all the target predictions {  1 ,   2 ,   3 } , the Pearson correlation coefficients (PCCs) between the eighteen predicted points and the real values are all above 0.999, while the root mean square errors (RMSEs) are all below 0.06, which demonstrates the high prediction

Long-term prediction
Due to its ability to obtain ( − 1) -step-ahead future states in one prediction, the DEFM can be applied to make a long-term prediction for an -dimensional system with an iterative scheme.Specifically, to forecast the long-term future states of a target variable   , e.g., ( − 1) -step-ahead future values ( ≥ 2 ), the trained DEFM is applied  times with the updated inputs.For each ( − 1)-step-ahead prediction, the input matrix [] × is updated by two groups of data: one part contains the predicted series of a group of selected target variables, and the other part includes the observed time series of the remaining variables.As shown in Fig. 2 (d)-(f), the DEFM is applied to a 90D coupled Lorenz system (Eq.( 12)), and it makes long-term predictions (300 steps) with different prediction spans ( − 1).It is seen that the DEFM still accurately forecasts the long-term future values when the span  − 1 = 10 (Fig. 2 (d)), while the predictions deviate from the real values when the span grows to  − 1 = 20 (Fig. 2 (e)) and  − 1 = 30 (Fig. 2 (f)).

Comparisons and robustness analysis
The spatiotemporal architecture of the DEFM enables the full exploration of the temporally associated and spatially intertwined information contained in the observed time series of a high-dimensional system.To illustrate the superior performance of the proposed prediction method, the performance of the DEFM is compared with that of four traditional prediction methods, including the moving average (MA), Holt's exponential smoothing (HES) 8,9 , the ARIMA 10 , and LSTM 16 .Moreover, to illustrate the benefit brought by the temporal module, a simplified DEFM, which has the same structure as the full DEFM except for the temporal module, is also applied in the comparison.The details of these five methods are listed in Supplementary Section 2.3.
First, the DEFM and the other five prediction methods are applied to the dataset generated from the 90D coupled Lorenz system (Eq.( 12)) with the length of unknown/to-be-predicted series in one prediction fixed to 18, i.e.,  − 1 = 18 .
Specifically, when the length of the known time series  = 80, Fig. 3 (a) illustrates the prediction performances of all methods.It is seen that the DEFM accurately predicts the unknown values of the target variable (RMSE=0.058)and performs better than the other methods (the smallest RMSE=1.10)(Table S2).As the known time series becomes shorter, the DEFM still achieves high accuracy when  = 60 (RMSE=0.068,S2).
Second, the robustness of the DEFM is shown in an application involving the following 90D coupled time-varying Lorenz system: where (⋅) is the same 90D nonlinear function set of the Lorenz system in Eq. ( 12), but () is a time-varying/time-switching parameter vector, which changes when the time variable  moves forward every 100 units.Notably, affected by external disturbances, most real-world systems are indeed time-varying rather than having constant parameters.Thus, the practical applicability of a prediction method should also be validated by its performance in such time-varying systems.As shown in Figs. 3 (d the stable performance of the DEFM, it is also worth noting that when the length of known series grows, the performance of LSTM becomes considerably poorer due to its limited memory 35 .Simultaneously, when examining DEFM under conditions of a fixed known length ( = 80) with variations in the multi-step prediction length (ranging from 20 to 40), the observed minimal fluctuations (Fig. S3) in prediction performance demonstrate the robustness of DEFM.
To further validate the robustness of the DEFM, a limited number of training cases and noisy conditions are applied.With different subsets of all 1000 cases selected from the 90D coupled Lorenz system (Eq.( 12)), the DEFM is trained and tested.As shown in Fig. 3 (i), the DEFM still works well after being trained on only 20% of the selected cases, which suggests that the proposed framework can eliminate dependencies on a large amount of training data.In addition, noise is inevitable in real-world systems and disturbs the data analysis process.Trained by disturbed data generated from the 90D Lorenz system with white noise, the DEFM is employed to predict future values that are also perturbed by the noise.Fig. 3 (j) shows that the performance of the DEFM decreases gradually as the noise strength increases.However, even when the data are perturbed by considerably strong noise, i.e., when the variation of noise is approximately 2.0, the average PCC and RMSE are acceptable.We also investigated the sensitivity of the DEFM forecasting performance to different initialization methods and found that the DEFM performs well with several prevalent initialization methods 36,37 , as detailed in Section 2.6 of the Supplementary Information.
From the above discussion, the robustness of the DEFM is demonstrated by its superior performance in both time-invariant and time-varying systems with even shortterm known series.In other words, the DEFM is widely applicable to various highdimensional systems and in different sampling conditions.The prediction results obtained for the 90D coupled time-varying Lorenz system (Eq.( 14)) are as follows:

Demonstration on the Lorenz 96 system
To demonstrate the effectiveness of the DEFM, we conduct a supplementary analysis using two additional representative numerical datasets, namely, the Lorenz 96 system 38 and the KS equation with inhomogeneity 17 , and compare the DEFM with the other five methods.
The Lorenz 96 model, introduced by Lorenz in 1996 38 , characterizes the macroscopic dynamics of the mid-latitude atmosphere.From a mathematical perspective, this model describes the temporal evolution of the components   , where  ∈ {0, 1, . . .,  − 1}, representing a spatially discretized atmospheric variable along a specific single latitude circle.The Lorenz 96 system can be formally defined as: dataset.The results obtained for these two metrics across different methods are visually represented in Fig. 4 (g) and Fig. 4 (h), respectively.We can observe that the DEFM achieves the best performance in terms of both evaluation metrics.Notably, the DEFM exhibits a 67.5% PCC improvement over the second-best approach (ARIMA), which is accompanied by a significant RMSE error reduction of 75.6%.In general, the dimensionality of a system significantly influences predictive performance.We conducted a thorough examination of the performance of DEFM on the Lorenz 96 system (Eq.( 15)) under varied dimensional conditions by incrementing the value of .Additionally, the influence of the forcing term  has been meticulously considered, with specific settings of  = 4 and  = 5.Note that a higher value of  indicates an augmented chaotic nature.For all the conditions, the known and prediction lengths of DEFM are fixed at  = 40 and  − 1 = 18 , respectively.Table 1 summarizes the evaluation results of DEFM applied to the Lorenz 96 system, encapsulating the DEFM's performance across distinct forcing terms and increasing dimensions.When  = 4, the Lorenz 96 system exhibits weak chaotic behavior, and the performance degradation of DEFM with incremental system dimensions is relatively minor.However, in the case of  = 5, the predictive RMSE errors of DEFM tend to escalate with the increase of system dimensions, reflecting the heightened complexity of the system.Notably, as the system dimension rises, although there is an augmentation in spatial information for the target variable, a concomitant rise in the number of irrelevant variables incorporated by the model introduces a surplus of extraneous information and noise into the forecasting task.

Validation on the Kuramoto-Sivashinsky equation with inhomogeneity
To examine another spatiotemporally chaotic system, we consider the modified KS equation, incorporating an additional spatial inhomogeneity term 17 : The scalar field (, ) is periodic within the interval [0, ), where  is an integer multiple of  .Note that the dimensionality of the attractor directly depends on the dimensionless parameter , scaling linearly with  for sufficiently large values of .
The KS equation (Eq.( 16)) is integrated on a grid comprising  equally spaced points with a temporal increment of Δ = 0.25.This integration yields a simulated dataset containing  time series, thereby establishing a  -dimensional (  =  ) system.
Following the approach outlined in reference 17 , the modified KS equation (Eq.( 16)) used here adopts the following parameter values:  = 200,  = 512,  = 0.01, and  = 100.In the context of this chaotic system, the forecasting results obtained from the DEFM reveal slight disparities when compared to the actual solutions (Figs. 4

(d)-(h)).
In particular, akin to its performance on the Lorenz 96 system, the DEFM showcases significantly superior average PCC and RMSE values for its KS equation prediction results in comparison with the other methods (Fig. 4 (i) and Fig. 4 (j)).Additionally, the results and analyses for the prediction horizon on the modified KS systems with varying system sizes (  = 100, 200, and 400 ) are presented in Section 2.4.10 of the Supplementary Information.
Specifically, our analysis reveals that the DEFM predicts up to approximately 1 Lyapunov time, as shown in Fig. S14 of the Supplementary Information, which is less effective than the reservoir computing method proposed by Pathak et al. in reference 17 .
In fact, the approach proposed in reference 17 is easier to implement and proves to be a better method when there is less training data available, for making longer-term predictions or when the training samples are from spatiotemporally chaotic systems like the KS equation or Lorenz 96 system.In terms of accuracy, the DEFM excels in shortterm time-series forecasting.Meanwhile, in scenarios where variables exhibit more complex nonlinear relationships (such as with real-world data), the DEFM, based on the STI transformation, is more effective in predicting the future information for each target variable.

Conclusion
In this paper, we present a novel dynamics-based machine learning model, i.e., a DEFM, to make multistep-ahead predictions of future information based on high-dimensional time series data.The key of the DEFM method is to represent the STI equation by solving a nonlinear function that maps from the known information (the highdimensional time series of a complex system) to the future information (the time series of a target variable).aased on the delay embedding scheme, which enlarges the training sample size by including labeled embeddings and unlabeled future values (Fig. 1), the DEFM makes use of high-dimensional time series data with its three spatiotemporal modules.Trained in a self-supervised way, the DEFM fits the nonlinear function well and accurately predicts the future dynamics of the target variable in a multistep-ahead manner.Moreover, due to its ability to obtain multistep-ahead future states in one prediction, the DEFM can be applied to make long-term predictions with an iterative scheme.As a universal deep learning method, the DEFM is not only effective in forecasting in dynamical systems with explicit models, but also exhibits satisfactory performances on various real-world datasets, demonstrating its applicability in periodic-like datasets such as the single-wing cases of a Lorenz system (Case 3 in Fig. 2) and solar irradiance (Fig. S10), nonperiodic datasets such as the two-wing cases of a Lorenz system (Cases 1 and 2 in Fig. 2) and typhoon positioning (Fig. S11), and even in highly fluctuating cases such as those involving wind speed (Fig. S8) and traffic flow (Fig. S12).In addition, a comparison with previously developed methods 23,25,39 that incorporate the STI transformation is expounded upon in Section 2.7 of the Supplementary Information, highlighting the advancements achieved by the DEFM.
The DEFM achieves a prediction horizon of approximately 5 Lyapunov times for the Lorenz 96 model, in contrast to around 1 Lyapunov time for the KS system, as shown in Figs. 5 and S14, respectively.Upon further analysis, we hypothesize that the discrepancy in predictive performance between the KS case and the Lorenz 96 system may be attributed to the inherent characteristics of the data generated from each model.
Specifically, the data generated from the KS model exhibits stronger temporal variability and periodicity compared to the time series from the Lorenz 96 system.As our DEFM method relies on recent short-term time series for prediction, the more

Fig. 1 .
Fig. 1.Schematic illustration of the DEFM.(a) Through a delay embedding scheme 21,22 , a delay attractor  is reconstructed for a target variable   with appropriately reconstructed dimensions, and this attractor is topologically conjugated with the original attractor .From attractor  to attractor , a formal nonlinear function ℱ maps the known spatial information to the temporal information of the target variable   , while the future values of   are obtained by    + = ℱ( 1   ,  2   , …     ),  = 1,2, … ,  − 1.Therefore, the key to predicting   is to fit the nonlinear function ℱ.(b) For the given target variable   , the DEFM framework is designed to solve the DEFM-based STI equation (Eq.(4))and thus offers a bridge between the high-dimensional data  and the delay embedding matrix   .The DEFM fully extracts rich spatial and temporal information from high-dimensional data with temporal and spatial modules and integrates this spatiotemporal information via a merging module.After being trained in a self-supervised way, the DEFM predicts the future information of   in a multistep-ahead

1
(b).The DEFM is composed of three modules, including temporal, spatial, and merging modules.The temporal module is assembled by a series of so-called selfattention26 layers, which captures the underlying sequentially and temporally associated information among the input samples across all  observed time points (Fig. 1 (b)).The spatial module for extracting the interactions among  variables in the highdimensional data and a merging module for processing the spatiotemporal information and outputting the delay time series are both implemented by two separate fully connected neural networks.Specifically, given a matrix  ∈ ℝ × consisting of the time series data of the high-dimensional variables { 1 ,  2 , … ,   } as the input, two data flow branches are utilized; i.e., each column    = [ 1   ,  2   , … ,     ]  of  (a spatial vector at a time point  =   ) is input into the spatial module, while the rows (the time series across all  time points { 1 ,  2 , … ,   }) are fed into the temporal module (Fig. 1 (b)).Then, the outputs of the temporal branch  ̂∈ ℝ  ̂× and that of spatial branch  ̃∈ ℝ  ̃× are aggregated into one feature matrix through a concatenation operation and inputted into the merging module (Fig. 1 (b)).
Figs.3 (g) and (h), the DEFM achieves the best prediction results with the highest PCC and lowest RMSE in terms of average performance.Notably, even with the short-term information observed from 40 time points, the DEFM still reaches a mean PCC of approximately 0.92 overall for randomly selected cases, while the mean PCCs and RMSEs of other methods are much worse, demonstrating the excellent ability of the DEFM to conduct short-term high-dimensional time series analyses.Compared with

( d )
=80, (e) =60 and (f) =40.For all 200 randomly chosen cases involving the Lorenz system, (g) the average Pearson correlation coefficient (PCC) and (h) the mean RMSE between the predictions and ground truths are determined as the length of the known time series increases, changing from 40 to 80. (i) The average PCC and RMSE between the DEFM predictions and the ground truths for cases with different data fractions.(j) The average PCC and RMSE between the DEFM predictions and the ground truths for cases with different noise strengths.

1 = 18 .
)for  ∈ {0, 1, . . .,  − 1} , where it is assumed that  −1 =   ,  −2 =  −1 , and  is a forcing constant that controls the chaotic nature of the system.In this study, we set  =  = 60,  = 5 to generate the numerical dataset.The DEFM is employed to forecast theLorenz 96 system with a known length of  = 40 and a prediction horizon of  − As the examples shown in Figs. 4 (a)-(c), the DEFM outperforms the other five methods, achieving the highest prediction accuracy.The predicted results closely align with the actual values, exhibiting an impressive near-perfect level of alignment.Furthermore, we systematically evaluate the DEFM and the other compared methods, employing the average PCC and RMSE metrics calculated on the synthetic Lorenz 96

Fig. 4 .
Fig. 4. Validation of the performance of the DEFM on the Lorenz 96 system and the KS equation.The forecasting results are obtained by six different models on the Lorenz 96 system under a known length of  = 40 and a prediction horizon of  − 1 = 18 (a-c) and on the KS equation under  = 40 and  − 1 = 13 (d-f).Performance comparisons are conducted among the six models using the PCC and RMSE metrics attained on the Lorenz 96 system (g, h) and the KS equation (i, j), respectively.Note that DEFM* represents the DEFM model without a temporal module.

Fig. 5 .
Fig. 5. Visualization of the prediction outcomes for the entire Lorenz 96 system ( = 5,  = 60).(a) Ground truth derived from the Lorenz 96 system.(b) Iteratively obtained predictions from DEFM.(c) The prediction error (derived by panel (a) minus panel (b)).The ordinate axis represents all the system variables, while the abscissa axis represents the time  scaled in units of one Lyapunov time, as time  is multiplied by the largest Lyapunov exponent of the system.
1 ,   2 , … ,    ]  ,  = [  1 ,   2 , … ,    ]  ,  = [  1 ,   2 , … ,    ]  .In each head subspace, the queries, keys, and values are produced by linear projections with   ,   and   dimensions.The final multi-head attention values are determined by the concatenated parallel attention values derived from the ℎ heads in the following form: Finally, the delay embedding matrix   of the target variable   is the output of the merging module.To integrate S sub-mapping functions, an S-node dense layer is adopted at the end of the merging module.Utilizing the architecture with the above three modules, the DEFM can efficiently transform the spatial information derived from a high-dimensional time series into temporal information and provide the future values 1 (b)), which is computed by N stacked self-attention layers, contains the temporal features at time  =   ,  = 1,2, … ,  , suggesting that the temporal module with stacked self-attention layers explores the sequentially and temporally associated information among the input samples across all  time points. of  (a spatial vector at a time point  =   ) as input neurons and outputs the spatial features  ̃ at time  =   .Aggregated by concatenating the temporal features  ̂ and spatial features  ̃, each column of the merged feature matrix includes the spatiotemporal information at time  =   .The spatial module and merging module are implemented by fully connected neural networks. with  = 1, 2, … , .Remembering that the nonlinear function ℱ is composed of  intertwined sub-mappings {ℱ 1 , ℱ 2 , … , ℱ  }, the DEFM is trained by a "consistently self-constrained scheme" to fit the sub-mappings simultaneously, thus maintaining the integrity of ℱ and preserving temporal consistency.Specifically, due to the nature of the delay embeddings in the output   (as shown in Eq. ( 1 ,    2 , … ,     } of   , and ℒ  is a future-consistency loss in terms of the future/unknown series {   +1 ,    +2 , … ,    +−1 } of   .Specifically, the determinedstate loss ℒ  is defined as: )  is the estimation of ℱ  (   ) with    = [ 1   ,  2   , … ,     ] , and     ( = 1, 2, … , ) is the known value of   .The loss ℒ  is constructed from the differences between the estimations (̂   )  and the observed values (ground truths)     for all past time points  =   ( = 1, 2, … , ).The future-consistency loss ℒ  is defined as:

Table 1 .
Performance of DEFM on Lorenz96 system under different parameter settings.
60), we trained 60 target variable-specific models for each variable within the system.