We use the minimum description length (MDL) principle, which is an information-theoretic criterion for model selection, to determine echo-state network readout subsets. We find that this method of MDL subset selection improves accuracy when forecasting the Lorenz, Rössler, and Thomas attractors. It also improves the performance benefit that occurs when higher-order terms are included in the readout layer. We provide an explanation for these improvements in terms of decreased linear dependence and improved consistency.

Minimum description length (MDL) is an information-theoretic criterion for evaluating model performance. It can be used to select an appropriate-sized predictor subset from large readouts. This is useful for training echo-state networks (ESNs), which map high-dimensional dynamical systems to outputs via a readout layer. The size of the MDL selected component subset reaches a plateau as ESN size is increased beyond some critical value, which results in readout sparsity. We make use of this sparsity by exploring recent works that augment reservoir readouts with higher-order sub-graph states. We find that these higher-order states substantially improve the performance of ESNs in some cases. However, we also show that the subsequent increase in readout size may increase the risk of over-fitting and make stronger regularization necessary. By using MDL to maintain a small readout size, these over-fitting risks can be mitigated while the performance benefits are achieved.

Reservoir computing is a branch of machine learning, in which only the responses of otherwise untrained dynamical systems are learned. Reservoir computing is attributed to Jaeger (2001) and Maass (2002) who initially introduced the concept as a simplified framework for training recurrent neural networks (Nakajima and Fischer, 2021). Since then, reservoir computing has maintained relevance at least in part because of its conceptual simplicity (Nakajima and Fischer, 2021) and low learning cost (Tanaka , 2019).

Despite simple training methods, reservoir computing has demonstrated great effectiveness in modeling temporal problems and has been used widely for the prediction of chaotic systems. In recent years, much work has been done to incorporate greater complexity in the readouts of reservoir computers while retaining the simplicity of their training. It has been shown by Bollt (2021) that reservoirs with linear activation functions may be improved by incorporating quadratic terms of reservoir component states. Quadratic terms are also used by Pathak (2018a) to build a readout for an ensemble of reservoirs and model spatio-temporal data. In Ren (2024), higher-order terms are included to construct a Hilbert space of polynomials of reservoir components.

While this recent work in echo-state network (ESN) readout augmentation has demonstrated the ability to significantly improve reservoir performance, it does so while substantially increasing readout size. We propose using the minimum description length principle (MDL) as a criterion for selecting subsets from these larger readouts. MDL is a powerful technique for comparison between models of different sizes that was first introduced by Rissanen (1978). We show that when combined with the subset selection algorithm presented by Judd and Mees (1995). MDL improves ESN performance. One possible explanation for such performance improvements is the reduction of linear dependence in the readout, which reduces the need for regularization.

We will first introduce reservoir computing with a particular focus on ESNs (Sec. II A), then we will introduce the MDL principle in Sec. II B. We look specifically at ESNs in this paper, though our implementation of MDL theory is applicable to all forms of reservoir computing including physical reservoirs. In Sec. II C, we provide methods for performance comparison between ESNs and present the dynamical systems that we use for prediction tasks. In Sec. III, we demonstrate the difference between MDL and ridge regression and compare their performance on benchmark tasks. In Sec. IV, we use the MDL principle to select components from a readout augmented with sub-graph states. Finally, in Sec. V, we provide an explanation in terms of linear dependence and consistency for the improved performance and reduced need for regularization in MDL readout subsets.

Reservoir computing is a method for time-series modeling with high-dimensional dynamical systems called reservoirs. These reservoirs are driven by an input signal, which they map onto a high-dimensional space of measurable components. After some transient period, these components become complex functions of historical input states if the echo-state property holds (Yildiz , 2012). If the echo-state property does hold, then the space of reservoir components is an embedding space (Duan , 2023), which converts a temporal problem into a spatial problem (Schrauwen , 2007). All that is needed to map the state of reservoir components back to the state of the input is a readout trained on pattern recognition (Tanaka , 2019). This readout consists of a weighted sum of reservoir components and is commonly trained using ordinary least squares regression or ridge regression (Sun , 2022). In theory, the latter helps prevent over-fitting, which is particularly important in reservoir computing since the reservoir state is high-dimensional by design.

In practice, an infinite transient period is not possible and the space of reservoir components may contain some amount of noise that is not functionally dependent on the input history. This can be demonstrated by driving two copies of a reservoir from different initial states with an identical time series and measuring the consistency capacity between them (Jüngling , 2021). Consistency capacity describes the effective dimension of the reservoir—the dimension of the subspace that is purely functionally dependent on the input history. A more detailed explanation of consistency capacity can be found in the  Appendix.

Echo-state networks (Jaeger, 2001), which pioneered the field of reservoir computing alongside the liquid state machine (Maass , 2002), remain one of the most commonly used reservoir frameworks. They can be conceptualized as consisting of a network with nodes { x i } i = 1 n, which are connected to a time-dependent input signal u ( t ). We denote the adjacency matrix of this network by M and denote the vector of connections between the input and network nodes by W. In order to introduce non-linearity, the network node states can be passed through an activation function f : R R.

In the simplest case, without leakage or feedback, the network node states are updated iteratively using a small time increment δ t according to the update equation below,
(1)
If leakage is used (denoted by 0 β < 1), then the update equation becomes
Often the hyperbolic tangent is used for an activation function f ( ) = tanh ( ). However, activation functions are not always necessary. It has been shown that without any activation function, the ESN is a vector auto-regressive (VAR) model [Bollt (2021)].
The output of the reservoir at time t is calculated as a weighted sum of network nodes and we denote it as y ^ ( t ). We denote the vector weight coefficients as θ. This is described in the equation below,
If the ESN is trained to predict the driving signal m time increments into the future, then the weights θ are chosen such that y ^ ( t ) best approximates u ( t + m × δ t ). These coefficients are usually estimated by linear or ridge regression. In the case of linear regression,
where X is a matrix of historical network node states and Y is a vector of the historical input states m time steps ahead Y j = u ( ( j + m ) δ t ). In the case of ridge regression with a penalty of α 0,

Minimum description length is derived from the idea of code-length. It is a method for comparing models of different sizes within the same model class by approximating their ability to compress data (Nakamura , 2003). The aim when building a model according to MDL theory is to minimize the sum of the code-lengths L ( ) of both the model parameters L ( θ ) and the model errors L ( e | θ ).

If the encoding of the model errors was optimal and errors were discretized by finite precision, then the expected code-length of one error point would be bounded below by the entropy H ( ) of the probability density P of the errors (Yamanishi, 2023). The average code-length of each error term is also bounded above by one more than the entropy of P:
(2)
If we assume that the model errors have a zero mean Gaussian distribution, then the entropy of the errors can be approximated as the differential entropy,
where σ 2 = 1 L ( Y X θ ) T ( Y X θ ) and L is the length of Y (the number of time steps in the training data). Note that this assumption is compatible with the choice of linear or ridge regression for ESN parameter optimization. If ESN prediction errors are assumed to be normally distributed with zero mean, then the linear regression solution θ = ( X T X ) 1 X T Y follows from maximum likelihood estimation and the ridge regression solution θ = ( X T X + α I ) 1 X T Y follows from expectation maximization with the Bayesian prior θ N ( 0 , 1 α 2 I ). If the parameters θ are encoded using the maximum feasible precision γ, then the minimum total description length of the model would be bounded above by
However, the total description length can often be reduced by sacrificing parameter precision. Moreover, it is possible to approximate an optimal precision δ i γ for each parameter θ i. Parameters to which the model is more sensitive require greater precision. Since some models are more sensitive to their parameters than others and require generally higher parameter precisions, MDL is able to account for differences in complexity between models of the same size (Grünwald, 2007). This is a substantial advantage of MDL over similar techniques like the Bayesian (Schwarz, 1978) and Akaike (Akaike, 1974) information criteria (BIC and AIC).
We denote θ ¯ to be the parameters θ rounded up to their optimal precisions δ. Using the rounded parameters gives a different total description length
In the continuous case (where we allow log δ i R), the rounding error θ ¯ θ is δ and the difference L ( θ ) L ( θ ¯ ) approximated to second order is
We use a second-order approximation because H θ = 0 if the parameters θ are computed using linear regression. However, if ridge regression is used to compute θ with a ridge penalty of α, then it follows from the ridge regression loss function that L σ 2 θ i = α θ i θ T θ. Since linear regression is just ridge regression with zero penalty, we use the more general expression
The second order derivative, which we denote by Q, equates to the following expression:
The precisions δ i are optimized when L ( θ ) L ( θ ¯ ) is minimized. This occurs when
We solve the equation above for δ i numerically using Newton’s method, which involves iterating the process,
where F i ( δ ) = Q j i δ i 1 δ j α L σ 2 θ i and J is the Jacobian of F i ( δ ). We repeat this until δ i is close to the desired solution (until the sum i | Q i j δ i 1 δ i α L σ 2 θ i | falls below some small threshold value).

However, an optimal model does not necessarily include all variables. An algorithmic approach that tries to optimize parameter subsets was given by Judd and Mees (1995). In this approach, which we will refer to as the breathing algorithm, parameter θ i n is added to the optimal parameter subset of size k increasing it to size k + 1 and then the parameter with the smallest fitted coefficient θ o u t is dropped. The added parameter θ i n corresponds to the component with the strongest absolute correlation to the model errors made by prediction with the parameter subset of size k. In other words, if e k is the vector of errors obtained when predicting with the current set of parameters, then θ i n corresponds to the component X i n which maximizes | X i n e k |. This is repeated until θ i n and θ o u t are the same parameter at which point θ i n is included in the optimal subset and k k + 1. After a model has been found for each subset size using this process, the description length for each is determined. The readout subset corresponding to the lowest description length is selected. This is illustrated in Fig. 1. Extensions to this algorithm have been proposed by Nakamura (2007).

FIG. 1.

An illustration of subset selection according to MDL theory. The solid line represents the description length of models with different node subset sizes. Subsets of each size are proposed by the breathing algorithm and the vertical dashed line indicates the best of the proposed subsets according to MDL. In this demonstration, a ridge parameter of α = 10 18 was used and subsets were trained for the purpose of predicting the Lorenz system one time step ( 0.02 Lyapunov times) ahead.

FIG. 1.

An illustration of subset selection according to MDL theory. The solid line represents the description length of models with different node subset sizes. Subsets of each size are proposed by the breathing algorithm and the vertical dashed line indicates the best of the proposed subsets according to MDL. In this demonstration, a ridge parameter of α = 10 18 was used and subsets were trained for the purpose of predicting the Lorenz system one time step ( 0.02 Lyapunov times) ahead.

Close modal

Both the subset selection algorithm and the iteration of Newton’s method for optimizing coefficient precisions δ i increase the computational complexity of the training process in most cases. If the subset selection algorithm is allowed to terminate early, then it is possible for the method detailed above to improve computational complexity by avoiding the inversion of large matrices. In our implementation using python code and an off-the-shelf laptop, ESNs with up to 1000 nodes required less than a minute to train.

We measure the performance of ESNs quantitatively according to mean square prediction error (MSE) and qualitatively with autonomous attractor reconstruction tasks. We calculate MSE on n-step ahead prediction tasks for an unseen time series. First, X ( t r a i n ) is generated for a time series u t r a i n ( t ) according to Eq. (1). Then coefficients θ are calculated for each forecasting horizon m δ t such that X t ( t r a i n ) θ approximates u t r a i n ( t + m δ t ). We then use a new time series u t e s t ( t ) to generate X ( t e s t ) and measure the mean squared error ( Y ( t e s t ) X ( t e s t ) θ ) T ( Y ( t e s t ) X ( t e s t ) θ ), where Y t t e s t is u t e s t ( t + m δ t ). In autonomous reconstruction tasks, the ESN predicted input state one time step ( 0.02 Lyapunov times) ahead is provided as the ESN input for the following time step, i.e.,
In Pathak (2017), autonomous reconstruction is referred to as modeling an attractor’s climate—meaning its global shape and characteristics. Models that accurately fit the climate of an attractor generate reconstructions with similar topologies as measured by their persistent homologies and fractal structures (Tan , 2021). The distinction between a model with low MSE and one capable of reconstruction is akin to the distinction between accurate short-term local weather forecasting and accurate modeling of global climate patterns. Reservoirs capable of performing such reconstruction tasks are useful for estimating invariant properties of systems like the Lyapunov spectrum or fractal dimension (Hart, 2024).
We also measure the noise robustness of ESNs by adding Gaussian noise ϵ N ( 0 , σ n o i s e 2 ) with variance σ n o i s e 2 to the desired outputs Y t and node states X i , t. Note that we do not drive the reservoir with a noisy time series or perturb the reservoir state, this noise is added to the design matrix X and target outputs Y after running the ESN. For example, using ridge regression with added noise and penalty size α would give the parameters
where X ~ i , j = X i , j + ε i , j and Y ~ i = Y i + ϵ i.

We use these metrics to compare ESN performance at modeling the standard Lorenz ( ( σ , ρ , β ) = ( 10 , 28 , 8 3 )), Rössler ( ( a , b , c ) = ( 0.1 , 0.1 , 14 )), and Thomas ( b = 0.208 186) systems over a range of forecasting horizons. The equations for these systems are provided in the  Appendix. Since the characteristic timescales of each system are different, forecasting horizons are measured in Lyapunov time which is the inverse of the leading Lyapunov exponent. We estimate the leading Lyapunov exponents of the Lorenz, Rössler, and Thomas systems with the coefficients above to be roughly 0.9121 ± 0.0044, 0.0660 ± 0.0068, and 0.0378 ± 0.0040, respectively. Accepted values for these systems are around 0.9057 for Lorenz (Viswanath, 1998) and 0.0714 for Rössler (Sprott, 2023). Our calculations are made using the method described by Sandri (1996) and the provided error margins are three times the standard deviation of bootstrap sample means. The Lyapunov times ( T L) we use are 1.096, 15.14, and 26.48. Given the difference between the timescale of the Lorenz attractor and those of Rössler and Thomas, we use δ t = 0.02 T L for each system (This equates to 0.02s for Lorenz, 0.3s for Rössler, and 0.53s for Thomas). A time delay embedding of each attractor is illustrated in Fig. 2.

FIG. 2.

Time delay embeddings of the Lorenz (left), Rössler (center), and Thomas (right) attractors. The embedding delay τ is 0.08 s for the Lorenz attractor, 1.6 s for Rössler, and 4 s for Thomas.

FIG. 2.

Time delay embeddings of the Lorenz (left), Rössler (center), and Thomas (right) attractors. The embedding delay τ is 0.08 s for the Lorenz attractor, 1.6 s for Rössler, and 4 s for Thomas.

Close modal

In our experiments, we use the hyper-parameters outlined below:

  1. Network size: We use 200-node ESNs unless otherwise specified.

  2. Activation function: We use the hyperbolic tangent tanh ( ) activation function unless specified.

  3. Spectral radius: ρ = 1 or ρ = 0.9 when linear activation is used.

  4. Input weight matrix: We draw input weights from a zero mean Gaussian distribution W N ( 0 , 0.3 I ).

  5. Adjacency matrix: We create the adjacency M matrix by first taking the adjacency matrix A of a Watts–Strogatz random network with average degree 6 and re-wire probability 10 network size. We then replace non-zero entries in A with independent samples from a zero-mean normal distribution and then scale the resulting matrix by its spectral radius.

  6. Time interval between inputs: δ t = 0.02 T L, where T L is the Lyapunov time of the driving system.

  7. Testing and training data: We use 5000 points from each time series to train the ESNs. They are then tested on a new time series from the same system with 25 000 points.

  8. Leakage: β = 0. Non-zero leakage is beneficial for longer forecasting horizons. However, we avoid using leakage for simplicity.

Typically in reservoir computing, ridge regression is used for regularization to reduce model over-fitting. The penalty term introduced by ridge regression reduces the magnitude of coefficients overall and improves numerical stability in the case of perfect or imperfect multicollinearity. However, ridge regression includes all model parameters and does not encourage sparsity as strongly as alternatives like Lasso regression (Tibshirani, 1996). There is a slightly different rationale behind selecting readout subsets using MDL and the breathing algorithm described in Sec. II B. Reservoir components that have little influence on model performance are completely neglected. Multicollinearity in the matrix of output states X is avoided by dropping components that are highly correlated to any others already contained within the optimal component subset. When ridge regression is used in conjunction with MDL subset selection, smaller ridge penalties α tend to be required since a reduction in linear dependence in the design matrix X results in a better-conditioned matrix inverse ( X T X + α I ) 1.

MDL also encourages sparsity by limiting readout size. The size of the readout chosen using MDL is bounded above according to the amount of data used to train the model. This is because a model cannot achieve a minimum total description length if its parameters require the specification of more information than the target dataset. In reality, the optimal subset size will fall well below this limit for a good model.

In Fig. 3, we compare the sparsity of ESNs trained according to the MDL principle to the effective sparsity of ESNs trained using just ridge regression. We measure the effective sparsity of the ridge regression readout by the number of coefficients required to constitute an arbitrary percentage Ω of the norm of r ^ ( α ). We define N eff Ω ( r ( α ) ) to be the minimum size of the set of “useful” coefficients C for which | r ~ | Ω 100, where
FIG. 3.

Effective readout size for ESNs trained with and without MDL subset selection. Solid lines represent the log 10 number of nodes in the MDL subset while dotted lines are the effective readout sizes [ N eff 99 ( r ( α ) )] when the whole readout is trained. ESNs are trained to predict the Lorenz system one time step ( 0.02 Lyapunov times) ahead using 5000 time-series points. The MDL readout size remains roughly constant beyond some critical ESN size.

FIG. 3.

Effective readout size for ESNs trained with and without MDL subset selection. Solid lines represent the log 10 number of nodes in the MDL subset while dotted lines are the effective readout sizes [ N eff 99 ( r ( α ) )] when the whole readout is trained. ESNs are trained to predict the Lorenz system one time step ( 0.02 Lyapunov times) ahead using 5000 time-series points. The MDL readout size remains roughly constant beyond some critical ESN size.

Close modal

If we compare this measurement to the size of the ESN, we find that the effective readout size continues to increase with the readout size when ridge regression is used on its own (Fig. 3). Perhaps counter-intuitively, readouts trained with weaker regularization (smaller α) appear more sparse. However, this is because the matrix inverse ( X T X + α I ) 1 may become poorly conditioned for small α causing some coefficients to blow up and account for a large part of the norm of the coefficient vector. On the other hand, the size of the MDL readout is substantially smaller than that produced by ridge regression for large ESNs. The size of the MDL readout reaches a plateau for some sufficient ESN size.

Despite the sparsity of the MDL trained readout, it performs better than ridge regression at the task of n-step ahead prediction for the Lorenz, Rössler, and Thomas systems. In the experiment below (Fig. 4), we compare the mean square prediction error of ESNs trained using either the entire readout (shown with dashed lines) or MDL readout subsets (shown with solid lines). MSE was found for three different forecasting horizons across a range of ridge regression penalty sizes. Results were averaged across 30 trials and error bars indicate the 5 th and 95 th percentile of bootstrap sample means. We find that the benefit of MDL subset selection is most apparent on short forecasting horizons. This is likely because model over-fitting is easier on the comparatively simpler task of short-term forecasting. This would also explain why stronger regularization (larger α) is also needed for shorter horizons.

FIG. 4.

Prediction error ( log 10 MSE) by ridge regression penalty log 10 α on n-step prediction tasks for ESNs trained on the Lorenz, Rössler, and Thomas systems. The solid lines are ESNs trained using MDL readout subsets while the dashed lines are trained using the entire readout. The black, blue, and red lines represent forecasting horizons of 0.02, 0.2, and 1 Lyapunov time, respectively. ESNs were trained using 5000 points from each time series. Each line represents the average across 30 samples (so far 2 or 3) with error bars representing the 5 th to 95 th percentiles of bootstrap sample means.

FIG. 4.

Prediction error ( log 10 MSE) by ridge regression penalty log 10 α on n-step prediction tasks for ESNs trained on the Lorenz, Rössler, and Thomas systems. The solid lines are ESNs trained using MDL readout subsets while the dashed lines are trained using the entire readout. The black, blue, and red lines represent forecasting horizons of 0.02, 0.2, and 1 Lyapunov time, respectively. ESNs were trained using 5000 points from each time series. Each line represents the average across 30 samples (so far 2 or 3) with error bars representing the 5 th to 95 th percentiles of bootstrap sample means.

Close modal

Improvements from MDL subset selection at n-step ahead prediction are also apparent with 1000 node ESNs (Fig. 5) but not on small ( 50-node) ESNs (Fig. 6). It is not surprising that MDL provides no benefit to small reservoirs since the size of the optimal subset according to MDL is the same as the size of the full readout (Fig. 3) for ESNs with less than around 100 nodes. However, performance is better in the larger ( 200 or 1000)-node ESNs where MDL readout subset selection is beneficial.

FIG. 5.

Prediction error at n-step ahead prediction of Lorenz as in Fig. 4 but using a 1000-node ESN.

FIG. 5.

Prediction error at n-step ahead prediction of Lorenz as in Fig. 4 but using a 1000-node ESN.

Close modal
FIG. 6.

Prediction error at n-step ahead prediction of Lorenz as in Fig. 4 using a 50-node ESN.

FIG. 6.

Prediction error at n-step ahead prediction of Lorenz as in Fig. 4 using a 50-node ESN.

Close modal

We also test the robustness to noise of MDL trained readouts in Fig. 7 and their ability for reconstruction in Fig. 8. Robustness to noise is similar in ESNs trained with and without MDL subset selection. Improved performance in the MDL readouts with low levels of noise and short forecasting horizons just follows from results in Fig. 4. Performance in ESNs that use the entire readout is improved by low levels of Gaussian noise. This is because low levels of Gaussian noise reduce linear dependence in X and result in a better-conditioned inverse ( X T X + α I ) 1. However, performance is not improved by low levels of Gaussian noise (with variance above 10 6) when MDL subsets are selected since linear dependence in X is reduced to begin with.

FIG. 7.

Noise robustness for ESNs trained using the entire readout (top row) or MDL readout subsets (bottom row) and ridge regression penalties α { 10 8 , 10 12 , 10 16 }. The x-axis of each panel is the forecasting horizon in Lyapunov time, and the y-axis is the log 10 standard deviation of added Gaussian noise. The color of each heat map represents the log 10 MSE for n-step ahead prediction of the Lorenz attractor.

FIG. 7.

Noise robustness for ESNs trained using the entire readout (top row) or MDL readout subsets (bottom row) and ridge regression penalties α { 10 8 , 10 12 , 10 16 }. The x-axis of each panel is the forecasting horizon in Lyapunov time, and the y-axis is the log 10 standard deviation of added Gaussian noise. The color of each heat map represents the log 10 MSE for n-step ahead prediction of the Lorenz attractor.

Close modal
FIG. 8.

Reconstruction composites of the Lorenz attractor made by superimposing 30 time delay embeddings of ESN autonomous reconstructions. ESNs were trained using the whole readout in the top row while ESNs in the bottom row used MDL readout subsets. In all cases, they were trained to predict one time step ( δ t = 0.02 T L ahead. The appearance of shaded backgrounds and diagonal bands comes from failed reconstructions. The embedding delay τ is 0.08 s and standard hyper-parameters were used (see Sec. II C).

FIG. 8.

Reconstruction composites of the Lorenz attractor made by superimposing 30 time delay embeddings of ESN autonomous reconstructions. ESNs were trained using the whole readout in the top row while ESNs in the bottom row used MDL readout subsets. In all cases, they were trained to predict one time step ( δ t = 0.02 T L ahead. The appearance of shaded backgrounds and diagonal bands comes from failed reconstructions. The embedding delay τ is 0.08 s and standard hyper-parameters were used (see Sec. II C).

Close modal

In autonomous reconstruction tasks, like in n-step ahead prediction tasks, ESNs trained using MDL readout subsets produce a similar performance to those trained using the whole readout when ridge penalties are large (Fig. 8). When penalties are smaller, performance appears to be qualitatively better with MDL subsets. However, MDL is arguably less relevant for autonomous reconstruction since ideal ridge parameters are larger.

Much work has been done lately to improve the performance of ESNs by incorporating higher-order products of reservoir components in their otherwise linear readouts. The performance improvements of this readout augmentation are particularly striking when linear activation functions are used in place of more conventional non-linear activation functions like tanh ( ). As demonstrated by Bollt (2021), ESNs with a linear activation function become Vector Auto-Regressive (VAR) models whose performances may be improved with quadratic terms. These quadratic terms are introduced by taking the product of pairs of network nodes. In general, it has been shown that by introducing non-linearity in just the readout of an otherwise linear dynamical system, a wide range of signals may be approximated (Boyd and Chua, 1985). It has also been shown that quadratic terms can help to break harmful symmetries that lead to mirror attractors in autonomous reconstruction tasks (Herteux and Räth, 2020).

If we build from the quadratic readout in Bolt’s paper and include network sub-graphs of arbitrary size by taking a direct product of sub-graph nodes, then this VAR becomes a polynomial model.

Readout weights are determined for augmented readouts as they otherwise would be except that columns representing sub-graph states are added to the design matrix X. For example, if a triangle involving nodes 1, 3, and 14 is added to the readout of a 200-node ESN, then the entry X 201 , t would take the value X 1 , t X 3 , t X 14 , t.

In Fig. 9, we demonstrate that the inclusion of sub-graph states can improve performance substantially when a linear activation function is used. These improvements are perhaps unsurprising since without higher-order sub-graph states, the reservoir output is a purely linear combination of historical inputs and linear approximations of dynamical systems are only suitable over very short time horizons. However, if the whole readout is trained and small ridge penalties are used performance decreases with readout augmentation.

FIG. 9.

Prediction error ( log 10 MSE) by ridge regression penalty log 10 α for n-step ahead prediction of Lorenz using a linear activation function. Dashed lines represent ESNs trained with the whole readout while solid lines use MDL readout subsets. Dark lines indicate the inclusion of higher-order sub-graph states while the faded lines represent performance when just node states are included in the readout. Augmentation with higher-order states improves performance if MDL readout subsets are used or if ridge penalties are large. However, performance decreases with augmentation for small ridge penalties if the whole readout is trained.

FIG. 9.

Prediction error ( log 10 MSE) by ridge regression penalty log 10 α for n-step ahead prediction of Lorenz using a linear activation function. Dashed lines represent ESNs trained with the whole readout while solid lines use MDL readout subsets. Dark lines indicate the inclusion of higher-order sub-graph states while the faded lines represent performance when just node states are included in the readout. Augmentation with higher-order states improves performance if MDL readout subsets are used or if ridge penalties are large. However, performance decreases with augmentation for small ridge penalties if the whole readout is trained.

Close modal

It is also true that performance is improved by readout augmentation for MDL readout subsets with a non-linear (hyperbolic-tangent) activation function. Performance improvements occur from augmentation but are smaller when the entire readout is trained. This is demonstrated in Fig. 10 where tests performed in Fig. 4 are repeated with and without edge, triangle and four-cycle states included in the readout.

FIG. 10.

Prediction error ( log 10 MSE) by ridge regression penalty log 10 α on n-step prediction tasks for ESNs with augmented readouts. The solid lines are ESNs trained using MDL readout subsets while the dashed lines are trained using the entire readout. Dark lines indicate the inclusion of higher-order sub-graph states while the faded lines represent performance when just node states are included in the readout. The black, blue, and red lines represent forecasting horizons of 0.02, 0.2, and 1 Lyapunov time, respectively.

FIG. 10.

Prediction error ( log 10 MSE) by ridge regression penalty log 10 α on n-step prediction tasks for ESNs with augmented readouts. The solid lines are ESNs trained using MDL readout subsets while the dashed lines are trained using the entire readout. Dark lines indicate the inclusion of higher-order sub-graph states while the faded lines represent performance when just node states are included in the readout. The black, blue, and red lines represent forecasting horizons of 0.02, 0.2, and 1 Lyapunov time, respectively.

Close modal

In Figs. 11 and 12, we re-perform the noise robustness and autonomous reconstruction experiments from Sec. III. The best performance in noise robustness tests using the whole readout occurs with a larger amount of noise when sub-graph states are included. This is likely because noise has a regularizing effect and more regularization is needed for the larger readout. Augmentation appears to reduce robustness to high degrees of Gaussian noise for ESNs trained using MDL readout subsets.

FIG. 11.

Noise robustness for augmented ESNs trained with using the entire readout (top row) or MDL readout subsets (bottom row) and ridge regression penalties α { 10 8 , 10 12 , 10 16 }. The x-axis of each panel is the forecasting horizon in Lyapunov time and the y-axis is the log 10 standard deviation of added Gaussian noise. The color of each heat map represents the log 10 MSE for n-step ahead prediction of the Lorenz attractor.

FIG. 11.

Noise robustness for augmented ESNs trained with using the entire readout (top row) or MDL readout subsets (bottom row) and ridge regression penalties α { 10 8 , 10 12 , 10 16 }. The x-axis of each panel is the forecasting horizon in Lyapunov time and the y-axis is the log 10 standard deviation of added Gaussian noise. The color of each heat map represents the log 10 MSE for n-step ahead prediction of the Lorenz attractor.

Close modal
FIG. 12.

Reconstruction composites of the Lorenz attractor made by superimposing 30 time delay embeddings of ESN autonomous reconstructions. ESNs were augmented with higher-order sub-graph states and trained using the whole augmented readout (top row) or MDL subsets of the augmented readout (bottom row). The embedding delay and hyper-parameters are the same as in Fig. 8.

FIG. 12.

Reconstruction composites of the Lorenz attractor made by superimposing 30 time delay embeddings of ESN autonomous reconstructions. ESNs were augmented with higher-order sub-graph states and trained using the whole augmented readout (top row) or MDL subsets of the augmented readout (bottom row). The embedding delay and hyper-parameters are the same as in Fig. 8.

Close modal

Reconstruction performance with and without augmentation is harder to compare due to its qualitative nature. However, the absence of dark bands off the attractor when α = 10 12 indicates an improvement from Fig. 8. Again, no successful reconstruction was achieved using the entire readout with α = 10 16.

An important finding from Secs. III and IV is that MDL readout subsets require less regularization. For example, we can see from n-step ahead prediction and autonomous reconstruction performance that large ridge penalties are needed when training the entire readout. However, performance on both tasks remains reasonable when smaller ridge penalties are used to train MDL readout subsets. Regularization from low levels of added Gaussian noise is beneficial for training the full readout but not for training MDL readout subsets.

Regularization in the form of a ridge regression penalty or added Gaussian noise helps stabilize the matrix inverse ( X T X + α I ) 1 if the columns of X [which are the time series of node or sub-graph states ( x i ( t 0 ) , x i ( t T ) )] are linearly dependent. We argued in Sec. III that this linear dependence is often avoided when using MDL readout subsets. We can test this theory with consistency capacity (see Subsection 3 of the  Appendix), which measures the number of linearly independent components of the driving signal contained within the reservoir (Lymburn , 2019).

The consistency spectrum, different to the consistency capacity, is a ranked set of reservoir component consistencies. The consistency of each reservoir component indicates its degree of functional dependence on the driving signal and their sum is the consistency capacity. By looking at the consistency spectrum in Fig. 13, we can see that additional independent signal components are provided by the higher-order sub-graph states.

FIG. 13.

Consistency profile of augmented and non-augmented ESNs (with 200 nodes and 200 edges, triangles and four cycles in the augmented case). MDL subsets are selected for the purpose of predicting the Lorenz system 0.02 Lyapunov times ahead. On the vertical axis is the size of the j th largest consistent component. Additional linearly independent components of the driving signal are contained within higher sub-graph states.

FIG. 13.

Consistency profile of augmented and non-augmented ESNs (with 200 nodes and 200 edges, triangles and four cycles in the augmented case). MDL subsets are selected for the purpose of predicting the Lorenz system 0.02 Lyapunov times ahead. On the vertical axis is the size of the j th largest consistent component. Additional linearly independent components of the driving signal are contained within higher sub-graph states.

Close modal

We can also measure an effective consistency capacity for readout subsets (including or not including higher-order states) by following the calculation in Subsection 3 of the  Appendix but including only the columns of X and X that correspond to elements of the readout. Rather than being a measure of the number of linearly independent signal components contained within the entire reservoir state, this effective consistency capacity measures the number of such components contained within the sub-state of the reservoir that gets mapped to the output. If coefficients are only fitted for a subset of readout components, then it is this effective consistency that matters. We normalize the effective consistency capacity of a readout subset by the subset size to provide a measure of the proportion of the readout that corresponds to linearly independent components of the driving signal.

We demonstrate in Fig. 14 that using the breathing algorithm with MDL as a subset selection criterion improves readout consistency. This means that the proportion of readout components in the MDL readout subset that are linearly independent and functionally dependent on the driving signal is higher than in the whole readout.

FIG. 14.

Total consistency of random readout subsets (black) compared with subsets chosen by the breathing algorithm (red) prior to (solid) and after (dashed) augmentation. The vertical lines indicate the optimal subset size according to MDL theory. Optimal subset sizes occur while consistency remains high and the breathing algorithm selects high-consistency subsets with and without readout augmentation.

FIG. 14.

Total consistency of random readout subsets (black) compared with subsets chosen by the breathing algorithm (red) prior to (solid) and after (dashed) augmentation. The vertical lines indicate the optimal subset size according to MDL theory. Optimal subset sizes occur while consistency remains high and the breathing algorithm selects high-consistency subsets with and without readout augmentation.

Close modal

One mechanism for this improvement is the reduction in readout size. Smaller readouts tend to be more consistent even with random subset selection. In theory, the optimal subset size according to MDL theory should occur while consistency is still high. Once a subset size is reached where new components are less relevant to the signal or similar to those already contained, any reduction in L ( e ) is less likely to be worth the increase in L ( θ ).

The breathing algorithm also tends to select component subsets with disproportionately high consistency capacity when readouts are small. This is because, at each subset size, the algorithm chooses the predictor θ in most strongly related to the prediction error. With large enough subsets, the breathing algorithm may eventually choose less consistent readouts than random sampling—as is the case when the readout is augmented in Fig. 14. However, this is only the case for readout sizes larger than the optimal size according to MDL and occurs because there may be signal components that are consistent but not useful. Irrelevant signal components will not be related to prediction error.

In this paper, we demonstrated the use of minimum description length in reservoir training. In doing so, we found that using MDL readout subsets is beneficial for n-step ahead prediction tasks and results in qualitatively similar attractor reconstructions. One explanation for the advantages of MDL readout subsets is that they contain less linear dependence which we demonstrated using consistency capacity. This reduction in linear dependence also made MDL readout subsets less sensitive to the size of the ridge penalty.

We argued that decreased linear dependence in MDL readout subsets was only partly explained by their smaller size. The other explanation was that the breathing algorithm selected subsets with higher consistency than random sampling.

The benefits of using the MDL principle in ESN training that we found did not apply to small ( 50-node) ESNs. However, these small ESNs produced inferior performance to their larger counterparts which were improved by MDL subset selection.

As such, we believe that our findings provide a strong case for the consideration of MDL in practical reservoir training tasks and that this case is strengthened by recent attention to readout augmentation.

A.M. was supported by the Australian Government RTP scholarship at the University of Western Australia. M.S. and D.M.W. are supported by the ARC Discovery Grant (No. DP200102961), funded by the Australian Government. M.S. also acknowledges the support of the Australian Research Council through the Centre for Transforming Maintenance through Data Science (Grant No. IC180100030).

The authors have no conflicts to disclose.

Antony Mizzi: Conceptualization (equal); Data curation (equal); Investigation (equal); Methodology (equal); Software (equal); Visualization (equal); Writing – original draft (equal). Michael Small: Conceptualization (equal); Methodology (equal); Project administration (equal); Software (equal); Supervision (equal); Validation (equal); Writing – review & editing (equal). David M. Walker: Methodology (equal); Project administration (equal); Supervision (equal); Validation (equal); Writing – review & editing (equal).

The data that support the findings of this study are openly available in MDL_ESNs, at https://github.com/antony-mizzi/MDL_ESNs.git (Mizzi, 2025).

1. Attractors

We generate time series from the Lorenz, Rössler, and Thomas systems using an Eulerian integration with an interval d t = 0.002 for Lorenz and d t = 0.02 for Rössler and Thomas. We then sample points at a rate of δ t = 0.02 for Lorenz and δ t = 0.2 for Rössler and Thomas. The set of sampled points is used for reservoir prediction tasks. The update equations for each system are given below:

Lorenz ( ( σ , ρ , β ) = ( 10 , 28 , 8 3 ))
Rössler ( ( a , b , c ) = ( 0.1 , 0.1 , 14 ))
Thomas ( b = 0.208186)

2. Breathing algorithm

Allow us to denote X to be the state matrix with rows x i ( t ) that contain the state of the i th node at times t { 0 , δ t , , ( T 1 ) δ t }. We require X to be normalized such that x i ( t ) x i T ( t ) = x j ( t ) x j T ( t ) i , j. If B { 1 , 2 , . . , N } is a node subset then X B is the sub-matrix with rows corresponding to nodes in B. Initially, we set B = .

  1. Fit coefficients θ B = ( X B X B T + α I ) 1 X B T Y for nodes in the subset B which has size k. Denote the errors of this fit as e B = Y X B T θ.

  2. Pick the row i that corresponds to the maximum entry of X T e B.

  3. Fit coefficients θ B for the matrix X B with rows corresponding to nodes in B = B { i }. Find the parameter j with the smallest absolute entry in θ B .

  4. If i = j (the newly included row has the smallest absolute parameter) then include i in the optimal parameter subset ( B B { i }, k k + 1).

  5. If i j then repeat the process using B { j } B { i } ¯.

We terminate this process when B { 1 , 2 , . . , N }.

3. Consistency capacity

Consistency capacity is a measure of the functional dependence of the reservoir state on the input signal. In fully consistent reservoirs, the echo-state property holds completely. Consistency capacity is measured as follows.

Two copies of the same reservoir x and x with identical adjacency matrices M and input weights W but different initial states x ( t 0 ) x ( t 0 ) are driven by the same time series u ( t ). The corresponding design matrices X and X are mean centered and a whitening transformation matrix Ω is defined for X,
where Q is a matrix of eigenvectors of X T X and Σ is a diagonal matrix of the corresponding eigenvalues. The consistency matrix C is the covariance matrix between the two whitened state matrices,
The consistency capacity is the trace of C and the consistency profile is the set of eigenvalues of C. This consistency profile is illustrated in Fig. 13 with eigenvalues sorted by value. If we are interested in the number of linearly independent components of the driving signal contained in a subset of reservoir elements, we can use an effective consistency capacity. This is calculated in the same way as the true consistency capacity but only the columns of X and X that correspond to elements in the subset are included. For example, if the subset B contains network nodes 1, 3 and 14, then the effective consistency using the subset B is found as follows:
Q B and Σ B are the matrices of eigenvectors and eigenvalues of X B T X B. We then follow the same process as before to get the effective whitening matrix
and effective consistency matrix
2.
Akaike
,
H.
, “
A new look at the statistical model identification
,”
IEEE Trans. Autom. Control
19
,
716
723
(
1974
).
4.
Bollt
,
E.
, “
On explaining the surprising success of reservoir computing forecaster of chaos? The universal machine learning dynamical system with contrast to VAR and DMD
,”
Chaos
31
,
013108
(
2021
).
5.
Boyd
,
S.
and
Chua
,
L.
, “
Fading memory and the problem of approximating nonlinear operators with Volterra series
,”
IEEE Trans. Circuits Syst.
32
,
1150
1161
(
1985
).
6.
Duan
,
X.-Y.
,
Ying
,
X.
,
Leng
,
S.-Y.
,
Kurths
,
J.
,
Lin
,
W.
, and
Ma
,
H.-F.
, “
Embedding theory of reservoir computing and reducing reservoir network using time delays
,”
Phys. Rev. Res.
5
,
L022041
(
2023
).
7.
Grünwald
,
P. D.
,
The Minimum Description Length Principle
(
MIT Press
,
2007
).
8.
Hart
,
J. D.
, “
Attractor reconstruction with reservoir computers: The effect of the reservoir’s conditional Lyapunov exponents on faithful attractor reconstruction
,”
Chaos
34
,
043123
(
2024
).
9.
Herteux
,
J.
and
Räth
,
C.
, “
Breaking symmetries of the reservoir equations in echo state networks
,”
Chaos
30
,
123142
(
2020
).
10.
Jaeger
,
H.
, “The ‘echo state’ approach to analysing and training recurrent neural networks-with an erratum note,” GMD Technical Report 148, 13 (German National Research Center for Information Technology, Bonn, Germany, 2001).
12.
Judd
,
K.
and
Mees
,
A.
, “
On selecting models for nonlinear time series
,”
Physica D
82
,
426
444
(
1995
).
13.
Jüngling
,
T.
,
Lymburn
,
T.
, and
Small
,
M.
, “
Consistency hierarchy of reservoir computers
,”
IEEE Trans. Neural Netw. Learn. Syst.
33
,
2586
2595
(
2021
).
15.
Lymburn
,
T.
,
Khor
,
A.
,
Stemler
,
T.
,
Corrêa
,
D. C.
,
Small
,
M.
, and
Jüngling
,
T.
, “
Consistency in echo-state networks
,”
Chaos
29
,
023118
(
2019
).
16.
Maass
,
W.
,
Natschläger
,
T.
, and
Markram
,
H.
, “
Real-time computing without stable states: A new framework for neural computation based on perturbations
,”
Neural Comput.
14
,
2531
2560
(
2002
).
38.
Mizzi
,
A.
(
2025
). “
MDL_ESNs
,”
Github repository
.
17.
Nakajima
,
K.
and
Fischer
,
I.
,
Reservoir Computing
(
Springer
,
Singapore
,
2021
).
18.
Nakamura
,
T.
,
Hirata
,
Y.
,
Judd
,
K.
,
Kilminster
,
D.
, and
Small
,
M.
, “
Improved parameter estimation from noisy time series for nonlinear dynamical systems
,”
Int. J. Bifurcation Chaos
17
,
1741
1752
(
2007
).
19.
Nakamura
,
T.
,
Judd
,
K.
, and
Mees
,
A.
, “
Refinements to model selection for nonlinear time series
,”
Int. J. Bifurcation Chaos
13
,
1263
1274
(
2003
).
20.
Pathak
,
J.
,
Hunt
,
B.
,
Girvan
,
M.
,
Lu
,
Z.
, and
Ott
,
E.
, “
Model-free prediction of large spatiotemporally chaotic systems from data: A reservoir computing approach
,”
Phys. Rev. Lett.
120
,
024102
(
2018a
).
21.
Pathak
,
J.
,
Lu
,
Z.
,
Hunt
,
B. R.
,
Girvan
,
M.
, and
Ott
,
E.
, “
Using machine learning to replicate chaotic attractors and calculate Lyapunov exponents from data
,”
Chaos
27
,
121102
(
2017
).
23.
Ren
,
H.-H.
,
Bai
,
Y.-L.
,
Fan
,
M.-H.
,
Ding
,
L.
,
Yue
,
X.-X.
, and
Yu
,
Q.-H.
, “
Constructing polynomial libraries for reservoir computing in nonlinear dynamical system forecasting
,”
Phys. Rev. E
109
,
024227
(
2024
).
24.
Rissanen
,
J.
, “
Modeling by shortest data description
,”
Automatica
14
,
465
471
(
1978
).
25.
Sandri
,
M.
, “
Numerical calculation of Lyapunov exponents
,”
Math. J.
6
,
78
84
(
1996
).
26.
Schrauwen
,
B.
,
Verstraeten
,
D.
, and
Van Campenhout
,
J.
, “An overview of reservoir computing: theory, applications and implementations,” in Proceedings of the 15th European Symposium on Artificial Neural Networks (ESANN, 2007), pp. 471–482.
27.
Schwarz
,
G.
, “
Estimating the dimension of a model
,”
Ann. Stat.
6
,
461
464
(
1978
).
28.
Sprott
,
J.
Chaos and time-series analysis
(Oxford University Press, 2003), ISBN: 9780198508397.
29.
Sun
,
C.
,
Song
,
M.
,
Cai
,
D.
,
Zhang
,
B.
,
Hong
,
S.
, and
Li
,
H.
, “
A systematic review of echo state networks from design to application
,”
IEEE Trans. Artif. Intell.
5
,
23
37
(
2022
).
31.
Tan
,
E.
,
Corrêa
,
D.
,
Stemler
,
T.
, and
Small
,
M.
, “
Grading your models: Assessing dynamics learning of models using persistent homology
,”
Chaos
31
,
123109
(
2021
).
32.
Tanaka
,
G.
,
Yamane
,
T.
,
Héroux
,
J. B.
,
Nakane
,
R.
,
Kanazawa
,
N.
,
Takeda
,
S.
,
Numata
,
H.
,
Nakano
,
D.
, and
Hirose
,
A.
, “
Recent advances in physical reservoir computing: A review
,”
Neural Netw.
115
,
100
123
(
2019
).
34.
Tibshirani
,
R.
, “
Regression shrinkage and selection via the lasso
,”
J. R. Stat. Soc. Ser. B
58
,
267
288
(
1996
).
35.
Viswanath
,
D.
,
Lyapunov Exponents from Random Fibonacci Sequences to the Lorenz Equations
(
Cornell University
,
1998
).
36.
Yamanishi
,
K.
,
Learning with the Minimum Description Length Principle
(
Springer
,
Singapore
,
2023
).
37.
Yildiz
,
I. B.
,
Jaeger
,
H.
, and
Kiebel
,
S. J.
, “
Re-visiting the echo state property
,”
Neural Netw.
35
,
1
9
(
2012
).