We use the minimum description length (MDL) principle, which is an information-theoretic criterion for model selection, to determine echo-state network readout subsets. We find that this method of MDL subset selection improves accuracy when forecasting the Lorenz, Rössler, and Thomas attractors. It also improves the performance benefit that occurs when higher-order terms are included in the readout layer. We provide an explanation for these improvements in terms of decreased linear dependence and improved consistency.
Minimum description length (MDL) is an information-theoretic criterion for evaluating model performance. It can be used to select an appropriate-sized predictor subset from large readouts. This is useful for training echo-state networks (ESNs), which map high-dimensional dynamical systems to outputs via a readout layer. The size of the MDL selected component subset reaches a plateau as ESN size is increased beyond some critical value, which results in readout sparsity. We make use of this sparsity by exploring recent works that augment reservoir readouts with higher-order sub-graph states. We find that these higher-order states substantially improve the performance of ESNs in some cases. However, we also show that the subsequent increase in readout size may increase the risk of over-fitting and make stronger regularization necessary. By using MDL to maintain a small readout size, these over-fitting risks can be mitigated while the performance benefits are achieved.
I. INTRODUCTION
Reservoir computing is a branch of machine learning, in which only the responses of otherwise untrained dynamical systems are learned. Reservoir computing is attributed to Jaeger (2001) and Maass (2002) who initially introduced the concept as a simplified framework for training recurrent neural networks (Nakajima and Fischer, 2021). Since then, reservoir computing has maintained relevance at least in part because of its conceptual simplicity (Nakajima and Fischer, 2021) and low learning cost (Tanaka , 2019).
Despite simple training methods, reservoir computing has demonstrated great effectiveness in modeling temporal problems and has been used widely for the prediction of chaotic systems. In recent years, much work has been done to incorporate greater complexity in the readouts of reservoir computers while retaining the simplicity of their training. It has been shown by Bollt (2021) that reservoirs with linear activation functions may be improved by incorporating quadratic terms of reservoir component states. Quadratic terms are also used by Pathak (2018a) to build a readout for an ensemble of reservoirs and model spatio-temporal data. In Ren (2024), higher-order terms are included to construct a Hilbert space of polynomials of reservoir components.
While this recent work in echo-state network (ESN) readout augmentation has demonstrated the ability to significantly improve reservoir performance, it does so while substantially increasing readout size. We propose using the minimum description length principle (MDL) as a criterion for selecting subsets from these larger readouts. MDL is a powerful technique for comparison between models of different sizes that was first introduced by Rissanen (1978). We show that when combined with the subset selection algorithm presented by Judd and Mees (1995). MDL improves ESN performance. One possible explanation for such performance improvements is the reduction of linear dependence in the readout, which reduces the need for regularization.
We will first introduce reservoir computing with a particular focus on ESNs (Sec. II A), then we will introduce the MDL principle in Sec. II B. We look specifically at ESNs in this paper, though our implementation of MDL theory is applicable to all forms of reservoir computing including physical reservoirs. In Sec. II C, we provide methods for performance comparison between ESNs and present the dynamical systems that we use for prediction tasks. In Sec. III, we demonstrate the difference between MDL and ridge regression and compare their performance on benchmark tasks. In Sec. IV, we use the MDL principle to select components from a readout augmented with sub-graph states. Finally, in Sec. V, we provide an explanation in terms of linear dependence and consistency for the improved performance and reduced need for regularization in MDL readout subsets.
II. BACKGROUND
Reservoir computing is a method for time-series modeling with high-dimensional dynamical systems called reservoirs. These reservoirs are driven by an input signal, which they map onto a high-dimensional space of measurable components. After some transient period, these components become complex functions of historical input states if the echo-state property holds (Yildiz , 2012). If the echo-state property does hold, then the space of reservoir components is an embedding space (Duan , 2023), which converts a temporal problem into a spatial problem (Schrauwen , 2007). All that is needed to map the state of reservoir components back to the state of the input is a readout trained on pattern recognition (Tanaka , 2019). This readout consists of a weighted sum of reservoir components and is commonly trained using ordinary least squares regression or ridge regression (Sun , 2022). In theory, the latter helps prevent over-fitting, which is particularly important in reservoir computing since the reservoir state is high-dimensional by design.
In practice, an infinite transient period is not possible and the space of reservoir components may contain some amount of noise that is not functionally dependent on the input history. This can be demonstrated by driving two copies of a reservoir from different initial states with an identical time series and measuring the consistency capacity between them (Jüngling , 2021). Consistency capacity describes the effective dimension of the reservoir—the dimension of the subspace that is purely functionally dependent on the input history. A more detailed explanation of consistency capacity can be found in the Appendix.
A. Introduction to echo-state networks
Echo-state networks (Jaeger, 2001), which pioneered the field of reservoir computing alongside the liquid state machine (Maass , 2002), remain one of the most commonly used reservoir frameworks. They can be conceptualized as consisting of a network with nodes , which are connected to a time-dependent input signal . We denote the adjacency matrix of this network by and denote the vector of connections between the input and network nodes by . In order to introduce non-linearity, the network node states can be passed through an activation function .
B. Introduction to minimum description length
Minimum description length is derived from the idea of code-length. It is a method for comparing models of different sizes within the same model class by approximating their ability to compress data (Nakamura , 2003). The aim when building a model according to MDL theory is to minimize the sum of the code-lengths of both the model parameters and the model errors .
However, an optimal model does not necessarily include all variables. An algorithmic approach that tries to optimize parameter subsets was given by Judd and Mees (1995). In this approach, which we will refer to as the breathing algorithm, parameter is added to the optimal parameter subset of size increasing it to size and then the parameter with the smallest fitted coefficient is dropped. The added parameter corresponds to the component with the strongest absolute correlation to the model errors made by prediction with the parameter subset of size . In other words, if is the vector of errors obtained when predicting with the current set of parameters, then corresponds to the component which maximizes . This is repeated until and are the same parameter at which point is included in the optimal subset and . After a model has been found for each subset size using this process, the description length for each is determined. The readout subset corresponding to the lowest description length is selected. This is illustrated in Fig. 1. Extensions to this algorithm have been proposed by Nakamura (2007).
An illustration of subset selection according to MDL theory. The solid line represents the description length of models with different node subset sizes. Subsets of each size are proposed by the breathing algorithm and the vertical dashed line indicates the best of the proposed subsets according to MDL. In this demonstration, a ridge parameter of was used and subsets were trained for the purpose of predicting the Lorenz system one time step ( Lyapunov times) ahead.
An illustration of subset selection according to MDL theory. The solid line represents the description length of models with different node subset sizes. Subsets of each size are proposed by the breathing algorithm and the vertical dashed line indicates the best of the proposed subsets according to MDL. In this demonstration, a ridge parameter of was used and subsets were trained for the purpose of predicting the Lorenz system one time step ( Lyapunov times) ahead.
Both the subset selection algorithm and the iteration of Newton’s method for optimizing coefficient precisions increase the computational complexity of the training process in most cases. If the subset selection algorithm is allowed to terminate early, then it is possible for the method detailed above to improve computational complexity by avoiding the inversion of large matrices. In our implementation using python code and an off-the-shelf laptop, ESNs with up to nodes required less than a minute to train.
C. Measuring performance
We use these metrics to compare ESN performance at modeling the standard Lorenz ( ), Rössler ( ), and Thomas ( ) systems over a range of forecasting horizons. The equations for these systems are provided in the Appendix. Since the characteristic timescales of each system are different, forecasting horizons are measured in Lyapunov time which is the inverse of the leading Lyapunov exponent. We estimate the leading Lyapunov exponents of the Lorenz, Rössler, and Thomas systems with the coefficients above to be roughly , , and , respectively. Accepted values for these systems are around for Lorenz (Viswanath, 1998) and for Rössler (Sprott, 2023). Our calculations are made using the method described by Sandri (1996) and the provided error margins are three times the standard deviation of bootstrap sample means. The Lyapunov times ( ) we use are , , and . Given the difference between the timescale of the Lorenz attractor and those of Rössler and Thomas, we use for each system (This equates to s for Lorenz, s for Rössler, and s for Thomas). A time delay embedding of each attractor is illustrated in Fig. 2.
Time delay embeddings of the Lorenz (left), Rössler (center), and Thomas (right) attractors. The embedding delay is s for the Lorenz attractor, s for Rössler, and s for Thomas.
Time delay embeddings of the Lorenz (left), Rössler (center), and Thomas (right) attractors. The embedding delay is s for the Lorenz attractor, s for Rössler, and s for Thomas.
In our experiments, we use the hyper-parameters outlined below:
Network size: We use -node ESNs unless otherwise specified.
Activation function: We use the hyperbolic tangent activation function unless specified.
Spectral radius: or when linear activation is used.
Input weight matrix: We draw input weights from a zero mean Gaussian distribution .
Adjacency matrix: We create the adjacency matrix by first taking the adjacency matrix of a Watts–Strogatz random network with average degree and re-wire probability . We then replace non-zero entries in with independent samples from a zero-mean normal distribution and then scale the resulting matrix by its spectral radius.
Time interval between inputs: , where is the Lyapunov time of the driving system.
Testing and training data: We use points from each time series to train the ESNs. They are then tested on a new time series from the same system with 25 000 points.
Leakage: . Non-zero leakage is beneficial for longer forecasting horizons. However, we avoid using leakage for simplicity.
III. ESN READOUT OPTIMIZATION
Typically in reservoir computing, ridge regression is used for regularization to reduce model over-fitting. The penalty term introduced by ridge regression reduces the magnitude of coefficients overall and improves numerical stability in the case of perfect or imperfect multicollinearity. However, ridge regression includes all model parameters and does not encourage sparsity as strongly as alternatives like Lasso regression (Tibshirani, 1996). There is a slightly different rationale behind selecting readout subsets using MDL and the breathing algorithm described in Sec. II B. Reservoir components that have little influence on model performance are completely neglected. Multicollinearity in the matrix of output states is avoided by dropping components that are highly correlated to any others already contained within the optimal component subset. When ridge regression is used in conjunction with MDL subset selection, smaller ridge penalties tend to be required since a reduction in linear dependence in the design matrix results in a better-conditioned matrix inverse .
MDL also encourages sparsity by limiting readout size. The size of the readout chosen using MDL is bounded above according to the amount of data used to train the model. This is because a model cannot achieve a minimum total description length if its parameters require the specification of more information than the target dataset. In reality, the optimal subset size will fall well below this limit for a good model.
Effective readout size for ESNs trained with and without MDL subset selection. Solid lines represent the number of nodes in the MDL subset while dotted lines are the effective readout sizes [ ] when the whole readout is trained. ESNs are trained to predict the Lorenz system one time step ( Lyapunov times) ahead using time-series points. The MDL readout size remains roughly constant beyond some critical ESN size.
Effective readout size for ESNs trained with and without MDL subset selection. Solid lines represent the number of nodes in the MDL subset while dotted lines are the effective readout sizes [ ] when the whole readout is trained. ESNs are trained to predict the Lorenz system one time step ( Lyapunov times) ahead using time-series points. The MDL readout size remains roughly constant beyond some critical ESN size.
If we compare this measurement to the size of the ESN, we find that the effective readout size continues to increase with the readout size when ridge regression is used on its own (Fig. 3). Perhaps counter-intuitively, readouts trained with weaker regularization (smaller ) appear more sparse. However, this is because the matrix inverse may become poorly conditioned for small causing some coefficients to blow up and account for a large part of the norm of the coefficient vector. On the other hand, the size of the MDL readout is substantially smaller than that produced by ridge regression for large ESNs. The size of the MDL readout reaches a plateau for some sufficient ESN size.
Despite the sparsity of the MDL trained readout, it performs better than ridge regression at the task of -step ahead prediction for the Lorenz, Rössler, and Thomas systems. In the experiment below (Fig. 4), we compare the mean square prediction error of ESNs trained using either the entire readout (shown with dashed lines) or MDL readout subsets (shown with solid lines). MSE was found for three different forecasting horizons across a range of ridge regression penalty sizes. Results were averaged across trials and error bars indicate the and percentile of bootstrap sample means. We find that the benefit of MDL subset selection is most apparent on short forecasting horizons. This is likely because model over-fitting is easier on the comparatively simpler task of short-term forecasting. This would also explain why stronger regularization (larger ) is also needed for shorter horizons.
Prediction error ( MSE) by ridge regression penalty on -step prediction tasks for ESNs trained on the Lorenz, Rössler, and Thomas systems. The solid lines are ESNs trained using MDL readout subsets while the dashed lines are trained using the entire readout. The black, blue, and red lines represent forecasting horizons of , , and Lyapunov time, respectively. ESNs were trained using points from each time series. Each line represents the average across samples (so far 2 or 3) with error bars representing the to percentiles of bootstrap sample means.
Prediction error ( MSE) by ridge regression penalty on -step prediction tasks for ESNs trained on the Lorenz, Rössler, and Thomas systems. The solid lines are ESNs trained using MDL readout subsets while the dashed lines are trained using the entire readout. The black, blue, and red lines represent forecasting horizons of , , and Lyapunov time, respectively. ESNs were trained using points from each time series. Each line represents the average across samples (so far 2 or 3) with error bars representing the to percentiles of bootstrap sample means.
Improvements from MDL subset selection at -step ahead prediction are also apparent with node ESNs (Fig. 5) but not on small ( -node) ESNs (Fig. 6). It is not surprising that MDL provides no benefit to small reservoirs since the size of the optimal subset according to MDL is the same as the size of the full readout (Fig. 3) for ESNs with less than around nodes. However, performance is better in the larger ( or )-node ESNs where MDL readout subset selection is beneficial.
Prediction error at -step ahead prediction of Lorenz as in Fig. 4 but using a -node ESN.
Prediction error at -step ahead prediction of Lorenz as in Fig. 4 but using a -node ESN.
Prediction error at -step ahead prediction of Lorenz as in Fig. 4 using a -node ESN.
Prediction error at -step ahead prediction of Lorenz as in Fig. 4 using a -node ESN.
We also test the robustness to noise of MDL trained readouts in Fig. 7 and their ability for reconstruction in Fig. 8. Robustness to noise is similar in ESNs trained with and without MDL subset selection. Improved performance in the MDL readouts with low levels of noise and short forecasting horizons just follows from results in Fig. 4. Performance in ESNs that use the entire readout is improved by low levels of Gaussian noise. This is because low levels of Gaussian noise reduce linear dependence in and result in a better-conditioned inverse . However, performance is not improved by low levels of Gaussian noise (with variance above ) when MDL subsets are selected since linear dependence in is reduced to begin with.
Noise robustness for ESNs trained using the entire readout (top row) or MDL readout subsets (bottom row) and ridge regression penalties . The -axis of each panel is the forecasting horizon in Lyapunov time, and the -axis is the standard deviation of added Gaussian noise. The color of each heat map represents the MSE for -step ahead prediction of the Lorenz attractor.
Noise robustness for ESNs trained using the entire readout (top row) or MDL readout subsets (bottom row) and ridge regression penalties . The -axis of each panel is the forecasting horizon in Lyapunov time, and the -axis is the standard deviation of added Gaussian noise. The color of each heat map represents the MSE for -step ahead prediction of the Lorenz attractor.
Reconstruction composites of the Lorenz attractor made by superimposing time delay embeddings of ESN autonomous reconstructions. ESNs were trained using the whole readout in the top row while ESNs in the bottom row used MDL readout subsets. In all cases, they were trained to predict one time step ( ahead. The appearance of shaded backgrounds and diagonal bands comes from failed reconstructions. The embedding delay is s and standard hyper-parameters were used (see Sec. II C).
Reconstruction composites of the Lorenz attractor made by superimposing time delay embeddings of ESN autonomous reconstructions. ESNs were trained using the whole readout in the top row while ESNs in the bottom row used MDL readout subsets. In all cases, they were trained to predict one time step ( ahead. The appearance of shaded backgrounds and diagonal bands comes from failed reconstructions. The embedding delay is s and standard hyper-parameters were used (see Sec. II C).
In autonomous reconstruction tasks, like in -step ahead prediction tasks, ESNs trained using MDL readout subsets produce a similar performance to those trained using the whole readout when ridge penalties are large (Fig. 8). When penalties are smaller, performance appears to be qualitatively better with MDL subsets. However, MDL is arguably less relevant for autonomous reconstruction since ideal ridge parameters are larger.
IV. AUGMENTING READOUT
Much work has been done lately to improve the performance of ESNs by incorporating higher-order products of reservoir components in their otherwise linear readouts. The performance improvements of this readout augmentation are particularly striking when linear activation functions are used in place of more conventional non-linear activation functions like . As demonstrated by Bollt (2021), ESNs with a linear activation function become Vector Auto-Regressive (VAR) models whose performances may be improved with quadratic terms. These quadratic terms are introduced by taking the product of pairs of network nodes. In general, it has been shown that by introducing non-linearity in just the readout of an otherwise linear dynamical system, a wide range of signals may be approximated (Boyd and Chua, 1985). It has also been shown that quadratic terms can help to break harmful symmetries that lead to mirror attractors in autonomous reconstruction tasks (Herteux and Räth, 2020).
If we build from the quadratic readout in Bolt’s paper and include network sub-graphs of arbitrary size by taking a direct product of sub-graph nodes, then this VAR becomes a polynomial model.
Readout weights are determined for augmented readouts as they otherwise would be except that columns representing sub-graph states are added to the design matrix . For example, if a triangle involving nodes , , and is added to the readout of a -node ESN, then the entry would take the value .
In Fig. 9, we demonstrate that the inclusion of sub-graph states can improve performance substantially when a linear activation function is used. These improvements are perhaps unsurprising since without higher-order sub-graph states, the reservoir output is a purely linear combination of historical inputs and linear approximations of dynamical systems are only suitable over very short time horizons. However, if the whole readout is trained and small ridge penalties are used performance decreases with readout augmentation.
Prediction error ( MSE) by ridge regression penalty for -step ahead prediction of Lorenz using a linear activation function. Dashed lines represent ESNs trained with the whole readout while solid lines use MDL readout subsets. Dark lines indicate the inclusion of higher-order sub-graph states while the faded lines represent performance when just node states are included in the readout. Augmentation with higher-order states improves performance if MDL readout subsets are used or if ridge penalties are large. However, performance decreases with augmentation for small ridge penalties if the whole readout is trained.
Prediction error ( MSE) by ridge regression penalty for -step ahead prediction of Lorenz using a linear activation function. Dashed lines represent ESNs trained with the whole readout while solid lines use MDL readout subsets. Dark lines indicate the inclusion of higher-order sub-graph states while the faded lines represent performance when just node states are included in the readout. Augmentation with higher-order states improves performance if MDL readout subsets are used or if ridge penalties are large. However, performance decreases with augmentation for small ridge penalties if the whole readout is trained.
It is also true that performance is improved by readout augmentation for MDL readout subsets with a non-linear (hyperbolic-tangent) activation function. Performance improvements occur from augmentation but are smaller when the entire readout is trained. This is demonstrated in Fig. 10 where tests performed in Fig. 4 are repeated with and without edge, triangle and four-cycle states included in the readout.
Prediction error ( MSE) by ridge regression penalty on -step prediction tasks for ESNs with augmented readouts. The solid lines are ESNs trained using MDL readout subsets while the dashed lines are trained using the entire readout. Dark lines indicate the inclusion of higher-order sub-graph states while the faded lines represent performance when just node states are included in the readout. The black, blue, and red lines represent forecasting horizons of , , and Lyapunov time, respectively.
Prediction error ( MSE) by ridge regression penalty on -step prediction tasks for ESNs with augmented readouts. The solid lines are ESNs trained using MDL readout subsets while the dashed lines are trained using the entire readout. Dark lines indicate the inclusion of higher-order sub-graph states while the faded lines represent performance when just node states are included in the readout. The black, blue, and red lines represent forecasting horizons of , , and Lyapunov time, respectively.
In Figs. 11 and 12, we re-perform the noise robustness and autonomous reconstruction experiments from Sec. III. The best performance in noise robustness tests using the whole readout occurs with a larger amount of noise when sub-graph states are included. This is likely because noise has a regularizing effect and more regularization is needed for the larger readout. Augmentation appears to reduce robustness to high degrees of Gaussian noise for ESNs trained using MDL readout subsets.
Noise robustness for augmented ESNs trained with using the entire readout (top row) or MDL readout subsets (bottom row) and ridge regression penalties . The -axis of each panel is the forecasting horizon in Lyapunov time and the -axis is the standard deviation of added Gaussian noise. The color of each heat map represents the MSE for -step ahead prediction of the Lorenz attractor.
Noise robustness for augmented ESNs trained with using the entire readout (top row) or MDL readout subsets (bottom row) and ridge regression penalties . The -axis of each panel is the forecasting horizon in Lyapunov time and the -axis is the standard deviation of added Gaussian noise. The color of each heat map represents the MSE for -step ahead prediction of the Lorenz attractor.
Reconstruction composites of the Lorenz attractor made by superimposing time delay embeddings of ESN autonomous reconstructions. ESNs were augmented with higher-order sub-graph states and trained using the whole augmented readout (top row) or MDL subsets of the augmented readout (bottom row). The embedding delay and hyper-parameters are the same as in Fig. 8.
Reconstruction composites of the Lorenz attractor made by superimposing time delay embeddings of ESN autonomous reconstructions. ESNs were augmented with higher-order sub-graph states and trained using the whole augmented readout (top row) or MDL subsets of the augmented readout (bottom row). The embedding delay and hyper-parameters are the same as in Fig. 8.
Reconstruction performance with and without augmentation is harder to compare due to its qualitative nature. However, the absence of dark bands off the attractor when indicates an improvement from Fig. 8. Again, no successful reconstruction was achieved using the entire readout with .
V. LINEAR DEPENDENCE AND CONSISTENCY
An important finding from Secs. III and IV is that MDL readout subsets require less regularization. For example, we can see from -step ahead prediction and autonomous reconstruction performance that large ridge penalties are needed when training the entire readout. However, performance on both tasks remains reasonable when smaller ridge penalties are used to train MDL readout subsets. Regularization from low levels of added Gaussian noise is beneficial for training the full readout but not for training MDL readout subsets.
Regularization in the form of a ridge regression penalty or added Gaussian noise helps stabilize the matrix inverse if the columns of [which are the time series of node or sub-graph states ] are linearly dependent. We argued in Sec. III that this linear dependence is often avoided when using MDL readout subsets. We can test this theory with consistency capacity (see Subsection 3 of the Appendix), which measures the number of linearly independent components of the driving signal contained within the reservoir (Lymburn , 2019).
The consistency spectrum, different to the consistency capacity, is a ranked set of reservoir component consistencies. The consistency of each reservoir component indicates its degree of functional dependence on the driving signal and their sum is the consistency capacity. By looking at the consistency spectrum in Fig. 13, we can see that additional independent signal components are provided by the higher-order sub-graph states.
Consistency profile of augmented and non-augmented ESNs (with nodes and edges, triangles and four cycles in the augmented case). MDL subsets are selected for the purpose of predicting the Lorenz system Lyapunov times ahead. On the vertical axis is the size of the largest consistent component. Additional linearly independent components of the driving signal are contained within higher sub-graph states.
Consistency profile of augmented and non-augmented ESNs (with nodes and edges, triangles and four cycles in the augmented case). MDL subsets are selected for the purpose of predicting the Lorenz system Lyapunov times ahead. On the vertical axis is the size of the largest consistent component. Additional linearly independent components of the driving signal are contained within higher sub-graph states.
We can also measure an effective consistency capacity for readout subsets (including or not including higher-order states) by following the calculation in Subsection 3 of the Appendix but including only the columns of and that correspond to elements of the readout. Rather than being a measure of the number of linearly independent signal components contained within the entire reservoir state, this effective consistency capacity measures the number of such components contained within the sub-state of the reservoir that gets mapped to the output. If coefficients are only fitted for a subset of readout components, then it is this effective consistency that matters. We normalize the effective consistency capacity of a readout subset by the subset size to provide a measure of the proportion of the readout that corresponds to linearly independent components of the driving signal.
We demonstrate in Fig. 14 that using the breathing algorithm with MDL as a subset selection criterion improves readout consistency. This means that the proportion of readout components in the MDL readout subset that are linearly independent and functionally dependent on the driving signal is higher than in the whole readout.
Total consistency of random readout subsets (black) compared with subsets chosen by the breathing algorithm (red) prior to (solid) and after (dashed) augmentation. The vertical lines indicate the optimal subset size according to MDL theory. Optimal subset sizes occur while consistency remains high and the breathing algorithm selects high-consistency subsets with and without readout augmentation.
Total consistency of random readout subsets (black) compared with subsets chosen by the breathing algorithm (red) prior to (solid) and after (dashed) augmentation. The vertical lines indicate the optimal subset size according to MDL theory. Optimal subset sizes occur while consistency remains high and the breathing algorithm selects high-consistency subsets with and without readout augmentation.
One mechanism for this improvement is the reduction in readout size. Smaller readouts tend to be more consistent even with random subset selection. In theory, the optimal subset size according to MDL theory should occur while consistency is still high. Once a subset size is reached where new components are less relevant to the signal or similar to those already contained, any reduction in is less likely to be worth the increase in .
The breathing algorithm also tends to select component subsets with disproportionately high consistency capacity when readouts are small. This is because, at each subset size, the algorithm chooses the predictor most strongly related to the prediction error. With large enough subsets, the breathing algorithm may eventually choose less consistent readouts than random sampling—as is the case when the readout is augmented in Fig. 14. However, this is only the case for readout sizes larger than the optimal size according to MDL and occurs because there may be signal components that are consistent but not useful. Irrelevant signal components will not be related to prediction error.
VI. CONCLUSION
In this paper, we demonstrated the use of minimum description length in reservoir training. In doing so, we found that using MDL readout subsets is beneficial for -step ahead prediction tasks and results in qualitatively similar attractor reconstructions. One explanation for the advantages of MDL readout subsets is that they contain less linear dependence which we demonstrated using consistency capacity. This reduction in linear dependence also made MDL readout subsets less sensitive to the size of the ridge penalty.
We argued that decreased linear dependence in MDL readout subsets was only partly explained by their smaller size. The other explanation was that the breathing algorithm selected subsets with higher consistency than random sampling.
The benefits of using the MDL principle in ESN training that we found did not apply to small ( -node) ESNs. However, these small ESNs produced inferior performance to their larger counterparts which were improved by MDL subset selection.
As such, we believe that our findings provide a strong case for the consideration of MDL in practical reservoir training tasks and that this case is strengthened by recent attention to readout augmentation.
ACKNOWLEDGMENTS
A.M. was supported by the Australian Government RTP scholarship at the University of Western Australia. M.S. and D.M.W. are supported by the ARC Discovery Grant (No. DP200102961), funded by the Australian Government. M.S. also acknowledges the support of the Australian Research Council through the Centre for Transforming Maintenance through Data Science (Grant No. IC180100030).
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Antony Mizzi: Conceptualization (equal); Data curation (equal); Investigation (equal); Methodology (equal); Software (equal); Visualization (equal); Writing – original draft (equal). Michael Small: Conceptualization (equal); Methodology (equal); Project administration (equal); Software (equal); Supervision (equal); Validation (equal); Writing – review & editing (equal). David M. Walker: Methodology (equal); Project administration (equal); Supervision (equal); Validation (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings of this study are openly available in MDL_ESNs, at https://github.com/antony-mizzi/MDL_ESNs.git (Mizzi, 2025).
APPENDIX: SYSTEMS, ALGORITHMS, AND CONSISTENCY
1. Attractors
We generate time series from the Lorenz, Rössler, and Thomas systems using an Eulerian integration with an interval for Lorenz and for Rössler and Thomas. We then sample points at a rate of for Lorenz and for Rössler and Thomas. The set of sampled points is used for reservoir prediction tasks. The update equations for each system are given below:
2. Breathing algorithm
Allow us to denote to be the state matrix with rows that contain the state of the node at times . We require to be normalized such that . If is a node subset then is the sub-matrix with rows corresponding to nodes in . Initially, we set .
Fit coefficients for nodes in the subset which has size . Denote the errors of this fit as .
Pick the row that corresponds to the maximum entry of .
Fit coefficients for the matrix with rows corresponding to nodes in . Find the parameter with the smallest absolute entry in .
If (the newly included row has the smallest absolute parameter) then include in the optimal parameter subset ( , ).
If then repeat the process using .
3. Consistency capacity
Consistency capacity is a measure of the functional dependence of the reservoir state on the input signal. In fully consistent reservoirs, the echo-state property holds completely. Consistency capacity is measured as follows.