Many natural systems exhibit chaotic behavior, including the weather, hydrology, neuroscience, and population dynamics. Although many chaotic systems can be described by relatively simple dynamical equations, characterizing these systems can be challenging due to sensitivity to initial conditions and difficulties in differentiating chaotic behavior from noise. Ideally, one wishes to find a parsimonious set of equations that describe a dynamical system. However, model selection is more challenging when only a subset of the variables are experimentally accessible. Manifold learning methods using time-delay embeddings can successfully reconstruct the underlying structure of the system from data with hidden variables, but not the equations. Recent work in sparse-optimization based model selection has enabled model discovery given a library of possible terms, but regression-based methods require measurements of all state variables. We present a method combining variational annealing—a technique previously used for parameter estimation in chaotic systems with hidden variables—with sparse-optimization methods to perform model identification for chaotic systems with unmeasured variables. We applied the method to ground-truth time-series simulated from the classic Lorenz system and experimental data from an electrical circuit with Lorenz-system like behavior. In both cases, we successfully recover the expected equations with two measured and one hidden variable. Application to simulated data from the Colpitts oscillator demonstrates successful model selection of terms within nonlinear functions. We discuss the robustness of our method to varying noise.

Chaos represents a challenge for studying the dynamic behavior of many physical and biological systems. Since the 1980s, we have known that time-series measurements from one variable of a chaotic system contain information about the underlying structure of the full multi-dimensional system. However, recovery of the full system from data with hidden variables has remained elusive. This work develops a novel data-assimilation technique to identify governing equations of chaotic systems from data with hidden variables. This method identifies fairly simple, low-dimensional, and deterministic models from seemingly incomplete data. Discovery of such equations can enable rich mathematical studies and physical insights for problems across nearly every discipline, including climate science, hydrology, neuroscience, ecology, medicine, and engineering.

## I. INTRODUCTION

Hypothesis generation through data-driven model identification has the potential to revolutionize science. Uncovering the interactions, structure, and mechanisms that determine the behavior of chaotic systems, in particular, could improve scientific understanding in almost every discipline with dynamical systems,^{1} including climate science,^{2} hydrology,^{3} population dynamics,^{4} and neuroscience.^{5} Many chaotic systems can be informatively described by relatively simple dynamical equations. However, characterization and control of these systems can be challenging^{6} due to sensitivity to initial conditions and difficulties in differentiating chaotic behavior from noise.^{7} Characterization through statistical, geometric, or model-based means becomes more challenging when only a subset of the variables are experimentally accessible. Model-based characterization can include the following: parameter estimation, where a model structure is known; model comparison, where multiple pre-defined structures are evaluated; and sparse model selection, where the model structure is selected from a comprehensive library of possible terms. Here, we use sparse model selection to identify a parsimonious set of equations that describe a chaotic system from measurements with hidden variables.

Much data-analysis for chaotic systems has focused on learning the attracting manifold structure from time-series. In the early 1980s, Takens’s theorem^{8} described the conditions under which one can use the time-delay embedding from a single variable to construct a manifold that preserves the topological properties of the full system. Takens’s result formalized the idea that the information of the manifold structure, and, therefore, chaotic dynamics, could be recovered from the time-history of a single state variable. Manifold reconstruction methods^{9–11} based on partial information provide insight into the system structure, dimensionality, and statistics of chaotic systems. By constructing manifolds from time-delays, Sugihara *et al.* developed methods discriminating chaos from noise^{7} and detecting causality between measured variables.^{12} Methods, including reservoir computing,^{13–15} other deep learning frameworks,^{16} data assimilation combined with neural networks,^{17} support vector machine,^{18} and nearest neighbors^{19} can accurately predict the dynamics of chaotic systems using a data-trained model with no specific physical knowledge of the system. For a review of predictive methods, see Ref. 20. Assuming a reasonable model structure is known, data-assimilation methods,^{21–24} including variational annealing (VA),^{25} can estimate model parameters for chaotic systems from incomplete, indirect, and noisy measurements. Some methods have been used to perform model comparison.^{23,24} We call the evaluation of multiple pre-defined models “model comparison” rather than “model selection” to avoid confusion with sparse model selection methods, described below. Recent works use a data-assimilation approach with constraints on parameter values, but they define a known sparsity structure, such as nearest neighbor interactions based on prior knowledge, and estimate a subset of the parameters.^{26,27} These methods are designed to assimilate information from sparse, noisy data-streams with hidden variables and are more robust than previous methods for model comparison. However, they are not designed for the purpose of sparsely selecting model terms from a large library and, thereby, hypothesizing parsimonious model structures.

Data-driven discovery of parsimonious dynamical system models to describe chaotic systems is by no means new. Early on, least-squares fitting of combinatorial sets of polynomial basis functions to time-series data followed by information-theory based selection produced models that reproduced a manifold structure and statistics of the system.^{28} Symbolic regression demonstrated successful recovery of the widely accepted equations for the chaotic double-pendulum system.^{29} Compressed sensing has been used to identify nonlinear dynamics of chaotic systems.^{30} An $\u21130$ constraint to promote model selection was shown to produce predictive models, but the recovered models often required more terms than the ground truth.^{31} More recently, sparse regression^{32–34} motivated model selection techniques, such as SINDy,^{35} which recover the ground-truth equations for chaotic systems from a relatively large library of functions, without needing a computationally intensive combinatorial search. Other sparsity-promoting frameworks have improved upon robustness for chaotic system equation recovery through integral formulations^{36–38} and entropic regression.^{39} However, these methods require measurements of all state variables that significantly impact the desired dynamic.^{36–39} Notably, Champion *et al.* recently used an autoencoder framework for automatic discovery system coordinates and equations but required input time-series of a higher dimension than an intrinsic dimension of the system.^{40}

Sparse model selection with hidden variables requires different methodology. By “hidden variables,” we mean that the number of measured variables is smaller than the intrinsic dimension of the system. Measured variables are not considered hidden if they are corrupted by noise or indirectly sampled through a measurement function. A few methods address the problem of sparse model selection with hidden variables, but they have not been demonstrated for chaotic systems. For example, Daniels *et al.*^{41,42} combinatorially fit each model in a predefined model space using data assimilation and subsequently use Bayesian inference to select the best model. Successful recovery of mass-action kinetic systems for chemical reactions was demonstrated with hidden variables using a neural network approach.^{43} A recent method uses LASSO to select predictive models for chaotic systems from a library with higher-order derivatives given a single state variable.^{44} This method effectively finds a higher-order ordinary differential equation (ODE) representation of the Lorenz and Rössler systems, but it is unclear how the recovered structures relate to the ground-truth models. Contemporaneous to our work, a method combining SINDy, time-delay coordinates, and deep delay autoencoders has shown selection of chaotic models of different structures but a similar sparsity from a single variable of simulated models and a waterwheel experiment.^{45}

In this paper, we present a new method to perform model selection in dynamical systems with hidden variables. This method combines the data-assimilation technique, VA, which has been used to estimate parameters when the structure of the system is known, with sparse model selection via hard thresholding. We call this method data assimilation for hidden, sparse inference (DAHSI). To demonstrate that our method could identify interpretable models for chaotic systems, we followed the philosophy of earlier works^{29,35} and demonstrated recovery of an accepted parsimonious model from simulated time-series where the ground truth is known and from experimental data where there is an expected model. In Sec. II, we describe the DAHSI algorithm for model selection with hidden variables and demonstrate application on the simulated Lorenz 63 system. In Sec. III, we analyze robustness to noise and sampling with the simulated system. We apply DAHSI to synthetic time-series from the Colpitts oscillator, a system that has exponential terms in its governing equations. DAHSI selected the correct exponential model from a library of non-additive, nonlinear functions. However, it made small structural errors in recovery of the ground-truth system when one variable is hidden. The correct system is recovered if DAHSI is run iteratively, indicating that iterative application may be a good check for robustness. As a final example, DAHSI successfully selected a set of models for a circuit that has Lorenz-like behavior from experimental data of two state variables (one hidden). One of the identified models has the same structure as the Lorenz system. The other identified models with strong statistical support exhibit nearly indistinguishable dynamics and suggest novel terms, which may better represent the experimental circuit system. We discuss future work to improve robustness, computational efficiency, and expand DAHSI to other classes of models.

## II. METHODS: SPARSE MODEL SELECTION USING A VARIATIONAL ANNEALING APPROACH

### A. Model selection with hidden variables

The dynamics of many physical systems can be described by ODE models with only a few terms. Our goal is to discover a sparse representation of these types of systems given the measurements of some, but not all, of the state variables. We consider a dynamical system with unknown governing equations

where $X=(x1,x2,\u2026,xD)\u2208RD$ are the state variables and $F=(F1,F2,\u2026,FD)$ are the unknown functions that govern the dynamics of the system. Here, we assume that the dimension of the system is known. If it is not known, the dimension can be estimated using time-delay embedding methods.^{9} These methods effectively allow pre-calculation and analysis of the dimensionality of the system based on the compactness of the manifold in different embedding dimensions. $p$ is a set of unknown model parameters. Although not inherent to our methodology, we focus on autonomous dynamical systems.

For a system with hidden variables, the measurements $Y=(y1,y2,\u2026,yL)\u2208RL$ are lower dimensional $L\u2264D$ than the underlying variables. The measurement function $h(X)=Y$ is a known transformation of a subset of the state variables in (1). The state variables in $X$ are ordered so that the first $L$ is measured and the rest are unmeasured. In principle, the measurement function could map some combination of state variables to a lower-dimension, as in $h(X)=x1+x2$. In this work, we assume that $h$ captures Gaussian experimental noise such that $yk=hk(X)=xk+N(0,\omega )$. The measurements are taken at an $N$ equally spaced point in time between $[t1,tN]$.

For sparse model selection to work, the function capturing the nonlinear dynamics of each state variable, $Fk$, must be well-represented by a subset of functions from some assumed function space.^{35,46} Given a library of possible functions $\Theta =(\theta 1,\theta 2,\u2026,\theta q)$, we can write a candidate function $F^k$ as

for $k=1,2,\u2026,D$ state variables. The vector of unknown parameters is $p=({p1,j},{p2,j},\u2026{pD,j},{p^j(1)},{p^j(2)},\u2026,{p^j(D)})$. Each bracketed set contains the parameters for the $j=1,2,\u2026,D$ equations. The first $q$ entries in the set ${p1,j}$ are the coefficients on the functions in the equation of the first state variable, the next $q$ in ${p2,j}$ are coefficients in the equation of the second state variable, and so on. The entries from $D\xd7q+1$ to the end correspond to the parameters inside the library functions, $p^j(k)$, which are empty vectors if the library function, $\Theta j$, in the $k$th equation does not have unknown parameters. The number of non-zero entries in the first $D\xd7q$ elements of $p$ corresponds to the subset of functions selected to form a candidate model. We examine libraries with unknown parameters inside nonlinear functions, such as $pk,j\theta j(X,p^j(k))=pk,jexp(p^j,1(k)x+p^j,2(k)y)$, and the selection of terms within $\theta j(X,p^j(k))$ in Sec. III C.

The choice of library functions can be based on prior knowledge of the system, such as symmetry and physical knowledge. There are certainly many problems for which an appropriate library is unknown and methods that do use physical knowledge may be more appropriate;^{40} however, there are also many complex and interesting systems where a reasonable library can be defined, but active terms in a given experiment are not known ahead of time. For example, monomial basis is an intuitive choice for systems well described by mass-action kinetics, but where the specific interactions are unknown. Alternative function spaces, such as orthogonal polynomials, and periodic functions, such as the Fourier series, or radial basis functions^{47–49} could also be used if there was physical intuition that the system is sparse within that space.

### B. Sparse-variational annealing cost function

We want to estimate the unknown parameters $p$ and all state variables $X$ using only the measurements $Y$ with the constraint that $p$ is sparse, meaning that many of the parameters are zero valued. Following the statistical derivation in Talagrand and Courtier,^{50} Evensen,^{51} and Abarbanel,^{52} we can solve this problem by minimizing the negative log likelihood,

This function is composed of three terms: the experimental error, $AE(X,Y)=1N\u2211i=1N\Vert X(ti)\u2212Y(ti)\Vert 2$; the model error term, $AM(X,p,F^)=1N\u2211i=1N\u22121{\Vert X(ti+1)\u2212f(X(ti),p,F^)\Vert 2}$; and a sparse penalty term $\lambda \Vert p\Vert 1$ (full derivation in Sec. S1 of the supplementary material). Here, $f(X(ti),p,F^)=X(ti+1)$ defines the discrete time model dynamics and is obtained by discretizing (1) using a Hermite–Simpson collocation. The experimental error, $AE$, only depends on the measurements and measured variables estimated from the model, and we assume Gaussian measurement noise. The model error, $AM$, contains the coupling between variables, taking advantage of the information about hidden variables in the time-history of the measured variables. We assume that the state at the $ti+1$ depends only on the state at $ti$. A relaxed delta function, a Gaussian, is used to enforce the model between time points. The sparse penalty term, $\lambda \Vert p\Vert 1$, arises from the assumption that each element in $p$ follows a Laplace distribution with *mean*$0$ and scaling parameter $1\lambda $, which sets the distribution width. Model selection is enabled through the sparse penalty, which effectively determines the number of parameters, $pk,j$, that will be active in the model or zero. Parameters within nonlinear functions, $p^j(k)$, are also under constraint. If $\lambda =0$ in (3), the cost is the same as that used in VA, usually known as the 4D-Var weak formulation.^{50,51} We next introduce the Lorenz system as a test case and then explain how to minimize Eq. (3) in Sec. II D.

### C. Lorenz system as an explanatory test case

The Lorenz system^{53} was originally developed to forecast the weather and has become a canonical example when scientists develop new methods to characterize chaotic systems. To demonstrate model selection with hidden variables by minimizing Eq. (3), we will use time series simulated from the classic Lorenz system [Fig. 1(a)],

where $\sigma =10$, $\rho =28$, and $\beta =8/3$. To produce synthetic time-series “data,” we numerically simulated the system using Runge–Kutta 4th order and a time step of $\Delta t=0.01$ and $N=501$. We treat $y$ as a hidden variable [Fig. 1(b)]. The measurement error was modeled as additive Gaussian noise of mean zero and standard deviation, $\omega =0.01$. Therefore, the measurement function is $Y=h(X)=X1,\u2026,L+N(0,\omega )$ for the measured state variables.

### D. DAHSI: Data assimilation for hidden sparse inference

Our algorithm, data assimilation for hidden sparse inference (DAHSI), performs model identification for chaotic systems from data with hidden variables. It combines the data-assimilation technique VA with sparse thresholding [Figs. 1(d) and 1(e) and Algorithm 1). The code base for DAHSI can be found at Ref. 54.

We illustrate DAHSI on simulated time-series of the Lorenz test case [Figs. 1(a) and 1(b)]. First, we construct the model library $\Theta $ based on domain-knowledge. For the Lorenz test case, we used monomials up to degree two in three variables, representing $109$ possible models composed of subsets of possible terms. From this library, we generate a generic governing equation for each variable via the linear combination of all the candidate functions [Fig. 1(b)]. Our goal is to find a small subset of candidate functions within the library, which describe the dynamics in the data, in this case recovering the ground-truth Lorenz model. We do not assume that we know the correct model complexity *a priori* and search for the set of models, which balance error and simplicity.

To sparsely select models by minimizing the cost function, Eq. (3), we used iterative VA^{25} and hard-thresholding. VA is a data-assimilation technique for non-convex parameter estimation in nonlinear, chaotic systems. The problem is highly non-convex, with many local minima, due to the incoherence between the data and the model.^{52,55} Decreasing the information or measurements and increasing the number of terms in the library will both increase the number of local minima^{56} (Fig. S1 in the supplementary material). VA allows us to efficiently find the global minima despite that the challenging landscape and hard-thresholding of the parameters promote sparsity of the solutions [Figs. 1(d) and 1(e)]. An ensemble of possible models are selected for varying threshold values [Fig. 1(f)]. These models are then compared using information criteria on out-of-sample validation data [Fig. 1(g), Sec. II E], and the model or models with highest statistical support are identified [Fig. 1(h)].

The core of the DAHSI algorithm is a nonlinear optimization step using VA, which is randomly initialized [Fig. 1(e)]. At each VA step, we minimize $AE+RfAM$ over state variable trajectories $X(t)$ and parameters $p$ given $Rf$ using IPOPT, an optimization package that uses a gradient descent method.^{57} The state variable trajectories $Xini(t)$ are initialized as $Y(t)$ for the measured states and random values from a uniform distribution within specified bounds for the unmeasured states. Since we expect the parameter vector $p$ to be sparse, it is initialized as $pini=0$. Because there are many local minima, we choose $NI=100$ random initializations to fully explore the landscape. The annealing part of VA works by varying $Rf$, which sets the balance between the model error and the measurement error [Fig. 1(d)]. When $Rf=0$, only measurement error contributes leading to a convex cost function with an easy to find global minima. As the model is enforced by gradually increasing $Rf$, the landscape increases in complexity and many local minima appear. By initializing the search near the minima, ${X(\beta ),p(\beta )}$, for the previous $Rf$, the solution remains near the first minima, which best fit the data. In practice, the initial $Rf$ takes some small value $Rf,0=\u03f5$, as $Rf=0$ would lead to an unconstrained solution on the unmeasured states and $p$. At each step $\beta =0,1,2,\u2026,\beta max$ of VA, $Rf$ is updated to $Rf=Rf,0\alpha \beta $ for $\alpha >1$. After each step $\beta $ of VA, we enforce sparsity by applying a hard threshold, $\lambda $, to $p(\beta )$. We choose $\beta max$ so that the cost function plateaus [Fig. 2(a)] and our final solution are ${Xfin,pfin}$.

As the desired model complexity is unknown ahead of time, DAHSI sweeps through $\lambda $ repeating the iterative VA and thresholding procedure [Fig. 1(e)]. Increasing the hard threshold $\lambda $ leads to model structures or candidate models with fewer terms. We chose the iterative thresholding framework over direct incorporation of the $\u21131$ penalty into the minimized cost function, based on the results that show that least square with thresholding converges locally, often outperforming convex variants,^{58,59} and recent demonstrations that LASSO makes mistakes early in the recovery pathway.^{60} As the penalty strength, $\lambda $, increases, the global minima move to 0 in a larger number of parameters. For one particular initialization, with $\lambda =0.35$, the term $x2$ is selected in the third equation of the system. With larger $\lambda =0.4$, the term $x2$ is no longer selected [Fig. 2(b)]. Parameters thresholded to zero are initialized as zero in the next VA iteration, but the terms are not removed entirely. We found that allowing terms to re-enter the equations yielded higher recovery rates, defined as the fraction of correctly selected models across all random initializations, $NI$. It is important to note that the same $\lambda $ yields multiple models due to the $NI$ different initializations of the unmeasured states. For example, for $NI=100$ with a fixed $\lambda =0.4$, we found a total of seven unique model structures [Figs. 2(a) and 2(c)]. Although the same $\lambda $ yields multiple models, higher values of $\lambda $ consistently produce models with fewer active terms [Fig. 2(c)]. For the $\lambda $ range that we considered, we have a total of 64 unique models [Figs. 2(a) and 2(c)].

1: procedure DAHSI |

2: Input: measurements $Y$, generic model library $\Theta $, $\lambda max$, $\beta max$, $\alpha $ |

3: Calculate discrete function $F^$ from $\Theta $ |

4: for $l=1:L$ do |

5: $xl=yl$ $\u22b3$ Fit measurements to data |

6: for $i=1:NI$ do $\u22b3$ Do $NI$ initializations |

7: Randomly initialize unobserved variables ${xL+1,\u2026,xD}$ |

8: $Xini={x1,x2,\u2026,xL,xL+1,\u2026,xD}$ |

9: Initialize $pini=0$ $\u22b3$ Force sparsity |

10: Assemble pair ${Xini,pini}$ |

11: $Rf,0=\u03f5$ |

12: while $\lambda <\lambda max$ do |

13: for $\beta =0:\beta max$ do $\u22b3$ Variational Annealing |

14: $Rf=Rf,0\alpha \beta $ |

15: ${X(\beta ),p(\beta )}=minX,pAE(X,Y)+RfAM(X,p,F^)\u22b3$ Minimize via IPOPT |

16: if $pk,j(\beta )<\lambda $ then $\u22b3$ Hard-threshold $p$ |

17: $pk,j(\beta )=0$ |

18: model$(\lambda )\u2190p(\beta )$ $\u22b3$ Store models |

19: $\lambda =2\lambda $ $\u22b3$ Increase $\lambda $ |

20: Parameter estimation using VA on selected models. |

1: procedure DAHSI |

2: Input: measurements $Y$, generic model library $\Theta $, $\lambda max$, $\beta max$, $\alpha $ |

3: Calculate discrete function $F^$ from $\Theta $ |

4: for $l=1:L$ do |

5: $xl=yl$ $\u22b3$ Fit measurements to data |

6: for $i=1:NI$ do $\u22b3$ Do $NI$ initializations |

7: Randomly initialize unobserved variables ${xL+1,\u2026,xD}$ |

8: $Xini={x1,x2,\u2026,xL,xL+1,\u2026,xD}$ |

9: Initialize $pini=0$ $\u22b3$ Force sparsity |

10: Assemble pair ${Xini,pini}$ |

11: $Rf,0=\u03f5$ |

12: while $\lambda <\lambda max$ do |

13: for $\beta =0:\beta max$ do $\u22b3$ Variational Annealing |

14: $Rf=Rf,0\alpha \beta $ |

15: ${X(\beta ),p(\beta )}=minX,pAE(X,Y)+RfAM(X,p,F^)\u22b3$ Minimize via IPOPT |

16: if $pk,j(\beta )<\lambda $ then $\u22b3$ Hard-threshold $p$ |

17: $pk,j(\beta )=0$ |

18: model$(\lambda )\u2190p(\beta )$ $\u22b3$ Store models |

19: $\lambda =2\lambda $ $\u22b3$ Increase $\lambda $ |

20: Parameter estimation using VA on selected models. |

### E. Validation using information criteria

The sparse-variational annealing process can produce a large number of candidate models. Optimally, we would validate all models on out-of-sample test data to statistically compare and rank the most likely models, but this can become computationally intensive. To down-select candidate models before validation, note that the cost function value has a gap of at least an order of magnitude [Fig. 1(f)]. We suggest only performing validation on those models that best fit the data, meaning those below the gap. If no clear gap exist, then all models may be candidates for validation. For the current case, we down-selected models with a cost function value less than $10\u22123$, leaving 17 candidate models to validate out of 64 models found in the DAHSI procedure. To ensure we had the best parameter fit for each down-selected model, we performed parameter estimation on the selected candidate models via VA without sparsity constraint.

To validate the models, we needed to estimate an initial condition for the hidden variable $y$. We used an 8th order finite difference approximation of the time derivative of $x$ for each model structure and solve the resulting algebraic equation for $y0$ (Sec. S3 in the supplementary material). We used the dynamic equation for $x$ since all down-selected models contain $y$ but not any higher-order $y$ terms. Estimation of the initial condition for hidden variables is only possible after the candidate models are found and must be done for the initial condition of each segment of validation data. This procedure takes advantage of Takens’s theorem that the information in $y$ is available in the time-delay of $x$.

Validation within the Lyapunov time ensures that the time-series do not diverge due to the inherent sensitivity to differences in initial conditions introduced by measurement and numerical error. All down-selected models have a similar Lyapunov time around 1.2 time units. We considered $S=1000$ segments of the simulated time series (excluding the training set), each of length 1/4 of a Lyapunov time to calculate the sum of the average error for each model. We discarded the first four points of each time segment as these points were used to predict the initial condition for $y$. The average error for the $s$th time segment of the $m$th model is defined as $Eav,ms=12M\u2211i=1M(xi,es\u2212xis)2+(zi,es\u2212zis)2$, where $xi,e$ and $zi,e$ are the $x$ and $z$ components of the simulated time series, respectively, and $i$ is the time index. The sum of all average errors over the time segments $S$ of the $m$th model is $Eav,m=\u2211s=1SEav,ms$. We used this error to calculate the Akaike information criteria as done in Mangan *et al.*,^{61} Sec. S4 in the supplementary material, to find that the model with most support is the ground-truth model [Figs. 1(g) and 1(h)]. Other model comparison techniques developed for chaotic systems^{23,24} could be used.

## III. RESULTS: MODEL SELECTION FOR CHAOTIC SYSTEMS

### A. Simulated Lorenz system

Details of successful recovery of the Lorenz system from simulated times series using DAHSI are in Secs. II C and II D. For a system where the ground-truth model is known, success can be measured in terms of recovery of the correct model structure and accuracy of the parameters estimated for that structure. For $\alpha =1.1$, $\beta max=207$ and $\lambda =0.3:0.05:0.5$, DAHSI had a 90% recovery rate for $\lambda =0.35,0.4,0.45$. Using AIC, we were able to successfully identify the correct expected model. Eight other models also had some support [Fig. 1(g)].

The final estimated parameters for the identified model [Fig. 1(h)] have a relative L2 error of 0.4431 when compared to the ground-truth model ones [Fig. 1(a)]. The overall manifold structure is still captured by this parameter set. The discrepancy between true and estimated parameters is likely due to missing information from the hidden variable.

### B. Robustness study on the simulated Lorenz system

To study the robustness of our method, we use the same ground-truth time series simulated from the classic Lorenz system [Eqs. (4)–(6)]. We studied the recovery rates of DAHSI as a function of the VA tuning parameter, $\alpha $, and found trends similar to previous work^{62} (Table S1 in the supplementary material). To study the noise-robustness of our method, we modeled measurement error as additive Gaussian noise of mean zero and varying standard deviation, $\omega $. Therefore, the measurement function is $h(X)=X1,\u2026,L+N(0,\omega )$. We expect that different noise instances, controlled by the random number generator seed, will change our recovery rate due to random corruption of essential parts of the data or overall poor manifold sampling.

We calculated recovery for three different standard deviations of noise with 20 noise seeds each and calculate the cumulative distribution function of the recovery rate (Fig. 3). The random noise seeds with the lowest standard deviation $\omega =0.01$ produced a wide variation in the recovery rate between 10% and 90%, indicating that the minimal data set used here is not very robust. As the noise strength increased, the cumulative distributions shifted left as more seeds have lower recovery. Setting $\omega =0.01$ produced a bimodal distribution, with either a high recovery rate ($>80$%) (the majority of simulations) or a low recovery rate ($<15$%). The recovery rate was always non-zero, and therefore, the ground-truth model was always found. For $\omega =0.05$, there were some seeds with intermediate recovery rates (10%), more low recovery rates (65%), and a few seeds that with a very high recovery rate (20%). There was one seed that had 0% recovery rate. The noise level dramatically affected the recovery rate for $\omega =0.1$. The vast majority of simulations led to less than 10% recovery. More than half (11) had 0% recovery, and only one had higher than 85% recovery. Only one instance of recovery is needed for success. For higher noise, either a much larger number of initializations or a larger number of data-points may improve recovery, but further work is needed to ensure computational efficiency.

We investigated how manifold sampling affects the recovery rate of our system but were unable to find any significant trends. We varied the number of time points from $N=300$ to $500$ and, in a separate numerical experiment, added additional noise to the manifold on either the crossing between lobes or on the wings of the attractor. We anticipated that low sampling or increased noise at specific points on the manifold, such as the crossing between lobes, could result in lower recovery compared to data loss on other parts of the manifold. However, the recovery rates appeared to be dominated by the randomness of the noise instance, with some instances showing no change, some a decrease, and others an increase in recovery for lower $N$ or increased noise (Sec. S6 in the supplementary material). The instances with increased recovery at higher noise and lower sampling are counterintuitive but may suggest an optimal balance of sampling different regions of the manifold. Further research is needed to identify optimal sampling strategies to overcome noise corruption and reduce computational cost.

### C. Colpitts oscillator

The Colpitts oscillator is a nonlinear system describing an electrical circuit that exhibits chaotic behavior.^{63} We tested DAHSI with synthetic time-series generated by the Colpitts oscillator to demonstrate performance with non-polynomial terms in the library. The Colpitts equations are

The parameters are $\alpha C=5$, $\eta =6.2723$, $\gamma =0.0797$, and $q=0.6898$.^{52} We generated a time-series from the Colpitts oscillator of $N=1501$ time points, $\Delta t=0.02$ with no noise. First, we preformed model selection with all variables measured to demonstrate recovery of the exponential type term in Eq. (8). We ran DAHSI with four different libraries used for all state variables:

$\Theta 1={1,x,y,z,x2,xy,xz,y2,yz,z2,exp(\u2212x)}$;

$\Theta 2={1,x,y,z,x2,xy,xz,y2,yz,z2,exp(ax)}$;

$\Theta 3={1,x,y,z,x2,xy,xz,y2,yz,z2,exp(ax),exp(by)}$; and

$\Theta 4={1,x,y,z,x2,xy,xz,y2,yz,z2,exp(ax+by)}$,

where we must estimate the coefficient in front of each term and the parameters $a$ and $b$. All coefficients and parameters are under sparse selection. For this test case, we know the library must contain the exponential with $x$ dependence in the $dydt$ equation because we know the ground truth. One could conceivably generate a similar library based only on the knowledge that the system is an electronic circuit, where $x$ and $y$ are voltages, and, therefore, they may influence other terms in an exponential fashion. The library $\Theta 4$ is an especially interesting case, as it tests DAHSI’s ability to select between terms within nonlinear functions. Sparse-regression-based methods cannot select between terms *within* a nonlinear function even when all state variables are measured.^{35}

Assuming all three variables are measured, we recover the true system for all four libraries for 100% of the seeds (Table I). This result demonstrates that the DAHSI framework can be effective for exponential and polynomial libraries. For the library $\Theta 4$, DAHSI selects that $exp(ax+by)$ is present in the equation for $y\u02d9$ and correctly recovers $a=\u22121$ and $b=0$, selecting $x$ within the exponential but not $y$.

. | Θ
. _{1} | Θ
. _{2} | Θ
. _{3} | Θ
. _{4} | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Term . | eq. $x\u02d9$ . | eq. $y\u02d9$ . | eq. $z\u02d9$ . | eq. $x\u02d9$ . | eq. $y\u02d9$ . | eq. $z\u02d9$ . | eq. $x\u02d9$ . | eq. $y\u02d9$ . | eq. $z\u02d9$ . | eq. $x\u02d9$ . | eq. $y\u02d9$ . | eq. $z\u02d9$ . |

1 | 0 | 6.2723 | 0 | 0 | 6.2723 | 0 | 0 | 6.2722 | 0 | 0 | 6.2723 | 0 |

x | 0 | 0 | −0.0797 | 0 | 0 | −0.0797 | 0 | 0 | −0.0797 | 0 | 0 | −0.0797 |

y | 0 | 0 | −0.0797 | 0 | 0 | −0.0797 | 0 | 0 | −0.0797 | 0 | 0 | −0.0797 |

z | 4.9999 | 6.2723 | −0.6898 | 4.9999 | 6.2723 | −0.6898 | 4.9999 | 6.2723 | −0.6898 | 5 | 6.2723 | −0.6898 |

x^{2} | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

xy | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

xz | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

y^{2} | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

yz | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

z^{2} | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

exp( − x) | 0 | −6.2723 | 0 | |||||||||

p exp(ax) | 0 | −6.2723 | 0 | |||||||||

exp(ax) | 0.1403^{a} | −1 | 0 | |||||||||

p exp(ax) | 0 | 6.2723 | 0 | |||||||||

exp(ax) | −0.0319^{a} | −1 | 0 | |||||||||

p exp(by) | 0 | 0 | 0 | |||||||||

exp(by) | 0.7202^{a} | 0.0769^{a} | 0 | |||||||||

p exp(ax + by) | 0 | −6.2723 | 0 | |||||||||

exp(ax + by) | 0.1169^{a} | −1 | 0 | |||||||||

exp(ax + by) | 0.3806^{a} | 0 | 0 |

. | Θ
. _{1} | Θ
. _{2} | Θ
. _{3} | Θ
. _{4} | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Term . | eq. $x\u02d9$ . | eq. $y\u02d9$ . | eq. $z\u02d9$ . | eq. $x\u02d9$ . | eq. $y\u02d9$ . | eq. $z\u02d9$ . | eq. $x\u02d9$ . | eq. $y\u02d9$ . | eq. $z\u02d9$ . | eq. $x\u02d9$ . | eq. $y\u02d9$ . | eq. $z\u02d9$ . |

1 | 0 | 6.2723 | 0 | 0 | 6.2723 | 0 | 0 | 6.2722 | 0 | 0 | 6.2723 | 0 |

x | 0 | 0 | −0.0797 | 0 | 0 | −0.0797 | 0 | 0 | −0.0797 | 0 | 0 | −0.0797 |

y | 0 | 0 | −0.0797 | 0 | 0 | −0.0797 | 0 | 0 | −0.0797 | 0 | 0 | −0.0797 |

z | 4.9999 | 6.2723 | −0.6898 | 4.9999 | 6.2723 | −0.6898 | 4.9999 | 6.2723 | −0.6898 | 5 | 6.2723 | −0.6898 |

x^{2} | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

xy | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

xz | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

y^{2} | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

yz | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

z^{2} | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

exp( − x) | 0 | −6.2723 | 0 | |||||||||

p exp(ax) | 0 | −6.2723 | 0 | |||||||||

exp(ax) | 0.1403^{a} | −1 | 0 | |||||||||

p exp(ax) | 0 | 6.2723 | 0 | |||||||||

exp(ax) | −0.0319^{a} | −1 | 0 | |||||||||

p exp(by) | 0 | 0 | 0 | |||||||||

exp(by) | 0.7202^{a} | 0.0769^{a} | 0 | |||||||||

p exp(ax + by) | 0 | −6.2723 | 0 | |||||||||

exp(ax + by) | 0.1169^{a} | −1 | 0 | |||||||||

exp(ax + by) | 0.3806^{a} | 0 | 0 |

^{a}

Parameters are non-zero but appear in an exponential term that is not selected.

We tested DAHSI on Colpitts with one hidden variable ($z$) and used $\Theta 4$ as our library. Using one full iteration of DAHSI, we were not able to recover the ground-truth structure with a hidden variable. For $\lambda =0.025$, DAHSI recovered the nearly correct equations (7)–(9), but with an additional constant term in the $x$ equation, so that $dxdt\u2248\u22120.0006\u22127.5583z$. The relative L2 error of all the other parameters is 1.676 when compared to the ground-truth model ones. Unlike Lorenz that has terms of a similar magnitude, the magnitude of terms in this system varies two orders of magnitude from $\u22487\xd710\u22122$ to $\u22485$. We suspect that the difference in the coefficient size may make it difficult for a single threshold value, $\lambda $, to work across all terms. Within the model selection iteration and before thresholding, DAHSI estimated the coefficient of the constant term as $p1,1=\u22120.3026$, while the smallest term (in absolute value) in the true system is estimated as $\gamma \u2248p3,3=\u22120.0504$. Therefore, DAHSI was not able to threshold out the incorrect term while keeping the true term in the model structure. Notably, when we perform parameter estimation on the recovered structure (the last step of the DAHSI algorithm), then $p1,1=\u22120.0006$, which is much smaller than the value found during intermediate DAHSI iterations with the whole library.

We are able to recover the true model structure if we run two iterations of DAHSI. Recall that DAHSI keeps all possible terms in the library at each iteration of optimization and initializes the parameters that were below the threshold to zero in the next run of IPOPT. This allows removed terms to re-enter the equations, which can correct for errors if terms are removed early on. If instead we take the model recovered from the first iteration of DAHSI and use it as the initial library in a second iteration of model selection, we are able to recover the Colpitts oscillator equations. Therefore, re-running DAHSI on recovered model structures may be a way to test robustness of the recovered models and determine if additional terms could be selected out. For systems where the ground truth is not known, validation of the different model structures found in both iterations is essential. We discuss other possible modifications to DAHSI, which may improve robustness and recovery in Sec. IV.

### D. Identification of models for the Lorenz circuit from experimental data

To demonstrate model selection on experimental data with hidden variables, we considered high-quality data from the electrical circuit in Blakely *et al.*^{64} [Fig. 4(a)]. This system exhibits a similar structure and behavior to the highly studied Lorenz system (see Sec. II D) and is well described by relatively simple circuit equations,

The structure of this system is similar to the Lorenz system, but its parameters have different values from the traditional Lorenz formulation when derived from first principles for the circuit’s electrical components (as in Blakely *et al.*,^{64} Table II). In the standard Lorenz formulation, $\epsilon ^=\eta ^=\gamma ^$. Here, $X=(x,y,z)$ denotes the voltages across the capacitors $C1$, $C2$, and $C3$ in the circuit [Fig. 4(a)]. The measured variables are $x$ and $z$, and $y$ is unmeasured or hidden. We denote the noisy measurements of $x$ and $z$, collected using an oscilloscope, by $xe$ and $ze$, respectively, and the measurement function $h((x,y,z))=(xe,ze)=Y$. The experimental sampling rate is $\Delta te=80$ ns resulting in $55000$ time points. A low-pass filter was applied to remove high-frequency measurement error.^{64} We re-scaled the experimental time by $\Delta t=\Delta te1.6\xd710\u22125ns=0.005$ so that the right hand side terms of (10)–(12) are around $O(1)$. We trained our method with $N=501$ time points [Fig. 4(a)], at a sampling rate $2\Delta t=0.01$ (re-scaled). The attractor is reasonably well sampled with 501 points (Fig. S8 in the supplementary material), and we retain the remaining data for validation. We sample the test data at the same rate as our training data, that is, $2\Delta t=0.01$ (re-scaled), and therefore, we have roughly $25000$ time points for validation.

. | . | . | Classical . | Circuit formulation . | DAHSI-recovered . | ||||
---|---|---|---|---|---|---|---|---|---|

. | Term . | Parameter . | Standard . | Estimated . | As in Blakely et al.^{64}
. | Estimated . | 7-terms . | 10-terms . | 11-terms . |

eq. $x\u02d9$ | 1 | p_{1,1} | −0.2514 | ||||||

x | p_{1,2} | −10 | −29.7560 | −12.9032 | −16.5369 | −16.9554 | −17.0172 | −17.0582 | |

y | p_{1,3} | 10 | 29.7560 | 12.9032 | 16.5369 | 18.7853 | 19.9884 | 19.9840 | |

z | p_{1,4} | 0.1596 | 0.1833 | ||||||

eq. $y\u02d9$ | x | p_{2,2} | 28 | 68.5427 | 54.2903 | 28.0876 | 24.3535 | 22.6028 | 22.6017 |

y | p_{2,3} | −1 | −12.8815 | −1.2903 | −0.0763 | 0.2580 | 0.3346 | 0.3567 | |

xy | p_{2,6} | −0.0906 | −0.0843 | ||||||

xz | p_{2,7} | −1 | −12.8815 | −14.2857 | −7.6252 | −6.7054 | −6.2507 | −6.2561 | |

eq. $z\u02d9$ | z | p_{3,4} | −2.6667 | −3.4168 | −3.8259 | −3.6547 | −3.6835 | −3.6954 | −3.6966 |

xy | p_{3,6} | 1 | 12.8815 | 3.4843 | 4.3315 | 4.8273 | 5.1412 | 5.1292 | |

xz | p_{3,7} | 0.0903 | 0.0791 | ||||||

E_{av} | … | … | … | 2165 | 319 | 10.37 | 9.7441 | 9.0995 | 9.0345 |

ΔAIC | … | … | … | 5920 | 3852 | 139.315 | 73.887 | 5.758 | 0 |

ΔBIC | … | … | … | 5885 | 3827 | 114.377 | 53.937 | 0.77 | 0 |

. | . | . | Classical . | Circuit formulation . | DAHSI-recovered . | ||||
---|---|---|---|---|---|---|---|---|---|

. | Term . | Parameter . | Standard . | Estimated . | As in Blakely et al.^{64}
. | Estimated . | 7-terms . | 10-terms . | 11-terms . |

eq. $x\u02d9$ | 1 | p_{1,1} | −0.2514 | ||||||

x | p_{1,2} | −10 | −29.7560 | −12.9032 | −16.5369 | −16.9554 | −17.0172 | −17.0582 | |

y | p_{1,3} | 10 | 29.7560 | 12.9032 | 16.5369 | 18.7853 | 19.9884 | 19.9840 | |

z | p_{1,4} | 0.1596 | 0.1833 | ||||||

eq. $y\u02d9$ | x | p_{2,2} | 28 | 68.5427 | 54.2903 | 28.0876 | 24.3535 | 22.6028 | 22.6017 |

y | p_{2,3} | −1 | −12.8815 | −1.2903 | −0.0763 | 0.2580 | 0.3346 | 0.3567 | |

xy | p_{2,6} | −0.0906 | −0.0843 | ||||||

xz | p_{2,7} | −1 | −12.8815 | −14.2857 | −7.6252 | −6.7054 | −6.2507 | −6.2561 | |

eq. $z\u02d9$ | z | p_{3,4} | −2.6667 | −3.4168 | −3.8259 | −3.6547 | −3.6835 | −3.6954 | −3.6966 |

xy | p_{3,6} | 1 | 12.8815 | 3.4843 | 4.3315 | 4.8273 | 5.1412 | 5.1292 | |

xz | p_{3,7} | 0.0903 | 0.0791 | ||||||

E_{av} | … | … | … | 2165 | 319 | 10.37 | 9.7441 | 9.0995 | 9.0345 |

ΔAIC | … | … | … | 5920 | 3852 | 139.315 | 73.887 | 5.758 | 0 |

ΔBIC | … | … | … | 5885 | 3827 | 114.377 | 53.937 | 0.77 | 0 |

We constructed a model library of three coupled variables with monomials up to degree two, and we used DAHSI to perform model identification on the experimental data. DAHSI generates 169 candidate models with a unique structure, which were further down-selected and validated to complete the model-selection process. Noting a gap between approximately $2\xd710\u22123$ and $7\xd710\u22124$ in the cost function (similar to Sec. II D, see Fig. S9 in the supplementary material), we down-selected 25 models (Fig. S10 in the supplementary material), which have low action values ($<10\u22123$). To ensure we have the best parameter fit for each down-selected model structure, we performed parameter estimation on candidate models via VA without sparsity constraint.

To validate the models, we approximate the initial condition for the hidden variable $y$ as we did in Sec. II D. The Lyapunov time of the down-selected models is around 0.9 time units. We considered $S=1083$ segments of the experimental data (excluding the training set), each of length 1/4 of a Lyapunov time to calculate the sum of the average error for each model.

The identified candidate models along the Pareto front best balance model complexity and error [Figs. 4(c) and 4(d) and Fig. S10 in the supplementary material]. We successfully recovered the Lorenz-like structure derived by Blakely *et al.*,^{64} which has the lowest average error of recovered models with seven active terms. For $\lambda =3.9$, the system presented in Blakely *et al.*^{64} is selected for 10.6% of the $NI=500$ randomly chosen initializations. However, we have no guarantee that this model is the “true model” for the circuit system. All identified models have a similar manifold structure [Fig. 4(c)] and low error within a Lyapunov time [Fig. 4(d)]. We believe that the main limitation of our prediction window is the uncertainty introduced by the hidden variable into the parameter estimation during VA. This uncertainty then propagates into the $y0$ estimate required for each validation data set and magnifies noise (Figs. S2 and S3 in the supplementary material). Given the difficulty in selecting between proposed chaotic models that exhibit such similar behavior,^{65,66} the primary goal of DAHSI as a model identification method is to generate possible models. However, we were able to consistently identify a unique model [salmon with 11 terms, Fig. 4(d)] with the most support using Akaike information criteria as done in Mangan *et al.*^{61} and Bayesian information criteria (Sec. S4 and Fig. S11 in the supplementary material), as well as identifying a weakly supported model [gold with 10 terms, Fig. 4(d)]. By identifying multiple models, DAHSI has effectively generated hypothesis for additional terms, which could be tested with further experimentation.

While DAHSI identified the same equation terms as Lorenz and the circuit formulation from Ref. 64, the parameters fit through the final step of VA are not the same. We compare the ability to predict the experimental data with the classical Lorenz-system structure, the circuit formulation from Ref. 64, and the DAHSI-recovered models, each of which has a different number of free parameters (Table II). We perform parameter estimation via VA for each model and use the validation data set described above to calculate average error $Eav$, $\Delta $AIC, and $\Delta $BIC. We also calculate the performance of the circuit formulation from Ref. 64, with their original first-principles calculated parameters. A model with the parameters for the classical Lorenz [Eqs. (4)–(6)] is not compared, as the structure with VA-estimated parameters already preforms poorly.

Although the average error is similar for the VA-estimated circuit formulation and all DAHSI models, the DAHSI-recovered models with 10 and 11 terms have substantially more $\Delta $AIC and $\Delta $BIC support. The classical Lorenz structure, which has four free parameters estimated by VA, is unable to capture the dynamics of the system. The parameters estimated via VA for the circuit model with six free parameters reduce the error by about an order of magnitude compared to those estimated from first principles.^{64} Notably, the parameters estimated for the seven-term DAHSI model are very close to the parameters estimated for the original circuit model, where $p1,2=p1,3$ is enforced, with a relative difference of 0.1245. Further experimentation is needed to determine if the coefficients in the $x\u02d9$ equation should be equal, $p1,2=p1,3$, and if the coefficient on the $y$ term in the $y\u02d9$ equation, $p2,3$ should be positive, negative, or zero [Fig. 4(c)]. The additional terms suggested by the 10 and 11 term DAHSI-recovered models are strongly supported by the AIC/BIC calculations but would require further experimentation to conclusively validate. They may represent parasitic resistances or other physical effects, which have a small but real impact on the circuit dynamics and were neglected during the original derivation by Blakely *et al*.^{64} Recovery of the Lorenz-like model and identification of other models with AIC/BIC support demonstrate that DAHSI can successfully identify parsimonious models for chaotic systems.

## IV. DISCUSSION

In this paper, we presented DAHSI, a method to identify non-linear dynamical systems from data with hidden variables. DAHSI combines variational annealing, a data-assimilation technique, with sparse thresholding. We applied DAHSI to simulated time-series from the classic Lorenz system, simulated time-series from a Colpitts oscillator, and to an experimental data set from a circuit that exhibits Lorenz-like dynamics.^{64} For the classic Lorenz system, DAHSI successfully recovered the ground-truth model with one hidden variable. DAHSI recovers the Colpitts oscillator when all variables are measured from a library with non-additive nonlinear functions but makes mistakes when one variable is hidden. The correct Colpitts model structure was recovered after a second iteration of DAHSI. For the experimental data, the outcome is a set of candidate models, including a model with the same Lorenz-like structure derived by Blakely *et al.* from circuit equations.^{64} Two additional parsimonious models with strong support based on AIC/BIC-based validation were also identified. The unanticipated terms suggested by these models may represent real physical processes in the circuit; such parasitic resistances or other factors are not included in the idealized model derivation. Through this example, we demonstrated that we can use DAHSI as an effective tool for generating models and functional hypothesis from data.

Our noise studies showed recovery rates of 80% for $\u223cN(0,0.01)$ and 10% or lower for $\u223cN(0,0.1)$. Therefore, we anticipate that the current formulation of DAHSI will have reasonable recovery rates for noise levels $<10%$ of the signal value. When combined with a validation step, which will promote the most parsimonious models regardless of their frequency of recovery, a low recovery rate may be tolerable as long as recovery does not drop to zero. Improved robustness to noise could be achieved through integral formulations similar to those used for sparse regression, rather than the discretized mapping between time points used here.^{36–38} The noise seed used to generate the synthetic data strongly impacts the recovery in our numerical experiments. At the explored noise levels, recovery rates have a seemingly random dependence on varied time-series length, which effectively varies manifold sampling, and targeted corruption on different regions of the manifold. These results suggest a nontrivial dependence on manifold sampling. For chaotic systems, increasing the time of the experiment will eventually ensure robust sampling of the manifold. However, the computational time of DAHSI scales with the length of the input time-series.^{67} Therefore, we hypothesize that short bursts of time-series designed to optimally sample the manifold would provide optimal sampling and computational efficiency. Further metrics for analyzing the information content of our data and minimal data-requirements for recovering models^{68} could lead to optimal manifold sampling.

One of the main benefits of a sparse model selection framework is that we identify likely model structures while avoiding combinatorial testing and validation of all potential models. For example, the number possible models described three variables with monomials up to degree two is approximately $109$. Doing parameter estimation on each of these models and validating would be computationally intractable, taking at least $105$ processor-days with our setup. For comparison, our entire model selection and validation process for the experimental data took just over a day of computational time. Running one initialization of the problem and sweeping through $\lambda =2.5:0.1:5.5$ with $N=501$ (as done in the example in Sec. III D) took 4 h. We parallelized simulations using Northwestern’s High Performance Computing Cluster Quest, running about 100 simulations at a time, leading to a total computational time of roughly 20 h. Performing parameter estimation without thresholding on a single model takes between 15 s and 15 min depending on the model structure. Parameter estimation on the 25 down-selected models took 5 h with our setup. Time estimates are for a Intel® Xeon® CPU E5-2680 v4 @ 2.40 GHz processor. In order to understand the impact of library size on a call to IPOPT, the optimizer used in DAHSI, we tested model libraries with 7, 10, 13, 16, 19, and 30 terms (Fig. S12 in the supplementary material). The computational time does not scale monotonically with library size. Instead, we find that a library with 10 terms can take 100 times longer to run than the library of 30 monomials. We suspect that the variation in optimization time depends on correlations between library functions,^{69} model symmetries, and other structural features.

In the case of the Colpitts system, one iteration of DAHSI is unable to recover the true system with hidden variables. We can recover the Colpitts system if we run an additional iteration of DAHSI on the models recovered from the initial library (Sec. S10 in the supplementary material). Therefore, it may be that multiple iterations of DAHSI are required to guarantee a robust search of model structures that sit near the Pareto front. The failure in a single DAHSI iteration is due to the difference in size between the parameters in the system. It is possible that optimal manifold sampling strategies mentioned above may be able to better resolve the small term. Similar issues for systems with disparately sized parameters have been observed for sparse-regression type methods, and re-scaling methods have been developed to avoid these issues.^{32,70} A recent method, SR3,^{58} effectively re-scales the data to improve recovery and robustness but has the disadvantage of introducing an extra tuning parameter in addition to $\lambda $, which increases computational time. Integration of one of these re-scaling strategies with DAHSI could provide a more robust search of the Pareto front when applied to systems with large and small parameters, avoiding the need for iterative application.

In addition to the chaotic systems presented in the results, we have applied DAHSI on two non-chaotic systems: on time-series data from a Lotka–Volterra-like system with no hidden variables and on simulated time-series for mass-action kinetics systems with hidden variables. Although DAHSI recovered reasonable models for both systems, there are several caveats. Recovery of Lotka–Volterra required an iterative formulation similar to that performed for the Colpitts system (Fig. S13 in the supplementary material). We also compared DAHSI to SINDy^{35} for the Lotka–Volterra system and found that SINDy was far superior in speed when all variables are observed. Recall that a comparison between DAHSI and SINDy is not possible for a Lorenz-like circuit system, as SINDy requires access to the unmeasured $y$ variable. The mass-action kinetic system modeled a semiconductor with two trap levels differing by one electronic unit of charge (Sec. S12 in the supplementary material). The recovery rate for the ground-truth model was low, around 3%. Unlike chaotic systems, which are highly non-convex, the mass-action kinetic system has a very flat cost function due to structural parameter identifiability issues (Figs. S17–S20 in the supplementary material).^{71–74} Stochastic gradient descent algorithms, such as IPOPT, are known to perform poorly for flat cost functions; therefore, switching to an optimizer designed for such systems^{75} may improve recovery. Other data-assimilation methods for parameter estimation with hidden variables, such as 3D-Var, 4D-Var, Kalman filtering, and hybrid methods,^{21} may be more cost-effective if VA is unnecessary to navigate to the global minimum of a highly non-convex function.

The formulation of cost function and sparsity constraint also likely impacts recovery. Different methods for sparse model-selection include stepwise and all-subsets regression, ridge regression,^{76} LASSO,^{32} least angle regression,^{77} and SR3.^{58} SR3 accelerates convergence and has been shown to outperform other methods and improves performance but has an extra tuning parameter. The parameter paths for the first four methods are different,^{34} and therefore, we expect that different regularization methods will lead to different model identification. Comparison between different sparsity-enforcement mechanisms within DAHSI framework could improve recovery but may be somewhat system dependent.

We anticipate many future applications and extensions of DAHSI. The framework for DAHSI does not have any intrinsic restrictions about the functional form of the equations; in particular, the function library need not be linear in the unknown parameters. Variational annealing is designed to handle stochasticity through the model error. In addition, data assimilation is commonly used for partial differential equation (PDE) systems, including PDE discovery.^{78} Therefore, we anticipate that we can apply or extend our framework to broader applications, without reformulation as was needed in sparse-regression based frameworks for rational functions,^{79} stochastic systems,^{80} and PDEs.^{81,82} Although we focus on autonomous systems here, the framework can, in principle, be applied to non-autonomous systems.^{83–85} Unlike strongly coupled state-variables that influence each other and are constrained through the model library, the dynamics of any external forcing terms would likely need to be measured or selected from a library of time-dependent models. Modifications to the optimization methodology and further investigation of optimal data-sampling strategies could improve the computational efficiency of DAHSI, opening up higher dimensional problems to model selection with hidden variables.

## SUPPLEMENTARY MATERIAL

See the supplementary material for a full derivation of the cost function [Eq. (3)]; further details of the calculations, including the validation steps and robustness study; supporting figures; DAHSI identified models; and calculations on additional examples.

## ACKNOWLEDGMENTS

The authors acknowledge Northwestern’s Center for Optimization and Statistical Learning for partially funding this research.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

Several people contributed to the work described in this paper. N.M.M. conceived of the basic idea of the work and provided guidance and feedback on all aspects of the work. H.R. designed and carried out the application and demonstration on the data. H.R. and S.S. designed the method. H.R. and A.V.N. performed the robustness analysis. H.R. and N.M.M. wrote the first draft of the manuscript; all authors subsequently took part in the revision process and approved the final copy of the manuscript.

## DATA AVAILABILITY

The data that support the findings of this study are openly available in Zenodo (DAHSI code base) at https://doi.org/10.5281/zenodo.5911949, Ref. 54.

## REFERENCES

*Dynamical Systems and Turbulence, Warwick 1980*(Springer, 1981), pp. 366–381.

*Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop*(IEEE, 1997), pp. 511–520.

*et al.*, “

*Proceedings of the 5th Hellenic-European Conference on Computer Mathematics and its Applications (HERCMA-01), Athens, Greece, 20-22 September 2001*(LEA Press, Athens, Hellas, 2002), pp. 2–24.

*2002 28th Annual Conference of the Industrial Electronics Society (IECON2002)*(IEEE, 2002).

*et al.*, “