This paper investigates how models of spatiotemporal dynamics in the form of nonlinear partial differential equations can be identified directly from noisy data using a combination of sparse regression and weak formulation. Using the 4th-order Kuramoto-Sivashinsky equation for illustration, we show how this approach can be optimized in the limits of low and high noise, achieving accuracy that is orders of magnitude better than what existing techniques allow. In particular, we derive the scaling relation between the accuracy of the model, the parameters of the weak formulation, and the properties of the data, such as its spatial and temporal resolution and the level of noise.

In recent years, data-driven discovery of mathematical models of spatially extended systems described by nonlinear partial differential equations (PDEs) has emerged as a promising alternative to more traditional modeling approaches. Existing approaches to model discovery such as sparse regression have several major weaknesses, however. Most notably, they break down for data with high levels of noise and have to be tuned empirically to produce meaningful results, making them ill-suited for analyzing experimental data. We show how these weaknesses can be addressed using a weak formulation of the model PDE. The weak formulation has substantial freedom that makes it extremely powerful and flexible, but the question arises of how this freedom can be used to robustly obtain the most accurate model. This question is addressed here for the first time.

Partial differential equations (PDEs) provide a natural description for the temporal evolution of spatially extended systems in various fields of science and engineering. Historically and practically important examples include wave equations arising in many areas of physics, the Schrödinger equation in quantum mechanics, the Navier-Stokes equations in fluid dynamics, and reaction-diffusion equations used to model physical, chemical, or biological systems. In the past, models of such systems were almost always constructed from first principles or using a suitable empirical approach. However, in recent years, a data-driven paradigm for learning the dynamics has emerged, which leverages the modern prevalence of data and computational power to create models when the underlying governing laws have eluded first-principles derivation.

Many indirect methods for learning the dynamics that do not require a PDE have been proposed. Notable examples include equation-free modeling,1 artificial neural networks,2–4 dynamic mode decomposition5 and Koopman operator approaches,6 balanced truncation,7 and resolvent-based analysis.8 While these techniques can provide an economical approximate description of the dynamics, this is done at the cost of losing the mathematical structure that affords physical intuition or interpretability. Symbolic regression, which was originally used to derive nonlinear ordinary differential equations describing low-dimensional systems,9,10 offers an enticing alternative by allowing construction of exact models and discovery of conservation laws. The genetic algorithms used in these earlier studies are however computationally expensive, preventing application of this approach to high-dimensional systems. Thus, the recent emergence of a sparse regression approach for model discovery11–15 has made a significant impact. Applied to spatially extended systems, this approach allows data-driven discovery of governing equations in the form of PDEs by evaluating a library of candidate terms containing partial derivatives at a large number of points and using a regularized regression procedure to compute the coefficients of each term and select a parsimonious model.

Sparse regression has proven computationally efficient and capable of reconstructing numerous canonical PDEs,14,16 but it faces serious difficulties when used for analysis of experimental data. One complication is that the proper choice of parsimonious model is often unclear. In many implementations, it relies on a manual Pareto analysis to balance model accuracy and complexity13 or on an automatic but complex thresholding procedure (e.g., sequential threshold ridge regression14) that tends to be sensitive to the choice of parameters. More importantly, existing sparse regression methods often suffer from low accuracy even in the absence of noise and completely break down at noise levels characteristic of realistic applications. This is because they inherently require explicit numerical evaluation of partial derivatives of the data, which is a notably ill-conditioned problem.

In this paper, we present a weak formulation of the sparse regression problem that eliminates this fundamental issue. We also suggest a simple thresholding procedure that can always identify the correct form of the governing PDE even in the presence of extremely high noise. Finally, we explore how this extremely flexible and robust approach can be optimized and tuned to the properties of the underlying data set to maximize accuracy. This paper has the following structure. Section II describes our approach and the system used to test it, and our results are presented in Sec. III. Sections IV and V contain the discussion and conclusions.

We consider the problem of using the data u(x,t) to identify a parsimonious mathematical model in the form of a PDE

(1)

where each term in the sum is a function of u and its partial derivatives in space and time with constant coefficients cn. In most applications, the form of the basis functions fn can be restricted based on physical considerations, such as symmetries, conservation laws, etc.15,17 Typically, fn are taken to be products of powers of independent variables (x, t) and dependent variables (u and its various derivatives), although the form can be arbitrary in theory. Generally, in order to allow for a flexible model form, one should start with N sufficiently large. Our goal is to determine the coefficients cn for the terms that should be present in the model while eliminating the dynamically insignificant and thus likely spurious terms, yielding a sparse solution where, with few exceptions, cn=0.

Sparse regression aims to convert the PDE (1) to a tractable (and ideally, robust) linear algebra problem. Conventionally, this is done by evaluating all of the terms in the PDE at a random collection of points (xk,tk) using finite differences,17,18 spectral methods,11,12 or polynomial approximation.14,15 All of these approaches are extremely sensitive to noise, especially when high-order derivatives are present. We will instead pursue a weak formulation of the problem that can be obtained by multiplying (1) by a weight wj(x,t) and then integrating the result over a domain Ωk. Repeating the process for K distinct combinations of weight functions and integration domains yields the linear system

(2)

where c=[c1,,cN]T and Q=[q1,,qN] is a “library” matrix, with each column qnRK consisting of the integrals of the function fn with all K combinations of weights wj and domains Ωk.

Note that there is an extra degree of freedom in (2) corresponding to the normalization of c. Conventionally this is dealt with by assuming that f1=tu, setting c1=1, and solving the overdetermined system that corresponds to the choice KN using least squares or some regularized version of it.14 This is however not always a valid assumption: it is usually unknown a priori whether any given temporal derivative should be included in the PDE at all, whereas in this case a particular term is forced into the model. Moreover, even if this term should be present in the model, the regression effectively assumes that the time derivative was computed without error, which reduces the practical accuracy of the procedure.

We will, therefore, not make the assumption that the model has the form of an evolution equation and consider the linear problem (2) in its most general form. The normalization of c can be fixed by adding an extra row with arbitrary nonzero elements to Q, after which the resulting Eq. (2) can be solved by ordinary least squares. A more elegant solution pursued in the present study is to instead compute c as the right singular vector of Q corresponding to the smallest singular value. Note that this corresponds to the solution of a constrained least squares problem for QTQc=0,

(3)

Once a suitable solution has been obtained by further constraining the problem, the resulting parsimonious model can be rewritten in the form of an evolution equation by solving for a term such as tu (or t2u for a wave equation).

To obtain a parsimonious model, we employ an iterative procedure to eliminate unnecessary terms from (1). At each step i, singular value decomposition is used to obtain the solution ci given the matrix Qi, and the residual ηi=Qici is computed. We then find the term with the smallest cniqn/qn and construct Qi+1 by eliminating the column qn from Qi. The corresponding term is eliminated from the model if ηi+1<γηi, where γ>1 is some fixed constant (we use γ=1.4 in the present study). The iteration terminates at step i if ηi+1>γηi, yielding a parsimonious model. We find that this method compares favorably to alternatives such as sequential threshold ridge regression14 as it robustly eliminates spurious terms without requiring extremely careful choice of parameters. Moreover, the sparsification parameter has a simple interpretation: γ1 is the maximum acceptable relative increase in the residual resulting from discarding a single library term.

We illustrate the advantages of our approach by applying it to the Kuramoto-Sivashinsky equation,19,20

(4)

which has posed a significant challenge in past studies of sparse regression11,14 because it contains a fourth-order partial derivative that is difficult to evaluate numerically with adequate accuracy. Here c1==c4=1 are all constants, although our approach can easily be extended even to the case when these coefficients are functions of time and/or space, as discussed below. Since this is a scalar equation in one spatial and one temporal dimension, we use scalar weight functions wj(x,t). If we denote the terms in the model (4) by f1,,f4, then

(5)

where dΩ=dxdt. The key feature of the weak formulation is that it can almost always be used to completely eliminate, or at least reduce the order of, the derivatives acting on the noisy data by integrating by parts. In our particular case,

(6)

under the assumption that wj and its first three partial derivatives with respect to x vanish on the boundary Ωk. In our implementation, we use the composite trapezoidal rule to evaluate the integrals numerically.

Note that although this particular PDE features constant coefficients, terms with variable coefficients can be treated in a similar manner. For instance, suppose that the coefficient of the term x4u is a function of x and t that can be expanded in some (finite) basis as

(7)

with some constants cp. Then,

(8)

where

(9)

Sparse regression for a model including such a term would then simply require expanding the library Q to include additional columns qp with entries qpjk. In this case as well, no derivatives of the noisy u are used in finding the elements of Q.

Although in principle integration domains of any shape can be used, here we will only consider rectangular domains of a fixed size,

(10)

where the centers (xk,tk) of the rectangles Ωk are chosen randomly. Similarly, there are many possible choices for the weight functions satisfying the boundary conditions on Ωk; we focus on functions of the form

(11)

where x_=(xxk)/Hx, t_=(ttk)/Ht are nondimensionalized independent variables and α4, β1, l0, and m0 are integers. Note that there are four weight functions (corresponding to the four different choices of the signs in the exponentials) for each pair of nonzero l and m. The integrals qnjk are all of the form

(12)

where

(13)

so Fnlm are the coefficients of the two-dimensional Fourier series for fnαβ(x_,t_). Although fn(x,t,u,) is not periodic on Ωk, the functions fnαβ(x_,t_) are. Moreover, assuming u(x,t) is smooth enough that, inside Ωk, fn(x,t,u,) has α1 continuous derivatives in space and β1 continuous derivatives in time, then the function fnαβ(x_,t_) does as well. The Fourier coefficients then decay according to Fnlmlαmβ. The powers α and β, therefore, control the width of the Fourier spectrum of the entries qnjk in the library Q, while the choice of l and m allows us to tune the frequencies of the weights to the spectral properties of the data. The convergence rate of Fourier series turns out to control the accuracy with which the integrals are evaluated using data that are available only on a discrete grid. For simplicity, we will assume that the same weight functions are integrated on every domain. It is possible to use either weight functions involving only a single pair of frequencies (e.g., l and m) or a range of frequencies in space and/or time.

To test our sparse regression approach, we computed a solution of the Kuramoto-Sivashinsky equation, using the integrator described in Ref. 14 to generate data on a physical domain with dimensions Lx=32π and Lt=500. The numerical integration generated data with spatial resolution Δx=0.0491 using a time step Δt=0.005, which was then downsampled to a lower spatial resolution δx and temporal resolution δt. Unless noted otherwise, the results presented below are for δx=0.1964 and δt=1. For reference, the solution has a correlation length x1.678.5δx and correlation time t8=8δt. To test the effects of noise, uncorrelated Gaussian noise with standard deviation σsu was added to the data for various choices of σ, where su1.3 is the sample standard deviation of u on the whole domain. (We verified that the results do not change if the uncorrelated noise is sampled from a different distribution with zero mean and the same standard deviation.)

To test the ability of the algorithm to eliminate spurious terms, in addition to the terms present in the Kuramoto-Sivashinsky equation (4), we also included terms xu, x3u, u, u2, u3, and 1 (which represents a hypothetical forcing) in our library. The corresponding integrals were rewritten using integration by parts to remove derivatives acting on u, as described previously. In Sec. III, we quantify the performance of our sparse regression approach using two key metrics: how well the algorithm can discriminate between the essential and spurious terms and how accurately it can determine the coefficients of the essential terms. Since the data were generated using a known model, we know which terms are essential [those contained in the PDE (4)]. If the reference model is unavailable, ensemble regression15 may be used instead to help distinguish essential terms from spurious ones.

As discussed previously, the elements of the library matrix Q are given by the Fourier coefficients of the different terms included in the generalized model [windowed by the envelope Eαβ(x_,t_)=(x_21)α(t_21)β on each domain Ωk]; hence knowledge of the Fourier spectrum of the data is crucial for an optimal choice of the size of the integration domains Ωk and the weight functions wj. The power spectrum (or, more precisely, the absolute value of the Fourier coefficients) of the noiseless data on the entire physical domain is shown in Fig. 1. In space, the spectrum is sharply peaked around a wave number κ0.625. At high wave numbers, the spectrum decays exponentially, Peκ/κ¯ where κ¯0.3. In time, the spectrum is peaked at zero frequency ω and decays as a power law, Pωχ with χ2.5.

FIG. 1.

The power spectrum over (a) space and (b) time, normalized so that the maximum is 1. The black dots show the spectrum of the original data. The symbols correspond to the spectra of the windowed data multiplied by envelopes Eαβ(x_,t_) (with different choices of α or β) on a “typical” integration domain Ωk [i.e., averaged over 1000 uniformly distributed choices of (xk,tk)]. The spatial and temporal frequencies correspond to κl=2πl/Fx and ωm=2πm/Ft for the windowed data.

FIG. 1.

The power spectrum over (a) space and (b) time, normalized so that the maximum is 1. The black dots show the spectrum of the original data. The symbols correspond to the spectra of the windowed data multiplied by envelopes Eαβ(x_,t_) (with different choices of α or β) on a “typical” integration domain Ωk [i.e., averaged over 1000 uniformly distributed choices of (xk,tk)]. The spatial and temporal frequencies correspond to κl=2πl/Fx and ωm=2πm/Ft for the windowed data.

Close modal

Having characterized the data, we turn to the investigation of how the performance of our algorithm depends on the choice of various parameters. Since the number of parameters is quite large, instead of exploring the entire parameter space, we focus on the dependence on one or two parameters at a time, with the remaining parameters staying fixed. Specifically, the noise level σ is fixed to 3% and we use the following near-optimal parameters in the sparse regression. The dimensions of the integration domain are Fx=2Hx=14.73 and Ft=2Ht=75. This choice corresponds to an equal number of grid points in both directions, Fx/δx=Ft/δt=75. Unless noted otherwise, we use a single set of weights with α=β=8, l=1, m=2, and the sparsification parameter is γ=1.4. We generally use every combination of 4 weight functions over 50 integration domains, so that the total number of library rows is K=200. To characterize the stochastic effects, for each set of parameters, we used an ensemble of M=100 trials featuring different random distributions of the integration domains and realizations of noise.

First, we tested the ability of the method to reconstruct the correct form of the PDE (4) for various values of γ with all other parameters fixed at their near-optimal values. Our iterative regression procedure proved very robust for a fairly wide range of values of γ. In particular, at a noise level of 30%, it performed perfectly for 1.1γ2, with the reconstructed model containing no missing or spurious terms in all of the trials. For the highest noise level considered here (100%), we found perfect performance for 1.2γ1.5. In some fraction of the trials, spurious terms appeared at lower γ and missing terms at higher γ, as shown in Fig. 2. For reference, without the benefit of the weak formulation, sparse regression failed14 to correctly reconstruct the lambda-omega reaction-diffusion system, which is only second-order, for noise level as low as 1%.

FIG. 2.

Fraction p of identified models with spurious (circles) or missing (squares) terms at maximum noise level (σ=1) as a function of the sparsification parameter γ.

FIG. 2.

Fraction p of identified models with spurious (circles) or missing (squares) terms at maximum noise level (σ=1) as a function of the sparsification parameter γ.

Close modal

The accuracy of regression (i.e., model identification) was quantified by computing the relative error in each parameter of the Kuramoto-Sivashinsky equation

(14)

where c¯n and cn are the true and estimated values of the model parameters, respectively, for n=2,3,4. We normalize the estimated parameters so that c1=1. In all of the following figures, we plot the estimated mean value of Δcn with 95% error bars, where all of the parameters in the regression procedure are held at their near-optimal values stated previously unless noted otherwise.

In particular, Fig. 3 shows the accuracy of regression as a function of σ for two different choices of data resolution (δx and δt). It is worth noting that the average relative error is Δcn1010 for all of the parameters for noiseless data with the higher of the two resolutions. However, even for 1% noise, Δcn2×104, which is more than three orders of magnitude smaller than what had been achieved in previous studies.14 The results are very similar for all three parameters; as this is generally the case, in subsequent figures, we only show the generally largest error Δc4, which corresponds to the term x4u involving the highest-order derivative. We find two distinct regimes. At higher noise levels, the error in evaluating the library matrix entries is due primarily to the averaged effect of noise. Applying the central limit theorem, we find that the relative error scales as

(15)
FIG. 3.

Parameter errors Δcn as a function of noise level. The circles, triangles, and squares correspond to n=2,3,4, respectively, and the empty and filled symbols indicate results for data with double resolution (δx=0.0982,δt=0.5) and half resolution (δx=0.393,δt=2), respectively. The dashed lines show the predicted scaling.

FIG. 3.

Parameter errors Δcn as a function of noise level. The circles, triangles, and squares correspond to n=2,3,4, respectively, and the empty and filled symbols indicate results for data with double resolution (δx=0.0982,δt=0.5) and half resolution (δx=0.393,δt=2), respectively. The dashed lines show the predicted scaling.

Close modal

At low noise levels, the parameter accuracy is controlled by numerical error, which has two different sources. The first source is a numerical error in the data itself, which is due to the finite accuracy of the integrator that “solves” the Kuramoto-Sivashinsky equation. This source dominates for smaller δx and δt. For experimental data, this source would correspond to systematic error. For larger δx and δt, the parameter inaccuracy is mainly due to the error in computing the library matrix entries based on data that are available on a discrete grid. Suppose we want to use numerical quadratures to evaluate an integral

(16)

where g(x)Cm (i.e., has m continuous derivatives) and g(i)(0)=g(i)(L) for all 0i<m. Then, for the composite trapezoidal rule on a grid with spacing h, the relative error associated with the discretization can be estimated using exact Euler-Maclaurin formulas21 and is found to scale as hm+2|g(m+2)| for m even (or hm+1|g(m+1)| for m odd), where a characteristic value of the derivative on the interval [0,L] is used.

Generalizing this result to two dimensions [and assuming δxmin(x,Fx), δtmin(t,Ft)], we find an estimate of the relative discretization error for an element of the library matrix Q that involves a temporal derivative of order νt and/or spatial derivative of order νx:

(17)

where h=δt/tδx/x and μ=min(ανx,βνt). It is easy to check that, due to the conditions on α and β, we always have μ0, as it should be for the trapezoidal rule. The Kuramoto-Sivashinsky equation features terms that all involve derivatives, with the lowest order being one and the highest being four; hence, for even α=β4, the exponent μ ranges between α2 and α+1. Therefore, the scaling

(18)

dominates for lower h, while the scaling

(19)

dominates for higher h.

The error Δcn can be found using perturbation theory. Let Q¯ be the library matrix evaluated using a continuous noiseless solution so that Q¯c¯=0 exactly (we assume that Q¯ corresponds to the parsimonious model). In the presence of measurement noise and/or discretization error, the error in evaluating each entry qnjk of the library matrix is proportional to ε=max(εd,εn), so

(20)

for some matrix Q^ whose entries are distributed as white Gaussian noise. Note that the entries of Q^ are O(FxFt). The entries of Q¯ have a more complicated scaling that is determined by the Fourier spectrum of the data (i.e., exponential in space, power law in time). Specifically, we find (cf. Fig. 4)

(21)

where λn=O(x1) and ξn=O(1) are some positive constants. To leading order in ε, the least squares solution to (2) is given by

(22)

where Q¯+ is the Moore-Penrose pseudoinverse of Q¯. Since the elements of Q^ can be considered uncorrelated, we have for Fxx and Ftt

(23)

where the numerator and denominator describe the scaling of the entries of Q^ and Q¯, respectively. Following from (21),

(24)

with some positive constants λ=O(x1) and ξ=O(1). For low σ, we have ε=εd and, therefore, Δcn is independent of σ. For high σ, we have ε=εn, so combining (23) and (15) we find

(25)

The predicted scaling of Δcn with σ in both regimes is consistent with the results shown in Fig. 3. In particular, we find that the effect of changing the resolution of the data is quite minor at high σ, where Δcnh according to (25). At low σ, the effect is much stronger: for α=β=8, we have Δcnh6 according to (17). The dependence of the scaling in (17) on α and β is further confirmed by Fig. 5, which shows results for noiseless data. In the α=β=4 case, we observe the scaling law Δcnh2 corresponding to (18) in the entire range of h we examined. When α=β=6, the parameter error scales according to Δcnh4 for small h and Δcnh6 for large h, which correspond to the limiting cases (18) and (19), respectively. We should also note that for h as large as 1/4, the accuracy remains very good. Thus, the method is suitable for fairly sparse data.

FIG. 4.

Scaling of the first four columns of the library Q with the size of the integration domain in the (a) spatial and (b) temporal directions. The columns correspond to tu (squares), uxu (circles), x2u (triangles), and x4u (diamonds).

FIG. 4.

Scaling of the first four columns of the library Q with the size of the integration domain in the (a) spatial and (b) temporal directions. The columns correspond to tu (squares), uxu (circles), x2u (triangles), and x4u (diamonds).

Close modal
FIG. 5.

Parameter error Δc4 as a function of the resolution of noiseless data for α=β=4 (circles) and α=β=6 (squares). The dashed lines show the predicted scaling.

FIG. 5.

Parameter error Δc4 as a function of the resolution of noiseless data for α=β=4 (circles) and α=β=6 (squares). The dashed lines show the predicted scaling.

Close modal

As illustrated in Fig. 6, we also observe the scaling for Δcn with K predicted by (25). This scaling is expected to break down when the total area of the integration domains exceeds the area of the physical domain due to the loss of statistical independence between the data on different integration domains, leading to an increased linear dependence of the rows of the library matrix Q. We can expect the error to asymptote to

(26)

for KNd, where

(27)

is the area ratio. For the reference set of parameters, saturation did not occur over the range of K we tested. To more easily observe the saturation effect, we set l=m=0, so that only one weight function is used and the number of integration domains equals K (rather than K/4 for nonzero l and m). Furthermore, we reduce the size of the physical domain to Lx=16π and Lt=250, so that Nd11 is relatively small. As Fig. 6 illustrates, for large K, the parameter accuracy indeed asymptotes to a constant.

FIG. 6.

Parameter error Δc4 as a function of the number of library rows K. Only the l=m=0 weight function is used and the physical domain size is reduced to Lx=16π, Lt=250. The dashed lines show the predicted scaling.

FIG. 6.

Parameter error Δc4 as a function of the number of library rows K. Only the l=m=0 weight function is used and the physical domain size is reduced to Lx=16π, Lt=250. The dashed lines show the predicted scaling.

Close modal

The scaling described by (26) can also be observed in the dependence of Δcn on the size of the physical domain (and hence Nd) with all other parameters fixed. This dependence is quite important, since it determines how much data need to be collected to identify the model with meaningful precision. As Fig. 7 illustrates, choosing the physical domain to be just double the size of the (optimal) integration domain in both directions (which corresponds to Nd=4) already yields a rather acceptable accuracy when only one weight function is used. When l and m are nonzero, accurate reconstruction is possible even if Nd is only slightly greater than 1.

FIG. 7.

Parameter error Δc4 as a function of Nd for l=m=0 and K=500. Squares correspond to fixing Lt=100 and varying Lx from 15.7 to 98.2. Circles correspond to fixing Lx=19.6 and varying Lt from 80 to 500. The dashed line shows the predicted scaling.

FIG. 7.

Parameter error Δc4 as a function of Nd for l=m=0 and K=500. Squares correspond to fixing Lt=100 and varying Lx from 15.7 to 98.2. Circles correspond to fixing Lx=19.6 and varying Lt from 80 to 500. The dashed line shows the predicted scaling.

Close modal

Next, we consider how the error in the estimated coefficients depends on the choice of the integration domain size. Figure 8 shows the dependence of the error Δc4 on the size of the integration domains for two different choices of the weight functions. In panel (a), Ft is fixed to 75 and Fx is taken to vary, and in panel (b), Fx is fixed to 14.73 with Ft varying. In both cases, we find that there is an optimal domain size with Fx14.73 and Ft75; moreover, the optimal values remain approximately the same even if we vary the size of the other dimension or the choice of weight functions. For small Fx and/or Ft, the error is large because (a) the integration domain is too small to effectively average out the influence of noise and (b) the numerical quadrature error is large (both εn and εd increase as Fx and/or Ft decrease). For large Fx and Ft, we enter the regime described by (23), which predicts that the error should grow exponentially in Fx and as a power of Ft. Indeed, this is exactly what we observe in Fig. 8. Based on (21), it appears that the optimal choice of Fx and Ft corresponds to the crossover between these two regimes, i.e., Fxx and Ftt. Our numerical results suggest that the optimal choice corresponds to Fx/xFt/t8.

FIG. 8.

Parameter error Δc4 as a function of the (a) spatial and (b) temporal dimensions of integration domains when only the l=m=0 weight function is used (squares) and for the optimal choice of l and m (circles).

FIG. 8.

Parameter error Δc4 as a function of the (a) spatial and (b) temporal dimensions of integration domains when only the l=m=0 weight function is used (squares) and for the optimal choice of l and m (circles).

Close modal

Finally, let us address the optimal choice of frequencies appearing in the weight functions (11). Figure 9 shows the effect of varying either l or m with all other parameters fixed at their reference values. Specifically, we plot Δc4 vs κl=2πl/Fx and ωm=2πm/Ft. (Note that when l or m is 0, the number of distinct weight functions is halved, so we correspondingly double the number of integration domains to keep the number of rows in the library constant.) One could assume that the optimal values would be given by the dominant frequencies of the original data (as we discussed previously, the dominant wave number is κ0.625 and the dominant temporal frequency is ω=0). According to Fig. 1, windowing the data broadens the peaks but leaves both dominant frequencies roughly the same: κ0.8 (l=2) and ω=0 (m=0). Unfortunately, it turns out that we cannot use the spectra to exactly predict the optimal frequencies, which are κ0.4 (l=1) and ω0.2 (m=2 or 3). However, choosing the frequencies based on the spectra still produces reasonably good accuracy (within a factor of 4 or so of the optimal result).

FIG. 9.

Parameter error Δc4 as a function of (a) the wave number κl=2πl/Fx and (b) the frequency ωm=2πm/Ft.

FIG. 9.

Parameter error Δc4 as a function of (a) the wave number κl=2πl/Fx and (b) the frequency ωm=2πm/Ft.

Close modal

These results suggest that using weight functions with a combination of different frequencies may be more robust and/or accurate. To test this hypothesis, we considered the case in which weight functions with a range of frequencies in space or time were included, with the total number of library rows fixed at 200. However, this approach yielded a decrease in the accuracy, as the broader choice of weight functions did not compensate for a decrease in the number of integration domains. This suggests that the optimal strategy is to use a large number of integration domains while keeping the frequencies of the weight functions fixed.

It is worth considering what modifications may need to be made to reconstruct models other than the Kuramoto-Sivashinsky equation. The choice of the weight functions wj plays a critical role in determining the accuracy of the model identification. Recall that, under the assumption that the data u itself is sufficiently smooth, the behavior of wj at the boundary of the integration domains Ωk determines the accuracy with which the elements of the library matrix Q are evaluated using numerical quadratures. Our choice of the form of (11) was influenced primarily by the order of the partial derivatives appearing in the model, which determine the lowest possible values of the exponents α and β. Indeed, weight functions of the form (11), with appropriate choices of the exponents, were found to work well for a variety of PDE models.

A more interesting question arises for models admitting nonsmooth solutions, such as Burgers’ equation, which develops shocks. When the data or their derivatives are discontinuous, the scaling (17) that we have derived does not hold for integrals over integration domains Ωk that contain the discontinuity (for instance, if u is discontinuous, then εdh). To preserve the high accuracy associated with the large scaling exponents μ, we have two choices. The first option is to restrict the sizes and positions of the integration domains such that they never include the discontinuities. In this case, we can still choose weight functions of the form (11). The second option is to modify the weight functions such that the smoothness of the products (13) is maintained across the discontinuity. For instance, if the position of the discontinuity can be parametrized by the equation ϕ(x,t)=0, then, for the integration domains containing the discontinuity, one may choose weight functions of the form

(28)

where ν is sufficiently large.

We have introduced a robust and flexible approach to data-driven discovery of models in the form of nonlinear PDEs. The approach uses a weak formulation, coupled with a novel sparse regression procedure, to obtain a parsimonious description. We have demonstrated its capability to identify PDEs, even with high-order derivatives, from extremely noisy data with unprecedented accuracy. For instance, with 1% noise, we were able to reduce the error in estimating the parameters of the 4th-order Kuramoto-Sivashinsky equation from 50%14 to just 2×104. Furthermore, whereas correct identification of the functional form of the underlying PDE has been far from guaranteed at any noise level using past approaches, our algorithm was able to reconstruct the Kuramoto-Sivashinsky equation accurately in 100% of cases from data with a signal-to-noise ratio of 100%.

This impressive performance is achieved by shifting the partial derivatives from the data onto a known smooth weight function using integration by parts, thus avoiding the large errors incurred by repeated numerical differentiation. Our method also proved to be well-adapted to sparse data, maintaining errors of less than 0.1% for a grid resolution only 4 times finer than the correlation length/time. Such reliability and high accuracy in the presence of noisy or sparse data is indispensable for analysis of experimental data. Notably, even in the absence of noise, our results compare very favorably with those of previous studies11,14 because the discretization error of the algorithm can be made extremely small: for the Kuramoto-Sivashinsky equation, the relative error in all parameters can easily be reduced to 1010. It is also important to mention that the computational cost of our algorithm is comparable to that of existing sparse regression methods.

We also derived the scaling laws that describe the accuracy of the regression as a function of the parameters used in the algorithm and the properties of the data. These scaling laws can be used to fully exploit the flexibility of the weak formulation approach by tuning its various parameters. In particular, the size of the input used by the regression can be controlled by choosing both the number of different integration domains and the number of different weight functions. We have shown that the number of integration domains plays a much more important role than the number of weight functions: the best results can be obtained by using a set of weight functions with a fixed shape (frequency and envelope) and a large number of integration domains. Furthermore, we have determined the optimal shape of the weights and the optimal size of the integration domains. The latter turned out to be determined by the correlation length and time describing the data (with the size roughly an order of magnitude larger than these characteristic scales). We have also shown that, although the error can be reduced further by using data on ever-larger physical domains, satisfactory results can be obtained for physical domains that are just a factor of two larger than the optimal integration domain in each dimension.

This material is based upon work supported by the National Science Foundation (NSF) under Grant No. CMMI-1725587. D.G. gratefully acknowledges the support of the Letson Undergraduate Research Scholarship.

The reference data and the codes used in this study are freely accessible from the GitHub repository at https://github.com/pakreinbold/PDE_Discovery_Weak_Formulation.

1.
I. G.
Kevrekidis
,
C. W.
Gear
,
J. M.
Hyman
,
P. G.
Kevrekidis
,
O.
Runborg
,
C.
Theodoropoulos
et al., “
Equation-free, coarse-grained multiscale computation: Enabling macroscopic simulators to perform system-level analysis
,”
Commun. Math. Sci.
1
,
715
762
(
2003
).
2.
K.-l.
Hsu
,
X.
Gao
,
S.
Sorooshian
, and
H. V.
Gupta
, “
Precipitation estimation from remotely sensed information using artificial neural networks
,”
J. Appl. Meteorol.
36
,
1176
1190
(
1997
).
3.
M.
Raissi
, “
Deep hidden physics models: Deep learning of nonlinear partial differential equations
,”
J. Mach. Learn. Res.
19
,
932
955
(
2018
).
4.
J.
Pathak
,
B.
Hunt
,
M.
Girvan
,
Z.
Lu
, and
E.
Ott
, “
Model-free prediction of large spatiotemporally chaotic systems from data: A reservoir computing approach
,”
Phys. Rev. Lett.
120
,
024102
(
2018
).
5.
J. H.
Tu
,
C. W.
Rowley
,
D. M.
Luchtenburg
,
S. L.
Brunton
, and
J. N.
Kutz
, “On dynamic mode decomposition: Theory and applications (2013),” e-print arXiv:1312.0041.
6.
I.
Mezić
, “
Analysis of fluid flows via spectral properties of the Koopman operator
,”
Annu. Rev. Fluid Mech.
45
,
357
378
(
2013
).
7.
C. W.
Rowley
, “
Model reduction for fluids, using balanced proper orthogonal decomposition
,”
Int. J. Bifurcat. Chaos
15
,
997
1013
(
2005
).
8.
B.
McKeon
and
A.
Sharma
, “
A critical-layer framework for turbulent pipe flow
,”
J. Fluid Mech.
658
,
336
382
(
2010
).
9.
J.
Bongard
and
H.
Lipson
, “
Automated reverse engineering of nonlinear dynamical systems
,”
Proc. Natl. Acad. Sci. U.S.A.
104
,
9943
9948
(
2007
).
10.
M.
Schmidt
and
H.
Lipson
, “
Distilling free-form natural laws from experimental data
,”
Science
324
,
81
85
(
2009
).
11.
D.
Xu
and
O.
Khanmohamadi
, “
Spatiotemporal system reconstruction using fourier spectral operators and structure selection techniques
,”
Chaos
18
,
043122
(
2008
).
12.
O.
Khanmohamadi
and
D.
Xu
, “
Spatiotemporal system identification on nonperiodic domains using Chebyshev spectral operators and system reduction algorithms
,”
Chaos
19
,
033117
(
2009
).
13.
S. L.
Brunton
,
J. L.
Proctor
, and
J. N.
Kutz
, “
Discovering governing equations from data by sparse identification of nonlinear dynamical systems
,”
Proc. Natl. Acad. Sci. U.S.A.
113
,
3932
3937
(
2016
).
14.
S. H.
Rudy
,
S. L.
Brunton
,
J. L.
Proctor
, and
J. N.
Kutz
, “
Data-driven discovery of partial differential equations
,”
Sci. Adv.
3
,
e1602614
(
2017
).
15.
P. A. K.
Reinbold
and
R. O.
Grigoriev
, “
Data-driven discovery of partial differential equation models with latent variables
,”
Phys. Rev. E
100
,
022219
(
2019
).
16.
X.
Li
,
L.
Li
,
Z.
Yue
,
X.
Tang
,
H. U.
Voss
,
J.
Kurths
, and
Y.
Yuan
, “
Sparse learning of partial differential equations with structured dictionary matrix
,”
Chaos
29
,
043130
(
2019
).
17.
M.
Bär
,
R.
Hegger
, and
H.
Kantz
, “
Fitting partial differential equations to space-time dynamics
,”
Phys. Rev. E
59
,
337
(
1999
).
18.
D. P.
Vallette
,
G.
Jacobs
, and
J. P.
Gollub
, “
Oscillations and spatiotemporal chaos of one-dimensional fluid fronts
,”
Phys. Rev. E
55
,
4274
(
1997
).
19.
Y.
Kuramoto
, “
Diffusion-induced chaos in reaction systems
,”
Prog. Theor. Phys. Suppl.
64
,
346
367
(
1978
).
20.
G. I.
Sivashinsky
, “
Weak turbulence in periodic flows
,”
Physica D
17
,
243
255
(
1985
).
21.
L. N.
Trefethen
and
J.
Weideman
, “
The exponentially convergent trapezoidal rule
,”
SIAM Rev.
56
,
385
458
(
2014
).