This paper presents a numerical method to implement the parameter estimation method using response statistics that was recently formulated by the authors. The proposed approach formulates the parameter estimation problem of Itô drift diffusions as a nonlinear least-squares problem. To avoid solving the model repeatedly when using an iterative scheme in solving the resulting least-squares problems, a polynomial surrogate model is employed on appropriate response statistics with smooth dependence on the parameters. The existence of minimizers of the approximate polynomial least-squares problems that converge to the solution of the true least square problem is established under appropriate regularity assumption of the essential statistics as functions of parameters. Numerical implementation of the proposed method is conducted on two prototypical examples that belong to classes of models with a wide range of applications, including the Langevin dynamics and the stochastically forced gradient flows. Several important practical issues, such as the selection of the appropriate response operator to ensure the identifiability of the parameters and the reduction of the parameter space, are discussed. From the numerical experiments, it is found that the proposed approach is superior compared to the conventional approach that uses equilibrium statistics to determine the parameters.

In many dynamical systems, some of the model parameters may not be identifiable from their equilibrium statistics. For problems where the parameters can be identified from their two-point statistics, this paper provides a method to numerically estimate these parameters. Mathematically, we formulate the inverse problem as a well-posed dynamic constrained nonlinear least-squares problem that fits a set of linear response statistics under small external perturbations. Since solving this problem directly is computationally not feasible, we consider a polynomial surrogate modeling to approximate this least-squares problem. Convergence of the solution of the approximate problem to the solution of the original nonlinear least-squares solution is established under appropriate regularity assumptions. Supporting numerical examples on two classes of models with a wide range of applications, the Langevin dynamics and the stochastically forced gradient flows are given.

## I. INTRODUCTION

Parameter estimation is ubiquitous in modeling of nature. The goal in this inverse problem is to infer the parameters from observations with sufficient accuracy so that the resulting model becomes a useful predictive tool. Existing parameter estimation techniques often belong to one of the following two classes of approaches, the maximum likelihood estimation^{1} and the Bayesian inference such as Markov chain Monte-Carlo (MCMC),^{2} depending on the availability of prior information about the parameters. Since the model parameters are usually not directly measured, the success of any inference method depends crucially on the identifiability of the parameters from the given observations. When the dependence of the observations on the parameters is implicit, that is, through the model, sensitivity analysis (see, e.g., the review article^{3}) is often a useful practical tool to determine the parameter identifiability.

In this paper, we consider the parameter estimation of ergodic stochastic differential equations driven by Brownian noise. When the corresponding invariant measure of the dynamical systems has an explicit dependence on the parameters, then these parameters can usually be inferred from appropriate equilibrium statistical moments. A popular approach that exploits this idea is the reverse Monte Carlo method^{4} developed in chemistry. In particular, it formulates an appropriate nonlinear least-squares system of integral equations by matching the equilibrium averages of some pre-selected observables. Subsequently, Newton’s iterations are used to estimate the parameters. At each iterative step, samples at the current parameter estimate are generated for Monte-Carlo estimation to approximate the least-squares integral equations and the corresponding Jacobian matrix. In practice, this method can be rather slow due to the repeated sampling procedure. A severe limitation of this approach is that it is restrictive to inference of parameters of the equilibrium density.

This limitation can be overcome by fitting appropriate two-point statistics. As shown in our previous work,^{5} we formulated the parameter estimation problem based on the linear response statistics subjected to an external forcing, which drives the system out of equilibrium. The fluctuation-dissipation theory (FDT), a hallmark in non-equilibrium statistical physics, suggested that the changes of the average of an observable under small perturbations can be approximated by appropriate two-point statistics, called the FDT response operators.^{6} The key point is that these FDT response operators can be estimated using the available samples of the equilibrium *unperturbed* dynamics so long as we know the exact form of the invariant measure of the dynamics, which we will assume to be the case for a large class of problems, such as the Langevin dynamics and stochastic gradient flows. We should point out that the proposed approach relies on the validity of FDT response statistics, which has been studied rigorously for a large class of stochastic system that includes the stochastic differential equation (SDE) setup in this paper.^{7} For deterministic dynamics, one can use, for example, the statistical technique introduced in Ref. 8 to verify the validity of the FDT linear response. Following the ideas from Refs. 9–11, we developed a parameter inference method using these response operators in Ref. 5. While the method in Refs. 9–11 involves minimizing an information-theoretic functional that depends on both the mean and variance response operators, our approach fits a finite number of quantities, to be referred to as the *essential statistics*, which would allow us to approximate the FDT response operators, of appropriate observables (beyond just the mean and variance).

In our previous work,^{5} we showed the well-posedness of the formulation on three examples, in the sense that under infinite sampling, appropriate choices of essential statistics will ensure the identifiability of the model parameters. On one of these examples, the simplified model for turbulence,^{12} which is a nonlinear system of SDEs with a Gaussian invariant measure, we were able to explicitly determine the choice of observables and external forcings (which in turn determine the FDT response operators) that allow us to identify all of the model parameters uniquely. In this case, one can directly estimate the parameters by solving the system of equations involving these essential statistics. While this result suggests that the choice of observables and external forcings is problem dependent, the FDT theory provides a guideline for choosing the appropriate two-point statistics for a well-posed parameter estimation. On another example, the Langevin dynamics (which will be discussed in Sec. IV), while the parameters can be identified by the essential statistics as shown in Ref. 5, the formulation suggests that one must include essential statistics that involve higher-order derivatives of the FDT response operators. This can be problematic because these higher-order statistics are rarely available in practice unless the available data are solutions of high-order SDE solvers. Given this practical issue, we propose to fit only the essential statistics that include the zeroth- and first-order derivatives of the FDT response operator (if possible, use only the zeroth-order derivative information). The price paid by restricting to only the lower-order essential statistical information is that the dependence of the essential statistics on some of the model parameters becomes implicit through the solutions of the dynamical model and the identifiability of these parameters becomes questionable.

In this paper, we devise a concrete numerical method to estimate the parameters by solving dynamic-constrained least-squares problems of integral equations involving these low-order essential statistics. As one would imagine, the implementation of this approach will face several challenges. First, just like in the reverse Monte-Carlo method, naive implementation of an iterative method for solving this dynamically constrained least-squares problem requires solving the model repeatedly. To avoid solving the model repeatedly in the minimization steps, we employ a polynomial surrogate model approach^{13,14} on the least-squares cost function that involves the essential statistics. This approach is motivated by the fact that the cost function (or effectively the essential statistics) is subjected to sampling error and that the essential statistics have smooth dependence on the parameters, assuming that the Fokker-Planck operator of the underlying dynamics has smooth dependence on the parameters. Under appropriate regularity assumption on the essential statistics, we will provide a theoretical guarantee for the existence of minimizers of the approximate polynomial least-squares problem that converge to the solution of the true least-squares problem. The second related issue, as mentioned above, is that certain (lower-order) essential statistics might not be sensitive enough to some of the parameters, which will lead to inaccurate estimations. To ensure the practical identifiability of the parameters, we employ an empirical *a priori* sensitivity analysis based on the training data used in constructing the polynomial surrogate models and *a posteriori* local sensitivity analysis to ensure the validity of the a prior sensitivity analysis.

While the proposed technique can be used to infer all of the model parameters, in practice, one should use the polynomial approximation to infer parameters that cannot be estimated directly to avoid the curse of dimensionality. This issue is a consequence of using the polynomial expansion in approximating the least-squares problem, where the required training data set for constructing the surrogate model increases exponentially as a function of the parameter dimension. In the numerical examples below, we will apply the approximate least-squares problems to estimate parameters that cannot be directly estimated from matching equilibrium and/or two-point statistics that involve the first-order derivative of the FDT response operator.

The rest of the paper is organized as follows. In Sec. II, we briefly review the concept of essential statistics and parameter inference method using the linear response statistics developed in Ref. 5. In Sec. III, we present the proposed numerical algorithm based on the polynomial based surrogate model and discuss its convergence (with detailed proof in Appendixes A– C). In Secs. IV and V, we show applications on two nonlinear examples, a Langevin model and a stochastic gradient system with a triple-well potential, respectively. In Sec. VI, we conclude the paper with a summary and discussion.

## II. A REVIEW OF ESSENTIAL STATISTICS

We begin by reviewing the parameter estimation formulation introduced in our previous paper.^{5} Consider an $n$-dimensional Itô diffusion

where the vector field $b(X;\theta )$ denotes the drift and $\sigma (X;\theta )$ is the diffusion tensor. Both coefficients are assumed to be smooth functions in $X$ and $\theta $; $Wt$ represents the standard Wiener process, and the variable $\theta \u2208D$ contains model parameters, where $D\u2282RN$ is the parameter domain. In this paper, we only consider the case where $D$ is bounded and for simplicity, we assume $D=[\u22121,1]N$. Throughout this paper, we denote the underlying true parameter value as $\theta \u2020$ and the corresponding estimate as $\theta ^$. We assume that Eq. (1) is ergodic (see Ref. 15 for precise conditions) with equilibrium density $peq(x;\theta )$ for all $\theta \u2208D$, and $peq(x;\theta )$ smoothly depends on both $x$ and $\theta $. When $\theta =\theta \u2020$, we assume that we have the access to the explicit formula of $peq(x;\theta \u2020)$ as a function of $x$ only, which is shortened as $peq\u2020(x)$. As we shall see in many applications, $peq(x;\theta )$ only depends on a subset of the parameters $\theta $.

The main idea of the parameter estimation formulation introduced in Ref. 5 is to infer the parameters from the linear response operator associated with the fluctuation dissipation theory (FDT). Recall that, in the FDT setting, we are interested in an order-$\delta $ ($0<\delta \u226a1$) external perturbation of the form $f(x,t)=c(x)\delta f(t)$, such that the solutions of the perturbed dynamics

which would otherwise remain at the equilibrium state of the unperturbed dynamics (1) and can be characterized by a perturbed density $p\delta (x,t;\theta )$. The density function is governed by the corresponding Fokker-Planck equation under initial condition $p\delta (x,0;\theta )=peq(x;\theta )$.

By a standard perturbation technique, e.g., Ref. 1, the difference between the perturbed and unperturbed statistics of any integrable function $A(x)$ can be estimated by a convolution integral, that is,

In (2), the term $kA(t)$ is known as the linear response operator. The FDT formulates the linear response operator as the following two-point statistics:

where $Bi$ and $ci$ denote the $ith$ components of $B$ and $c$, respectively. We should point out that the validity of the FDT has been studied rigorously under mild conditions.^{7} A more explicit form is given by

where $\rho $ is the solution of the Fokker-Planck equation

Here, $L$ denotes the generator of the unperturbed dynamics (1) and $L\u2217$ is its adjoint operator in the $L2(Rn)$.

The result in Ref. 16 states that if the coefficients of linear parabolic PDEs are $Ck$ as functions of $\theta ,$ then the weak solutions are also $k$-time differentiable with respect to $\theta $. In our case$,$ if $b,\sigma \u2208Ck([\u22121,1]N),$ then the linear response operator $kA(t;\theta )$ (4) is also $Ck$ with respect to the parameters $\theta $ under the mild assumption that $peq(x;\theta )$ is smooth with respect to $\theta $. In Sec. III B, we will conduct a convergence analysis on the proposed parameter estimation method to determine the necessary value of $k,$ the smoothness of the linear response operators as functions of $\theta $.

Notice that $B\u2020(X):=B(X;\theta \u2020)$ can be determined analytically based on our assumption of knowing the explicit formula of $peq\u2020$. Given $t$, the value of $kA(t;\theta \u2020)$ can be computed using a Monte-Carlo sum based on the time series of the unperturbed system, $X$, at $peq\u2020(x)$. Therefore, the linear response operator can be estimated without knowing the underlying unperturbed system in (1) so long as the time series of $X$ at $peq\u2020(x)$ and the explicit formula for this equilibrium density are known.

The FDT, along with the expression of the kernel function (3), has been implemented to *predict* the change of the expectation of the observable $A$.^{17–22} The advantage is that the response operator is defined with respect to the equilibrium distribution, which does not involve the prior knowledge of the perturbed density $p\delta $. In statistical physics, the FDT is a crucial formulation to derive transport coefficients, e.g., electron conductivity, viscosity, diffusion coefficients, etc.^{6} In the current paper, however, we propose to use the response statistics to infer the underlying true parameter value $\theta \u2020$.

Since this linear response operator is in principle infinite-dimensional, it is necessary for practical purposes to introduce a finite-dimensional approximation. The following parametric form of the response operator has been motivated by the rational approximation of the Laplace transform.^{23} To explain the idea, consider the Laplace transformation of $kA(t;\theta )$ denoted by $K(s;\theta )$, which can be approximated by a rational function in the following form:

In the time domain, $kA(t;\theta )$ can be written explicitly as

where $I$ denotes the $n$-by-$n$ identity matrix. Here, $m$ stands for the order of the rational approximation. In Ref. 23, the coefficients in the rational approximations were determined based on certain interpolation conditions, and such a rational approximation has been proven to be an excellent approximation for the time correlation function. For the current problem, $gm$ will be determined using finitely many essential statistics defined as follows.

The values ${kA(j)(ti;\theta \u2020)}i=1,2,\u2026,K$ are called *essential statistics* if they are sufficient to approximate the response operator $kA(t;\theta \u2020)$ up to order-$m$ using (6), for $j=0,\u2026,2m\u22121$ and $t1<t2\cdots <tK,$ where $j$ indicates the order of derivatives.

In Ref. 5, we have considered the derivatives of $kA$ at $t=0$ for a Langevin dynamics model$,$ and explicit formulas involving the model parameters have been found. Theoretically$,$ all the parameters can be identified from ${kA(j)(0+;\theta \u2020)},$ which to some extent shows the consistency. However$,$ in practice$,$ we use only the derivative of order no greater than one$,$ since estimating higher order derivative of $kA(t,\theta \u2020)$ requires time series $X(ti)$ with sufficient accuracy$,$ which is not necessarily available. By using the essential statistics correspond to lower-order derivatives at $ti$ other than zero$,$ we give up the explicit expressions of the higher-order statistics with respect to the parameters. This prompted us to solve the problem in the least-squares sense. Furthermore$,$ it requires other more practical methods$,$ such as a sensitivity analysis test$,$ to determine the identifiability of the parameters from these statistics.

Figure 1 shows the performance of such a discrete representation, where the linear response operator arises from a Langevin model subjected to a constant external forcing (25), which will be discussed in Sec. IV. With such discrete representation, we reduce our problem of inferring from an infinite-dimensional $kA(t;\theta \u2020)$ to a finite number of essential statistics. To derive a system of equations involving both $\theta $ and those essential statistics, we introduce

We also define for $j=0,1$,

We should stress that (7) is not the FDT response, since the $B\u2020(x)$ in (7) is defined with respect to $peq\u2020$. The key idea in Ref. 5 is to estimate the true parameter values $\theta \u2020$ by solving

Here, the term on the left-hand-side is estimated from the available sample at $peq\u2020$, and the term on the right-hand-side will be computed from the time series generated by solving (1) for a given $\theta \u2208D$.

Since $peq\u2020$ usually depends on a subset of the true parameter value $\theta \u2020,$ the assumption of knowing the explicit form of $peq\u2020$ as a function of $x$ only is too strong. In general$,$ we cannot even compute the left-hand-side term in (8) since the term $B\u2020,$ which may depend on the unknown true value $\theta \u2020,$ is not available to us. We will discuss how to address such issue on two examples in Secs. IV and V, where each of them belongs to a class of models with a wide range of applications.

To ensure the solvability, we assume that the total number of equations in (8) is always greater than the dimension of the unknown parameters $\theta $. For this over-determined scenario, we propose to solve (8) in the least-squares sense, addressing the practical issue raised in Ref. 5 when the explicit solutions are not available. The main difficulty in solving the nonlinear least-squares problem comes from evaluating $k^A(j)(ti;\theta )$, which often requires solving the model (1) repeatedly. In Sec. III, we will propose an efficient numerical method to address this issue.

## III. AN EFFICIENT ALGORITHM FOR PARAMETER ESTIMATION

In Sec. II, we formulated a parameter estimation method as a problem of solving a nonlinear system (8) subject to a dynamical constraint in (1). In a compact form, we denote the system (8) as

where $\theta \u2208[\u22121,1]N$, $K>N$. Notice that we have neglected the derivative constraints in (8). While similar algorithm and convergence result as shown below also hold if the first-order derivative constraints are included in the least-squares procedure, we will only numerically solve the least-squares problem for the zeroth-order derivative constraints as in (9). As we shall see in Secs. IV and V, we will use the derivative constraints to directly estimate some of the parameters.

Our goal is to solve (9) in the sense of least-squares, that is,

However, as we mentioned in Sec. II, we do not necessarily have the explicit expressions for $fi(\theta )$ and evaluating $fi$ for certain values of $\theta $ requires solving the true model (1) to approximate the integral in $fi(\theta )$ by a Monte-Carlo sum. For example, if we apply the Gauss-Newton method to solve the nonlinear least-squares problem (10), the overall computation will become very expensive since we have to solve the model (1) repeatedly. As a remedy, we apply the idea of polynomial based surrogate model, which was introduced in Refs. 13 and 14 for solving Bayesian inference problems, to alleviate this part of computation. The main difference of our approach from Refs. 13 and 14 is that they used the orthogonal polynomials to expand the dynamical model, $X(t,\theta )$, whereas here, we will use the orthogonal polynomials to expand $fi(\theta )$. While the convergence of the approximate posterior density estimate of the Bayesian inference problem is established in Ref. 14, we will end this section by analyzing the convergence of the approximate least-squares problem.

### A. Polynomial based surrogate model

The key idea of our method is to approximate $fi(\theta )$ by a polynomial function $fiM(\theta )$. This is motivated by the fact that the sampling error cannot be avoided in computing the value of essential statistics, that is, there is uncertainty on the left-hand-side of (8). Thus, it is more reasonable to think of the nonlinear least-squares solution of (10) as a random variable with certain distribution $\mu $ over $[\u22121,1]N$. Furthermore, since ${fi(\theta )}$ are $Ck$ with respect to $\theta $ (Remark 2.1) whenever $b$ and $\sigma $ in (1) are $Ck$ with respect to $\theta $, polynomial chaos expansion provides a natural choice of $fiM(\theta )$ based on the orthogonal polynomials with respect to $\mu $.

For instance, if we consider $\mu \u223cU[\u22121,1]$, i.e., the uniform distribution, then the corresponding orthogonal polynomials are the Legendre polynomials ${pn}$. They are given by $p0=1$, $p1=x$, and the recursive formula

which form a basis for $L2([\u22121,1],\mu )$. The orthogonality of ${pn}$ over $L2([\u22121,1],\mu )$ is described by

One can normalize $pn$ by introducing $Pn:=pnn+12$.

For multi-dimensional cases, one can define $Pk\u2192(\theta ):=\u220fj=1NPkj(\theta j)$ ($N$ represents the dimension of $\theta $), where $k\u2192\u22650\u2192$ is a multi-index with $k\u2192=(k1,\u2026,kN)\u2208{0,1,2,\u2026}N$ and $\theta =(\theta 1,\u2026,\theta N)\u2208[\u22121,1]N$. We can consider an order-$M$ approximation of $fi(\theta )$

In practice, there are two approaches to determine the coefficient $\alpha k\u2192(i)$. The first approach is the Galerkin method, which requires that the errors in the approximation in (11) to be orthogonal to the finite-dimensional subspace of $L2([\u22121,1],\mu )$ generated by ${Pk\u2192(\theta )}\u2225k\u2192\u2225\u221e\u2264M$. In our convergence analysis, we will deduce the error estimates based on the coefficients obtained through this Galerkin (or least-squares) fitting. Alternatively, one can determine the coefficient $\alpha k\u2192(i)$ by the collocation method, that is, matching the values of $fi$ at certain points of $\theta $ in $D$, denoted by $\Theta $, which leads to the following linear equations:

Since (12) is equivalent to a polynomial interpolation, a common choice of $\Theta $ is the product space of order-$MC$ Chebyshev nodes, where $MC\u2265M+1$ [when $MC>M+1$, (12) turns into a linear least-squares problem]. In the numerical examples in this paper, we will use the collocation method to determine the coefficients.

Replacing $fi(\theta )$ in (10) by $fiM(\theta )$, we obtain a new least-squares problem for the order-$M$ polynomial based surrogate model

For easier notations, let us use $fM(\theta )$ and $JM(\theta )$ to denote the vector-valued function, $[f1M(\theta ),\u2026,fNM(\theta )]\u22a4$, and its Jacobian matrix, respectively.

In summary, we have Algorithm 1 based on the Gauss-Newton method.

Let $\Theta $ be a set of collocation nodes in $[\u22121,1]N$. |

1. For each $\theta \u2208\Theta $, solve (1). |

2. Estimate $fi(\theta )$ using a Monte-Carlo sum over $i=1,\u2026,K$. |

3. Compute the coefficients $\alpha k\u2192(i)$ by solving the linear system (12) and obtain the approximations $fiM(\theta )$ for $i=1,\u2026,K.$ |

4. For an initial guess $\theta 0\u2208[\u22121,1]N$ and a threshold $\delta >0$. Compute $\theta k=\theta k\u22121\u2212{[JM(\theta k\u22121)]\u22a4JM(\theta k\u22121)}\u22121[JM(\theta k\u22121)]\u22a4fM(\theta k\u22121).$ |

while The step length $\u2225\theta k\u2212\theta k\u22121\u2225\u2265\delta $ do |

Repeat (14). |

end while |

return $\theta k$ |

Let $\Theta $ be a set of collocation nodes in $[\u22121,1]N$. |

1. For each $\theta \u2208\Theta $, solve (1). |

2. Estimate $fi(\theta )$ using a Monte-Carlo sum over $i=1,\u2026,K$. |

3. Compute the coefficients $\alpha k\u2192(i)$ by solving the linear system (12) and obtain the approximations $fiM(\theta )$ for $i=1,\u2026,K.$ |

4. For an initial guess $\theta 0\u2208[\u22121,1]N$ and a threshold $\delta >0$. Compute |

while The step length $\u2225\theta k\u2212\theta k\u22121\u2225\u2265\delta $ do |

Repeat (14). |

end while |

return $\theta k$ |

Computationally, the most expensive step is in solving (1) on the collocation nodes $\Theta $, which can be done in parallel.

### B. The convergence of the approximate solutions

In this section, we will provide the convergence analysis of the solutions of the approximate least-squares problem in (13) to the solution of the true least-squares problem in (10) as $M\u2192\u221e$, under appropriate conditions. We denote the solution of (13) by $\theta M\u2217$, that is,

Since each $fiM$ is a multivariate polynomial, the minimum (13) can be attained in $[\u22121,1]N$; but such $\theta M\u2217$ may not be unique. For the case of $K=N$ and sufficiently smooth $fi$, assuming that the true parameter value $\theta \u2020\u2208(\u22121,1)N$ is the unique solution of the true least-squares problem in (10), we can show that there exists a sequence of minimizers of the approximated problem (13)${\theta M\u2217}$ such that $\theta M\u2217\u2192\theta \u2020$ as $M\u2192+\u221e$. To be more precise,

Before we apply numerical method to solve the problem, a fundamental question will be: assuming $\theta \u2020\u2208(\u22121,1)N$, the true parameter value, is the unique solution of (10) can we find a sequence of minimizers of the approximated problem (13) ${\theta M\u2217}$ such that $\theta M\u2217\u2192\theta \u2020$ as $M\u2192+\u221e$? We are able to prove the following theorem, which partially answers this question.

$fi\u2208C\u2113([\u22121,1]N),$ $\u2113=\u230832N\u2309+2$, $i=1,2,\u2026,N,$

Let $\theta \u2020$ be the solution of (10) such that $fi(\theta \u2020)=0,$ $i=1,2,\u2026,N,$

the Jacobian matrix of $f:=(f1,f2,\u2026,fN)\u22a4$ at $\theta \u2020,$ denoted by $\u2207f\u2192(\theta \u2020),$ is invertible.

We set $N=K$ since the existence of the minimizers rely on the technical assumption in Lemma 3.6. The second and third assumptions provide the well-posedness of the original nonlinear least-squares problem (10). Since we discuss the parameter estimation problem under the perfect model setting$,$ the second assumption naturally holds.

The proof of Theorem 3.1 can be found in Appendix A. Here, we will present the proof for the one-dimensional case ($K=N=1$) to illustrate the main ideas. Notice that in the one-dimensional case, the formula in (11) is reduced to

where ${Pn}$ are normalized Legendre polynomials. We start with two Lemmas, which pertain to the pointwise convergence of $fM$, and the proof can be found in Ref. 24.

Lemma 3.3 connects the smoothness of $f$ and the decay rate of the generalized Fourier coefficients $\alpha k$. The following Lemma is a direct result of Lemma 3.3.

Here, $\u2225\u22c5\u2225\u221e$ denotes the $L\u221e$-norm on $C2[\u22121,1]$. In the proof of Proposition 3.7, we also need $\u2225(fM)\u2032\u2212f\u2032\u2225\u221e\u21920$ as $M\u2192+\u221e$. To obtain such convergence, one needs higher regularity on $f$. In particular, we have

Let $f\u2208C4([\u22121,1])$ and $fM$ be the corresponding least-squares approximation given by (15). We have $\u2225f\u2032\u2212(fM)\u2032\u2225\u221e\u21920$ as $M\u2192+\u221e$.

^{24}we are going to show that ${(fM)\u2032}$ is a Cauchy sequence under $L\u221e$-norm. Notice for $m>n$,

So far, we have shown that with enough regularity of $f$, we have the uniform convergence of both $fM\u2192f$ and $(fM)\u2032\u2192f\u2032$ on $[\u22121,1]$. To show the existence of minimizer near $\theta \u2020$, the following lemma is crucial.

Let $F:D\u2282Rn\u2192R$ be continuously differentiable in the domain $D$ and suppose that there is an open ball $B(x0,r)\u2282D$ and a positive $\gamma $ such that $\u2225\u2207F(x)\u22121\u2225\u2264\gamma \u2200x\u2208B(x0,r)$ and $r>\gamma \u2225F(x0)\u2225$. Then, $F(x)=0$ has a solution in $B(x0,r)$.

The proof can be found in Ref. 25. In the one-dimensional case, Lemma 3.6 basically says that if a $C1$ function $f$ changes with a minimum speed ($|f\u2032|>\gamma \u22121$) on the interval $(x0\u2212r,x0+r)$ and the function value at $x0$ is small enough [$f(x0)\u2264r\gamma $], then the function must reach zero in this interval. With Lemmas 3.4–3.6, we are able to prove the following result, which is the one-dimensional analogy of Theorem 3.1.

$f\u2208C4([\u22121,1]),$

$f(\theta \u2020)=0,$

$|f\u2032(\theta \u2020)|\u22600,$

This result guarantees the existence of minimizers of the approximate polynomial least-squares problem that converges to the solution of the true least-squares problem under a fairly strong condition, that is, $f$ is smooth enough. In practice (see Sec. IV), we shall see reasonably accurate estimates even when the essential statistics are not as regular as required by this theoretical estimate.

### C. Convergence of the algorithm

In Algorithm 1, we are solving a least-squares problem (13). For the Gauss-Newton method, a standard local convergence result under a full rank assumption of the Jacobian matrix is well known (e.g., see Chapter 2 of Ref. 26 for details). To achieve such a local convergence in our numerical algorithm, we will now verify the full rank condition in the polynomial chaos setting.

For our problem, the Jacobian matrix $JM(\theta )\u2208RK\xd7N$ is given by

where $fiM$ is an element of the finite-dimensional polynomial space $\Gamma NM$ defined by

Using linear independence over the function space $\Gamma NM$ and some fundamental results in algebra, we can prove the following theorem.

See Appendix B.

Here, $N(J)$ is the set of $\theta $ for which $J(\theta )$ is not full rank in the matrix sense. In our application$,$ $N$ is the dimension of the parameter $\theta ,$ $K$ is the total number of equations in (9), and $M$ is the order of approximation used in (11). In practice, since the orthogonal polynomials involved in (11) form a basis for $\Gamma NM$, it is enough to check the rank of the coefficient matrix $AM$, where the $ith$ column consists of the coefficients $\alpha k\u2192(i)$ in (11) for all possible $k\u2192$. The linear independence hypothesis is indeed sensible in practice since we want to avoid solving underdetermined least-squares problems.

## IV. EXAMPLE I: THE LANGEVIN MODEL

For the first example, we consider a classical model in statistical mechanics: the dynamics of a particle driven by a conservative force, a damping force, and a stochastic force. In particular, we choose the conservative force based on the Morse potential

where the last quadratic term in $U0$ acts as a retaining potential (also known as a confining potential), preventing the particle from moving to infinity. For this one-dimensional model, we rescale the mass to unity $m=1$ and write the dynamics as follows:

where $W\u02d9$ is a white noise. The generator of the system (21), denoted by $L$, is given by

The smooth retaining potential $U(x)$ guarantees the ergodicity of the Langevin system (21) (see Appendix C for details). Namely, there is an equilibrium distribution (Gibbs measure) $peq(x,v)$, given by

which is independent of $\gamma $. In particular, we have $v\u223cN(0,kBT)$ at equilibrium. For this Langevin model, there are a total of five unknown parameters $\theta :=(\gamma ,T,\u03f5,a,x0)$ with true values $\theta \u2020:=(\gamma \u2020,T\u2020,\u03f5\u2020,a\u2020,x0\u2020)$.

For this specific problem, notice that all parameters except $\gamma $ appear in the equilibrium density function. Intuitively, this suggests that one can estimate four parameters, ${T,\u03f5,a,x0}$, from equilibrium statistics and $\gamma $ from two-point statistics, respectively. We will present a conventional estimation method based on this guideline for diagnostic purpose. In particular, the parameter $\gamma $ will be directly estimated from a two-point statistic as shown in Sec. IV A. Subsequently, the remaining four parameters are obtained via one-point statistics, as shown in (27) and Sec. IV B 1.

As for the proposed approach, we will consider estimating all the parameters using the essential statistics in the most efficient manner. First, two of the parameters, ${\gamma ,T}$, will be estimated using the essential statistics in (27). Then, via a sensitivity analysis discussed in Sec. IV B 2, we will show that parameter $x0$ is independent of the essential statistics. As a consequence, we formulate a least-square problem to estimate $\theta =(\u03f5,a)$ and use equilibrium statistics to estimate $x0$.

### A. Reduction of the parameter space

In our previous work,^{5} we have suggested an approach to estimate $T\u2020$ and $\gamma \u2020$ directly from the essential statistics. To construct such essential statistics, we consider a constant external forcing $\delta f$ with $\delta \u226a1$. The corresponding perturbed system is given by

that is, $c(x)=(0,1)$ in the FDT formula (3). By selecting the observable $A=(0,v)\u22a4$, we work with the $(2,2)$ entry of the response operator given by

where $\rho $ is the solution of corresponding Fokker-Planck equation (5) associated with the Langevin dynamics. Taking the time derivative of $Epeq\u2020[v(t)v(0)]$ for $t>0$, we obtain

Let $t\u21920+$ in (26) and recall that $v\u223cN(0,kBT)$ at equilibrium, we have

where both the left-hand-sides can be computed from the sample. Thus, (27) provides direct estimates for $T\u2020$ and $\gamma \u2020$. As a result, the original parameter estimation problem can be reduced into estimating $(\u03f5\u2020,a\u2020,x0\u2020)$ in the potential function $U$, which is a non-trivial task since the dependence of $peq$ on these parameters is quite complicated. With these estimates, $B\u2020$ becomes available since it only depends on $kBT\u2020$, which allows one to compute $Mj(ti)$ in (8) and avoid the issue pointed out in Remark 2.4.

### B. Parameter estimation approaches

In this subsection, we focus on the three-dimensional parameter estimation problem with $\theta :=(\u03f5,a,x0)$. First, we review the conventional approach, which matches the equilibrium statistics of $x$. Then, we discuss the proposed new approach using the essential statistics. In particular, we perform a sensitivity analysis to determine the identifiability of the parameters from the essential statistics. Finally, we present the numerical results including a comparison between the conventional and the new approaches.

#### 1. Conventional method

We first look at what we can learn from the equilibrium statistics. Since the marginal distribution of $x$ at equilibrium state is proportional to $exp\u2061[\u2212U(x;\theta )/kBT]$, a natural idea is to match the moments of $x$ with respect to this equilibrium density. In particular, one can introduce the following three equations:

where the left-hand-sides are computed from the given data, while the right-hand-sides are treated as functions of $(\u03f5,a,x0)$, obtained from solving the model (24). The following pre-computation simplifies Eq. (28) into a one-dimensional problem.

To begin with, we define the probability density functions

in terms of $U$ and $U0$, respectively. With a change of variables $y:=a(x\u2212x0)$, the normalizing constants $N$ and $N0$ satisfy

As a result, the first two equations in (28) can be written into

with a unique solution

where $Varpeq1,0\u2061[x]$ stands for the variance of $x$ with respect to the rescaled equilibrium density $peq1,0$. In practice, both $Epeq\u2020[x]$ and $Varpeq\u2020[x]$ can be empirically estimated directly from the available data, while $Varpeq1,0\u2061[x]$ is a function of $\u03f5$ only. Thus, (29) can be denoted as

and (28) is reduced to a one-dimension problem for $\u03f5$,

Note that the left-hand-side of (30) will be estimated by Monte-Carlo averaging from the unperturbed data, whereas the right-hand-side integral will be approximated using a numerical quadrature.

#### 2. Sensitivity analysis of the essential statistics

To ensure a successful parameter estimation, it is useful to gather some *a priori* knowledge of the parameter identifiability from the proposed essential statistics. This is important especially since there are non-unique choices of essential statistics to be fit and we want to ensure that we can identify the parameters from appropriate essential statistics: Recall that the essential statistics are defined based on the choices of external forcing and observable.

While local sensitivity analysis such as the pathwise derivative method described in Ref. 27 is desirable, it requires knowledge of the true parameters, which are not available to us. Essentially, the pathwise derivative method is to compute $y\xaf\theta :=E[Y(t,\omega )]$, where $Y(t,\omega ):=D\theta X(t,\theta )$ and the expectation is defined with respect to the density of the Itô diffusion (written in its integral form)

Since we have no explicit solutions of the density $\rho (x,t;\theta )$, one way to approximate $y\xaf\theta $ is using an ensemble average of the solutions of the following equation:

Notice that this equation depends on (31) and the unknown parameters, $\theta $. The dependence on $\theta $ implies that one cannot use this method for *a priori* sensitivity analysis. However, it can be used for *a posteriori* sensitivity analysis, evaluated at the estimates, to verify the difficulties of the particular parameter regimes.

As an empirical method for *a priori* sensitivity analysis, we simply check the deviation of $k^(t,\theta )$ as a function of the training collocation nodes, $\theta \u2208\Theta $, which are available from Step 2 of Algorithm 1. Of course, a more elaborate global sensitivity analysis technique such as the Sobol index^{28} can be performed as well, but we will not pursue this here. From this empirical *a priori* sensitivity analysis, we found that the corresponding essential statistics, $Epeq[v(t)v(0)]$ is not sensitive to $x0$. However, it is sensitive to $a$ and strongly sensitive to $\u03f5$. Also, the sensitivity of the low damping case ($\gamma =0.5$) is stronger than the high damping case ($\gamma =0$). To verify the validity of this empirical method, we compute the local sensitivity analysis indices $y\xaf\theta :=E[Y(t,\omega )]$, where $Y(t,\omega )$ solves (32) with the true parameters (again this is not feasible in practice since the true parameters are not available).

For our Langevin model, we have the corresponding drift and diffusion coefficients

Suppose we are interested in the dependence on parameter $x0$. Introducing the notations $xx0:=\u2202\u2202x0x(t,x0)$ and $vx0:=\u2202\u2202x0v(t,x0)$, the joint system of (31) and (32) (written in differential form) is given as follows:

There is no stochastic term in the last equation, since the noise in the Langevin model is additive with coefficient independent of $x0$. Notice that $(xx0,vx0)\u2261(1,0)$ is a fixed point for the last two equations in (33) because of

As a result, by choosing the initial condition to be $[x(0),v(0)]=(x0,c)$ for an arbitrary constant $c$, one can claim that the solution $v(t)$ of the Langevin model (21) is independent of $x0$. Thus, under this circumstance, $Epeq\u2020[v(t)v(0)]$ is independent of $x0\u2020$, confirming the conclusion from our empirical *a priori* sensitivity analysis.

While this parameter insensitivity may seem discouraging, we can use it to our advantage to simplify the problem. That is, we can assign an arbitrary value to $x0$ in applying Algorithm 1 and consider estimating only $\theta =(a,\u03f5)$. Once these two parameters are estimated, we can use the formula in (29) to estimate $x0$.

We also use this local sensitivity analysis to verify the validity of the *a priori* sensitivity analysis with respect to $\u03f5$ and $a$. Since it is difficult to analyze explicit behaviors as above, for a given realization of the Langevin model, we solve the remaining ODEs by the fourth order Runge-Kutta method. Figure 2 shows the numerical results of $v\xafa$ and $v\xaf\u03f5$, where the average is estimated over 3000 realizations of the Langevin model. Based on the scales of $v\xafa$ and $v\xaf\u03f5$ in Fig. 2, we confirm the validity of the sensitivity of the two-point statistics $Epeq\u2020[v(t)v(0)]$ that was found empirically. Namely, the identifiability of $\u03f5$ is stronger than $a$ and the identifiability of the low damping case ($\gamma =0.5$) is stronger than the high damping case ($\gamma =5.0$).

### C. Numerical results

We now present several numerical results. Based on a time series of size $107$ (with temporal lag $h=2\xd710\u22123$), we compare the conventional method and the proposed scheme that fits the essential statistics under $\gamma \u2020=0.5$ and $5.0$, representing a low damping case and a high damping case, respectively. The true values of the remaining four parameters are set to $(kBT\u2020,\u03f5\u2020,a\u2020,x0\u2020)=(1,0.2,10,0)$. The data, in the form of a time series, are generated from the model (24) using an operator-splitting method,^{29} which is a strong second-order method.

#### 1. Low damping case

We start with the conventional method. To examine the sensitivity of the estimation for $(\u03f5\u2020,a\u2020,x0\u2020)$ to the estimates $T^$, as a comparison, we also include the estimation in which we use the true value $kBT\u2020=1$ in solving (30) (we name the results of this scenario as partial estimates). The results have been listed in Table I. We clearly observe that the results of the conventional method are sensitive to the value of $T^$. To explain the sensitivity, recall that in solving (30), $a$ and $x0$ are viewed as functions of $\u03f5$. This suggests that the high sensitivity would occur if

where $a(\u03f5)$ and $x0(\u03f5)$ are given by (29) and the implicit function theorem has been used. By direct computation, we get

Since the explicit formula for the partial derivative with respect to $\u03f5$ is not easy to compute, its value will be computed using a finite difference formula. Evaluating it at $(kBT,\u03f5,a,x0)=(1,0.2,10,0)$, we have

which supports our observation.

. | $kBT$ . | $\gamma $ . | $\u03f5$ . | $a$ . | $x0$ . |
---|---|---|---|---|---|

True | 1.0000 | 0.50000 | 0.20000 | 10.000 | 0.0000 |

Eq. stat. full estimates | 0.99903 | 0.50003 | 0.16870 | 10.888 | 0.001120 |

Eq. stat. partial estimates | $\u2026$ | 0.50003 | 0.20039 | 10.005 | 0.011263 |

Essential stat estimates | 0.99903 | 0.50003 | 0.20004 | 9.9741 | 0.008265 |

. | $kBT$ . | $\gamma $ . | $\u03f5$ . | $a$ . | $x0$ . |
---|---|---|---|---|---|

True | 1.0000 | 0.50000 | 0.20000 | 10.000 | 0.0000 |

Eq. stat. full estimates | 0.99903 | 0.50003 | 0.16870 | 10.888 | 0.001120 |

Eq. stat. partial estimates | $\u2026$ | 0.50003 | 0.20039 | 10.005 | 0.011263 |

Essential stat estimates | 0.99903 | 0.50003 | 0.20004 | 9.9741 | 0.008265 |

Next, we report the estimations obtained from the new approach using essential statistics, and the results are listed in Table I. The improvement in accuracy for $\u03f5\u2020$ and $a\u2020$ is noticeable compared with the full estimates of the conventional method. Here, we used a total of $20$ essential statistics given by $kA(ti;\theta \u2020)$ for $ti=0.1i$, $i=1,2,\u2026,20$. We applied Algorithm 1 to solve an ensemble of two-dimensional ($a$ and $x0$) nonlinear least-squares problems based on $300$ uniformly generated random initial conditions. Figure 3 (left) shows the contour plot of the cost function together with all of the estimates. Notice that except for few outliers, most of the estimates lie along the low value of the contour of the cost function. The estimate $\u03f5^$ reported is the average of all the estimates (excluding the outliers). To satisfy the equilibrium constraint, $a(\u03f5^)$ given by (29) is used as the estimates of $a\u2020$.

The results from this test indicate that for the conventional method, it is difficult to obtain very accurate estimates for $\u03f5\u2020$, unless $T\u2020$ can be estimated accurately, e.g., by using longer time series. In contrast, the proposed approach of using essential statistics is much less sensitive to the error in $T^$. This approach, however, requires a pre-computing step as noted in Algorithm 1. In our numerical test, we solved the model (21) on 64 collocation nodes to evaluate the essential statistics over different values of $(\u03f5,a)$, that is, an order $MC=8$ Chebyshev nodes were used to construct the $\Theta $ used in (12). However, as we have previously alluded to, this can be done in parallel, and it will not become a serious issue as long as the dimension of the parameter is relatively small. As for the value of $M$ in (12), we picked $M=6$. (The same scheme was applied to the high damping case.)

Another interesting issue arises when the damping parameter is large. Since the conventional method fully relies on one-point statistics with respect to equilibrium density, it is important to have high-quality independent samples. In Fig. 4(left), we show the time correlation of $x$ for both the low damping regimes ($\gamma =0.5$) and high damping regime ($\gamma =5.0$). We observe that in the latter case, the auto-correlation function of $x$ decays much slower, indicating a strong correlation among the samples with small lags. In this case, the estimates from the conventional method will deteriorate due to the difficulty in obtaining high quality independent sampling.

#### 2. High damping case

In this section, we focus on the high damping regime $\gamma \u2020=5.0$. We will also make a connection between Fig. 2 and the estimation using the essential statistics. As shown in Fig. 4 (left), we no longer have a fast decay of the time auto-correlation of $x$. As a result, without changing the sample size, significant error will occur in estimating the moment of $x$ used in the conventional method. This can be clearly seen from the results listed in Table II. In particular, besides suffering from the sensitivity to $T^$, the error in estimating $Epeq\u2020[xn]$ for $n=1,2,3$ also leads to inaccurate estimates for $\u03f5\u2020$, $a\u2020$, and $x0\u2020$. We should point out an interesting fact, that is, although the marginal distribution of $x$ at the equilibrium state is independent of $\gamma $, a large value of $\gamma $ causes difficulties in estimating the moments of $x$ in practice.

. | $kBT$ . | $\gamma $ . | $\u03f5$ . | $a$ . | $x0$ . |
---|---|---|---|---|---|

True | 1.0000 | 5.0000 | 0.20000 | 10.000 | 0.0000 |

Eq. stat. full estimates | 0.99949 | 4.9993 | 0.69579 | 5.4717 | 0.12990 |

Eq. stat. partial estimates | $\u2026$ | 4.9993 | 0.25931 | 8.9295 | −0.0034460 |

Essential stat estimates | 0.99949 | 4.9993 | 0.21160 | 9.8663 | −0.01822 |

. | $kBT$ . | $\gamma $ . | $\u03f5$ . | $a$ . | $x0$ . |
---|---|---|---|---|---|

True | 1.0000 | 5.0000 | 0.20000 | 10.000 | 0.0000 |

Eq. stat. full estimates | 0.99949 | 4.9993 | 0.69579 | 5.4717 | 0.12990 |

Eq. stat. partial estimates | $\u2026$ | 4.9993 | 0.25931 | 8.9295 | −0.0034460 |

Essential stat estimates | 0.99949 | 4.9993 | 0.21160 | 9.8663 | −0.01822 |

Table II and Fig. 3 (right) show the numerical results using the essential statistics. Although the relative error is not as small as in the low damping case, it is still lower than the estimates from the conventional method. Here is how our method is implemented: Similar to the low damping case, (29) is used in estimating the value of $a\u2020$ based on the estimates $\u03f5^$. The 20 essential statistics here are given by $kA(ti;\theta \u2020)$ for $ti=0.04i+0.2$, $i=1,2,\u2026,20$ [suggested by the time auto-correlation of $v$ shown in Fig. 4 (right)], which are in a much shorter time interval. Compared to the low damping case, the loss of accuracy can be verified based on the sensitivity analysis (Fig. 2), which indicated that the parameters $(a,x0)$ in this regime are less identifiable when $\gamma $ is large.

## V. EXAMPLE II: A GRADIENT SYSTEM WITH A TRIPLE-WELL POTENTIAL

Our next example is a gradient system driven by white noise. We consider a two-dimensional stochastic system as follows:

where $Wt$ is a two-dimensional Wiener process, $V$ is a potential energy, and $C$ is a matrix defined by

The generator $L$ of the system (34) is given by

We choose a triple-well potential function $V$ similar to the model in Ref. 30,

with

where $\chi (\u2212a,a)(z)$ denotes the characteristic function over the interval $(\u2212a,a)$. Notice that the matrix $C$ is positive definite. The additional quadratic term $0.2[(x1\u2212a)2+(x2\u2212a/3)2]$ in the triple-well potential (36) is, again, a smooth retaining potential. It is well known^{1} that the triple-well model (34) yields an equilibrium distribution given by

which is independent of parameter $d$, the off-diagonal element of $C$.

In the numerical tests, we set $(d\u2020,a\u2020,kBT\u2020,\gamma \u2020)=(0.5,1,1.5,0.25)$ as the true values of the parameters. Figure 5 shows the contour plot of the potential and the scatter plot of the time series under this set of parameters. To generate the data from (34), we applied the weak trapezoidal method introduced in Ref. 31, which is a weak second-order method. In the remainder of this section, we will choose appropriate essential statistics from which ${d,T}$ can be directly estimated and the remaining parameters, ${a,\gamma}$, will be estimated using an appropriate least-square problem.

### A. The essential statistics and reduction of the parameter space

Similar to the Langevin model, we consider an external forcing that is constant in $x$. Subsequently, the perturbed dynamics is given by

where $|\delta |\u226a1$. If we select $A(x):=x$ as the observable, the corresponding linear response operator reads

where $B=(B1,B2)\u22a4$ with

given by (3). As a result, the entries of the linear response operator $kA(t;\theta )$ satisfy

where $Vxi$ denotes the partial derivative of $V$ with respect to $xi$. Using integration by parts, one can show that

which leads to

This is known as the equipartition of the energy in statistical mechanics.

Notice that the response operators $k(i,j)(t;\theta \u2020)$ are not accessible due to the fact that the function $Vxi(x,\theta \u2020)$ depends on unknown true parameter values $(kBT\u2020,a\u2020,\gamma \u2020)$. This is precisely one of the issues raised in Remark 2.4. For gradient flow systems, this problem can be overcome by introducing a linear transformation to another set of two-point statistics. We define

and consider their time derivatives. Following the same calculations that led to (26), we obtain the following identities for $t>0$:

It is more helpful to rewrite (42) into a linear system for $t>0$,

where the coefficient matrix is non-singular since $d\u2208(\u22121,1)$. This linear relationship suggests that one can consider fitting $mi,j$ in place of $ki,j$, since the former is numerically accessible.

Let $t\u21920+$ and apply (40) to the first and the third equations of (43) under $\theta =\theta \u2020$, we obtain the following estimates for $d\u2020$ and $T\u2020$:

where $mi,j\u2032(0+;\theta \u2020)$ is computable from the available sample. Equation (44) reduces the problem into a two-dimensional problem of estimating $(a\u2020,\gamma \u2020)$ only.

### B. Numerical results

As we have pointed out earlier, the linear response operator, $k(i,j)(t;\theta \u2020)$ (39), is not directly accessible. Motivated by the linear relation in (43), we consider to infer the values $(a\u2020,\gamma \u2020)$ from the two-point statistics $mi,j(t;\theta \u2020)$ (41). In particular, we choose $m1,1(ti;\theta \u2020)$ with $ti=0.1i$, $i=1,\u2026,20$, as the essential statistics in our numerical test.

In light of the fact that $m1,1(t;\theta )$ is the two-point statistics of $x1$, we can apply the empirical *a priori* sensitivity test by checking the dependence of $k^(t,\theta )$ over the training collocation nodes $\theta =(a,\gamma )\u2208\Theta $, which are available to us from Step 2 of Algorithm 1. In this numerical experiment, the training used 25 collocation nodes, $\theta \u2208\Theta $. To validate this empirical sensitivity test, we also perform the local sensitivity analysis as described in Sec. IV B 2 to parameters $a$ and $\gamma $ (which, again, is not possible in practice since the true parameters are unknown). The results are shown in Fig. 6 (left). From the scales of $x\xafa$ and $x\xaf\gamma $ (average over $2000$ realizations of $x$), we can see that this two-point statistic, $m1,1(t;\theta \u2020)$, is quite sensitive to the parameters $\gamma $ and $a$, with stronger dependence on the parameter $a$.

Based on a time series of size $4\xd7106$ (with time lag $h=1\xd710\u22123$), we implemented Algorithm 1 with $M=4$ and collocation nodes, $\Theta $, constructed from order $MC=5$ Chebyshev nodes in (12).

The estimates are shown in Table III. We also show in Fig. 6 (right) the contour plot of the cost function together with the estimates based on $300$ uniformly generated initial guesses. Notice that the true parameter value does not lie on the lowest contour value and these discrepancies are due to the surrogate modeling and quality of the samples. However, the estimates $a^$ and $\gamma ^$ are still reasonably accurate, as reported in Table III. We should point out that the estimates $a^$ and $\gamma ^$ reported in this table are the average of all the estimates (excluding the outliers), while the estimates $kBT^$ and $d^$ are obtained by solving (44). The contour plot also confirms the sensitivity analysis, which suggested that estimating $a\u2020$ is easier than $\gamma \u2020$.

. | $a$ . | $\gamma $ . | $kBT$ . | $d$ . |
---|---|---|---|---|

True | 1.0000 | 0.2500 | 1.5000 | 0.5000 |

Estimates | 1.0249 | 0.2463 | 1.4946 | 0.5077 |

. | $a$ . | $\gamma $ . | $kBT$ . | $d$ . |
---|---|---|---|---|

True | 1.0000 | 0.2500 | 1.5000 | 0.5000 |

Estimates | 1.0249 | 0.2463 | 1.4946 | 0.5077 |

One interesting question is whether the approximate values of the parameters can reproduce the essential statistics, which is useful in predicting non-equilibrium averages in the presence of external forces. Using the estimates reported in Table III, we compare the resulting two-point statistics used in estimating the parameters, $m1,1(t;\theta ^)$, with the corresponding true statistics, $m1,1(t;\theta \u2020)$ (see Fig. 7, left). In Fig. 7 (right), we also compare the estimated response operator, $k1,1(t,\theta ^)$, and the true response operator, $k1,1(t,\theta \u2020)$. Excellent agreement is found.

## VI. SUMMARY AND FURTHER DISCUSSION

This paper presented a parameter estimation method for stochastic models in the form of Itô drift diffusions. To infer parameters that do not appear in the equilibrium density function, we proposed to fit two-point *essential statistics* formulated using the fluctuation dissipation theory. Building upon the framework established in our previous work,^{5} we formulated the problem as a nonlinear least-squares problem subjected to a dynamical constraint when the parameters cannot be estimated directly by solving the corresponding statistical constraints. To avoid expensive computational cost in evaluating the essential statistics at each iteration when the Gauss-Newton method is used, we proposed to solve an approximate least-squares problem based on the polynomial surrogate modeling approach. This approach is motivated by the fact that the sampling error cannot be avoided in computing the value of essential statistics and the essential statistics are smooth functions of the parameters. We guaranteed the existence of minimizers of the approximate least-squares problem that converge to the solution of the true least-squares problem that involves the essential statistics under the assumption that these statistics are smooth functions of the parameters. We also showed that the polynomial approximate least-squares problem has a Jacobian that is full rank almost everywhere, which implies the local convergence of Gauss-Newton solutions. We tested the proposed methods on two examples that belong to two large classes of stochastic models—a Langevin dynamics model and a stochastic gradient system. In general, we expect that the parameter estimation procedure should be carried out as follows:

Reduce the parameter space by direct estimation using appropriate statistics that are easily computable, whenever this is possible.

Based on the observable of interest, the functional form of the equilibrium density, the external forcing term, and the available data, identify appropriate essential statistics for the remaining parameters using the

*a priori*sensitivity analysis test. In our implementation, we compare the essential statistics computed on the training parameter values (e.g., collocation nodes). A more elaborate global sensitivity analysis technique such as the Sobol index^{28}can be used to determine the parameter identifiability.For the remaining parameters, formulate a nonlinear least-squares problem using the appropriate essential statistics at some training parameter values.

Apply Algorithm 1 to obtain the estimates $\theta ^$ of the true parameter values $\theta \u2020$.

Apply a sensitivity analysis, such as the local sensitivity analysis as discussed in Sec. IV B 2, as

*a posteriori*confidence check for the estimates. If the parameters are found to be insensitive to the selected response statistics, go back to Step 3 and choose a different set of response statistics.

One of the restrictions with our formulation is to be able to compute $B\u2020$ in (7), which may require knowledge of (some of) the true values $\theta \u2020$ and this is not feasible in general. As a result, the non-linear system (8) cannot be evaluated without knowing the true parameters. For the Langevin dynamics, we found that the parameter $kBT\u2020$ in $B\u2020$ can be estimated with the equilibrium statistics, variance of $v$. For the gradient flow problem, there exists an invertible linear transformation between the linear response statistics that are not computable and other two-point statistics that can be estimated without knowing the true parameter values. Since numerical Algorithm 1 does not rely on the FDT formulation, one can apply it on any available two-point statistics, e.g., time auto-correlations of the solution. We are not aware of a general routine that bypasses this difficulty. However, as we have demonstrated, at least for general Langevin dynamics models and stochastic gradient systems, this idea will work.

While the proposed method requires the knowledge of the equilibrium density function of the unperturbed system, one can relax this condition and estimate it from the data. For low-dimensional problems, one can use the non-parametric kernel density estimation method.^{32,33} If the invariant distribution of the dynamical system can be characterized by a density function defined on a smooth manifold (embedded in a high-dimensional phase space) that the data lie on or are close to, then one can estimate the density function by applying the theory of the kernel embedding of distribution^{34} on a Hilbert space with basis functions defined on the data manifold. Analogous to the conditional density estimation proposed in Refs. 37 and 38, these data-driven basis functions can be obtained via the diffusion maps algorithm.^{35,36} Another alternative to nonparametric methods is to consider a parametric density function estimation using the moment constrained maximum entropy principle.^{39} For densities with moderate, say five to seven, dimensions, one can use a recently developed Equation-By-Equation (EBE) method^{40} for solving the moment constrained maximum entropy problems. The source code of the EBE scheme is available at Ref. 41.

One of our ultimate goals is to use this method for parameter estimation of molecular modeling, in which a Langevin type of model is usually used. The potential energy typically involves a large set of parameters, e.g., the stiffness constants associated with bond stretching and deformation of bond angles, as well as the coefficients in the van der Waals and screened electro-static interactions. The selection of damping coefficients is also non-trivial; see the review paper^{42} for various perspectives. Compared to the existing methods, the novelty of our approach is that we use the response statistics to formulate the parameter estimation problem, which can reveal parameters that do not appear in the equilibrium density. In addition, the polynomial surrogate model provides an efficient mean to solve the nonlinear least-squares problem.

To achieve this goal, however, there are some remaining challenges. For large parameter space, we suspect that one needs to combine the existing methods to compute the parameters associated with the equilibrium density as one way to reduce the problem (as suggested in Step 1 above) with the proposed method to estimate the remaining parameters. Second, the proposed method requires high quality and possibly large amount of samples for accurate evaluation of the essential statistics. However, since the essential statistics depend on the choice of the observable and external forcing, approximating the high-dimensional integral can be avoided so long as the dimensions of the ranges of these two functions are small. Finally, the underlying model might be subject to modeling error. For example, the model may be derived from a multiscale expansion or just empirically postulated. The formulation of the response statistics in the presence of modeling error will be investigated in separate works.

## ACKNOWLEDGMENTS

This research was partially supported by the National Science Foundation (NSF) (Grant No. DMS-1619661). X.L. is also supported by the NSF (Grant No. DMS-1522617). J.H. is also supported by the Office of Naval Research (ONR) (Grant No. N00014-16-1-2888).

### APPENDIX A: PROOF OF THEOREM 3.1

In Sec. III B, we presented the convergence result Theorem 3.1 and included the proof for the one-dimensional case. Here, we show the proof for the general $N$-dimensional case. We will first generalize Lemmas 3.3–3.5 to the $N$-dimensional case.

Let $f\u2208C\u2113([\u22121,1]N)$ with $\u2113>32N$ and consider its least-squares approximation $fM$ given by (11). Then, we have $\u2225f\u2212fM\u2225\u221e\u21920$ as $M\u2192+\u221e$.

To generalize Lemma 3.5, it is enough to consider the partial derivative case, that is, finding the condition for $\u2225fxiM\u2212fxi\u2225\u221e\u21920$ as $n\u2192+\u221e$.

Now, we are ready to prove Theorem 3.1.

### APPENDIX B: PROOF OF THEOREM 3.8

The full rank condition of the Jacobian matrix is usually required in the local convergence theorem of Newton-like methods. We are going to discuss this issue over a finite dimensional function space $\Gamma NM$ defined by

Since the zero element in $\Gamma NM$ is the function that is identically zero, we can introduce the following definition of linear independence.

We define the differential operator $D\theta i:\Gamma NM\u27f6\Gamma NM$, which is as follows:

Now, let us deduce some relevant properties of such an operator.

The operators ${D\theta i}i=1N$ satisfy

$rank\u2061(D\theta i)=M(M+1)N\u22121;$

$\u2200i\u2260j,$ there exists a permutation $Pij$ satisfying $D\theta j=Pij\u22a4D\theta iPij$.

The following lemma will also be useful to prove our main result.

To gain intuitions, one can consider the following example with $N=2$. One can select the following ordered basis for $\Gamma 2M$

We can derive the matrix representation of $D\theta 1$ and $D\theta 2$

where $I\u2208RM+1\xd7M+1$ is the identity matrix and $E\u2208RM+1\xd7M+1$ is defined as follows:

The linear combination $D\theta 1+\lambda D\theta 2$ satisfies

and rank$(D\theta 1+\lambda D\theta 2)\u2265M(M+1)$ for $\u2200\lambda \u2208R$.

We now turn to our discussion of the full rank condition of the Jacobian matrix,

where $fi\u2208\Gamma NM$ and $K\u2265N$. Let $f$ be the column vector $[f1(\theta ),\u2026,fK(\theta )]\u22a4$, then $f\u2208(\Gamma NM)K$, which is the $K$-time product space of $\Gamma NM$. We can rewrite the Jacobian matrix in the following way:

where $D\theta if$ is defined by operating $D\theta i$ on each component of $f$. With the above observation, we are ready to prove Theorem 3.8.

### APPENDIX C: PROOF OF THE ERGODICITY OF THE LANGEVIN DYNAMICS (21)

With a change of variables [$q=a(x\u2212x0)$, $p=av$], we reduce (21) into

To verify the ergodicity of (C1), we apply Theorem 3.2 in Ref. 15, which requires the potential $U0$ satisfying the following two conditions:

$U0(q)\u22650$ for all $q\u2208R$.

- There exists an $\alpha >0$ and $\beta \u2208(0,1)$ such that(C2)$12U0\u2032(q)q\u2265\beta U0(q)+\gamma 2\beta (2\u2212\beta )8(1\u2212\beta )q2\u2212\alpha .$

For the first condition, we simply notice that adding a constant term in $U0$ does not affect the dynamic or the equilibrium distribution. Replacing $U0$ in (C1) by

we have

For the second condition, substituting $U0$ in (C2) by (C3), the inequality becomes $\u2200q\u2208R$

It is enough to show that

the function $f(q)=\u2212qe\u22122q+qe\u2212q\u2212\beta e\u22122q+2\beta e\u2212q$ has a lower bound, and

- for any given $\gamma ,\u03f5>0$, $\u2203\beta \u2208(0,1)$ such that(C4)$0.01(1\u2212\beta )\u2212\gamma 2\beta (2\u2212\beta )8\u03f5(1\u2212\beta )>0.$

Since $f(q)$ is continuous, by checking its limits on both sides

one can see that $f(q)$ must have a lower bound. Given $\u03f5,\gamma >0$, (C4) can be satisfied by taking $\beta \u2208(0,1\u22121\u22120.080.08+\gamma 2/\u03f5)$. Thus, one can conclude that both two conditions can be satisfied and the Langevin dynamics (21) is indeed ergodic under all possible values of the parameters.

## REFERENCES

*Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference*, 2nd ed., Chapman & Hall/CRC Texts in Statistical Science (Taylor & Francis, 2006).

*International Conference on Algorithmic Learning Theory*(Springer, 2007), pp. 13–31.