Learning dynamics on invariant measures using PDE-constrained optimization

We extend the methodology in [Yang et al., 2023] to learn autonomous continuous-time dynamical systems from invariant measures. The highlight of our approach is to reformulate the inverse problem of learning ODEs or SDEs from data as a PDE-constrained optimization problem. This shift in perspective allows us to learn from slowly sampled inference trajectories and perform uncertainty quantification for the forecasted dynamics. Our approach also yields a forward model with better stability than direct trajectory simulation in certain situations. We present numerical results for the Van der Pol oscillator and the Lorenz-63 system, together with real-world applications to Hall-effect thruster dynamics and temperature prediction, to demonstrate the effectiveness of the proposed approach.

Data-driven models have proven to be instrumental across numerous scientific disciplines for their ability to predict and control the behavior of complex physical systems. 1 Popular approaches for modeling dynamical trajectories typically adopt a Lagrangian perspective and seek a pointwise matching with either the observed data or its approximate state derivatives. When the observed data has a poor temporal resolution and the state derivatives are difficult to approximate, these approaches may struggle. Such difficulties are further exaggerated when measurements are contaminated with noise, and the system in question exhibits sensitive dependence on initial conditions. In this paper, we propose an alternative approach that can circumvent some of these challenges by treating global statistics of the observed dynamics as the inference data.

I. INTRODUCTION
Differential equations are typically used to model trajectory data originating from physical systems. Common techniques for fitting differential equations to trajectory data include the shooting methods, 6,7 neural differential equations, 5,8,9 and SINDy. 2,10 Methods based on the Kalman filter are effective in data assimilation and in estimating unknown states and parameters of a system as it evolves in time. [11][12][13][14][15][16] When used to identify model parameters, these approaches fall under the broad category of system identification.
These approaches all adopt a Lagrangian perspective and directly fit the modeled trajectories or their state derivatives to the observed measurements. While these techniques have seen great success in modeling complex dynamics, their application is generally limited to inference trajectories sampled at a relatively high frequency. When the inference trajectory is sampled slowly, or in the worst-case scenario when measurement times are unknown, these approaches may not be a) Corresponding author: jrb482@cornell.edu applicable. For example, see Figure 1 in which we investigate the use of SINDy 2 and a Neural ODE 5 for modeling the dynamics of a slowly sampled limit cycle.
There are at least three sources of instabilities when directly using the trajectory data to perform velocity reconstruction. First, for certain chaotic dynamical systems, a small perturbation in the initial condition can lead to a large deviation in the trajectory at a later time, which cannot be differentiated from inaccurate dynamics by looking at the data alone. Second, the estimation of the particle velocity suffers from slowly sampled trajectory data, which directly affects the reconstruction of the dynamics, as shown in Figure 1. Third, the measurement (extrinsic) noise and the model intrinsic noise both change the state location. The small noise pollution can be amplified more in the velocity estimation using the divided difference with a small time step. All three factors share the nature that a small perturbation in the trajectory data leads to a large deviation in the estimated velocity/learned dynamics.
In contrast with the Lagrangian approach to modeling dynamics, our method builds on an Eulerian perspective 17,18 in which velocity models are constructed to yield the same asymptotic statistics as the observed measurements. This approach converts what is traditionally regarded as an ordinary differential equation (ODE) or stochastic differential equation (SDE) modeling problem into a partial differential equation (PDE)-constrained optimization problem. The motivation of our method is that, in certain situations, the PDE forward model yields better stability in solving the inverse problem than direct trajectory forward simulation based upon an ODE or SDE. Importantly, our method does not rely on prior knowledge of sampling times and can thus be used to learn the dynamics from slowly sampled trajectories.
There are two important differences between the line of work using Kalman filters and our proposed method. First, a Kalman filter is a particular case of the Bayes filter using the Bayes theorem, while our reconstruction follows a deterministic inverse problem (PDE-constrained optimization). Second, time is a crucial element in designing a Kalman filter, while in our approach, we use the invariant measure and a timeindependent PDE surrogate model. Once the flow has been inferred, we can also perform uncertainty quantification for the FIG. 1. Comparison with the SINDy 2-4 and the Neural ODE 5 frameworks for learning slowly sampled dynamics. The left panel shows the original dynamics (in grey) and the first eight points of a slowly sampled trajectory (in red). While the SINDy and Neural ODE frameworks can learn the quickly sampled dynamics (model output in grey), both struggle to learn from the slowly sampled trajectory (model output in red). On the other hand, our framework can learn from both quickly and slowly sampled dynamics (the red and grey model outputs coincide). Additional experiment details are provided in Section V A.
forecasted dynamics, building towards extending grid-based Bayesian estimation of nonlinear low-dimensional systems 19 to slowly sampled unknown systems with nontrivial invariant measures.
More specifically, instead of directly treating the noisy observations {x(t i )} n i=1 from one single trajectory of an autonomous flowẋ = v * (x) as inference data, we consider the occupation measure ρ * generated by a single trajectory, where for each measurable set B, When the occupation measures generated by a nontrivial (see Section II A) set of initial conditions all weakly converge to the same invariant measure, the limiting measure is said to be physical. 20 In this work, we consider the class of autonomous systems for which the occupation measure of Lebesgue-almost all initial conditions converges to a unique physical measure. Notably, this encompasses chaotic attractors such as the Lorenz-63 system. 21,22 This assumption guarantees the uniqueness of the invariant measure for the dynamical system under study. If we relax it and allow the existence of multiple invariant measures, further treatment of the PDE forward model is needed. For instance, the fact that different invariant measures are mutually singular as well as information on the initial condition, among other considerations, is necessary to guarantee that the steady-state solution picked up by the PDE model matches the observed invariant measure. We remark that the definition of a physical measure demonstrates its robustness to perturbations with respect to initial conditions. Going forward, we write v = v(θ ) = v(x; θ ) to denote the dependence of the reconstructed velocity fields on a set of parameters θ ∈ Θ where Θ ⊂ R m is the admissible set of all parameter values. The concrete form of θ depends on the hypothesis space of v, which will be discussed in Section IV B. The task is now to find the best-parameterized model v(x; θ ) approximating the true velocity v * through the optimization The formulation (2) represents an inverse data-matching problem, in which D denotes a metric or divergence on the space of probability measures and ρ ε (v(θ )) is a regularized approximation to the physical measure of the dynamical system, given some regularization parameter ε > 0 and the current velocity v(θ ). That is, v(θ ) → ρ ε (v(θ )) is our new forward model. Although one could approximate ρ(v(θ )) by numerically integrating a trajectory and binning the observed states to a histogram, 17 this approach does not permit simple differentiation of the resulting measure with respect to the parameters θ . When the size of θ , i.e., m, is large, it is practical to use gradient-based optimization methods for solving the optimization problem (2), and one has to compute the essential gradient ∂ θ J . In Ref. 18, this was handled by viewing ρ ε (v(θ )) as the dominant eigenvector of a regularized Markov matrix originating from an upwind finite volume discretization of the continuity equation. The derivative ∂ θ J was then seamlessly computed via the adjoint-state method. 18 The computation time of the adjoint-state method is independent of the size of θ , making the framework presented in Ref. 18 well-suited for large-scale computational inverse problems.
In this work, we build upon the framework proposed in Ref. 18  in order to guarantee the uniqueness of the computed stationary solution ρ ε (v(θ )).

2.
In contrast to only learning three coefficients as done in Ref. 18, we parameterize the velocity v(θ ) using piecewise polynomial, global polynomial, and neural network discretizations, which can all yield large parameter spaces with thousands of dimensions. We compare the reconstructed velocity in each case and further discuss how the choice of parameterization affects the inverse problem's well-posedness and the reconstructed velocity's regularity. We also consider various metrics/divergences as the choice of the objective function.
3. We investigate velocity learning in time-delay coordinates, which can characterize the full dynamics from partial state measurements alone. 23 After performing the optimization (2), we evolve the learned Fokker-Planck equation forward in time to quantify the uncertainty in predictions of future dynamics. Based on this framework, we demonstrate that forecasts incur larger uncertainties when the embedding dimension is not sufficiently high. It is worth noting that there is no analytic form for the velocity in time-delay coordinates, even for well-studied dynamical systems. We also stress that our proposed approach permits larger-scale modeling of time-delayed dynamics than the approach considered in Ref. 17, due to the use of the adjoint-state method when solving the PDE-constrained optimization.
The rest of the paper is organized as follows. In Section II, we review essential background on dynamical systems, invariant measures, the Fokker-Planck equation, and time-delay coordinates. In Section III, we introduce the forward surrogate model ρ ε (v(θ )) and analyze its modeling errors. In Section IV, we present an efficient gradient calculation for the objective function J (θ ) by treating (2) as a PDE-constrained optimization problem and utilizing the adjoint-state method. We then adapt the gradient calculation to various velocity parameterizations, including neural network discretizations in which the gradient is computed along with the backpropagation technique. 24 Finally, in Section V, we present velocity reconstructions for the Van der Pol oscillator and the Lorenz-63 system. We also model dynamics in time-delay coordinates based on realworld data from a Hall-effect thruster and actual temperature recordings. We perform uncertainty quantification on the last two real-data examples. Conclusions follow in Section VI.

II. BACKGROUND
This section reviews the essential background on invariant measures, stochastic dynamics, the Fokker-Planck equation, and time-delay coordinates. We also review the Eulerian approach for parameter identification proposed in Refs. 17 and 18, as well as past work on the discrete inverse Frobenius-Perron problem. 25

A. Physical Measures
Physical measures characterize the long-term statistical behavior of a significant collection of dynamical trajectories. When a dynamical system is chaotic and exhibits sensitive dependence on initial conditions, the existence of a physical measure unifies the statistical properties of trajectories that are pointwise dissimilar. While ergodic measures also describe the long-term statistical behavior of dynamical trajectories, they may have very small support or even be singular. On the other hand, when a dynamical system admits a physical measure, it holds that the trajectories corresponding to a positive Lebesgue measure subset of initial conditions will all share the same statistical behavior. We will now formalize these ideas in the language of ergodic theory. For a more thorough treatment of the topic, we refer to Refs. 20, 26-28.
While we will review the theory of physical measures in the context of discrete-time dynamical systems, our applications will consider dynamics given by a time-∆t flow map for some ∆t > 0. Following Ref. 20, we assume that M is a compact Riemannian manifold and that T : M → M is a diffeomorphism. A probability measure µ is said to be invariant with respect to the map T if µ(T −1 (B)) = µ(B) for all B ∈ B, where B denotes the Borel σ -algebra (see Ref. 29, Definition 2.1). Hereafter, we will assume that µ is an invariant measure. A point x ∈ M is said to be generic (see Ref. 20, Section 2.2) if for all g ∈ C(M), it holds that The left-hand side of (3) is known as the time-average of a function g ∈ C(M) whereas the right-hand side of (3) is known as the space average. It follows Birkhoff's pointwise ergodic theorem (see Ref. 29, Theorem 2.30) that the timeaverage of any g ∈ C(M) necessarily exists on a set of full µ-measure. To formally discuss the statistical properties of dynamical trajectories, we now define the N-step occupation measure given the initial condition x ∈ M as The condition that a point x ∈ M is generic is equivalent to the condition where convergence takes place in the weak-* topology (see Ref. 29,Definition 4.19). Since the quantity µ x,N (B) approximates the average amount of time for which the orbit {T k (x)} ∞ k=0 initiated at x ∈ M resides in a measurable set B ∈ B, this convergence indicates that the collection of generic points all share the same asymptotic statistical behavior. When the measure µ is ergodic (see Ref. 29,Definition 4.19), it holds that µ-almost every x ∈ M is a generic point (see Ref. 29,Corollary 4.20). However, if µ is an ergodic measure that is singular with respect to the Lebesgue measure, the resulting collection of generic points may be physically insignificant and difficult to observe computationally. Motivated by this perspective, an invariant measure µ is said to be physical if there exists a collection of generic points with positive Lebesgue measure (see Ref. 20, Definition 2.3).
We will next discuss the ways in which a physical invariant measure µ can be computationally approximated. If one collects the measurements {T k (x)} N k=1 , the weak-* convergence in (5) suggests that the physical measure µ will describe the statistics of our measurements provided that N is sufficiently large. Motivated by this perspective, we can discretize the domain M and directly compute the occupation measure (4) for each cell in the discretization to approximate the physical measure. This procedure has been previously used to approximate physical measures. 17,18,30 Other approaches have been proposed to compute the invariant measure as the stationary vector of the finite-dimensional approximation of the continuous Frobenius-Perron operator, 31 including Ulam's method, 27 and Galerkin-type methods. 32,33 More precisely, these discretizations are used to construct a Markov matrix that represents a random dynamical system approximating the deterministic map T : M → M. An invariant measure for the discrete approximation is then recovered as a stationary vector of the resulting Markov matrix. As the discretization is refined, certain assumptions guarantee that the desired physical measure will be recovered in the weak-* limit (see Ref. 32, Theorem 4.14).

B. Stochastic Dynamics and the Fokker-Planck Equation
Consider an Itô stochastic differential equation (SDE) of the form Above, W t is a Brownian motion, v is the velocity, and σ determines the diffusion matrix Σ(x) = 1 2 σ (x)σ (x) . For simplicity, we will consider the case of a constant diffusion. Similar to the deterministic setting, there are analogous notions of invariant measures, ergodicity, and physical measures in the stochastic setting. 34,35 One may use the Euler-Maruyama method to obtain the numerical solution to (6) on the time interval [0, T ], which assigns where {ξ j } are independently and identically distributed By assuming a constant diffusion, we may write Σ(x) = DI, where I denotes the identity and D > 0 is a constant representing the scale of the diffusion. Equation (7) can then be simplified to read We leave the study of a non-constant or anisotropic diffusion for later work. We remark that if D = 0, (8) reduces to the so-called continuity equation, which instead models the probability flow of the ODE given byẋ = v(x). Under certain conditions, 37 the steady-state solution ρ(x) of (8) exists and satisfies Since (9) describes a limiting distribution lim t→∞ ρ(x,t), it has been previously used to provide approximations of invariant measures for stochastically-forced dynamical systems. 30 In Ref. 38, an SDE learning problem was studied using (7) as the modelling equation with different data assumptions.

C. Delay Coordinates and Takens' Theorem
The technique of time-delay embedding is a popular approach for reconstructing chaotic dynamical systems from limited observations. 17,[39][40][41] The procedure involves embedding time series measurements ψ(t) = ψ(x(t)) of the state x(t) into d-dimensional Euclidean space by considering the vector of time-lagged observations for some τ > 0. Takens' theorem 23 provides suitable assumptions under which Ψ d,τ (t) and x(t) are related via a diffeomorphism, implying that the time-lagged vector of partial observations Ψ d,τ (t) is sufficient for reconstructing the full state x(t). Notably, the embedding dimension provided in Ref. 23 is d = 2m + 1 where m is the dimension of a compact manifold M on which the flow map f t for the original dynamics is defined. In cases when trajectories are attracted to a compact subset A with box-counting dimension (see Ref. 42, Page 586) d A strictly less than m, it turns out that lower-dimensional embeddings can be obtained.
When a time-series projection ψ(t) of an unknown systeṁ x = v(x) is observed, one can try to numerically determine a suitable embedding dimension d and time delay τ; see for example, Refs. 43-46. Choosing a proper embedding dimension and time delay is important for obtaining a reliable surrogate model of the original dynamics in time-delayed coordinates. Notably, in Section V B, we demonstrate that models for the velocity in time-delayed coordinates can incur excess uncertainties when the embedding dimension is not sufficiently large.

D. Prior Work on Learning Dynamics from Invariant Measures
For chaotic systems, trajectories are sensitive to initial conditions and estimation parameters. Sometimes, the approximate reference velocity field {v(x(t i ))} cannot be accurately estimated from a trajectory {x(t i )} due to the lack of observational data, slow sampling, discontinuous or inconsistent time trajectories, and noisy measurements. To tackle such difficulties, instead of working with the Lagrangian trajectories, Refs. 17 and 18 propose an Eulerian approach by treating the occupation measure (4) as the data. When enough samples are available, the occupation measure can be treated as an approximation to the invariant measure; see Section II A. Finding the optimal parameter θ is then translated into the optimization problem (2). The reference measure ρ * is the occupation measure converted from the observed trajectories {x(t i )}; see (4). In Ref. 17, the approximated synthetic ρ ε (v(θ )) is generated by first simulating the synthetic trajectories {x(t i ; θ )} based on the dynamical system and then computing its histogram following (4). Since this approach requires lengthy trajectory simulation, each evaluation of ρ ε (v(θ )) for a given θ is relatively costly. Moreover, it is difficult to compute the derivative of ρ ε (v(θ )) with respect to θ due to the histogram approximation of nonlinear trajectories. As an improvement to the original idea in Ref. 17,Ref. 18 proposes a surrogate model to approximate ρ ε (v(θ )) that is differentiable in θ and sometimes faster to compute. The key idea is to solve for ρ ε (v(θ )) as the distributional steady-state solution to the continuity equation (i.e., (9) with D = 0) using a finite volume upwind scheme together with the teleportation regularization. The gradient of the objective function J in (2) with respect to the parameter θ can be efficiently computed based on the adjoint-state method (see Ref. 18, Sec. 5). The problem of learning an SDE from an invariant measure is also studied in Ref. 47, which uses a deep learning framework to invert the drift and diffusion terms.
The task of learning a dynamical system from an invariant measure has also been studied in the discrete-time setting under the inverse Frobenius-Perron problem. 25,[48][49][50] The Frobenius-Perron operator, also known as the transfer operator, characterizes the time evolution of an initial measure µ 0 according to some prespecified dynamical system. Given a probability measure µ, the inverse Frobenius-Perron problem seeks to construct a dynamical system for which µ is a fixed point of the associated transfer operator. The most widely studied case involves recovering an ergodic map T on [0, 1] for which a prescribed absolutely continuous measure is the unique fixed point of the discrete transfer operator. In this particular setting, various approaches such as topological conjugation 51 and matrix methods 52 have been introduced to solve the inverse problem. The multivariate inverse Frobenius-Perron problem was also studied in Ref. 53, where ergodic maps were constructed to adhere to the statistics of two-dimensional densities. Moreover, due to inherent nonuniqueness in the inverse problem, recent approaches further restrict the solution space of the discrete ergodic maps to those with a prescribed power spectrum. 54 To the best of our knowledge, Refs. 17, 18, and 47, and our contributions here are the first works that numerically solve the inverse Frobenius-Perron problem in the continuous-time setting. Notably, we do not assume that µ is absolutely continuous, as we use a finite-volume discretization to approximate the Frobenius-Perron operator.

III. THE FORWARD MODEL AND MODELING ERRORS
A central contribution of this work is to consider a different regularized forward model than the one in Ref. 18, especially for trajectory measurements containing intrinsic noise, which can be interpreted as sample paths of stochastic dynamical systems (6). In those cases, the Fokker-Planck equation (7) is a better candidate as the PDE surrogate model, as it contains a diffusion term that can fit noise present in the data. Based on the relationship between (6) and (7), one can learn both the velocity field v(x) and the diffusion tensor Σ(x) in the optimization framework (2). For simplicity, we only consider a fixed diffusion constant and leave the investigation of multi-parameter inversion to future work.
We will use (9) as the forward model to fit invariant measures generated by trajectories with intrinsic noise. While the diffusion term allows the model to fit the intrinsic noise and prevent over-fitting the noise into the target velocity component, it also controls the scaling of the reconstructed veloc- However, for most cases, v and v will not solve the stationary Fokker-Planck equation (9) for D > 0.

A. Finite Volume Discretization
We assume that our system evolves on the d-dimensional rectangular state space with a spatially dependent velocity v : Ω → R d . We define n i ∈ Z + , 1 ≤ i ≤ d, to be the number of equally-spaced points along the i-th spatial dimension at which we wish to approximate the solution of (8), as well as the mesh spacing We are thus interested in obtaining a solution to the forward problem at points of the form where k i ∈ {1, . . . , n i }. We will index our coordinates using column-major order and write We will regard x j as the center of the cell C j where Following the approach in Ref. 19, we implement a first-order upwind finite volume discretization of the continuity equation, adding a diffusion term using the central difference scheme and enforcing a zero-flux boundary condition. 55 This allows us to obtain an explicit time-evolution of the probability vec- While ρ is a discrete probability measure over the cells C j , it also corresponds to a piecewise-constant probability density function on Ω.
With an abuse of notation, we will refer to both the piecewise-constant density and the discrete probability mea-sure as ρ. We discretize the time domain with a step size ∆t. Based on (8), the probability vector at the l-th time step evolves as where each K i is a tridiagonal matrix of the form Above, we have defined for each j ∈ {1, . . . , N} the upwind velocities where v i j := v x j − e i ∆x i /2 · e i and w i j := v x j + e i ∆x i /2 · e i denote the i-th components of the velocity field at the center of cell faces, and {e i } is the standard basis in R d . We remark that if x j is away from ∂ Ω, then w i,± j = v i,± j+1 . To enforce the zero-flux boundary condition, we set both the velocity v and diffusion D to be zero on ∂ Ω. As a result, the columns of K each sum to zero, and the total probability is conserved under time evolution. Since numerical artifacts cause the flux accumulation along the boundary, we also enforce ρ = 0 on ∂ Ω. When the boundary ∂ Ω is sufficiently far from the trajectory data, this artifact is insignificant. Hereafter, we assume the uniform spatial discretization ∆x i = ∆x for all i = 1, . . . , d. Here, we used an explicit time stepping scheme. The Courant-Friedrichs-Lewy (CFL) stability condition enforces ∆t = O(∆x 2 ) to ensure the stability of the scheme. To be more specific, we choose ∆t < 1 2d In this way, we can enforce that all entries of I + K are non-negative with columns summed to one, which implies that I + K is a Markov matrix. For a complete description of the finite volume scheme, we refer to Ref. 55. We remark that there are many higher-order structure-preserving schemes to solve (8) which also yield a Markov matrix; see Ref. 56 for example. A more accurate numerical scheme can further reduce the forward modeling error, which is left for future work.

B. Teleportation and Diffusion Regularization
We use the finite volume discretization of the Fokker-Planck equation in Section III A to approximate its steadystate solution. After discretization, finding such stationary distributions to (9) is equivalent to solving the linear system:  Since the columns of K sum to zero, we have that M := I + K is a column-stochastic Markov matrix. When D = 0, M is a transition matrix for an ergodic Markov chain, which has a unique equilibrium. When D = 0, to guarantee the uniqueness of the equilibrium, Ref. 18 applies the so-called teleportation regularization 57 and considers There is now a unique solution to the linear system From a computational aspect, it is useful to take advantage of the fact that M − I is sparse where I ∈ R N×N is the identity matrix and to instead solve where we have simply rearranged terms in (11) and used the fact that ρ · 1 = 1.
Since U is also a column stochastic Markov matrix with the uniform probability of visiting any point of the mesh, using M ε amounts to stopping the dynamics based on M at a random time and restarting it from a uniformly randomly chosen initial point. The size of ε represents the restarting frequency-the smaller ε, the rarer we restart. 18 On the other hand, adding the diffusion component D to the tridiagonal matrix K can be seen as another way of regularizing the noise-free Markov matrix by adding a scaled Brownian motion after each discrete evolution of the deterministic dynamics. For deterministic dynamics with D = 0, the solution to (9) might not be unique if there is more than one attractor. The use of teleportation connects all attractors through the "random restart", and the solution ρ ε to the linear system (11) has support that connects all the disjoint attractors.
Similarly, when D = 0, the Brownian motion connects all disjoint attractors of the deterministic dynamics, giving a unique steady-state solution. In this scenario, the use of teleportation for the diffusive case is simply a numerical treatment to improve the conditioning of matrix M rather than to guarantee the uniqueness of ρ.
It is worth noting that both the teleportation regularization and an incorrect diffusion coefficient could be sources of modeling error when we perform parameter identification. Although these regularizations enable faster evaluation of ρ ε (v(θ )) and better posedness of the forward problem, they may reduce the accuracy of the inverse problem solution.

C. Numerical Diffusion
In Figure 2, we illustrate the ρ ε computed as the steadystate solution to the Fokker-Planck equation in the top row and the approximation to physical invariant measures of the corresponding SDE in the bottom row. From Figure 2, we see that on a coarse mesh, the first-order finite volume scheme incurs significant numerical error, which gives a computed solution with an artificial diffusion effect and thus is often referred to as the numerical diffusion. 19 The amount of numerical diffusion is reduced as the mesh is refined since it is incurred by the first-order scheme. In particular, it is expected to decay as O(max i ∆x i ) in the L ∞ norm as we refine the mesh. 55 Besides the teleportation and the modeling diffusion D, the presence of numerical diffusion is another modeling error incurred from solving the forward problem.

IV. GRADIENT CALCULATION & VELOCITY PARAMETERIZATION
Another main contribution of this paper is to reconstruct the velocity field v(x) using large-scale parameterizations v(x; θ ), which turns an infinite-dimensional problem of searching for v(x) in a function space to a finite-dimensional optimization problem of finding θ ∈ Θ ⊂ R m . Here, we introduce parameterizations based on piecewise-constant, neural network, and global polynomial functions. We also investigate various data-fitting objective functions J that compare the mismatch between the observed and simulated invariant measures, ρ * and ρ ε (v(θ )). We compute the gradient of such functions with respect to the coefficients θ in the parameterized velocity model v(x; θ ) based on the adjoint-state method for the PDEconstrained part and the backpropagation technique 24 for the neural network part. Thanks to these techniques, we can then efficiently evaluate the gradients of J with respect to θ and thus conveniently use gradient-based optimization algorithms to iteratively update θ , e.g., steepest descent, L-BFGS, conjugate gradient descent methods as well as stochastic methods such as Adam. 58 For notational simplicity, we will write ρ(v(θ )) = ρ ε (v(θ )) throughout this section.

A. Gradient Calculation Through the Adjoint-State Method
Recall the finite volume scheme in Section III A for solving (9). The forward model yields a discrete measure ρ(v(θ )) = ρ(θ ) = [ρ 1 (θ ) . . . ρ j (θ ) . . . ρ N (θ )] over the cells {C j }, which converges to the solution to (9) in the weak sense as we refine the discretization parameters. For the explicit form of ρ(v(θ )), we refer to Ref. 18, Eqn. (5.1). Note that we have highlighted the dependence of our approximate steadystate distributional solution to the Fokker-Planck equation (9) on the velocity v(x; θ ). Our goal is to solve the optimization problem (2): by using gradient-based methods, where J is the cost function, and ρ * represents our inference data. The adjoint-state method is an efficient technique by which we can evaluate the derivative ∂ θ J , as the computation time is largely independent of the size of θ . One can derive the adjoint-state method for gradient computations by differentiating the discrete constraint, 59 which in our case is the eigenvector problem: where ρ(θ ) · 1 = 1. Specifically, we compute ∂ θ J = λ ∂ θ g where λ solves ∂ ρ g λ = − ∂ ρ J . In our case, this linear system is the adjoint equation (see Ref. 18, Eqn. (5.8)) and the derivative As a result, we only need to compute the derivatives ∂ ρ J and ∂ θ M ε to determine the gradient ∇ θ J = (∂ θ J ) . The former depends on the choice of the objective function, while the latter is based on a specific parameterization of the velocity field v(x; θ ) determined by its hypothesis space.

The Computation of ∂ ρ J
For the objective function J , we consider the quadratic Wasserstein distance, the squared L 2 norm, the Kullback-Leibler (KL) Divergence, and the Jensen-Shannon (JS) Divergence.
Quadratic Wasserstein Distance: For probability measures ρ and ρ * on Ω, with finite second-order moments, the squared quadratic Wasserstein distance is defined by is the set of maps that push ρ forward into ρ * . 60 With an abuse of notation, we also use ρ(x) and ρ * (x) to denote the densities of ρ and ρ * respectively. For efficient computation of the W 2 distance, we utilize the back-and-forth method, 61 which instead uses the dual Kantorovich formulation 60 where φ 1 ∈ L 1 ρ * (Ω) and φ 2 ∈ L 1 ρ (Ω) are required to satisfy φ 1 (x) + φ 2 (y) ≤ |x − y| 2 . In this case, the Fréchet derivative of J = W 2 2 (ρ, ρ * ) with respect to ρ is given by Squared L 2 Norm: The squared L 2 distance as the objective function and its Fréchet derivative are given by KL-Divergence: The KL-divergence and its Fréchet derivative are given by We remark that our definition of the KL-divergence differs from many applications in which it is commonly computed as J = D KL (ρ * , ρ). JS-Divergence: Defining ρ := (ρ + ρ * )/2, the JSdivergence and its Fréchet derivative are given by Based on definitions of the KL and JS divergence, it is clear that we may encounter numerical instability issues if either ρ or ρ * is not supported on the entire domain Ω. Thus, we remark that for the computation of both the KL and JS divergences, we restrict the domain Ω to regions where both ρ and ρ * are strictly positive. This is equivalent to the definition of the KL and JS divergence based upon the so-called Csiszar divergence (see Ref. 62, Eqn. (1)).

The Computation of ∂ θ J
We have presented a few cases of ∂ ρ J for different choices of J . Next, we show how to obtain ∂ θ M ε , which is the other necessary component in the adjoint-state method for gradient calculation; see (12)- (13). To begin with, we consider θ = {v i j } for all i = 1, . . . , d and j = 1, . . . , N, which corresponds to one variant of piecewise-constant velocity parameterization.
Since we are only interested in computing the gradient away from ∂ Ω, we can utilize the property that w i,± j = v i,± j+1 . First, observe that as well as In (14), H(·) is the Heaviside function. We remark that ∂ v i j K i can only be nonzero in the ( j, j), ( j, j − S i ), ( j − S i , j), and (10). After solving (12) for λ and applying (13), we deduce that Equation (15) provides an efficient way for computing the gradient of the objective function with respect to the piecewise-constant velocity based on cells {C j } from our finite-volume discretization.
Alternatively, if the velocity v = v(x; θ ) is smoothly parameterized by the vector θ = [θ 1 , . . . , θ k , . . . , θ m ] ∈ R m , for each θ k , we can then evaluate to determine the derivative ∂ θ J . By using a similar indexing convention to Section III A, we can collect the terms ∂ v i j J and ∂ θ k v i j into the vectors ∂ v J and ∂ θ k v, respectively. Therefore, the double summation in (16) is achieved by the inner-product ∂ v J · ∂ θ k v. Note that for different θ k , we only need to change ∂ θ k v as ∂ v J does not depend on θ k .

B. Velocity parameterization
We now apply Equations (15) and (16) to evaluate the gradients of several parameterized velocity models. Specifically, we consider piecewise constant, global polynomial, and neural network parameterizations of the velocity.

Piecewise-Constant parameterization
In the case of the piecewise-constant parameterization, we model the velocity as Here, we again use the column-major ordering from Section III A to accumulate vectors of cells C j with centers x j , and velocity components v i j = v(x j − e i ∆x i /2) · e i along the i-th direction of the cell face located at x j − e i ∆x/2. The parameter space of the model presented in (17) is given by {v i j }, which has size N · d, and the gradient of the parameters {v i j } can be directly evaluated by (15). We remark that (17) is only one variant of piecewiseconstant parameterization since the parameterization mesh is the same as the discretization mesh in the finite-volume method; see Section III A. These two meshes do not have to be coupled together. To reduce the numerical error from the first-order scheme, it is preferable to reduce the spacing {∆x i }, but we can keep the parameterization mesh fixed so the size of the optimization problem does not change. In this case, we need to apply the chain rule (16) to obtain the final gradient after evaluating (15).
The model defined by (17) can be learned by gradientbased optimization methods. The regularity of the piecewiseconstant model defined by (17) can be improved to a C 0 function by interpolating between the values v i j using either piecewise linear or higher-order piecewise polynomial functions, as in Ref. 63.

Global Polynomial parameterization
Though the regularity of the piecewise-constant model given by (17) can be improved by interpolation, the inverted velocity v(x; θ ) may still be highly oscillatory if the mesh size ∆x is small. Modeling approaches, such as SINDy, 2 learn the velocity fields of dynamical systems from a polynomial basis together with sparse regression. Here, we show how the gradient derivation in (16) can be adapted to such polynomial basis parameterization of the velocity field The i-th component of the velocity field v i (x; θ ) parameterized by a linear combination of the monomial basis of degree at most K can be written as where the powers are represented by multi-indices Although a global polynomial parameterization guarantees ideal C ∞ regularity of the parameterized velocity v(x; θ ), the Runge phenomenon could be a potential downside of this approach. Specifically, as we increase the maximum degree K of the polynomial basis, we may encounter substantial interpolation errors near the boundary ∂ Ω.

Neural Network parameterization
Motivated by the universal approximation theory of neural networks, 64 we may also choose to model each component of the velocity v i (x; θ ) as a feed-forward neural network, where the tunable parameters θ make up the network's weights and biases. We follow Ref. 65 to combine the adjoint-state method for the PDE constraints and the backpropagation technique to update the weights and biases of the neural network.
The term ∂ v i j J in the gradient calculation (16) can be computed by first evaluating the neural network on the mesh of cell face centers oriented in the direction of e i to obtain {v i j }, which is then plugged into (15) to obtain ∂ v i j J . The remaining term ∂ θ v in (16) is then computed via the backpropagation technique. 24 For simplicity, we restrict ourselves to single-layer feedforward networks. Moreover, by using a smooth activation function, such as the hyperbolic tangent or the sigmoid function, we can guarantee C ∞ regularity of the reconstructed velocity v(x; θ ) on the domain Ω. To enforce the zero-flux boundary condition, we manually set v = 0 on ∂ Ω. Consequently, the neural network parameterization may lack regularity near ∂ Ω. However, if the domain is sufficiently large, the support of the physical measure will be very far from ∂ Ω, in which case we will not observe any discontinuities originating from the boundary condition while simulating the trajectories based on (6). As we increase the number of nodes in the hidden layer of the neural network, both the approximation power and the potential difficulty of training the neural network are expected to increase.

V. NUMERICAL RESULTS
In this section, we present several numerical examples to demonstrate the utility of the proposed approach for learning dynamical systems from invariant measures with intrinsic noise. 1 In Section V A, we study the inverse problem for the Van der Pol oscillator with a neural network parameterization of the velocity. In Section V B, we time-delay embed a signal sampled from a Hall-effect thruster and proceed to model the dynamics in delay-coordinates based upon the time-delayed invariant measure. We then illustrate that a low-dimensional embedding may increase the uncertainty of the learned model and that the choice of parameterization largely affects the regularity of the reconstructed velocity. In Section V C, we study rolling averages of a temperature data set and perform uncertainty quantification using the learned Fokker-Planck PDE in time-delayed coordinates. We conclude in Section V D by inverting a component of the Lorenz-63 system's velocity using a neural network parameterization. All experiments are conducted using an Intel i7-1165G7 CPU.

A. Van der Pol Oscillator
We begin by considering the autonomous Van der Pol oscillator, 66 given by Our results for learning a dynamical system with prescribed statistical properties given by the stochastically-forced Van der Pol oscillator are shown in Figure 3. In the top row, the first panel features the velocity of (19) for the choice of c = 0.5, the second panel shows the approximate occupation measure (see (4)) obtained from the simulation of a single SDE trajectory (see (6)), the third panel shows the SDE trajectory used to approximate the invariant measure, and the fourth panel shows the dynamics of the oscillator without stochastic forcing. Throughout, we color the SDE trajectories by their histogrammed density to illustrate the connection between the Lagrangian and Eulerian perspectives. We also stress that the experiment in Figure 3 assumes the diffusion coefficient to be known a priori, but that Section V B relaxes this assumption.
In the following rows of Figure 3, we use neural network, piecewise constant, and global polynomial parameterizations of the velocity to solve the inverse problem using the optimization framework from Sections III A and IV. For the case of the neural network parameterization, we compare each objective function studied in Section IV A 1, while we only focus on the L 2 objective for the remaining two parameterizations. Across all tests, the reconstructed velocity is shown to vary significantly from the true velocity shown in the first row of Figure 3. This is mainly due to the lack of data away from the main attracting limit cycle. In regions of the state space with no available data, we can only expect that the modeled velocity v(x; θ ) will direct trajectories towards the attracting limit cycle on which the invariant measure is supported. Indeed, this is what we observe.   Figure 3. The error is quantified by the squared W 2 distance between the occupation measure of the ground truth diffuse trajectory (see the third panel of Figure 3a), and the occupation measure accumulated from the simulation of a trajectory with diffusion according to the learned velocity (see the fourth panel of Figure 3d-3c).
Moreover, while the learned PDE model (9) matches the observed occupation measure (4) across all tests, we find that the SDE and ODE trajectories generated using the learned velocity vector fields may vary depending on the parameterization. Table I provides a comparison of the accuracy of the learned models, as well as the required computation times. While the piecewise constant velocity is by construction discontinuous and thus does not naturally guarantee the existence and uniqueness of the corresponding ODE solution, the neural network parameterization based on the hyperbolic tangent Learning velocity fields to reproduce the statistics of the stochastically-forced Van der Pol oscillator. The ground truth occupation measure, velocity, and dynamics are shown in (a). The results for inverting the velocity based on the occupation measure from (a) using neural network, piecewise constant, and global polynomial parameterizations are shown in (b)-(g). The first column shows the objective function; the second column shows the learned velocity vector field; the third column shows the final PDE forward model evaluation based on the learned velocity; the fourth column shows the simulation of a diffuse trajectory, and the final column shows the simulation of a trajectory without diffusion. Specifically, the "diffuse trajectories" are simulated according to the Euler-Maruyama method using the assumed diffusion coefficient D = 0.02, while the "non-diffuse" trajectories assume D = 0. The coloring of each diffuse trajectory is given by the occupation measure it generates; see (4). Across all tests, the objective function is minimized to 0.25% − 0.35% of its initial value. For (b)-(c), the L-BFGS-B algorithm is used for optimization. In (d)-(g), the neural network architecture consists of a single hidden layer with the hyperbolic tangent activation function, trained by the Adam optimizer with a learning rate of 10 −1 . activation function yields a C ∞ velocity. Moreover, while the global polynomial parameterization is also C ∞ , it may suffer from the Runge phenomena and grow rapidly near the boundary of the domain. Thus, we mainly consider neural network parameterizations of the velocity for the remainder of the numerical tests.
To reduce the computational cost of the inversion in the final row of Figure 3, we compute J = W 2 2 on a coarsened mesh. Among the four objective functions in Figure 3, it is worth noting that the W 2 metric does not compare the two densities pointwisely and is well-defined for comparing singular measures. The distance reflects both the local intensity differences and the global geometry mismatches. 67 It has also been shown that the Wasserstein metric is robust to noise. 68,69 Thanks to the geometric nature of the optimal transportation problem, the Wasserstein metric is primarily sensitive to global changes such as translation and dilation and is robust to small local perturbations such as noisy measurements of ρ * . The better stability also brings a downside as the optimization landscape can be relatively flat around the ground truth, which may lead to compromised accuracy in the velocity inversion.
The different velocities shown in the second column of Figure 3 reveal that there is nonuniqueness if we only use the invariant measure as the reference data. The current modeling assumption yields dynamics reproducing the same invariant measure but does not necessarily recover the same velocity field. Depending on the concrete application, one can add regularization, time information, or focus on velocities in a particular parameterized subspace to avoid nonuniqueness. The large error for the reconstructed velocity near the origin is due to the fact that the method aims to learn the flow on or (in the case of stochastically-forced dynamics) near the invariant measure. It is, therefore, unsurprising that the learned velocity does not match the ground truth where there is no data.
In Figure 4, we show how the inversion accuracy and computation time depend on the chosen value of ∆x. That is, as ∆x decreases, we can learn velocities that can reproduce the statistics of the observed occupation measure more accurately, with the cost of longer computation time.
Next, we provide experimental details on the comparison of our approach with SINDy 2 and the Neural ODE 5 frameworks in Figure 1. This test uses the Van der Pol oscillator with c = 2. Since the SINDy and Neural ODE methods are designed for modeling ODEs, the experiments in Figure 1 use the diffusion coefficient D = 0. While we only plot the first 8 points of the slowly sampled trajectory in Figure 1, the full trajectory used for inference contains 2.5 · 10 3 observations. The quickly sampled trajectory also consists of 2.5 · 10 3 observations. The three approaches considered for comparison have various hyperparameters which can be tuned. For SINDy, we learn the models from the monomial basis up to degree three and use the sequentially thresholded least squares optimizer with threshold 0.025 to enforce a sparsity condition on the learned coefficients; see Ref. 2. For the Neural ODE framework, the velocity is parameterized by a single-layer fully connected neural network with 100 nodes and a hyperbolic tangent activation function. The Neural ODE is trained using a multiple shooting approach with the mean-squared er- FIG. 4. We demonstrate how the computation time and inversion accuracy depend on the mesh spacing used in the first-order FVM solver. Here, we use the Van der Pol oscillator with D = 0.05 and learn the velocity using a neural network parameterization. The Adam optimizer is used with a learning rate of 10 −2 . In each case, we reduce the KL divergence objective function below 0.5% of its initial value. The error is quantified in terms of the squared W 2 discrepancy between the simulated occupation measureρ according to the learned dynamics and the observed occupation measure ρ * . ror objective function. More specifically, rather than treating the simulation of a single long time-trajectory as the forward model, we integrate N − 1 trajectories initiated at the observed data points {x(t i )} N−1 i=1 for a time of ∆t = t i+1 − t i . This approach results in greater success while modeling slowly sampled dynamics. The Adam optimizer with a learning rate of 10 −3 is used, and the tolerance for both relative and absolute error of the ODE solver is set as 10 −5 .
To ensure a fair comparison with the Neural ODE framework, we consider our approach based on a neural network parameterization of the velocity using the same architecture, optimizer, and learning rate. For our approach, we use the KLdivergence objective function (see Section IV A 1), apply additional Gaussian filtering to the occupation measure (see (4)) to simplify the resulting optimization, assume a diffusion coefficient of D = 10 −3 during training, and set ∆x = 0.1. Thus, the only differences between the setup for our approach and the Neural ODE framework are the forward model and objective function.  . The wall-clock computation time is reported, and the error is quantified by W 2 2 (ρ, ρ * ), whereρ is the simulated occupation measure from the learned velocity field, and ρ * is the observed occupation measure.
As shown in Figure 1, all three frameworks can learn from the quickly sampled trajectory. However, SINDy and the Neural ODE frameworks are less robust to changes in the sampling frequency of the inference data than our approach. This is further demonstrated in Table II, where we quantify the error in the simulated occupation measure based on the learned velocity. We report the average error over ten trials with different random training seeds to compare our method and the Neural ODE framework. When the data is sampled at a sufficiently high frequency, Table II also shows that methods such as SINDy or the Neural ODE are preferable in terms of both computational cost and accuracy.

B. Hall-Effect Thruster
We now turn to the more realistic setting of experimentally sampled time-series data. Specifically, we study the Cathode-Pearson signal sampled from a Hall-effect thruster (HET) in its breathing mode. Hall-effect thrusters are in-space propulsion devices that exhibit dynamics resembling stable limit cycles while in breathing mode. For details about the experimental setup used to collect the data, the reader is encouraged to consult Refs. 70 and 71. In Section V B 1, we utilize Takens' theorem 23 to reformulate the large-scale optimization framework presented in Sections III and IV to be compatible with scalar time-series observations, and in Section V B 2 we demonstrate numerical results based upon this reformulation.

Methods
Intrinsic physical fluctuations present in the Cathode-Pearson signal indicate that the HET's dynamics may be modeled well by a Fokker-Planck equation. Motivated by this insight, we first time-delay embed the Cathode-Pearson signal C(t) in d-dimensions to form the trajectory C d,τ (t) := (C(t),C(t − τ) . . . ,C(t − (d − 1)τ)). We then use a histogram approximation to compute the occupation measure ρ * of C d,τ (t); see (4). By viewing each dimension of the coordinate system on which the measure ρ * is supported as the independent variables C −kτ (t) := C(t −kτ) where 0 ≤ k ≤ d −1, we then seek a solution to the optimization problem (2) for a velocity v = v(C d,τ ; θ ). Such a velocity can then provide us with a model of the asymptotic statistics of the embedded trajectory C d,τ (t), provided that a suitable diffusion coefficient can be found.
We note that forming the time-delay coordinates C d,τ (t) does require a knowledge of measurements at uniform increments in time. However, the available data may still be sampled slowly enough such that it is impractical to seek a direct approximation of the Lagrangian velocity through the standard approaches described in Section I. This perspective motivates our use of the approach developed in Sections III and IV to learn dynamical systems from invariant measures in time-delay coordinates.
There are a few additional considerations that arise when adapting the modeling framework presented in Sections III and IV to real-world data. Namely, we do not know the proper diffusion coefficient a priori (as was the case in Section V A). Moreover, the invariant measure that the model is based on does not contain any information about the time scale at which the system evolves. Towards this, we utilize the following three-step procedure as a computationally efficient means to mitigate these difficulties.
1. Bin the trajectory C d,τ (t) onto a d-dimensional mesh with spacing ∆x along each axis to form the occupation measure ρ * , assume a constant diffusion coefficient D > 0, and learn the velocity v = v(C d,τ ; θ ), using the framework from Sections III and IV.
2. Bin the trajectory C d,τ (t) onto another d-dimensional mesh with spacing ∆x ≤ ∆x to create a new occupation measureρ * and adjust the diffusion coefficient by solving the optimization problem where the term ρ ε (v;D) in (20) denotes the forward model evaluation with the diffusion coefficientD.
3. Rescale both the velocity and diffusion by solving the optimization problem whereĈ(t i ; a) denotes the time-t i solution of the ODE initial value problem with velocity av(·; θ ) and initial condition C d,τ (t 0 ) . The final velocity and diffusion are then given byãv(·; θ ) andãD, respectively.
The three-step approach makes repeated use of the fact that ρ ε (v; D) = ρ ε (av; aD), for any scalar multiple a > 0. Indeed, if the true diffusion coefficient D * > 0 is unknown a priori, but we instead seek a solution v(·, θ ) with a different diffusion D > 0, it is guaranteed that the velocity v = (D/D * )v * will still provide a solution to the inverse problem. This observation motivates step one, in which an arbitrary diffusion coefficient is used to find a solution v(·; θ ) to the inverse problem. As the dimensionality d is increased, solving the large-scale optimization problem in step 1 on a fine mesh becomes infeasible. As such, step one is typically performed on a coarse mesh where additional Gaussian filtering is applied to the inference measure ρ * to make the large-scale optimization more feasible.
The diffusion coefficient is then adjusted in step two on a finer mesh via (20) to mitigate the errors due to the Gaussian filtering, numerical diffusion, and histogram errors incurred during step one (see Figure 2). Finally, in step three, the scale of both the velocity and diffusion are adjusted via (21) such that the time evolution of simulated trajectories is consistent with the inference trajectory C d,τ (t) in delay coordinates. Since diffusion plays a relatively small role over short time scales for the quasi-periodic HET data, we use the zerodiffusion trajectory to calibrate a reasonable time-scaling between our model and the available data. However, as the magnitude of the diffusion increases, the least squares fit in (21) will become less reliable, and it may be preferable to instead minimize a transport cost between a collection of model samples and a collection of data samples at each time-step. While this final optimization is similar in spirit to various Lagrangian approaches for learning dynamics (see Section I), we remark that the parameter space in (21) has only one dimension.

Results
The results of the three-step procedure in Section V B 1 for learning the HET dynamics are shown in Figure 5 for an embedding dimension of d = 3 and time-delay of τ = 1.4 · 10 −5 seconds, or rather τ = .23 when normalizing the time-scale to the HET breathing mode frequency (16.6kHz). The modeled trajectory accurately reconstructs the shape of the embedded Cathode-Pearson signal but cannot capture the variable diffusion present throughout the time-delayed signal. We do not expect to capture such details, as we assume a constant diffusion coefficient in our model. Nevertheless, we regard the reconstruction of the 3D globally attracting limit cycle as a success and leave extending the model to account for the case of a non-constant diffusion tensor to future work.
The dimensionality of the original HET dynamics is unknown, and as such, a sufficient embedding dimension for the Cathode-Pearson signal is unclear, though likely very high. Interestingly, we can compare the model learned in Figure 5 with a 2D analog to demonstrate that when the number of time delays is not sufficiently large, there is more uncertainty in modeling the time-delayed dynamics. This phenomenon is most evident when inspecting regions of the delayed Cathode-Pearson signal for which the 2D embedding lacks structure readily observed in 3D.
Specifically, consider a collection of nearby samples will also be nearby one another in the 2D time-delay coordinate system (C 0 ,C −τ ). In Figure 6, we initiate uniform distributions centered about these samples in both 2D and 3D timedelay coordinate systems. We then evolve both the samples and initial uniform distributions forward in time. The evolution of the ground truth samples is simply determined by the time-delayed Cathode Pearson signal C d,τ (t), and the evolution of the uniform distributions is given by Fokker-Planck models constructed from the time-delayed Cathode Pearson signal's invariant measure. As the modeled probability densities and ground truth samples evolve in time, we observe in Figure 6 that the mean of the 3D model matches the true sample mean more closely than the 2D model and that it has less uncertainty.
In Figure 7, we study the three parameterizations from Section IV B for learning the time-delayed Cathode-Pearson signal's velocity, now with an embedding dimension of two to allow for clearer visualizations. It can be seen that the density associated with each velocity parameterization indeed matches the ground truth density in Figure 7, but that the velocity fields differ significantly from one another. The piecewise-constant velocity in Figure 7 suffers from poor regularity with discontinuities on the attracting limit cycle. As a result, we lose the connection between the Eulerian and Lagrangian dynamics and cannot reconstruct zero-diffusion trajectories that form a stable limit cycle. On the other hand, the velocities parameterized by the global polynomial and the neural network are both C ∞ . The differences among these three can clearly be seen via the zoomed-in velocity plots in the second row of Figure 7. The global polynomial and neu-(a) Using the 2D model to predict the evolution of the samples C 2,τ .
(b) Using the 3D model to predict the evolution of the samples C 3,τ .  (16.6kHz). Both the 2D and 3D models utilized a neural network velocity parameterization with 500 nodes in a single hidden layer and reduced the KL divergence objective function to 0.1% of its initial value during training. As in Figure 5, the three-step procedure in Section V B 1 is used to learn the models, and in step one, additional Gaussian filtering is applied to the occupation measure ρ * to simplify the resulting optimization. The 3D visualization was plotted using Ref. 72. ral network discretizations are both global parameterizations of the velocity, and as such, their values near the domain's boundary are dictated by the available data in the center of the domain. This causes the polynomial velocity to rapidly increase near the boundary, and a similar effect can also be seen for the neural network.
It is worth noting that the initial condition for the optimiza-tion in Figure 7 can play a large role in the reconstructed velocity, which is related to the optimization landscape of the nonconvex optimization problem (2) we tackle. In the case of the piecewise-constant discretization, we initialize all velocities to be significantly less than the diffusion coefficient D = 0.1. Thus, diffusion initially dominates in the finite volume solver, and all non-boundary cells will contain , and the forward model output ρ ε (v(θ )) for each parameterization (c). The resulting parameter spaces of these discretizations have a dimensionality of 9800 (PC), 56 (GP), and 400 (NN). The L 2 loss is reduced below 0.1% of its initial cost for the PC and NN discretizations and reduced below 0.7% of its initial value for the GP case when we stopped the optimization.
nonzero mass, which allows for accurate gradient updates everywhere. This phenomenon can also help neural network training, though it is not always necessary due to the global nature of parameterization. Moreover, we initialize our polynomial basis to form the velocity which describes a globally attracting limit cycle. To converge to the ground truth limit cycle of the time-delayed Cathode-Pearson signal, this initial velocity only needs to be translated and deformed.

C. Temperature Uncertainty Quantification
We now study 2D time-delay embedded data of weekly rolling averages of the temperature in Ithaca, NY, between 2006 and 2020. 73 We view temperature fluctuations over short time scales as an intrinsic diffusion process and the approximately periodic oscillation of seasonal temperatures driven by some nonzero velocity. Thus, we model the 2D data in delay coordinates as a diffuse limit cycle. We again follow the procedure in Section V B 1 to learn a velocity v(x; θ ) and diffusion coefficient D, which closely matches the occupation measure.
As in Section V B, we can use the trained model v(x; θ ) to quantify measurement uncertainties through the Fokker-Planck equation (9), whose solution is a probability density in the time-delay coordinates (C 0 ,C −τ ). Specifically, if we know some initial probability distribution that captures the current state of the temperature system well, we can consider the time evolution of the distribution using our trained model to quantify the uncertainty of future temperature measurements. The process of evolving both the Fokker-Planck PDE from a uniform distribution and the ground truth sample paths from past temperature measurements is shown in Figure 8. The uncertainty bounds from the model accurately capture fluctuations in the training data used to form the occupation measure (plotted in black), as well as a testing sample path previously unseen by the model (plotted in red).
It is also worth noting that the confidence intervals we construct may be larger than the actual range due to several factors, including additional extrinsic noise from filtering the data, modeling errors accumulated from the hypothesis space, numerical diffusion in the forward model, and a sub-optimal embedding dimension. Reducing such errors may result in tighter confidence intervals and considering time delays in higher dimensions could yield better predictions of the temperature's transient behaviors.

D. Lorenz-63 System
We conclude this section by studying the Lorenz-63 system, 21 defined by where we consider (c 1 , c 2 , c 3 ) = (10, 28, 8/3). For these choices of parameters, the Lorenz-63 system exhibits chaotic behavior and admits a unique physical measure. 22 In Figure 9, we assume that the quantitiesẏ andż are known, and we learn a model for the velocity in the x-direction, using the stochastically-forced Lorenz-63 system's occupation measure. We emphasize that the data used to approximate the Lorenz system's occupation measure can be sampled slowly or even randomly in time (see Ref. 18, Figure 7). From the approximate occupation measure, we are able to successfully invert the first componentẋ of the Lorenz-63 system's velocity via a neural network parameterization. We remark that whenẋ,ẏ, andż are all simultaneously inverted, the optimization is unsuccessful at reconstructing the true velocity (22). While we may be able to learn a velocity that approximately recovers the stationary state of the Lorenz-63 system in the sense of (9), the physical property (3) does not hold. Whether the difficulties of inverting all velocity components of the Lorenz-63 system are due to inherent nonuniqueness in the inverse problem or simply inconvenient local minima during training is worth further investigation in future work. To demonstrate the applicability of our approach to non-rational velocities, we also consider the Arctan Lorenz-63 system, 18 where again (c 1 , c 2 , c 3 ) = (10, 28, 8/3). The results for invertingẋ from the occupation measure generated by (23) with additional stochastic forcing are shown in Figure 10, assuming that the quantitiesẏ andż are known.

VI. CONCLUSION
In this paper, we introduced a PDE-constrained optimization approach to modeling trajectory data originating from stochastic dynamical systems. We first adapted the invariant measure surrogate model in Ref. 18 based upon the continuity equation to the Fokker-Planck equation. This increased our modeling capacity and prevented overfitting the reconstructed velocity while modeling intrinsically noisy trajectories. We next extended the three-coefficient learning performed in Ref. 18 to thousands of coefficients by modeling the velocity via global polynomials, piecewise polynomials, and fully connected neural networks. The efficient gradient computation presented in Section IV made these large-scale parameterizations of the velocity computationally tractable. We finally studied velocity inversion for invariant measures of time-delay embedded observables. The method of time-delay embedding is useful for analyzing real-world data, where in many cases, only limited observations of complex systems are available. As such, we proceeded to learn the velocity in timedelay coordinates for a Hall-effect thruster system and rolling weekly averages of temperature measurements. Using these models, we predicted future states of the systems and quantified uncertainty in forecasts by evolving the learned Fokker-Planck equation forward in time.
(a) Learned velocity vector field (left), a simulated trajectory with diffusion (middle), and a simulated trajectory without diffusion (right).
(b) True velocity (left), a ground truth SDE trajectory (middle), and a ground true ODE trajectory (right).
FIG. 10. Neural network parameterization ofẋ using the Arctan Lorenz system's stochastically perturbed invariant measure with D = 10 and ∆x = 2. The neural network used to learn the velocity contains a single layer of 100 nodes with the sigmoid activation function, and the L 2 objective function is used to train the model. In (a), we show results for the learned velocity vector field and the simulation of trajectories using the learned velocity both with and without diffusion. For comparison, in (b), we also include the ground truth velocity and trajectories with and without diffusion. tal program funded by AFSOR grant FA9550-17QCOR497 (Program Officer: Dr. Brett Pokines). The data used in Section V C is openly available via Ref. 73.