We identify effective stochastic differential equations (SDEs) for coarse observables of fine-grained particle- or agent-based simulations; these SDEs then provide useful coarse surrogate models of the fine scale dynamics. We approximate the drift and diffusivity functions in these effective SDEs through neural networks, which can be thought of as effective stochastic ResNets. The loss function is inspired by, and embodies, the structure of established stochastic numerical integrators (here, Euler–Maruyama and Milstein); our approximations can thus benefit from backward error analysis of these underlying numerical schemes. They also lend themselves naturally to “physics-informed” gray-box identification when approximate coarse models, such as mean field equations, are available. Existing numerical integration schemes for Langevin-type equations and for stochastic partial differential equations can also be used for training; we demonstrate this on a stochastically forced oscillator and the stochastic wave equation. Our approach does not require long trajectories, works on scattered snapshot data, and is designed to naturally handle different time steps per snapshot. We consider both the case where the coarse collective observables are known in advance, as well as the case where they must be found in a data-driven manner.

We discuss the data-driven approximations of drift and diffusivity functions in effective stochastic differential equations (SDEs) through neural networks. The loss function used in training the neural networks embodies the structure of established stochastic numerical integrators, which allows us to work with data from very brief observations of the system. We apply this learning method to time series data of coarse observables from fine-grained particle- or agent-based simulations. The learned SDEs provide useful coarse surrogate models of the fine scale dynamics.

Residual networks (ResNets,1 but see also Ref. 2) successfully employ a forward-Euler integrator based approach to create useful deep architectures. This has inspired follow-up work to include other integrator schemes for deterministic, ordinary differential equations (ODEs) such as symplectic integrators.3–5 In this work, we propose to include stochastic integrators in the search for appropriate loss functions for training network architectures, to improve system identification and prediction tasks.

Identification of dynamical systems from data is an established systems task.6–11 Machine learning both in its “second spring” in the 1980s–1990s, and its current “third spring” has made a major impact on the approaches and algorithms for this task. From ResNets1 and Neural ODEs2,12 to transformers,13 “neural PDEs,”14 and DeepONets,15 the identification of dynamical systems from (discrete) spatiotemporal data is a booming business because it is now comparatively easy to program and train neural networks. Learning SDEs and stochastic partial differential equations (SPDEs) is a natural “next step” and there is also an array of excellent contributions to this as we outline in the related work section below. The time-honored SDE estimation techniques for local drift and diffusivity can now be synthesized in a (more or less global) surrogate model.

Given all these recently available options, our contribution is as follows. On the application side, we were motivated by the identification of coarse, or effective, SDEs, which are often used as models for macroscopic statistics (e.g., leading moments) of more fine-grained physics simulations (e.g., atomistic, stochastic, or agent-based). This is the reason for part of our title, and for some of the examples: lattice simulations of epidemic models.

There is also, however, a methodological motivation whose roots are in the numerical analysis of integration schemes for SDEs/SPDEs. These established algorithms (starting with Euler–Maruyama16) iteratively produce discrete time data; it makes sense to “mathematically inform” the neural network architecture used for identification with the structure of the numerical integrator one would use on an explicitly available, closed form a coarse stochastic model. This way, our work naturally links with, and can take advantage of, the extensive body of work in what is called “backward error analysis” of these integration schemes.17 

A third motivation for our work: beyond its links to backward error analysis, the approach naturally lends itself to the what one might call “physics-informed” or “gray-box” identification algorithms: neural network architectures that seamlessly incorporate approximate or partially known analytical models if these are available. We developed such an approach for deterministic ODE identification in the early 1990s, devising recurrent Runge–Kutta ResNet architectures for this purpose, and extending them to gray boxes and implicit ResNets.2,14,18 It is a natural next step to extend these techniques so that they can be applied to the identification of effective stochastic models, allowing the possibility of physics-informed gray boxes, and exploiting mathematical support via numerical backward error analysis. Learned, black-box models for agent-based systems may be used in malicious or questionable ways in technology and society. An example could be careless, uncontrolled generalization in predicting the behavior of physical or societal systems using learned models. Using gray-box, physics-informed machine learning methods that are amenable to thorough mathematical analyses and interpretability can help mitigate such risks, and our optimization approach is specifically designed for this type of knowledge integration and analysis.

Our contributions are as follows:

  1. We introduce a neural network training loss function that allows us to approximate the state dependent drift and diffusivity of an SDE from discrete observation data.

  2. The approach does not require long time series: paired snapshots (xt,xt+h), scattered over the domain of interest, are sufficient. Step sizes h can vary for each snapshot.

  3. The approach works point-wise: No distributions need to be approximated from data, nor do we need to compute distances between distributions.

  4. We formulate explicit loss functions for Euler–Maruyama- and Milstein-type integration schemes using Gaussian noise.

  5. For data sampled from SDE of Langevin-type, where the assumption of normally distributed noise in the state variable does not hold anymore, we show how to reformulate simple numerical schemes to enable accurate approximations of the drift and diffusivity in the acceleration terms.

  6. We show how a given numerical integration scheme for stochastic PDE can be reformulated to match our loss function and demonstrate it by learning forcing terms in the stochastic wave equation.

  7. We demonstrate that we can employ the approach to coarse-grain particle/lattice simulations; in particular, we approximate the drift and diffusivity of a two-dimensional mean field SDE for a Gillespie-based19,20 kinetic mechanistic model, as well as for the kinetic Monte Carlo lattice simulation of an epidemiological SIR model, leading to those kinetics at the mean field limit.

Loss functions and network architectures informed by numerical integration methods were devised several decades ago, e.g., Refs. 2 and 18. Recently, these ideas have been rediscovered and used in deep architectures such as ResNets,1 with symplectic schemes for Hamiltonian3–5 and Poisson networks.21 Scalable gradient approaches were developed for SDE approximation recently,22 which allow backpropagation through a long series of SDE iterations. If only a single time step for each data point is available, as in our work, this type of scalability is not an immediate training issue.

The theory of SDE is well developed; we refer to standard books on stochastic processes,23 stochastic calculus,24 and stochastic algorithms.25 In the broader machine learning community, SDEs are used mostly for generative modeling, meaning that the actual drift and diffusivity functions learned from the data are not informative, as long as the generated distribution is “acceptable.” A classical example are generative-adversarial networks.26,27 In recent years, learning random ODEs (deterministic ODE with random parameters) has been done using an approach with generative-adversarial networks.28 Long–short-term-memory networks were used in a stochastic setting for the generation of hand-written text.29 The authors of Ref. 30 present a framework that learns an SDE for training and sampling from score-based models, allowing sample generation. In a classical dynamical systems context, the SDE itself is the desired outcome of approximation and thus, typically, different methods are necessary. Parameter estimation techniques for Ornstein–Uhlenbeck processes with constant coefficients are readily available.31 Gaussian processes have been used to identify SDE from observations,32 with a target likelihood based on the Euler–Maruyama scheme (not explicitly mentioned by the authors). Our setting is very different from identifying ODE from noisy observations, for example, through Gaussian processes.33 Generative stochastic modeling of strongly nonlinear flows with non-Gaussian statistics is also possible.34 Researchers studying neural SDEs have studied their relation to Wasserstein-GANs (generative adversarial networks with Wasserstein loss)35,36 and for long time series.37 Since a pre-print of this paper appeared, other pre-prints expanded on it by learning SDEs with other noise types, e.g., Lévy-type.38 Another research direction is to identify the correct parameters of a latent SDE, given partial observations of it.39 This requires special assumptions on the latent space and the observation functions. Propagators for SDE have also been covered in the recent literature, with loss functions for moment functions and distributions.40 These approaches differ from our work in that they do not learn using point-wise on snapshots, but instead require accurate ensemble averages and corresponding large batch sizes. Spectral-Galerkin type SPDE solvers using neural networks have also been developed.41 Such approaches differ from our work in that they learn the solution to a (possibly unknown) SDE, not the SDE itself. Density and ensemble approximation techniques42 provide a complementary view to the particle-like approach with SDE. Stochastic dynamical systems can also be described through the associated Fokker–Planck operator. See, e.g.,43 for a non-parametric forecasting method using the “shift-map,” a numerical approximation of the operator projected on the eigenfunctions of the Laplace operator.

We discuss SDEs of the form

(1)

where f:RnRn and σ:RnRn×n are smooth, possibly nonlinear functions; every σ(x) is positive and bounded away from zero (and a positive-definite matrix if n>1), and Wt is a Wiener process such that for t>s, WtWsN(0,ts). See Ref. 23 for an introduction to stochastic processes. We assume we have access to a set of N snapshots D={(x1(k),x0(k),h(k))}k=1N, where x0(k) are points scattered in the state space of (1) and the value of x1(k) results from the evolution of (1) under a small time step h(k)>0, starting at x0(k). The snapshots are samples from a distribution, x0p0, and the transition densities x1p1(|x0,h) are associated with (1) for a given time step h>0, which in turn is chosen from some distribution ph. The joint data-generating distribution is, therefore, given by p(x0,x1,h)=p1(x1|x0,h)p0(x0)ph(h). Alternatively, the data could be collected along a long trajectory {xti} of (1) with sample frequency hi>0, that is, ti+1=ti+hi. The problem is to identify the drift f and diffusivity σ through two neural networks fθ:RnRn and σθ:RnRn×n, parameterized by their weights θ, only from the data in D. We assume the points in D are sampled sufficiently densely in the region of interest.

As we will see, each choice of SDE integrator will give a different training algorithm, with a loss based on the particular discretization scheme; training algorithms based on integrators of different accuracy, integrators with different rates and types of convergence (strong or weak), and even integrators based on different stochastic calculus type (Ito vs Stratonovich) fall under this umbrella. We will demonstrate these ideas here through two Ito-calculus based training algorithms: the Euler–Maruyama and the Milstein schemes.

We now formulate and rationalize the loss term that we use to train the neural networks fθ and σθ. The Euler–Maruyama scheme is a simple method to integrate (1) over a small time h>0,

(2)

where h>0 is small and δW0 is a vector of n random variables, all i.i.d. and normally distributed around zero with variance h. The convergence of (2) for h0 has been studied at length; we refer to the standard literature.23 We can use this idea to construct a loss function for training the networks σθ and fθ simultaneously. We initially restrict the discussion to the case n=1 for simplicity. Essentially, conditioned on x0 and h, we can think of x1 as a point drawn from a multivariate normal distribution

(3)

In the training dataset D, we only have access to triples (x0(k),x1(k),h(k)) and not the drift f and diffusivity σ. To approximate them, we define the probability density pθ of the normal distribution (3) and then, given the neural networks fθ and σθ, ask that the log-likelihood of the data D under the assumption in Eq. (3) is high,

(4)

We can now formulate the loss function that will be minimized to obtain the neural network weights θ. The logarithm of the well-known probability density function of the normal distribution, together with the mean and variance from (3), yields the loss to minimize during training,

(5)

This formula can easily be generalized to higher dimensions (see  Appendixes A– C), and we use such generalizations, for example, in two and three dimensions. Minimizing L in (5) over the data D implies maximization of the log marginal likelihood (4) with the constant terms removed (as they do not influence the minimization).23 Likelihood estimation in combination with the normal distribution is used in many variational and generative approaches.22,26,30,32,33,44 Note that here, the step size h(k) is defined per snapshot, so it is possible that it has different values for every index k. This will be especially useful in coarse-graining Gillespie simulations, where the time step is determined as part of the scheme.

We note that the specific form (3) of the transition probabilities pθ induced by the Euler–Maruyama method implies that it will, in general, be impossible to exactly reproduce the data-generation density p(x0,x1,h) for any finite step size h. This motivates the application of more refined numerical methods, such as the Milstein scheme considered in Sec. III B. Alternatively, one can think of pθ as being induced by an SDE (1) with modified drift f and diffusion σ. A “modified equation” is a selected ODE that is approximating the results of a fixed numerical scheme. This is the opposite of “solving an ODE with a numerical scheme,” where the ODE is fixed and a proper scheme must be selected to solve it. Modified equations have been derived for a variety of numerical methods, such as the Euler or the Milstein method.17 In general, there is no second-order modified equation for the Euler–Maruyama method, while one exists for the Milstein scheme. Modified equation analysis in our setting is typically performed with equal step sizes h(k) for all data points.

We now discuss how the loss function can be based on the Milstein integration scheme,45 to illustrate complexities that arise when deviating from simple Euler schemes, even in one dimension. For n=1, the Milstein scheme is a more accurate method to integrate (1) and is given through

(6)

where h>0 is small and δW0 is a random variable, normally distributed around zero with variance h. For scheme (6), it is also possible to write down the distribution of x1, similar to Eq. (3). It is no longer normal, however, because of the square of the variable δW0. To determine the distribution of x1, we define

Then, for a random variable zN(0,1), we have x1=A+Bz+Cz2. Assuming C0,

The variable ξ has a non-central χ2 distribution with one degree of freedom and non-centrality parameter λ=(B2C)2. With this derivation, we can see that x1 is distributed according to a linear transformation of a non-central χ2 distribution. The probability density of ξ is

where Ik/21 is a modified Bessel function of the first kind. We can now use the function pξ to define the probability density pθ of x1, given x0,

(7)

The corresponding point-wise loss function then is very similar to (5),

(8)

Equation (7) defines a probability density that can vanish over large portions of its domain, which leads to issues both with log-marginal likelihood estimation and with gradient computations. These issues can be mitigated through regularization, but also through kernel density approximations, which we outline in Sec. III C. To illustrate the difference of the probability distributions for the Euler–Maruyama and the Milstein templates, we plot them for h=0.15, f(x)=x+1, σ(x)=14x2+x+12 in Fig. 1 against 105 sampled points using the two templates with the starting point x0=0.

FIG. 1.

Probability density functions and sampled data from the two numerical integrator templates.

FIG. 1.

Probability density functions and sampled data from the two numerical integrator templates.

Close modal

Inspired by the iteration procedures use to discover kernels for deep Gaussian processes46 and kernel density estimates,16,47 we now describe an iterative method to approximate the probability density of other numerical integration schemes, such as Runge–Kutta type integrators.48 This will be useful in cases for which the density cannot be derived in analytical, closed form or for schemes with potentially locally vanishing probability density, like the Milstein method.

Consider an arbitrary (but convergent and accurate) numerical integration scheme ϕ for (1), so that for a fixed step size h>0 and a noise process δW0, a single time step becomes

(9)

Similar to Secs. III A and III B, given the scheme ϕ as well as the approximate drift fθ and diffusivity σθ parametrized by θ, we need to approximate the probability density pθ(x1|x0,h). For schemes that rely on noise processes δW0 that can be sampled easily (e.g., Gaussian noise as in Euler–Maruyama and Milstein schemes), the key idea for our numerical approximation is the following. We begin by sampling a set of M points δW0(i) from the distribution of δW0. Then, we use the SDE integration scheme ϕ to estimate candidates x^1(i) for the next point in time, given x0 and h. Finally, we describe the density of these candidates through a mixture of Gaussians, each with a small standard deviation σr>0 so that the actual probability distribution pϕ is approximated as

(10)

The approximation accuracy of (10) improves as M and σr0; for a discussion of this, we refer to the standard literature on kernel density estimation16,47 and viscosity solutions for PDE.49,50Figure 2 demonstrates this convergence for varying σ, with fixed M=104, x0=0.01, h=1.0, f(x)=2x, σ(x)=x, and ϕ the Milstein scheme with the exact form for pϕ given through Eq. (7). Many other density estimation methods are applicable here as well, such as generative-adversarial networks26 or normalizing flows.51 

FIG. 2.

Approximations of (7) using kernel density estimation (KDE) at varying scales σr, in standard and log-scale.

FIG. 2.

Approximations of (7) using kernel density estimation (KDE) at varying scales σr, in standard and log-scale.

Close modal

Up to this point, we discussed SDE where all coordinates are directly affected by noise. Second-order SDE of Langevin-type cannot be accurately described in this way, because the diffusivity is exactly zero for half of the coordinates,

(11a)
(11b)

The density of xt is no longer Gaussian, and it would seem that we require a loss function other than (5) to approximate drift f and diffusivity σ. Equation (11) can be approximated using the (symplectic) Euler–Maruyama scheme

(12a)
(12b)

This will not approximate the stationary distribution well, and other, more accurate schemes exist.52 Still, since we only assume access to snapshots ((x0,v0),(x1,v1),h) and do not process longer time series, we can effectively view x1 in (12) as a fixed parameter for the functions f,σ over a time span of h. Then, v1 is again normally distributed, conditioned on x1,v0 and step size h, similar to (3). Thus, loss function (5) can be used, but now only for the dynamics of vt, when given the corresponding values of xt, to learn approximations fθ and σθ. Including the symplectic adaptation which uses x1 instead of x0 to predict v1, we must, thus, minimize

(13)

We show in Sec. IV C that using this adapted training scheme for data sampled from SDE of Langevin-type (11) significantly improves approximation accuracy, compared to (incorrectly) using loss function (5) on (xt,vt) data.

Equation (1) defines a stochastic differential equation in finite-dimensional space, s.t. xtRn. Stochastic partial differential equations (SPDEs) describe properties of functions embedded in infinite-dimensional spaces. In the context of agent-based systems, a description of the system behavior though an SPDE is useful if the number of agents is very large, but finite-size effects still play a role, so that a deterministic description though a deterministic PDE would be too restrictive. There are no generically applicable numerical schemes for SPDEs similar to the Euler–Maruyama scheme (2), and numerical methods of higher-order also typically do not exist.53 However, in certain cases, we can reformulate existing numerical schemes for specific SPDEs and still learn parts of its law in a black- or gray-box setting. For example, we can start with a stochastic wave equation that has both deterministic driving f(x) and stochastic forcing σ(x), and initial conditions u0,v0,

(14a)
(14b)

To solve Eq. (14), we use a numerical algorithm suggested by Walsh.53 We refer to the paper53 for details and only summarize the main ideas here. To start, we choose a small step size h>0 and discretize space and time into points xi=ih, tj=jh. Then, we define a staggered grid on which we construct the solution, as shown in Fig. 3. The center grid points (xi,tj) are excluded in the discretization, no solution is approximated on it. The following numerical scheme then approximates the solution to the stochastic wave equation on the grid:

(15a)
(15b)
(15c)

FIG. 3.

Staggered grid on which we can solve the stochastic wave equation. The center grid point (xi,tj) is excluded in the discretization. The distance from xi to xi+1 is the step size h, the same between tj and tj+1.

FIG. 3.

Staggered grid on which we can solve the stochastic wave equation. The center grid point (xi,tj) is excluded in the discretization. The distance from xi to xi+1 is the step size h, the same between tj and tj+1.

Close modal

Scheme (15) can be reformulated to match the Euler–Maruyama scheme (2) by defining

Then,

which is the same scheme as (2). Given data D={u~i,j+1,u^i,j,h^}i,j, we can learn the forcing terms f,σ through minimization of the loss (5). In this paper, we only considered the autonomous case, where f and σ do not depend on time. Note that the SPDE (14) is also of Langevin type, and the reformulation of the Euler–Maruyama scheme here corresponds to the reformulation explained in Sec. III D.

In this paper, we consider cases where the coarse variables are already known (e.g., their mean field or quasi-chemical approximation variables). However, many observations in the real world do not immediately reveal the smallest dimensionality of the space in which the dynamics evolve. Latent space identification and related unsupervised learning are possible with a wide variety of techniques, also for SDE.39 If a latent space must be identified, we propose to learn it with Diffusion Maps54 as well as an SDE-integrator-informed auto-encoder with the SDE loss (5) augmented by the log mean squared loss: L(θ|D)=log(1Nk=1N(decoder(encoder(x0(k)))x0(k))2)+LEuler--Maruyama. The Diffusion Map coordinates define a latent space; learning an SDE on these new variables, and then mapping back to the original data is thus similar to latent SDE identification.22 With Diffusion Maps, we first identify coarse variables and then learn the effective SDE. This has benefits compared to an end-to-end auto-encoder approach, because by construction, the latent coordinates are invariant to isometry and sampling density in the ambient space.54 Learning the latent coordinates together with the drift and diffusivity functions instead leads to a latent space that is adapted to the specific SDE. This construction usually does not have the same invariance properties as the coordinates from Diffusion Maps. A description of the methods to obtain these latent spaces and the identification of SDE on them for kMC lattices can be found in Appendix B 6.

We first illustrate the neural network identification techniques on toy examples, and then show an application to learning effective SDEs from lattice dynamics simulations. Table I lists important training parameters. All computational experiments were performed on a single desktop computer with an AMD Ryzen 7 3700X 8-Core Processor, 32GB RAM, and a NVIDIA GeForce GTX 1070 GPU. Training did not take longer than 30 min for each case, data generation did not take longer than two hours. The neural networks and loss functions were defined in TensorFlow V2.4.155 (Apache 2.0 license). For the toy examples, we created snapshots by integrating the SDEs with 10 steps of Eq. (2) from t=0 to t=h, only storing the initial and final result. The given network topology was used twice, separately for drift and diffusivity. If not otherwise stated, we use a test/validation split of 90%/10%, the ADAM optimizer with standard parameters learning_rate = 0.001, β1=0.9, β2=0.999, ϵ=107, batch size of 32, 100 epochs, and exponential linear unit activation functions.

TABLE I.

Parameters, equations for loss, training, and validation errors for the experiments.

Exampleh# pointsNetwork topologyLossTrainValidation
Section IV A: cubic, 1D 0.01 10 000 1|50|50|1 (5) −0.003 54 −0.0111 
Section IV A: cubic, 1D 0.01 10 000 1|50|50|1 (7) −0.0010 −0.0112 
Section IV A: linear, 3D 0.25 15 000 3|50|50|3/6 (drift) (5) −7.60 −7.63 
   3|50|50|6 (diff.)    
Section IV A: 1D–19D 0.01 10 000 n|50|50|n (5) Varies Varies 
Section IV B a: Gillespie Varies 90 000 2|50|50|2 (5) −10.9 −11.0 
Section IV B b: kMC lattice Varies 90 000 2|50|50|2 (5) −11.7 −11.8 
Section IV C: Langevin 0.05 10 000 2|50|50|2 (13) −0.44 −0.42 
Section IV D: wave SPDE 0.001 Varies 2|50|50|2 (5) Varies Varies 
Exampleh# pointsNetwork topologyLossTrainValidation
Section IV A: cubic, 1D 0.01 10 000 1|50|50|1 (5) −0.003 54 −0.0111 
Section IV A: cubic, 1D 0.01 10 000 1|50|50|1 (7) −0.0010 −0.0112 
Section IV A: linear, 3D 0.25 15 000 3|50|50|3/6 (drift) (5) −7.60 −7.63 
   3|50|50|6 (diff.)    
Section IV A: 1D–19D 0.01 10 000 n|50|50|n (5) Varies Varies 
Section IV B a: Gillespie Varies 90 000 2|50|50|2 (5) −10.9 −11.0 
Section IV B b: kMC lattice Varies 90 000 2|50|50|2 (5) −11.7 −11.8 
Section IV C: Langevin 0.05 10 000 2|50|50|2 (13) −0.44 −0.42 
Section IV D: wave SPDE 0.001 Varies 2|50|50|2 (5) Varies Varies 

We tested our implementation with examples in one dimension first, with drift f and diffusivity σ defined through f(xt)=2xt34xt+1.5,σ(xt)=0.05xt+0.5. A comparison between the learned and true functions is shown in Fig. 4, along with densities obtained from sampling the true and approximate SDE. We trained networks with both Euler–Maruyama and Milstein loss functions, with very comparable results—even the training and validation curves were very similar. When increasing the dimension from n=1 to n=19, approximation quality decreases after n=6. Non-diagonal diffusivity matrices were approximated accurately for n=3 (see Appendix B 3).

FIG. 4.

Approximation vs true values of f and σ in example A. Densities over time, from points sampled uniformly in [2,2] and then integrated using the true SDE and using the approximation of drift and diffusivity with the network.

FIG. 4.

Approximation vs true values of f and σ in example A. Densities over time, from points sampled uniformly in [2,2] and then integrated using the true SDE and using the approximation of drift and diffusivity with the network.

Close modal

We validate our approach through a well-studied model for which analytical approximations at several levels of coarse-graining are available. The problem is the dynamics of epidemics on a lattice, which we briefly outline below. We consider two different versions of the model, and “learn” the corresponding effective SDEs. The first version at the finest scale involves Monte Carlo lattice simulations; the second version involves Gillespie simulation of the mechanistic scheme implied by the mean field model. Not surprisingly, if we use mean field SDE data, we discover an approximation to the mean field SDE. However, we find an approximation to dynamics in the mean field variables which is different than the mean field SDE, if we get data from Gillespie/lattice simulations and coarse-grain it in terms of mean field variables.56,57

Epidemic SIRS model: We consider a hierarchical sequence of three stochastic models describing the dynamics of the classical SIR (susceptible–infected–recovered)-type epidemic process.58–60 The first, most detailed model is represented by kinetic Monte Carlo (kMC) simulations on a lattice. The kMC model considers a two-dimensional square lattice of size N=Nx×Ny with periodic boundary conditions, where each site contains a single individual. In accordance to the health state, each individual can be in one of three states: Susceptible (S), Infected (I), or Recovered (R). The elementary reactions (independent Poisson processes) occurring on a lattice are as follows: (1) I+SI+I (rate constant k1; infection), (2) IR (rate constant k2; recovery with immunity), (3) RS (rate constant k3; loss of immunity).

The infection spreads through nearest neighbor bonds only; thus the first reaction may occur only when I and S are the first neighbors on a lattice. Elementary events take place with transition probabilities kp (p=1,2,3). When k3>0, the model is called as susceptible–infected–recovered–susceptible (SIRS), while the SIR model is a particular case of the SIRS model with k3=0. In addition to steps (1)–(3), migration of individuals occurring via the exchange mechanism is considered: (4) Si+IjIi+Sj (rate constant D1), and (5) Si+RjRi+Sj (rate constant D2). Here, the sites i and j are first neighbors on a lattice. The time evolution of the probability distribution for the populations of individuals on a lattice can be described by the chemical master equation, derived from the Markov property of the underlying stochastic process. To generate realizations of this process, we applied the so-called rejection-free direct kMC algorithm,19,61 which produces a sequence of configurations and the times when they change from one to the other. It is interesting to mention that the same lattice kMC model can be formulated in a mathematically equivalent form using the following three types of states: excited (infected), refractory (recovered), or quiescent (susceptible). In this case, the model may reproduce the typical patterns of excitable media such as traveling pulses and spiral waves.62 Furthermore, we assume that D=D1=D2. In essence, the migration rate D may take into account quarantine restrictions and other regulations in place to keep the infection from spreading. Small D values correspond to strong restrictions. If a spatially homogeneous, random distribution of all individuals on a lattice is assumed, the problem can be significantly simplified. Such a homogeneous distribution can be achieved in the “fast migration” limit (D).

The second stochastic model we study is represented by the Gillespie’s stochastic simulation algorithm (SSA19) that is used to simulate a well-stirred, spatially homogeneous system. For the SIRS model, SSA operates in terms of three integer stochastic variables: the number of susceptible individuals, n0, the number of infected individuals, n1, and the number of recovered individuals, n2; their concentrations are calculated as: yp=np/N. The mass balance implies that y0+y1+y2=1. At each time step of the SSA algorithm the following three rates are calculated through r1=4k1y0y1,r2=k2y1,r3=k3y2. Then, an event to be performed is selected with a probability proportional to its rate and the numbers np are updated accordingly.

The third stochastic model is represented by a set of stochastic differential equations (SDEs) of the Langevin type which, under certain assumptions, approximates the results of the SSA algorithm.20 The SDE model at finite N is as follows:

(16a)
(16b)
(16c)

Here, Wp(t) are independent Wiener processes such that dWp(t)N(0,dt). Only two of three SDEs must be considered, since y0+y1+y2=1. In the limit N, the SDE system transforms to the standard mean field SIRS model. In particular, for the SIR model, this SDE system reduces to having a diagonal diffusivity matrix

(17)

Equation (16) or (17) can be integrated using the Euler–Maruyama method with fixed time step h. The SDE models approximate the solution of both SSA and kMC models. The following parameters and initial conditions are used in the simulations: k1=1, k2=1, k3=0; y0(0)=0.9, y1(0)=0.1. In the limit of infinite N, the mean field equations simplify to

(18)

We assume that SSA and kMC simulations generate trajectories which can be approximated by the solution of two latent SDEs with diagonal diffusion matrix, similar to Eq. (17). Under this assumption, the neural network was used to learn the unknown drift and diffusion terms at the fixed values of the model parameters k1,k2,k3, and N. Training was performed using a set of Ntr short-time stochastic simulations on a N1×N2 lattice with the random, but physically-relevant initial conditions for the average concentrations ym(0,1), m{0,1,2}. The initial species distribution on a lattice was also random. During each independent SSA and kMC simulation run, the data points were evaluated at the time moments t0{0,t1,,tmax}, where hi=ti+1tih. The values of tmax and h are the fixed parameters of the algorithm. Note that a time step in the SSA/kMC algorithm is a random variable. The time is advanced via an increment given from the exponential distribution: dt=log(ξ)/R, where R is the total reaction rate and ξ(0,1) is a uniformly distributed random number. Therefore, hi are close to h, but they are not constant. Each data point used for training contains the following five values: hi,y(0,i),y(2,i),y(0,i+1),y(2,i+1). The total number of data points is Ndp=Ntr([tmax/h]1). The trained network approximates the unknown drift and diffusion terms at any given values of ym(0,1). Thus, the SDE approximation in combination with the Euler–Maruyama method provides a time-stepper (decoder), which can reproduce the dynamics.

a. Gillespie simulations

Figure 5 shows that a network can be trained to match the results of SSA simulations. A distribution function (histogram) of the time steps hi is plotted in Fig. 5(a). It demonstrates that our approach works with variable time steps in the training dataset. Three sample paths of y0 and y1 and the average of 200 paths are shown in Figs. 5(b) and 5(c).

FIG. 5.

Histogram of the time steps hi used in the training dataset (a). Three sample paths of y0 and y1 (b); average of 200 paths with deviation of max|yi,NNyi,SSA|<0.02 (c). Blue (red) lines correspond to NN (SSA) results. Parameters: N=400; Ntr=10000, tmax=0.1, h=0.01.

FIG. 5.

Histogram of the time steps hi used in the training dataset (a). Three sample paths of y0 and y1 (b); average of 200 paths with deviation of max|yi,NNyi,SSA|<0.02 (c). Blue (red) lines correspond to NN (SSA) results. Parameters: N=400; Ntr=10000, tmax=0.1, h=0.01.

Close modal
b. Kinetic Monte Carlo lattice simulations

Figure 6 shows that our approach successfully reproduces the results of kMC simulations on a 32×32 lattice. The Euler–Maruyama method with a fixed time step hEM=0.002 was used in numerical integration of the network SDE approximation. Figure 6(a) demonstrates three sample paths obtained using the neural network (blue) and kMC (red, “true solution”). The averaged paths are presented in Fig. 6(b). A sufficiently high accuracy network approximation is demonstrated in Figs. 6(c) and 6(d), where the probability density functions for y0 and y1 at t=2 are shown. They were estimated by NN and kMC and are visually identical.

FIG. 6.

Illustration of the kMC lattice (top row), physically relevant values y0,y1 measured over time respective simulated with the identified SDE from the network (a), averaged paths over 200 simulations with a deviation max|yi,NNyi,kMC|<0.01 (b), and propagated densities from the initial condition until t=2 (c) and (d). Lattice snapshots show S, I, and R-type species as gray, yellow, and blue squares, respectively. Parameters: 32×32 lattice, D=50; Ntr=4000, tmax=0.05, h=0.01.

FIG. 6.

Illustration of the kMC lattice (top row), physically relevant values y0,y1 measured over time respective simulated with the identified SDE from the network (a), averaged paths over 200 simulations with a deviation max|yi,NNyi,kMC|<0.01 (b), and propagated densities from the initial condition until t=2 (c) and (d). Lattice snapshots show S, I, and R-type species as gray, yellow, and blue squares, respectively. Parameters: 32×32 lattice, D=50; Ntr=4000, tmax=0.05, h=0.01.

Close modal

The described examples consider the lattice SIR model which has only the trivial steady states with y1=0. When k3>0, the system exhibits richer dynamic behavior. Simulations with k3=0.2 were performed as well, and the network model is capable of reproducing the correct steady state and damped oscillations (the mean field SIRS model has a focus-type steady state at these values of parameters).

This is actually surprising: When the SIRS model is considered, the diffusion matrix is not diagonal, because there are three different reactions. Nevertheless, the trained network is capable here of approximating the kMC results, even when only trained with a diagonal diffusivity matrix.

The reformulation of the loss function (5) to (13) introduced in Sec. III D allows us to also learn non-Gaussian dynamics of Langevin-type. We will now demonstrate this on a simple example, and show that incorporating the assumption of Langevin-type dynamics for data from Langevin systems is necessary for accurate approximation of drift and diffusivity functions. The system we investigate is defined through one-dimensional state xt and velocity vt, such that

(19)

where γ=12, β=10, f(xt)=U(xt)γvt with potential U(xt)=14xt4+12xt2, and σ=2γ/β. When approximating f and σ, we assume they depend on both xt and vt, which creates a more challenging learning problem. As a loss function, we use (13). The data from the spatial coordinate xt is used as a fixed parameter for each given snapshot ((x0,v0),(x1,v1),h). The result is shown in Fig. 7, with probability densities for different times shown in Fig. 8.

FIG. 7.

Approximations of drift and diffusivity, and sample path comparison when using the same sample path of the noise process Wt. The spread of values in the drift term is due to the small influence of γvt in the true function f, which is also captured accurately.

FIG. 7.

Approximations of drift and diffusivity, and sample path comparison when using the same sample path of the noise process Wt. The spread of values in the drift term is due to the small influence of γvt in the true function f, which is also captured accurately.

Close modal
FIG. 8.

Probability density functions at times t=0.5,1.0,5.0, starting with xN(0,110), v=0. The network approximation works well for smaller times, but the integrated error accumulates for larger ones.

FIG. 8.

Probability density functions at times t=0.5,1.0,5.0, starting with xN(0,110), v=0. The network approximation works well for smaller times, but the integrated error accumulates for larger ones.

Close modal

When using loss function (5) for both dimensions (xt,vt) instead, we effectively (incorrectly) assume the SDE (1) instead of (11). In this case, the training fails to converge, as shown in Fig. 9, even when trained with 20 times the number of epochs. The final loss value in this experiment is 8.4 and would diverge to with increasing number of data points. The issue is also apparent in the loss evaluated during training, as shown in Fig. 10.

FIG. 9.

Approximations of drift and diffusivity, and sample path comparison when using the same noise process, if the wrong type of SDE is assumed. Drift and diffusivity terms for vt are only learned after 20 times more training time, and the path prediction also fails.

FIG. 9.

Approximations of drift and diffusivity, and sample path comparison when using the same noise process, if the wrong type of SDE is assumed. Drift and diffusivity terms for vt are only learned after 20 times more training time, and the path prediction also fails.

Close modal
FIG. 10.

Loss against epoch when training with correct Langevin loss (11) and incorrect Euler loss (5). Note the erratic behavior of the loss, and that training 20 times longer also does not lead to proper convergence. The numerical value of the loss in the right panel is also much lower, resulting from an approximated diffusivity very close to zero.

FIG. 10.

Loss against epoch when training with correct Langevin loss (11) and incorrect Euler loss (5). Note the erratic behavior of the loss, and that training 20 times longer also does not lead to proper convergence. The numerical value of the loss in the right panel is also much lower, resulting from an approximated diffusivity very close to zero.

Close modal

We now demonstrate learning deterministic driving, f(x), and stochastic forcing, σ(x), terms in a stochastic PDE, using the reformulation of the numerical scheme (15) with step size h=0.001. The training data in Fig. 11 were generated using autonomous, spatially varying forcing terms,

We use our methodology to identify approximations of these two functions with N=500,000 pairs of training data. In addition to this experiment, Figure 12 shows how the network approximations converge to the true functions (which would be unknown in a real setting), when the number of available data points (triples (un,un+1,h)) is increased while keeping the number of network parameter updates in one training run constant at approximately 105. We perform ten training runs per dataset size to illustrate variability of the results due to stochastic training and data point selections. The loss (5) used during training and validation is approximately constant for all datasets of size >104, indicating convergence of training for all cases where the number of data points surpasses the number of trainable parameters (including weights and biases, we use a network with 5602 trainable parameters).

FIG. 11.

Training data from the stochastic wave equation, and space dependent deterministic and stochastic forcing functions learned through the reformulated numerical solution method.

FIG. 11.

Training data from the stochastic wave equation, and space dependent deterministic and stochastic forcing functions learned through the reformulated numerical solution method.

Close modal
FIG. 12.

Experimental results for convergence of the neural network approximations to the true forcing functions (drift and diffusivity) for increasing number of available data points. Each data point in the panels indicates the result of an entire training process, with varying number of available data points but constant number of training updates.

FIG. 12.

Experimental results for convergence of the neural network approximations to the true forcing functions (drift and diffusivity) for increasing number of available data points. Each data point in the panels indicates the result of an entire training process, with varying number of available data points but constant number of training updates.

Close modal

Training neural networks with loss functions based on numerical integrators such as (5) and (8), but also (13), has several limitations. If the time step between many samples is too small, we cannot accurately identify the drift, because the diffusivity term will dominate. This could be mitigated by starting with small h, estimating the diffusivity, and then estimating the drift with subsampled trajectories. Even with infinite data, the drift is difficult to estimate because the time step also has to go to zero.23 Conversely, if the time step is too big, we cannot accurately identify the diffusivity. If the dynamics of the coarse-grained observables include rare events, then learning the corresponding SDE is a challenge, because these events will not be present in a lot of snapshots, and hence, not a lot of data are available to learn them. Using a loss function based on Lévy noise SDE integrators38 might provide a useful surrogate for SDEs with rare events.

There are many possible extensions and applications of our work. An important task is to find explicit probability density functions for integrators using other noise types, such as Lévy or Poisson noise,38 or integrators of higher order.48 Integrating additional noise from measurement devices on top of the inherent stochasticity of the dynamics will also be important. Analyzing integrators based on other types of calculus such as Heun’s63 method may help impose more variegated types of priors during training. It is a predictor–corrector method, x~1=x0+f(x0)h+σ(x0)z;x1=x0+1/2[f(x0)+f(x~)]h+1/2[σ(x0)+σ(x~1)]z;zN(0,h), based on Stratonovich calculus and will give rise to implicit neural networks.11,64 Using the adjoint method28 allows us to propagate the loss gradients back through longer time series. The choice of numerical approximation (9) entails a systematic error for the data generating distribution, which can impact adversely on the capability to generalize beyond the training data. This effect can be minimized by higher-order numerical methods and analyzed through modified equation analysis.17 In terms of applications, many more particle- and agent-based models can be coarse-grained. For many of these, it is important to find coarse variables, and latent space techniques such as VAE-type architectures and Diffusion Maps discussed in Sec. III F will help in identifying them. These techniques can help to construct local SDE models in the process of larger-scale simulations to guide sampling and exploration. Another clear next step based on numerical analysis is the identification of SPDEs, for which we have demonstrated how our approach can be used to approximate forcing terms. Identification of entire SPDEs from data, including all terms, is still an open issue.

The authors wish to dedicate this paper to the memory of Professor Alexei Makeev, a key contributor to the work, a friend, and an inspiring and knowledgeable collaborator, whom we lost to Covid last year. We thank A. J. Roberts for insightful discussions. The research of F.D. has been partially funded by the Deutsche Forschungsgemeinschaft (DFG)—Project No. 468830823. The research of S.R. has been partially funded by the Deutsche Forschungsgemeinschaft (DFG)—Project No. 318763901-SFB1294. The work of I.G.K., N.E., and T.B. was partially supported by the U.S. Department of Energy and the U.S. Air Force Office of Scientific Research.

The authors have no conflicts to disclose.

Felix Dietrich: Conceptualization (equal); Data curation (lead); Formal analysis (lead); Funding acquisition (equal); Investigation (equal); Methodology (lead); Project administration (equal); Resources (equal); Software (lead); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (lead). Alexei Makeev: Data curation (equal); Methodology (equal); Software (supporting); Validation (equal); Visualization (equal); Writing – original draft (supporting); Writing – review & editing (supporting). George Kevrekidis: Software (supporting); Validation (supporting); Writing – original draft (supporting). Nikolaos Evangelou: Data curation (supporting); Software (equal); Validation (supporting); Visualization (supporting); Writing – original draft (supporting); Writing – review & editing (supporting). Tom Bertalan: Data curation (supporting); Software (supporting); Writing – original draft (supporting). Sebastian Reich: Conceptualization (equal); Formal analysis (equal); Funding acquisition (equal); Methodology (equal); Supervision (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (supporting). Ioannis G. Kevrekidis: Conceptualization (lead); Formal analysis (equal); Funding acquisition (lead); Investigation (equal); Methodology (equal); Project administration (equal); Supervision (lead); Validation (equal); Writing – original draft (lead); Writing – review & editing (equal).

The data and code that supports the findings of this study are openly available in GitLab at https://gitlab.com/felix.dietrich/sde-identification, Ref. 68.

The code to reproduce the results of the computational experiments is available at https://gitlab.com/felix.dietrich/sde-identification. Table II lists the software we used to test if the code can be compiled and actually produces the experimental results.

TABLE II.

Software we used to test that the code in the supplement can be compiled and run.

SoftwareVersion
Microsoft Windows 10.0.19042 
Python 3.8 
Microsoft Visual Studio Community 2019 16.9.0 
Microsoft Visual C++ 2019 
SoftwareVersion
Microsoft Windows 10.0.19042 
Python 3.8 
Microsoft Visual Studio Community 2019 16.9.0 
Microsoft Visual C++ 2019 

All required python packages can be installed by running “pip install -r requirements.txt” in the folder where the requirements.txt file is placed. The toy examples and lattice computations are all available in separate files, the mapping is listed in Table III. To compile the kMCLattice.dll, the Microsoft Visual Studio solution file in kmc/kMCLattice_dll can be used. Make sure to use an x64 setup (for both Python and the .dll file). The file needs to be placed in the same folder as the Python script kmc/KMC_SIRS3.py. The SSA results can be obtained by setting both Dif1 and Dif2 to 1000 in the file kmc/KMC_SIRS3.py.

TABLE III.

Mapping of computational experiments to Jupyter notebooks (.ipynb)/python (.py) files.

ExperimentFile
Toy example: 1D, cubic drift, linear diff. (E.–M.) example3—1d sde-cubic.ipynb 
Toy example: 1D, cubic drift, linear diff. (Milstein) example3—1d sde-cubic-with-milstein.ipynb 
Toy example: 3D, linear drift, lower-triangular diff. example4—3d sde-nondiagonal.ipynb 
Toy example: 3D, linear drift, SPD diff. example4—3d sde-spd.ipynb 
Toy example: nD (1-19D), cubic drift, linear diff. example3—nd sde-cubic.ipynb 
Kinetic Monte Carlo simulations kmc/KMC_SIRS3.py 
Stochastic wave equation example7—wave SPDE.ipynb 
Langevin dynamics example8—nonGaussian.ipynb 
ExperimentFile
Toy example: 1D, cubic drift, linear diff. (E.–M.) example3—1d sde-cubic.ipynb 
Toy example: 1D, cubic drift, linear diff. (Milstein) example3—1d sde-cubic-with-milstein.ipynb 
Toy example: 3D, linear drift, lower-triangular diff. example4—3d sde-nondiagonal.ipynb 
Toy example: 3D, linear drift, SPD diff. example4—3d sde-spd.ipynb 
Toy example: nD (1-19D), cubic drift, linear diff. example3—nd sde-cubic.ipynb 
Kinetic Monte Carlo simulations kmc/KMC_SIRS3.py 
Stochastic wave equation example7—wave SPDE.ipynb 
Langevin dynamics example8—nonGaussian.ipynb 

Appendixes B 2B 6 contain additional descriptions and results for the computational experiments. They are not necessary to follow the main results described in the paper.

1. Milstein loss

The TensorFlow implementation for the Milstein loss uses a slightly different Bessel function, with integer-valued parameter ν=0 instead of the required ν=1/2 (no other Bessel function was available). Figure 13 demonstrates the difference is quite small.

FIG. 13.

Probability density functions and sampled data from the two numerical integrator templates. The top panel shows the probability density functions (PDFs) with the correct Bessel function with real-valued parameter ν=1/2, using scipy.stats.ncx2 for the PDF of the non-centered χ2 distribution. The bottom panel shows the PDF with an integer-valued parameter ν=0 (the only one available in TensorFlow).

FIG. 13.

Probability density functions and sampled data from the two numerical integrator templates. The top panel shows the probability density functions (PDFs) with the correct Bessel function with real-valued parameter ν=1/2, using scipy.stats.ncx2 for the PDF of the non-centered χ2 distribution. The bottom panel shows the PDF with an integer-valued parameter ν=0 (the only one available in TensorFlow).

Close modal

2. Toy example: Cubic drift and linear diffusivity

The loss and validation loss curves for Euler–Maruyama and Milstein schemes are shown in Fig. 14.

FIG. 14.

Milstein vs Euler–Maruyama loss curves.

FIG. 14.

Milstein vs Euler–Maruyama loss curves.

Close modal

3. Toy example: Non-diagonal diffusivity

We now demonstrate that our approach can learn non-diagonal diffusivity matrices. We sample initial points x0[.3,.3]3, randomly sample a lower-triagonal diffusivity matrix Σ with positive eigenvalues, which we use as the (constant) diffusivity σ(x)=Σ, and set the drift to be f(x)=x. The absolute error between original and (the average over) identified matrix entries is smaller than 0.01, and the standard deviation of the diffusivity values over the entire dataset is smaller than 0.003 for all elements of the matrix. Note that the network σθ is still a nonlinear function over the state space, not a constant. To ensure the matrix has positive eigenvalues, we pass the real-valued output of the neurons that encode the diagonal of the matrix through a soft-plus activation function. Figure 15 shows that we learn a good approximation for Σ.

FIG. 15.

Lower-triangular (top row) and full symmetric, positive definite (bottom row) diffusivity matrix approximation. The panels show the averaged network output over the uniformly sampled data inside the training region, the true diffusivity matrices, the absolute error between the results shown in the first and second columns, and the deviation of the network matrix output over the training region. The color range is the same for all four plots.

FIG. 15.

Lower-triangular (top row) and full symmetric, positive definite (bottom row) diffusivity matrix approximation. The panels show the averaged network output over the uniformly sampled data inside the training region, the true diffusivity matrices, the absolute error between the results shown in the first and second columns, and the deviation of the network matrix output over the training region. The color range is the same for all four plots.

Close modal

4. Toy example: Increasing dimension from 1D to 19D

We define f(x)=(4x38x+3)/2 and σ(x)=5e2x+0.5, where all operations are meant coordinate-wise (e.g., x3 computes the third power of each individual coordinate of the vector). Figure 16 (left) illustrates how the training and validation losses change when increasing n from 1 to 19. The loss (LEuler--Maruyama) is adapted to the changing dimensionality by adding log(2π)n, as described by Eq. (C1). This normalization is necessary because (apparently) TensorFlow 2.4.1 does not add this constant when computing the log probability (see https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/MultivariateNormalDiag##log_prob). Figures 17 and 18 show the individual functions compared to the true values, for dimensions n=10 and n=19. For higher dimensions, it becomes harder to approximate the true functions correctly, which can also be seen here. The increase in loss can be explained by the constant number of points (N=10000) we used: Increasing the intrinsic dimension of the problem by sampling the input data in the n-dimensional cube [2,2]n causes the data sampling to get sparser and sparser. By increasing the number of training data points linearly with the dimension (while keeping the number of training iterations per dimension constant), we can see that the training loss is relatively small even for n=12 (Fig. 16, right). Additionally, the training loss tracks the validation loss much better, indicating better generalization properties of the network. This is reasonable, because the provided data is less sparse, even for higher dimensions.

FIG. 16.

Normalized training and validation loss when increasing the intrinsic dimension of the problem from n=1 to n=19. The solid lines show the results of keeping the number of samples constant at N=10000, the dashed lines show the average of what happens when increasing N=10000×n, but training for the same number of iterations per dimension (Epochs=1000/n).

FIG. 16.

Normalized training and validation loss when increasing the intrinsic dimension of the problem from n=1 to n=19. The solid lines show the results of keeping the number of samples constant at N=10000, the dashed lines show the average of what happens when increasing N=10000×n, but training for the same number of iterations per dimension (Epochs=1000/n).

Close modal
FIG. 17.

Approximation of individual drift and diffusivity functions compared to the true values, in dimension 10. Function f9 is not approximated well.

FIG. 17.

Approximation of individual drift and diffusivity functions compared to the true values, in dimension 10. Function f9 is not approximated well.

Close modal
FIG. 18.

Approximation of individual drift and diffusivity functions compared to the true values, in dimension 19. Several functions are not approximated well.

FIG. 18.

Approximation of individual drift and diffusivity functions compared to the true values, in dimension 19. Several functions are not approximated well.

Close modal

5. Kinetic Monte Carlo lattice simulations

Results on a smaller 20×20 lattice are shown in Fig. 19. Figure 20 shows that the algorithm works at different system sizes, even when N is large—that is, even when the noise level is low. For N, the system becomes deterministic and can be described by an ODE. The data presented in Fig. 20 correspond to a relatively fast migration rate D=D1=D2=50. In this case, a well-mixed state is achieved and the kMC results are close to that obtained with the SSA algorithm. However, the network can also be trained to match the kMC results at lower migration rate values. Figure 21 shows the simulation results for D=5. Since the initial distribution on a lattice is random, we omit the first few time steps and select the data points for training only when a “matured” state is achieved. For the results presented in Fig. 21, the training process was started at t0=0.1. At D=5, the accuracy of the network is acceptable, but at lower D values, the coincidence of the network and kMC results decreased considerably. This suggests that around that value of D the dynamics may not effectively close in terms of the mean field variables and that higher dimensional approximations (possibly including pair probabilities as additional independent variables) may become necessary. Similar results can be obtained on a smaller 20×20 lattice. In this case, as expected, the influence of the stochasticity becomes more pronounced. In a separate experiment, we also confirmed the algorithm works at different system sizes, even when N is large—that is, even when the noise level is low. For N, the system becomes deterministic and can be described by an ODE. At low migration rate values (D<5), the approximation quality of the network decreased considerably as expected: The system can hardly be described in terms of two averaged concentrations because the species distribution on a lattice becomes inhomogeneous: Clustering occurs, and a higher dimensional SDE, possibly in terms of densities as well as pair probabilities, becomes necessary.

FIG. 19.

Blue (red) lines correspond to NN (kMC) results. (a) Three sample paths of y0 and y1; (b) average of 200 paths; (c) and (d) probability density function for y0 and y1 at t=1.5 computed from 2×104 sample paths. Parameters: 20×20 lattice, D=50; Ntr=10000, tmax=0.05, h=0.01.

FIG. 19.

Blue (red) lines correspond to NN (kMC) results. (a) Three sample paths of y0 and y1; (b) average of 200 paths; (c) and (d) probability density function for y0 and y1 at t=1.5 computed from 2×104 sample paths. Parameters: 20×20 lattice, D=50; Ntr=10000, tmax=0.05, h=0.01.

Close modal
FIG. 20.

Influence of the system size on the neural network-estimated probability density function at t=2: 20×20 (black), 32×32 (navy), 100×100 (blue).

FIG. 20.

Influence of the system size on the neural network-estimated probability density function at t=2: 20×20 (black), 32×32 (navy), 100×100 (blue).

Close modal
FIG. 21.

Relatively low migration rate. Parameters: 32×32 lattice, D=5; Ntr=4000, tmax=0.4, h=0.02.

FIG. 21.

Relatively low migration rate. Parameters: 32×32 lattice, D=5; Ntr=4000, tmax=0.4, h=0.02.

Close modal

When k3>0, the system exhibits richer dynamic behavior. An example of simulations with k3=0.2 is shown in Fig. 22. The network model is capable of reproducing the correct steady state and damped oscillations (the mean field SIRS model has a focus-type steady state at these values of parameters).

FIG. 22.

SIRS model with k3=0.2, 60×60 lattice, D=50; Ntr=5000, tmax=0.1, h=0.01. The left panel shows 200 averaged paths, the right panel shows the propagated densities until t=2.

FIG. 22.

SIRS model with k3=0.2, 60×60 lattice, D=50; Ntr=5000, tmax=0.1, h=0.01. The left panel shows 200 averaged paths, the right panel shows the propagated densities until t=2.

Close modal

6. Learning latent spaces for the SIR model

We identify latent spaces of six-dimensional data obtained from the kMC lattice simulations. Figure 23 illustrates that the latent spaces are both two-dimensional and triangular. This shape is due to the conservation law of the underlying model (when written in terms of fractions of the total population that are susceptible (S), infected (I), and removed (R), the law would be S+I+R=1). Appendix B 6 a outlines the Diffusion Maps algorithm, Appendix B 6 b describes how the auto-encoder was set up and trained.

FIG. 23.

Latent spaces identified by SDE-integrator-informed auto-encoder (a) and with Diffusion Maps (b). Both show a triangular structure, as expected from a system with three intrinsic variables with a single conservation law (y0+y1+y2=Const.).

FIG. 23.

Latent spaces identified by SDE-integrator-informed auto-encoder (a) and with Diffusion Maps (b). Both show a triangular structure, as expected from a system with three intrinsic variables with a single conservation law (y0+y1+y2=Const.).

Close modal
a. Diffusion Maps

The Diffusion Maps algorithm by Coifman and Lafon54 follows three main steps:

  1. Construct the kernel matrix Ki,j=exp(d(xi,xj)2/ϵ).

  2. Normalize the matrix to mitigate effects of data density.

  3. Compute eigenvectors to approximate eigenfunctions of Δ evaluated on the data.

These steps are expanded upon in Algorithm 1. We use the implementation from Ref. 65.

ALGORITHM 1

Diffusion Maps algorithm from Ref. 54, adapted from Ref. 66, and implemented in Ref. 65.

Input:nN: Dimension of points in the input. N: Number of eigenvectors to compute. XRN×n: input dataset, with N points and 
n dimensions per point. 
(1)  Compute kernel matrix KRN×N with Kij=exp(XiXj2/ϵ) and kernel bandwidth ϵ through 5\% of the median of squared 
            distances between all points. Note that sparse computations are also possible in the software.65  
(2)  Normalize kernel matrix to make it invariant to sampling density: 
         (a)  Form the diagonal normalization matrix Pii=j=1NKij
         (b)  Form the matrix W=P1KP1
         (c)  Form the diagonal normalization matrix Qii=j=1NWij
         (d)  Form the matrix T^=Q1/2WQ1/2
(3)  Solve the eigensystem and construct eigenvectors: 
         (a)  Find the largest eigenvalues ap associated to non-harmonic eigenvectors vp of T^. See67 for a discussion and method to remove 
                   harmonic eigenvectors. 
         (b)  Compute the eigenvalues of T^1/ϵ by λp=ap1/(2ϵ)
         (c)  Compute the eigenvectors ϕp of the matrix T=Q1W through ϕp=Q1/2vp
Output: eigenvectors ϕp, each with N entries, and with corresponding eigenvalues λp
Input:nN: Dimension of points in the input. N: Number of eigenvectors to compute. XRN×n: input dataset, with N points and 
n dimensions per point. 
(1)  Compute kernel matrix KRN×N with Kij=exp(XiXj2/ϵ) and kernel bandwidth ϵ through 5\% of the median of squared 
            distances between all points. Note that sparse computations are also possible in the software.65  
(2)  Normalize kernel matrix to make it invariant to sampling density: 
         (a)  Form the diagonal normalization matrix Pii=j=1NKij
         (b)  Form the matrix W=P1KP1
         (c)  Form the diagonal normalization matrix Qii=j=1NWij
         (d)  Form the matrix T^=Q1/2WQ1/2
(3)  Solve the eigensystem and construct eigenvectors: 
         (a)  Find the largest eigenvalues ap associated to non-harmonic eigenvectors vp of T^. See67 for a discussion and method to remove 
                   harmonic eigenvectors. 
         (b)  Compute the eigenvalues of T^1/ϵ by λp=ap1/(2ϵ)
         (c)  Compute the eigenvectors ϕp of the matrix T=Q1W through ϕp=Q1/2vp
Output: eigenvectors ϕp, each with N entries, and with corresponding eigenvalues λp
b. SDE-informed auto-encoder

The SDE-informed auto-encoder has a single encoder network (from observations to latent space) and two additional networks: one decoder from latent space back to the observations, and one network that defines drift and diffusivity functions on the latent space. Note that the drift and diffusivity we find can still be considered functions of the original observations, because f(x)fθ(encoder(x)) and σ(x)σθ(encoder(x)). In the computational example that lead to Fig. 23, we used an encoder architecture of 6|25|25|25|2, a decoder architecture 2|25|25|25|6, and an SDE approximation network with diagonal diffusivity matrix and architecture 2|25|25|25|6, all with tf.nn.selu activations. The loss function for training the auto-encoder is the following, where LEuler--Maruyama is defined in the paper [Eq. (5)]:

(B1)

where B is the size of the current mini-batch (here: B=32). The Python class implementing this auto-encoder loss is called AEModel, and can be found in the file

sde/sde_learning_network.py of the repository.

The logarithm of the probability density p of a multivariate normal distribution in n dimensions is

(C1)
Proof.
This follows directly from
1.
K.
He
,
X.
Zhang
,
S.
Ren
, and
J.
Sun
, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2016).
2.
R.
Rico-Martínez
,
K.
Krischer
,
I.
Kevrekidis
,
M.
Kube
, and
J.
Hudson
, “
Discrete- vs continuous-time nonlinear signal processing of Cu electrodissolution data
,”
Chem. Eng. Commun.
118
(
1
),
25
48
(
1992
).
3.
T.
Bertalan
,
F.
Dietrich
,
I.
Mezić
, and
I. G.
Kevrekidis
, “
On learning hamiltonian systems from data
,”
Chaos
29
(
12
),
121107
(
2019
).
4.
S.
Greydanus
,
M.
Dzamba
, and
J.
Yosinski
, “Hamiltonian neural networks,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems (Curran Associates, Inc., 2019).
5.
A.
Zhu
,
P.
Jin
, and
Y.
Tang
, “Deep hamiltonian networks based on symplectic integrators,” arXiv:2004.13830 (2020).
6.
J.
Bongard
and
H.
Lipson
, “
Automated reverse engineering of nonlinear dynamical systems
,”
Proc. Natl. Acad. Sci. U.S.A.
104
(
24
),
9943
9948
(
2007
).
7.
S. L.
Brunton
,
J. L.
Proctor
, and
J. N.
Kutz
, “
Discovering governing equations from data by sparse identification of nonlinear dynamical systems
,”
Proc. Natl. Acad. Sci. U.S.A.
113
(
15
),
3932
3937
(
2016
).
8.
S.
Klus
,
F.
Nüske
,
S.
Peitz
,
J.-H.
Niemann
,
C.
Clementi
, and
C.
Schütte
, “
Data-driven approximation of the koopman generator: Model reduction, system identification, and control
,”
Phys. D
406
,
132416
(
2020
).
9.
I.
Mezić
, “
Spectral properties of dynamical systems, model reduction and decompositions
,”
Nonlinear Dyn.
41
(
1
),
309
325
(
2005
).
10.
A.
Rahimi
and
B.
Recht
, “Unsupervised regression with applications to nonlinear system identification,” in Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS’06 (MIT Press, Cambridge, MA, 2006), pp. 1113–1120.
11.
R.
Rico-Martínez
and
I.
Kevrekidis
,
Nonlinear System Identification Using Neural Networks: Dynamics and Instabilities
(
Elsevier Science
,
1995
), Chap. 16, pp.
409
442
.
12.
R. T. Q.
Chen
,
Y.
Rubanova
,
J.
Bettencourt
, and
D.
Duvenaud
, “Neural ordinary differential equations,” in NeurIPS Conference 2018 (Curran Associates Inc., 2018).
13.
N.
Geneva
and
N.
Zabaras
, “
Transformers for modeling physical systems
,”
Neural Networks
146
,
272
289
(
2022
).
14.
R.
González-García
,
R.
Rico-Martínez
, and
I.
Kevrekidis
, “
Identification of distributed parameter systems: A neural net based approach
,”
Comput. Chem. Eng.
22
,
S965
S968
(
1998
).
15.
L.
Lu
,
P.
Jin
,
G.
Pang
,
Z.
Zhang
, and
G. E.
Karniadakis
, “
Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators
,”
Nat. Mach. Intell.
3
(
3
),
218
229
(
2021
).
16.
D. J.
Higham
, “
An algorithmic introduction to numerical simulation of stochastic differential equations
,”
SIAM Rev.
43
(
3
),
525
546
(
2001
).
17.
K. C.
Zygalakis
, “
On the existence and the applications of modified equations for stochastic differential equations
,”
SIAM J. Sci. Comput.
33
(
1
),
102
130
(
2011
).
18.
R.
Rico-Martinez
,
J.
Anderson
, and
I.
Kevrekidis
, “Continuous-time nonlinear signal processing: A neural network based approach for gray box identification,” in Proceedings of IEEE Workshop on Neural Networks for Signal Processing (IEEE, 1994).
19.
D. T.
Gillespie
, “
A general method for numerically simulating the stochastic time evolution of coupled chemical reactions
,”
J. Comput. Phys.
22
(
4
),
403
434
(
1976
).
20.
D. T.
Gillespie
, “
The chemical Langevin equation
,”
J. Chem. Phys.
113
(
1
),
297
306
(
2000
).
21.
P.
Jin
,
Z.
Zhang
,
I. G.
Kevrekidis
, and
G. E.
Karniadakis
, “Learning Poisson systems and trajectories of autonomous systems via Poisson neural networks,” arXiv:2012.03133 (2020).
22.
X.
Li
,
T.-K. L.
Wong
,
R. T.
Chen
, and
D.
Duvenaud
, “Scalable gradients for stochastic differential equations,” arXiv:2001.01328 (2020).
23.
G. A.
Pavliotis
,
Stochastic Processes and Applications
(
Springer
,
New York
,
2014
).
24.
I.
Karatzas
and
S. E.
Shreve
,
Brownian Motion and Stochastic Calculus
(
Springer
,
New York
,
1998
).
25.
R.
Perez-Carrasco
and
J.
Sancho
, “
Stochastic algorithms for discontinuous multiplicative white noise
,”
Phys. Rev. E
81
(
3
),
032104
(
2010
).
26.
I.
Goodfellow
,
J.
Pouget-Abadie
,
M.
Mirza
,
B.
Xu
,
D.
Warde-Farley
,
S.
Ozair
,
A.
Courville
, and
Y.
Bengio
, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Curran Associates, Inc., 2014), pp. 2672–2680.
27.
A.
Jalal
,
A.
Ilyas
,
C.
Daskalakis
, and
A. G.
Dimakis
, “The robust manifold defense: Adversarial training using generative models,” arXiv:1712.09196v5 (2017).
28.
J.
Liu
,
Z.
Long
,
R.
Wang
,
J.
Sun
, and
B.
Dong
, “Rode-net: Learning ordinary differential equations with randomness from data,” arXiv:2006.02377 (2020).
29.
A.
Graves
, “Generating sequences with recurrent neural networks,” arXiv:1308.0850 (2013).
30.
Y.
Song
,
J.
Sohl-Dickstein
,
D. P.
Kingma
,
A.
Kumar
,
S.
Ermon
, and
B.
Poole
, “Score-based generative modeling through stochastic differential equations,” in International Conference on Learning Representations (ICLR, 2021).
31.
S.
Ditlevsen
and
P.
Lansky
, “
Estimation of the input parameters in the Ornstein-Uhlenbeck neuronal model
,”
Phys. Rev. E
71
(
1
),
011907
(
2005
).
32.
C.
Yildiz
,
M.
Heinonen
,
J.
Intosalmi
,
H.
Mannerstrom
, and
H.
Lahdesmaki
, “Learning stochastic differential equations with Gaussian Processes without gradient matching,” in 2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP) (IEEE, 2018).
33.
S.
Yang
,
S. W. K.
Wong
, and
S. C.
Kou
, “
Inference of dynamic systems from noisy and sparse data via manifold-constrained Gaussian processes
,”
Proc. Natl. Acad. Sci. U.S.A.
118
(
15
),
e2020397118
(
2021
).
34.
H.
Arbabi
and
T.
Sapsis
, “Generative stochastic modeling of strongly nonlinear flows with non-Gaussian statistics,” arXiv:1908.08941 (2019).
35.
P.
Kidger
, “On neural differential equations,” Master’s thesis (University of Oxford, 2021).
36.
P.
Kidger
,
J.
Foster
,
X.
Li
, and
T. J.
Lyons
, “Neural SDEs as infinite-dimensional gans,” in Proceedings of the 38th International Conference on Machine Learning (PMLR, 2021), pp. 5453–5463.
37.
J.
Morrill
,
C.
Salvi
,
P.
Kidger
, and
J.
Foster
, “Neural rough differential equations for long time series,” in Proceedings of the 38th International Conference on Machine Learning (PMLR, 2021), pp. 7829–7838.
38.
C.
Fang
,
Y.
Lu
,
T.
Gao
, and
J.
Duan
, “An end-to-end deep learning approach for extracting stochastic dynamical systems with α-stable Lévy noise,” arXiv:2201.13114 [cs, stat] (2022).
39.
A.
Hasan
,
J. M.
Pereira
,
S.
Farsiu
, and
V.
Tarokh
, “
Identifying latent stochastic differential equations
,”
IEEE Trans. Signal Process.
70
,
89
104
(
2022
).
40.
Y.
Zhu
,
Y.-H.
Tang
, and
C.
Kim
, “Learning stochastic dynamics with statistics-informed neural network,” arXiv:2011.13456 [cs.LG] (2022).
41.
C.
Salvi
,
M.
Lemercier
, and
A.
Gerasimovics
, “Neural stochastic partial differential equations: Resolution-invariant learning of continuous spatiotemporal dynamics,” arXiv:2110.10249 [cs] (2021).
42.
L.
Yang
,
C.
Daskalakis
, and
G. E.
Karniadakis
, “Generative ensemble-regression: Learning stochastic dynamics from discrete particle ensemble observations,” arXiv:2008.01915v1 (2020).
43.
T.
Berry
,
D.
Giannakis
, and
J.
Harlim
, “
Nonparametric forecasting of low-dimensional dynamical systems
,”
Phys. Rev. E
91
(
3
),
3
(
2015
).
44.
D. P.
Kingma
and
M.
Welling
, “Auto-encoding variational Bayes,” in Proceedings of the 2nd International Conference on Learning Representations (ICLR) (Curran Associates, Inc., 2014).
45.
G. N.
Mil’shtejn
, “
Approximate integration of stochastic differential equations
,”
Theory Probability Appl.
19
(
3
),
557
562
(
1975
).
46.
J.
Lee
,
J.
Sohl-Dickstein
,
J.
Pennington
,
R.
Novak
,
S.
Schoenholz
, and
Y.
Bahri
, “Deep neural networks as Gaussian processes,” in International Conference on Learning Representations (Curran Associates, Inc., 2018).
47.
B.
Schölkopf
and
A. J.
Smola
,
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
(
The MIT Press
,
2018
).
48.
A. J.
Roberts
, “Modify the improved euler scheme to integrate stochastic differential equations,” arXiv:1210.0933 [math] (2012).
49.
M. G.
Crandall
,
H.
Ishii
, and
P.-L.
Lions
, “
User’s guide to viscosity solutions of second order partial differential equations
,”
Bull. Am. Math. Soc.
27
,
1
67
(
1992
).
50.
M. G.
Crandall
and
P.-L.
Lions
, “
Viscosity solutions of Hamilton-Jacobi equations
,”
Trans. Am. Math. Soc.
277
(
1
),
1
42
(
1983
).
51.
I.
Kobyzev
,
S. J. D.
Prince
, and
M. A.
Brubaker
, “
Normalizing flows: An introduction and review of current methods
,”
IEEE Trans. Pattern Anal. Mach. Intell.
43
(
11
),
3964
3979
(
2021
).
52.
R.
Mannella
, “
Numerical stochastic integration for quasi-symplectic flows
,”
SIAM J. Sci. Comput.
27
(
6
),
2121
2139
(
2006
).
53.
J. B.
Walsh
, “
On numerical solutions of the stochastic wave equation
,”
Ill. J. Math.
50
(
1–4
),
991
1018
(
2006
).
54.
R. R.
Coifman
and
S.
Lafon
, “
Diffusion maps
,”
Appl. Comput. Harmonic Anal.
21
(
1
),
5
30
(
2006
).
55.
M.
Abadi
,
A.
Agarwal
,
P.
Barham
,
E.
Brevdo
,
Z.
Chen
,
C.
Citro
,
G. S.
Corrado
,
A.
Davis
,
J.
Dean
,
M.
Devin
,
S.
Ghemawat
,
I.
Goodfellow
,
A.
Harp
,
G.
Irving
,
M.
Isard
,
Y.
Jia
,
R.
Jozefowicz
,
L.
Kaiser
,
M.
Kudlur
,
J.
Levenberg
,
D.
Mané
,
R.
Monga
,
S.
Moore
,
D.
Murray
,
C.
Olah
,
M.
Schuster
,
J.
Shlens
,
B.
Steiner
,
I.
Sutskever
,
K.
Talwar
,
P.
Tucker
,
V.
Vanhoucke
,
V.
Vasudevan
,
F.
Viégas
,
O.
Vinyals
,
P.
Warden
,
M.
Wattenberg
,
M.
Wicke
,
Y.
Yu
, and
X.
Zheng
, see www.tensorflow.org for “TensorFlow: Large-scale Machine Learning on Heterogeneous Systems,” 2015.
56.
A. G.
Makeev
,
D.
Maroudas
, and
I. G.
Kevrekidis
, “
Coarse stability and bifurcation analysis using stochastic simulators: Kinetic Monte Carlo examples
,”
J. Chem. Phys.
116
(
23
),
10083
10091
(
2002
).
57.
A. G.
Makeev
,
D.
Maroudas
,
A. Z.
Panagiotopoulos
, and
I. G.
Kevrekidis
, “
Coarse bifurcation analysis of kinetic Monte Carlo simulations: A lattice-gas model with lateral interactions
,”
J. Chem. Phys.
117
(
18
),
8229
8240
(
2002
).
58.
L. J.
Allen
, “
A primer on stochastic epidemic models: Formulation, numerical simulation, and analysis
,”
Infect. Dis. Modell.
2
(
2
),
128
142
(
2017
).
59.
W. O.
Kermack
and
A. G.
McKendrick
, “
A contribution to the mathematical theory of epidemics
,”
Proc. R. Soc. London, Ser. A
115
(
772
),
700
721
(
1927
).
60.
C.
Siettos
and
L.
Russo
, “
Mathematical modeling of infectious disease dynamics
,”
Virulence
4
,
295
306
(
2013
).
61.
A. B.
Bortz
,
M. H.
Kalos
, and
J. L.
Lebowitz
, “
A new algorithm for Monte Carlo simulation of Ising spin systems
,”
J. Comput. Phys.
17
(
1
),
10
18
(
1975
).
62.
A. G.
Makeev
and
N. L.
Semendyaeva
, “
A basic lattice model of an excitable medium: Kinetic Monte Carlo simulations
,”
Math. Models Comput. Simul.
9
(
5
),
636
648
(
2017
).
63.
K.
Burrage
,
P. M.
Burrage
, and
T.
Tian
, “
Numerical methods for strong solutions of stochastic differential equations: An overview
,”
Proc. R. Soc. London, Ser. A.
460
(
2041
),
373
402
(
2004
).
64.
J.
Anderson
,
I.
Kevrekidis
, and
R.
Rico-Martinez
, “
A comparison of recurrent training algorithms for time series analysis and system identification
,”
Comput. Chem. Eng.
20
,
S751
S756
(
1996
).
65.
D.
Lehmberg
,
F.
Dietrich
,
G.
Köster
, and
H.-J.
Bungartz
, “
Datafold: Data-driven models for point clouds and time series on manifolds
,”
J. Open Source Softw.
5
(
51
),
2283
(
2020
).
66.
T.
Berry
,
J. R.
Cressman
,
Z.
Gregurić-Ferenĉek
, and
T.
Sauer
, “
Time-scale separation from diffusion-mapped delay coordinates
,”
SIAM J. Appl. Dyn. Syst.
12
(
2
),
618
649
(
2013
).
67.
C. J.
Dsilva
,
R.
Talmon
,
R. R.
Coifman
, and
I. G.
Kevrekidis
, “
Parsimonious representation of nonlinear dynamical systems through manifold learning: A chemotaxis case study
,”
Appl. Comput. Harmonic Anal.
44
(
3
),
759
773
(
2018
).
68.
F.
Deitrich
, “
Data and code for SDE identification
,”
GitLab
,
2022
. https://gitlab.com/felix.dietrich/sde-identification