We identify effective stochastic differential equations (SDEs) for coarse observables of fine-grained particle- or agent-based simulations; these SDEs then provide useful coarse surrogate models of the fine scale dynamics. We approximate the drift and diffusivity functions in these effective SDEs through neural networks, which can be thought of as effective stochastic ResNets. The loss function is inspired by, and embodies, the structure of established stochastic numerical integrators (here, Euler–Maruyama and Milstein); our approximations can thus benefit from backward error analysis of these underlying numerical schemes. They also lend themselves naturally to “physics-informed” gray-box identification when approximate coarse models, such as mean field equations, are available. Existing numerical integration schemes for Langevin-type equations and for stochastic partial differential equations can also be used for training; we demonstrate this on a stochastically forced oscillator and the stochastic wave equation. Our approach does not require long trajectories, works on scattered snapshot data, and is designed to naturally handle different time steps per snapshot. We consider both the case where the coarse collective observables are known in advance, as well as the case where they must be found in a data-driven manner.
We discuss the data-driven approximations of drift and diffusivity functions in effective stochastic differential equations (SDEs) through neural networks. The loss function used in training the neural networks embodies the structure of established stochastic numerical integrators, which allows us to work with data from very brief observations of the system. We apply this learning method to time series data of coarse observables from fine-grained particle- or agent-based simulations. The learned SDEs provide useful coarse surrogate models of the fine scale dynamics.
I. INTRODUCTION
Residual networks (ResNets,1 but see also Ref. 2) successfully employ a forward-Euler integrator based approach to create useful deep architectures. This has inspired follow-up work to include other integrator schemes for deterministic, ordinary differential equations (ODEs) such as symplectic integrators.3–5 In this work, we propose to include stochastic integrators in the search for appropriate loss functions for training network architectures, to improve system identification and prediction tasks.
Identification of dynamical systems from data is an established systems task.6–11 Machine learning both in its “second spring” in the 1980s–1990s, and its current “third spring” has made a major impact on the approaches and algorithms for this task. From ResNets1 and Neural ODEs2,12 to transformers,13 “neural PDEs,”14 and DeepONets,15 the identification of dynamical systems from (discrete) spatiotemporal data is a booming business because it is now comparatively easy to program and train neural networks. Learning SDEs and stochastic partial differential equations (SPDEs) is a natural “next step” and there is also an array of excellent contributions to this as we outline in the related work section below. The time-honored SDE estimation techniques for local drift and diffusivity can now be synthesized in a (more or less global) surrogate model.
Given all these recently available options, our contribution is as follows. On the application side, we were motivated by the identification of coarse, or effective, SDEs, which are often used as models for macroscopic statistics (e.g., leading moments) of more fine-grained physics simulations (e.g., atomistic, stochastic, or agent-based). This is the reason for part of our title, and for some of the examples: lattice simulations of epidemic models.
There is also, however, a methodological motivation whose roots are in the numerical analysis of integration schemes for SDEs/SPDEs. These established algorithms (starting with Euler–Maruyama16) iteratively produce discrete time data; it makes sense to “mathematically inform” the neural network architecture used for identification with the structure of the numerical integrator one would use on an explicitly available, closed form a coarse stochastic model. This way, our work naturally links with, and can take advantage of, the extensive body of work in what is called “backward error analysis” of these integration schemes.17
A third motivation for our work: beyond its links to backward error analysis, the approach naturally lends itself to the what one might call “physics-informed” or “gray-box” identification algorithms: neural network architectures that seamlessly incorporate approximate or partially known analytical models if these are available. We developed such an approach for deterministic ODE identification in the early 1990s, devising recurrent Runge–Kutta ResNet architectures for this purpose, and extending them to gray boxes and implicit ResNets.2,14,18 It is a natural next step to extend these techniques so that they can be applied to the identification of effective stochastic models, allowing the possibility of physics-informed gray boxes, and exploiting mathematical support via numerical backward error analysis. Learned, black-box models for agent-based systems may be used in malicious or questionable ways in technology and society. An example could be careless, uncontrolled generalization in predicting the behavior of physical or societal systems using learned models. Using gray-box, physics-informed machine learning methods that are amenable to thorough mathematical analyses and interpretability can help mitigate such risks, and our optimization approach is specifically designed for this type of knowledge integration and analysis.
Our contributions are as follows:
We introduce a neural network training loss function that allows us to approximate the state dependent drift and diffusivity of an SDE from discrete observation data.
The approach does not require long time series: paired snapshots , scattered over the domain of interest, are sufficient. Step sizes can vary for each snapshot.
The approach works point-wise: No distributions need to be approximated from data, nor do we need to compute distances between distributions.
We formulate explicit loss functions for Euler–Maruyama- and Milstein-type integration schemes using Gaussian noise.
For data sampled from SDE of Langevin-type, where the assumption of normally distributed noise in the state variable does not hold anymore, we show how to reformulate simple numerical schemes to enable accurate approximations of the drift and diffusivity in the acceleration terms.
We show how a given numerical integration scheme for stochastic PDE can be reformulated to match our loss function and demonstrate it by learning forcing terms in the stochastic wave equation.
We demonstrate that we can employ the approach to coarse-grain particle/lattice simulations; in particular, we approximate the drift and diffusivity of a two-dimensional mean field SDE for a Gillespie-based19,20 kinetic mechanistic model, as well as for the kinetic Monte Carlo lattice simulation of an epidemiological SIR model, leading to those kinetics at the mean field limit.
II. RELATED WORK
Loss functions and network architectures informed by numerical integration methods were devised several decades ago, e.g., Refs. 2 and 18. Recently, these ideas have been rediscovered and used in deep architectures such as ResNets,1 with symplectic schemes for Hamiltonian3–5 and Poisson networks.21 Scalable gradient approaches were developed for SDE approximation recently,22 which allow backpropagation through a long series of SDE iterations. If only a single time step for each data point is available, as in our work, this type of scalability is not an immediate training issue.
The theory of SDE is well developed; we refer to standard books on stochastic processes,23 stochastic calculus,24 and stochastic algorithms.25 In the broader machine learning community, SDEs are used mostly for generative modeling, meaning that the actual drift and diffusivity functions learned from the data are not informative, as long as the generated distribution is “acceptable.” A classical example are generative-adversarial networks.26,27 In recent years, learning random ODEs (deterministic ODE with random parameters) has been done using an approach with generative-adversarial networks.28 Long–short-term-memory networks were used in a stochastic setting for the generation of hand-written text.29 The authors of Ref. 30 present a framework that learns an SDE for training and sampling from score-based models, allowing sample generation. In a classical dynamical systems context, the SDE itself is the desired outcome of approximation and thus, typically, different methods are necessary. Parameter estimation techniques for Ornstein–Uhlenbeck processes with constant coefficients are readily available.31 Gaussian processes have been used to identify SDE from observations,32 with a target likelihood based on the Euler–Maruyama scheme (not explicitly mentioned by the authors). Our setting is very different from identifying ODE from noisy observations, for example, through Gaussian processes.33 Generative stochastic modeling of strongly nonlinear flows with non-Gaussian statistics is also possible.34 Researchers studying neural SDEs have studied their relation to Wasserstein-GANs (generative adversarial networks with Wasserstein loss)35,36 and for long time series.37 Since a pre-print of this paper appeared, other pre-prints expanded on it by learning SDEs with other noise types, e.g., Lévy-type.38 Another research direction is to identify the correct parameters of a latent SDE, given partial observations of it.39 This requires special assumptions on the latent space and the observation functions. Propagators for SDE have also been covered in the recent literature, with loss functions for moment functions and distributions.40 These approaches differ from our work in that they do not learn using point-wise on snapshots, but instead require accurate ensemble averages and corresponding large batch sizes. Spectral-Galerkin type SPDE solvers using neural networks have also been developed.41 Such approaches differ from our work in that they learn the solution to a (possibly unknown) SDE, not the SDE itself. Density and ensemble approximation techniques42 provide a complementary view to the particle-like approach with SDE. Stochastic dynamical systems can also be described through the associated Fokker–Planck operator. See, e.g.,43 for a non-parametric forecasting method using the “shift-map,” a numerical approximation of the operator projected on the eigenfunctions of the Laplace operator.
III. MATHEMATICAL PROBLEM SETTING
We discuss SDEs of the form
where and are smooth, possibly nonlinear functions; every is positive and bounded away from zero (and a positive-definite matrix if ), and is a Wiener process such that for , . See Ref. 23 for an introduction to stochastic processes. We assume we have access to a set of snapshots where are points scattered in the state space of (1) and the value of results from the evolution of (1) under a small time step , starting at . The snapshots are samples from a distribution, , and the transition densities are associated with (1) for a given time step , which in turn is chosen from some distribution . The joint data-generating distribution is, therefore, given by . Alternatively, the data could be collected along a long trajectory of (1) with sample frequency , that is, . The problem is to identify the drift and diffusivity through two neural networks and , parameterized by their weights , only from the data in . We assume the points in are sampled sufficiently densely in the region of interest.
As we will see, each choice of SDE integrator will give a different training algorithm, with a loss based on the particular discretization scheme; training algorithms based on integrators of different accuracy, integrators with different rates and types of convergence (strong or weak), and even integrators based on different stochastic calculus type (Ito vs Stratonovich) fall under this umbrella. We will demonstrate these ideas here through two Ito-calculus based training algorithms: the Euler–Maruyama and the Milstein schemes.
A. Identification of drift and diffusivity with the Euler–Maruyama scheme
We now formulate and rationalize the loss term that we use to train the neural networks and . The Euler–Maruyama scheme is a simple method to integrate (1) over a small time ,
where is small and is a vector of random variables, all i.i.d. and normally distributed around zero with variance . The convergence of (2) for has been studied at length; we refer to the standard literature.23 We can use this idea to construct a loss function for training the networks and simultaneously. We initially restrict the discussion to the case for simplicity. Essentially, conditioned on and , we can think of as a point drawn from a multivariate normal distribution
In the training dataset , we only have access to triples and not the drift and diffusivity . To approximate them, we define the probability density of the normal distribution (3) and then, given the neural networks and , ask that the log-likelihood of the data under the assumption in Eq. (3) is high,
We can now formulate the loss function that will be minimized to obtain the neural network weights . The logarithm of the well-known probability density function of the normal distribution, together with the mean and variance from (3), yields the loss to minimize during training,
This formula can easily be generalized to higher dimensions (see Appendixes A– C), and we use such generalizations, for example, in two and three dimensions. Minimizing in (5) over the data implies maximization of the log marginal likelihood (4) with the constant terms removed (as they do not influence the minimization).23 Likelihood estimation in combination with the normal distribution is used in many variational and generative approaches.22,26,30,32,33,44 Note that here, the step size is defined per snapshot, so it is possible that it has different values for every index . This will be especially useful in coarse-graining Gillespie simulations, where the time step is determined as part of the scheme.
We note that the specific form (3) of the transition probabilities induced by the Euler–Maruyama method implies that it will, in general, be impossible to exactly reproduce the data-generation density for any finite step size . This motivates the application of more refined numerical methods, such as the Milstein scheme considered in Sec. III B. Alternatively, one can think of as being induced by an SDE (1) with modified drift and diffusion . A “modified equation” is a selected ODE that is approximating the results of a fixed numerical scheme. This is the opposite of “solving an ODE with a numerical scheme,” where the ODE is fixed and a proper scheme must be selected to solve it. Modified equations have been derived for a variety of numerical methods, such as the Euler or the Milstein method.17 In general, there is no second-order modified equation for the Euler–Maruyama method, while one exists for the Milstein scheme. Modified equation analysis in our setting is typically performed with equal step sizes for all data points.
B. Identification of drift and diffusivity with the Milstein scheme
We now discuss how the loss function can be based on the Milstein integration scheme,45 to illustrate complexities that arise when deviating from simple Euler schemes, even in one dimension. For , the Milstein scheme is a more accurate method to integrate (1) and is given through
where is small and is a random variable, normally distributed around zero with variance . For scheme (6), it is also possible to write down the distribution of , similar to Eq. (3). It is no longer normal, however, because of the square of the variable . To determine the distribution of , we define
Then, for a random variable , we have Assuming ,
The variable has a non-central distribution with one degree of freedom and non-centrality parameter . With this derivation, we can see that is distributed according to a linear transformation of a non-central distribution. The probability density of is
where is a modified Bessel function of the first kind. We can now use the function to define the probability density of , given ,
The corresponding point-wise loss function then is very similar to (5),
Equation (7) defines a probability density that can vanish over large portions of its domain, which leads to issues both with log-marginal likelihood estimation and with gradient computations. These issues can be mitigated through regularization, but also through kernel density approximations, which we outline in Sec. III C. To illustrate the difference of the probability distributions for the Euler–Maruyama and the Milstein templates, we plot them for , , in Fig. 1 against sampled points using the two templates with the starting point .
C. Approximation of the probability density of numerical integration schemes
Inspired by the iteration procedures use to discover kernels for deep Gaussian processes46 and kernel density estimates,16,47 we now describe an iterative method to approximate the probability density of other numerical integration schemes, such as Runge–Kutta type integrators.48 This will be useful in cases for which the density cannot be derived in analytical, closed form or for schemes with potentially locally vanishing probability density, like the Milstein method.
Consider an arbitrary (but convergent and accurate) numerical integration scheme for (1), so that for a fixed step size and a noise process , a single time step becomes
Similar to Secs. III A and III B, given the scheme as well as the approximate drift and diffusivity parametrized by , we need to approximate the probability density . For schemes that rely on noise processes that can be sampled easily (e.g., Gaussian noise as in Euler–Maruyama and Milstein schemes), the key idea for our numerical approximation is the following. We begin by sampling a set of points from the distribution of . Then, we use the SDE integration scheme to estimate candidates for the next point in time, given and . Finally, we describe the density of these candidates through a mixture of Gaussians, each with a small standard deviation so that the actual probability distribution is approximated as
The approximation accuracy of (10) improves as and ; for a discussion of this, we refer to the standard literature on kernel density estimation16,47 and viscosity solutions for PDE.49,50Figure 2 demonstrates this convergence for varying , with fixed , , , , , and the Milstein scheme with the exact form for given through Eq. (7). Many other density estimation methods are applicable here as well, such as generative-adversarial networks26 or normalizing flows.51
D. Underdamped Langevin equations
Up to this point, we discussed SDE where all coordinates are directly affected by noise. Second-order SDE of Langevin-type cannot be accurately described in this way, because the diffusivity is exactly zero for half of the coordinates,
The density of is no longer Gaussian, and it would seem that we require a loss function other than (5) to approximate drift and diffusivity . Equation (11) can be approximated using the (symplectic) Euler–Maruyama scheme
This will not approximate the stationary distribution well, and other, more accurate schemes exist.52 Still, since we only assume access to snapshots and do not process longer time series, we can effectively view in (12) as a fixed parameter for the functions over a time span of . Then, is again normally distributed, conditioned on and step size , similar to (3). Thus, loss function (5) can be used, but now only for the dynamics of , when given the corresponding values of , to learn approximations and . Including the symplectic adaptation which uses instead of to predict , we must, thus, minimize
E. Stochastic partial differential equations
Equation (1) defines a stochastic differential equation in finite-dimensional space, s.t. . Stochastic partial differential equations (SPDEs) describe properties of functions embedded in infinite-dimensional spaces. In the context of agent-based systems, a description of the system behavior though an SPDE is useful if the number of agents is very large, but finite-size effects still play a role, so that a deterministic description though a deterministic PDE would be too restrictive. There are no generically applicable numerical schemes for SPDEs similar to the Euler–Maruyama scheme (2), and numerical methods of higher-order also typically do not exist.53 However, in certain cases, we can reformulate existing numerical schemes for specific SPDEs and still learn parts of its law in a black- or gray-box setting. For example, we can start with a stochastic wave equation that has both deterministic driving and stochastic forcing , and initial conditions ,
To solve Eq. (14), we use a numerical algorithm suggested by Walsh.53 We refer to the paper53 for details and only summarize the main ideas here. To start, we choose a small step size and discretize space and time into points , . Then, we define a staggered grid on which we construct the solution, as shown in Fig. 3. The center grid points are excluded in the discretization, no solution is approximated on it. The following numerical scheme then approximates the solution to the stochastic wave equation on the grid:
Then,
which is the same scheme as (2). Given data , we can learn the forcing terms through minimization of the loss (5). In this paper, we only considered the autonomous case, where and do not depend on time. Note that the SPDE (14) is also of Langevin type, and the reformulation of the Euler–Maruyama scheme here corresponds to the reformulation explained in Sec. III D.
F. Exploiting latent spaces
In this paper, we consider cases where the coarse variables are already known (e.g., their mean field or quasi-chemical approximation variables). However, many observations in the real world do not immediately reveal the smallest dimensionality of the space in which the dynamics evolve. Latent space identification and related unsupervised learning are possible with a wide variety of techniques, also for SDE.39 If a latent space must be identified, we propose to learn it with Diffusion Maps54 as well as an SDE-integrator-informed auto-encoder with the SDE loss (5) augmented by the log mean squared loss: . The Diffusion Map coordinates define a latent space; learning an SDE on these new variables, and then mapping back to the original data is thus similar to latent SDE identification.22 With Diffusion Maps, we first identify coarse variables and then learn the effective SDE. This has benefits compared to an end-to-end auto-encoder approach, because by construction, the latent coordinates are invariant to isometry and sampling density in the ambient space.54 Learning the latent coordinates together with the drift and diffusivity functions instead leads to a latent space that is adapted to the specific SDE. This construction usually does not have the same invariance properties as the coordinates from Diffusion Maps. A description of the methods to obtain these latent spaces and the identification of SDE on them for kMC lattices can be found in Appendix B 6.
IV. COMPUTATIONAL EXPERIMENTS
We first illustrate the neural network identification techniques on toy examples, and then show an application to learning effective SDEs from lattice dynamics simulations. Table I lists important training parameters. All computational experiments were performed on a single desktop computer with an AMD Ryzen 7 3700X 8-Core Processor, 32GB RAM, and a NVIDIA GeForce GTX 1070 GPU. Training did not take longer than 30 min for each case, data generation did not take longer than two hours. The neural networks and loss functions were defined in TensorFlow V2.4.155 (Apache 2.0 license). For the toy examples, we created snapshots by integrating the SDEs with 10 steps of Eq. (2) from to , only storing the initial and final result. The given network topology was used twice, separately for drift and diffusivity. If not otherwise stated, we use a test/validation split of 90%/10%, the ADAM optimizer with standard parameters learning_rate = , , , , batch size of 32, 100 epochs, and exponential linear unit activation functions.
Example . | h . | # points . | Network topology . | Loss . | Train . | Validation . |
---|---|---|---|---|---|---|
Section IV A: cubic, 1D | 0.01 | 10 000 | 1|50|50|1 | (5) | −0.003 54 | −0.0111 |
Section IV A: cubic, 1D | 0.01 | 10 000 | 1|50|50|1 | (7) | −0.0010 | −0.0112 |
Section IV A: linear, 3D | 0.25 | 15 000 | 3|50|50|3/6 (drift) | (5) | −7.60 | −7.63 |
3|50|50|6 (diff.) | ||||||
Section IV A: 1D–19D | 0.01 | 10 000 | n|50|50|n | (5) | Varies | Varies |
Section IV B a: Gillespie | Varies | 90 000 | 2|50|50|2 | (5) | −10.9 | −11.0 |
Section IV B b: kMC lattice | Varies | 90 000 | 2|50|50|2 | (5) | −11.7 | −11.8 |
Section IV C: Langevin | 0.05 | 10 000 | 2|50|50|2 | (13) | −0.44 | −0.42 |
Section IV D: wave SPDE | 0.001 | Varies | 2|50|50|2 | (5) | Varies | Varies |
Example . | h . | # points . | Network topology . | Loss . | Train . | Validation . |
---|---|---|---|---|---|---|
Section IV A: cubic, 1D | 0.01 | 10 000 | 1|50|50|1 | (5) | −0.003 54 | −0.0111 |
Section IV A: cubic, 1D | 0.01 | 10 000 | 1|50|50|1 | (7) | −0.0010 | −0.0112 |
Section IV A: linear, 3D | 0.25 | 15 000 | 3|50|50|3/6 (drift) | (5) | −7.60 | −7.63 |
3|50|50|6 (diff.) | ||||||
Section IV A: 1D–19D | 0.01 | 10 000 | n|50|50|n | (5) | Varies | Varies |
Section IV B a: Gillespie | Varies | 90 000 | 2|50|50|2 | (5) | −10.9 | −11.0 |
Section IV B b: kMC lattice | Varies | 90 000 | 2|50|50|2 | (5) | −11.7 | −11.8 |
Section IV C: Langevin | 0.05 | 10 000 | 2|50|50|2 | (13) | −0.44 | −0.42 |
Section IV D: wave SPDE | 0.001 | Varies | 2|50|50|2 | (5) | Varies | Varies |
A. Toy examples: Cubic drift, linear diffusivity, and higher dimensions
We tested our implementation with examples in one dimension first, with drift and diffusivity defined through . A comparison between the learned and true functions is shown in Fig. 4, along with densities obtained from sampling the true and approximate SDE. We trained networks with both Euler–Maruyama and Milstein loss functions, with very comparable results—even the training and validation curves were very similar. When increasing the dimension from to , approximation quality decreases after . Non-diagonal diffusivity matrices were approximated accurately for (see Appendix B 3).
B. Learning coarse, effective SDE from fine-grained (particle) simulations
We validate our approach through a well-studied model for which analytical approximations at several levels of coarse-graining are available. The problem is the dynamics of epidemics on a lattice, which we briefly outline below. We consider two different versions of the model, and “learn” the corresponding effective SDEs. The first version at the finest scale involves Monte Carlo lattice simulations; the second version involves Gillespie simulation of the mechanistic scheme implied by the mean field model. Not surprisingly, if we use mean field SDE data, we discover an approximation to the mean field SDE. However, we find an approximation to dynamics in the mean field variables which is different than the mean field SDE, if we get data from Gillespie/lattice simulations and coarse-grain it in terms of mean field variables.56,57
Epidemic SIRS model: We consider a hierarchical sequence of three stochastic models describing the dynamics of the classical SIR (susceptible–infected–recovered)-type epidemic process.58–60 The first, most detailed model is represented by kinetic Monte Carlo (kMC) simulations on a lattice. The kMC model considers a two-dimensional square lattice of size with periodic boundary conditions, where each site contains a single individual. In accordance to the health state, each individual can be in one of three states: Susceptible (S), Infected (I), or Recovered (R). The elementary reactions (independent Poisson processes) occurring on a lattice are as follows: (1) (rate constant ; infection), (2) (rate constant ; recovery with immunity), (3) (rate constant ; loss of immunity).
The infection spreads through nearest neighbor bonds only; thus the first reaction may occur only when I and S are the first neighbors on a lattice. Elementary events take place with transition probabilities (). When , the model is called as susceptible–infected–recovered–susceptible (SIRS), while the SIR model is a particular case of the SIRS model with . In addition to steps (1)–(3), migration of individuals occurring via the exchange mechanism is considered: (4) (rate constant ), and (5) (rate constant ). Here, the sites and are first neighbors on a lattice. The time evolution of the probability distribution for the populations of individuals on a lattice can be described by the chemical master equation, derived from the Markov property of the underlying stochastic process. To generate realizations of this process, we applied the so-called rejection-free direct kMC algorithm,19,61 which produces a sequence of configurations and the times when they change from one to the other. It is interesting to mention that the same lattice kMC model can be formulated in a mathematically equivalent form using the following three types of states: excited (infected), refractory (recovered), or quiescent (susceptible). In this case, the model may reproduce the typical patterns of excitable media such as traveling pulses and spiral waves.62 Furthermore, we assume that . In essence, the migration rate may take into account quarantine restrictions and other regulations in place to keep the infection from spreading. Small values correspond to strong restrictions. If a spatially homogeneous, random distribution of all individuals on a lattice is assumed, the problem can be significantly simplified. Such a homogeneous distribution can be achieved in the “fast migration” limit .
The second stochastic model we study is represented by the Gillespie’s stochastic simulation algorithm (SSA19) that is used to simulate a well-stirred, spatially homogeneous system. For the SIRS model, SSA operates in terms of three integer stochastic variables: the number of susceptible individuals, , the number of infected individuals, , and the number of recovered individuals, ; their concentrations are calculated as: . The mass balance implies that . At each time step of the SSA algorithm the following three rates are calculated through Then, an event to be performed is selected with a probability proportional to its rate and the numbers are updated accordingly.
The third stochastic model is represented by a set of stochastic differential equations (SDEs) of the Langevin type which, under certain assumptions, approximates the results of the SSA algorithm.20 The SDE model at finite is as follows:
Here, are independent Wiener processes such that . Only two of three SDEs must be considered, since . In the limit , the SDE system transforms to the standard mean field SIRS model. In particular, for the SIR model, this SDE system reduces to having a diagonal diffusivity matrix
Equation (16) or (17) can be integrated using the Euler–Maruyama method with fixed time step . The SDE models approximate the solution of both SSA and kMC models. The following parameters and initial conditions are used in the simulations: , , ; , . In the limit of infinite , the mean field equations simplify to
We assume that SSA and kMC simulations generate trajectories which can be approximated by the solution of two latent SDEs with diagonal diffusion matrix, similar to Eq. (17). Under this assumption, the neural network was used to learn the unknown drift and diffusion terms at the fixed values of the model parameters , and . Training was performed using a set of short-time stochastic simulations on a lattice with the random, but physically-relevant initial conditions for the average concentrations , . The initial species distribution on a lattice was also random. During each independent SSA and kMC simulation run, the data points were evaluated at the time moments , where . The values of and are the fixed parameters of the algorithm. Note that a time step in the SSA/kMC algorithm is a random variable. The time is advanced via an increment given from the exponential distribution: , where is the total reaction rate and is a uniformly distributed random number. Therefore, are close to , but they are not constant. Each data point used for training contains the following five values: . The total number of data points is . The trained network approximates the unknown drift and diffusion terms at any given values of . Thus, the SDE approximation in combination with the Euler–Maruyama method provides a time-stepper (decoder), which can reproduce the dynamics.
a. Gillespie simulations
Figure 5 shows that a network can be trained to match the results of SSA simulations. A distribution function (histogram) of the time steps is plotted in Fig. 5(a). It demonstrates that our approach works with variable time steps in the training dataset. Three sample paths of and and the average of paths are shown in Figs. 5(b) and 5(c).
b. Kinetic Monte Carlo lattice simulations
Figure 6 shows that our approach successfully reproduces the results of kMC simulations on a lattice. The Euler–Maruyama method with a fixed time step was used in numerical integration of the network SDE approximation. Figure 6(a) demonstrates three sample paths obtained using the neural network (blue) and kMC (red, “true solution”). The averaged paths are presented in Fig. 6(b). A sufficiently high accuracy network approximation is demonstrated in Figs. 6(c) and 6(d), where the probability density functions for and at are shown. They were estimated by NN and kMC and are visually identical.
The described examples consider the lattice SIR model which has only the trivial steady states with . When , the system exhibits richer dynamic behavior. Simulations with were performed as well, and the network model is capable of reproducing the correct steady state and damped oscillations (the mean field SIRS model has a focus-type steady state at these values of parameters).
This is actually surprising: When the SIRS model is considered, the diffusion matrix is not diagonal, because there are three different reactions. Nevertheless, the trained network is capable here of approximating the kMC results, even when only trained with a diagonal diffusivity matrix.
C. Learning underdamped Langevin dynamics
The reformulation of the loss function (5) to (13) introduced in Sec. III D allows us to also learn non-Gaussian dynamics of Langevin-type. We will now demonstrate this on a simple example, and show that incorporating the assumption of Langevin-type dynamics for data from Langevin systems is necessary for accurate approximation of drift and diffusivity functions. The system we investigate is defined through one-dimensional state and velocity , such that
where , , with potential , and . When approximating and , we assume they depend on both and , which creates a more challenging learning problem. As a loss function, we use (13). The data from the spatial coordinate is used as a fixed parameter for each given snapshot . The result is shown in Fig. 7, with probability densities for different times shown in Fig. 8.
When using loss function (5) for both dimensions instead, we effectively (incorrectly) assume the SDE (1) instead of (11). In this case, the training fails to converge, as shown in Fig. 9, even when trained with 20 times the number of epochs. The final loss value in this experiment is and would diverge to with increasing number of data points. The issue is also apparent in the loss evaluated during training, as shown in Fig. 10.
D. Learning forcing terms in SPDEs
We now demonstrate learning deterministic driving, , and stochastic forcing, , terms in a stochastic PDE, using the reformulation of the numerical scheme (15) with step size . The training data in Fig. 11 were generated using autonomous, spatially varying forcing terms,
We use our methodology to identify approximations of these two functions with pairs of training data. In addition to this experiment, Figure 12 shows how the network approximations converge to the true functions (which would be unknown in a real setting), when the number of available data points (triples ) is increased while keeping the number of network parameter updates in one training run constant at approximately . We perform ten training runs per dataset size to illustrate variability of the results due to stochastic training and data point selections. The loss (5) used during training and validation is approximately constant for all datasets of size , indicating convergence of training for all cases where the number of data points surpasses the number of trainable parameters (including weights and biases, we use a network with 5602 trainable parameters).
V. DISCUSSION
Training neural networks with loss functions based on numerical integrators such as (5) and (8), but also (13), has several limitations. If the time step between many samples is too small, we cannot accurately identify the drift, because the diffusivity term will dominate. This could be mitigated by starting with small , estimating the diffusivity, and then estimating the drift with subsampled trajectories. Even with infinite data, the drift is difficult to estimate because the time step also has to go to zero.23 Conversely, if the time step is too big, we cannot accurately identify the diffusivity. If the dynamics of the coarse-grained observables include rare events, then learning the corresponding SDE is a challenge, because these events will not be present in a lot of snapshots, and hence, not a lot of data are available to learn them. Using a loss function based on Lévy noise SDE integrators38 might provide a useful surrogate for SDEs with rare events.
There are many possible extensions and applications of our work. An important task is to find explicit probability density functions for integrators using other noise types, such as Lévy or Poisson noise,38 or integrators of higher order.48 Integrating additional noise from measurement devices on top of the inherent stochasticity of the dynamics will also be important. Analyzing integrators based on other types of calculus such as Heun’s63 method may help impose more variegated types of priors during training. It is a predictor–corrector method, based on Stratonovich calculus and will give rise to implicit neural networks.11,64 Using the adjoint method28 allows us to propagate the loss gradients back through longer time series. The choice of numerical approximation (9) entails a systematic error for the data generating distribution, which can impact adversely on the capability to generalize beyond the training data. This effect can be minimized by higher-order numerical methods and analyzed through modified equation analysis.17 In terms of applications, many more particle- and agent-based models can be coarse-grained. For many of these, it is important to find coarse variables, and latent space techniques such as VAE-type architectures and Diffusion Maps discussed in Sec. III F will help in identifying them. These techniques can help to construct local SDE models in the process of larger-scale simulations to guide sampling and exploration. Another clear next step based on numerical analysis is the identification of SPDEs, for which we have demonstrated how our approach can be used to approximate forcing terms. Identification of entire SPDEs from data, including all terms, is still an open issue.
ACKNOWLEDGMENTS
The authors wish to dedicate this paper to the memory of Professor Alexei Makeev, a key contributor to the work, a friend, and an inspiring and knowledgeable collaborator, whom we lost to Covid last year. We thank A. J. Roberts for insightful discussions. The research of F.D. has been partially funded by the Deutsche Forschungsgemeinschaft (DFG)—Project No. 468830823. The research of S.R. has been partially funded by the Deutsche Forschungsgemeinschaft (DFG)—Project No. 318763901-SFB1294. The work of I.G.K., N.E., and T.B. was partially supported by the U.S. Department of Energy and the U.S. Air Force Office of Scientific Research.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Felix Dietrich: Conceptualization (equal); Data curation (lead); Formal analysis (lead); Funding acquisition (equal); Investigation (equal); Methodology (lead); Project administration (equal); Resources (equal); Software (lead); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (lead). Alexei Makeev: Data curation (equal); Methodology (equal); Software (supporting); Validation (equal); Visualization (equal); Writing – original draft (supporting); Writing – review & editing (supporting). George Kevrekidis: Software (supporting); Validation (supporting); Writing – original draft (supporting). Nikolaos Evangelou: Data curation (supporting); Software (equal); Validation (supporting); Visualization (supporting); Writing – original draft (supporting); Writing – review & editing (supporting). Tom Bertalan: Data curation (supporting); Software (supporting); Writing – original draft (supporting). Sebastian Reich: Conceptualization (equal); Formal analysis (equal); Funding acquisition (equal); Methodology (equal); Supervision (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (supporting). Ioannis G. Kevrekidis: Conceptualization (lead); Formal analysis (equal); Funding acquisition (lead); Investigation (equal); Methodology (equal); Project administration (equal); Supervision (lead); Validation (equal); Writing – original draft (lead); Writing – review & editing (equal).
DATA AVAILABILITY
The data and code that supports the findings of this study are openly available in GitLab at https://gitlab.com/felix.dietrich/sde-identification, Ref. 68.
APPENDIX A: INSTRUCTIONS FOR THE SOFTWARE
The code to reproduce the results of the computational experiments is available at https://gitlab.com/felix.dietrich/sde-identification. Table II lists the software we used to test if the code can be compiled and actually produces the experimental results.
Software . | Version . |
---|---|
Microsoft Windows | 10.0.19042 |
Python | 3.8 |
Microsoft Visual Studio Community 2019 | 16.9.0 |
Microsoft Visual C++ | 2019 |
Software . | Version . |
---|---|
Microsoft Windows | 10.0.19042 |
Python | 3.8 |
Microsoft Visual Studio Community 2019 | 16.9.0 |
Microsoft Visual C++ | 2019 |
All required python packages can be installed by running “pip install -r requirements.txt” in the folder where the requirements.txt file is placed. The toy examples and lattice computations are all available in separate files, the mapping is listed in Table III. To compile the kMCLattice.dll, the Microsoft Visual Studio solution file in kmc/kMCLattice_dll can be used. Make sure to use an x64 setup (for both Python and the .dll file). The file needs to be placed in the same folder as the Python script kmc/KMC_SIRS3.py. The SSA results can be obtained by setting both Dif1 and Dif2 to in the file kmc/KMC_SIRS3.py.
Experiment . | File . |
---|---|
Toy example: 1D, cubic drift, linear diff. (E.–M.) | example3—1d sde-cubic.ipynb |
Toy example: 1D, cubic drift, linear diff. (Milstein) | example3—1d sde-cubic-with-milstein.ipynb |
Toy example: 3D, linear drift, lower-triangular diff. | example4—3d sde-nondiagonal.ipynb |
Toy example: 3D, linear drift, SPD diff. | example4—3d sde-spd.ipynb |
Toy example: nD (1-19D), cubic drift, linear diff. | example3—nd sde-cubic.ipynb |
Kinetic Monte Carlo simulations | kmc/KMC_SIRS3.py |
Stochastic wave equation | example7—wave SPDE.ipynb |
Langevin dynamics | example8—nonGaussian.ipynb |
Experiment . | File . |
---|---|
Toy example: 1D, cubic drift, linear diff. (E.–M.) | example3—1d sde-cubic.ipynb |
Toy example: 1D, cubic drift, linear diff. (Milstein) | example3—1d sde-cubic-with-milstein.ipynb |
Toy example: 3D, linear drift, lower-triangular diff. | example4—3d sde-nondiagonal.ipynb |
Toy example: 3D, linear drift, SPD diff. | example4—3d sde-spd.ipynb |
Toy example: nD (1-19D), cubic drift, linear diff. | example3—nd sde-cubic.ipynb |
Kinetic Monte Carlo simulations | kmc/KMC_SIRS3.py |
Stochastic wave equation | example7—wave SPDE.ipynb |
Langevin dynamics | example8—nonGaussian.ipynb |
APPENDIX B: ADDITIONAL INFORMATION REGARDING THE COMPUTATIONAL EXPERIMENTS
Appendixes B 2–B 6 contain additional descriptions and results for the computational experiments. They are not necessary to follow the main results described in the paper.
1. Milstein loss
The TensorFlow implementation for the Milstein loss uses a slightly different Bessel function, with integer-valued parameter instead of the required (no other Bessel function was available). Figure 13 demonstrates the difference is quite small.
2. Toy example: Cubic drift and linear diffusivity
The loss and validation loss curves for Euler–Maruyama and Milstein schemes are shown in Fig. 14.
3. Toy example: Non-diagonal diffusivity
We now demonstrate that our approach can learn non-diagonal diffusivity matrices. We sample initial points , randomly sample a lower-triagonal diffusivity matrix with positive eigenvalues, which we use as the (constant) diffusivity , and set the drift to be . The absolute error between original and (the average over) identified matrix entries is smaller than , and the standard deviation of the diffusivity values over the entire dataset is smaller than for all elements of the matrix. Note that the network is still a nonlinear function over the state space, not a constant. To ensure the matrix has positive eigenvalues, we pass the real-valued output of the neurons that encode the diagonal of the matrix through a soft-plus activation function. Figure 15 shows that we learn a good approximation for .
4. Toy example: Increasing dimension from 1D to 19D
We define and , where all operations are meant coordinate-wise (e.g., computes the third power of each individual coordinate of the vector). Figure 16 (left) illustrates how the training and validation losses change when increasing from to . The loss () is adapted to the changing dimensionality by adding , as described by Eq. (C1). This normalization is necessary because (apparently) TensorFlow 2.4.1 does not add this constant when computing the log probability (see https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/MultivariateNormalDiag##log_prob). Figures 17 and 18 show the individual functions compared to the true values, for dimensions and . For higher dimensions, it becomes harder to approximate the true functions correctly, which can also be seen here. The increase in loss can be explained by the constant number of points () we used: Increasing the intrinsic dimension of the problem by sampling the input data in the -dimensional cube causes the data sampling to get sparser and sparser. By increasing the number of training data points linearly with the dimension (while keeping the number of training iterations per dimension constant), we can see that the training loss is relatively small even for (Fig. 16, right). Additionally, the training loss tracks the validation loss much better, indicating better generalization properties of the network. This is reasonable, because the provided data is less sparse, even for higher dimensions.
5. Kinetic Monte Carlo lattice simulations
Results on a smaller lattice are shown in Fig. 19. Figure 20 shows that the algorithm works at different system sizes, even when is large—that is, even when the noise level is low. For , the system becomes deterministic and can be described by an ODE. The data presented in Fig. 20 correspond to a relatively fast migration rate . In this case, a well-mixed state is achieved and the kMC results are close to that obtained with the SSA algorithm. However, the network can also be trained to match the kMC results at lower migration rate values. Figure 21 shows the simulation results for . Since the initial distribution on a lattice is random, we omit the first few time steps and select the data points for training only when a “matured” state is achieved. For the results presented in Fig. 21, the training process was started at . At , the accuracy of the network is acceptable, but at lower values, the coincidence of the network and kMC results decreased considerably. This suggests that around that value of the dynamics may not effectively close in terms of the mean field variables and that higher dimensional approximations (possibly including pair probabilities as additional independent variables) may become necessary. Similar results can be obtained on a smaller lattice. In this case, as expected, the influence of the stochasticity becomes more pronounced. In a separate experiment, we also confirmed the algorithm works at different system sizes, even when is large—that is, even when the noise level is low. For , the system becomes deterministic and can be described by an ODE. At low migration rate values (), the approximation quality of the network decreased considerably as expected: The system can hardly be described in terms of two averaged concentrations because the species distribution on a lattice becomes inhomogeneous: Clustering occurs, and a higher dimensional SDE, possibly in terms of densities as well as pair probabilities, becomes necessary.
When , the system exhibits richer dynamic behavior. An example of simulations with is shown in Fig. 22. The network model is capable of reproducing the correct steady state and damped oscillations (the mean field SIRS model has a focus-type steady state at these values of parameters).
6. Learning latent spaces for the SIR model
We identify latent spaces of six-dimensional data obtained from the kMC lattice simulations. Figure 23 illustrates that the latent spaces are both two-dimensional and triangular. This shape is due to the conservation law of the underlying model (when written in terms of fractions of the total population that are susceptible (S), infected (I), and removed (R), the law would be ). Appendix B 6 a outlines the Diffusion Maps algorithm, Appendix B 6 b describes how the auto-encoder was set up and trained.
a. Diffusion Maps
The Diffusion Maps algorithm by Coifman and Lafon54 follows three main steps:
Construct the kernel matrix .
Normalize the matrix to mitigate effects of data density.
Compute eigenvectors to approximate eigenfunctions of evaluated on the data.
Input: : Dimension of points in the input. : Number of eigenvectors to compute. : input dataset, with points and |
dimensions per point. |
(1) Compute kernel matrix with and kernel bandwidth through 5\% of the median of squared |
distances between all points. Note that sparse computations are also possible in the software.65 |
(2) Normalize kernel matrix to make it invariant to sampling density: |
(a) Form the diagonal normalization matrix . |
(b) Form the matrix . |
(c) Form the diagonal normalization matrix . |
(d) Form the matrix . |
(3) Solve the eigensystem and construct eigenvectors: |
(a) Find the largest eigenvalues associated to non-harmonic eigenvectors of . See67 for a discussion and method to remove |
harmonic eigenvectors. |
(b) Compute the eigenvalues of by . |
(c) Compute the eigenvectors of the matrix through . |
Output: eigenvectors , each with entries, and with corresponding eigenvalues . |
Input: : Dimension of points in the input. : Number of eigenvectors to compute. : input dataset, with points and |
dimensions per point. |
(1) Compute kernel matrix with and kernel bandwidth through 5\% of the median of squared |
distances between all points. Note that sparse computations are also possible in the software.65 |
(2) Normalize kernel matrix to make it invariant to sampling density: |
(a) Form the diagonal normalization matrix . |
(b) Form the matrix . |
(c) Form the diagonal normalization matrix . |
(d) Form the matrix . |
(3) Solve the eigensystem and construct eigenvectors: |
(a) Find the largest eigenvalues associated to non-harmonic eigenvectors of . See67 for a discussion and method to remove |
harmonic eigenvectors. |
(b) Compute the eigenvalues of by . |
(c) Compute the eigenvectors of the matrix through . |
Output: eigenvectors , each with entries, and with corresponding eigenvalues . |
b. SDE-informed auto-encoder
The SDE-informed auto-encoder has a single encoder network (from observations to latent space) and two additional networks: one decoder from latent space back to the observations, and one network that defines drift and diffusivity functions on the latent space. Note that the drift and diffusivity we find can still be considered functions of the original observations, because and . In the computational example that lead to Fig. 23, we used an encoder architecture of , a decoder architecture , and an SDE approximation network with diagonal diffusivity matrix and architecture , all with tf.nn.selu activations. The loss function for training the auto-encoder is the following, where is defined in the paper [Eq. (5)]:
where is the size of the current mini-batch (here: ). The Python class implementing this auto-encoder loss is called AEModel, and can be found in the file
sde/sde_learning_network.py of the repository.
APPENDIX C: ADDITIONAL PROOFS
The logarithm of the probability density of a multivariate normal distribution in dimensions is