Neural Ordinary Differential Equations (ODEs) are a promising approach to learn dynamical models from time-series data in science and engineering applications. This work aims at learning neural ODEs for stiff systems, which are usually raised from chemical kinetic modeling in chemical and biological systems. We first show the challenges of learning neural ODEs in the classical stiff ODE systems of Robertson’s problem and propose techniques to mitigate the challenges associated with scale separations in stiff systems. We then present successful demonstrations in stiff systems of Robertson’s problem and an air pollution problem. The demonstrations show that the usage of deep networks with rectified activations, proper scaling of the network outputs as well as loss functions, and stabilized gradient calculations are the key techniques enabling the learning of stiff neural ODEs. The success of learning stiff neural ODEs opens up possibilities of using neural ODEs in applications with widely varying time-scales, such as chemical dynamics in energy conversion, environmental engineering, and life sciences.

Neural Ordinary Differential Equations (ODEs) have been emerged as a powerful tool to describe a dynamical system using an artificial neural network. Despite its many advantages, there are many scientific and engineering examples where neural ODEs may fail due to stiffness. This study demonstrates how instability arises in neural ODEs during the training with the benchmark stiff system. Furthermore, we apply scaling to differential equations and loss functions to mitigate stiffness. The proposed technique in adjoint handling and equation scaling for stiff neural ODEs can be used in many biological, chemical, and environmental systems subject to stiffness.

## I. INTRODUCTION

Neural Ordinary Differential Equations (ODEs) are elegant approaches for data-driven modeling of dynamical systems from time-series data in science and engineering applications. Neural ODEs offer various advantages over physics-based modeling and traditional neural network modeling. Compared to physics-based modeling where a presumed functional form is required, the neural ODEs approach precludes the need for expert knowledge in identifying such prior information.^{1,2} Compared to traditional neural network modeling, such as recurrent neural networks, neural ODEs enable flexibility in the modeling of irregularly and incomplete sampled time series^{3} and can be more efficient by taking advantage of modern ODE solvers and adjoint methods.^{4,5}

Despite the success of neural ODEs^{4} in many sciences and engineering problems,^{6–9} it has been recognized that it is very challenging to learn neural ODEs for stiff dynamics,^{10} characterized by dynamics with widely separated time scales. Part of the challenge is due to the high computational cost of solving stiff ODEs as well as handling the ill-conditioned gradients of loss functions with respect to neural network weights.^{11} In addition, it has been shown that stiffness could lead to failures in many data-driven modeling approaches, such as reduced order of modeling^{12} and physics-informed neural networks.^{13} It was suggested that the stiffness could lead to gradient pathologies and ill-conditioned optimization problems, which leads to the failure of stochastic gradient descent based optimization.^{14}

This work is motivated to elucidate the challenges of learning neural ODEs for stiff systems and proposes strategies to overcome these obstacles. We study learning neural ODEs on data generated from two classical stiff systems, ROBER^{15} and POLLU,^{16} which are extensively used for benchmarking stiff ODE integrators. Both ROBER and POLLU describe the dynamics of species concentrations in stiff chemical reaction systems. We propose strategies to mitigate the challenges associated with the scale separations in time as well as the magnitudes of states. We are thus able to successfully learn stiff neural ODEs for those two benchmark problems. We demonstrate how the application of adjoint methods to stiff systems leads to numerical issues and changes in the computational complexity that are overcome by proposing new adjoint techniques. Those strategies not only demonstrate successful numerical methods and architectures for handling time scale separations in neural ODEs but also elucidate general theoretical issues of data-driven modeling of stiff dynamical systems, which can be transferred to other techniques.

## II. BACKGROUND

### A. Neural ordinary differential equations

Many science and engineering problems can be formulated as ODEs; that is,

where $t$ is time, $y(t)$ is the state variables, and the function $f$ models the dynamics. Identifying the model $f$ is one of the central tasks in many scientific discoveries, which usually involves proposing functional forms and then performing parameter inference against observation data. Proposing functional forms of $f$ is a very challenging task for many complex systems, such as modeling the gene regulatory networks and cell signaling networks in life science.^{17,18} Discovering the functional forms of $f$ could require decades of effort with expert knowledge, and it is often the case that a lot of unknown dynamics are yet to discover in those complex systems.^{19,20} With the help of the universal approximation theorem, one can approximate the model $f$ using a neural network without worrying about missing important dynamics as in physics-based modeling; i.e.,

If one can measure the data pair of $(y(t),dy(t)dt)$, one can train the neural network straightforwardly as a normal supervised learning problem. However, we usually only have access to the time-series data of $y(t)$. On the contrary, neural ODEs train the $NN$ by integrating the ODEs,

where $y(t0)$ are the initial conditions of the integration and $t\u2208(t0,t1)$ is the range of the integration. We can define the loss functions as the difference between the observed state variables and the predicted state variables,

where the mean absolute error (MAE) is as an example metric. Using modern differentiable ODE solvers,^{5} one can compute the gradient of the loss function to the model parameters efficiently. This work employs Julia’s^{21} Scientific Machine Learning ecosystem, mainly consisting of DifferentialEquations.jl,^{22} Flux.jl,^{23} and ForwardDiff.jl^{24} to integrate the neural ODEs and do the backpropagation. We can thus use stochastic gradient descent algorithms to optimize the neural network.

### B. Stiff ODE systems

Ernst Hairer’s classic working definition for the field of numerical ODE solving is “stiff equations are problems for which explicit methods do not work.”^{25,26} In other words, it is the case where numerical ODE solvers such as the $dopri$method, chosen in the original neural ODE paper,^{4} fail to adequately solve the differential equation. Many definitions have been proposed to explain this phenomenon. A commonly used definition is the stiffness index,

where $\lambda i$ are the eigenvalues of the Jacobian. While these types of stiffness can be helpful in identifying essential analytical properties for numerical methods to capture, every stiffness index does not capture all of the numerically difficult problems. Shampine famously stated: “Indeed, the classic Robertson chemical kinetics problem typically used to illustrate stiffness has two zero eigenvalues. Any definition of stiffness that does not apply to this standard test problem is not very useful.”^{27} In addition, the role of the time span is often overlooked. Shampine notes that a system may not seem stiff if one is only solving on the timescale of the fast process: stiffness requires that one is investigating the effects on the time scale of the slow process.

Due to the ambiguity of the definition of stiff systems, we will choose two differential equations studied extensively by the stiff ODE solver literature as representative highly stiff systems:^{25} ROBER^{15} and POLLU.^{16} Specifically, the canonical ROBER problem consists of three species and five reactions, and the POLLU problem consists of 20 species and 25 reactions describing the air pollution formation in atmospheric chemistry. The formula of the ROBER and POLLU problem is presented in Sec. IV and the supplementary material. We will investigate the difficulty of fitting data sampled from these systems, thus illuminating the unique features of the training process that arise in the discovery of stiff systems and the numerical methods for handling them.

## III. METHOD

### A. Stability of adjoints on stiff ODEs

The core problem of training neural ODEs is calculating the gradient of the ODE’s solution. In the method proposed by Chen *et al.*,^{4} the ODE is reversed and augmented with the adjoint equations as

However, previous work has called into question the stability of reversing the ODE equation; i.e.,

when solving from $t1$ to $t0$. While theoretically this reverses the ODE exactly, in practice, numerical issues can cause errors that propagate into the gradient. Gholami *et al*.^{28} noted that if the Lipschitz constant is too large in an absolute value, then either the forward pass or reverse solve is numerically unstable. This can be seen by using a standard ODE solver on the linear ODE $dydt=\u2212\lambda y$ on $t\u2208(0,1)$ with $y(0)=1$, $\lambda =100$ imparts about 1% error, while $\lambda =10000$ cannot be reversed numerically in double precision.

To show how this result extends to stiff ODEs, recall that the Lipschitz constant of a nonlinear function is the infimum of values $\Lambda $ such that $\Vert f(x)\u2212f(y)x\u2212y\Vert \u2264\Lambda $ for some domain $x,y\u2208\Omega $, which directly implies that the global Lipschitz constant satisfies $\Vert dfdy\Vert \u221e\u2264\Lambda $ over the whole domain. Thus, because the maximum eigenvalues of the Jacobian give a lower bound on the Lipschitz constant, stiff ODEs fall into the class of problems that are numerically difficult to reverse. Figure 1 demonstrates this on the ROBER stiff ODE test problem^{13,15} and demonstrates that even with imperceptible errors in the forward pass, we can receive noticeable errors in the reverse solve even when plotted in a log scale. Using the well-known CVODE BDF integrator from Sundials,^{29} we see that with $tol=abstol=reltol=10\u22126$, we receive 72% error, $tol=10\u22127$ gives 38% error, and $tol=10\u22128$ gives 5% error, with 0.1% error only reached at $tol<10\u221210$, i.e., below the threshold of single precision arithmetic.

The adjoint calculation only amplifies this error. With a test loss function $L(\theta )=\u2211y1(ti)$ at evenly spaced points in the log-space, using the adjoint as described caused CVODE to exit with a Newton divergence near the start of the reverse pass for all $tol=10\u2212i$ for integers $i=6,7,8,9,10,11,12,13,14$. Manually inspecting the solution process reveals the problem. While the reverse solution only has an approximately $10\u221210$ error in $y2$ at $t=99999.98999$ as shown in Fig. 2, the large $k2$ amplifies this term by $6\xd7107$ imparting approximately $O(10\u22123)$ error into $ddtdLd\theta $. This is because while in the original ODE, the middle term is of the form $k2y22$; in the Jacobian, the term is $2k2y2$; and without the squaring, the small error is not diminished. This sends $dLd\theta $ negative in a way that linearly explodes.

The simple linear ODE $u\u2032(t)=\lambda u(t),u(0)=1$ on the domain $t\u2208[0,1]$ is sufficient to demonstrate the blow-up behavior of the reverse solve after the forward solve. In the presence of truncation error and round-off error, the forward solution is $uf(t)=et\lambda +\epsilon f(t)$, where $\epsilon f$ denotes the error of the forward solve. It follows that the reverse problem has the initial condition $ub(1)=e\lambda +\epsilon f(1)$. Thus, at $t=0$, the backward solution is $ub(0)=1+e\u2212\lambda \epsilon f(1)+\epsilon b(0)$, where $\epsilon b$ denotes the error of the backward solve. Therefore, the total error at $t=0$ from solving the linear equation forward and backward is $e\u2212\lambda \epsilon f(1)+\epsilon b(0)$, and it exhibits inherit exponential blow-up behavior when $\lambda <0$. A stable forward problem implies the exponential divergence of the reverse problem. To demonstrate the exponential diverge, the above simple linear ODE is solved forward and backward, and the error at $t=0$ is plotted with various parameters $\lambda $ in Fig. 3.

Avoiding this issue requires one does not attempt to reverse the ODE. To this effect, we tested the interpolated checkpointing method of CVODES^{29} (known as InterpolatingAdjoint in DiffEqFlux) and discrete adjoint sensitivities via automatic differentiation of the solver,^{30} which both successfully solved the equation and agreed on all derivatives to at least three decimal places when solved with a $tol=10\u22126$. We note that one can prove the consistency of the adjoint discretization in the case of discrete adjoint sensitivities ensuring the stability.^{31}

### B. Linear scaling adjoints

Being limited to non-reversing adjoints, we investigate the computational cost of derivative calculations on stiff ODEs. To illustrate the computational cost and motivate new formulations, consider the implicit Euler method,

which is solved via the root finding problem $G(yn+1)=yn+1\u2212yn+hf(yn+1,t+h)=0$ for the next step. The instability of fixed point solvers^{25} makes it necessary to solve the root finding problem by Newton’s method; i.e., $yi+1=yi\u2212dG(yi)dyi\u22121G(yi)$ so that $yi\u2192yn+1$. The core computational cost is thus the solving of the linear system $dG(yi)dyix=G(yi)$. The condition number of the matrix $dG(yi)dyi$ is precisely the ratios of the maximum and minimum eigenvalues of the system’s Jacobian, meaning that the linear system is ill-conditioned for stiff equations. Krylov subspace methods such as GMRES have slow convergence when the condition number of the matrix is high^{32} and thus using these methods can require problem-specific preconditioners to make convergence computationally acceptable. For this reason, methods for solving stiff ODEs generally attempt direct methods, either dense or sparse, on the Jacobian. The general cost of performing such a linear solve via LU-factorization is $O(k3)$, where $k$ is the number of columns in the Jacobian, which is equivalent to the number of states in the ODE.

The cubic cost growth of solving a stiff system changes the efficiency calculus of methods for computing the derivative of the ODE solution and thus for calculating the gradient of the neural ODE’s loss function. In the traditional argument, forward-mode automatic differentiation of an ODE is equivalent to solving the forward sensitivity equations,

where there exists one sensitivity $dzdp$ for each parameter. Thus, given $k$ state equations and $m$ parameters, the total size of the ODE system is $O(km)$. In contrast, the adjoint method works by only solving the original ODE forward and solves a system of size $O(m+k)$ in reverse, greatly reducing the computational cost when the size of the ODE and the parameters is sufficiently large.

However, the total computational cost for implicit methods is based on the cube of the system size, meaning standard forward-mode sensitivities scale as $O(k3m3)$ while the adjoint method of Chen *et al.*^{4} or Serban and Hindmarsh^{33} scales as $O((k+m)3)$. Given the large number of parameters in a neural ODE, the cubic scaling adds a tremendous computational cost. To mitigate this issue, we developed three separate strategies. For the first, note that many solvers contain a scheme known as dense output for generating a high order interpolation by reusing the internal computations of the steps.^{34} This interpolation scheme in many cases requires zero additional $f$ calculations by the ODE solver to construct and thus is computationally efficient. However, because it uses the intermediate steps for the calculation, it requires an increase in the memory by $s$ for an $s$-stage Runge–Kutta or Rosenbrock method.^{35} Using the dense output schemes, we can instead calculate Eq. (6) by sequentially solving $z\u2032=f(z,\theta ,t)$ forward, then solving $\omega \u2032=\u2212\omega T\u2202f\u2202z$ in reverse [using the dense output of $z(t)$ for any Jacobian computation], and finally solve $dLd\theta =\u222bt0t1\omega T(t)\u2202f\u2202\theta dt$ independently using the dense output for $\omega (t)$. Using quadrature techniques such as Clenshaw–Curtis can converge exponentially fast, using less steps than the ODE solver. Thus, it potentially further decreases the computational cost of this step of the calculation, which is $O(m)$. However, most importantly, this splits the computation into two stiff ODE solves matching the size of $z$ plus an explicit quadrature matching the size of $\theta $. Because of the cubic cost scaling of LU-factorizations in a stiff ODE solve, this trades computation for memory and brings the cost down to $O(k3+m)$. We termed this the QuadratureAdjoint method.

A core fact that is exploited in that formulation is that the final equation is explicit and thus does not necessarily have similar stability properties to the forward or adjoint equations. It would only be stiff if the Jacobian of $\omega T(t)\u2202f\u2202\theta $ is ill-conditioned, which does not necessarily follow from the conditioning of $\u2202f\u2202z$. Thus, another formulation that could reduce the memory cost while achieving a similar effect would be to use an implicit–explicit (IMEX) ODE solver. IMEX ODE solvers allow for solving the differential equation as $f=fi+fe$, where $fi$ is treated implicitly, while $fe$ is treated explicitly. In this sense, Eq. (6) could be split so that

As an IMEX solver only requires factorizing the Jacobian of the $fi$ term, specializing the factorization on the block structure reduces the LU-factorization cost to $O(k3)$, and factoring in the linear cost of the explicit Runge–Kutta handling, we arrive at another $O(k3+m)$ scheme but with reduced peak memory requirements.

These two approaches highlight that the main difficulty is due to the inclusion of $m$ parameter derivatives into the stiff solve. Thus, the last way to achieve a similar effect is to simply do any of the aforementioned schemes but only calculate the derivative of $l\u226am$ parameters at a time. In this case, one would need to repeat the adjoint solve $ml$ times. However, because each solve has a cubic computational cost with respect to the states, this form of splitting brings the complexity down to $O(mk3l2)$ for forward sensitivities and $O(ml(k+l)3)$ for adjoint. Note that this technique also applies to direct differentiation via forward and reverse automatic differentiation of the solver. Additionally, the separate differentiation passes can all be solved in parallel, further reducing the real-world compute cost. This is the technique that we used to make the experiments of Sec. IV computationally viable.

### C. Equation scaling

Armed with computationally stable and efficient gradient calculations, we noticed that these techniques were still not enough to stabilize the training process. To further tackle the difficulties in training stiff neural ODEs introduced by scale separation, we propose to normalize the neural network outputs and the loss components for different species. Specifically, we learn the following normalized form of neural ODEs:

where $yscale$ is a vector that consists of the characteristic scales for each species and $tscale$ refers to the characteristic time scale. We further re-balance the loss components by normalizing the loss functions as

## IV. EXPERIMENTS

### A. ROBER problem

Robertson’s equations, denoted as ROBER, are one of the prominent stiff ODEs that model a reaction network with three chemicals.^{15} The species concentrations $[y1,y2,y3]$ are governed by Eq. (15) with reaction rate constants of $k1=0.04,k2=3\xd7107,k3=104$,

#### 1. Baseline model

We first study baseline neural ODEs to elucidate the challenges of training stiff neural ODEs. The baseline model consists of one hidden layer with 50 hidden nodes and a hyperbolic tangent activation function, which is similar to the demo code in DiffEqFlux.jl and torchdiffeq. The stiff ODE integrator Rosenbrock23, an Order 2/3 L-Stable Rosenbrock-W method, is employed, and we hybrid Rosenbrock23 with Tsit5 via auto-switching to accelerate the integration and training. ForwardDiff.jl is adopted for the backpropagation, which we found is more efficient than adjoint methods in the current case due to the small number of states. The training data are generated by integrating Eq. (15) with the initial conditions of $[y1,y2,y3]=[1,0,0]$ and a time span of $[10\u22125,105]$. The 50 sample data points are uniformly spaced in a logarithmic time scale. The neural ODE is trained with the ADAM^{36} optimizer with a learning rate of 0.005. The training ended at 10 000 epochs where the loss function flatlines, as shown in Fig. 4.

Figure 5 shows the predictions of the learned baseline model. The baseline model failed to learn the actual dynamics and consequently failed to reproduce the species profiles. In addition, the gradient norm highly fluctuates and training losses do not decrease, which implies training instability (Fig. 4). We hypothesize that the failure is attributed to the time scale separations and stiffness induced scale separations in the neural network outputs as well as the imbalance of different loss function components. Specifically, after the initial short induction period, the changes of $y1$ and $y3$ are in the order of unity while $y2$ in the order of $10\u22125$, which implies that one has to approximate such widely separated scales in a single neural network. In addition, since the magnitude of $y2$ is 5 orders of magnitude smaller than $y1$ and $y3$, such imbalance in the three loss components could lead to gradient pathologies, which was anticipated that stochastic gradient descent is unstable in the presence of such gradient pathologies.^{14}

#### 2. Mitigation of gradient pathologies via scaling

Figure 6 shows the results of a successful learning of stiff ROBER problems with equation scaling. The neural networks consist of six hidden layers and five nodes per layer, with an activation function of GELU.^{37} GELU was chosen to mitigate potential issues with vanishing gradients typically seen in recurrent neural networks with saturating activation functions.^{38} Figure 7 shows the history of loss functions and the $L2$ norm of the gradient. The loss functions steadily decrease, and the gradient norm is stable during the training. The training in a workstation using a single CPU processor took 2 h.

We carried out sensitivity analyses to better understand the importance of proposed techniques, i.e., the equation scaling and the deep networks, on the training of stiff ODEs. Our test shows that scaling is essential for training stiff neural ODEs, as the training with the same neural network structure but without equation scaling failed to learn the ROBER problem. Regarding the neural network structure, we found that shallow neural networks with the hyperbolic tangent activation function (tanh) can also successfully learn the ROBER problem but takes a longer training time, which could be attributed to the potential gradient issues. The results of the training without equation scaling and the training with the shallow network are provided in the supplementary material.

We can also accelerate the training with temporal regularization^{10} and annealing of learning rates, although such techniques were not required for successful learning of stiff neural ODEs. The temporal regularization, STEER,^{10} proceeds as randomly truncate the training to an integration time ahead of the entire time-space and in such a way introduce stochastic into the optimization to accelerate the training. The annealing of a learning rate is a widely adopted technique for accelerating the training of neural networks at later stages.

### B. POLLU problem

To show the generality of the technique, we demonstrate the effectiveness of the proposed methods in the POLLU problem, which is more complex than the ROBER problem and consists of 20 species and 25 reactions. The POLLU is an air pollution model developed at The Dutch National Institute of Public Health and Environmental Protection. It can be described mathematically by the following 20 non-linear ODEs shown in Eq. (16), and full details about the model can be found in Ref. 16 and the supplementary material,

The neural ODEs consist of three hidden layers and ten nodes per hidden layer. The label data are sampled uniformly in $[0,60]$ on a linear scale. The rest of the hyper-parameter settings are the same as the ones for the ROBER problem. Figure 8 presents the label data generated with Eq. (16) and the predictions of the learned neural ODEs. Note that the baseline model again failed for the POLLU problem similarly to the ROBER problem, and it is presented in the supplementary material. The trained neural ODEs can capture both the dynamics during the induction period and the near-equilibrium states. It also well captures both the dynamics of the major species, such as $y1,y2$, with relatively large concentration and slow time scales, and the dynamics of transient intermediate species, such as $y3,y10$, with relatively low concentrations and fast time scales. Therefore, the demonstration of the POLLU problem further shows the robustness of proposed approaches in learning stiff neural ODEs where previous techniques had failed.

## V. CONCLUSION AND DISCUSSION

This work proposed new derivative calculation techniques for the stability of stiff ODE systems while reducing the computational complexity. It is coupled with equation scaling and deeper neural networks with rectified activation functions to learn neural ODEs in stiff systems. The effectiveness of the proposed approaches is demonstrated in two classical stiff systems: the ROBER and the POLLU. While the baseline models even failed to learn the overall trend of the dynamics, the proposed approaches accurately reproduced all of the species profiles including both species with slow and fast time scales.

While we have shown the effectiveness of the proposed techniques, there is a demand for further theoretical analysis of the challenges of learning stiff neural ODE, such as how stiffness affects the gradient dynamics during training and leads to training failures. In addition, more theoretical analysis on equation scaling relations to the gradient flow of neural ODEs would better elucidate why it was important in the training process. It is also interesting to explore other equation scaling and neural network scaling techniques, such as scaling the neural network inputs and batch-normalization.^{39} Given the first order optimizers are solving an ordinary differential equation on the gradient flow themselves, one likely could define the stiffness index on the Jacobian of the gradient of the optimization, which is the Hessian with respect to the parameters. Handling the optimization itself with stiffly-aware methods could be beneficial in cases where gradient pathologies are not mitigated.

Importantly, this paper demonstrates the impact of errors in the continuous adjoint handling on the training of a neural ODE. Such an analysis could be required in other “implicit layer” methods as well. For example, we referenced how a nonlinear solver destabilizes when its Jacobian exhibits ill-conditioning, equivalent to stiffness in the ODE system when viewing it as a solver to steady state.^{25} This would suggest that on highly stiff systems, methods such as the DEQ,^{40} similar behavior would apply to the forward pass. However, importantly, similar instabilities in the adjoint problem likely occur due to the reliance on the adjoint’s assumption that $G(x)=0$ exactly, which is more difficult to satisfy as the condition number rises. Likely, similar stabilization may be required for this case. Similarly, differentiable optimization as a neural network layer^{41,42} requires satisfaction of the KKT equations for its adjoint, which are likely to be less satisfied when the Hessian of the optimization problem is ill-conditioned, therefore likely requires a similar stability analysis.

We end by noting that such developments in handling highly stiff equations can allow for machine learning architectures that satisfy arbitrary constraint equations during their evolution. Rackauckas *et al.*^{1} proposed using differential-algebraic equations (DAEs) in a mass-matrix form for this purpose; i.e.,

where $M$ is singular. For example, it is well-known that the Rosenbrock equation can be written in its DAE form,

where the mass matrix is of the $M=[100;010;000]$. Similarly, an arbitrary system constrained to have dynamical states sum to 1 could be imposed by

and more generally,

the constraint equation can be set or fit using domain information as necessary. While currently able to be specified and trained in differentiable DAE solver systems, we noticed fitting instabilities similar to stiff neural ODEs arise in such cases. Indeed, Wanner and Hairer^{25} notes that DAEs are a form of infinite stiffness, being posed as the limit of singularly perturbed stiff ODEs. Petzold^{43} famously showcased how “DAEs are not ODEs,” showing how they exhibit additional behavior that must be appropriately treated; thus, an analysis of such systems will be required for fully stabilizing neural network DAE formulations including the relationship of training techniques to a differential index.

## SUPPLEMENTARY MATERIAL

See the supplementary material for the sensitivity analysis to the neural network structures and the details in POLLU equations.^{16}

## ACKNOWLEDGMENTS

The information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award No. DE-AR0001222. The views and opinions of the authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

## DATA AVAILABILITY

The data that support the findings of this study are openly available in GitHub at https://github.com/DENG-MIT/StiffNeuralODE, Ref. 44.

## REFERENCES

*33rd Conference on Neural Information Processing Systems*(Curran Associates, Inc., 2019), available at https://papers.nips.cc/paper/2019/hash/42a6845a557bef704ad8ac9cb4461d43-Abstract.html.

*et al.*, “Turbulence forecasting via neural ode,” arXiv:1911.05180 (2019).

*Numerical Analysis: Introduction*(Thompson, 1966), pp. 178–182.

*AAAI Spring Symposium: MLPS 2020*(AAAI, 2020).

*Encyclopedia of Applied and Computational Mathematics*(Springer, Berlin, 2015), pp. 339–345.

*International Conference on Machine Learning*(PMLR, 2013), pp. 1310–1318.

*International Conference on Machine Learning*(PMLR, 2015), pp. 448–456.

*International Conference on Machine Learning*(PMLR, 2017), pp. 136–145.