Mathematical models of epidemiological systems enable investigation of and predictions about potential disease outbreaks. However, commonly used models are often highly simplified representations of incredibly complex systems. Because of these simplifications, the model output, of, say, new cases of a disease over time or when an epidemic will occur, may be inconsistent with the available data. In this case, we must improve the model, especially if we plan to make decisions based on it that could affect human health and safety, but direct improvements are often beyond our reach. In this work, we explore this problem through a case study of the Zika outbreak in Brazil in 2016. We propose an embedded discrepancy operator—a modification to the model equations that requires modest information about the system and is calibrated by all relevant data. We show that the new enriched model demonstrates greatly increased consistency with real data. Moreover, the method is general enough to easily apply to many other mathematical models in epidemiology.

Potential epidemics of communicable diseases are a major health concern of the modern world, especially as city density, air and water pollution, and worldwide travel steadily increase. A stark example of this is the global coronavirus outbreak, already responsible for more than 2000 deaths around the world at the time of this article’s submission and more than 13 000 at the time of revision, approximately one month later. When faced with a potential outbreak, decision-makers such as health officials and medical professionals rely on mathematical models to aid their decision-making processes. However, oftentimes, these models are not consistent with the dynamical system they are designed to represent. The discrepancy between the output of a model and the real system is then a serious impediment, as it may decrease our confidence in the model, or even invalidate it entirely, so that it can no longer be used to aid in decision-making. When such a discrepancy is observed, we, as modelers, must either improve the model or somehow account for the discrepancy itself. While a direct model improvement is usually the most desirable solution, how to do so may be infeasible because of computational reasons, time constraints, or lack of domain knowledge. This paper provides a systematic method to instead account for the discrepancy itself, explored via a case study of the Brazilian Zika epidemic of 2016. The method is not a correction of the model output to data, but rather a modification of the model equations themselves by a so-called embedded discrepancy operator. The operator is designed with three critical properties in mind: interpretability, domain-consistency, and robustness. We show that including the embedded discrepancy operator greatly increases the fidelity of the model so that the model output and real data are now in fact consistent.

## I. INTRODUCTION

Mathematical models of scientific systems necessarily include simplifications about the actual system they aim to represent. In some cases, these simplifications do not preclude the use of the model to understand, investigate, and make decisions and predictions about the system. The quintessential example of this comes from the domain of classical mechanics: Newtonian mechanics ignores quantum and relativistic effects but, over a wide domain of masses and energies, provides a completely adequate model to describe the motion of macroscopic objects. Outside this domain, however, quantum or relativistic effects are no longer negligible, and Newtonian mechanics fails.

Just as classical mechanics is insufficient to describe a quantum system, the simplifications of a modern mathematical model may yield a discrepancy between the model and the system at hand too great to be ignored. This discrepancy is revealed during *model validation*, a process by which we check that the mathematical model is a reliable representation of reality. Without accounting for the discrepancy, one cannot trust the model output, much less use it to make predictions or decisions. In this case, there are two immediate options: (1) improve the model directly, i.e., from first principles or by including additional information and (2) represent the model discrepancy itself. While option 1 is usually desirable, it may not be feasible due to computational constraints or because we in fact lack the knowledge to directly improve the model. Then, we are left with option 2—represent the model discrepancy.

A common approach to account for a model discrepancy is through a *response discrepancy function*,^{1} also called a bias function. A response discrepancy function corrects the model output (or response) to data. Typically, an additive function on the model output is calibrated to data, either point-wise or with a parametric form. An advantage of this approach is that it can be implemented even if the model is a black box; that is, one only needs access to the model output, not the model itself. There are also disadvantages. In essence, a response discrepancy function builds a better interpolation to a single dataset over the range of usable data. Thus, this approach provides no basis for extrapolation to, for example, make a prediction about the probability of an epidemic next year. Furthermore, the action of this bias function is not interpretable, as it lies outside the model equations.

Instead, in this paper, we show how to modify equations directly to account for the model error with an *embedded discrepancy operator*. The advantages of this approach are threefold:

*Interpretability*: As the embedded operator appears within the model equations, and acts on state variables, the action of this operator is interpretable.*(Domain-)Consistency*: Information or constraints about the system can be incorporated into the discrepancy operator.*Robustness*: Discrepancy parameters can, and should, be calibrated over all the available data. This can include data from multiple scenarios or initial conditions.

These three properties—interpretability, consistency, and robustness—are designed to allow for decisions or extrapolative predictions. The inclusion of the embedded discrepancy operator into the original, or reduced model, yields an *enriched model*. In essence, the enriched model takes advantage of both mechanistic and statistical modeling: it retains the reduced mechanistic model and incorporates a general, statistically calibrated discrepancy model. Of course, this intrusive approach highly depends on the context. Here, we investigate the value of an embedded discrepancy operator in the context of epidemiology modeling.

Mathematical modeling of disease spread and outbreaks has a long and rich history; see Refs. 2–5, to name just a few. One of the most common classes of these models consists of coupled ordinary differential equations (ODEs), whose state variables include populations of the host (here, humans) and the disease carrier, or vector. These populations are further specified as either **S**usceptible, **E**xposed, **I**nfected, or **R**ecovered, leading to thus-named SEIR models.^{6} These models are relatively simple to implement and understand. Model parameters allow for the specification of transmission rates, incubation times, etc.

In particular, we investigate the model discrepancy of a well-studied SEIR-SEI model of the Zika outbreak in Brazil in 2016.^{7} In the previous work, after calibration of model parameters, the reduced model captured major tendencies of the outbreak. This was a major improvement compared to the reduced model with parameter values as suggested by current literature, which bore almost no usable resemblance to the real epidemic data. However, the calibrated model was still insufficient to precisely capture the dynamical behavior of the Zika outbreak. The current work extends previous works of the embedded model discrepancy, used in the contexts of combustion^{8} and ecological models,^{9} to the current domain of epidemiology. The discrepancy model is embedded within the coupled model differential equations, and the introduced discrepancy parameters are calibrated to the available data. The enriched model is shown to greatly outperform the original model.

To differentiate the current article from Ref. 9, note that the study was primarily a numerical study over a constrained set of scenarios. The interaction matrices (determining reduced and true model coefficients) follow a number of assumptions, such as negative-definiteness, yielding highly well-behaved models. In addition, the actual discrepancy was known exactly between each reduced model and the corresponding data-generating model; some of that information was used to further constrain the discrepancy model parameters. Thus, the previous paper provided no guarantee that this method would work in a highly applied, real-world model scenario without those strong assumptions.

Although the current modeling scheme only describes a single outbreak,^{10} and thus is not suitable to describe multiple incidences of the disease, this type of model is useful for guiding decision and policy makers. For example, Ref. 11 describes common questions faced by decision-makers, such as how many total people will be infected or even how to slow or prevent the outbreak from occurring. The objective of the present article is to reproduce with reasonable precision the data of a real epidemic. An appropriate enrichment and calibration of the model can then yield useful predictions about the dynamical system.

A final complication of this field is that new disease cases are often not reported, causing the outbreak numbers to appear artificially low. A study from 2018 estimates that as much as 90% of the cases are not reported.^{12} However, as we will see shortly, the issue of faulty data is insufficient to account for the discrepancy between the original model and observations. At the same time, the issue of under-reported cases does obviously play an important role during the model validation process. We try to disentangle the two problems—observational errors and model errors—by first considering only model errors and later allowing for both model errors and significant under-reporting. We consider how the enriched model performs in different possible under-reporting scenarios, such as 10% or 50%.

The rest of the paper is organized as follows. Section II describes the specific model of the 2016 Brazilian Zika outbreak and reconstructs previous results as a reference. Section III presents the formulation and calibration of the embedded discrepancy operator and also corresponding numerical results. Section IV explores the issue of under-reporting and how well the enriched model might perform given some sample under-reporting scenarios. A brief concluding discussion is given in Sec. V.

## II. ZIKA DISEASE MODELING

As mentioned in Sec. I, a typical approach to model the spread of an infectious disease is with a set of coupled ordinary differential equations. Here, we consider the well-known SEIR-SEI model, which describes coupled growth rates of species of interest, namely, susceptible, exposed, infected, and recovered humans, as well as susceptible, exposed, and infected vectors. In the case of Zika (also dengue and yellow fever) in Brazil, this vector is the *Aedes aegypti* mosquito.

In this section and in Sec. III, we assume that the data represent the actual truth. That is, the modeling objective is to achieve a model consistent with the given data.

### A. Model specification

We follow the SEIR-SEI model discussed by Dantas *et al*.,^{7} which included species $Sh,Eh,Ih,Rh,Sv,Ev,$ and $Iv$. Subscripts $h$ and $v$ indicate humans and vectors, respectively. The model also includes a state variable $C(t)$, which counts cumulative new cases over time. Then, the eight coupled equations are

where $Nh$ represents Brazil’s human population and $Nv$ represents the vector population. Nominal values of the interaction rates were determined by a careful literature study. These rates are

Let us collect these model parameters into the vector $\theta $, and let $\theta n$ refer to the nominal values given above.

To fully specify this model, it remains to provide initial conditions. These are^{13}

Finally, let us call the above model $Z$ and the state vector $x$, where $x$ is ordered in the same way as Eqs. (1a)–(1h) ($x1=Sh$, $x2=Eh$, and so on). Then, we may refer to the above model as

We may also refer to this model as the original model or a *reduced* model.

### B. Previous results compared to data

The work in Ref. 7 presents a detailed approach to calibrate this model to data. The data are made available by the Brazilian Ministry of Health^{14} and are available as a supplementary material in Ref. 7. Each data point $di,i=1,\u2026,52$, gives the recorded cumulative number of Zika cases at epidemiological week $i$ of the year 2016. In this section, we re-plot the results from that paper to serve as an immediate reference and comparison.

First, Fig. 1 compares the model output to data, using the above model with nominal parameters. The model with nominal parameter values, $Z(x;\theta =\theta n)$, severely underestimates the outbreak. Note that under-reporting cannot explain the observed discrepancy: higher reporting rates would only increase this discrepancy.

Clearly, the reduced model, given $\theta n$ parameter values, is a poor representation of reality. After observing such a discrepancy, the authors of Ref. 7 performed a sophisticated calibration of the model parameters $\theta $, using the Trust-Region-Reflective (TRR) method.^{15,16} Following that method, and using the public data to calibrate, two slightly different results are obtained by imposing a different set of constraints on the possible parameter values. The model outputs after both calibrations are shown in Fig. 2. Although the model output is now much closer to the data, still a detectable inconsistency persists. To be specific, note that after about week 30, the difference is tens of thousands of new cases of humans infected by Zika.

From a modeling perspective, the salient point is that the model is unable to capture the dynamical behavior of the outbreak, even after calibration. Assuming (as we are for now) that the given data are correct, this suggests that the problem lies with the model itself. Indeed, there are several possible sources of model errors that may impact not only this specific Zika model but many other epidemiological models. First, these SEIR models are not built from first principles, but rather from assumptions about interactive behavior, empirical information, and domain scientists’ intuition and experience. Second, these models provide a continuous, deterministic description of discrete interactions, which naturally involve some stochasticity.^{17} With large enough populations, though, this should not be a problem. Third, diseases do not spread in a closed system of host and vector. Rather, the spread of a disease involves other species such as livestock and non-human primates.^{18} Fourth, other modes of transmission are possible, such as sexual interactions^{19,20} and blood transfusions.^{21} Finally, there could certainly be additional time- or spatially-dependent effects, such as migrations and local dynamics,^{22,23} collective behavior, and time-delayed synchronization dynamics.^{24} Some modelers assume power-law dynamics of the networks,^{25} while others use fractional derivatives to describe relevant dynamics.^{26} Instead, this model assumes time-independent parameters and only model populations over time, not space. In summary, the spread of a contagious disease is an incredibly complex problem, and it remains unclear what is critically missing from the model or how to best improve it directly from epidemiological information. Here, then, we can turn to the field of model discrepancy to help.

## III. EMBEDDED DISCREPANCY OPERATOR

Before describing the embedded discrepancy operator, we illustrate the overall relationship between the different models considered in this paper. A schematic diagram is shown in Fig. 3. As seen in Sec. II, after calibrating the model parameters to data, there is still a significant discrepancy between the model output and the data. That is, the answer to Q2 (and Q1) in Fig. 3 is “No,” and therefore, we move to state (M3): we model this discrepancy with the goal of reaching states (U1) and ultimately (E3).

### A. Proposed approach: Embedded discrepancy operator

Previous work has shown that missing dynamics on the right-hand side (RHS) of differential equations can be approximated with “extra” information about the existing state variables,^{27–29} such as memory or derivative information. Exploiting this, we pose the following enriched model:

where

with $\kappa =(\kappa 1,\u2026,\kappa 7)$ and $\lambda =(\lambda 1,\u2026,\lambda 7)$. That is, the differential equation for $xi,i=1,\u2026,7$ in the reduced model is modified by two additional terms, one linear in $xi$ and the other linear in $dxi/dt$. The discrepancy parameters are collected into the vector $\varphi $,

Note, the RHS for $dC(t)/dt$ is not modified because this function simply counts the exposed cases as given by the model. A change here would be analogous to modifying the model output itself, not interpretable, and not reliable for any type of decision or prediction.

As mentioned in Sec. I, this type of discrepancy model can be constrained to the available information about the system. For example, the discrepancy operator for a combustion reaction in Ref. 8 is constrained to satisfy conservation of atoms and conservation of energy. In this scenario, we do not have such strict constraints; see Ref. 9 for constrained operators in similar Lotka–Volterra models.

Altogether, the enriched model is

Denote the enriched model as $E(x,dx/dt;\theta ,\varphi )$, i.e.,

### B. Calibration details

In contrast with the calibration process of Ref. 7 described in Sec. II B, here we use a Bayesian framework^{30,31} to calibrate the discrepancy parameters $\varphi $ given the data $d$. This allows for the representation of uncertainty about these parameters and also how this uncertainty propagates to the model output.

Recall that the observations are cumulative cases at each epidemiological week, $d={di},i=1,\u2026,52$. For each $di$, let $yi$ be the corresponding model output. We assume that the measurements are independent and that the measurement error is additive and Gaussian as such,

with standard deviation $\sigma \u03f5=5\xd7103$. This standard deviation value seems reasonable as the uncertainty in reported values is high and because the observations are on the order of tens to hundreds of thousands.

In the Bayesian framework, the conditional probability density of $\varphi $ given the data $d$, $ppo(\varphi |d)$, is called the *posterior* and given as

We specify each term on the RHS above:

*Prior:*The prior density $ppr(\varphi )$ collects the prior knowledge we have about the parameters. Specifically, these parameters are assumed independent and uniform in the prior, where each $ppr(\varphi i)=U(\u22120.3,0.15)$, and therefore,(13)$ppr(\varphi )=\u220fi=114ppr(\varphi i).$*Likelihood:*The likelihood $pli(d|\varphi )$ tells us how likely it is to observe $d$, given a particular value of $\varphi $. The measurement error model in Eq. (11) yields the likelihood functionwhere $\Sigma =\sigma \u03f52I$.(14)$pli(d|\theta )=12\pi |\Sigma |exp\u221212(d\u2212y)T\Sigma \u22121(d\u2212y),$*Evidence:*The evidence $pev(d)=\u222bpli(d|\varphi )ppr(\varphi )d\varphi $ gives the probability of observing the data $d$. This is typically difficult to compute, but note that it is not a function of $\varphi $. With a Markov chain Monte Carlo (McMC) approach, the posterior is found by computing ratios of the RHS in Eq. (12) (for different values of $\varphi $), and therefore, fortunately, this term cancels.

Under this framework, the discrepancy parameters $\varphi $ are calibrated using the DRAM method, developed by Ref. 32 and implemented through the library QUESO.^{33} The complete code for this project is available at https://github.com/rebeccaem/zika.^{34}

Table I presents posterior means and standard deviations for each of the 14 marginal posterior densities of the discrepancy parameters.

. | Posterior . | Posterior . |
---|---|---|

Parameter . | mean . | standard deviation . |

κ_{1} | −0.04 | 0.005 |

κ_{2} | −0.26 | 0.02 |

κ_{3} | 0.10 | 0.02 |

κ_{4} | −0.04 | 0.13 |

κ_{5} | 0.07 | 0.01 |

κ_{6} | 0.11 | 0.02 |

κ_{7} | 0.10 | 0.01 |

λ_{1} | 0.00 | 0.09 |

λ_{2} | −0.15 | 0.08 |

λ_{3} | −0.18 | 0.08 |

λ_{4} | −0.15 | 0.09 |

λ_{5} | −0.02 | 0.10 |

λ_{6} | −0.15 | 0.10 |

λ_{7} | −0.06 | 0.09 |

. | Posterior . | Posterior . |
---|---|---|

Parameter . | mean . | standard deviation . |

κ_{1} | −0.04 | 0.005 |

κ_{2} | −0.26 | 0.02 |

κ_{3} | 0.10 | 0.02 |

κ_{4} | −0.04 | 0.13 |

κ_{5} | 0.07 | 0.01 |

κ_{6} | 0.11 | 0.02 |

κ_{7} | 0.10 | 0.01 |

λ_{1} | 0.00 | 0.09 |

λ_{2} | −0.15 | 0.08 |

λ_{3} | −0.18 | 0.08 |

λ_{4} | −0.15 | 0.09 |

λ_{5} | −0.02 | 0.10 |

λ_{6} | −0.15 | 0.10 |

λ_{7} | −0.06 | 0.09 |

### C. Numerical results

Figure 4 shows the enriched model response compared to the data. Uncertainty in discrepancy parameters $\varphi $ is propagated through to the model output: the thick centerline shows the median response, the darker band shows the 50% confidence interval, and the lighter band the 95% confidence interval (CI). Importantly, all observations are in fact captured by the 95% CI.

### D. Interpretation

The embedded discrepancy operator can be interpreted from two different points of view. The first is a more mathematical lens, although not disconnected from the physics: we interpret the discrepancy operator as a linear feedback signal. The second, in contrast, relies on an epidemiological basis: here, we interpret the corrections made by the discrepancy operator as effects due to causes of a biological origin.

This second point of view is especially interesting for the goal of elucidating potential deficiencies in the baseline model. As a wide range of issues must yet be explored to obtain a consistent epidemiological interpretation, this line will not be addressed in this paper but will be the topic of future work. Instead, the first perspective is explored further below.

In light of the theory of systems with a linear feedback, the discrepancy operator is a linear combination of the system state and its first order time-derivative, thus defining a signal that feeds the original nonlinear system with information from the present state and its rate of change. Roughly speaking, the parameters of the enrichment can be seen as “gains” that adjust to drive the epidemic curve generated by the model toward the real observational curve. These parameters are identified via Bayesian inference, with prior distributions that admit negative and positive values: thus, the gains can define both negative and positive feedbacks. The realizations of the enriched dynamical system may then admit a superposition between negative and positive feedbacks, generating a kind of competition between the model’s stimuli signals. This competition stabilizes some coordinates of the state vector and destabilizes others, while the global effect materializes in the corrected (enriched) epidemic curve.

To understand more deeply why this competition between corrective signals produces such an effective correction, consider the injection and removal of information (i.e., energy) in the system, as well as its flow between the different coordinates of the state (groups of human and mosquito populations). To make an analogy with the dynamics of mechanical oscillators, the feedback effects proportional to the state derivative produce a kind of “viscous force,” which introduces (via a positive feedback) or removes (via a negative feedback) information into or from the epidemiological state variables. Furthermore, the terms proportional to the system state correspond to a kind of “restoring force,” which redistributes information among the different population groups. The intensity of this additional information flow between the different coordinates of the system state is controlled by the new time scales induced by the feedback signals, which are nonlinear functions of the gains $\kappa i$ and $\lambda i$, $i=1,\u2026,7$.

It should be noted that the approach presented in the paper is neither control theory in the literal sense, since no information from the biological system is obtained in real-time, nor is an action signal sent to the real system to adjust its epidemic curve trajectory. Therefore, the observability and controllability issues related to the real system are not taken into account. The notion of duality between parameters and gains is explored above only as a way to pave an initial reasonable interpretation of how the discrepancy operator acts to correct the model’s response, by promoting additional information flows between the different compartments of the populations.

## IV. EFFECTS OF UNDER-REPORTING

Now, let us also consider the scenario that the data are in fact under-reported. First, suppose 10% of cases are not reported so that $di=0.9di\u2217$, where $di\u2217$ represents the value of observations we would expect without under-reporting. (This is not claiming that $di\u2217$ is the exact true value, as we still expect an unbiased measurement error.) The discrepancy parameters are re-calibrated, and the corresponding model response is shown in Fig. 6. Again, all (modified) observations are captured by the enriched model response.

Finally, we suppose that only 50% of cases are reported so that $di=0.5di\u2217$. These results are shown in Fig. 7. Even here, the enriched model adapts to this scenario and covers the dynamical behavior of the outbreak in this highly under-reported scenario.

## V. CONCLUSION

This work presents an initial endeavor to represent the model discrepancy of an epidemiological system, namely, the 2016 Brazilian Zika outbreak. Preliminary results are promising—compared to the original model, the embedded discrepancy operator greatly improves the consistency between the model output and available observations.

The general applicability of this method to other epidemiological models is best understood in two parts. One one hand, the formulation of the enriched model equations is immediately applicable to another model, if that model is also comprised of a set of ODEs. That is, nothing prevents a modeler from testing the proposed model enrichment framework in that case. On the other hand, the particular details of the calibration and whether or not this approach is in fact able to capture the discrepancy between the original model and the data may depend on domain specific information. Future studies will test this approach in other outbreak data sets.

Many other open questions remain. In Sec. III D, we discuss two possible interpretations of the embedded discrepancy operator. First, the linear terms added into the differential equations resemble a type of a linear feedback, with one term proportional to the state variable and one a differential “control” term. In this case, the discrepancy operator is not actually controlling the real system itself but driving the model system to the target (epidemic data). Second, and perhaps more importantly, would be a physiological interpretation of the discrepancy terms. However, beyond the scope of this paper, a deep exploration of interpretability, connections to the linear feedback theory, and an explanation of these discrepancy terms in a physiological sense will be the subject of immediate future work.

Related to the point above, we would also like to understand what the calibrated discrepancy operator implies about the missing dynamics of the reduced model. That is, can we use the learned discrepancy model to infer what the reduced model is most critically lacking? This question is currently under study, also in the context of ecological models (which have a similar structure, as sets of coupled ODEs). Doing so would allow the use of these embedded operators to function as a type of modeling tool, as opposed to only a model correction.

Finally, this study would be perhaps more convincing with more trustworthy data. How to achieve this, though, is just as complex a problem as the epidemiological system itself, as it involves accessibility to healthcare in remote regions, public awareness of mandatory reporting policies, and incentives and rewards for timely reporting of a communicable disease.

## ACKNOWLEDGMENTS

The authors acknowledge Professor Davi Antônio dos Santos (ITA Brazil) for very helpful discussions, especially regarding the linear feedback aspect of this work. The second author acknowledges the financial support given by the Brazilian agencies Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001 and the Carlos Chagas Filho Research Foundation of Rio de Janeiro State (FAPERJ) under Grant Nos. 210.021/2018 and 211.037/2019.

## DATA AVAILABILITY

The data are available in a database maintained by the Brazilian Ministry of Health (Ref. 14) and also available as supplementary material in Ref. 7. All data (and code) needed to reproduce results in this paper are also included in the code base at https://doi.org/10.5281/ZENODO.3666845, Ref. 34.

## REFERENCES

*et al.*,

*et al.*, “

*et al.*“

*et al.*, “

*Euro-Par 2011: Parallel Processing Workshops*(Springer, 2012), pp. 398–407.