This paper analyzes the sensitivity of antineutrino count rate measurements to changes in the fissile content of civil power reactors. Such measurements may be useful in IAEA reactor safeguards applications. We introduce a hypothesis testing procedure to identify statistically significant differences between the antineutrino count rate evolution of a standard “baseline” fuel cycle and that of an anomalous cycle, in which plutonium is removed and replaced with an equivalent fissile worth of uranium. The test would allow an inspector to detect anomalous reactor activity, or to positively confirm that the reactor is operating in a manner consistent with its declared fuel inventory and power level. We show that with a reasonable choice of detector parameters, the test can detect replacement of 82 kg of plutonium in 90 days with 95% probability, while controlling the false positive rate at 5%. We show that some improvement on this level of sensitivity may be obtained by various means, including use of the method in conjunction with existing reactor safeguards methods. We also identify a necessary and sufficient minimum daily antineutrino count rate and a maximum tolerable background rate to achieve the quoted sensitivity, and list examples of detectors in which such rates have been attained.

## I. INTRODUCTION

The International Atomic Energy Agency (IAEA) nuclear safeguards regime is designed to detect diversion of fissile material from civil nuclear fuel cycle facilities to weapons programs.^{1} In previous work, we predicted^{2} and demonstrated,^{3–5} that cubic meter scale antineutrino detectors, operating at a distance of tens of meters from a 1 gigawatt electric (GWe) pressurized water reactor (PWR), can directly detect changes in operational status, power levels, and fissile inventory of the reactor core. Similar results were achieved earlier by a Russian group.^{6} These metrics are all of potential use for the IAEA reactor safeguards regime.

In this paper, we demonstrate a possible methodology for using antineutrino detection in a safeguards context. We use a hypothesis test to identify statistically significant differences between the antineutrino count rate evolution of a standard “baseline” fuel cycle and that of an anomalous cycle in which 82 kg of fissile plutonium have been removed and replaced with the equivalent fissile worth of uranium. (This quantity of plutonium represents the removal and replacement of ten partially burnt assemblies with ten fresh fuel assemblies.) The test would allow an inspector to detect anomalous reactor activity, or to positively confirm that the reactor is operating in a manner consistent with its declared fuel inventory and power level. We show that with a reasonable choice of detector parameters, the test can detect the net removal from the core of 82 kg of fissile plutonium $(239Pu$ + $\u2009241Pu)$ in 90 days with 95% probability, while controlling the false positive rate at 5%.

The purpose of the study is to explore this possible alternative method of reactor safeguards, by quantifying the sensitivity of an antineutrino count rate measurement to anomalous changes in fissile content. In describing our example, we avoid the standard IAEA term “diversion,” since we do not explicitly specify the fate of the removed plutonium. In particular, we are not asserting that the removal of plutonium in this example could not be uncovered by existing IAEA safeguards methodologies.

One of IAEA’s inspection goals is to be able to detect diversion of 8 kg of plutonium from a civil nuclear facility in a 90-day period.^{7} This amount of plutonium is designated by the IAEA as a “significant quantity” (SQ), or “the approximate quantity of nuclear material in respect of which, taking into account any conversion process involved, the possibility of manufacturing a nuclear explosive device cannot be excluded.”^{8} Our current sensitivity to anomalous reactor operation caused by removal of plutonium is at the level of several SQ. Enhancements to the detector, including the capability to measure the antineutrino energy spectrum, may allow for detection of even smaller changes in the reactor’s fissile content. While the demonstrated sensitivity is not to actual diversion, but to anomalous reactor operations, we expect that this method can be used in conjunction with existing IAEA safeguards methodologies to achieve IAEA SQ goals for diverted material. We note that other IAEA surveillance and accountancy measurement devices do not in isolation reach the SQ goals, but are used as part of a comprehensive accountancy strategy. Examples include Cherenkov light monitors in spent fuel cooling ponds, which are not sensitive at the SQ level, but which provide continuity of knowledge and confirm the presence of large numbers of radioactive spent fuel assemblies.

We begin by briefly describing the relationship between the antineutrino count rate and the reactor fissile inventory, and contrast our method for anomaly detection with current IAEA reactor safeguards practice. Next, we describe the test procedure and its inputs, including the fuel loadings of the baseline and anomalous scenario cycles. We then examine the statistical power of the procedure to distinguish between the two cycles and thereby identify an anomaly in reactor operations. We include the effects of counting statistics, background, a systematic bias in detector response, deliberate malfeasance on the part of the reactor operator, the duration of data acquisition, and simulation errors. We also establish a range of acceptable detector masses, intrinsic efficiencies and standoff distances that would permit discovery of the anomaly in our example. We conclude by summarizing the potential impact of this approach on current IAEA safeguards and useful next steps.

## II. CURRENT IAEA REACTOR SAFEGUARDS AND ANTINEUTRINO-BASED SAFEGUARDS

Currently, the IAEA uses nuclear material accountancy, as well as containment and surveillance (CS) techniques, to verify the quantities of fuel used in and discharged from reactors. Nuclear material accountancy refers to a quantitative and independent check of fuel inventories, performed by the Agency. At reactors, the predominant material accountancy method is item accountancy, or counting of items (fresh and spent fuel assemblies and rods) considered to contain fixed and known quantities of fissile material. In addition, the presence and integrity of radioactive spent fuel assemblies and rods in cooling ponds at the reactor is checked by Cherenkov light measurements and other methods. CS techniques, such as video cameras and seals on the reactor head, are also used.^{8}

By contrast, antineutrino-based safeguards offer a form of near-real-time and nondestructive bulk accountancy. Bulk accountancy methods provide estimates of the total fissile mass without relying on assumptions about the mass contents of premeasured items. Examples include coincidence neutron counting, mass spectroscopy and chemical analyses. As such, antineutrino-based methods are complementary to the existing safeguards regime, since they provide independent quantitative information about fissile material inventories, as long as the reactor is operational. Among other uses, this information can provide independent confirmation that the fuel inventory throughout the reactor cycle is consistent with operator declarations. In principle, the inventory estimate so derived can also be used to check for shipper/receiver differences, both for fresh fuel taken in by the operator and for spent fuel sent to downstream reprocessing or storage facilities.

While the measurement capability appears promising, its actual import for IAEA safeguards is beyond the scope of this paper. As an example of the complications that arise, we note that for existing power reactors, the antineutrino-based inventory estimates would have to be reconciled with and integrated into the full accounting of all materials at the reactor site, including that in spent fuel cooling ponds. For such sites, with decades of accumulated and largely unassayed fuel, containing many tens of tons of fissile material, such accounting may prove impractical. For this reason, we recommend that a more detailed analysis of the capability be conducted by safeguards experts, both for existing and future reactor safeguards regimes. In addition, we note that the antineutrino flux from spent fuel near reactors contributes negligibly to the measured rate in a practical detector. Even in very unfavorable geometries, the contribution at a typical plant is estimated to be only approximately 1%.^{9}

## III. MODELING THE ANTINEUTRINO COUNT RATE FOR SAFEGUARDS APPLICATIONS

A change in fissile mass content in a reactor core—such as that occurring when uranium is consumed and plutonium produced in the course of a reactor fuel cycle—creates a measurable systematic shift in the antineutrino count rate (and energy spectrum). In previous work,^{5} we have shown that the antineutrino count rate is reduced by about $10%$ relative to its initial value over the course of a typical 1.5-yr PWR fuel cycle. This reduction occurs even when (as is typical) the reactor maintains constant power throughout the cycle; therefore, monitoring the antineutrino count rate provides information about core fissile inventory evolution that is not accessible through a measurement of the reactor power alone.

In a safeguards context, the measured antineutrino count rate evolution would be compared to a predicted count rate evolution, assuming normal conditions (i.e., no removal of plutonium) over some portion or all of the fuel cycle. The predicted evolution under normal operating conditions will be referred to as the “baseline scenario” for the remainder of this paper. The prediction is obtained from a reactor simulation code which takes as inputs the operator-declared thermal power and initial fissile isotopic masses, as well as other reactor parameters, and returns fission rates for each isotope. The individual fission rates are then converted into a predicted emitted antineutrino flux using standard analytical formulas. The emitted antineutrino flux is finally converted to a measured antineutrino count rate, using a detector response function derived from experiment and modeling.

In the present work, we simulate both the baseline and anomalous antineutrino count rates over the course of the fuel cycle for use in our hypothesis test. We use an origen simulation^{10} of the core of unit 2 of the San Onofre Nuclear Generating Station (SONGS), originally published in Ref. 11. The detector response function was derived from the SONGS1 experiment,^{2} for which the antineutrino signal was approximately 360 counts per day at beginning of cycle after subtraction of reactor-off background.

Following Ref. 6, we describe the PWR core antineutrino count rate evolution $N\nu \xaf(t)$ at time *t* in the fuel cycle as a product of two time-dependent factors

$Pth(t)$ is the reactor thermal power. The term $[1+k(t)]$ depends on the changing fissile isotopic content of the core, embodied in the parameter *k*(*t*). $\alpha $ is a constant related to the detector mass, efficiency, and standoff distance. This parameterization highlights the direct dependence of the count rate on the thermal power, an important consideration we return to in Sec. V E.

For the PWR core being considered here, Eq. (1) is well approximated by a quadratic function of time

The quadratic model in Eq. (2) is valid for PWRs loaded with typical low enriched uranium (LEU) fuel. Other fuel loadings and reactor types can result in an antineutrino count rate evolution that is substantially different in form from Eq. (2).

The coefficients $\beta 0$, $\beta 1$, and $\beta 2$ in Eq. (2) can be used to detect a departure from the baseline scenario. The measured antineutrino count rate evolution can be used to estimate the coefficients, which can then be compared to those predicted for the baseline scenario. A statistically significant difference in at least one of the estimated coefficients from its baseline counterpart could indicate a departure of the observed evolution from that of the baseline scenario.

## IV. TESTING FOR ANOMALOUS ACTIVITY

Following the model in Eq. (2), the true baseline evolution of antineutrino count rate as a function of time *t* in the fuel cycle is given by

(The superscript “*B*” in the above equation and for the remainder of the paper indicates “baseline”.) As discussed earlier, the predicted baseline evolution is obtained from a reactor simulation, which is subject to both random and systematic errors. To account for random errors arising from uncertainties on input parameters, such as thermal power, we represent the baseline count rate at time *t* as a Gaussian random variable with mean equal to the predicted simulation value and standard deviation equal to 1% of this value:

$\mu (t)$ is the baseline evolution antineutrino count rate value at time *t* predicted by the simulation and can be modeled as

The assumed 1% random error is typical for these and other ORIGEN simulations.^{11,12} The simulations are also limited in accuracy by systematic errors, arising from bias in the input antineutrino spectral densities, as well as other factors. These systematic shifts can be removed by a calibration procedure and are treated separately in Sec. V D.

Let ${N\nu \xaf(M)(t)}$ denote the measured count rate evolution (the superscript “*M*” indicates “measured”) to be tested against the baseline scenario evolution. Since the measurements follow Poisson statistics,

To determine whether the measured antineutrino count rate evolution deviates significantly from that of the baseline, we can compare the baseline coefficient $\beta i(B)$ in Eq. (5) and the measured coefficient $\beta i(M)$ in Eq. (6) for each $i=0,1,2$. This requires us to estimate each coefficient.

This can be done using least squares (LS) regression of both the modeled baseline count rates $N\nu \xaf(B)(t)$ and the measured count rates $N\nu \xaf(M)(t)$ on *t* and *t*^{2}. By construction, the modeled baseline count rates are Gaussian, and the high Poisson statistics of the measured count rates make them approximately Gaussian. Furthermore, as noted in Sec. III, the counts change by approximately 10% over the course of the cycle and so does the count variance, making it nearly constant. Under these conditions, LS regression should produce statistically near-optimal coefficient estimates.

To remove instabilities in coefficient estimates arising from a high correlation between *t* and *t*^{2} in Eq. (2), we follow the standard practice of performing LS regression on deviations from the sample mean, $(t-t\xaf)$.^{13} That is, we reparameterize the model for the measured count rates $N\nu \xaf(M)(t)$ in Eq. (6) as

We must also reparameterize the model in Eq. (5). The baseline count rate $N\nu \xaf(B)(t)$ still follows Eq. (4), but the baseline mean function $\mu (t)$ is now given by

Each coefficient $\gamma i(M)$ in Eq. (7) can then be compared to its counterpart $\gamma i(B)$ in Eq. (8) by testing the following pairs of hypotheses for $i=0,1,2$:

The test procedure then consists of the following steps (assume $i=0,1,2$ throughout the paper):

1. Generate ${N\nu \xaf(B)(t)}$ according to Eq. (4), with $\mu (t)$ taken from the baseline reactor simulation.

2. Perform LS regression of ${N\nu \xaf(B)(t)}$ from step 1 and measured counts ${N\nu \xaf(M)(t)}$ on $(t-t\xaf)$ and $(t-t\xaf)2$ to obtain coefficient estimates $\gamma \u2227i(B)$ and $\gamma \u2227i(M)$ and standard errors $se(\gamma \u2227i(B))$ and $se(\gamma \u2227i(M))$.

3. Obtain test statistics

and their corresponding *p*-values, given by

where *S* has a Student’s *t* distribution with $2\xb7(n-3)$ degrees of freedom with *n* equal to the number of count rate measurements.

4. Determine the acceptable false positive (FP) rate (see Sec. V) and apply the false discovery rate (FDR) procedure, described in Ref. 14, to determine whether to reject each of the $H0(i)$ in favor of $Ha(i)$ in Eq. (9). (As described in detail in Ref. 14, the FDR procedure controls the false positive error rate associated with testing multiple hypotheses.) If at least one of the null hypotheses is rejected, conclude that the measured evolution deviates significantly from that of the baseline.

## V. TEST PERFORMANCE

The test can produce two types of errors: it could find a significant difference from the baseline in at least one coefficient when the evolution was in fact produced by a baseline scenario (a false positive, or FP, result), or it could miss a significant difference in all three coefficients when the evolution was different from the baseline (a false negative result).

The complement of the false negative rate is the true positive (TP) rate. The TP rate is defined as the probability of finding a significant difference in at least one of the coefficients from its baseline counterpart when the evolution in question is in fact different from that of the baseline. A good test has a low FP rate and a high TP rate. There is a trade-off between these two quantities: all else being equal, increasing the TP rate of the test comes at a price of a higher FP rate. To study this trade-off, we generated receiver operating characteristic (ROC) curves for each of the cases we considered. A ROC curve shows the TP rate as a function of the FP rate, thus allowing one to determine the TP rate associated with an acceptable FP rate.

### A. Simulation

To estimate the TP rate for a given FP rate on a ROC curve, we carried out a simulation (not to be confused with the reactor simulation). This simulation was performed for a scenario in which ten once-burned assemblies with the highest plutonium content are removed and replaced with 3.91% enriched fresh fuel. This represents the removal of 82 kg of fissile Pu $(239Pu$ + $\u2009241Pu)$ from the core. Complete fissile inventories at beginning of cycle for the baseline and anomalous scenarios are shown in Table I.

Isotope . | Baseline mass (kg) . | Anomalous scenario mass (kg) . | Mass difference (kg) . |
---|---|---|---|

^{235}U | 2834 | 2849 | 15 |

^{238}U | 82912 | 83351 | 439 |

^{239}Pu | 225 | 152 | −73 |

^{241}Pu | 21 | 12 | −9 |

Isotope . | Baseline mass (kg) . | Anomalous scenario mass (kg) . | Mass difference (kg) . |
---|---|---|---|

^{235}U | 2834 | 2849 | 15 |

^{238}U | 82912 | 83351 | 439 |

^{239}Pu | 225 | 152 | −73 |

^{241}Pu | 21 | 12 | −9 |

Figure 1 shows the antineutrino count rate evolutions predicted by the origen simulation for the baseline scenario (solid green) and the anomalous scenario (red). (The shifted baseline evolution, shown in dashed green, is discussed in Secs. V D and V E.)

A given point on a ROC curve is obtained as follows. One hundred thousand pairs of anomalous and baseline evolutions are generated, with the former from a Poisson distribution with the coefficients $\gamma i(M)$, and the latter from a Gaussian distribution according to Eqs. (4) and (8) with the coefficients $\gamma i(B)$. Both sets of coefficients are obtained from the ORIGEN reactor simulation for the given scenario and time period. We then apply steps 2 through 4 of the test procedure introduced in Sec. IV at the given FP rate (the *x* coordinate of the point on the ROC curve) to each pair of evolutions. We then estimate the TP rate (the *y* coordinate of the point on the ROC curve) with the fraction of the 100, 000 evolution pairs for which at least one of the null hypotheses in Eq. (9) is rejected. This is repeated for a sequence of FP rate values from 0 to 1, thus producing a curve. The large number of generated evolutions ensures that every TP rate estimate is within 1% of the relevant true TP rate.

To verify that the nominal FP rate of our test procedure corresponds to its actual FP rate, we also generated 100, 000 baseline evolutions from a Poisson distribution with the coefficients $\gamma i(B)$. We estimated the actual FP rates with the fractions of these evolutions for which at least one of the null hypotheses in Eq. (9) was rejected. We found these fractions to be very close to the nominal FP rates.

While the performance of the test will depend on the specific scenario, the present example allows us to identify several important factors that influence our ability to detect any anomalous reactor operation. In the following sections we assess the impact on our test performance of finite counting statistics, background, systematic error in the detector response or baseline simulation, operator malfeasance, and the duration of data acquisition within the cycle.

### B. Effect of counting statistics

For the evolutions shown in Fig. 1, antineutrino count rates range from approximately 375 per day at the beginning of cycle to approximately 335 per day at the end of cycle. As discussed in Sec. VI, easily achievable increases in the combined detector mass and efficiency can lead to a five-fold improvement in counting statistics. We considered the impact of these changes on the test performance, simply by increasing the count rate used in our test by a factor of 5.

Figure 2 shows that this dramatically improves the performance of the test. The ROC curve for high count rates collected over the first 90 days in the cycle, shown in purple, is up to six times higher than the ROC curve for the low count rates for the same time period, shown in orange. For example, at the FP rate of 5%, the high count TP rate is 95%, while the low count TP rate is 34%. This strong effect was observed for other data acquisition periods. The results for the high count rate case, as well as the other factors we considered (discussed in Secs. V C–V F), are summarized in Table II.

Duration (days) . | No Malfeasance . | Malfeasance . | ||||
---|---|---|---|---|---|---|

Background (% of signal) . | 0% . | 25% . | 100% . | 0% . | 25% . | 100% . |

first 90 | 95 | 83 | 52 | 23 | 17 | 11 |

first 250 | 99 | 96 | 72 | 56 | 39 | 21 |

500 | ∼ 100 | 98 | 75 | 99 | 92 | 63 |

Duration (days) . | No Malfeasance . | Malfeasance . | ||||
---|---|---|---|---|---|---|

Background (% of signal) . | 0% . | 25% . | 100% . | 0% . | 25% . | 100% . |

first 90 | 95 | 83 | 52 | 23 | 17 | 11 |

first 250 | 99 | 96 | 72 | 56 | 39 | 21 |

500 | ∼ 100 | 98 | 75 | 99 | 92 | 63 |

We assume that an acceptable test for IAEA safeguards or a similar monitoring regime will require at least 95% TP rate at the 5% FP rate. For the particular scenario considered here, we verified that a minimum five-fold improvement in counting statistics is necessary to achieve this target. This was accomplished by progressively increasing the count rate in the testing procedure until the 95%/5% TP/FP combination was attained.

### C. Effect of background

Nonantineutrino events in the detector can mimic the antineutrino signal, producing background. In an earlier paper,^{4} we showed that background, measured during reactor-off periods, is distributed as a Poisson random variable. Depending on its level, the background can dilute the test sensitivity. In Sec V B, we showed that in the absence of background, approximately 2000 counts per day are both necessary and sufficient to meet our sensitivity goals. To study the influence of background on the sensitivity of our test, we added a Poisson-distributed background term with mean proportional to the given evolution’s initial count rate, and recalculated the TP rates at the target $5%$ FP rate. We set the background mean equal to 5, 25, and 100% of the initial antineutrino count rate. For comparison, the SONGS1 reactor-off background rate was approximately $25%$ of the antineutrino rate.^{3}

The test performance remains virtually unchanged at the $5%$ background rate. As indicated in Table II, the performance degrades slightly at the $25%$ background level, but is substantially worse when the background rate approaches that of the signal. In the following sections, we also consider the effects of detector or simulation bias, and of operator malfeasance in the presence of background.

### D. Effect of a systematic bias in the predicted or measured antineutrino count rate

Systematic uncertainties in the predicted or measured count rate could cause the detector measurements to deviate significantly from the predicted baseline evolution, even in the absence of an anomalous fuel loading. In this section we analyze the consequences of such shifts for the hypothesis testing procedure.

The absolute count rate of reactor antineutrinos has been predicted with just under $3%$ systematic uncertainty. Errors in the measured count rate, arising from detector-related systematic uncertainties, such as imperfect knowledge of the number of target atoms in the detector, are smaller, at the $~1.5%$ level.^{15} A systematic shift in either the predicted or measured response will decrease the statistical power of the hypothesis test. However, the negative impact of the shift can be partially or even fully mitigated using the template matching strategy discussed below.

For the current scenario, we first considered the impact of a systematic shift incorrectly interpreted as evidence for anomalous reactor operations. We considered two types of shift, with nearly equivalent effects. The first is an overall 1% upward shift in the predicted count rate, arising from a systematic bias in the input antineutrino spectrum or the input thermal power. The second is an overall 1% downward shift in the measured count rate, resulting from a miscalibration of the detector, such as an underestimate of the detector volume. The directions of both shifts were chosen to undermine the statistical power of the test. A 1% absolute systematic error in either the prediction or the measurement is smaller than that obtained in reactor antineutrino experiments, but is already large enough to illustrate the strong impact of such shifts.

Figure 1 shows the case of a 1% upward shift in the predicted baseline. As seen in the plot, for much of the cycle (roughly for the first 200 days), the shift causes the predicted baseline to be closer to the measured anomalous evolution than to the evolution one would measure under normal operating conditions. As a result, the test performance deteriorates dramatically since the predicted baseline is used as a reference in the hypothesis test. For example, at 5% FP rate and 90 days of data acquisition, the TP rate is 0.4%, compared to 95% in the absence of a shift. The test attains the desired 95% TP rate only at the FP rate of practically 100%. Thus, even a small bias in either the measured or predicted response severely weakens the statistical power of the hypothesis test if an *absolute* comparison of predicted and measured count rate trajectories is made.

#### 1. Template matching

The negative impact of these systematic shifts can be mitigated by comparing the measured count rate to a template defined in a previous cycle. This template would be obtained by shifting the predicted baseline count rate evolution to match the evolution measured with the same detector, for a cycle in which operating conditions were known by other means to be standard. Using this template in effect removes systematic errors in both the prediction and the detector response, since agreement is empirically enforced when the template is defined. The template would still require operator-reported thermal power as an input for the latest cycle. The template remains valid so long as the detector response is unchanged, which can be verified by various automated calibration techniques.

Template matching in this way is equivalent to making antineutrino count rate measurements relative to a predicted initial value. Such relative measurements have a considerably smaller systematic error—of less than $1%$^{16}—than an absolute measurement. The error reduction occurs since many of the systematic biases present in the absolute measurement, including the error in the predicted flux and many detector-related errors, are canceled by subtraction.

This strategy will lead to practically the same result obtained earlier for high statistics acquisition in the absence of a shift—95% TP rate at 5% FP rate with 90 days of data acquisition. The impact of background following the use of this template-matching calibration is thus also practically unchanged.

We investigated another strategy for mitigating the impact of a detector bias, namely, correcting the measured counts throughout the cycle by the difference between them and the predicted values averaged over the first 20 days of the cycle. This enforces agreement of predicted and measured count rates at beginning of cycle before the testing procedure is applied. However, this strategy has limited value: when the measured counts are corrected in this way, at 5% FP rate and 90 days of data acquisition, the TP rate is only 12%. While this is a significant improvement over the 0.4% TP rate reported above for the absolute comparison of unadjusted predicted and measured count rate trajectories, it is nevertheless too low to be of practical use. Although the TP rates improve for longer acquisition periods, approximately 250 days of acquisition are required for the rates to become comparable to those observed in the absence of a bias.

Thus, the approach of comparing to an adjusted prediction from a previous cycle appears to be the most effective method for identifying anomalous fuel loadings, so long as systematic errors in predicted and measured detector response remain at the level of a few percent, and the same detector is used for benchmarking the prediction and making the new measurements to be tested. In addition, this strategy requires independent knowledge that the cycle used to benchmark the prediction has a standard baseline fuel loading.

### E. Effect of operator malfeasance

Equation (1) shows that both thermal power and fissile isotopic content can be altered to change the antineutrino count rate. Thus, in an attempt to conceal the removal of plutonium in the present example, the reactor operator could report a higher thermal power value than the true operating power. This input information would cause the simulation to incorrectly predict a systematic upward shift in the baseline evolution.

To assess the impact of a misreported power history, we considered the effect of a 1% upward systematic shift of the baseline evolution that was originally obtained from the origen simulation (solid green curve in Fig. 1). This has the same effect on the predicted baseline as the simulation bias discussed in Sec. V, but in this case this shifted prediction (dashed green curve in Fig. 1) would also be the reference in the hypothesis test.We assume that misreporting power is only a sensible strategy for the operator if the removal of plutonium is taking place.Thus, a false positive result would only occur if the operator is in fact operating the reactor at the reported power (but different from the original baseline scenario), for example, in order to generate a different amount of electricity than the original baseline scenario would allow. As can be seen in Fig. 1, the resulting shifted baseline prediction is much less distinguishable from the anomalous evolution than the original baseline, so this shift is also expected to deteriorate the test’s performance.

Indeed, for a test using high count rate data for the first 90 days, at 5% FP rate and in the absence of background, the TP rate was 23% when using the shifted baseline, compared to 95% obtained for the original unshifted baseline. As expected and as indicated in Table II, the TP rates are reduced even further in the presence of background. One obvious, but important, difference between the case of malfeasance and that of prediction or measurement bias discussed in Sec. V is that template matching is not viable in the former case, as the shift in the prediction is being deliberately introduced. In Sec. VII, we discuss operational and experimental means to address the problem of deliberate misreporting.

It is important to note that longer data acquisition times reduce the impact of malfeasance. As Table II shows, the TP rates at 5% FP rate in the absence of background are 56% and practically 100% for 250 and 500 days of data acquisition, respectively. Hence, even in the presence of malfeasance, the anomaly can be detected with high sensitivity if one acquires antineutrino data over the entire cycle. The relevant entries in Table II show that this conclusion is largely unchanged in the presence of background, up to a signal-to-background ratio of 4:1 (i.e., background is 25% of the initial count rate).

### F. Effect of the duration of data acquisition

Naturally, the estimates of the evolution coefficients $\gamma \u2227i(M)$ and the test performance both improve as data are acquired for longer periods. In our ROC curve simulation, we considered the following four durations: 500 days (roughly full cycle length), first 250 days (half cycle length), first 90 days, and first 30 days in the cycle. Figure 3 shows the ROC curves for these four duration periods, assuming high count rates. At the FP rate of 5%, the TP rate is practically 100% for 500 days versus 99, 95, and 58% for first 250, 90, and 30 days, respectively.

In addition, in all cases, the test performance is largely maintained in the presence of background up to the level of $25%$, while at unity signal-to-background ratio the performance degrades substantially. These various effects are summarized in Table II.

## VI. DETECTOR DESIGN AND OPERATION

The test performance described above can be used to guide the design of future safeguards antineutrino detectors. For a given anomalous scenario and desired true and false positive rates, a minimum antineutrino count rate requirement can be established. Within practical limits set by the reactor site, detector cost and complexity, a desired count rate may be achieved by adjusting the detector standoff distance, size or intrinsic efficiency. The effect of background on the test performance also sets an approximate minimum background requirement on the detector.

As discussed earlier, the antineutrino rate in the SONGS1 experiment^{2} was approximately 360 counts per day at beginning of cycle after subtraction of reactor-off background. According to the ROC curve in Fig. 2, this antineutrino count rate gives a TP rate of 34% for a 5% FP rate with a 90-day acquisition period. In Sec. V B, we showed that for the anomalous scenario we considered, a 2000 count per day net antineutrino event rate is necessary and sufficient to achieve the presumed IAEA target 95%/5% TP/FP rate combination.

The SONGS1 detector was located 24.5 meters from the reactor core, with a 0.48 ton target mass and 11% intrinsic detection efficiency.^{5} An increase in the count rate compared to SONGS1 could be accomplished by a combination of reduced standoff distance, increased detector target mass and/or increased intrinsic detection efficiency. For example, at 24.5 m standoff, a one ton detector with 30% intrinsic efficiency, or a two ton detector with 15% intrinsic efficiency would reach the 2000 count rate level and thus, the desired 95%/5% TP/FP rates. Alternatively, a one ton, 11% efficient detector at 15 m standoff would reach the same TP/FP rate combination. Among these adjustments, the changes in standoff distance and the detector size depend on the plant configuration and access levels, and so might not be achievable at all sites. However, several of the detectors in Table III already exhibit the required efficiency.

Experiment . | Power (GWt) . | Mass (ton) . | Distance (m) . | Efficiency (%) . | Signal/background Counts/Day . | Detector type . |
---|---|---|---|---|---|---|

Rovno 1 (Ref. 6) | 1.375 | $~0.5$ | 18 | 20 | 909/149 | 3He + water |

Rovno 2 (Ref. 17) | 1.375 | $~0.2$ | 18 | 30 | 267/94 | Gd scint. |

CHOOZ (Ref. 18) | 4.4 | 5.0 | 1000 | 69.8 | 24/1.2 | Gd scint. |

Palo Verde (Ref. 19) | 11.6 | 11.3 | 800 | 10 | 200/300 | Gd scint. |

SONGS1 (Ref. 2) | 3.4 | 0.64 | 24.5 | 11 | 564/105 | Gd scint. |

Bugey (Ref. 20) | 3.4 | 0.60 | 15.0 | 30 | 62/2.5 | Li scint. |

Experiment . | Power (GWt) . | Mass (ton) . | Distance (m) . | Efficiency (%) . | Signal/background Counts/Day . | Detector type . |
---|---|---|---|---|---|---|

Rovno 1 (Ref. 6) | 1.375 | $~0.5$ | 18 | 20 | 909/149 | 3He + water |

Rovno 2 (Ref. 17) | 1.375 | $~0.2$ | 18 | 30 | 267/94 | Gd scint. |

CHOOZ (Ref. 18) | 4.4 | 5.0 | 1000 | 69.8 | 24/1.2 | Gd scint. |

Palo Verde (Ref. 19) | 11.6 | 11.3 | 800 | 10 | 200/300 | Gd scint. |

SONGS1 (Ref. 2) | 3.4 | 0.64 | 24.5 | 11 | 564/105 | Gd scint. |

Bugey (Ref. 20) | 3.4 | 0.60 | 15.0 | 30 | 62/2.5 | Li scint. |

As Table II shows, test performance can be maintained up to a signal-to-background ratio of about 4:1, similar to that achieved in the SONGS1 detector, provided that the test uses longer data acquisition times. Since the normal PWR cycle is uninterrupted for hundreds of days, the use of longer acquisition times should be feasible. In the event of unplanned shutdowns within a few months of start-up, the antineutrino measurements could still be used to ensure that the operating state of the reactor had not changed in terms of fuel loading and power, after which the hypothesis testing procedure could be reinitiated following a restart.

As shown in Table III, previous antineutrino detectors had masses, efficiencies and signal-to-background ratios required to achieve the desired TP/FP rate performance. The series of deployments at the Rovno reactor complex in the Ukraine is of particular interest since the efficiencies are high, while the overburden and other conditions are similar to those that would be encountered in many reactors under the IAEA safeguards. By contrast, the high efficiency of the CHOOZ detector reflects the state of the art for this class of detectors, but is achieved in part through significantly greater overburden and reduced ambient radioactivity compared to the other experiments, so such a device is unlikely to be practical in a safeguards context. Background levels from 4 to 25% were achieved in five of the six examples, meeting or exceeding the minimum background requirement of $25%$.

## VII. CONCLUSIONS AND POSSIBLE FUTURE WORK

This paper introduced a test procedure that determines whether a given antineutrino count rate evolution significantly deviates from that of the baseline. The procedure uses a quadratic model for the antineutrino count rate as a function of time since the beginning of the fuel cycle. However, the procedure can be adapted to a much wider class of models. In particular, any antineutrino count rate evolution that can be represented by an analytic function could be treated in a similar manner. The procedure described in this paper involves least squares estimation of the parameters in the quadratic model for the evolution in question and a multiple hypothesis testing procedure, known as false discovery rate (FDR), to determine whether at least one of the estimated parameters is significantly different from its baseline counterpart.

The anomalous operations identified in this paper do not constitute a diversion scenario *per se*, since we have not specified the ultimate fate of the removed fuel. Instead, we have estimated the sensitivity of antineutrino rate measurements to changes in typical civil power reactor fuel loadings. An important future exercise, best conducted by IAEA safeguards experts, is a fuller analysis of the reactor safeguards implications of this novel bulk accountancy method.

While the specific performance of the test will depend on the scenario, this work has identified the factors that most influence our ability to detect anomalous fuel loadings generally. Among the factors that we considered, counting statistics, the presence of detector bias, and introduction of a systematic shift due to operator malfeasance had the most dramatic impact on the test performance. High counting statistics collected over longer periods of time in the absence of a deliberate shift in the baseline or detector bias yield the best performance and attain the target 95% TP rate at the 5% FP rate. We also found a template matching method to be the most effective way to maintain test performance in the presence of systematic biases. This approach has the further advantage of reducing the dependence of the method on a reactor simulation.

Past experience has demonstrated that increasing the antineutrino count rate through efficiency or mass increases is achievable, so that our target 95% TP / 5% FP rate combination can be attained with practical detectors. More problematic in a safeguards context is the issue of deliberate misreporting of power levels on the part of the operator that would undermine the statistical power of our test. While this is a serious concern, we note that the operator’s misreporting must be fully consistent with the antineutrino data, which are independently acquired by and remain under the control of the safeguards inspector. This independently acquired information places an important additional constraint on the operator compared to current practice, in which declarations, along with item accountancy, are the primary sources of quantitative information about the reactor thermal power and fuel loading. Moreover, the misrepresentation must be tuned to the particular anomalous operational state chosen by the operator. If different amounts or types of fissile material are removed, the hypothesis test may still detect a significant departure from the baseline. To further examine the robustness of this method, it is necessary to investigate a wider class of anomalous scenarios, varying both fuel and reactor type.

As described in Ref. 21, a direct measurement of the antineutrino spectrum would provide sufficient information to simultaneously constrain both power and fissile isotopic content. This would severely undermine or even eliminate the benefit to the operator of misreporting the thermal power. However, since the antineutrino rate per energy bin will be necessarily reduced, the statistical power of the test may be compromised, or, alternatively, a larger detector may be required than is the case for a pure rate measurement. In future work, we will apply a hypothesis testing procedure on a spectrally resolved antineutrino measurement, including realistic statistical and systematic uncertainties, to quantify any additional sensitivity inherent in the spectral analysis.

Finally, as noted earlier, we used an origen simulation of the SONGS unit 2 reactor core. Assemblies were assumed to have no spatial extent: the only spatial information in our calculation was the variation in distance of each pointlike assembly from the detector. A full three-dimensional treatment of the assemblies would allow inclusion of effects, such as the variation of the centroid of fission over the cycle.

## ACKNOWLEDGMENTS

We thank the DOE Office of Nonproliferation Research and Engineering for their sustained support of this project. We also thank Nathaniel Bowden, Scott Kiff, and the anonymous reviewer for insightful comments on earlier versions of this manuscript. Finally, we express our gratitude to the management and staff of the San Onofre Nuclear Generating Station for allowing us to deploy and take data with our prototype safeguards antineutrino detectors.