In community noise studies, there is often a desire to understand how the annoyance response to multiple noise events aggregates over a long period of time. Many cumulative response metrics, such as day-night level (DNL), are based on the idea that humans respond, on average, to the sum of frequency-weighted acoustic energy over time. This paper introduces a generalization of DNL that includes a parameter, $b$, that ranges between zero and one. When $b$ equals zero, the metric returns the maximum level of the events. When $b$ equals 0.5, the metric reproduces the equal-energy-based output of DNL. When *b* = 0, 0.5, and 1, the metric returns a value that more harshly penalizes the number of events. In this way, these common possible hypotheses are organized onto a single scale, one that may be used to craft effective noise mitigation techniques or implement regulations. The analysis is demonstrated in two ways: first, on synthetic datasets to show the utility and consistency of the metric, and second, on limited quiet-supersonic response data gathered during the Quiet Supersonic Flights 2018 community study.

## I. INTRODUCTION

Community noise studies commonly strive to understand and characterize the annoyance generated by a particular noise source over a period of time. This work proposes and demonstrates an analysis method that can be used to determine the relationship between short-term annoyance generated by individual noise events and longer-term cumulative annoyance that results from exposure to multiple single events. Determining the way that a metric should aggregate energy from individual events into a single value is typically referred to as the “noise and number” problem and much of the previous work focuses on conventional fixed-wing aircraft. There, the commonly used cumulative noise metric is the day-night level (DNL).^{1} Although it is well established for regulatory purposes, DNL may not fully capture the cumulative response to noise as it is based on the equal energy hypothesis (EEH), where level, duration, and number of events are given an “equal” trade-off as determinants of annoyance response.^{2} However, other heuristics are entirely possible. For instance, people may be more sensitive to the loudest or most recent event in a given day (as suggested by the psychological heuristic called the “peak-end rule”^{3}). It may be the case that people are more annoyed by the number or pace of events throughout a day—the fact that something is happening repeatedly either regularly or irregularly—regardless of the level.^{4}

In light of the different potential hypotheses to describe annoyance, this paper proposes an analysis technique that generalizes the DNL calculation. The formulation introduces a parameter, $b$, that changes how single events are summed together such that the metric can represent several different response hypotheses. Statistical methods are given that can be used to estimate this parameter from community response data and show the level of confidence in that estimate. The formulation allows the metric to represent a wide range of possible responses that may be observed in empirical data. If the response is found to be equivalent to the EEH, the metric will revert to DNL. If the response appears to be more nuanced, the method can represent different behaviors—from a response that is equal to only the maximum event that occurred over a period of time, to one that more harshly penalizes the number of events that occurred over the period. This analysis technique is meant to motivate the design space of single-event summation possibilities to produce more detailed understanding of the underlying factors driving cumulative annoyance.

In developing this approach, it is instructive to have a real-world set of data to use as a testbed. The data considered here are from community response to sonic booms. Annoyance to sonic booms has hindered commercial supersonic flight and contributed to the 1973 Federal Aviation Administration ban on overland supersonic civil flight.^{5} Building on decades of sonic boom research,^{6} the National Aeronautics and Space Administration (NASA) is developing an experimental aircraft, the X-59, to demonstrate quiet-supersonic flight. Community flight campaigns accompanying the X-59 aircraft present a unique opportunity to collect human response data to quiet-supersonic noise signatures to inform regulators in their efforts to consider alleviating the overland supersonic ban with permissible noise standards.^{7} In preparation for the X-59 community studies, NASA conducted a risk reduction quiet-supersonic community study.^{8} These survey data, which consist of single-event responses to individual flyover events and end-of-day, or cumulative, responses to events of the day, are used in this paper to demonstrate the proposed noise metric and its accompanying analysis methodology.

The remainder of this paper is outlined as follows: Section II provides background technical information primarily from fixed-wing noise literature (a literature review is contained at the end of Sec. II). Section III defines the analysis method for quiet-supersonic events and gives the statistical procedures of evaluating community response data. Section IV introduces some community response datasets on which the method is demonstrated—one artificial and one from a previous quiet-supersonic community study. Section V discusses the results of applying the method to these datasets and confers lessons that might be applied to the design of future community response studies.

## II. BACKGROUND

This section introduces several concepts that are at the core of this work. The EEH, how it relates to existing noise metrics, and how it might be extended to encompass other responses based on psychoacoustic data are discussed. With this context, previous work is reviewed from laboratory and *in situ* psychoacoustic testing.

### A. Whence equal energy

At the most basic level, a noise metric is a mathematical function that reduces some intricate description of noise (e.g., a pressure waveform or one-third octave band time history) to a single number. This can be performed for noise of short or long duration—from milliseconds to a year or more. Many noise metrics are designed to correlate well with some aspect of how people respond to the sound. For historical reasons—a matter of the last 60 yrs or so of policy and research—annoyance is studied as the primary impact of noise on humans (see the recent article by Clark *et al.*^{9}). Thus, researchers are often trying to find correlations of annoyance to understand, predict, and hopefully help to limit people's exposure to noise.

There are several aspects of long-term sound exposure that should understandably play a role in annoyance when considering multiple noise events: the level of the events (if levels increase, annoyance should increase), the length of the events (if the sounds last longer, annoyance presumably increases), and the number of events (experiencing many events is likely more annoying than experiencing only a few). Although it is difficult to pinpoint the particular origin, in the pursuit of a noise metric that captured all these features, it was proposed to use acoustical energy integration over time and (weighted) frequency. Therefore, for a predictor of annoyance, just take the acoustical energies of the noise over a period of time of interest and add them up. This simple heuristic captures all these features (level, length, and number) and proposes a “natural” trade-off between them of 3 dB in equivalent level of doubling either the length of the sound or the number of events. The idea that this strategy represents an optimal way of correlating noise exposure with annoyance is the EEH.^{2} This can be extended to produce common noise metrics with additional details such as nighttime penalties or different frequency weighting schemes.

DNL is the most common implementation of the EEH for community noise assessment purposes.^{1,10} This metric uses an *A*-weighting to mimic the frequency response of the human ear and assesses a 10-dB penalty on noises occurring during the nighttime hours of 10 p.m.–7 a.m. For a more comprehensive history of how this measure came into fashion and how it is used (at least in the U.S.), see Fidell and Mestre.^{2} Over time, there has been significant concern with the shortcomings of DNL as a measure of noise, predictor of annoyance, and tool for communication and regulation;^{11} however, DNL has also never been strongly and consistently rejected. A few disparate studies in laboratory and community studies, e.g., Vogt,^{12} seem to refute the EEH in particular contexts, but in totality, EEH has been the predominate working hypothesis in the literature. There is an important distinction to make here: one should not misunderstand the prevalence of the EEH to mean that individuals act as integrating sound level meters. Even entire communities or groups of test subjects may show behavior that is strongly weighted to one or two of the aspects (level, length, or number) based on the particulars of the noise problem that they are faced with, but overall, attempting to incorporate the reactions of a large number of people to a variety of scenarios over a large amount of time makes it difficult to do away with the EEH.

### B. Nonequal energy responses—The canonical approach

This work seeks to extend metrics based on the EEH to be able to correlate with other types of responses. If a new parameter can be added to the metric in such a way that a particular setting of this parameter reproduces the old output of the metric, then this new generalized metric becomes a superset of the old one. If psychoacoustic data from a community response study provide evidence that the response is in accordance with the EEH, then this parameter can simply be specified to the value implied by the EEH. On the other hand, if evidence is presented that allows for the confident rejection of the EEH, then this metric can be used with the parameter accordingly set to assess noise in a way that is in line with the observed data. Last, even if the EEH cannot be confidently rejected, such a parameter may provide insight into either further experimentation or development of metrics that could capture more subtleties of the responses that are not in accordance with the EEH.

^{13}The EEH implies that $k$ = 10, and much of the past noise and number literature deals with trying to disprove this as a null hypothesis. If $k$ = 0, there would be no effect of the number of events, $N$, and the response is only dependent on the average $ L A E$. For $k$ > 10, an increase in the number of events in the defined period of time would be penalized more harshly than the EEH would suppose.

Although noise assessments around the world today are mostly in accord with EEH, there are some notable historical exceptions. Goldstein^{14} catalogues several instances, such as the isopsophic index, noisiness index, and total noise load. The noise and number index (NNI),^{1} for example, explicitly uses $k$ = 15 and was used for community noise assessment in the United Kingdom from the 1960s to 1990, when it was replaced by a metric based on the EEH (cf. the introduction of Fields^{13}). Another non-EEH heuristic was the “Kosten unit” (Ke) of noise exposure, which has been in use for a long time in The Netherlands.^{15} Although this strategy is not directly relatable to a particular value of $k$, it became deprecated when increasing number of aircraft movements led to a reduction in Ke that was not representative of the persisting levels of annoyance—a noise and number problem—and it was replaced by day-evening-night level (DENL).

One of the interesting consequences of using $k$ ≠ 10 is that it requires a separate input of the number of events (or all events individually). If $k$ = 10, then a time-integrating sound level meter can be used—a device that is ignorant of the number of noise events there are within its integration window. For any other $k$, each event must be identified, counted, and assessed separately. Whereas this is simple to do with predictive computational tools, to do it with recorded data can represent a considerable increase in effort. Thus, the metrics employing the EEH have necessarily lower complexity than metrics that use any other trade-off. It is possible that this added complexity leads to an evolutionary force, which has helped to move regulations over time into a state of accordance with the EEH, as in the case of the NNI.^{16}

### C. Vector norms and the $b$ concept

^{17}Whereas these are commonly used norms, the “base” of the vector norm calculation is a continuous number $p$, and the operation is well-defined for $p\u2208[1,\u221e]$ such that

^{18–20}

Reconsidering the earlier vector of inputs in decibel form ( $ L A E= 94 , 93 , 88$), this formulation in Eq. (6) with the time normalization constant (49.4) dropped can achieve the same results: when $b$= 0, then $ L d n , b$ = 94 dB, which is the maximum element. When $b$ = 0.5, then $ L d n , b=$ 97.1 dB, which is the same answer that the original DNL summation provides in Eq. (1). When $b$ = 1, then $ L d n , b=$ 101.6 dB, which is to say that the formula has penalized the number of elements in the summation more harshly than would be implied by the EEH. (Note that $b$ = 1 also corresponds to a coherent pressure summation of the inputs as opposed to an energy summation.) Now, $b$ runs along the bounded unit line from zero to one with these three readily understandable computations that are equidistant along the line.

Perhaps a more powerful concept exists with the idea that $b$= 1 should return the number of events in the input that are above a certain value. This, in concept, is in line with how vector norms are sometimes defined for $p=0$ as being a count of the positive values in a vector and with the “number-above” noise metric that has found recent notoriety as a supplement for DNL (e.g., GAO^{21}). However, the transformation from $p$ to $b$ becomes less straightforward—likely involving a cotangent operation—and the response of the metric necessarily passes through the unphysical region of $p\u2208(0,1)$, which risks computational instability.

How do the $b$ and $k$ formulations [i.e., Eqs. (6) and (2), respectively] compare? In the case where the input $ L A E$ values are all the same, the two methods produce equal results with $k=20b$. Therefore, the $b$ formulation has an advantage when the inputs have some depth to them—if the summation of dissimilar events results in an interesting effect on the response. Consider a scenario made up of one loud event (e.g., 100 dB) and numerous quiet events (e.g., 99 unique events at 50 dB). Suppose that all these events occurred over the course of a day, and you are trying to come up with a single value that best represents the whole day of exposure. In this scenario, the $k$ formulation would compute the average $ L A E$ such that when $k$ = 0, instead of returning 100 dB, $ L d n , k$ [Eq. (2)] would return 80 dB, a value diluted by the numerous quiet events instead of the true maximum. In contrast, when $b=0,\u2009 L d n , b$ [Eq. (6)] would return 100 dB (again, with the normalization constant dropped) no matter how many other quiet events there were. On the other hand, the $b$ formulation also captures the “number” effect better with the many small events in this scenario. The small events, even when combined in the taxicab norm, would add little to the total in the $b$ formulation. In contrast, the $N$ is impactful in the $k$ formulation regardless of the level of individual events. Again, for the EEH case of $b=0.5$ and $k$ = 10, the two approaches will exactly agree.

The primary cost for using the $b$ formulation over $k$ is that the method is more conceptually complex—although not significantly so from a computational standpoint. It also means that there is a need to keep track of each individual event. In practice, arbitrarily large sets of $ L A E$ values may be reduced to $ L A E \xaf$ and $N$ such that only two numbers need to be stored. Unfortunately, any computation for the $b$ concept that starts from this resolution of data will only reproduce the output of the $k$ methodology. It is also fair to point out that $k$ is not limited to the range [0,20], whereas $b$ (in its present formulation) is limited to the equivalent range of [0,1]. Thus, the $k$ formulation is able to represent penalties for increases in the number of events beyond that which the $b$ concept can. That being said, there is no existing literature that points to $k$ > 20 being a reasonable descriptor of human response to multiple events.

### D. Selected literature

Armed with an understanding of the EEH and the $b$ concept, the current efforts can now be contextualized in relation to the existing literature.

The first and perhaps most important piece of earlier work is the previously cited meta-analysis of Fields.^{13} He applies a regression model to determine $k$ for a large number of community response tests. Although there are individual tests for which the best guess of $k$ seems to be quite different from the EEH value of 10, the large confidence intervals on $k$ found throughout the study make it difficult to confidently reject the EEH (i.e., as a null hypothesis) in any particular instance and impossible overall.

The most similar piece of previous work comes from Miedema *et al*,^{22} who present an analysis of a multiple-event dataset that augments the DNL computation with two free parameters: one for the magnitude of the time-of-day penalty, and one for a number-trade-off rate. Their formulation introduces a parameter, $\alpha $, which matches the 1/ $b$ exponential term in Eq. (6) to a factor of 2. This derives from two similar observations to those given in Sec. II B: that the depth of the individual $ L A E$ values may be important to evaluating whether the EEH is true or not (and not just the mean value and number). Also, they note that the $k$ formulation partially presupposes the EEH by reducing the data to $ L A E \xaf$ and $N$ via an energetic average. However, they do not identify their operation as a norm and do not use the inverse exponentiation after the sum. Hence, the result of their analysis ceases to be in decibel-like units, and the end points of $\alpha $ do not match other common measures of sound, perhaps leading to trouble with linearization as $\alpha $ moves away from the EEH. Ultimately, their study is also unable to reject the EEH.

Many laboratory experiments have been performed to try and answer this noise and number question as well. For instance, NASA performed a laboratory psychoacoustic study to generate data that would discriminate between the EEH and several other strategies for cumulative noise assessment.^{23} Presently, NASA has plans to continue this type of experimentation for novel vertical lift vehicles.^{24} The applicability of laboratory data to real-world situations was called into question by Vogt,^{12} although he still executed a laboratory study that finds rather small values for $k$ (he also provides a thorough review of experimentation through the mid-1990s). Other authors look at this problem through slightly different lenses, although most come up with equivalent results. For instance, Morinaga *et al.*^{4} looked at the duration of the quiet time interval between events but still determined $ L e q$ (an EEH-based computation) as the first-order correlate to their data.

The noise and number topic has also been studied outside of the aircraft-noise arena. Researchers on road traffic noise face similar problems of integrating noise across multiple pass-by events. Sato *et al.*^{25} describe a field experiment in which they record the number of traffic events along with sound level meter measurements. Although they use analysis methods not covered here, they conclude that there is a strong relationship between prevalence of annoyance and the peak events recorded over a period of time. This situation might benefit from the use of $b$ analysis, as it retains the maximum single-event value, as opposed to $k$-based analysis, which only retains the mean.

There are other schemes put forward to deal with this kind of data: the noise pollution level concept of Robinson^{26} asserts that it may not be the number *per se* of the events but how the overall sound pressure level modulates up and down over time as a result of things coming and going—a highway may be proportionally less annoying than sounds that one hears one at a time—although no alternate scheme like this has ever gained significant traction in practice (see, for instance, Rice^{27}). There are also threshold-based concepts such as Gjestland and Oftedal,^{28} which have proved efficacious in laboratory settings. (Do subjects *only* integrate noise power between events that are clearly heard over some ambient background noise?) The latter approach is unfortunately not extensible beyond the experiment over which it is demonstrated, although it resembles current research by Christian^{20} that, once more mature, may be applicable to the more general noise and number issue.

Last, Vaughn *et al.*^{29} parallels the application of the subsequently described analysis methodology to simulated data of community response to quiet-supersonic noise events. That work contains a brief literature review of multiple-boom response studies and considers potential dose designs for X-59 community studies in the context of $b$ analysis, which complements the effort of the present study.

## III. METHODOLOGY

This section introduces the $b$ analysis methodology as applied to quiet-supersonic community response data. The same methodology is outlined in Vaughn *et al.*,^{29} although a more thorough exposition is given here.

### A. Noise metrics

Implementing the $b$ analysis for quiet-supersonic events requires the use of a cumulative metric that is expressive of human response to the quiet-supersonic noise signatures. Stevens's Mark VII perceived level (PL)^{30} as a candidate metric for single-event dose and the corresponding day-night averaged perceived level (PLDNL) as the cumulative dose metric have commonly been used in previous analyses of preliminary community response studies conducted by NASA.^{8,31–33} However, neither the human response to the cumulative metric of PLDNL nor the implications of cumulative daily response are well understood. Therefore, the present analysis investigates potential relationships between single-event and cumulative data with the PLDNL metric to garner greater understanding of cumulative response survey results.

The starting point of the $b$ analysis is to use Eq. (6) to generate sets of cumulative doses that will be referred to as $b$ doses to differentiate from DNL, which is equivalent to the $b$ dose when $b=$ 0.5. Given the single parameter of interest ( $b$) and the finite range of [0,1], this analysis uses a “grid approximation,” as opposed to the more common Markov chain Monte Carlo approach.^{34} Therefore, sets of $b$ doses are generated for values of $b$ from zero to one in small increments of 0.001, resulting in 1001 $b$ dose sets. Each set of $b$ dose and their corresponding responses (which remain unchanged) are then ready to be fit to a dose-response model.

### B. Statistical dose-response model

The cumulative dose-response data are modeled using a simple logistic regression. The choice of model assumes a particular relationship between the dose and response. A simple logistic regression treats observations as independent by fully pooling the data. Given that previous and future quiet-supersonic flight studies implement a longitudinal design, data may be impacted by order effects^{35} and have within-subject correlation, which can be accounted for with a multilevel model.^{36} Previous analysis of the presently considered dataset produced similar single-event dose-response curves using simple and multilevel logistic regression.^{37} Although it is reasonable for individuals to be differently annoyed than the tendency of the overall population, a simple logistic regression is considered here to observe overall population trends.

### C. Profile likelihood

^{38}

^{,}$L$, as a function of $b$. The function, $L$, describes how likely a particular value of $b$ was to have given rise to the data that was observed. Thus, it can be used as a measure of the goodness of fit for different values of $b$. Unlike the commonly reported

*R*

^{2}, which has use as an absolute measure of goodness of fit, likelihood has statistical meaning as a relative measure between values of $b$ and can be used for statistical inference—to differentiate between values of the parameter. A Bernoulli likelihood

^{39}is computed here for the simple logistic regression:

^{39}provides a tutorial on Bernoulli likelihood for social scientists. A book by Edwards

^{38}offers a more thorough accounting of the concept of likelihood, including its application to Bernoulli data. The likelihood statistic is computed in the logarithmic domain for computational stability purposes—a typical modification to Eq. (8).

The logit $\beta $ parameters from Eq. (7) are uniquely estimated for each value of $b$ using maximum likelihood estimation. This procedure of maximizing over the $\beta $ parameters is a scheme for creating a “profile” likelihood function. That is, a curve fitting tool may be used to quickly maximize over the $\beta $ parameters of the logit, and then the unidimensional likelihood is only evaluated for the parameter of interest. This is in contrast to “integrated” likelihood methods for which all combinations of parameters are evaluated (e.g., via Monte Carlo methods) and then integrated over the probability of the “nuisance” parameter posterior distributions to gain a unidimensional function. For this application, the results appear to be free of the pathologies of profile likelihood approaches, for instance, the presence of “sharp ridges” (cf. Berger *et al.*^{40}). Integrated methods tend to be common for models that have many parameters—models for which it would be computationally infeasible to create a high-dimensional grid approximation. However, for this application, generation of the profile likelihood is much more expedient than a Monte Carlo approach.

### D. Posterior distribution

^{32}This is evaluated using Bayes's rule as follows:

^{34}The prior is a means for including a presumption regarding the value of $b$ before performing the analysis. To give no predisposition for the value of $b$, this analysis uses a “noninformative” prior, which equates to a standard uniform distribution over the unit interval [0,1] as $b$ values beyond this interval are discounted as unphysical or nonsensical. The step between the likelihood function and posterior is therefore trivial: a multiplication by one and a division to normalize the area under the curve to one.

### E. Interpretation

This analysis uses two descriptors to convey the results of the posterior distribution for $b$: a point estimate that maximizes the posterior density function of $b$ and an interval estimate about that value. The point estimate is simply the mode or peak of the posterior distribution, and its magnitude within the interval [0,1] describes the general behavior of respondents. The peak point estimate is complemented by an interval estimate that describes the precision about the point estimate. Bayesian inference can be used to generate this credible interval (CI).^{34} Given that the encapsulated area of the posterior distribution equals one, a critical value, $P o *$, of the posterior distribution can be determined, where the area within the curve equals a chosen criterion. For example, a criterion of 0.95 would determine the set of $b$ that contains 95% of the probability of that parameter given the data by computing $P o *$ and taking all $b$ for which $Po b Data>P o *$ are within this interval. Note that this method may generate intervals that have multiple disconnected sections. For instance, with a bimodal distribution, the data may support $b$ being either zero or one but not 0.5 (although this is not encountered in this work).

These two descriptors provide insight for interpreting $b$ analysis results. As the CI contains the peak point estimate and a bound about it (provided the function is continuous), the width indicates the precision in the determination of $b$. A broad CI width suggests that the observed data do not support a particular $b$ value over another, whereas a narrow CI width provides assurance in the $b$ estimation. Furthermore, the CI width typically relates to curvature of the posterior distribution (and likelihood function) near the maximum. Statistical “information”—a usage of the word that predates the work of Claude Shannon^{38}—is defined as a measure of the curvature of the likelihood function near its maximum. High information corresponds to high curvature and a narrow peak.^{41} The informativeness of the observed data determines the confidence in estimating $b$, which subsequently impacts the understanding of the underlying cumulative response behavior. It is conceivable to have two equal-sized datasets where one is informative and the other is noninformative with regard to $b$.

This notion of information-density is important in developing test plans as a test design can improve the gathered information by allowing for contrasting outcomes for different settings of the parameter of interest. The higher the contrast, the greater the statistical power of the test and the more potential information gained regarding the parameter (cf. the introduction of Miedema *et al.*^{22}). The simulated datasets described next will demonstrate what can happen in informative and noninformative cases.

## IV. DOSE-RESPONSE DATASETS

This section describes the simulated and observed quiet-supersonic community study dose-response datasets. All datasets are similarly composed of cumulative doses computed from sets of single-event doses and binary HA responses.

### A. Simulation datasets

Simulation datasets were produced to demonstrate successful implementation of the proposed analysis method and indicate potential dataset limitations. Two sets of data were generated: Simulation 1 spans the perceptual range from 0% and approaches 100%HA and simulation 2 spans from 0% to 4%HA over the dose range, which is similar to a previous community study.^{33} Each simulated dataset consists of 10 000 independent cumulative dose-response pairs. This value was chosen as large enough to reduce sampling error yet small enough for relatively quick runtimes. Each dataset has the same set of dose values but different responses. Prior to generating response data, a user-defined input $b$ value is chosen. The input $b$ value determines the $b$ dose from which responses are generated and can be compared to the output $b$ value for verification of the proposed analysis.

The simulated dose data follow the same format as real quiet-supersonic cumulative dose-response data, as shown in the subsequent section. Each cumulative dose value is calculated from one to seven single-event doses ranging from 65 to 90 dB. The number of single events and each single-event dose value are drawn from uniform distributions with the aforementioned ranges that are comparable to observed values for a previous community study,^{8} although the distributions differ, as will be shown in Sec. IV B. The resultant cumulative dose in PLDNL ranges from 15 to 46 dB with a distribution depicted in Fig. 1(a). This distribution shape is the result of the logarithmically summed single-event doses drawn from the uniform distributions. A potential benefit to this distribution is that there are more doses sampled from the upper dose range where there is a greater potential for annoyance responses, which may help better characterize the dose-response curve. Figure 1(b) illustrates the various combinations of number of events and the maximum single-event PL with the color bar noting the PLDNL values. The dominant trend of similar PLDNL values at a given maximum single-event PL is attributed to the logarithmic summation of single events.

The simulated response values are drawn from a Bernoulli distribution based on a logistic function with user-defined $ \beta 0$ and $ \beta 1$ parameters. This response generation process can be simply implemented by taking a given $b$ dose value and using the corresponding probability value on the parameter-defined logistic curve as the probability of high annoyance. Then, using a random number generator with a uniform distribution from zero to one, if the random number drawn is greater than the probability of high annoyance, then the response is assigned a zero for not HA, and if the random number is less than the probability value, then the response is a one for HA. This process is repeated for all 10 000 dose-response pairs in each simulated dataset.

The simulated response values differ between simulation 1 and simulation 2 datasets. The $\beta $ parameters for simulation 1 are $ \beta 0=\u221220$ and $ \beta 1=0.6$, which places the 50%HA point at 40 dB. This allows the logistic curve to span the perceptual range and represent a more idealized dataset for fitting a logistic regression where one-third of the responses are HA. For simulation 2, $ \beta 0=\u22128.25$ and $ \beta 1=0.11$, which yields a more limited perceptual range that is a more realistic depiction of previously observed quiet-supersonic dose-response data with approximately 2%HA at a PLDNL of 40 dB.^{33}

### B. QSF18 community study dataset

NASA conducted a risk reduction community study, Quiet Supersonic Flights 2018 (QSF18), at Galveston, TX in 2018. Whereas this study had limited cumulative dose-response data and was not planned with the $b$ analysis in mind, the dataset provides a real-world dataset to demonstrate the $b$ analysis methodology. Dose-response data were collected over 9 days with 4–8 quiet-supersonic, or “sonic thump,” events per day, resulting in a total of 52 events. Single-event dose values were assigned via a combination of predicted and measured noise levels for each participant based on their location at the time of the event. Cumulative dose values can then be computed from the single-event doses that a participant experienced in a given day as given in Eq. (1) [or Eq. (6) for $b$ analysis]. Responses were solicited from participants after each sonic thump via single-event surveys and at the end of each day via daily summary surveys. The surveys asked participants the following questions: “How much did the sonic thump bother, disturb, or annoy you?” and “Over the course of your day, how much did the sonic thumps bother, disturb, or annoy you?”^{42,43} Participants then responded with a selection from the following five-point verbal scale: *not at all*, *slightly*, *moderately*, *very*, or *extremely*. More details regarding QSF18 can be found in Refs. 8 and 32.

The QSF18 cumulative response data came from the daily summary surveys. Of the 500 initially recruited participants, 386 participants completed a total of 1952 daily summary surveys, which are considered in the present analysis. These cumulative responses are depicted as a histogram in Fig. 2, where the vast majority of responses fall under the *not at all* annoyed category and relatively few at higher annoyance categories. Response choices are commonly dichotomized for analysis in community noise studies where *very* and *extremely* are considered to be HA and other responses are considered to be not HA.^{42,43} Additionally, a lower threshold for annoyance at *moderately* or *slightly* can be used to perform analyses with participants categorized as at least moderately annoyed (MA+) or at least slightly annoyed (SA+) relative to participants less annoyed.^{32,33}

Cumulative dose values corresponding to the completed 1952 daily summary surveys are depicted in PLDNL as a histogram in Fig. 3(a). They are computed from 8704 single-event doses: 4998 from matched single-event dose-response pairs and the remaining 3706 from single-event doses that were determined from the daily summary survey where participants provided their location during time of day of the quiet-supersonic event such that additional single-event doses could be assigned when a single-event survey was not completed. Figure 3(b), which is similar to Fig. 1(b), illustrates for QSF18, the observed combinations of number of events and the maximum single-event PL with the color bar noting the PLDNL values. Although there were 4–8 single events per day, cases with fewer than four events were due to unusable participant location information, and the one day with eight events had an event with unusable data. More details regarding the QSF18 cumulative dose-response dataset can be found in Sec. 3.2.2 of Fidell *et al.*^{32}

## V. RESULTS

This section discusses the results of applying the $b$ analysis to the described datasets. This includes observations of how the range of annoyance response data impacts the precision with which one can estimate $b$ and other implications for future noise response test designs.

### A. Simulation results

The results for the simulation datasets demonstrate the successful implementation of the $b$ analysis and potential dataset limitations. The output $b$ value needs to be equal or similar to the input $b$ value to demonstrate consistency of the analysis. Simulation 1 demonstrates an idealized dataset while simulation 2 is more indicative of realistic quiet-supersonic data. Detailed results of both simulations are presented with a specified input $b$ value of 0.5. The general relationship for varying input $b$ values relative to the returned output $b$ value is then given for the two simulations.

Simulation 1 consists of a fully sampled dose-response curve with responses spanning the %HA range from 0% to nearly 100% within the prescribed dose range, as illustrated in Fig. 4(a). Figure 4(b) reveals the most probable, or output, $b$ value matches the input value of 0.5. The CI width is narrow and tight about this maximum, and the curvature of the probability function is very high (very negative), indicating that this dataset is highly informative for the $b$ parameter.

Simulation 2 consists of a limited dose-response curve with responses spanning the %HA from 0% to about 4% within the prescribed dose range as visualized in Fig. 5(a). The $b$ posterior distribution in Fig. 5(b) suggests a $b$ value that is close to the input $b$ value of 0.5. Although somewhat accurate, the precision is low as depicted by the broad 95% CI width. Rather than focusing on the peak of the distribution, considering what $b$ values can be excluded may provide more insight into the most apt underlying predictive phenomenon. The maximum single-event level $(b=0)$ and number of events $(b=1)$ can potentially be rejected as they are outside the CIs, but it is still not clear whether the EEH or an in-between $b$ value would be the most predictive of annoyance. Whereas one might suppose that this sort of result is misleading or wrong, it is not. Rather, it is simply uninformative as established by the wide CI width.

The reason for the difference between the outcomes depicted in Figs. 4 and 5 might not be immediately obvious. At a basic level, both datasets are comprised of the same number of points, and these points were generated in the same way from similar logistic curves. The difference primarily lies in the amount of information that is contained in each data point—a quantity that differs markedly between the two cases. For this type of regression, the information provided by each point for the determination of the $\beta $ regression coefficients is primarily related to two factors: the slope of the function being fit at the *x*-coordinate of that point and the variance of the statistical process at the *y*-coordinate of that point (see Green^{44}). Although the Bernoulli likelihood has a variance maximum near 50%HA, that is also the location of the maximum slope of the logistic curve. These two factors combine in a way in which data points located near the 50% level will be the most informative for the curve fit. The uncertainty in $b$ is necessarily dependent on the uncertainty in the curve fit, hence, this implies that data collected around 50%HA will also produce better determinations of $b$. Thus, the case shown in Fig. 4, which explores the entire “transition region” between 0% and 100%HA will necessarily result in a better determination of $b$.

This is not the entire story, as there seems to be further factors that impact the amount of information contained in a dataset about $b$. These factors have to do with the number of single events that comprise a multiple-event response and the relationship between points in the dataset as a whole. For instance, sets that have lots of data around 50%HA but have all HA responses coming from loud single-event days will result in good determination of the $\beta $ coefficients but be completely uninformative for $b$. Further, sets with various numbers of events per day may be more informative than sets which all have the same number of events per day. These sorts of rules are not yet fully understood and the focus of future work (cf. Vaughn *et al.*^{29}).

Figures 4 and 5 only represent results where the input $b$ equals 0.5. The same simulation can be run with varying values of input $b$ from zero to one while keeping the same dose values and sets of $\beta $ parameters for simulations 1 and 2. Results of such simulations are given in Fig. 6 and demonstrate the informativeness of the datasets in returning the prescribed input $b$ value. A linear trend along the diagonal, where the input $b$ value equals the output $b$ value, signifies the method accurately capturing the expected result. The equivalence line is noted as a dashed line in Fig. 6, and the solid and dotted lines depict the posterior distribution peak values for simulations 1 and 2, respectively. The solid line for simulation 1 and its accompanying CIs closely encompass the equivalence line, demonstrating informative data that yield accurate and precise results for the proposed $b$ analysis. On the other hand, the dotted line for simulation 2 varies wildly, and although the equivalence value is captured within its broad CIs, these data represent a case that produces largely uninformative results regarding the most predictive $b$ value. This result suggests that field data (i.e., QSF18) from which simulation 2 was based may not yield highly informative results regarding the most predictive $b$ value.

Of course, setting a goal for uncertainty for a real-world test campaign (i.e., of quiet-supersonic flyovers) is a practical question that is necessarily convolved with considerations of the time and expense needed to execute the test. Understanding the factors that contribute to uncertainty serves as guidance for how to design a test to minimize uncertainty in a parameter like $b$ for a fixed amount of testing effort (again, cf. Vaughn *et al.*^{29}). In the end, even though from a purely statistical standpoint, one will want to generate as highly informative data as possible with many HA responses, it is clearly impractical and impolitic to execute a test in which some populations are ensonified to a point at which they are 100%HA (even 50%HA is likely too far).

### B. QSF18 community study results

The $b$ analysis results for QSF18 data are given in Fig. 7. Similar to the simulation results in Figs. 4(b) and 5(b), QSF18 results are presented in Figs. 7(a)–7(c) as posterior distributions with the peak and 95% CI bounds indicated. Examining Fig. 7(a) reveals that the typical dichotomization into HA data produces an uninformative situation for this analysis—the peak is essentially flat with the CI width nearly encompassing the entire unit line. Accordingly, two more analyses were undertaken, where the dichotomization point was lowered to include MA+ and SA+ responses, and their results are depicted, respectively, in Figs. 7(b) and 7(c). These alternative dichotomizations provide more 1 s or annoyed responses for the logistic modeling and, thus, transitioning from HA to SA+ results has a similar effect as going from simulation 2 toward simulation 1. As expected, this corresponds to a more peaked distribution and tighter CI.

Although examining the peaks in Figs. 7(a)–7(c) provides an interesting interpretation in terms of what $b$ values might be optimal, a more pragmatic point of view might be to ask which $b$ values can be rejected. None of the three sets of results reject the EEH at $b$ = 0.5. Also, although in Figs. 7(a)–7(c), there are regions that can be rejected, this may only be an artifact of the CI computation method—the defined CIs contain 95% of the area under the distribution, therefore, 5% is rejected by definition. Given these results, a reasonable question to ask is whether or not any part of the interval, $b$ = [0,1], should be rejected. A method to answer this question is provided by taking a “likelihoodist” approach.^{38,45} The reasoning here is that if the likelihood function is similar to a normal distribution, then the frequentist 95% confidence interval will be formed by the set of $b$ values for which the log-likelihood value is within roughly two of the peak. This can be adopted directly as the criterion for interval estimation. Thus, we can form a likelihoodist interval (LI) by observing where the log-likelihood values fall by a value of two from their respective peaks.

Figure 7(d) shows the same results of Figs. 7(a)–7(c) cast as log-likelihood functions instead of as distributions and scaled such that their peaks are all set to zero. Now, the location where these functions fall below –2 will demarcate the LI. For the SA+ case, the LI produced is similar to the CI; however, for the HA and MA+ cases, the likelihoodist interpretation indicates that no part of the interval $b$ = [0,1] should be rejected as the functions never fall below the threshold. This conclusion relates to the earlier discussion on the curvature of the likelihood function near its peak,^{41} and it is reasonable to say that the datasets lack the information needed to confidently reject any values of $b$ for the HA and MA+ cases.

Considering the peaks of the distributions in Fig. 7, the HA results seem to indicate that people with HA responses are responding to the peak event that they experienced in a given day—even if they are exposed to several events, they are responding primarily to a memory of the loudest event by the end of the day. On the contrary, the MA+ and SA+ results suggest that people are responding to a hybrid of the number of events and the integrative way as would be suggested by EEH.

The differing distribution peaks for the QSF18 results point to an interesting issue in community noise research: a tacit assumption built into many studies on noise and annoyance is that people will be annoyed in the same way by loud and soft sounds and there is no scale variance to the process of the response. That is, a researcher could figure out how a few quiet sounds annoy someone and then potentially extrapolate up in level/number/etc. Similarly, one might assume that sections of the population who are HA respond that way by being annoyed in the same manner as those who are less annoyed. These results hint to an interesting counterpoint, and some basic ideas in the way that community response to noise is modeled and studied may need to be scrutinized if people who are more annoyed are also annoyed in a systematically different way than their less-irked peers. This being said, it is unknown if the analyses used here are capable of contending with such features in the data. The assumption of response homogeneity is included within this study just as much as it is in other studies. The simulation studies from Sec. IV A were based on data that were homogeneous in $b$, and it is unknown how the analysis would respond to this sort of heterogeneity in the dataset. The dose-response data collected in QSF18 were not intended to be definitive nor conclusive in regard to human response to quiet supersonic flight, and further development would be needed to determine if the distribution peaks in Figs. 7(a)–7(c) are reliable and what they actually reveal about the response.

### C. Implications for future test design

What are the takeaways from these analyses to aid the design of future studies so that they can generate data to discriminate between values of $b$? The results of the simulation study and QSF18 data reveal similar trends: sampling more of the perceptual response range results in a better curve fit and a more informative estimation of $b$ (again, see Vaughn *et al.*^{29}). Simply increasing the sample size would likely not provide any additional information as the percentage of annoyed responses should remain somewhat constant. An assumption that $b$ is homogeneous across annoyance levels could allow for fitting to SA+ or similar data; however, this would need to be scrutinized given the results above. Having a greater variety of single-event dose combinations (i.e., various numbers of events and several mixtures of levels) seems to be important. The $b$ analysis simulation study results of Vaughn *et al.*^{29} also uphold these inferences, in particular, the importance of dose design.

## VI. CONCLUSION

This paper outlined a method to determine the best predictor of annoyance to multiple noise events for use in community noise studies. The method uses a parameter, $b$, that continuously varies a cumulative dose metric from representing the maximum single event when $b=0$, a DNL or “equal energy” summation when $b=0.5$, and a value more responsive to the number of events when $b=1$. The $b$ analysis is demonstrated on data from simulations and the QSF18 community study. Implementation on simulated data revealed that the response buried in the data can be confidently resolved by the analysis. In simulation 1, the well-sampled curve illustrates an idealized case, whereas the limited curve in simulation 2 exposed potential data insufficiencies for implementing this method. Limitations observed for simulation 2 are exemplified in the QSF18 results. Overall, the results highlight difficulties in estimating such a parameter from real data, suggest what structures are needed in the data to provide an informative estimation of a parameter like $b$, and encourage further research of the $b$ analysis, which could include application to other noise sources. Community noise surveys designed with $b$ analysis in mind can lead to cumulative noise metrics that better reflect people's annoyance, more effective noise mitigation techniques, and implementation of regulations to curtail potential community annoyance.

## SUPPLEMENTARY MATERIAL

See supplementary material for the QSF18 cumulative dose-response dataset.

## ACKNOWLEDGMENTS

This work was conducted in support of the NASA Commercial Supersonic Technology Project. The authors express gratitude to the following reviewers of this manuscript: K. Ballard, R. Cabell, N. Cruze, W. Doebler, and J. Rathsam.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Ethics Approval

The QSF18 data were collected under protocol approved by the NASA Langley Research Center Institutional Review Board. Informed consent was obtained from all participants as part of this protocol.

## DATA AVAILABILITY

The data that support the findings of this study are available within the article and its supplementary material.

## REFERENCES

*Elements of Aviation Acoustics*

*A Guide to U.S. Aircraft Noise Regulatory Policy*

*Choices, Values, and Frames*

*Quieting the Boom: The Shaped Sonic Boom Demonstrator and the Quest for Quiet Supersonic Flight*

*Community Noise*

*Real Analysis for Graduate Students*

*et al.*, Ref. 19). It was also applied to temporal summation within single noise events of flyovers of unmanned aerial vehicles (see Ref. 20). Rather than $b$, these initial works used $\u2136$, the Hebrew letter “bet,” for its symbology with the following rationale: A parameter was needed to represent the inverse of the “base” of the norm, and the Greek $\beta $ was already in use (as it is in this work as well as the parameters of the logistic curve).

*L*

_{eq}calculations

*E*)

*Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan*

*Theory of Optimal Designs*