Although the behavioral pure-tone threshold audiogram is considered the gold standard for quantifying hearing loss, assessment of speech understanding, especially in noise, is more relevant to quality of life but is only partly related to the audiogram. Metrics of speech understanding in noise are therefore an attractive target for assessing hearing over time. However, speech-in-noise assessments have more potential sources of variability than pure-tone threshold measures, making it a challenge to obtain results reliable enough to detect small changes in performance. This review examines the benefits and limitations of speech-understanding metrics and their application to longitudinal hearing assessment, and identifies potential sources of variability, including learning effects, differences in item difficulty, and between- and within-individual variations in effort and motivation. We conclude by recommending the integration of non-speech auditory tests, which provide information about aspects of auditory health that have reduced variability and fewer central influences than speech tests, in parallel with the traditional audiogram and speech-based assessments.

Hearing loss is the third most prevalent chronic health condition in the United States (Blackwell , 2014; National Center for Environmental Health, 2018), with over 37 million Americans reporting subjective hearing loss. The primary clinical complaint of patients with elevated audiometric thresholds associated with hearing loss is difficulty hearing in noise (Abrams and Kihm, 2015; Arlinger, 2003; Wilson, 2004). Hearing loss affects, among other things, education, employment, and social engagement (Arlinger, 2003; Emmett and Francis, 2015). Current clinical interventions for hearing loss include acoustic amplification, auditory prosthetic devices (e.g., cochlear implants), and aural rehabilitation. On the horizon are pharmaceutical interventions to restore hearing (Xu and Yang, 2021).

The gold standard for quantifying hearing loss and differentiating conductive from sensorineural etiology is comparison of behavioral assessment of air- and bone-conduction pure-tone thresholds. However, there are several reasons pure-tone thresholds should not be the primary outcome measure for evaluating intervention-related benefits. From a theoretical standpoint, pure-tone thresholds provide little information about the functional performance of the auditory system. Pure-tone thresholds represent the minimum sound-pressure level required to detect the presence of a tone in quiet. If restoring audibility was the only thing necessary to eliminate the performance deficits associated with hearing loss, then amplification alone would be a sufficient intervention for the majority of people with hearing loss. However, even after amplification is provided to the maximum extent possible given limitations such as feedback or loudness recruitment (Byrne and Dillon, 1986), many people with hearing loss still experience difficulty understanding speech, especially in noise (Lesica, 2018; Varnet , 2021). Unaided pure-tone audiometric thresholds can predict speech-reception thresholds in quiet (Dirks , 1982; Humes, 2007; Schum, 1996; Smoorenburg, 1992; Suter, 1985), but the correlation between elevated hearing thresholds and increased difficulty hearing in noise is generally weak (Vermiglio , 2012); other aspects of hearing acuity, including frequency resolution (Davies-Venn , 2015) to the ability to use fine temporal information (Lorenzi , 2006) and cognitive processes (Dryden , 2017), are likely contributing factors. Furthermore, unaided thresholds only partially account for subjective listener complaints (Hannula , 2011; Jerger, 2011), and audiometric thresholds and word recognition ability in quiet are generally not considered to be good predictors of real-world capability in noisy listening situations (Barrenäs and Wikström, 2000). Indeed, more than 25 million Americans with “normal” audiometric thresholds report difficulty hearing in noise (Edwards, 2020).

Addressing the subjective hearing problems of individuals who report disproportionate difficulty hearing speech in noise (SiN) despite having hearing thresholds within normal limits, or those who continue to have difficulty understanding SiN after audibility is restored by amplification, will require the ability to measure outcomes that correlate with real-world SiN performance. Assessments of pure-tone thresholds and word recognition in quiet are carried out using well-established procedures that are standardized across audiology clinics. While tests of SiN have been developed for clinical use, such as the Quick Speech in Noise test (QuickSIN; Killion , 2004) and Hearing in Noise Test (HINT; Soli and Wong, 2008), they are not commonly used, perhaps because they are perceived as less useful given that device interventions often provide only modest benefits for SiN (Lakshmi , 2021). Current SiN tests have flaws that make them poorly suited for tracking the small longitudinal improvements in hearing performance that might be obtained from a pharmaceutical intervention for hearing loss. Current challenges in this domain were observed in the initial results of a Phase 2a clinical trial of the intratympanic FX-322 drug, where individuals in the placebo group showed an unexpected apparent level of hearing benefit (Frequency Therapeutics, 2021), which might have been driven by learning effects or other aspects of the study design.

In this paper, we discuss some of the challenges inherent in measuring SiN performance in listeners with hearing loss, and suggest possible strategies to address these problems in future studies.

Plomp (1978) proposed a model that divided the effect of hearing loss into two components: an audibility component, caused by the loss of information from inaudible parts of the speech signal, and a distortion component, caused by loss of fine detail that makes it difficult for the listener to extract information from the suprathreshold components of the speech signal. Hearing researchers have long been aware that the SiN deficits experienced by listeners with hearing loss cannot be fully explained by the changes in pure-tone sensitivity as indexed by the audiogram. The audibility constraints imposed by elevated thresholds place an upper bound on SiN performance, but the distortion component of hearing loss may cause listeners to have difficulty with SiN even when the volume of the speech signal is high enough to ensure audibility at all frequencies. In the absence of amplification, the high-frequency pure-tone average can account for much of the variance in performance for SiN or quiet (65%–75%; Amos and Humes, 2007; Humes, 2007), but once frequency-shaped gain is applied to overcome audibility limitations, the audiogram provides little (Summers , 2013) or no (Amos and Humes, 2007) significant predictive power for SiN performance.

The distortion component of hearing loss can be challenging to isolate in most clinical SiN tests because listeners' performance is influenced by both the audibility and distortion components of hearing loss. The relative contribution of audibility tends to be higher in SiN tests where the target speech is presented at a relatively low level, e.g., the Speech Reception in Noise Test, where the target speech is presented at 50 dB hearing level (HL) (Brungart , 2017), or where the target speech signal is adaptively varied to change the signal-to-noise ratio (SNR) relative to a fixed-level noise, e.g., the Hearing in Noise Test (Nilsson , 1994; Vermiglio, 2008) and Oldenburg Sentence Test (Kollmeier and Wesselkamp, 1997; Wardenga , 2015). However, even in tests where the target speech is presented at a fixed (and relatively high) level, such as the QuickSiN (Killion , 2004), audibility limitations may still impact high frequencies. Listeners with sloping hearing loss have poorer hearing in the high-frequency region, a region that intersects with the speech spectrum's 6-dB per octave roll-off. This greatly increases the chances that audibility limitations will cause significant correlations between the audiogram and SiN performance.

Hearing interventions that produce significant improvement in hearing sensitivity would also be expected to produce substantial improvements in performance on conventional SiN tests, as these tests tend to be influenced by stimulus audibility. However, it is possible that an intervention could affect the distortion component of speech perception without a substantial effect on audibility. This could theoretically occur, for example, by regenerating auditory nerve-fiber synapses (Seist , 2020), given that the loss of some proportion of synapses has been shown not to affect absolute threshold (Kujawa and Liberman, 2015). Such a change may be more difficult to measure since audibility and distortion are conflated in many SiN tests. Even if a test was able to isolate the distortion component of hearing loss, a great degree of sensitivity might be required to measure a significant change in auditory function, because changes in distortion that have a large impact on communication effectiveness tend to produce small changes in objective tests of speech perception. One example of a test that can be used to illustrate this point is an 80-word version (Brungart , 2021a) of the Modified Rhyme Test (MRT; House l., 1965), termed the MRT80 that was developed as a tool to assess the fitness-for-duty of Service Members (SMs) with hearing loss. The MRT has long been the de-facto standard for assessing intelligibility in communication systems (American National Standards Institute, 2020), so there is a great deal of information available on how the MRT score corresponds to intelligibility in a real-world listening environment. The MRT80 is particularly useful to illustrate the challenges associated with measuring SiN performance because (1) the stimuli have a high-frequency emphasis and are presented at a high level to minimize the influence of audibility; and (2) it has been used to evaluate SiN performance on a large number of listeners with normal hearing (NH) thresholds and with hearing loss.

Figure 1 shows the cumulative distributions of the individual MRT80 scores that Brungart (2021a) obtained from 2374 SMs, who were divided into four groups with systematically increasing levels of hearing loss: those with NH thresholds, and those with three levels of hearing loss (H1, H2, and H3) established by the United States Army for determinations of auditory fitness for duty (U.S. Department of the Army, 2019). The results show only a 10 percentage-point difference in median score between listeners with NH and those with the worst hearing profile (H3). There is also substantial overlap in individual scores across the four groups, with roughly 20% of the H3 listeners performing better than the median NH listener and 60% of the H3 listeners performing better than the bottom 5th percentile of NH listeners (shown by the vertical dashed line). This result is generally consistent with other studies that have shown that listeners with hearing loss on average perform worse than listeners with NH when the effects of audibility are accounted for, but the difference is relatively small in comparison to the variations in individual performance that occur within each group.

FIG. 1.

(Color online) Distributions of individual scores on the MRT80 test for four different listener groups with different levels of hearing loss. From Brungart (2021a).

FIG. 1.

(Color online) Distributions of individual scores on the MRT80 test for four different listener groups with different levels of hearing loss. From Brungart (2021a).

Close modal

Although the 10 percentage-point difference between the average NH listener and the average H3 listener on the MRT80 test is not large in absolute terms, one should not assume that it is not large enough to have a significant effect on real-world hearing ability. The US military standard for human factors design (US Department of Defense, 2015) sets the cutoff for exceptional intelligibility in military environments at an MRT Score of 97% (uncorrected for guessing), normally-acceptable intelligibility at 93%, and minimal acceptable intelligibility at 79%. Thus, a 10 percentage-point difference represents more than half of the entire range (18% points) of expected variability in “acceptable” MRT performance in a real-world listening environment. Results from listening tests of real-world stimuli also indicate that listeners with hearing loss who score below 70% on the MRT80 are much more likely to have difficulty in challenging listening environments than those who score better than 80% (Brungart , 2021b). For this group, a 5 percentage-point improvement in individual performance on the MRT80 could mean the difference between retention in their current military assignments, reassignment, or separation. Hence, any outcome measure designed to assess the distortion component of SiN perception will need to be sensitive to small changes in performance, on the order of 5 percentage-points or less, on objective measures of SiN performance to capture differences that have a major impact on subjective hearing difficulty and functional hearing ability in noisy listening environments. Achieving this degree of sensitivity is made challenging by the numerous sources of variability in SiN test performance that are unrelated to a listeners' true underlying speech perception ability.

To achieve the goal of assessing small differences in hearing performance across listeners, or small longitudinal changes in hearing performance within listeners, it is necessary to develop tests that can measure SiN perception with a high degree of precision and accuracy. Unfortunately, speech perception measures are susceptible to several sources of variability that can make these precise measurements difficult. The recent reviews of Mattys (2012) and Rönnberg (2019) provide frameworks for considering the wide variety of factors that influence speech perception. Here, we consider a few sources of variability which are broadly relevant to how SiN perception is measured.

In most cases, the only reliable way to assess auditory performance is to conduct behavioral tests where stimuli are presented to listeners who are then required to make some decision about the characteristics of the stimulus and report it back to the experimenter or clinician. The accuracy and precision of the estimate of auditory performance can be improved by increasing the number of trials, but some degree of statistical variability is unavoidable when estimating performance from a finite number of observations. Perfect precision would require averaging performance across an infinite number of trials (Gelfand, 1998; Schlauch and Carney, 2018).

The statistical variability that exists when estimating the percentage-correct performance score by observing the outcome of a sequence of trials is a fundamental limitation that cannot be eliminated. If the number of observations is large, the variance in the estimate of the probability p derived from n independent trials can be approximated by a Gaussian distribution with a standard deviation equal to p ( 1 p ) / n (Sokal and Rohlf, 1981, p. 77). As an example, consider the MRT80. Listeners with normal hearing have an average score of 80% correct which corresponds to a standard deviation of 0.8 ( 1 0.8 ) / 80, or 4.5%. A listener with a true underlying performance level of 80% has a 5% chance of scoring below 73% (and a 5% chance of scoring above 87%). Doubling the number of trials from 80 to 160 would reduce the standard deviation by a factor of 2, to 3.2%, reducing the chance of scoring below 73% (and, alternatively, above 87%) to roughly 1%. Given that the average MRT score is only 10 percentage-points higher for NH listeners (80%) than for listeners with moderate to severe hearing loss (70%), these random statistical variations are not negligible.

Another example is the QuickSIN test (Killion , 2004). In this test, the outcome measure, SNR loss, is derived from the number of correct responses (out of 30 trials) using the following formula:
25.5 Correct Key Words = dB SNR Loss .
The cutoff SNR loss for a moderate hearing loss, 7 dB, corresponds to a 62% correct response rate, which is associated with a standard deviation of 8.9%. This corresponds to a theoretical standard deviation in SNR loss of 2.7 dB for a single QuickSIN list, a difference which could easily alter the interpretation of the test results. For this reason, it is typically recommended to average the scores for at least two QuickSIN lists, especially when comparing two conditions (e.g., unaided and aided, or two different hearing aids), which would reduce the estimated standard deviation to 1.9 dB.

Statistical limitations exist, in some form, for all objective measures of speech perception. The impact of these statistical limitations may be more difficult to estimate in tests that use adaptive tracking than they are in tests that present stimuli at fixed SNR values. In all cases, the only way to systematically reduce this variability is to increase the number of trials while keeping the underlying difficulty of the test constant. Clinicians and experimenters may elect to shorten stimulus lists for the sake of efficiency, but this can contribute to a higher degree of measurement error (Carney and Schlauch, 2007; Thornton and Raffin, 1978). One alternative that can improve precision without adding additional testing time is to score by phoneme rather than by word, when possible (Gelfand, 2003;, Schlauch , 2014).

Many speech perception tests are designed to include stimuli that account for the diverse auditory cues that exist in speech, either by explicitly testing all possible phonemes (Fletcher and Galt, 1950) or by attempting to balance the distribution of phonemes to match the statistical distribution of sounds that occur in natural speech. While this approach ensures a higher level of test validity, it can also be counterproductive to the goal of assessing individual differences or changes in auditory function because the need to test a wider range of speech sounds can lead to longer test times. There are fundamental differences in acoustic cue robustness associated with different phonemes. Some speech tokens will be so easy that all listeners are able to identify them correctly (i.e., ceiling effect) or so difficult that no listeners are able to identify them correctly (i.e., floor effect). This frequently happens in tests that present phonetically-balanced materials at a fixed SNR, because there are large variations in how robustly different phonemes can be detected in noise. Stimuli that are perceived correctly or incorrectly by all listeners increase the testing time, but provide no information about the relative performance of different listeners. Thus it may be advisable to eliminate these stimuli from a SiN test, even if elimination results in a speech corpus that is not phonetically balanced. For example, Brungart (2017) found that it was possible to eliminate 100 speech tokens with near-ceiling and near-floor performance from the Speech Recognition in Noise Test (SPRINT) and still obtain scores that were very highly correlated with those obtained with the original 200-word test (r = 0.96).

The selection of speech materials would be much less challenging if it were always possible to collect repeated measures of the same speech tokens on the same listeners. However, this may not be possible in cases where listeners are able to improve their scores over time by learning the characteristics of the speech stimuli. Listeners hear speech in noisy environments all the time, so one could argue that they already have so much practice with SiN perception that they are unable to improve over the course of a single study. However, speech perception tests tend to use corpora that are far more constrained than random everyday speech, and listeners are very good at learning these constraints and using them to improve their overall performance. Speech is an inherently variable stimulus in the sense that there is no unique one-to-one relationship between the waveform of a speech signal and the corresponding word (Allen , 2003; Newman , 2001). Listeners are always making a probabilistic assessment of what the word is given their prior expectations of what the word might be (i.e., contextual cues) and their ability to adjust their expectations of the acoustic features of the waveform based on knowledge of the talker's voice and other factors that might cause variations in the stimulus (Kleinschmidt and Jaeger, 2015). For example, robust short-term learning effects are seen for speech materials that are altered or degraded by time compression (Manheim , 2018; Peelle and Wingfield, 2005), noise-vocoding (Davis , 2005; Peelle and Wingfield, 2005), temporal distortions (Phatak and Grant, 2019), dysarthria (Borrie , 2017), or non-native accent (Bradlow and Bent, 2008; Clarke and Garrett, 2004).

One important area where learning effects interact with issues related to stimulus variability and statistical efficiency is in the decision to use open-set speech tests, where the listener hears a word and verbally repeats it back to the test administrator, or closed-set speech tests, where the listener is given a list of alternative responses to choose from after each stimulus presentation. Open-set tests have historically been more widely used in the clinic, as they are easy to administer and score and have a high degree of face validity for characterizing the overall hearing performance of a listener with hearing loss. Closed-set tests have been more popular in research and the assessment of communication systems because they can be scored automatically and are suitable for testing large numbers of conditions across individual listeners. Open-set tests tend to have a steeper psychometric function than closed-set tests, increasing the efficiency for assessing performance differences with a small number of trials compared to closed-set tests (Clopper , 2003; Clopper et al., 2006; Lash , 2013; Mullennix , 1989; Sommers , 1997; Yu and Schlauch, 2019). However, learning effects can be problematic in open-set test paradigms that require individual test stimuli to be presented more than once, because listeners are likely to be much better at open-set word recognition when they are familiar with the possible response options. This can make it very challenging to obtain enough unique test materials in cases where more than one condition and/or more than one time point needs to be tested per listener, or in cases where the differences in performance across experimental conditions are simply too small to be assessed without collecting a large number of trials. Closed-set, multiple choice tests, like the MRT (Brungart , 2021a; House , 1965), Triple Digit Task (Jansen , 2010; Watson , 2012), or Oldenburg Sentence Test (Kollmeier , 2015; Kollmeier and Wesselkamp, 1997) are more practical when a substantial amount of data collection is required. In these cases, test familiarization happens more quickly such that learning effects are more likely to be mitigated with a short training period, and these tests can be scored automatically (Nye and Gaitenby, 1973; Schlueter , 2016; Smits , 2013; Williams and Hecker, 1967).

A final factor that is less frequently considered in the assessment of auditory perception is the influence of variations in the physical and psychological state of the listener on the test results at the time of assessment. Auditory testing that occurs during a single 1- or 2-h test period may be influenced by non-auditory factors that impact a listener's ability to perform well on a task (Nikhil , 2018; Veneman , 2013). A critical factor is the amount of sustained attention and effort the listener chooses to apply when completing each trial of the task. Effort is influenced by motivation, fatigue, and level of distraction during the test session. Other factors may play a role in the listener's capacity to perform well, such as sleep status, nutrition level, emotional state, and medical status (e.g., illness on day of testing, medications that affect alertness) (Makeig and Inlow, 1993; Makeig and Jung, 1996). These factors can result in significant differences between the listener's scores when they are tested at a “good time” on a “good day” and those when they are tested at a “bad time” on a “bad day,” a problem that can be especially detrimental to SiN scores for older adults (Veneman , 2013).

In sum, there are many potential sources that can contribute to variability in a listener's score on a SiN test that are unrelated to the listener's true underlying ability to understand SiN. Variability can stem from the methodological or statistical properties of the tests, the lexico-acoustic properties of the speech stimuli, or the environmental features of the testing environment and state of the listener, among others. When the goal is to track changes in SiN performance over time, clinicians and researchers must be able to ensure with some degree of confidence that any differences observed are due to true changes in the listener's abilities (before, during, and after rehabilitation strategies are implemented), rather than extraneous sources of variation. Given that even small changes in percentage-correct performance can represent significant real-world effects, a great degree of sensitivity is required for longitudinal assessment.

Given the variability in speech-perception testing described above, a reasonable alternative approach to assessing functional performance is the use of non-speech psychoacoustic tests. By measuring performance in auditory dimensions that are known to be critical to speech understanding, some of the pitfalls associated with speech-perception testing can be avoided while still assessing acuity in the relevant dimensions.

Some insights into the use of non-speech stimuli to evaluate speech-relevant dimensions come from the cochlear-implant (CI) literature. CI users require an extensive period to adapt to new speech cues, ranging from months (Lenarz , 2012; Massa and Ruckenstein, 2014) to years (Oh , 2003; Ruffin , 2007) after device activation (Cusumano , 2017) or changes to signal-processing strategy (Kleine Punte , 2014). The adaptation time limits the effectiveness of speech-perception testing to evaluate outcomes acutely. For this reason, an approach often employed in CI research is to employ psychoacoustic measures in lieu of speech-based testing. For hearing aids, evidence for an acclimatization period is less clear (Dawes , 2014; Humes , 2002; Humes and Wilson, 2003; Saunders and Cienkowski, 1997), although an adjustment period of 1–6 months has been reported for patients with severe losses (Dawes and Munro, 2017; Gatehouse, 1992; Kuk , 2003) or with advanced signal-processing algorithms like nonlinear frequency compression (Wolfe , 2011). It is unknown whether auditory changes associated with pharmaceutical interventions would require an extensive adaptation period (like CIs) or a briefer one (like hearing aids). In any case, the psychoacoustic-test strategy employed in CI research is potentially instructive for assessing the benefits of other forms of intervention, given the potential advantages of such an approach for avoiding the variability associated with speech tests.

1. Spectrotemporal modulation sensitivity

A common technique for acutely assessing CI signal-processing strategies is to use a spectral-ripple test to measure spectral contrast sensitivity. While CIs have many limitations relative to acoustic hearing, broad current spread and the associated lack of spectral resolution have been implicated as particularly problematic. This is evidenced by a number of studies that have identified correlations between speech-reception abilities and spectral-ripple detection or discrimination tasks that rely on spectral resolution (Gifford , 2018; Holden , 2016; Jeon , 2015; Lawler , 2017; Litvak , 2007; Saoji , 2009; Won , 2015). Spectral resolution has been a frequent target for signal-processing intervention, such as focused stimulation techniques (Arenberg , 2018; Bierer and Litvak, 2016) or the de-activation of certain electrodes to reduce spectral overlap (Goehring , 2019; Zhou, 2017). Because of the acclimatization problem, such investigations often assess spectral-ripple sensitivity rather than speech reception to acutely assess the potential for benefit (Aronoff , 2016; Goehring , 2019; Smith , 2013; Zhou, 2017). CI tests of spectral-ripple discrimination, using stimuli with static spectra (Jeon , 2015) or stimuli with combined spectrotemporal modulation (STM) with spectra that shift over time to provide a more salient cue (Archer-Boyd , 2018; Archer-Boyd 2020; Aronoff and Landsberger, 2013), have been developed explicitly for this purpose.

A test of STM sensitivity is also a prime candidate to assess auditory acuity in acoustic hearing in dimensions thought to be important for speech understanding. STM stimuli contain regularly spaced spectral peaks that move over time. This type of stimulus contains many of the important features of a speech signal and any complex speech spectrogram can be broken down into a sum of STM stimuli, each with one combination of spectral modulation density and temporal modulation rate (Chi , 1999; Chi , 2005). Speech consists of spectral modulations in the range of approximately 0.5–4 cycles/octave and 2–8 Hz amplitude modulation. The ability to detect STM or discriminate between stimuli with different patterns of STM has been shown to be strongly correlated with speech understanding scores in noise for listeners with moderate-to-severe hearing loss, even after factoring out differences in audiometric threshold (Bernstein , 2013; Bernstein , 2016; Mehraei , 2014; Miller , 2018; Zaar , 2020).

Because the STM stimulus is complex—containing both temporal and spectral modulations and covering a broad range of spectral frequencies—the precise underlying mechanisms that are revealed by this stimulus have not been settled. Acute frequency selectivity is likely required to detect the presence or movement of spectral peaks, and therefore performance on an STM test might reveal individual variability in spectral resolution (Mehraei , 2014), especially for high-frequency spectral regions where hearing loss is most likely to cause reduced frequency selectivity. Yet most of the power of an STM sensitivity task in accounting for variability in speech understanding is seen for low-pass-filtered stimuli that only include spectral content below 2 kHz, where frequency selectivity is known to be less affected by hearing loss (Summers , 2013). Mehraei (2014) argued that STM sensitivity may reflect the ability to use temporal fine-structure information to extract information regarding the movement of spectral peaks (e.g., formant transitions in speech) at these low absolute frequencies where phase locking is the strongest (Johnson, 1980). Regardless of the specific constellation of psychophysical dimensions being assessed, the correlations with speech understanding in noise, combined with the theoretical rationale where STM forms the building blocks for a complex speech signal, strongly motivate the use of an STM sensitivity metric to complement or stand in for speech tests to evaluate suprathreshold auditory function for listeners with hearing loss. The question then becomes how best to carry out such a test.

Many permutations of STM sensitivity measurements have been proposed. Most such tests involve the presentation of two or more intervals, with the reference interval or intervals containing an unmodulated carrier and the target interval containing the modulated stimulus; the listener is required to identify which of the intervals contained the modulated stimulus (detection task; Bernstein , 2013; 2016). In other cases, listeners are required to identify the difference between an upward-moving and a downward-moving STM stimulus by identifying the stimulus that was different in a three-interval forced-choice task (discrimination task; Archer-Boyd , 2018), although this design has mainly been employed for listeners with CIs. Some studies have employed static spectral-ripple stimuli with no temporal modulation dimension, and still found a correlation with speech understanding in noise (Davies-Venn , 2015; Miller , 2018), although this type of stimulus might be less desirable because in the absence of temporal modulation, the detection cue might be more difficult for listeners to learn. Even within these paradigms, the methodology chosen can have a major impact on the outcome of the test and its ability to account for variability in speech-understanding scores.

Early work by Bernstein (2013) and Mehraei (2014) found that STM stimuli with low-frequency carriers (<2000 Hz), low temporal modulation rates (4 Hz), and relatively high spectral densities (2 cyc/oct) resulted in the largest performance differences between listeners with normal hearing and listeners with hearing loss. However, these studies evaluated a large range of STM conditions and required hours of listening time to complete. In a group of naive listeners with normal hearing, Grant (2013) found that STM sensitivity for a single condition could reach asymptotic performance after a single training run. Based on these findings, Bernstein (2016) designed a fast version STM sensitivity task that tested only one condition (4 Hz, 2 cyc/oct with a 2000-Hz low-pass-filtered carrier) with one training run and three test runs, for a total test time of 15 min.

Though rapid, the STM sensitivity task used by Bernstein (2016) did not properly converge to a threshold estimate for nearly a third of the listeners whose detection accuracy at full modulation depth was poorer than the performance level targeted by the adaptive staircase rule. Subsequent paradigms proposed by Zaar (2020) and Souza (2020) employed strategies to reduce the difficulty of the task and increase the likelihood of converging on a threshold estimate, even for people with poorer STM sensitivity. While these studies made several changes to the test that were not systematically evaluated, the use of three or four intervals instead of a two-interval paradigm might have been the most important strategy that allowed for test convergence. With more than two intervals, listeners would need only to identify which interval was different, and would not be required to explicitly learn what the modulation sounds like.

2. Binaural tone-in-noise detection

While STM sensitivity has proven to be a useful psychoacoustic tool to predict SiN performance for individuals with greater than mild hearing loss, it is far less useful for differentiating individuals with normal hearing or mild hearing loss. Bernstein (2016) found that STM sensitivity had no predictive value for individuals with a high-frequency average audiometric threshold (2, 3, 4, and 6 kHz) better than 43 dB HL. This is problematic for its potential application in evaluating pharmaceutical interventions or noise-protection systems in cases where the major focus is to prevent progressive hearing loss secondary to noise exposure.

One measure that has been proposed as a possible useful tool for differentiating between individuals with normal or near-normal hearing thresholds is a binaural tone-in-noise detection task. Bernstein and Trahiotis (2016) found that listeners with normal or near-normal hearing thresholds, but with a 4-kHz threshold better than 7.5 dB HL, performed better at binaural tone-in-noise detection than those with a 4-kHz threshold worse than 7.5 dB HL, suggesting that such a measurement might be useful in detecting early signs of cochlear damage.

To examine the utility of a binaural detection task as a psychoacoustic measure that could be substituted for a SiN perception test for individuals with mild threshold elevations, we re-analyzed data acquired from a large population of SMs. The dataset contained approximately 3400 SMs with either normal audiometric thresholds (NHT) or mildly elevated audiometric thresholds (EHT) who were examined to estimate the prevalence of functional hearing difficulties (Grant , 2021). These two groups were further subdivided based on the history of exposure to blast: “none,” “far,” (exposed to at least one blast at a distance) or “close” (exposed to at least one blast close enough to feel the heat or pressure wave). The participants were tested in both SiN and binaural detection tasks, among other measures.

Figure 2 shows the correlation coefficients for each of these six subgroups for the relationship between the SiN performance (average of two adaptive speech-reception threshold tests using the American English Matrix Test; Kollmeier , 2015) and the low-frequency average threshold (LFA; 500 and 1000 Hz), the high-frequency average threshold (HFA; 2000, 3000, 4000, and 6000 Hz), or the threshold for the detection of an interaurally correlated (S0) or inversely correlated tone (Sπ) in correlated noise (N0). The audiometric thresholds (light gray and dark gray bars) accounted for very little of the variance in SiN performance, and these correlations were almost all non-significant (Bonferroni corrections for six comparisons, p > 0.0083 in all cases except for the EHT, no-blast group). In contrast, tone-in-noise detection performance (white and black bars) was significantly correlated with SiN scores, especially for the three EHT groups. Even though these groups were categorized as having EHT because at least one audiometric threshold was greater than 20 dB HL, it is important to note that these individuals generally had only mildly elevated hearing thresholds, with a mean (±SD) HFA threshold of 12.6 ± 5.4 dB HL. While tone-detection thresholds for both N0S0 (white bars) and N0Sπ stimuli (black bars) were related to speech scores, the correlations were stronger for N0Sπ. These data suggest that binaural signal detection in noise is a good candidate for a psychoacoustic task that could stand in for a SiN test for listeners with mild or early signs of hearing loss.

FIG. 2.

Correlations between audiometric thresholds (LFA, HFA) or tone-in-noise detection thresholds (N0S0, N0Sπ) and speech-reception thresholds for matrix sentences in noise for six listener groups defined by their hearing thresholds (NHT, EHT) and history of blast exposure (none, or close or far from a blast). Asterisks indicate correlations that were significant after Bonferoni corrections for six comparisons (p < 0.0083). LFA indicates low-frequency average; HFA, high-frequency average; NHT, normal hearing thresholds; EHT, elevated hearing thresholds; N0S0, tone and noise both interaurally correlated; N0Sπ, noise interaurally correlated and tone interaurally inversely correlated (π phase).

FIG. 2.

Correlations between audiometric thresholds (LFA, HFA) or tone-in-noise detection thresholds (N0S0, N0Sπ) and speech-reception thresholds for matrix sentences in noise for six listener groups defined by their hearing thresholds (NHT, EHT) and history of blast exposure (none, or close or far from a blast). Asterisks indicate correlations that were significant after Bonferoni corrections for six comparisons (p < 0.0083). LFA indicates low-frequency average; HFA, high-frequency average; NHT, normal hearing thresholds; EHT, elevated hearing thresholds; N0S0, tone and noise both interaurally correlated; N0Sπ, noise interaurally correlated and tone interaurally inversely correlated (π phase).

Close modal

For speech and non-speech tests of auditory function, it is critical to design assessments that minimize sources of variability without increasing the total testing time beyond what is feasible for the subject population targeted in the study. One way to reduce variability is to use a temporal sampling strategy, where each participant is tested over many short test sessions conducted on different days, rather than a single long test session. Repeating the test multiple times will reduce statistical variability and, for speech tests, any variability associated with the presentation of different speech tokens in each condition of the study. Conducting these tests on separate days will allow for the analysis to reveal learning effects that occur over the sampling time frame, while also limiting measurement bias that might result from sampling performance when the listener is in a particularly good or bad “state.”

To be feasible, it is likely that shorter test sessions would need to be conducted at home, rather than in a clinical booth or laboratory. At-home testing has historically been viewed as more challenging due to increased difficulty controlling variables like background noise level and listener distraction. However, the restrictions imposed by the COVID-19 pandemic have accelerated efforts to develop at-home test techniques suitable for conducting carefully controlled experiments (Peng , 2022). This section discusses two experiments, one involving binaural tone-in-noise detection and the other involving speech understanding, that were conducted using at-home testing techniques during the COVID-19 pandemic. These experiments were carried out for two very different purposes—the evaluation of different methodological choices for binaural tone-in-noise detection, and the impact of non-native production on speech understanding—and were not specifically designed to assess the impact of a temporal sampling approach on evaluations of auditory performance. However, both longitudinal study designs allowed us to investigate the potential advantages of such an approach for reducing measurement variability. The at-home testing for the two studies was conducted with Galaxy A7 8-in. Android tablets (Samsung, Suwon-si, South Korea) using the TabSINT (Creare Inc., Hanover, NH) application for portable auditory testing (Shapiro , 2020). The tablets were attached to S2 stereo-isolating headphones (Vic Firth, Boston, MA) that were tested to ensure a consistent frequency response across the different headphones used for the study. Each participant was provided with a different tablet and set of headphones to use for the study, and they were given instructions on how to use the TabSINT software and conduct the at-home testing prior to the start of the experiment. Because these studies have not been previously published, the methods for both experiments are briefly described before discussing their implications for temporal sampling.

1. Experiment 1: A comprehensive evaluation of binaural tone detection

The first experiment was designed to perform within-participant comparisons of different techniques for measuring binaural processing (in particular, the use of interaural phase differences for binaural tone detection). The major variables of the study included test procedure (three-interval three-alternative forced choice adaptive staircase, single-interval method of limits, or Békésy-style method of adjustment), tone frequency (125, 250, 500, 750, 1000, 1500, 2000, 3000, 4000, or 6000 Hz), tone interaural phase (correlated, N0S0; or inversely correlated, N0Sπ), and level of the noise used to mask the tones. The noise was threshold-equalizing noise (Moore , 2000) that was bandpass-filtered to span ±2 auditory-filter equivalent rectangular bandwidths (ERBs; Glasberg and Moore, 1990) around the tone frequency. Noise level was specified in terms of the level in a one-ERB wideband around 1000 Hz (30, 40, 50, 51, 60, or 70 dB SPL). Each of six participants with NH completed one tone-in-noise detection threshold measurement for each of 299–360 different combinations of test conditions, with data collection occurring over 15 to 18 test sessions conducted on different days.

2. Experiment 2: An evaluation of the perception of native and non-native speech

The second experiment examined how well native-English listeners with NHTs could understand standardized speech materials spoken by native and non-native English talkers. The talkers were recorded at a recent North Atlantic Treaty Organization (NATO) exercise held in Stavanger, Norway. The stimuli included sentences from the American English Matrix Test (Kollmeier , 2015) recorded by 60 talkers (16 native and 44 non-native English talkers). The sentences were mixed in speech-shaped noise and multitalker-babble maskers, at two SNRs per masker. Eight participants each heard 750 sentences each (15 blocks of 50 trials), with the order of presentation within and across blocks randomized with regard to talker type, masker, and SNR. Four listeners finished the 15 blocks within four sessions, two listeners required five, one required six, and one required seven sessions, with sessions typically separated by 1–2 calendar days.

3. Controls for minimizing variability across test sessions

Because Experiments 1 and 2 were performed at home, rather than under controlled laboratory conditions, there is some concern that any day-to-day variation in the results might be related to distractions or other problems related to variations in the home test environment, rather than due to any fundamental changes in the day-to-day performance of the individual listeners. Although the listeners were not observed during the test sessions, it is worth noting that the participants in the studies were experienced test subjects who worked part-time on-site at the Air Force Research Laboratory before on-site testing was curtailed by the COVID-19 global pandemic. They were allowed to continue as part-time telework employees during the time period when the data for these experiments were collected, with the expectation that they would find a quiet location where they would not be distracted in order to perform the testing. In the binaural tone experiment, the microphone on the tablet was also used to monitor the sound level in the room during each trial test session. These measurements showed the median sound level during the experiment was 29.4 dBA, and that more than 95% of the trials were collected with ambient sound levels less than 43.2 dBA. There was no significant correlation between ambient noise level and task performance (r = –0.0449, p = 0.099).

4. Analysis

The assessment of day-to-day variability in performance was complicated by the fact that each experiment consisted of a number of sub-conditions of varying difficulty that were randomly assigned across the different listening sessions completed by each listener. In Experiment 1, there were 360 sub-conditions, each with a different threshold measurement method, tone frequency, and interaural phase condition (N0S0 or N0Sπ). In Experiment 2, the 750 sentences were arbitrarily grouped into 15 sub-conditions of 50 trials each, and each listener heard these 15 sub-conditions in a different random order. Thus, performance for each sub-condition was normalized by converting performance scores for each participant and day to a Z-score based on the overall mean and standard deviation for that condition. Subject scores for each listening session were then calculated by averaging across the Z-scores of all the subconditions that occurred in that session. This normalization procedure reduced performance differences related to the relative difficulty of the specific trials completed on a given day of the experiment and made it possible to focus only on differences in performance across listening sessions.

5. Results

As was discussed in Sec. IV A, increasing the number of repetitions is one way to reduce measurement variability, and the ability to collect data across multiple test sessions at home, rather than in one single visit to a clinical or laboratory, is a way to achieve the goal of collecting more data from each listener. However, there are several other ways that temporal sampling across multiple test sessions conducted on different days may reduce measurement variability. In the following, the temporal sampling data collected in Experiments 1 and 2 are discussed with respect to (a) training effects, (b) day-to-day variation in performance, and (c) reduced variability by sampling on different days.

(a) Training effects. One important advantage of temporal sampling is that it is easier to control for the learning effects that may occur when participants are evaluated with the same test at multiple timepoints. Figure 3 shows how the Z-scores changed over time. In the left panel, each point on the x axis represents a different test session conducted on a different day. In the right panel, each point on the x axis represents a different block of 50 trials, with the color coding indicating session (day) in which the block was completed. Larger (positive) Z-scores indicate better performance.

FIG. 3.

(Color online) Estimates of the learning effects in Experiment 1 (left panel) and Experiment 2 (right panel). Data in each test condition were pooled across participants and converted to Z-scores to equate performance between sub-conditions. Positive Z-scores indicate better performance. Data from both experiments demonstrate a trend of improved performance in later test blocks and sessions. For Experiment 2 (right panel), symbols represent different test sessions (days).

FIG. 3.

(Color online) Estimates of the learning effects in Experiment 1 (left panel) and Experiment 2 (right panel). Data in each test condition were pooled across participants and converted to Z-scores to equate performance between sub-conditions. Positive Z-scores indicate better performance. Data from both experiments demonstrate a trend of improved performance in later test blocks and sessions. For Experiment 2 (right panel), symbols represent different test sessions (days).

Close modal

For both experiments, there was evidence of a learning effect, although these effects were much more pronounced in Experiment 2 (speech identification) than in Experiment 1 (tone-in-noise detection). This is perhaps unsurprising since Experiment 2 involved identifying non-native speech spoken by many individuals from different countries of origin. In both cases, the availability of data across multiple days is extremely helpful for characterizing the effects of learning on task performance and potentially controlling for these effects in the analysis of the experiment. Possible options for this would include eliminating trials that are collected before an individual listener plateaus in performance in a listening task or using a statistical model to correct for the effects of learning over time in each individual task.

(b) Day-to-day variation. Another potential benefit of a temporal sampling strategy is the ability to control for systematic day-to-day changes in individual participant performance due to differences in motivation, arousal, or other extraneous factors. Figure 4 shows the residual scores obtained after subtracting the estimated group-mean learning curves shown in Fig. 3 from the Z-scores for each test session (Experiment 1) or block (Experiment 2). These results show that there were systematic changes in performance over time, independent of learning effects, that were different for each participant. Some of these changes—namely, the rapid increase in performance over the first several test blocks for some participants in Experiment 2—might be attributable to learning effects that differ across participants and are therefore not captured by the group mean. However, other fluctuations likely reflect other causes such as participant state on a given day. This variability means that, in the absence of temporal sampling, one would record a different score depending on when the test is completed.

FIG. 4.

(Color online) Residual Z-scores after regressing Z-scores on test session (left panel) or block number (right panel) to correct for global learning effects. Data from individual participants are shown in separate subplots. Curves indicate trends in the residuals using locally estimated scatterplot smoothing.

FIG. 4.

(Color online) Residual Z-scores after regressing Z-scores on test session (left panel) or block number (right panel) to correct for global learning effects. Data from individual participants are shown in separate subplots. Curves indicate trends in the residuals using locally estimated scatterplot smoothing.

Close modal

(c) Reduced variability by sampling from different days. To obtain a quantitative estimate of the potential benefits of temporal sampling when there is day-to-day variation in performance, a bootstrapping simulation was performed to compare the variations in individual participant scores that might be expected from a within-day approach, where all data are collected on the same day, to what would be expected from an across-days approach, where data collection for each participant is divided into smaller sessions conducted on different days. In each repetition, the individual mean Z-score for each participant in Experiment 1 (binaural tone detection) was estimated by averaging across 50 blocks that were randomly selected (with replacement) from the blocks completed by that participant. This process was repeated 500 times, resulting in 500 estimates of individual performance for each participant.

First, the Across-Day condition was designed to simulate the process of temporal-sampling by dividing data collection across different days. The 50 blocks were selected from all the blocks completed by that participant without regard to test day. This likely resulted in the most accurate and reliable estimates of individual participant scores, because each individual estimate incorporated data from all of the individual data points collected for each participant. The left panel of Fig. 5 shows the results of this simulation for the six individual participants in Experiment 1. The average within-subject standard deviation (π) across the 500 repetitions was 0.11.

FIG. 5.

(Color online) Monte Carlo simulations of the behavioral performance for individual participants in Experiment 1 (binaural tone-in-noise detection) to model the effects of across-day vs within-day sampling. Each point represents one of 500 simulations repeated for each subject. (Left panel) Each point represents 50 blocks sampled at random (with replacement) without respect to the day of testing, from the ∼300 total test blocks completed. (Center panel) Each point represents 50 blocks, sampled at random (with replacement) without respect to the day of testing, but from a smaller number of test blocks equivalent to the number carried out on a single randomly selected test day, to model the increased variance associated with fewer available trials. (Right panel) Each point represents 50 blocks, sampled at random (with replacement) from the blocks completed on a randomly selected single test day. The colors in the center and right panel each represent a different test day. The average standard deviation of the estimates across the six listeners is shown at the top of each panel. Comparing the right and center panels shows the increase in variance associated with sampling on a single day when there is day-to-day variability.

FIG. 5.

(Color online) Monte Carlo simulations of the behavioral performance for individual participants in Experiment 1 (binaural tone-in-noise detection) to model the effects of across-day vs within-day sampling. Each point represents one of 500 simulations repeated for each subject. (Left panel) Each point represents 50 blocks sampled at random (with replacement) without respect to the day of testing, from the ∼300 total test blocks completed. (Center panel) Each point represents 50 blocks, sampled at random (with replacement) without respect to the day of testing, but from a smaller number of test blocks equivalent to the number carried out on a single randomly selected test day, to model the increased variance associated with fewer available trials. (Right panel) Each point represents 50 blocks, sampled at random (with replacement) from the blocks completed on a randomly selected single test day. The colors in the center and right panel each represent a different test day. The average standard deviation of the estimates across the six listeners is shown at the top of each panel. Comparing the right and center panels shows the increase in variance associated with sampling on a single day when there is day-to-day variability.

Close modal

Next, to compare the Across-Day condition to a case where all trials are collected on the same day, we would ideally like to have a corresponding condition where the same number of trials were collected in a single test session per participant, rather than divided across multiple test sessions on different days. However, we only have access to a relatively small number of trials for each participant in each test session, which will tend to increase σ. The comparison is further complicated by the fact that the number of trials per test session varied across sessions and across participants. To account for the smaller (and variable) number of trials available within a given day, we ran an alternative version of the Across-Day simulation. For each of the 500 repetitions, the full set of ∼300 available blocks was reduced to only the number of blocks tested for a given participant on a randomly chosen day. The data were still randomly sampled from different days but from a smaller number of trials equivalent to those available on a given randomly chosen day. The center panel of Fig. 5 shows this Across-Day Subsampled simulation, with each color representing a different day that was used to set the sample size. The reduced data set used in this simulation increased σ to 0.23, but because the simulated scores were selected randomly from the entire data set there is no systematic pattern of performance with respect to the day selected for the simulation.

Finally, the Within-Day condition involved the random selection of 50 trials on one given day for a participant. For each of the 500 simulations, a particular day was chosen at random, and the 50 blocks were selected (with replacement) from the blocks conducted on that day. The idea here was that we wanted to calculate how accurate the score would be relative to the actual underlying long-term mean if you selected a day to test at random. If there were no systematic across-day differences in performance, then one would expect this simulation to result in similar levels of variability to the Across-Day Subsampled condition. To the extent that there are systematic variations in individual participant scores across days, one would expect higher variability in the individual scores in the Within-Day condition.

The right panel of Fig. 5 shows the results for the Within-Day condition. Different days are represented in the simulation with different colors, and it is clear from this color coding that there were systematic differences in performance across the different days of the experiment. This resulted in a variance in the individual participant score estimates (σ = 0.30) that was 30% larger than in the Across-Day Subsampled condition.

Figure 6 shows the results of the stimulation for Experiment 2. Here, the approach was the same, except that the simulation only included 15 blocks (with replacement) in calculating the average score for each of the 500 repetitions of the simulation, because participants only completed 15 blocks over the course of the experiment. The left and center panels show the expected increase in variance as the number of blocks available to choose from in the Across-Day condition (σ = 0.04) was decreased in the Across-Day Subsampled condition (σ = 0.09) to match the number of trials available in the Within-Days condition. The increase in variance from the Across-Days Subsampled condition to the Within-Days condition (right panel, σ = 0.10) was less marked than the increase in Experiment 1, although it should be noted that the increase was stable and appeared every time the simulation was re-run. While this 10% increase in variability is modest, it is certainly large enough to impact the sensitivity of a clinical trial evaluating the effect of a particular intervention on individual participant performance. Moreover, it is notable that, like the 30% increase in Experiment 1, it represents a systematic error that cannot be addressed simply by collecting more trials or even by using a more sensitive test on a single day of data collection. One possible reason for the more modest increase in Experiment 2 is that any variability associated with testing on a specific day (right panel) was swamped by the variability associated with testing a small number of blocks on each day (center panel).

FIG. 6.

(Color online) Monte Carlo simulations of the behavioral performance for individual participants in Experiment 2 (intelligibility of speech produced by non-native talkers). Same as Fig. 5, except that participants completed fewer blocks in this experiment (a total of 15 blocks as compared to the ∼300 in Experiment 1), and thus each point represents only 15 blocks selected at random (with replacement) from the available set of test blocks. The average standard deviation of the estimates across the six listeners is shown at the top of each panel. The difference in variance between the center (0.09) and right panels (0.10) that can be attributed to systematic variations in across-day performance.

FIG. 6.

(Color online) Monte Carlo simulations of the behavioral performance for individual participants in Experiment 2 (intelligibility of speech produced by non-native talkers). Same as Fig. 5, except that participants completed fewer blocks in this experiment (a total of 15 blocks as compared to the ∼300 in Experiment 1), and thus each point represents only 15 blocks selected at random (with replacement) from the available set of test blocks. The average standard deviation of the estimates across the six listeners is shown at the top of each panel. The difference in variance between the center (0.09) and right panels (0.10) that can be attributed to systematic variations in across-day performance.

Close modal

The results of these two studies indicate that, for both speech and non-speech based tasks, temporal sampling across several consecutive days results in a more reliable indicator of listener performance than could be obtained in a single test session. While the logistics required for this type of sampling are not currently suitable for many clinics, there has been a recent push to develop feasible remote and at-home testing protocols. Ideally, this line of work will result in new strategies for extending the capabilities of and sensitivity for evaluating auditory function.

Twenty-five million Americans are estimated to report hearing difficulties in the absence of a documented hearing loss (Edwards, 2020). Even if a patient has speech-perception scores within normal limits, there may nevertheless be subclinical damage within auditory neural pathways that impact the self-perceived ability to hear (Tremblay , 2015). Furthermore, age- or injury-related changes within central cognitive systems may limit the extent to which top-down functions help listeners compensate when listening to a degraded signal (for a review, see Kuchinsky and Vaden, 2020). Thus, even subtle deficits or changes in hearing acuity or function, which may not manifest as clinically significant, can impose greater demands on the limited pool of mental resources that must be divided among all concurrent activities (Kahneman, 1973). The allocation of mental resources to an auditory task—known as listening effort—is affected by the task demands, the listener's mental resource capacity, and his/her level of fatigue or motivation (Pichora-Fuller , 2016). The more effort that is required for understanding speech, the less that may be available for other concurrent functions important for daily life (Lin and Albert, 2014). As an example, McCoy (2005) demonstrated that while intelligibility for words presented in noise was at ceiling, individuals with hearing impairment had poorer later recall of those words than NH listeners, suggesting increased listening effort impeded concurrent memory processes.

High levels of listening effort is a common complaint among listeners, especially those who are noise exposed (Hétu , 1988). Though audiologists have numerous diagnostic tools to measure the function of the peripheral and central auditory systems, they typically have few, if any, options for examining deficits in how these systems interact with the central linguistic and cognitive systems that contribute to effortful listening. Thus, elevated effort has been suggested as an important consideration for listeners, including those reporting hearing difficulties that are not explained by their audiometric thresholds (Beck , 2018). It has also been noted that measures of effort can explain unique variance in predicting intervention outcomes beyond measures of performance alone (Humes, 1999; Kuchinsky , 2014; Wendt , 2017).

1. Measures of listening effort

A number of indices of listening effort have been investigated with the goal of quantifying how hard listeners have to work to achieve a given level of performance on a listening task. These have included subjective report, measures of reaction time (including as part of dual-task experiments), physiological measures, and neural measures of listening effort, each of which have relative strengths and weaknesses in terms of the supporting evidence as well as the practicality of implementation in laboratory and clinical settings (Giuliani , 2021; McGarrigle , 2014; Teubner-Rhodes and Kuchinsky, 2020; Winn , 2018).

While survey measures have face validity and are easy to collect, they are subject to biases, may be more indicative of intelligibility than effort (Moore and Picou, 2018), and typically have low sampling resolution (i.e., collected at the end of an experiment or block). Dual-task cost measures (decrements in secondary task accuracy or reaction time as a function of primary listening task demands) are also easy to collect, require no specialized equipment beyond a computer, and have long been used in psychology (e.g., Pashler, 1994) and hearing science (e.g., Gosselin and Gagné, 2010) to examine how limited attentional resources are allocated. However, this approach also has relatively low temporal sampling resolution and the strict set of assumptions required of the task design may be difficult to meet: the primary and secondary tasks must compete for an overlapping set of mental resources and participants must successfully allocate their attention to the tasks as instructed (Fisk , 1986). Furthermore, the particular demands imposed by the two tasks and their relative timing can impact whether individuals are multitasking or task-switching (Koch , 2018), with such design choices impacting the ability to observe changes in listening effort (e.g., Picou and Ricketts, 2014).

Physiological measures such as changes in pupil dilation, cardiac measures (e.g., heart rate, heart rate variability), electrodermal skin conductance, and others, may be less biased and provide greater temporal resolution, but there are more requirements in terms of equipment and analyses and limitations in terms of exclusion criteria (e.g., individuals taking psychoactive or blood pressure medications). Neural measures, such as spectral (e.g., alpha) and event-related potential electroencephalography (EEG), functional magnetic resonance imaging (fMRI), and functional near infrared spectroscopy (fNIRS), provide reasonable to excellent temporal and/or spatial resolution depending on the method, but are the most costly in terms of equipment and expertise requirements and may entail potentially even more exclusion criteria (Teubner-Rhodes and Kuchinsky, 2020). To the extent that these various metrics have been collected simultaneously, studies have often observed that they do not co-vary with one another (e.g., Alhanbali , 2019; Kramer , 2016; Miles , 2017; Strand , 2018). This lack of correspondence suggests that each measure may tap into different components of what is broadly termed “listening effort” (McGarrigle , 2014, Alhanbali , 2019).

Ultimately, any researcher or clinician that wants to measure listening effort needs to weigh the expected sensitivity of a given measure to their specific question (Strand , 2018; for a review see Zekveld , 2018) against the practical or methodological challenges associated with that measure (McGarrigle , 2014; Teubner-Rhodes and Kuchinsky, 2020; Winn , 2018). The specific question here is about measures that are sensitive to subtle differences in listening difficulty over time. For the purposes of assessing small changes in hearing performance longitudinally, pupillometry, a measurement of changes in pupil dilation over time, may be a particularly useful metric and is therefore the focus of the studies reviewed here. Other survey and behavioral measures of effort may also hold promise particularly due to their relative ease of collection and scoring and their potential to quantify different aspects of listening effort (Alhanbali , 2019; McGarrigle , 2014); however, more work is needed to assess their sensitivity to the factors described below.

Pupillometry has been commonly used to index listening effort in laboratory research (Zekveld , 2018). Compared to a pre-stimulus baseline, pupil size reliably peaks following the onset of an attended stimulus such as a word, sentence, or passage (see Winn , 2018, for a review of the method). Reviews of the literature (Zekveld , 2018) have shown that the pupillary response is typically larger, slower, and/or more sustained with greater acoustic degradation (Winn , 2015) or background noise (Kuchinsky , 2013). Furthermore, pupillometry is sensitive to interactions among the acoustic, linguistic, and cognitive systems that are involved in real-world listening situations (Kuchinsky , 2013; Wagner , 2016; Zekveld , 2019). Differences in listening effort among individuals with hearing loss (Ayasse and Wingfield, 2020; Kuchinsky , 2014; Ohlenforst , 2017b; Zekveld , 2011) or tinnitus (Juul Jensen , 2018) compared to controls have been observed in pupillometry studies.

The rationale for using pupillometry as an index of effort stems from, in part, its close association with the engagement of a well-studied neural mechanism, the locus coeruleus-norepinephrine (LC-NE) system (Joshi , 2016). LC-NE activity modulates both the listener's general level of arousal and stimulus-driven attentional focus to optimize task performance (Aston-Jones and Cohen, 2005). Unlike behavioral measures and survey reports of effort, pupillometry is not under direct cognitive control and thus avoids issues related to listener bias and other demand characteristics. Pupillometry also provides temporal sensitivity to changes in effort across pre-stimulus, listening, retention, response, correct-answer feedback (when appropriate), and recovery periods (Winn and Moore, 2018; Winn and Teece, 2021). While neural measures can also provide such information (Teubner-Rhodes and Kuchinsky, 2020), they are often more costly to collect in terms of equipment needed, number of trials needed, and analytical requirements as well as entailing more participant exclusion criteria (e.g., electrical inference or magnetic safety issues related to hearing aids or cochlear implants or any surgical implants).

The literature reviewed in the sections that follow suggests that pupillometry may be well suited for assessing smaller differences in functional hearing in a way that vastly exceeds the information provided by survey and performance measures (accuracy, reaction time) and that may be more practical to collect than neural measures. Particularly when compared to other measures of effort, pupillometry (1) in at least some listening conditions, has been observed to exhibit reasonable reliability and good sensitivity to functional hearing abilities; (2) provides temporal information that may be more sensitive to subtle differences in how listening effort and fatigue unfolds both within and across trials; (3) has the potential to be readily combined with existing standard measures used in the clinic and the lab, rather than requiring an additional, specially designed task (e.g., dual task) or survey. In particular, it may represent a middle ground between temporally sparse survey or behavioral measures and temporally and/or spatially dense neural measures of effort.

2. Effort reveals hearing difficulties that extend beyond decrements in performance

For measures of listening effort to be of value to audiologists, they must provide information about hearing difficulties beyond the performance-based tests that are already available to them. Measures of effort, including pupillometry, have been shown to diverge from speech intelligibility in ways that are meaningful to both the assessment and remediation of hearing difficulties (for a review see Lunner , 2020). As the schematic in Fig. 7 shows, the relationship between listening task demands and performance follows a monotonic psychometric function. However, maximal effort (pupil size) is observed at moderately challenging levels of speech intelligibility, with decreases in effort observed with increasing intelligibility as the task becomes easier, or with decreasing intelligibility as listeners begin to give up (Ayasse and Wingfield, 2018; Ohlenforst , 2017b).

FIG. 7.

(Color online) Schematic depicting the differential relationships between listening demands and performance versus effort in a SiN task. While a psychometric function generally relates SNR and accuracy (dashed curve), effort varies nonlinearly (solid curve). Maximal effort tends to occur at moderate levels of difficulty, with reductions as listening becomes easy or impossibly hard. Asterisks on the left side and right side each represent a pair of patients who may have similar levels of SiN performance but substantially different levels of effort to achieve those scores.

FIG. 7.

(Color online) Schematic depicting the differential relationships between listening demands and performance versus effort in a SiN task. While a psychometric function generally relates SNR and accuracy (dashed curve), effort varies nonlinearly (solid curve). Maximal effort tends to occur at moderate levels of difficulty, with reductions as listening becomes easy or impossibly hard. Asterisks on the left side and right side each represent a pair of patients who may have similar levels of SiN performance but substantially different levels of effort to achieve those scores.

Close modal

As effort and performance pattern differently with task demands, patients may obtain similar SiN test scores but require vastly different levels of effort to achieve those scores. In the case of two patients with similar performance scores (red asterisks in Fig. 7), good task performance could either be associated with relatively high or low levels of effort. Likewise, as indicated by the blue asterisks, poor SiN scores could reflect a complete disengagement from the task or that an individual is trying hard despite their performance. While these examples are likely explained by ceiling and floor effects in the speech-perception measures, there are many other reasons, as described above, why a SiN test might lack the sensitivity required to detect changes in hearing. In cases where it is impractical to present a sufficient number of trials and a range of test conditions to allow for a sensitive SiN test, a measure that is sensitive to changes in effort could more comprehensively characterize hearing changes.

Survey and behavioral measures of effort have been shown to be sensitive to changes in listening demands, such as with decreasing SNRs or with versus without the use of hearing aids (for reviews: Gosselin and Gagné, 2010; McGarrigle , 2014). However, there is evidence that pupillometry yields more sensitive and reliable results across different listening test sessions than other commonly used measures of effort, namely responses on the dual-task and skin conductance measures (Giuliani , 2021). Additionally, the strongest evidence for an impact of hearing loss on effort has come from physiological rather than behavioral measures, including dual-task responses (Ohlenforst , 2017a). Pupillometry has been further shown to identify differences in listening effort in varying levels of acoustic challenge even when individuals have normal hearing (Zekveld , 2010), when there are no reliable differences in performance (DeRoy Milvae , 2021), when all words are correctly identified (Kuchinsky , 2013), when performance is matched across individuals (Zekveld , 2010), or when reaction time variability is statistically controlled for at the trial level (Kuchinsky , 2013). Pupillometry studies have also revealed changes in listening effort associated with speech-perception training or with the use of hearing aid noise-reduction algorithms, even when controlling for improvements in performance (Kuchinsky , 2014; Wendt , 2017). Together, these results suggest that the temporal precision offered by pupillometry may provide marked advantages in terms of sensitivity to small effects compared to effort measures that yield a single value per trial or block (Hershman , 2022).

3. Quantifying moment-by-moment variation in effort and fatigue

Even when a listener has the capacity to meet the demands of a listening task (e.g., hearing thresholds or cognitive screeners that indicate performance within normal limits), fatigue, low arousal or alertness, low motivation, frustration, or boredom may limit the extent to which the listener can or will allocate that capacity to communication (Herrmann and Johnsrude, 2020). Because these factors can vary day-to-day or even moment-by-moment, the reliable measurement of auditory performance may depend on accounting for such variability. Ideally, data would be collected when a participant is attending to the task and therefore performing as well as their hearing allows. Temporal sampling can help to mitigate some of this variability by treating arousal and motivation as sources of noise to be averaged out across sessions. A less time consuming approach could be to account for variation in fatigue and motivation within a single test session using a physiological measure, like pupillometry.

Pupillometry has been shown to be sensitive to changes in arousal and effort at multiple timescales, from single words (Kuchinsky , 2013) or pure tones (Bala , 2020), to longer spoken passages (McGarrigle , 2017) or tone streams (Zhao , 2019), and across blocks of experiments lasting minutes (McGarrigle , 2021) or hours (Hopstaken , 2015). Pupillary responses have also been shown to correlate with individual differences in ability to maintain vigilant attention (Kuchinsky , 2016) and fatigue experienced in daily life (Wang , 2018). In particular, analyses that assess changes across the entire pupillary response time course, rather than just a peak or mean value, can pickup on otherwise hidden effects (Hershman , 2022). Thus, pairing pupillometry, due to its greater temporal precision, with behavioral measures of task performance could help to identify those trials or time periods where a listener is most engaged in the task.

4. Implications for clinical assessment and intervention

For pupillary measures to provide diagnostic utility, more research is needed to understand the reliability of pupillary measures at the individual level (Winn , 2018). Initial studies of pupillometry have shown reasonable reliability and low within-subject variability (Neagu , 2019; Wagner , 2019), especially when compared to other measures of effort such as subjective ratings, reaction times, and other physiological measures (Giuliani , 2021). Reliability will need to be further assessed in populations for whom pupillometry may exhibit reduced sensitivity as a result of increased noise in the data, such as in older adults (Winn , 2018). Technical requirements are another important consideration. Pupillometry has largely been limited to research applications due to the need for specialized equipment, controlled lighting environments, and analytical expertise. However, pupillometry and other ocular measures may be achievable within a head-mounted display, potentially allowing such measurements outside of the laboratory (Imaoka , 2020; Juvrud , 2018; Sakamoto , 2020; Wilson , 2021).

In summary, as it becomes more feasible to measure pupillometry in a variety of settings, there is the potential to use this physiological measure of listening effort to measure effort with minimal burden simultaneously as listeners complete standard behavioral clinical evaluations (e.g., SiN tests) of a possible hearing intervention. Other behavioral and neural measures of effort may also provide critical information about listening difficulties over time, however, more work is needed to ensure they are sensitive to small effects and reliable across testing sessions (Giuliani , 2021). Additionally, more research is needed into whether the additional complexity of neuroimaging data provides a substantial increase in sensitivity warranting the additional complexities. Measures of effort may allow for better characterization of potential quality-of-life benefits stemming from intervention benefits that are insufficiently captured by small to null increases in absolute SiN performance. Effort data could also be used as a validation tool to identify trials that should be excluded from analysis due to listener inattention.

To examine the potential benefits of a hearing-related intervention, it is critical to compare pre-intervention baseline evaluations and short- and long-term post-intervention outcomes. Hearing deficits can stem from reductions in hearing sensitivity as well as distortions to the perceived signal. Depending on the focus of a given intervention, a listener's deficits in audibility and distortion may be differentially affected. Thus, it is important that assessment methods are able to capture both components.

The focus of this paper is on the assessment of changes in SiN performance in cases where there are no significant changes in the audibility of the stimuli. Difficulty hearing in noise is the most common complaint of people with hearing loss (Arlinger, 2003). Improvement in this domain of listening would likely result in improvement in quality of life. Measuring longitudinal changes in SiN performance is challenging because small changes in measured performance can correspond to significant variations in real-world performance. Current SiN measurement tools are limited in their ability to reliably capture true, intervention-related fluctuations in performance over time because they are subject to variability that could mask intervention-related changes. In this paper, we recommend a number of strategies to improve longitudinal measurements of functional performance:

  1. Use of non-speech stimuli to target dimensions of auditory perception that correspond to SiN performance. While pure-tone thresholds alone are not a good proxy for SiN performance, some psychophysical measures, including detection/discrimination of spectral or spectrotemporal ripples and binaural tone detection, have been found to correlate with SiN performance for listeners with hearing loss. Clinically feasible methods have been developed to track STM sensitivity outcomes for CIs, and similar methods under development are suitable for a wider clinical population of listeners with acoustic hearing.

  2. Temporal sampling of SiN performance. While SiN measures are a logical way to evaluate speech recognition abilities in noise, current test protocols are susceptible to a number of sources of variability unrelated to changes in a listener's “true” ability to recognize SiN. The inclusion of multiple test sessions can reduce this extraneous variability to provide a clearer representation of performance. Recent advances in remote testing may allow for increases in the amount of data collected from an individual listener while maintaining the efficiency and expediency of in-office visits.

  3. Incorporation of listening effort to index functional deficits. Subclinical deficits in hearing may nevertheless lead to a greater degree of effort associated with successful SiN listening. Objective measures of effort, such as pupillometry, can provide important information to clinicians and experimenters with regard to a listener's overall communication function. In some cases, these measures can identify differences between individuals when a test of behavioral performance reveals no measurable difference. Measures of effort might also help to refine behavioral results by distinguishing those time periods where an individual is or is not fully engaged in the task. While the logistics of pupillometry measurement are not currently well suited for the clinic, technological advances and the recent push for improvements in remote testing may also lead to greater ease of administration for these objective tests in the near future.

At the time of writing, there is no gold standard for tracking improvements in SiN performance following a targeted hearing intervention. Between the improvements in the understanding of auditory function and technological advances in telehealth and remote testing, the adoption or incorporation of one or more of these measures may help to overcome SiN variability and provide a more comprehensive picture of the true effects of a given intervention on performance.

We thank Brian Simpson, Hilary Gallagher, Savannah Seals, and the 711th Human Performance Wing of the Air Force Research Laboratory for facilitating the experiments outlined in Sec. V B. We would also like to thank Coral Dirks for assistance with editing the manuscript. This research was supported in part by an appointment to the Department of Defense (DOD) Research Participation Program administered by the Oak Ridge Institute for Science and Education (ORISE) through an interagency agreement between the U.S. Department of Energy (DOE) and the DOD. ORISE is managed by ORAU under DOE Contract No. DE-SC0014664. All opinions expressed in this paper are the author's and do not necessarily reflect the policies and views of DOD, DOE, or ORAU/ORISE. The views expressed in this article are those of the authors and do not reflect the official policy of the Department of Army/Navy/Air Force, Department of Defense or U.S. Government. The identification of specific products or scientific instrumentation is considered an integral part of the scientific endeavor and does not constitute endorsement or implied endorsement on the part of the authors, DOD, or any component agency.

1.
Abrams
,
H. B.
, and
Kihm
,
J.
(
2015
). “
An introduction to MarkeTrak IX: A new baseline for the hearing aid market
,”
Hear. Rev.
22
,
16
.
2.
Alhanbali
,
S.
,
Dawes
,
P.
,
Millman
,
R. E.
, and
Munro
,
K. J.
(
2019
). “
Measures of listening effort are multidimensional
,”
Ear Hear.
40
,
1084
1087
.
3.
Allen
,
J. S.
,
Miller
,
J. L.
, and
DeSteno
,
D.
(
2003
). “
Individual talker differences in voice-onset-time
,”
J. Acoust. Soc. Am.
113
,
544
552
.
4.
American National Standards Institute
(
2020
).
ANSI S3.2-2020, Method for Measuring the Intelligibility of Speech Over Communication Systems
(
American National Standards Institute
,
New York
).
5.
Amos
,
N. E.
, and
Humes
,
L. E.
(
2007
). “
Contribution of high frequencies to speech recognition in quiet and noise in listeners with varying degrees of high-frequency sensorineural hearing loss
,”
J. Speech. Lang. Hear. Res.
50
,
819
834
.
6.
Archer-Boyd
,
A. W.
,
Goehring
,
T.
, and
Carlyon
,
R. P.
(
2020
). “
The effect of free-field presentation and processing strategy on a measure of spectro-temporal processing by cochlear-implant listeners
,”
Trends Hear.
24
,
233121652096428
.
7.
Archer-Boyd
,
A. W.
,
Southwell
,
R. V.
,
Deeks
,
J. M.
,
Turner
,
R. E.
, and
Carlyon
,
R. P.
(
2018
). “
Development and validation of a spectro-temporal processing test for cochlear-implant listeners
,”
J. Acoust. Soc. Am.
144
,
2983
2997
.
8.
Arenberg
,
J. G.
,
Parkinson
,
W. S.
,
Litvak
,
L.
,
Chen
,
C.
,
Kreft
,
H. A.
, and
Oxenham
,
A. J.
(
2018
). “
A dynamically focusing cochlear implant strategy can improve vowel identification in noise
,”
Ear Hear.
39
,
1136
1145
.
9.
Arlinger
,
S.
(
2003
). “
Negative consequences of uncorrected hearing loss—A review
,”
Int. J. Audiol.
42
,
2S17
2S20
.
10.
Aronoff
,
J. M.
, and
Landsberger
,
D. M.
(
2013
). “
The development of a modified spectral ripple test
,”
J. Acoust. Soc. Am.
134
,
EL217
EL222
.
11.
Aronoff
,
J. M.
,
Stelmach
,
J.
,
Padilla
,
M.
, and
Landsberger
,
D. M.
(
2016
). “
Interleaved processors improve cochlear implant patients' spectral resolution
,”
Ear Hear.
37
,
e85
e90
.
12.
Aston-Jones
,
G.
, and
Cohen
,
J. D.
(
2005
). “
An integrative theory of locus coeruleus-norepinephrine function: Adaptive gain and optimal performance
,”
Annu. Rev. Neurosci.
28
,
403
450
.
13.
Ayasse
,
N. D.
, and
Wingfield
,
A.
(
2018
). “
A tipping point in listening effort: Effects of linguistic complexity and age-related hearing loss on sentence comprehension
,”
Trends Hear.
22
,
233121651879090
.
14.
Ayasse
,
N. D.
, and
Wingfield
,
A.
(
2020
). “
Anticipatory baseline pupil diameter is sensitive to differences in hearing thresholds
,”
Front. Psychol.
10
,
2947
.
15.
Bala
,
A. D. S.
,
Whitchurch
,
E. A.
, and
Takahashi
,
T. T.
(
2020
). “
Human auditory detection and discrimination measured with the pupil dilation response
,”
JARO
21
,
43
59
.
16.
Barrenäs
,
M.-L.
, and
Wikström
,
I.
(
2000
). “
The influence of hearing and age on speech recognition scores in noise in audiological patients and in the general population
,”
Ear Hear.
21
,
569
577
.
17.
Beck
,
D. L.
,
Danhauer
,
J. L.
,
Abrams
,
H. B.
,
Atcherson
,
S. R.
,
Brown
,
D. K.
,
Chasin
,
M.
,
Clark
,
J. G.
,
De Placido
,
C.
,
Edwards
,
B.
,
Fabry
,
D. A.
,
Flexer
,
C.
,
Fligor
,
B.
,
Frazer
,
G.
,
Galster
,
J. A.
,
Gifford
,
L.
,
Johnson
,
C. E.
,
Madell
,
J.
,
Moore
,
D. R.
,
Roeser
,
R. J.
,
Saunders
,
G. H.
,
Searchfield
,
G. D.
,
Spankovich
,
C.
,
Valente
,
M.
, and
Wolfe
,
J.
(
2018
). “
Audiologic considerations for people with normal hearing sensitivity yet hearing difficulty and/or speech-in-noise problems
,”
Hear. Rev.
25
,
28
38
.
18.
Bernstein
,
J. G.
,
Danielsson
,
H.
,
Hällgren
,
M.
,
Stenfelt
,
S.
,
Rönnberg
,
J.
, and
Lunner
,
T.
(
2016
). “
Spectrotemporal modulation sensitivity as a predictor of speech-reception performance in noise with hearing aids
,”
Trends Hear.
20
,
233121651667038
.
19.
Bernstein
,
J. G.
,
Mehraei
,
G.
,
Shamma
,
S.
,
Gallun
,
F. J.
,
Theodoroff
,
S. M.
, and
Leek
,
M. R.
(
2013
). “
Spectrotemporal modulation sensitivity as a predictor of speech intelligibility for hearing-impaired listeners
,”
J. Am. Acad. Audiol.
24
,
293
306
.
20.
Bernstein
,
L. R.
, and
Trahiotis
,
C.
(
2016
). “
Behavioral manifestations of audiometrically-defined ‘slight’ or ‘hidden’ hearing loss revealed by measures of binaural detection
,”
J. Acoust. Soc. Am.
140
,
3540
3548
.
21.
Bierer
,
J. A.
, and
Litvak
,
L.
(
2016
). “
Reducing channel interaction through cochlear implant programming may improve speech perception: Current focusing and channel deactivation
,”
Trends Hear.
20
,
233121651665338
.
22.
Blackwell
,
D. L.
,
Lucas
,
J. W.
, and
Clarke
,
T. C.
(
2014
). “
Summary health statistics for US adults: National health interview survey
, 2012”
Vital Health Stat.
10
,
1
161
.
23.
Borrie
,
S. A.
,
Lansford
,
K. L.
, and
Barrett
,
T. S.
(
2017
). “
Generalized adaptation to dysarthric speech
,”
J. Speech. Lang. Hear. Res.
60
,
3110
3117
.
24.
Bradlow
,
A. R.
, and
Bent
,
T.
(
2008
). “
Perceptual adaptation to non-native speech
,”
Cognition
106
,
707
729
.
25.
Brungart
,
D. S.
,
Makashay
,
M. J.
, and
Sheffield
,
B. M.
(
2021a
). “
Development of an 80-word clinical version of the modified rhyme test (MRT80)
,”
J. Acoust. Soc. Am.
149
,
3311
3327
.
26.
Brungart
,
D. S.
,
Sheffield
,
B.
,
Makashay
,
M. J.
, and
Galloza
,
H.
(
2021b
). “
Development of an auditory fitness-for-duty standard that predicts performance in military hearing tasks from auditory thresholds and performance on the 80-word modified rhyme test
,”
J. Acoust. Soc. Am.
150
,
A339
.
27.
Brungart
,
D. S.
,
Walden
,
B.
,
Cord
,
M.
,
Phatak
,
S.
,
Theodoroff
,
S. M.
,
Griest
,
S.
, and
Grant
,
K. W.
(
2017
). “
Development and validation of the Speech Reception in Noise (SPRINT) Test
,”
Hear. Res.
349
,
90
97
.
28.
Byrne
,
D.
, and
Dillon
,
H.
(
1986
). “
The National Acoustic Laboratories' (NAL) new procedure for selecting the gain and frequency response of a hearing aid
,”
Ear Hear.
7
,
257
265
.
29.
Carney
,
E.
, and
Schlauch
,
R. S.
(
2007
). “
Critical difference table for word recognition testing derived using computer simulation
,”
J. Speech. Lang. Hear. Res.
50
,
1203
1209
.
30.
Chi
,
T.
,
Gao
,
Y.
,
Guyton
,
M. C.
,
Ru
,
P.
, and
Shamma
,
S.
(
1999
). “
Spectro-temporal modulation transfer functions and speech intelligibility
,”
J. Acoust. Soc. Am.
106
,
2719
2732
.
31.
Chi
,
T.
,
Ru
,
P.
, and
Shamma
,
S. A.
(
2005
). “
Multiresolution spectrotemporal analysis of complex sounds
,”
J. Acoust. Soc. Am.
118
,
887
906
.
32.
Clarke
,
C. M.
, and
Garrett
,
M. F.
(
2004
). “
Rapid adaptation to foreign-accented English
,”
J. Acoust. Soc. Am.
116
,
3647
3658
.
33.
Clopper
,
C. G.
,
Pisoni
,
D. B.
, and
Tierney
,
A. T.
(
2006
). “
Effects of open-set and closed-set task demands on spoken word recognition
,”
J. Am. Acad. Audiol.
17
,
331
349
.
34.
Clopper
,
C. G.
,
Tierney
,
A. T.
, and
Pisoni
,
D. B.
(
2003
). “
Effects of response format on speech intelligibility in noise: Results obtained from open‐set, closed‐set, and delayed response tasks
,”
J. Acoust. Soc. Am.
113
,
2254
.
35.
Cusumano
,
C.
,
Friedmann
,
D. R.
,
Fang
,
Y.
,
Wang
,
B.
,
Roland
,
J. T.
, Jr.
, and
Waltzman
,
S. B.
(
2017
). “
Performance plateau in prelingually and postlingually deafened adult cochlear implant recipients
,”
Otol. Neurotol.
38
,
334
338
.
36.
Davies-Venn
,
E.
,
Nelson
,
P.
, and
Souza
,
P.
(
2015
). “
Comparing auditory filter bandwidths, spectral ripple modulation detection, spectral ripple discrimination, and speech recognition: Normal and impaired hearing
,”
J. Acoust. Soc. Am.
138
,
492
503
.
37.
Davis
,
M. H.
,
Johnsrude
,
I. S.
,
Hervais-Adelman
,
A.
,
Taylor
,
K.
, and
McGettigan
,
C.
(
2005
). “
Lexical information drives perceptual learning of distorted speech: Evidence from the comprehension of noise-vocoded sentences
,”
J. Exp. Psychol. Gen.
134
,
222
241
.
38.
Dawes
,
P.
, and
Munro
,
K. J.
(
2017
). “
Auditory distraction and acclimatization to hearing aids
,”
Ear Hear.
38
,
174
183
.
39.
Dawes
,
P.
,
Munro
,
K. J.
,
Kalluri
,
S.
, and
Edwards
,
B.
(
2014
). “
Acclimatization to hearing aids
,”
Ear Hear.
35
,
203
212
.
40.
DeRoy Milvae
,
K.
,
Kuchinsky
,
S. E.
,
Stakhovskaya
,
O. A.
, and
Goupell
,
M. J.
(
2021
). “
Dichotic listening performance and effort as a function of spectral resolution and interaural symmetry
,”
J. Acoust. Soc. Am.
150
,
920
935
.
41.
Dirks
,
D. D.
,
Morgan
,
D. E.
, and
Dubno
,
J. R.
(
1982
). “
A procedure for quantifying the effects of noise on speech recognition
,”
J. Speech Hear. Disord.
47
,
114
123
.
42.
Dryden
,
A.
,
Allen
,
H. A.
,
Henshaw
,
H.
, and
Heinrich
,
A.
(
2017
). “
The association between cognitive performance and speech-in-noise perception for adult listeners: A systematic literature review and meta-analysis
,”
Trends Hear.
21
,
233121651774467
.
43.
Edwards
,
B.
(
2020
). “
Emerging technologies, market segments, and MarkeTrak 10 insights in hearing health technology
,”
Semin. Hear.
41
,
37
54
.
44.
Emmett
,
S. D.
, and
Francis
,
H. W.
(
2015
). “
The socioeconomic impact of hearing loss in US adults
,”
Otol. Neurotol.
36
,
545
550
.
45.
Fisk
,
A. D.
,
Derrick
,
W. L.
, and
Schneider
,
W.
(
1986
). “
A methodological assessment and evaluation of dual-task paradigms
,”
Curr. Psychology.
5
,
315
327
.
46.
Fletcher
,
H.
, and
Galt
,
R. H.
(
1950
). “
The perception of speech and its relation to telephony
,”
J. Acoust. Soc. Am.
22
,
89
151
.
47.
Frequency Therapeutics
(
2021
). “
Frequency therapeutics releases new data from two FX-322 clinical studies; plans to advance single-dose regimen
,” https://www.businesswire.com/news/home/20210323005208/en/Frequency-Therapeutics-Releases-New-Data-from-Two-FX-322-Clinical-Studies-Plans-to-Advance-Single-Dose-Regimen (Last viewed 12/17/2021).
48.
Gatehouse
,
S.
(
1992
). “
The time course and magnitude of perceptual acclimatization to frequency responses: Evidence from monaural fitting of hearing aids
,”
J. Acoust. Soc. Am.
92
,
1258
1268
.
49.
Gelfand
,
S. A.
(
1998
). “
Optimizing the reliability of speech recognition scores
,”
J. Speech. Lang. Hear. Res.
41
,
1088
1102
.
50.
Gelfand
,
S. A.
(
2003
). “
Tri-word presentations with phonemic scoring for practical high-reliability speech recognition assessment
,”
J. Speech Lang. Hear. Res.
46
,
405
412
.
51.
Gifford
,
R. H.
,
Noble
,
J. H.
,
Camarata
,
S. M.
,
Sunderhaus
,
L. W.
,
Dwyer
,
R. T.
,
Dawant
,
B. M.
,
Dietrich
,
M. S.
, and
Labadie
,
R. F.
(
2018
). “
The relationship between spectral modulation detection and speech recognition: Adult versus pediatric cochlear implant recipients
,”
Trends Hear.
22
,
233121651877117
.
52.
Giuliani
,
N. P.
,
Brown
,
C. J.
, and
Wu
,
Y.-H.
(
2021
). “
Comparisons of the sensitivity and reliability of multiple measures of listening effort
,”
Ear Hear.
42
,
465
474
.
53.
Glasberg
,
B. R.
, and
Moore
,
B. C.
(
1990
). “
Derivation of auditory filter shapes from notched-noise data
,”
Hear. Res.
47
,
103
138
.
54.
Goehring
,
T.
,
Keshavarzi
,
M.
,
Carlyon
,
R. P.
, and
Moore
,
B. C.
(
2019
). “
Using recurrent neural networks to improve the perception of speech in non-stationary noise by people with cochlear implants
,”
J. Acoust. Soc. Am.
146
,
705
718
.
55.
Gosselin
,
P. A.
, and
Gagné
,
J.-P.
(
2010
). “
Use of a dual-task paradigm to measure listening effort
,”
Can. J. Speech Lang. Pathol. Audiol.
34
,
43
51
.
56.
Grant
,
K. W.
,
Bernstein
,
J. G.
, and
Summers
,
V.
(
2013
). “
Conclusion: Predicting speech intelligibility by individual hearing-impaired listeners: The path forward
,”
J. Am. Acad. Audiol.
24
,
329
336
.
57.
Grant
,
K. W.
,
Kubli
,
L. R.
,
Phatak
,
S. A.
,
Galloza
,
H.
, and
Brungart
,
D. S.
(
2021
). “
Estimated prevalence of functional hearing difficulties in blast-exposed service members with normal to near–normal-hearing thresholds
,”
Ear Hear.
42
,
1615
1626
.
58.
Hannula
,
S.
,
Bloigu
,
R.
,
Majamaa
,
K.
,
Sorri
,
M.
, and
Mäki-Torkko
,
E.
(
2011
). “
Self-reported hearing problems among older adults: Prevalence and comparison to measured hearing impairment
,”
J. Am. Acad. Audiol.
22
,
550
559
.
59.
Herrmann
,
B.
, and
Johnsrude
,
I. S.
(
2020
). “
A model of listening engagement (MoLE)
,”
Hear. Res.
397
,
108016
.
60.
Hershman
,
R.
,
Milshtein
,
D.
, and
Henik
,
A.
(
2022
). “
The contribution of temporal analysis of pupillometry measurements to cognitive research
,”
Psychol. Res.
(published online).
61.
Hétu
,
R.
,
Riverin
,
L.
,
Lalande
,
N.
,
Getty
,
L.
, and
St-Cyr
,
C.
(
1988
). “
Qualitative analysis of the handicap associated with occupational hearing loss
,”
Br. J. Audiol.
22
,
251
264
.
62.
Holden
,
L. K.
,
Firszt
,
J. B.
,
Reeder
,
R. M.
,
Uchanski
,
R. M.
,
Dwyer
,
N. Y.
, and
Holden
,
T. A.
(
2016
). “
Factors affecting outcomes in cochlear implant recipients implanted with a perimodiolar electrode array located in scala tympani
,”
Otol. Neurotol.
37
,
1662
1668
.
63.
Hopstaken
,
J. F.
,
van der Linden
,
D.
,
Bakker
,
A. B.
, and
Kompier
,
M. A. J.
(
2015
). “
A multifaceted investigation of the link between mental fatigue and task disengagement: Mental fatigue and task disengagement
,”
Psychophysiology
52
,
305
315
.
64.
House
,
A. S.
,
Williams
,
C. E.
,
Hecker
,
M. H.
, and
Kryter
,
K. D.
(
1965
). “
Articulation‐testing methods: Consonantal differentiation with a closed‐response set
,”
J. Acoust. Soc. Am.
37
,
158
166
.
65.
Humes
,
L. E.
(
1999
). “
Dimensions of hearing aid outcome
,”
J. Am. Acad. Audiol.
10
,
26
39
.
66.
Humes
,
L. E.
(
2007
). “
The contributions of audibility and cognitive factors to the benefit provided by amplified speech to older adults
,”
J. Am. Acad. Audiol.
18
,
590
603
.
67.
Humes
,
L. E.
, and
Wilson
,
D. L.
(
2003
). “
An examination of changes in hearing-aid performance and benefit in the elderly over a 3-year period of hearing-aid use
,”
J. Speech. Lang. Hear. Res.
46
,
137
145
.
68.
Humes
,
L. E.
,
Wilson
,
D. L.
,
Barlow
,
N. N.
, and
Garner
,
C.
(
2002
). “
Changes in hearing-aid benefit following 1 or 2 years of hearing-aid use by older adults
,”
J. Speech. Lang. Hear. Res.
45
,
772
782
.
69.
Imaoka
,
Y.
,
Flury
,
A.
, and
de Bruin
,
E. D.
(
2020
). “
Assessing saccadic eye movements with head-mounted display virtual reality technology
,”
Front. Psychiatry
11
,
572938
.
70.
Jansen
,
S.
,
Luts
,
H.
,
Wagener
,
K. C.
,
Frachet
,
B.
, and
Wouters
,
J.
(
2010
). “
The French digit triplet test: A hearing screening tool for speech intelligibility in noise
,”
Int. J. Audiol.
49
,
378
387
.
71.
Jeon
,
E. K.
,
Turner
,
C. W.
,
Karsten
,
S. A.
,
Henry
,
B. A.
, and
Gantz
,
B. J.
(
2015
). “
Cochlear implant users' spectral ripple resolution
,”
J. Acoust. Soc. Am.
138
,
2350
2358
.
72.
Jerger
,
J.
(
2011
). “
Why do people without hearing loss have hearing complaints?
,”
J. Am. Acad. Audiol.
22
,
490
490
.
73.
Johnson
,
D. H.
(
1980
). “
The relationship between spike rate and synchrony in responses of auditory‐nerve fibers to single tones
,”
J. Acoust. Soc. Am.
68
,
1115
1122
.
74.
Joshi
,
S.
,
Li
,
Y.
,
Kalwani
,
R. M.
, and
Gold
,
J. I.
(
2016
). “
Relationships between pupil diameter and neuronal activity in the locus coeruleus, colliculi, and cingulate cortex
,”
Neuron
89
,
221
234
.
75.
Juul Jensen
,
J.
,
Callaway
,
S. L.
,
Lunner
,
T.
, and
Wendt
,
D.
(
2018
). “
Measuring the impact of tinnitus on aided listening effort using pupillary response
,”
Trends Hear.
22
,
233121651879534
.
76.
Juvrud
,
J.
,
Gredebäck
,
G.
,
Åhs
,
F.
,
Lerin
,
N.
,
Nyström
,
P.
,
Kastrati
,
G.
, and
Rosén
,
J.
(
2018
). “
The immersive virtual reality lab: Possibilities for remote experimental manipulations of autonomic activity on a large scale
,”
Front. Neurosci.
12
,
305
.
77.
Kahneman
,
D.
(
1973
).
Attention and Effort
(
Prentice Hall
,
Upper Saddle River, NJ
).
78.
Killion
,
M. C.
,
Niquette
,
P. A.
,
Gudmundsen
,
G. I.
,
Revit
,
L. J.
, and
Banerjee
,
S.
(
2004
). “
Development of a quick speech-in-noise test for measuring signal-to-noise ratio loss in normal-hearing and hearing-impaired listeners
,”
J. Acoust. Soc. Am.
116
,
2395
2405
.
79.
Kleine Punte
,
A.
,
De Bodt
,
M.
, and
Van de Heyning
,
P.
(
2014
). “
Long-term improvement of speech perception with the fine structure processing coding strategy in cochlear implants
,”
ORL
76
,
36
43
.
80.
Kleinschmidt
,
D. F.
, and
Jaeger
,
T. F.
(
2015
). “
Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel
,”
Psychol. Rev.
122
,
148
203
.
81.
Koch
,
I.
,
Poljac
,
E.
,
Müller
,
H.
, and
Kiesel
,
A.
(
2018
). “
Cognitive structure, flexibility, and plasticity in human multitasking-an integrative review of dual-task and task-switching research
,”
Psychol. Bull.
144
,
557
583
.
82.
Kollmeier
,
B.
,
Warzybok
,
A.
,
Hochmuth
,
S.
,
Zokoll
,
M. A.
,
Uslar
,
V.
,
Brand
,
T.
, and
Wagener
,
K. C.
(
2015
). “
The multilingual matrix test: Principles, applications, and comparison across languages: A review
,”
Int. J. Audiol.
54
,
3
16
.
83.
Kollmeier
,
B.
, and
Wesselkamp
,
M.
(
1997
). “
Development and evaluation of a German sentence test for objective and subjective speech intelligibility assessment
,”
J. Acoust. Soc. Am.
102
,
2412
2421
.
84.
Kramer
,
S. E.
,
Teunissen
,
C. E.
, and
Zekveld
,
A. A.
(
2016
). “
Cortisol, chromogranin A, and pupillary responses evoked by speech recognition tasks in normally hearing and hard-of-hearing listeners
,”
Ear Hear.
37
,
126S
135S
.
85.
Kuchinsky
,
S. E.
,
Ahlstrom
,
J. B.
,
Cute
,
S. L.
,
Humes
,
L. E.
,
Dubno
,
J. R.
, and
Eckert
,
M. A.
(
2014
). “
Speech‐perception training for older adults with hearing loss impacts word recognition and effort
,”
Psychophysiology
51
,
1046
1057
.
86.
Kuchinsky
,
S. E.
,
Ahlstrom
,
J. B.
,
Vaden
,
K. I.
, Jr.
,
Cute
,
S. L.
,
Humes
,
L. E.
,
Dubno
,
J. R.
, and
Eckert
,
M. A.
(
2013
). “
Pupil size varies with word listening and response selection difficulty in older adults with hearing loss
,”
Psychophysiology
50
,
23
34
.
87.
Kuchinsky
,
S. E.
, and
Vaden
,
K. I.
(
2020
). “
Aging, hearing loss, and listening effort: Imaging studies of the aging listener
,” in
Aging and Hearing, Springer Handbook of Auditory Research
, Vol.
72
, edited by
K. S.
Helfer
,
E. L.
Bartlett
,
A. N.
Popper
, and
R. R.
Fay
(
Springer
,
Cham, Switzerland
), pp.
231
256
.
88.
Kuchinsky
,
S. E.
,
Vaden
,
K. I.
, Jr.
,
Ahlstrom
,
J. B.
,
Cute
,
S. L.
,
Humes
,
L. E.
,
Dubno
,
J. R.
, and
Eckert
,
M. A.
(
2016
). “
Task-related vigilance during word recognition in noise for older adults with hearing loss
,”
Exp. Aging Res.
42
,
50
66
.
89.
Kujawa
,
S. G.
, and
Liberman
,
M. C.
(
2015
). “
Synaptopathy in the noise-exposed and aging cochlea: Primary neural degeneration in acquired sensorineural hearing loss
,”
Hear. Res.
330
,
191
199
.
90.
Kuk
,
F. K.
,
Potts
,
L.
,
Valente
,
M.
,
Lee
,
L.
, and
Picirrillo
,
J.
(
2003
). “
Evidence of acclimatization in persons with severe-to-profound hearing loss
,”
J. Am. Acad. Audiol.
14
,
84
99
.
91.
Lakshmi
,
M. S. K.
,
Rout
,
A.
, and
O'Donoghue
,
C. R.
(
2021
). “
A systematic review and meta-analysis of digital noise reduction hearing aids in adults
,”
Disabil. Rehabil. Assist. Technol.
16
,
120
129
.
92.
Lash
,
A.
,
Rogers
,
C. S.
,
Zoller
,
A.
, and
Wingfield
,
A.
(
2013
). “
Expectation and entropy in spoken word recognition: Effects of age and hearing acuity
,”
Exp. Aging Res.
39
,
235
253
.
93.
Lawler
,
M.
,
Yu
,
J.
, and
Aronoff
,
J.
(
2017
). “
Comparison of the spectral-temporally modulated ripple test with the Arizona Biomedical Institute Sentence Test in cochlear implant users
,”
Ear Hear.
38
,
760
766
.
94.
Lenarz
,
M.
,
Sönmez
,
H.
,
Joseph
,
G.
,
Büchner
,
A.
, and
Lenarz
,
T.
(
2012
). “
Long-term performance of cochlear implants in postlingually deafened adults
,”
Otolaryngol. Head Neck Surg.
147
,
112
118
.
95.
Lesica
,
N. A.
(
2018
). “
Why do hearing aids fail to restore normal auditory perception?
,”
Trends Neurosci.
41
,
174
185
.
96.
Lin
,
F. R.
, and
Albert
,
M.
(
2014
). “
Hearing loss and dementia–Who's listening?
,”
Aging Ment. Health
18
,
671
673
.
97.
Litvak
,
L. M.
,
Spahr
,
A. J.
,
Saoji
,
A. A.
, and
Fridman
,
G. Y.
(
2007
). “
Relationship between perception of spectral ripple and speech recognition in cochlear implant and vocoder listeners
,”
J. Acoust. Soc. Am.
122
,
982
991
.
98.
Lorenzi
,
C.
,
Gilbert
,
G.
,
Carn
,
H.
,
Garnier
,
S.
, and
Moore
,
B. C. J.
(
2006
). “
Speech perception problems of the hearing impaired reflect inability to use temporal fine structure
,”
Proc. Natl. Acad. Sci. U.S.A.
103
,
18866
18869
.
99.
Lunner
,
T.
,
Alickovic
,
E.
,
Graversen
,
C.
, and
Ng
,
E. H. N.
(
2020
). “
Three new outcome measures that tap into cognitive processes required for real-life communication
,”
Ear Hear.
41
,
39S
47S
.
100.
Makeig
,
S.
, and
Inlow
,
M.
(
1993
). “
Lapse in alertness: Coherence of fluctuations in performance and EEG spectrum
,”
Electroencephalogr. Clin. Neurophysiol.
86
,
23
35
.
101.
Makeig
,
S.
, and
Jung
,
T.-P.
(
1996
). “
Tonic, phasic, and transient EEG correlates of auditory awareness in drowsiness
,”
Cogn. Brain Res.
4
,
15
25
.
102.
Manheim
,
M.
,
Lavie
,
L.
, and
Banai
,
K.
(
2018
). “
Age, hearing, and the perceptual learning of rapid speech
,”
Trends Hear.
22
,
233121651877865
.
104.
Massa
,
S. T.
, and
Ruckenstein
,
M. J.
(
2014
). “
Comparing the performance plateau in adult cochlear implant patients using HINT and AzBio
,”
Otol. Neurotol.
35
,
598
604
.
103.
Mattys
,
S. L.
,
Davis
,
M. H.
,
Bradlow
,
A. R.
, and
Scott
,
S. K.
(
2012
). “
Speech recognition in adverse conditions: A review
,”
Lang. Cognit. Process.
27
,
953
978
.
105.
McCoy
,
S. L.
,
Tun
,
P. A.
,
Cox
,
L. C.
,
Colangelo
,
M.
,
Stewart
,
R. A.
, and
Wingfield
,
A.
(
2005
). “
Hearing loss and perceptual effort: Downstream effects on older adults' memory for speech
,”
Q. J. Exp. Psychol. Sec. A
58
,
22
33
.
106.
McGarrigle
,
R.
,
Dawes
,
P.
,
Stewart
,
A. J.
,
Kuchinsky
,
S. E.
, and
Munro
,
K. J.
(
2017
). “
Pupillometry reveals changes in physiological arousal during a sustained listening task: Physiological changes during sustained listening
,”
Psychophysiology
54
,
193
203
.
107.
McGarrigle
,
R.
,
Munro
,
K. J.
,
Dawes
,
P.
,
Stewart
,
A. J.
,
Moore
,
D. R.
,
Barry
,
J. G.
, and
Amitay
,
S.
(
2014
). “
Listening effort and fatigue: What exactly are we measuring? A British Society of Audiology Cognition in Hearing Special Interest Group ‘white paper
,”
Int. J. Audiol.
53
,
433
445
.
108.
McGarrigle
,
R.
,
Rakusen
,
L.
, and
Mattys
,
S.
(
2021
). “
Effortful listening under the microscope: Examining relations between pupillometric and subjective markers of effort and tiredness from listening
,”
Psychophysiology
58
,
e13703
.
109.
Mehraei
,
G.
,
Gallun
,
F. J.
,
Leek
,
M. R.
, and
Bernstein
,
J. G.
(
2014
). “
Spectrotemporal modulation sensitivity for hearing-impaired listeners: Dependence on carrier center frequency and the relationship to speech intelligibility
,”
J. Acoust. Soc. Am.
136
,
301
316
.
110.
Miles
,
K.
,
McMahon
,
C.
,
Boisvert
,
I.
, and
Ibrahim
,
R.
(
2017
). “
Objective assessment of listening effort: Coregistration of pupillometry and EEG
,”
Trends Hear.
21
,
233121651770639
.
111.
Miller
,
C. W.
,
Bernstein
,
J. G.
,
Zhang
,
X.
,
Wu
,
Y.-H.
,
Bentley
,
R. A.
, and
Tremblay
,
K.
(
2018
). “
The effects of static and moving spectral ripple sensitivity on unaided and aided speech perception in noise
,”
J. Speech. Lang. Hear. Res.
61
,
3113
3126
.
112.
Moore
,
B. C. J.
,
Huss
,
M.
,
Vickers
,
D. A.
,
Glasberg
,
B. R.
, and
Alcántara
,
J. I.
(
2000
). “
A test for the diagnosis of dead regions in the cochlea
,”
Br. J. Audiol.
34
,
205
224
.
113.
Moore
,
T. M.
, and
Picou
,
E. M.
(
2018
). “
A potential bias in subjective ratings of mental effort
,”
J. Speech. Lang. Hear. Res.
61
,
2405
2421
.
114.
Mullennix
,
J. W.
,
Pisoni
,
D. B.
, and
Martin
,
C. S.
(
1989
). “
Some effects of talker variability on spoken word recognition
,”
J. Acoust. Soc. Am.
85
,
365
378
.
115.
National Center for Environmental Health
(
2018
). “Statistics about the Public Health Burden of Noise-Induced Hearing Loss,” https://www.cdc.gov/nceh/hearing_loss/public_health_scientific_info.html (Last viewed 12/17/2021).
116.
Neagu
,
M.-B.
,
Dau
,
T.
,
Hyvärinen
,
P.
,
Bækgaard
,
P.
,
Lunner
,
T.
, and
Wendt
,
D.
(
2019
). “
Investigating pupillometry as a reliable measure of individual's listening effort
,” in
Proceedings of the International Symposium on Auditory and Audiological Research
,
August 22–25
,
Nyborg, Denmark
, pp.
365
372
.
117.
Newman
,
R. S.
,
Clouse
,
S. A.
, and
Burnham
,
J. L.
(
2001
). “
The perceptual consequences of within-talker variability in fricative production
,”
J. Acoust. Soc. Am.
109
,
1181
1196
.
118.
Nikhil
,
J.
,
Megha
,
K. N.
, and
Prabhu
,
P.
(
2018
). “
Diurnal changes in differential sensitivity and temporal resolution in morning-type and evening-type individuals with normal hearing
,”
World J. Otorhinolaryngol. Head Neck Surg.
4
,
229
233
.
119.
Nilsson
,
M.
,
Soli
,
S. D.
, and
Sullivan
,
J. A.
(
1994
). “
Development of the Hearing in Noise Test for the measurement of speech reception thresholds in quiet and in noise
,”
J. Acoust. Soc. Am.
95
,
1085
1099
.
120.
Nye
,
P. W.
, and
Gaitenby
,
J. H.
(
1973
). “
Consonant intelligibility in synthetic speech and in a natural speech control (modified rhyme test results)
,” Haskins Laboratories Status Report SR 33 (
Haskins Laboratories
,
New Haven, CT
), pp.
77
91
.
121.
Oh
,
S.-H.
,
Kim
,
C.-S.
,
Kang
,
E. J.
,
Lee
,
D. S.
,
Lee
,
H. J.
,
Chang
,
S. O.
,
Ahn
,
S.
,
Hwang
,
C. H.
,
Park
,
H. J.
, and
Koo
,
J. W.
(
2003
). “
Speech perception after cochlear implantation over a 4-year time period
,”
Acta Otolaryngol.
123
,
148
153
.
122.
Ohlenforst
,
B.
,
Zekveld
,
A. A.
,
Jansma
,
E.
,
Wang
,
Y.
,
Naylor
,
G.
,
Lorens
,
A.
,
Lunner
,
T.
, and
Kramer
,
S.
(
2017a
). “
Effects of hearing impairment and hearing aid amplification on listening effort: A systematic review
,”
Ear Hear.
38
,
267
281
.
123.
Ohlenforst
,
B.
,
Zekveld
,
A. A.
,
Lunner
,
T.
,
Wendt
,
D.
,
Naylor
,
G.
,
Wang
,
Y.
,
Versfeld
,
N. J.
, and
Kramer
,
S. E.
(
2017b
). “
Impact of stimulus-related factors and hearing impairment on listening effort as indicated by pupil dilation
,”
Hear. Res.
351
,
68
79
.
124.
Pashler
,
H.
(
1994
). “
Dual-task interference in simple tasks: Data and theory
,”
Psychol. Bull.
116
,
220
244
.
125.
Peelle
,
J. E.
, and
Wingfield
,
A.
(
2005
). “
Dissociations in perceptual learning revealed by adult age differences in adaptation to time-compressed speech
,”
J. Exp. Psychol. Hum. Percept. Perform.
31
,
1315
1330
.
126.
Peng, Z.
E.
,
Waz
,
S.
,
Buss
,
E.
,
Shen
,
Y.
,
Richards
,
V.
,
Bharadwaj
,
H.
,
Stecker, G
,
C.
,
Beim, J.
A.
,
Bosen, A.
K.
,
Braza, M.
D.
, and
Diedesch
A. C.
(
2022
). “
Remote testing for psychological and physiological acoustics,
,”
J. Acoust. Soc. Am.
151
,
3116
3128
.
127.
Phatak
,
S. A.
, and
Grant
,
K. W.
(
2019
). “
Effects of temporal distortions on consonant perception with and without undistorted visual speech cues
,”
J. Acoust. Soc. Am.
146
,
EL381
EL386
.
128.
Pichora-Fuller, M.
K.
,
Kramer, S.
E.
,
Eckert, M.
A.
,
Edwards
,
B.
,
Hornsby, B. W.
Y.
,
Humes
,
L. E.
,
Lemke
,
U.
,
Lunner
,
T.
,
Matthen
,
M.
,
Mackersie, C.
L.
,
NAylor
,
G.
,
Phillips, N.
A.
,
RIchter
,
M.
,
Rudner
,
M.
,
Sommers, M.
S.
,
Tramblay, K.
L.
, and
Wingfield
,
A.
(
2016
). “
Hearing impairment and cognitive energy: The framework for understanding effortful listening (FUEL)
,”
Ear Hear.
37
,
5S
27S
.
129.
Picou
,
E. M.
, and
Ricketts
,
T. A.
(
2014
). “
The effect of changing the secondary task in dual-task paradigms for measuring listening effort
,”
Ear Hear.
35
,
611
622
.
130.
Plomp
,
R.
(
1978
). “
Auditory handicap of hearing impairment and the limited benefit of hearing aids
,”
J. Acoust. Soc. Am.
63
,
533
549
.
131.
Rönnberg
,
J.
,
Holmer
,
E.
, and
Rudner
,
M
. (
2019
). “
Cognitive hearing science and ease of language understanding
,”
Int. J. Audiol.
58
,
247
261
.
132.
Ruffin
,
C. V.
,
Tyler
,
R. S.
,
Witt
,
S. A.
,
Dunn
,
C. C.
,
Gantz
,
B. J.
, and
Rubinstein
,
J. T.
(
2007
). “
Long‐term performance of Clarion 1.0 cochlear implant users
,”
Laryngoscope
117
,
1183
1190
.
133.
Sakamoto
,
K.
,
Shirai
,
S.
,
Orlosky
,
J.
,
Nagataki
,
H.
,
Takemura
,
N.
,
Alizadeh
,
M.
, and
Ueda
,
M.
(
2020
). “
Exploring pupillometry as a method to evaluate reading comprehension in VR-based educational comics
,” in
Proceedings of the 2020 IEEE Conf. Virtual Reality 3D User Interfaces Abstracts Workshops
,
March 22–26
,
Atlanta, GA
, pp.
422
426
.
134.
Saoji
,
A. A.
,
Litvak
,
L.
,
Spahr
,
A. J.
, and
Eddins
,
D. A.
(
2009
). “
Spectral modulation detection and vowel and consonant identifications in cochlear implant listeners
,”
J. Acoust. Soc. Am.
126
,
955
958
.
135.
Saunders
,
G. H.
, and
Cienkowski
,
K. M.
(
1997
). “
Acclimatization to hearing aids
,”
Ear Hear.
18
,
129
139
.
136.
Schlauch
,
R. S.
,
Anderson
,
E. S.
, and
Micheyl
,
C.
(
2014
). “
A demonstration of improved precision of word recognition scores
,”
J. Speech. Lang. Hear. Res.
57
,
543
555
.
137.
Schlauch
,
R. S.
, and
Carney
,
E.
(
2018
). “
Clinical strategies for sampling word recognition performance
,”
J. Speech. Lang. Hear. Res.
61
,
936
944
.
138.
Schlueter
,
A.
,
Lemke
,
U.
,
Kollmeier
,
B.
, and
Holube
,
I.
(
2016
). “
Normal and time-compressed speech: How does learning affect speech recognition thresholds in noise?
,”
Trends Hear.
20
,
2331216516669889
.
139.
Schum
,
D.
(
1996
). “
Speech understanding in background noise
,” in
Hearing Aids: Standards, Options, and Limitations
, edited by
M.
Valente
(
Thieme
,
New York
), pp.
368
406
.
140.
Seist
,
R.
,
Tong
,
M.
,
Landegger
,
L. D.
,
Vasilijic
,
S.
,
Hyakusoku
,
H.
,
Katsumi
,
S.
,
McKenna
,
C. E.
,
Edge
,
A. S. B.
, and
Stankovic
,
K. M.
(
2020
). “
Regeneration of cochlear synapses by systemic administration of a bisphosphonate
,”
Front. Mol. Neurosci.
13
,
87
.
141.
Shapiro
,
M. L.
,
Norris
,
J. A.
,
Wilbur
,
J. C.
,
Brungart
,
D. S.
, and
Clavier
,
O. H.
(
2020
). “
TabSINT: Open-source mobile software for distributed studies of hearing
,”
Int. J. Audiol.
59
,
S12
S19
.
142.
Smith
,
Z. M.
,
Parkinson
,
W. S.
, and
Long
,
C. J.
(
2013
). “
Multipolar current focusing increases spectral resolution in cochlear implants
,” in
Proceedings of the 35th Annal International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)
,
July 3–7
,
Osaka, Japan
, pp.
2796
2799
.
143.
Smits
,
C.
,
Theo Goverts
,
S.
, and
Festen
,
J. M.
(
2013
). “
The digits-in-noise test: Assessing auditory speech recognition abilities in noise
,”
J. Acoust. Soc. Am.
133
,
1693
1706
.
144.
Smoorenburg
,
G. F.
(
1992
). “
Speech reception in quiet and in noisy conditions by individuals with noise‐induced hearing loss in relation to their tone audiogram
,”
J. Acoust. Soc. Am.
91
,
421
437
.
145.
Sokal
,
R. R.
, and
Rohlf
,
F. J.
(
1981
).
Biometry: The Principles and Practice of Statistics in Biological Research
,
2nd ed
. (
Freeman
,
San Francisco, CA
), p.
859
.
146.
Soli
,
S. D.
, and
Wong
,
L. L. N.
(
2008
). “
Assessment of speech intelligibility in noise with the hearing in noise test
,”
Int. J. Audiol.
47
,
356
361
.
147.
Sommers
,
M. S.
,
Kirk
,
K. I.
, and
Pisoni
,
D. B.
(
1997
). “
Some considerations in evaluating spoken word recognition by normal-hearing, noise-masked normal-hearing, and cochlear implant listeners. I: The effects of response format
,”
Ear Hear.
18
,
89
99
.
148.
Souza
,
P.
,
Gallun
,
F.
, and
Wright
,
R.
(
2020
). “
Contributions to speech-cue weighting in older adults with impaired hearing
,”
J. Speech. Lang. Hear. Res.
63
,
334
344
.
149.
Strand
,
J. F.
,
Brown
,
V. A.
,
Merchant
,
M. B.
, and
Brown
,
H. E.
(
2018
). “
Measuring listening effort: Convergent validity, sensitivity, and links with cognitive and personality measures
,”
J. Speech. Lang. Hear. Res.
61
,
1463
1486
.
150.
Summers
,
V.
,
Makashay
,
M. J.
,
Theodoroff
,
S. M.
, and
Leek
,
M. R.
(
2013
). “
Suprathreshold auditory processing and speech perception in noise: Hearing-impaired and normal-hearing listeners
,”
J. Am. Acad. Audiol.
24
,
274
292
.
151.
Suter
,
A. H.
(
1985
). “
Speech recognition in noise by individuals with mild hearing impairments
,”
J. Acoust. Soc. Am.
78
,
887
900
.
152.
Teubner-Rhodes
,
S.
, and
Kuchinsky
,
S. E.
(
2020
). “
Physiological approaches
,” in
The Handbook of Listening
, edited by
D. L.
Worhington
and
G. D.
Bodie
(
Wiley
,
Hoboken, NJ)
, pp.
9
26
.
153.
Thornton
,
A. R.
, and
Raffin
,
M. J.
(
1978
). “
Speech-discrimination scores modeled as a binomial variable
,”
J. Speech Hear. Res.
21
,
507
518
.
154.
Tremblay
,
K. L.
,
Pinto
,
A.
,
Fischer
,
M. E.
,
Klein
,
B. E.
,
Klein
,
R.
,
Levy
,
S.
,
Tweed
,
T. S.
, and
Cruickshanks
,
K. J.
(
2015
). “
Self-reported hearing difficulties among adults with normal audiograms: The Beaver Dam Offspring Study
,”
Ear Hear.
36
,
e290
e299
.
155.
U.S. Department of the Army
(
2019
).
Pamphlet 40-502: Medical Readiness Procedures
(
U.S. Department of the Army
,
Washington, DC
).
156.
US Department of Defense
(
2015
).
MIL-STD-1474E, Department of Defense Design Criteria Standards Noises Limits, AMSC 9542
(
DOD
,
Washington, DC
).
157.
Varnet
,
L.
,
Léger
,
A. C.
,
Boucher
,
S.
,
Bonnet
,
C.
,
Petit
,
C.
, and
Lorenzi
,
C.
(
2021
). “
Contributions of age-related and audibility-related deficits to aided consonant identification in presbycusis: A causal-inference analysis
,”
Front. Aging Neurosci.
13
,
640522
.
158.
Veneman
,
C. E.
,
Gordon-Salant
,
S.
,
Matthews
,
L. J.
, and
Dubno
,
J. R.
(
2013
). “
Age and measurement time-of-day effects on speech recognition in noise
,”
Ear Hear.
34
,
288
299
.
159.
Vermiglio
,
A. J.
(
2008
). “
The American English hearing in noise test
,”
Int. J. Audiol.
47
,
386
387
.
160.
Vermiglio
,
A. J.
,
Soli
,
S. D.
,
Freed
,
D. J.
, and
Fisher
,
L. M.
(
2012
). “
The relationship between high-frequency pure-tone hearing loss, hearing in noise test (HINT) thresholds, and the articulation index
,”
J. Am. Acad. Audiol.
23
,
779
788
.
161.
Wagner
,
A. E.
,
Nagels
,
L.
,
Toffanin
,
P.
,
Opie
,
J. M.
, and
Başkent
,
D.
(
2019
). “
Individual variations in effort: Assessing pupillometry for the hearing impaired
,”
Trends Hear.
23
,
233121651984559
.
162.
Wagner
,
A. E.
,
Toffanin
,
P.
, and
Başkent
,
D.
(
2016
). “
The timing and effort of lexical access in natural and degraded speech
,”
Front. Psychol.
7
,
398
.
163.
Wang
,
Y.
,
Naylor
,
G.
,
Kramer
,
S. E.
,
Zekveld
,
A. A.
,
Wendt
,
D.
,
Ohlenforst
,
B.
, and
Lunner
,
T.
(
2018
). “
Relations between self-reported daily-life fatigue, hearing status, and pupil dilation during a speech perception in noise task
,”
Ear Hear.
39
,
573
582
.
164.
Wardenga
,
N.
,
Batsoulis
,
C.
,
Wagener
,
K. C.
,
Brand
,
T.
,
Lenarz
,
T.
, and
Maier
,
H.
(
2015
). “
Do you hear the noise? The German matrix sentence test with a fixed noise level in subjects with normal hearing and hearing impairment
,”
Int. J. Audiol.
54
,
71
79
.
165.
Watson
,
C. S.
,
Kidd
,
G. R.
,
Miller
,
J. D.
,
Smits
,
C.
, and
Humes
,
L. E.
(
2012
). “
Telephone screening tests for functionally impaired hearing: Current use in seven countries and development of a US version
,”
J. Am. Acad. Audiol.
23
,
757
767
.
166.
Wendt
,
D.
,
Hietkamp
,
R. K.
, and
Lunner
,
T.
(
2017
). “
Impact of noise and noise reduction on processing effort: A pupillometry study
,”
Ear Hear.
38
,
690
700
.
167.
Williams
,
C. E.
, and
Hecker
,
M. H.
(
1967
). “
Selecting an intelligibility test for communication system evaluation
,”
J. Acoust. Soc. Am.
42
,
1198
.
168.
Wilson
,
J. C.
,
Nair
,
S.
,
Scielzo
,
S.
, and
Larson
,
E. C.
(
2021
). “
Objective measures of cognitive load using deep multi-modal learning: A use-case in aviation
,”
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.
5
,
1
35
.
169.
Wilson
,
R. H.
(
2004
). “
Adding speech-in-noise testing to your clinical protocol: Why and how
,”
Hear. J.
57
,
10
12
.
170.
Winn
,
M. B.
,
Edwards
,
J. R.
, and
Litovsky
,
R. Y.
(
2015
). “
The impact of auditory spectral resolution on listening effort revealed by pupil dilation
,”
Ear Hear.
36
,
e153
e165
.
171.
Winn
,
M. B.
, and
Moore
,
A. N.
(
2018
). “
Pupillometry reveals that context benefit in speech perception can be disrupted by later-occurring sounds, especially in listeners with cochlear implants
,”
Trends Hear.
22
,
233121651880896
.
172.
Winn
,
M. B.
, and
Teece
,
K. H.
(
2021
). “
Listening effort is not the same as speech intelligibility score
,”
Trends Hear.
25
,
233121652110276
.
173.
Winn
,
M. B.
,
Wendt
,
D.
,
Koelewijn
,
T.
, and
Kuchinsky
,
S. E.
(
2018
). “
Best practices and advice for using pupillometry to measure listening effort: An introduction for those who want to get started
,”
Trends Hear.
22
,
233121651880086
.
174.
Wolfe
,
J.
,
John
,
A.
,
Schafer
,
E.
,
Nyffeler
,
M.
,
Boretzki
,
M.
,
Caraway
,
T.
, and
Hudson
,
M.
(
2011
). “
Long-term effects of non-linear frequency compression for children with moderate hearing loss
,”
Int. J. Audiol.
50
,
396
404
.
175.
Won
,
J. H.
,
Moon
,
I. J.
,
Jin
,
S.
,
Park
,
H.
,
Woo
,
J.
,
Cho
,
Y.-S.
,
Chung
,
W.-H.
, and
Hong
,
S. H.
(
2015
). “
Spectrotemporal modulation detection and speech perception by cochlear implant users
,”
PLoS One
10
,
e0140920
.
176.
Xu
,
S.
, and
Yang
,
N.
(
2021
). “
Research progress on the mechanism of cochlear hair cell regeneration
,”
Front. Cell. Neurosci.
15
,
732507
.
177.
Yu
,
T.-L. J.
, and
Schlauch
,
R. S.
(
2019
). “
Diagnostic precision of open-set versus closed-set word recognition testing
,”
J. Speech. Lang. Hear. Res.
62
,
2035
2047
.
178.
Zaar
,
J.
,
Simonsen
,
L. B.
,
Behrens
,
T.
,
Dau
,
T.
, and
Laugesen
,
S.
(
2020
).
“Investigating the relationship between spectro-temporal modulation detection, aided speech perception, and directional noise reduction preference in hearing-impaired listeners
,” in Proceedings of the International Symposium on Auditory and Audiological Research, August 21-23, 2019, Nyborg, Denmark, pp.
181
188
.
179.
Zekveld
,
A. A.
,
Koelewijn
,
T.
, and
Kramer
,
S. E.
(
2018
). “
The pupil dilation response to auditory stimuli: Current state of knowledge
,”
Trends Hear.
22
,
233121651877717
.
180.
Zekveld
,
A. A.
,
Kramer
,
S. E.
, and
Festen
,
J. M.
(
2010
). “
Pupil response as an indication of effortful listening: The influence of sentence intelligibility
,”
Ear Hear.
31
,
480
490
.
181.
Zekveld
,
A. A.
,
Kramer
,
S. E.
, and
Festen
,
J. M.
(
2011
). “
Cognitive load during speech perception in noise: The influence of age, hearing loss, and cognition on the pupil response
,”
Ear Hear.
32
,
498
510
.
182.
Zekveld
,
A. A.
,
Kramer
,
S. E.
,
Rönnberg
,
J.
, and
Rudner
,
M.
(
2019
). “
In a concurrent memory and auditory perception task, the pupil dilation response is more sensitive to memory load than to auditory stimulus characteristics
,”
Ear Hear.
40
,
272
286
.
183.
Zhao
,
S.
,
Bury
,
G.
,
Milne
,
A.
, and
Chait
,
M.
(
2019
). “
Pupillometry as an objective measure of sustained attention in young and older listeners
,”
Trends Hear.
23
,
233121651988781
.
184.
Zhou
,
N.
(
2017
). “
Deactivating stimulation sites based on low-rate thresholds improves spectral ripple and speech reception thresholds in cochlear implant users
,”
J. Acoust. Soc. Am.
141
,
EL243
EL248
.