Online auditory experiments use the sound delivery equipment of each participant, with no practical way to calibrate sound level or frequency response. Here, a method is proposed to control sensation level across frequencies: embedding stimuli in threshold-equalizing noise. In a cohort of 100 online participants, noise could equate detection thresholds from 125 to 4000 Hz. Equalization was successful even for participants with atypical thresholds in quiet, due either to poor quality equipment or unreported hearing loss. Moreover, audibility in quiet was highly variable, as overall level was uncalibrated, but variability was much reduced with noise. Use cases are discussed.
The study of auditory perception has a long and distinguished history of rigorous control of experimental conditions, such as calibrating the stimulus delivery apparatus or standardizing the listening environment of the participant (Yost, 2015). This “psychoacoustics” tradition, drawing in parts from the engineering, technological, and psychological innovations of the time, has been the source of a large body of fundamental and applied knowledge [e.g., Moore (2013) for a review].
Recently, a novel technique has been added to the toolbox of the behavioral researcher: online experiments. In online settings, potentially large groups of anonymous participants complete self-supervised automated experiments (de Leeuw, 2015). Online testing has several potential benefits. First, online testing allows time and cost savings for large pools of participants, because recruitment can be automated through specialized platforms and also because testing can be run in parallel across participants, with minimal involvement of the experimenter. Second, the demographics of the participants' pool may be more diverse than for in-lab studies, in terms of age, geographic origin, or even socio-economical status. Third, as has been shown by the COVID-19 pandemics, online testing may provide alternative ways of collecting data in unfavorable circumstances. Such combined benefits enable larger-scale experiments, with more statistical power, allowing for cross-cultural comparisons or innovative testing methods.
For the psychoacoustics approach, however, online testing also presents a set of challenges. Online testing uses the sound delivery system of the participant. Its characteristics and overall quality are unknown. Uncalibrated sound delivery could distort stimuli and introduce uncontrolled variability across participants. Moreover, the sound presentation level, typically one of the first parameters that are controlled in in-lab experiments, is self-adjusted by the participant through operating system or sound card controls, which are inaccessible to online programming languages. Thus, presentation level cannot be set or even documented by the experimenter. Finally, the assessment of the hearing status of the participant, typically another routine part of in-lab experiments, cannot be performed accurately because the sound delivery system is uncalibrated.
As a result, there is an ongoing effort to adapt psychoacoustics to online settings. A special task force of the Perception & Psychophysics committee of the Acoustical Society of America has been set up to this effect, collecting use cases and drawing recommendations for remote testing (Stecker , 2020). Screening techniques have been developed to enforce the use of headphones (Milne , 2021; Woods , 2017), as this should reduce a significant source of variability: computer loudspeakers come in all shapes and unexpected sizes, whereas headphones should be more comparable across models in terms of frequency response. Procedures for sound calibration and fast absolute threshold estimation have also been proposed, in an effort to promote “auditory hygiene” for online experiments (Zhao , 2022). Encouragingly, there are results suggesting that some classic psychoacoustical and audiological tests can be administered remotely (de Larrea-Mancera , 2022; Zhao , 2022).
In the present paper, we propose an additional procedure to reduce across-participant variability: controlling audibility across frequency regions. Indeed, there are currently no data regarding the variability of audibility with frequency for cohorts of online participants. Moreover, to the best of our knowledge, there has been no attempt so far to control the sensation level at which stimuli are presented. This could be an issue if the auditory feature under investigation depends on sensation level, such as is the case for frequency discrimination, intensity discrimination, or gap detection, for instance (Moore, 2013). This could be an even more acute issue if differences across participants are of interest (Pressnitzer , 2018). Without controlling sensation level, such differences may be ascribed partly to the participants and partly to the sound delivery equipment.
To control audibility, we propose using threshold-equalizing noise (TEN) as partially masking noise. TEN has initially been developed for the clinical diagnosis of dead regions in the cochlea (Moore , 2000). Briefly, the detection threshold of a pure tone embedded in TEN should be at a constant physical level (in dB SPL) irrespective of the frequency of the tone, for normal hearing listeners. To achieve this, the spectral shape of TEN is derived from a model that takes into account the noise power within each auditory critical band as well as an empirical detection efficiency for each frequency [Moore (1997) and Moore (2000); see also the accompanying online generation script]. Thus, with TEN, the sensation level of any frequency component should be under the experimenter's control, simply by manipulating the level of tonal components relative to the TEN.
Several in-lab studies have made use of TEN to control for sensation level [e.g., Demany (2021), Mehta and Oxenham (2020), and Traer (2021)]. However, it is unknown whether the properties of TEN will be robust to the uncontrolled conditions of online testing. For instance, the overall level of the noise may change its masking properties. Even though TEN has been shown to operate as intended over a range of reasonable presentation levels (Moore , 2000), at the extreme, a very soft TEN will not have any masking power. Moreover, all previous in-lab uses of TEN have been performed via a sound delivery equipment with a known frequency response, and on listeners with a known hearing status.
The aim of the present study is threefold: to provide an estimate of the variability in audibility of pure tones at different frequencies in an online setting; to test whether TEN is able to equate audibility across frequency in an online setting; and to discuss use cases (and provide software) for TEN in online psychoacoustic experiments.
2.1 Pre-registration, open data, ethical approval
The study was pre-registered (Pressnitzer , 2023). The sample size and statistical analyses exactly followed the pre-registration. There was a deviation from the pre-registration in terms of exclusion criteria, as described below. The full dataset is available on the ResearchBox platform, together with sound examples and matlab generation scripts (Bravard , 2023). The experimental protocol was approved by the INSERM Institutional Review Board (IRB00003888), Avis 21-858.
Audibility thresholds were measured for frequency-modulated pure tones (FM tones) with six carrier frequencies: 125, 250, 500, 1000, 2000, and 4000 Hz. The frequency modulation was sinusoidal on a log-frequency scale, spanning 0.5 octave at a rate of 4 Hz; it started and ended in sine phase. The carrier frequencies chosen are the standard frequencies used for clinical audiograms.
FM tones were presented either in quiet or embedded in noise (TEN). The spectral boundaries of the TEN were 74 and 6727 Hz, half an octave beyond the lowest and highest instantaneous frequencies of the FM tones. Note that the lowest carrier frequency used here, 125 Hz, is still within the parameter range of the model used to compute the TEN spectral shape (Moore , 1997), but, to the best of our knowledge, there are no experimental data yet about the effectiveness of TEN at such a low frequency.
Stimuli were presented in sequences within which each FM tone had a duration of 1 s and was gated on and off using cosine ramps lasting 50 ms. In the absence of noise, there was a 500-ms silent gap between consecutive FM tones. When noise was added, the noise started 125 ms before each FM tone and ended 125 ms after it. The level of the FM tones was self-adjusted by the participant while the sequence was played, as described below.
2.3 Main task procedure
In the main task, participants were asked to self-adjust the level of the FM tones so that they were just audible. Participants used their keyboard to increase or decrease the level of the FM tones in steps of 1 dB. The instructions encouraged them to bracket their audibility threshold and terminate the adjustment for the first audible step. The initial level of the FM tones was randomly jittered from trial to trial, between −10 and −14 dB relative to the maximal possible level, to reduce undesirable response strategies such as counting the number of steps before terminating the adjustment.
Two adjustments were performed for each FM tone carrier frequency. If the second adjustment differed from the first one by more than 4 dB, a third adjustment was performed, as pre-registered. The threshold for each frequency was taken as the mean of the two adjustments, or the mean of the two closest adjustments if a third adjustment had been required.
Thresholds were measured in two distinct blocks: in quiet and in noise. The order of presentation of these blocks was counterbalanced across participants. In the noise blocks, it was emphasized that noise would remain audible throughout the trial and that participants had to self-report the audibility threshold of the FM tone only. The order of the frequencies of the FM carriers (the Frequency factor of the experiment) was randomized within each block.
2.4 Scale normalization
The outcome of the main task, for a given participant, is a set of attenuations (gains with negative values) applied to the FM in dB re: the maximal possible level. However, as sound delivery was not calibrated, the maximal possible level is unknown. To plot the results in a convenient format, we used an arbitrary dB scale intended to approximate the dB SPL scale. To do so, all attenuations at threshold in quiet within the range 1–4 kHz from all selected participants (N = 89, see below) were catenated. The median value was found to be −57.0 dB. A constant +67.0 dB was then added to all attenuations, so that audibility thresholds in quiet on our arbitrary dB scale would be around 10 dB in the 1–4 kHz range, which is the approximate value in dB SPL for absolute thresholds in normal-hearing listeners (Moore, 2013). Note that the normalization is only intended for display purposes and that the resulting scale should not be identified as the dB SPL scale.
2.5 Full procedure
Participants first read an online information sheet about the experiment and provided their informed consent by ticking boxes. Then, several preliminary steps followed before the main task: (1) An initial volume calibration was performed. A sample of the TEN used in the main task was played, and participants were asked to adjust the volume of this noise to a loud but comfortable level, using the volume controls on their computer. (2) Participants were required to wear headphones to pass a “headphone check” (Woods , 2017). Progress to the next step was not possible until the headphone check was successfully completed. It was further emphasized that the headphones had to be kept on throughout the experiment. (3) A second volume calibration was performed. A single audibility threshold was measured for a carrier frequency of 1000 Hz. If the corresponding attenuation was not comprised between −50 and −80 dB, participants were instructed to increase or decrease the overall volume on their computer accordingly. This step was repeated until successful completion. After this final volume calibration, participants were instructed to never change their computer volume for the rest of the experiment. Finally, the main task was initiated. The online test is available for demonstration purposes [see Bravard (2023)].
As pre-registered, a cohort of 100 participants was recruited through the Prolific platform (Prolific, 2023). The inclusion criteria were: above 18 years old, having English as first language, no known hearing impairment. Participants were also required to use a desktop computer with the Chrome navigator, for technical reasons. They were paid 3.5 € for the experiment, which lasted about 20 mn. The cohort included 49 female and 51 male participants. The age distribution of the cohort, in years, was as follows: M = 38.5; SD = 14; Min = 20; Max = 78.
2.7 Exclusion criteria
Upon inspection of the data, the following changes were made to the pre-registered exclusion criteria. First, in spite of the broad range of attenuations available (−95 to −1 dB), some participants reached floor or ceiling in the main task. For those participants, the true threshold is unknown, and it is likely that they did not perform the main task according to the instructions, or even that they did not pay attention to the sounds. Thus, as a first criterion, we excluded all participants who provided at least one threshold adjustment, either in quiet or in TEN, that was either at floor or at ceiling relative to the measurement range. This first criterion excluded four participants.
Second, some participants exhibited lower (better) thresholds in noise than in quiet. Such a reversal of expected performance is extremely unlikely, if not impossible, if participants performed the main task according to the instructions. It could be that those participants did not pay attention to the task, or that contrary to the instructions they changed the overall volume of their computer between the two experimental blocks. Thus, as a second criterion, we excluded all participants who provided at least one threshold value that was lower in noise than in quiet. This second criterion excluded nine participants, some of them having already failed the first criterion.
Third, we initially pre-registered an exclusion criterion for participants who did not produce attenuations of at least −25 dB in quiet for all frequencies (thresholds better than 42 dB in our arbitrary scale). This criterion would have excluded 19 participants, 10 of whom passed the first two criteria. Upon inspection of the data, it appeared that the effect of noise for those 10 participants was in fact highly similar to the effect of noise observed for the rest of the cohort. We thus did not exclude these participants from the main analysis, but we provide an additional descriptive analysis to justify their inclusion in the main group.
3.1 Planned analyses
From the cohort of 100 participants recruited through the Prolific platform, 89 participants were retained for the main analysis. Those participants produced thresholds both away from floor or ceiling and always higher in noise than in quiet (see Sec. 2.7). The distributions of thresholds for those 89 participants are shown in Fig. 1. The mean across participants, together with the standard deviation about the mean, is shown on top of the individual data and violin plots. Tables 1 and 2 further report the median value and inter-quartile range (25%–75%) at each frequency.
|.||125 Hz .||250 Hz .||500 Hz .||1000 Hz .||2000 Hz .||4000 Hz .|
|.||125 Hz .||250 Hz .||500 Hz .||1000 Hz .||2000 Hz .||4000 Hz .|
|.||125 Hz .||250 Hz .||500 Hz .||1000 Hz .||2000 Hz .||4000 Hz .|
|.||125 Hz .||250 Hz .||500 Hz .||1000 Hz .||2000 Hz .||4000 Hz .|
Let us first consider audibility in quiet, shown in the left part of Fig. 1. As expected, there was an overall trend for lower thresholds for higher frequencies, which correspond to the more sensitive regions of the normal-hearing audiogram (Yost, 2021). Note that beyond the audiogram, there could be several additional factors affecting thresholds: the frequency response of the audio equipment used by each participant, the attention given to the instructions, and the overall presentation level. Unlike in laboratory experiments, such parameters cannot be reliably controlled in online settings. As a consequence, large variability of thresholds was expected across participants. This is precisely what was observed. The spread of the results for each frequency was always more than 40 dB, with a large interquartile range throughout (Table 1), even though all participants reported normal hearing.
To formally test whether frequency had an effect on audibility thresholds, we performed a Bayesian one-way analysis of variance (ANOVA) using the jasp analysis program (van den Bergh , 2020). Audibility threshold in quiet was the dependent variable, with Frequency the factor (6 levels). The prior probability of both the Frequency and Null models was set to 0.5. The Bayesian ANOVA provided overwhelming evidence for the model including Frequency, with a probability of essentially 1 (log-BFM for the Frequency model was 100.6), thus showing that frequency had an effect on audibility thresholds.
Consider next the results with TEN, shown in the right part of Fig. 1. It is immediately apparent that noise had a large impact on audibility thresholds. First, the audibility thresholds curve was essentially flat relative to frequency. Median values were within 1 dB of each other between 250 and 4000 Hz, and within 2.5 dB of each other if the lowest frequency of 125 Hz was included (Table 2). Another major difference with the quiet condition is that the variability across participants was much reduced. Apart from a few outliers, most results were close to the mode of the distribution, resulting in violin plot distributions tightly focused around the mean value. This was reflected in smaller inter-quartile ranges (Table 2).
To test for an effect of Frequency on audibility in TEN, we performed a second one-way Bayesian ANOVA. The ANOVA produced moderate evidence for the Null model, with a Bayes factor of BFM = 4.4 in favor of the Null model compared to the model including Frequency. As pre-registered, we also performed the same ANOVA but excluding the lowest frequency of 125 Hz, as TEN was not tested for such a low frequency (Moore , 2000). Over the 250 Hz to 4000 Hz range, the ANOVA found overwhelming evidence in favor of the Null model, with a Bayes factor of BFM = 78.7.
Next, we formally tested the reduction in variability for audibility thresholds in TEN compared to in quiet. To do so, we performed a Levene's test on audibility at all frequencies (Anderson, 2006). This was achieved by first computing the squared distance of each threshold value to the group mean at the corresponding frequency, in the quiet and TEN conditions separately. The squared distances obtained in quiet and TEN were then compared through a paired t-test. There was a significant reduction in variability for the thresholds in TEN: t(533) = 10.84, p < 0.001. For consistency with the previous analyses, we also performed a Bayesian version of the t-test, which returned overwhelming evidence for a reduction in variability when TEN was applied (log-BF10 = 49.8).
Finally, for comparison purposes, we converted the audibility thresholds obtained in TEN to signal-to-TEN ratios (STRs). For in-lab testing, the STR between the power of a pure tone at threshold and the power per ERB at 1 kHz of the TEN is about 0 dB [Moore (2000), their Fig. 2]. Here, we cannot know the overall presentation level of the stimuli, but the attenuations of both TEN and FM tones were known. Thus, we could convert the respective attenuation values to STRs (see accompanying scripts). Using the median values of Table 2, we observed STRs at threshold between –2.0 dB and −1.0 dB for frequencies above 125 Hz, and of −3.5 dB for 125 Hz. This is broadly in line with in-lab results.
In summary, the masking noise successfully equated audibility thresholds over the whole frequency range tested, with higher confidence over the range 250–4000 Hz. Over this range, audibility threshold was essentially constant with a 1-dB precision. The STR at threshold was comparable to in-lab findings. Furthermore, the addition of the TEN considerably reduced the variability of audibility thresholds across individuals.
3.2 Additional analyses
We further replot separately the data from the sub-group of 10 participants who produced poor audibility thresholds in quiet, according to the criterion defined in Sec. 2.7 (threshold higher than 42 dB for at least one frequency). There could be several reasons for poor audibility in online experiments. First, the overall volume may have been self-adjusted to be too low. Second, the instructions may not have been properly understood. Finally, poor audibility in quiet could reflect some form of hearing loss.
The results for the sub-group of 10 participants are shown in Fig. 2. The shape of the audibility curves in quiet, shown on the left panel, differs from average for the whole cohort. Some of the poor thresholds were observed for high frequencies, resulting in a non-monotonic relation between audibility thresholds and frequency for some participants. Such a pattern is suggestive of presbycusis, the most prevalent form of sensorineural hearing loss associated with age. Obviously, given the lack of control of the audio reproduction equipment, such a speculation must remain tentative. Interestingly, this sub-group of participants had a higher mean age than the full cohort, but was not exclusively composed of older listeners: M = 48; Min = 21; Max = 78.
Importantly, however, audibility in TEN for this sub-group of participants exhibited the same features as for the whole cohort: audibility thresholds did not vary with frequency and individual variability was small. Thus, even for participants with atypical online audibility curves, the TEN seems to have succeeded in equating audibility over all tested frequencies and reducing individual variability.
In a large cohort of online participants, (N = 100 overall, N = 89 after exclusion criteria), we measured the audibility of FM tones over a range of carrier frequencies (125–4000 Hz). In quiet, we observed an increase in average audibility with frequency, as expected from the normal-hearing audiogram. However, there was a large spread of audibility across participants. Our main result is that, even with such large variability in quiet, embedding tones in TEN successfully equated audibility over a broad frequency range. From 250 Hz to 4000 Hz, median audibility was constant within a 1-dB range, with reduced variability across participants. The STR for the median audibility threshold, about −1 dB, was fully consistent with in-lab reports (Moore , 2000). Thus, by embedding stimuli in TEN, online experiments can control sensation level.
The robustness of the TEN technique for online settings was unknown. We used a two-step volume calibration procedure, but even so, the presentation level self-adjusted by online participants likely varied over a broad range, as suggested by the results in quiet. Participants could use any kind of computer sound card combined with any kind of headphones, which was another source of variability. Participants reported normal hearing, but this could not be independently verified. In spite of all those sources of uncontrolled variability, using TEN successfully equated audibility. Moreover, the efficiency of TEN generalized to atypical listeners, who displayed poor audibility at some frequencies, potentially related to hearing loss. This may seem surprising, as hearing loss is often associated with a broadening of auditory filters (Glasberg and Moore, 1986), which deviates from the underlying model for spectrally shaping the TEN. However, we only have a small sample of such listeners (N = 10) and the source of the atypical audibility cannot be established online, so such an intriguing result needs to be further investigated. For practical purposes, nevertheless, this is another robust feature of the TEN.
There are limitations to the present study. The set of preliminary steps that was used resulted in successful equalization of audibility with TEN. However, we have not tested the relative importance of each step. For instance, it may be possible to simplify the volume calibration steps (Zhao , 2022) or even to allow the participants to employ loudspeakers rather than headphones. This remains to be tested. Also, FM tones were used with the aim of measuring audibility over half-octave frequency bands. Finer sampling is needed to check that equalization is not disrupted by micro-variations in the audiogram or the audio equipment of the participant.
Given these results, what are the benefits of using TEN in online experiments? An important use case from our perspective is to control the sensation level of tonal components in experiments where individual differences are investigated. On the one hand, the large and diverse participant pools enabled by online testing are ideally suited for the investigation of individual differences. On the other hand, without controlling for audibility, any perceptual difference observed may have a trivial origin. In the general case, controlling audibility is desirable to guide the interpretation of experiments where frequency is a parameter. It should be noted that TEN has other potential benefits. For instance, it will mask distortion products generated by harmonic sounds with missing fundamentals (Oxenham , 2011). TEN may also play the role of a portable “poor-man” audiometric booth, by isolating the participant from unexpected and fluctuating external noises, which may cause attention lapses or intermittent partial masking.
There are also drawbacks to embedding stimuli in TEN. The main issue we can identify is that partially masking the stimuli of interest may blur some details of complex sounds and generally limit the useful dynamic range over which stimuli can be presented. Indeed, the TEN has to be loud enough to provide partial masking, so the level of added acoustic components is limited by clipping and by comfortable listening levels. Also, the loudness of acoustic components above audibility threshold in TEN will increase faster with level than in quiet, because of loudness recruitment (Moore, 2013). Finally, the TEN introduces random amplitude modulations in all auditory filters, which produces modulation masking for complex signals such as speech (Stone , 2011).
These drawbacks are mitigated by two factors. First, the use of high sensation levels far above the noise would likely result in unequal loudness across frequencies anyway, because of loudness recruitment, which would defeat one purpose of using TEN in the first place. Second, even though dynamic range is clearly limited, it should still be sufficient for complex stimuli such as speech. As an illustration, we consider a speech sentence embedded in TEN [see Bravard (2023)]. The STR is +15 dB if computed by considering the total power of the speech sentence and the noise power within a 1-ERB wideband around 1 kHz (Moore , 2000) or −1 dB if computed considering the total power of noise. For normal listeners, there should be a broad range of useable STRs over which the speech remains fully intelligible.
In order to control for audibility without imposing dynamic range limitations, one could measure audibility in quiet for each participant at the frequencies of interest, and then apply inverse amplification for the main task. Such a strategy may be practical with fast and efficient threshold measurements. However, any measurement error in the audibility-in-quiet step will affect results for the main task. The trade-off between this strategy and the TEN-based method may be the subject of future investigations. In the meantime, our study shows that some features of what made the success of the psychoacoustics approach can be translated online, opening up practical ways to address questions that would be hampered by the time and cost associated with in-lab studies.
The full dataset is available on the ResearchBox platform, together with sound examples and matlab generation scripts (Bravard , 2023).
We would like to thank Brian C. J. Moore for suggesting the threshold scale conversion and for numerous helpful comments on a previous version of the manuscript. This work was supported by the Agence Nationale de la Recherche, Grants Nos. ANR-22-CE28-0023-01 and ANR-17-EURE-0017. The authors declare no conflict of interest. The experimental protocol was approved by the INSERM Institutional Review Board (IRB00003888), Avis 21-858.