This study assessed the detection of mistuning of a single harmonic in complex tones (CTs) containing either low-frequency harmonics or very high-frequency harmonics, for which phase locking to the temporal fine structure is weak or absent. CTs had F0s of either 280 or 1400 Hz and contained harmonics 6–10, the 8th of which could be mistuned. Harmonics were presented either diotically or dichotically (odd and even harmonics to different ears). In the diotic condition, mistuning-detection thresholds were very low for both F0s and consistent with detection of temporal interactions (beats) produced by peripheral interactions of components. In the dichotic condition, for which the components in each ear were more widely spaced and beats were not reported, the mistuned component was perceptually segregated from the complex for the low F0, but subjects reported no “popping out” for the high F0 and performance was close to chance. This is consistent with the idea that phase locking is required for perceptual segregation to occur. For diotic presentation, the perceived beat rate corresponded to the amount of mistuning (in Hz). It is argued that the beat percept cannot be explained solely by interactions between the mistuned component and its two closest harmonic neighbours.
Pitch is an important attribute of sound that is crucial for the perception of music and of speech intonation and that plays an important role in the segregation of sounds in complex auditory scenes [see, e.g., Brokx and Nooteboom (1982), Scheffers (1983), Hartmann (1996), Vliegen and Oxenham (1999)]. A series of simultaneous frequency components with frequencies corresponding to consecutive integer multiples of a common fundamental frequency (F0) is usually perceived as a single auditory object with a residue pitch corresponding to the F0 (Schouten, 1940; Darwin and Carlyon, 1995; Carlyon and Gockel, 2008). When a single low harmonic is mistuned sufficiently, it can be heard as a separate component, “popping out” from the rest of the complex tone, while when a high harmonic is mistuned, the dominant percept is a temporal fluctuation (beats) (Moore et al., 1985; Moore et al., 1986; Hartmann et al., 1990; Lee and Green, 1994). Research on this topic has been restricted to complex tones with low F0s and harmonic frequencies below about 4–5 kHz, i.e., to a frequency range for which some degree of phase locking is assumed to be present in the auditory nerve (Johnson, 1980; Palmer and Russell, 1986). The present study investigated mistuning perception for complex tones within a low-to-medium frequency range and for complex tones with component frequencies restricted to a very high frequency region (at or above 8.4 kHz), where phase locking to the temporal fine structure is generally assumed to be weak or absent (Verschooten et al., 2019). The main aim was to assess whether mistuning detection has similar properties in the presence and absence of phase locking. A second aim was to assess the perceived beat rate (when beats are perceived) and to relate this to the origin of the beats.
As noted above, for complex tones with low F0s and harmonics below about 4–5 kHz, mistuning of a single harmonic has several perceptual effects. When the mistuned component is of low rank (up to about the 6th harmonic) and is resolved in the auditory periphery, and when it is mistuned by a few percent, subjects report hearing a separate tone with a pitch corresponding approximately to the frequency of the mistuned harmonic, in addition to the lower residue pitch of the remainder of the complex, i.e., the mistuned harmonic is perceptually segregated and appears to “pop out” (Moore et al., 1985; Moore et al., 1986). This probably happens because the mistuned component is no longer consistent with a harmonic series, for example it no longer passes through the “harmonic sieve” for that F0 (Duifhuis et al., 1982). The pop out may provide a cue for detection of the mistuning. In contrast, for mistuned harmonics of higher rank (above about the 8th), which are unresolved in the auditory periphery (Moore and Gockel, 2011), the mistuning may be detected via “beats” or temporal fluctuations in the signal that arise from the interaction between the mistuned component and the other components.1 In this case, thresholds for mistuning detection decrease markedly with increasing signal duration (Moore et al., 1985); for long-duration stimuli, the thresholds can be less than 0.1% of the frequency of the mistuned component, which is below the threshold for detecting a change in frequency of the component when presented in isolation.
There is growing evidence that a residue pitch can be perceived for complex tones with (partially) resolved harmonics for which all component frequencies are above the presumed limit of phase locking and that this pitch is not based on the repetition rate of the temporal envelope, as can be the case for complex tones with low F0s containing only unresolved harmonics (Oxenham et al., 2011; Lau et al., 2017; Carcagno et al., 2019; Gockel et al., 2020). Gockel and Carlyon (2021) showed that the residue pitch derived from very high-frequency components (at or above 8.4 kHz) was sufficiently salient to allow musical interval adjustments by musically trained listeners, although the pitch was clearly less salient than that derived from complex tones with harmonics of the same rank but in a lower frequency region. Lau et al. (2017) suggested that the residue pitch evoked by harmonics with very high frequencies resulted from activation of central, possibly cortical, pitch-sensitive neurons that respond selectively to a combination of harmonics, but that do not respond to any individual harmonic, and that can be activated by rate-place information in the absence of phase locking.
The present study investigated the detection of mistuning imposed on a single harmonic within complex tones resembling those used by Lau et al. (2017), Gockel et al. (2020), and Gockel and Carlyon (2021). The complex tones contained harmonics 6–10 with an F0 of either 1400 Hz, with the lowest harmonic at 8400 Hz, for which phase locking is assumed to be weak or absent, or 280 Hz, with the lowest harmonic at 1680 Hz, for which phase locking would occur. In the first two experiments, the ability of subjects to detect mistuning imposed on the 8th harmonic was measured in a diotic condition, for which harmonics were expected to interact to a certain degree in the auditory periphery, and in a dichotic condition for which even harmonics and odd harmonics were presented to opposite ears and where the harmonics were expected to interact much less in the periphery. If the detection of mistuning for very high frequencies is based on the mistuned component not fitting with a central harmonic template, then performance should be similar for the diotic and dichotic conditions. If, on the other hand, it depends on temporal fluctuations produced by the interaction of harmonics, then performance should be worse for the dichotic than for the diotic condition, because of the greater separation of components in a given ear for dichotic presentation. The main objective of experiments 1 and 2 was to compare performance in the two frequency regions and to assess whether the same cues were used in the two regions.
Lee and Green (1994) suggested that the beats produced by mistuning a single harmonic, when heard, correspond to a modulation component that is apparent in the power spectrum of the envelope at the output of a simulated (rounded-exponential, roex) auditory filter (Patterson and Moore, 1986) centered on the mistuned harmonic, and that has a rate corresponding to twice the amount of frequency shift (ΔF) applied to the mistuned harmonic. However, they did not assess the number of beats perceived by their subjects and therefore it is unclear whether the detection and perception of mistuning are indeed determined by the dominant modulation at the output of a filter centered close to the frequency of the mistuned component. The third experiment of the present paper assessed the beat rate heard by the subjects in the diotic conditions, in an attempt to clarify the processes underlying the beat percept.
II. GENERAL METHODS
A. Subjects and screening procedure
Eight young normal-hearing subjects (5 females, 3 males) between 20 and 28 years of age (mean age of 23 yr) participated in the main experiments. All of them had some musical training, but none was a professional musician or had absolute pitch. The average number of years of musical training/practice was 14.5 (ranging from 5 to 21 yr). Subjects 1, 7, and 8 started playing the violin or cello from age 7 or earlier and played for at least 10 yr. Subjects 2, 4, and 6 started playing piano from ages 8, 9, and 4 and had played for at least 5 yr. Subject 3 started playing the trombone at age 10 and had played for 8 yr, while subject 5 had sung for 15 yr. All except subjects 3 and 4 were active hobby musicians.
To ensure audibility of the high-frequency tones and basic frequency discrimination ability, subjects had to pass a three-stage screening, as in Lau et al. (2017), Gockel et al. (2020), and Gockel and Carlyon (2018, 2021), to be eligible for the main part of the study. The screening comprised:
Measurement of pure-tone audiometric thresholds in quiet at octave frequencies from 0.25 to 8 kHz and at 6 kHz, using a Midimate 602 audiometer (Madsen Electronics, Minnesota, Minneapolis, USA). Thresholds had to be ≤ 15 dB HL at all frequencies for subjects to pass this stage.
Measurement of thresholds for detecting 210-ms sinusoidal tones (including 10-ms onset and offset hanning-shaped ramps) at 10, 12, 14, and 16 kHz in a continuous threshold-equalizing noise (TEN) (Moore et al., 2000) extending from 0.02 to 22 kHz. This was done separately for each ear. The TEN had a level of 45 dB SPL/ERBN at 1 kHz, the same as used in the experiments (see below), where ERBN stands for the average value of the equivalent rectangular bandwidth of the auditory filter for young normal-hearing listeners tested at low sound levels (Glasberg and Moore, 1990). A two-interval two-alternative forced-choice task (2I-2AFC) with a 3-down 1-up adaptive procedure was used, estimating the 79.4% correct point on the psychometric function (Levitt, 1971). The initial step size was 5 dB, and the step size was reduced to 1 dB after two reversals. The adaptive track terminated after 10 reversals, and the threshold was calculated as the mean of the levels at the last six reversals. Three thresholds, i.e., three adaptive tracks, were obtained for each frequency and ear. The final threshold was the mean of these three thresholds. Subjects' masked thresholds had to be ≤ 45 dB SPL up to 14 kHz, and ≤ 50 dB SPL at 16 kHz.
Difference limens for F0 (F0DLs) and difference limens for frequency (FDLs) were measured in quiet for diotically presented complex tones containing harmonics 6–10 with an F0 of 280 or 1400 Hz (the same tones as used in the main experiment, i.e., with edge component levels 6 dB below that of the other components, but without level randomization; see below) and for the components of the complex tones presented in isolation, respectively. A 2I-2AFC task with a 3-down 1-up adaptive procedure was used. Subjects were asked to indicate the tone with the higher pitch and they received correct-answer feedback after each trial. The signal duration was 210 ms (including 10-ms onset and offset Hanning-shaped ramps). The inter-stimulus interval was 500 ms. At the beginning of each adaptive track, the difference in F0 (or frequency) was 20%. This was reduced (or increased) by a factor of 2 until two reversals occurred, by until two more reversals occurred, and by 1.2 thereafter. After 12 reversals, the adaptive track terminated and the threshold was calculated as the geometric mean of the frequency differences at the last eight reversals. Three adaptive tracks were obtained for each condition, and the final threshold was taken as the geometric mean of these three thresholds. FDLs and F0DLs had to be < 6% in the low frequency region and < 20% in the high frequency region.
Initially, 25 musically trained subjects between 18 and 28 years old were tested, eight of whom passed all screening stages. Three failed at the first stage of the screening, 11 at the second stage, and three at the last stage. All eight subjects had participated in another experiment involving high-frequency tones before data collection for the present study commenced. Informed consent was obtained from all subjects. This study was carried out in accordance with the UK regulations governing biomedical research and was approved by the Cambridge Psychology Research Ethics Committee.
In all experiments, the harmonic complex tones contained harmonics 6–10. In the high-frequency condition, the F0 was 1400 Hz with the lowest frequency component at 8400 Hz, and in the low-frequency condition the F0 was 280 Hz with the lowest frequency component at 1680 Hz. If mistuning was present, the 8th harmonic was randomly shifted up or down from its nominal frequency of 11 200 Hz and 2240 Hz for the high and the low F0s, respectively. In the diotic condition, all components were presented to both ears. In the dichotic condition, odd harmonics were presented to the left ear and even harmonics to the right ear. For each presentation, the starting phases of all components were randomized. The mean level per component was 55 dB SPL for harmonics 7–9 and 49 dB SPL for harmonics 6 and 10; the level of the edge components was reduced to minimize edge pitches (Kohlrausch and Houtsma, 1992). The individual component levels were randomized independently by ±3 dB about the mean level for each presentation, following a uniform distribution. The randomization of individual component levels and starting phases was done to weaken envelope cues for pitch. The duration of the tones varied across conditions (see below), and always included 10-ms onset and offset Hanning-shaped ramps. To mask possible distortion products, the tones were presented in a background of a continuous TEN that extended from 0.02 to 22 kHz and had a level of 45 dB SPL/ERBN at 1 kHz. The TEN was presented diotically in the diotic condition. In the dichotic condition, independent TEN was presented to each ear. These stimuli were similar to those used by Lau et al. (2017), Gockel et al. (2020), and Gockel and Carlyon (2021).
All stimuli were generated digitally in matlab (The Mathworks, Natick, MA) with a sampling rate of 48 kHz. They were played out through four channels (two for the continuous noise and two for the signals) of a Fireface UCX (RME, Germany) soundcard using 24-bit digital-to-analog conversion, and were attenuated independently with four Tucker-Davis Technologies (TDT, Alachua, FL) PA4 attenuators. They were mixed using two TDT SM5 signal mixers and fed into a TDT HB7 headphone driver, which also applied some attenuation. Stimuli were presented via Sennheiser HD 650 headphones (Wedemark, Germany), which have an approximately diffuse-field response. The specified sound levels are approximate equivalent diffuse-field levels. Subjects were seated individually in a double-walled, sound-insulated booth (IAC, Winchester, UK).
Shapiro-Wilk tests were used to assess whether the data were normally distributed. For statistical analysis, repeated-measures analyses of variance (RM-ANOVA) were calculated using SPSS (Chicago, IL). Throughout the paper, if appropriate, the Huynh-Feldt correction was applied to the degrees of freedom (Howell, 2009). In such cases, the original degrees of freedom and the corrected significance values are reported. All modelling was done in matlab (The Mathworks, Natick, MA) version R2014a using customized scripts and freely available software (version 2.0 of the Auditory Toolbox of Slaney, 1998).
III. EXPERIMENT 1: THRESHOLDS FOR MISTUNING DETECTION
The objective was (i) to determine thresholds for the detection of mistuning of a single harmonic in a harmonic complex tone for which all component frequencies were above the presumed upper limit of phase locking and (ii) to compare these to thresholds obtained with the same subjects for similar stimuli but in a lower frequency region, where phase locking is presumably available. Of particular interest was whether the same cues would be used in the two frequency regions.
Stimuli were presented either diotically or dichotically. For harmonic complex tones with harmonics 6–10 as used here, the pitch corresponds to the F0 for both the dichotic and the diotic presentation, for both the low frequency region (Bernstein and Oxenham, 2008; Gockel and Carlyon, 2021) and the very high frequency region (Gockel and Carlyon, 2021). Therefore, in the dichotic condition, where the frequency spacing between components within a given ear was doubled, peripheral resolvability of harmonics was increased without doubling the F0 and pitch, and thus without reducing the harmonic ranks of components in a central pitch template, while keeping the component frequencies constant.
Three tone durations were used, with the expectation that thresholds would decrease markedly with increasing duration if beat cues were used, while thresholds should be less affected by duration in the absence of usable beats (Moore et al., 1985). For the diotic condition, beat cues were expected to be used for the low-F0 complex in the low frequency region, especially for longer-duration tones (Moore et al., 1985; Lee and Green, 1994). It has been suggested that at high frequencies the relative bandwidths of the auditory filters are much smaller than has commonly been assumed, at least at very low sound levels (Shera et al., 2002). For a given harmonic rank, this would lead to less interaction with neighboring harmonics in the auditory periphery for the high than for the low frequency region. If this characterization of auditory filter bandwidth were to hold at the medium sound levels used in the present study, one might expect less-salient beat cues and possibly higher thresholds for mistuning detection in the high than in the low frequency region. For the dichotic condition, beat cues were expected to be greatly reduced in salience and it was expected that for the low F0 the mistuned component would be perceptually segregated from the remainder of the complex and heard as a separate auditory object (Moore et al., 1985; Bernstein and Oxenham, 2003). For the high F0, a similar percept and thresholds (expressed as a percentage of the F0) as for the low frequency region would be expected if the process(es) underlying the segregation of the mistuned component did not require phase locking and the harmonic templates in the high and low frequency regions had similar tolerances for passing or rejecting a mistuned component, provided that processing of high-frequency information was not degraded for other reasons, as considered in Sec. V.
B. Stimuli and procedure
The tones had a duration of 0.21, 1.0, or 2.0 s, including 10-ms rise/fall ramps. Thresholds for the detection of mistuning were measured using a 2I-2AFC task. In one randomly chosen interval, the harmonic complex was presented. The other interval contained the complex in which the 8th harmonic was randomly and with equal probability shifted up or down from its nominal frequency by ΔF. The two tones were separated by a 500-ms silent interval. Subjects indicated which of the two intervals contained the complex with the mistuned harmonic and received visual feedback as to whether their answer was correct. A 3-down 1-up rule tracking 79% correct was used. In preliminary testing subjects reported that in the diotic condition they detected the mistuning on the basis of beats. Because beats are most salient, i.e., the fluctuation strength is highest, for rates around 4–5 Hz (Terhardt, 1968; Fastl, 1983; Kohlrausch et al., 2000), the starting value of the mistuning was set to ±0.1% and ±0.7% of the F0 for the high and the low F0, respectively; this gave shifts of about 2 Hz for the two F0s. The same starting mistuning was used in the dichotic condition, even though the thresholds were expected to be higher. The initial step size was a factor of two. The step size decreased to a factor of 1.41 after two reversals and to a factor of 1.2 after four reversals. Eight reversals were collected at the smallest step size, and the threshold was taken as the geometric mean of the frequency shifts at the last eight reversals. The final threshold was the geometric mean of the thresholds from six adaptive tracks, except for subject 4, for whom only three adaptive tracks were collected in the dichotic conditions, due to time constraints.
The maximum frequency shift was limited to 50% of the F0, to avoid the mistuned component getting close to one of its neighboring components. If the adaptive procedure required a mistuning larger than 50%, it was set to 50%, and if this happened more than six times in a row, the track was aborted and the threshold for that track was set to 50%. This happened mainly for the high F0 in the dichotic condition. In addition, the threshold was set to 50% for all completed individual tracks in which the upper limit for ΔF of 50% was reached more than once for the 1400-Hz F0 and more than twice for the 280-Hz F0. The different rule for the two F0s was chosen because the starting ΔF as a factor was closer to the upper limit for the low F0 than for the high F0, and therefore the upper limit was more likely to be hit by chance for the low F0. The final threshold across all tracks was set to 50% if half or more of the tracks gave a threshold of 50%. Simulations of random responses showed that with these settings, the probability of obtaining a final threshold value that was smaller than 50% of the F0 by chance was less than 1%.
All subjects (except subject 4, who finished one practice track for each condition) received training until performance seemed to be stable, which took approximately 4 h. These data were discarded. The presentation order of the 12 conditions was quasi-random and counterbalanced across subjects. Three tracks were obtained within a condition before the next condition was run. For the experiment proper, data collection took on average about four to five sessions of two hours each (including breaks) for each subject.
C. Results and discussion
Figure 1 shows thresholds for the detection of mistuning of the 8th harmonic (in Hz) as a function of the duration of the complex tone. Panels (A)–(H) show thresholds for each of the eight subjects. Cases where individual-track thresholds at 50% contributed to the final threshold estimate are marked with an upward-pointing arrow, as the true threshold might be underestimated. The geometric mean thresholds across all subjects are shown in Fig. 2. The black horizontal solid and dashed lines indicate the upper limit of mistuning at 50% of the low and of the high F0 in Hz, respectively. The pattern of thresholds across conditions was similar for most subjects.
For the diotic conditions (purple downward-pointing triangles and lines), thresholds decreased with increasing duration for both F0s, reaching values as small as 0.5–1 Hz for the 2-s duration (0.02%–0.04% and 0.004%–0.009% of the component frequency for the 280-Hz and the 1400-Hz F0, respectively). This is consistent with detection based on beat cues for both frequency regions and is consistent with subjects' reports of the cues used. There was no systematic difference between thresholds (in Hz) for the two F0s.
For the dichotic conditions (turquoise squares and lines), thresholds did not depend on the duration. Thresholds were markedly higher than in the diotic condition, at least for the longer durations. For the low F0, thresholds were between about 20 and 40 Hz (7%–14% of the F0; 0.09%–1.8% of the component frequency) for most subjects, but they were higher for subject 3 and not measurable for subject 4. These thresholds are somewhat larger than the previously reported difference limen for detecting a change in frequency of a single low harmonic in a complex tone, which were measured in lower background noise and for constant amplitude sine-phase complexes (Moore et al., 1984; Gockel et al., 2007). Subjects reported that the mistuned component popped out from the remainder of the complex. Subject 4 found this condition very difficult and reported hearing out the component in both intervals, not just one. For the high F0, the adaptive tracks hit the upper limit most times and thresholds were unmeasurable; no popping-out or beating was reported by subjects. This is perhaps not surprising, since a mistuning of 0.5F0 corresponds to about a 6% change for the 8th harmonic, and this is close to or above the threshold for detecting a change in frequency of a single pure tone at 11.2 kHz, the frequency of the 8th harmonic for the high F0 (Lau et al., 2017; Gockel et al., 2020).
For statistical analysis, the thresholds (in Hz) were log-transformed to make them more normally distributed. The high-F0 dichotic condition, for which it was usually not possible to estimate a reliable mistuning threshold, was excluded from any statistical analysis. A repeated-measures analysis of variance (RM-ANOVA) on thresholds for the diotic conditions (with factors F0 and duration) showed a highly significant main effect of duration [F(2,14) = 31.87, p < 0.001]. The effect of F0 [F(1,7) = 0.09, p = 0.778] and the interaction between F0 and duration [F(2,14) = 1.94, p = 0.203] were not significant. These results suggest that on average the salience of beat cues in the diotic condition depended on the duration only and was not affected by the frequency region (or F0) of the complex. An RM-ANOVA on thresholds from the low-F0 conditions (with factors mode of presentation and duration) showed highly significant main effects of mode of presentation [F(1,7) = 65.83, p < 0.001], confirming that thresholds were significantly lower in the diotic than in the dichotic condition, and of duration [F(2,14) = 53.35, p < 0.001]. Importantly, the interaction between presentation mode and duration was also highly significant [F(2,14) = 61.70, p < 0.001], showing that the effect of duration differed significantly between diotic and dichotic conditions; for the low-F0 dichotic condition, there was no significant effect of duration [F(2,14) = 0.66, p = 0.534].
For the low-F0 conditions, the results can be compared with those from previous studies. Moore et al. (1985) measured thresholds for the detection of a mistuned harmonic in a monaurally presented 200-Hz F0 complex tone. For the eighth harmonic, the mean thresholds were about 25, 5, and less than 1 Hz, for stimulus durations of 110, 410, and 1610 ms, respectively. Given the differences between the two studies in stimulus duration, presence of background noise, and percent-correct values tracked, the results shown in Fig. 2 here for the diotic conditions are comparable to those of Moore et al. (1985).
For the dichotic conditions, the present data were compared to those obtained for detection of mistuning of the fourth harmonic in the study of Moore et al. (1985). This was done because dichotic presentation increases the resolvability of the components of a harmonic complex (Bernstein and Oxenham, 2003) and because mistuning detection in the dichotic condition was most likely based on the perceptual segregation of the mistuned component from the remainder of the complex. For the fourth harmonic, the mean thresholds in the study of Moore et al. (1985) were about 1.6%, 0.9%, and 0.2% of the harmonic frequency, for stimulus durations of 110, 410, and 1610 ms, respectively. In contrast, in the present study mean thresholds were about 1.8% of the harmonic frequency and did not decrease with increasing duration. It is possible that in the absence of background noise weak beating cues were available in the study of Moore et al. (1985), and that such cues were masked by the TEN background in the present study.
In a different study, Moore et al. (1986) measured thresholds for hearing out a mistuned harmonic as a separate tone, standing out from the harmonic complex as a whole. This task was different from the one used in Moore et al. (1985), where any available cue could be used to detect mistuning of a harmonic. For an F0 of 200 Hz, the mean thresholds for the fourth harmonic were about 2%, 2%, and 1.4% of the harmonic frequency, for stimulus durations of 110, 410, and 1610 ms, respectively. Thus, in the study of Moore et al. (1986), thresholds for hearing out the mistuned harmonic decreased slightly with increasing duration, but much less so than for the detection of mistuning. This supports the idea that in the present study in the dichotic condition, subjects used perceptual segregation of the mistuned harmonic as a cue. Given the differences between the procedures, the mix of conditions, and the general variability across subjects, there is good agreement between the studies of Moore et al. (1985) and Moore et al. (1986) and the results for the low-F0 conditions in the present study.
In summary, the main findings of experiment 1 were: (1) for the diotic conditions, where beat cues were available, there was no significant difference between thresholds in Hz for the two F0s, and that (2) for the dichotic conditions, where beat cues appeared not to be usable, thresholds could be tracked only for the low-F0 conditions, while for the high-F0 conditions all subjects experienced difficulties.
IV. EXPERIMENT 2: DICHOTIC STIMULI-BLOCKED DESIGN FOR MISTUNING DETECTION
In experiment 1, the adaptive tracking procedure usually failed to converge on a threshold estimate for the high-F0 dichotic stimuli. The objective of experiment 2 was to assess whether subjects had some ability to detect mistuning of the 8th harmonic for the high-F0 dichotic complex tones, albeit at a performance level lower than that tracked by the adaptive procedure in experiment 1, and to compare performance with that for the low-F0 dichotic tones. In addition, it is possible that the adaptive change in ΔF increased the difficulty of the task by affecting the type of cue available to the subjects. Therefore, in experiment 2, a blocked design with selected fixed amounts of mistuning was used to minimize potential changes of the available cues from trial to trial. The results were intended to clarify whether weak cues for the detection of mistuning of a single harmonic were available in the very high frequency region.
A. Stimuli and procedure
The complex tones and the TEN background were the same as for the dichotic condition of experiment 1. The duration of the tones was 1 s. Percent correct was determined in a 2I-2AFC task (identical to that used in experiment 1) for several fixed values of ΔF in a blocked design. The maximum mistuning was set to 50% of the F0, so as to maximize the chance of perceptual segregation of the mistuned component. The minimum mistuning was set to 1.78% or 0.356% for the low and the high F0s, respectively. The two smaller percentages correspond to a shift of 5 Hz for the two F0s, and were chosen so as to maximize the salience of possible slow beats. Additional intermediate values of ΔF were employed for most of the subjects, depending on how much time they had available. This was done to check on the general form of the psychometric functions.
The same eight subjects as in experiment 1 took part. At least 200 trials were collected for each subject and each of the four main conditions (2 F0s × 2 ΔFs). Where additional intermediate values of ΔF were tested, at least 200 trials were collected for each ΔF value. Data were collected in blocks of 104 trials with a fixed condition. The first four trials within each block were considered warm-up trials and results for these were discarded. Within each block, 50 trials with negative and 50 trials with positive mistuning were randomly interleaved. The order of conditions across blocks and subjects was randomized. Data collection took on average about two to three sessions of 2 h each (including breaks) for each subject.
B. Results and discussion
Figure 3 shows percentage correct for the detection of mistuning of the 8th harmonic as a function of ΔF, with ΔF expressed as a percentage of the F0 of the complex tone. Panels (A)–(H) show results for the individual subjects. There is no indication that the general shape of the psychometric functions was unusual for either the low (solid symbols) or the high (open symbols) F0. This is consistent with the idea that, for a given F0, the type of cue used in the dichotic condition is probably constant across ΔF; the strength of that cue can of course vary with ΔF. For the low F0, performance increased to 80% correct or higher with increasing ΔF for all but one subject (subject 4). This is consistent with the results of experiment 1, where a threshold could be determined for all subjects except subject 4. In contrast, for the high F0, percentage-correct values were below 70% throughout, and any increases in percentage correct with increasing ΔF were very modest.
For the 280-Hz F0, the ΔF needed to achieve 79% correct was interpolated from the individual data. The geometric mean value across subjects of about 14.5% of the F0 (40.6 Hz) corresponded well to the mean threshold value of about 40 Hz found in experiment 1 for the 1-s duration.
Figure 4 shows the mean percentage correct across subjects for the four main conditions. Note that the data of subject 4 for the low-F0–50% condition were treated as outliers, and were not included in the average. For the mistuning of 5 Hz (0.2% and 0.04% of the component frequency for F0s of 280 and 1400 Hz, respectively), performance was at chance level for both F0s. This indicates that slow beat cues were not usable in the dichotic condition. When the mistuned component was shifted by 50% of the F0 (6.25% of the component frequency), performance was mostly very good (on average 94.6% correct) for the low F0. Subjects reported hearing the mistuned component as a separate tone. In contrast, for the 1400-Hz F0 performance was only just above chance; the average percentage correct of 57% corresponds to a d' value of 0.25, which is very small but significantly (more than three standard deviations) above zero (Jesteadt, 2005). Subjects did not report perceiving the mistuned component as segregated from the remainder of the complex. Instead, they reported using small timbre differences between the in-tune and the mistuned complex.
These results confirm and strengthen the findings of experiment 1. They also show that the difficulty of tracking a threshold for the high-F0 dichotic condition in experiment 1 was not due to the adaptive nature of the task per se, but instead was a consequence of very low levels of performance for all ΔFs in this condition.
V. INTERIM DISCUSSION
In the diotic conditions there was no significant difference between the results for the two F0s. Thresholds for mistuning detection decreased markedly with increasing duration, suggesting that performance was based on beat cues, and consistent with subjective reports on the cues used. The beat cues seemed to be equally salient in the two frequency regions. Therefore, adjacent components clearly interacted with each other in the auditory periphery, even in the high frequency region, where the relative bandwidth of the auditory filters are smaller than at low and medium frequencies (Glasberg and Moore, 1990; Shera et al., 2002). It is possible that a different result would be obtained at lower stimulus levels.
In the dichotic conditions, where peripheral interaction of harmonics was much reduced, subjects reported perceiving the mistuned component as segregated from the remainder of the complex for the 280-Hz F0. For the 1400-Hz F0, performance was markedly worse than for the low F0 and the mistuned component was not reported as being perceived as a separate object from the complex tone. As mentioned above, the perceptual segregation of low-frequency harmonics presumably occurs because the mistuned harmonic is not consistent with a harmonic template or no longer passes through a “harmonic sieve” (Duifhuis et al., 1982). One might also think of mistuning detection as requiring detection of deviation from a regular spectral pattern: the “in-tune” harmonics form part of a regular pattern and the mistuned component deviates from that pattern (Roberts and Brunstrom, 1998). The pattern might be place-based or based on the time intervals between nerve spikes evoked by the different components.
In experiment 2 of the present study, the maximum ΔF used in the dichotic condition was 50% of the F0, but this did not lead to pop out for the high F0. As noted earlier, this corresponds to about 6% of the frequency of the mistuned component, which is close to or above the threshold for detecting a frequency change of the 8th harmonic (Lau et al., 2017; Gockel et al., 2020), and so, on the one hand, is perhaps not so surprising. On the other hand, the characteristics of a presumed template that allows for the “super-optimal” way that information from harmonics is combined in the high-F0 region (Lau et al., 2017; Gockel et al., 2020), are unclear. In particular, it was conceivable that it would be able to reject a component that was mistuned by 0.5F0, even when that component presented on its own does not have a very salient pitch. Nevertheless, pop out of a tone-like auditory object did not occur. One explanation is that the pop out requires the presence of a template based on phase locking, or that rejection by the template does not lead to pop out without phase locking.
Some possible alternative explanations should be considered for the lack of perceptual segregation of the mistuned component at high frequencies that do not involve the absence of phase locking. One is that harmonic-template neurons, which presumably form the basis of the hypothetical harmonic sieve, are poorly formed at very high frequencies, due to a lack of exposure to high-frequency resolved harmonics. However, many musical instruments produce tones that would have audible high-frequency resolved harmonics and bird song often contains tones with F0s above 3 kHz and with resolved harmonics in the frequency range 6–12 kHz (Hoese et al., 2000). Furthermore, natural soundscapes have considerable energy for frequencies above 5 kHz (Thoret et al., 2020). Finally, a template that is so poorly formed that it could not reject a component that is mistuned by 50% of the F0 would not seem to account for the finding that F0 discrimination for tones with very high harmonics is markedly better than the frequency discrimination of the individual components of the tones (Oxenham et al., 2011; Lau et al., 2017; Gockel et al., 2020).
Another explanation could be that the harmonic template in the high frequency region, while well formed, has a much larger tolerance, requiring a larger frequency deviation of a single component from its harmonic value before rejecting that component. However, Carcagno et al. (2019) argued that harmonic templates have similar tolerances for the high- and low-frequency regions. It is possible that a component in a high-frequency complex tone that is mistuned sufficiently is rejected by the harmonic sieve, and so does not contribute to the residue pitch of the complex tone, but is nevertheless not heard as a separate tone. This might happen because the pitch of pure tones with very high frequencies is less clear than the pitch of low-frequency pure tones [as indicated by larger frequency difference limens at very high frequencies (Sek and Moore, 1995; Moore and Ernst, 2012)], an effect that is enhanced in the presence of relatively high background TEN (Gockel et al., 2020), and the pitch would be even less clear when the mistuned component was presented in the presence of the other components of the complex tone (Moore et al., 1984; Gockel et al., 2007). Also, the TEN used here probably made it more difficult to hear the mistuned component as a separate tone.
The present results for the dichotic conditions support and extend the findings of a study by Hartmann et al. (1990). They asked subjects to match the pitch of an isolated sinusoidal tone to that of a mistuned harmonic in an otherwise harmonic complex tone. Their data showed a strong decrease in the ability to make these pitch matches when the frequency of the mistuned component increased to about 2–3.5 kHz. They attributed this decrease to the decline of phase-locking information with increasing frequency and the need for phase-locking information for perceptual segregation of the mistuned component from the remainder of the complex. Note, however, that the task used by Hartmann et al. (1990) requires the percept of a pitch corresponding approximately to the frequency of the mistuned component and the ability to do pitch matching. It is likely that the pitch of a very high frequency component presented within a complex tone is not clear, and also the pitch of the isolated comparison tone would be somewhat unclear, and this would affect the accuracy of pitch matching. In contrast, the task used in the present study made no additional demands on subjects and thus more unambiguously indicates the role of absolute frequency and possibly of phase-locking information in the segregation of a mistuned harmonic.
VI. EXPERIMENT 3
In the diotic conditions, thresholds for the detection of mistuning (in Hz) were similar for the high and the low frequency regions. Subjective reports and the dependence of thresholds on duration both strongly suggest that mistuning detection was based on beat cues. The objective of experiment 3 was to assess the perceived beat rate in the two frequency regions to elucidate the processes underlying the beat percept.
There are several ways in which beats could arise in principle. First, they might arise through periodic waveform fluctuations (including but not restricted to envelope fluctuations) arising from the interaction of stimulus components at the outputs of the auditory filters (Plomp, 1967). Second, beats might arise through interaction between combination tones produced in the cochlea and mistuned components. For example, for a 100-Hz F0 complex tone where the 8th harmonic is mistuned by 2 Hz, the 9th and the 10th harmonics would interact to give a combination tone at 800 Hz (the cubic difference tone, CDT, at frequency 2F1 − F2, where F1 and F2 are the frequencies of the two interacting components and F2>F1), which in turn could interact with the mistuned harmonic at 802 Hz and result in a beat rate of 2 Hz. Both of the above possibilities involve peripheral mechanisms. A third way in which the beats might arise, possibly involving somewhat more central mechanism(s), is that so called “second-order” beats in the envelope may be produced through interaction between first-order beats (Lorenzi et al., 2001; Sinex et al., 2002; Sinex, 2008). For example, for a 100-Hz F0 complex tone where the 8th harmonic is mistuned by 2 Hz, components at 700 and 802 Hz would produce a (first-order) beat at 102 Hz, and components at 802 and 900 Hz would produce a (first-order) beat at 98 Hz. The beats between beats (the second-order beats) would have a rate of 4 Hz (102 Hz minus 98 Hz), corresponding to twice the amount of the frequency shift. Extending the range of harmonics involved can give a different prediction. For example, components at 900 and 1000 Hz would beat at 100 Hz, and components at 802 and 900 Hz would beat at 98 Hz. The beats between beats including channels with higher center frequencies would have a 2-Hz rate, corresponding to the amount of frequency shift. The same is true if one were to extend the range of harmonics involved to lower frequencies.
Lee and Green (1994) investigated the detection of mistuning of a single harmonic in a harmonic complex tone in various conditions. The complex tone consisted of 12 harmonics with frequencies in a low- to mid-frequency range. They suggested that the beats resulting from the mistuning of the 9th harmonic in a monaural condition correspond to the largest low-frequency modulation component that is apparent in the power spectrum of the envelope at the output of the auditory filter centered on the mistuned harmonic. The auditory filters were approximated as linear roex filters (Patterson and Moore, 1986), and thus the calculated envelope did not include the effects of combination tones that are produced by nonlinearities in the auditory system. The largest low-frequency component observed in the envelope of the output of the filter centered on the mistuned harmonic had a rate corresponding to 2ΔF, and Lee and Green (1994) suggested that it arises from the interaction of the two components at F0 − ΔF and F0 + ΔF in the envelope spectrum, which in turn arise mainly from the interaction of the mistuned component with its next highest and next lowest neighbour. Basically, the largest low-frequency component in the envelope spectrum was considered to be a second-order beat. However, Lee and Green (1994) did not assess the number of beats perceived by their subjects and therefore it is unclear whether the percept was indeed determined by the dominant modulation at the output of this filter.
In experiment 3, subjects indicated how many beats they heard for the same diotic complex tones as used in experiment 1 in the low- and the high-frequency regions. The duration of the tones was fixed at 2 s, and the amount of mistuning of the 8th harmonic was varied. Of specific interest was whether the perceived beat rate would correspond to the amount of mistuning, ΔF, or to 2ΔF. If the perceived beat rate corresponded to 2ΔF, the beat might be dominated by the interaction of the mistuned component with its closest neighboring components as implied by Lee and Green (1994). If the perceived beat rate corresponded to ΔF, harmonics further away (beyond harmonics 7 and 9) would have to be involved. This could occur via combination tones resulting from interactions between more remote harmonics or via beats involving more remote harmonics.
B. Stimuli and procedure
The complex tones and the TEN background were the same as for the diotic conditions of experiment 1, i.e., harmonic complex tones containing harmonics 6–10 with F0s of either 280 or 1400 Hz. The duration of the tones was 2 s. All components had either cosine or random starting phases. The envelopes of the waveforms at the outputs of the auditory filters were expected to have somewhat higher crest factors for cosine than for random phase, possibly resulting in a somewhat clearer pitch, although the difference was expected to be small for these intermediate harmonic ranks (harmonics 6–10). The amount of mistuning of the 8th harmonic was either 1, 2, 3, or 4 Hz. Subjects were asked to write down the number of beats or cycles heard for each tone. In a given trial, subjects were allowed to listen as often as they liked to a given tone complex before writing down their beat count for this tone; the same tone, separated by a 500-ms silent interval, was presented twice and subjects could repeat this sequence as often as they liked with a button press. The first tone was meant to function as a cue, and subjects reported the number of beats that they heard over the 2-s duration of the tones. No feedback was provided.
Seven subjects participated, all of whom also took part in experiment 1; subject 3 was not available. All subjects were tested with cosine-phase stimuli. Subjects 1, 5, 7, and 8 were additionally tested with random-phase stimuli; the remaining three subjects were no longer available. Subjects were tested first in the cosine-phase conditions. For a given starting phase, testing was carried out in blocks of four trials. During a block, the F0 was fixed and ΔF was varied randomly across trials. The order of F0s was counterbalanced across subjects. At least two repeats were completed for each condition. Data collection took on average one session of 2 h (including breaks) for each subject.
C. Results and discussion
Figure 5 shows the results for the 280-Hz F0 complex. Each panel shows the number of beats reported for a given ΔF. The horizontal dashed line indicates the expected value of beat counts for the 2-s stimulus if the perceived beat rate corresponded to ΔF. Note that the expected value of beat counts for the 2-s duration is equal to twice the beat rate. While there was sometimes variability across trials, both within and across subjects, the majority of counts lay on the horizontal dashed lines or below, for both starting phases. There was not a single case where the beat count was equal to the value that would be expected if the perceived beat rate corresponded to 2ΔF. The data points to the far right in each panel show the median (M) values of the beat counts across subjects, and are based on subjects' individual medians. The interquartile ranges for the median (the vertical lines at M) did not exceed the expected beat count assuming a beat rate corresponding to ΔF.
The answers of subject 6 for the cosine-phase condition are shown by circles rather than triangles. This subject reported that she did not perceive any salient beats and was more or less guessing. This tallies with the finding that her thresholds for the detection of mistuning of the 8th harmonic in the 280-Hz complex were about 15 Hz (see Fig. 1) and were much higher than those for the other seven subjects. Therefore, her beat-counts data were treated as outliers and were excluded from the calculation of M.
Figure 6 shows the results for the 1400-Hz F0 complex. While the variability across trials was slightly larger than for the low F0, again the majority of counts lay on the horizontal dashed lines or below, for both starting phases and all ΔFs. There were only two (out of 103 total) individual trials where the beat count was equal to the value expected if the beat rate corresponded to 2ΔF. For all ΔF, the median values of the beat counts were equal to the values expected assuming that the beat rate corresponds to ΔF, and the 75th percentiles were always below the values expected assuming that the beat rate corresponds to 2ΔF.
In summary, for both F0s the data clearly show that the perceived beat rate for the diotic complexes with a slightly mistuned 8th harmonic corresponds much more closely to ΔF than to 2ΔF. This in turn does not support the idea that the perceived beat is determined mainly by second-order beats between the first-order beats that result from the interaction of the mistuned component with its closest neighbors (see above), as implied by Lee and Green (1994). Instead, it suggests that interactions between the mistuned component and harmonics further away are involved, which could occur via beats involving more remote harmonics or via combination tones resulting from interactions between more remote harmonics.
D. Model predictions
A simple model (see supplementary material2) was used to analyze the slow periodic waveform fluctuations at the outputs of simulated auditory filters (AFs) for the diotic complex tones with mistuned 8th harmonic. The objective was to assess whether the perceived beat rate could be predicted from the envelopes at the outputs of simulated AFs, and to determine the center frequencies of the filters whose outputs reflect the perceived beat rate. The model was used to examine the possible explanations for the beats that were outlined in Sec. VI A; predictions were derived both in the absence and in the presence of a simulated CDT. Briefly, the model consisted of the following stages: (i) a gammatone filterbank to simulate the frequency selectivity of the peripheral auditory system, (ii) halfwave rectification followed by a squaring device, (iii) a compressive nonlinearity (exponent of 0.25) that reflects the nonlinear input/output function of the basilar membrane, and (iv) a sliding temporal window (STW) that simulates the reduced sensitivity of the auditory system to high-rate modulation (Viemeister, 1979) with parameters as in Oxenham (2001) and in McKay et al. (2013). This basic model structure has been used before [see, e.g., Oxenham and Moore (1994) and Plack and Oxenham (1998)]. The peak-to-valley ratio of the periodic output waveform after the STW was determined. To assess the detectability associated with a given peak-to-valley ratio at the output of the model, the external AM depth (expressed as 20logm, where m is the modulation index) of a sinusoidal carrier with sinusoidal amplitude modulation (SAM) was found that produced the same peak-to-valley ratio at the output of the model. This external AM depth (referred to hereafter as effective AM depth) provided a “calibration value” that allowed comparison with published data on AM detection. The results were as follows.
In the absence of a CDT, for envelope rate ΔF, which was the rate most commonly reported by subjects, the largest effective AM depths occurred for filters with center frequencies (CFs) at about 9.7*F0 (see Fig. S1 in supplementary material). For envelope rate 2ΔF, the largest effective AM depths occurred for filters with CFs at about 8.5*F0. The maximum AM depths were about −21 to −23 dB, and were very similar for the two envelope rates and F0s (see Table SI in supplementary material). The simultaneous presence of beats with rates r and 2r leads to an overall repetition rate of r and this might be perceived as beats with rate r (Warren, 1978, 1982). Thus, if the AM depths were sufficient to be detectable, a beat rate of ΔF would be perceived. However, the AM depths might be below threshold as the tones were presented at a relatively low level and in the presence of masking noise, and both factors are known to increase AM detection thresholds (Moore and Sek, 1994; Furukawa and Moore, 1997; Kohlrausch et al., 2000).
As mentioned above, the 9th and 10th harmonics of a complex tone interact to give a CDT with frequency corresponding to that of the 8th harmonic. Following Goldstein (1967) and Hall (1972), the level of this CDT is about −15 to −20 dB relative to the level of the primaries. Therefore, the CDT by itself would be inaudible in the continuous TEN background used in this experiment. Nevertheless the CDT can interact with the mistuned harmonic and this would result in a beat rate of ΔF. To assess the effect of the presence of a low-level CDT on the model predictions, a sinusoid with a frequency corresponding to that of the (in-tune) 8th harmonic and with a level of −20 dB relative to that of the inner-component levels was added to the complex tones that served as input stimuli for the model.
In the presence of the CDT, the envelopes of the outputs of all AFs showing modulation were periodic with rate ΔF (see Fig. S1 in supplementary material). A very similar pattern was observed for both F0s. Adding the CDT mainly changed the outputs of AFs with CFs in the range from about 7.5*F0 to 8.5*F0. The largest effective AM depths were about −17 to −18 dB (about 4–5 dB larger than in the absence of the CDT), and occurred for filters with CFs between 8.3*F0 and 8.4*F0. This amount of AM is very likely to be detectable in the presence of the TEN (Moore and Sek, 1994; Furukawa and Moore, 1997).
For both F0s, the beat-count data showed that the perceived beat rate for the diotic complexes with a slightly mistuned 8th harmonic corresponds to ΔF rather than to 2ΔF. This suggests that the perceived beat is not determined mainly by second-order beats between the first-order beats that result from the interaction of the mistuned component with its closest neighbors on either side.
The findings reported here are in general agreement with the interpretation of Plomp (1967) and Ter Kuile (1902) that the beats resulting from mistuning of a single component in harmonic complex tones are a consequence of waveform interaction on the basilar membrane, although for the stimuli used here the presence of CDTs probably increased the salience of the beats, whereas Plomp (1967) argued that CDTs played little role in the perception of beats for low-frequency two-tone complexes. For three harmonics with frequencies n − 1, n, and n + 1 times F0, Ter Kuile (1902) and previous researchers [reviewed in Plomp (1967)] reported that they heard a beat rate of ΔF when the lower or the upper harmonic was mistuned by ΔF, but heard a beat rate of 2ΔF when the middle harmonic was mistuned by ΔF. Here, subjects reported a beat rate of ΔF when the middle harmonic was mistuned by ΔF. This indicates that the critical factor is the number of harmonics falling above (or possibly below) the mistuned component. When two consecutive in-tune harmonics were adjacent to the mistuned component, the perceived beat rate was ΔF, both in the present experiment and as reported by Ter Kuile (1902).
The analysis of the envelopes at the outputs of the model showed that in the absence of a simulated CDT the effective modulation depths at rate ΔF and 2ΔF were similar. In the absence of a simulated CDT, the perceived beat rate was conveyed most strongly by AFs with CFs around 9.7*F0, suggesting that the percept would be dominated by the interaction between three components within a filter, i.e., the interaction between the mistuned component and the two next highest harmonics. However, in the absence of a simulated CDT the effective modulation depths may have been insufficient for the modulation to be detectable in the presence of the TEN (Furukawa and Moore, 1997). The addition of a low-level CDT with a frequency of 8*F0, increased the effective modulation depth at rate ΔF but not at rate 2ΔF. In the presence of the CDT, the perceived rate was conveyed most clearly by AFs with CFs at about 8.3*F0, suggesting that the percept would be dominated by the interaction between the mistuned component and the CDT. Thus, in both accounts (with or without a CDT) the percept relies crucially on the contribution from two harmonics above (and/or below) the mistuned component. In other words, the perceived beat rate is not solely determined by the interaction of the mistuned component with its two closest neighboring components.
In summary, the analyses show that envelope fluctuations at the outputs of simulated AFs (with an added low-level CDT) seem to be sufficient to account for the beats perceived due to mistuning of the 8th harmonic. However, this does not rule out the possibility that somewhat more central processes are involved. Physiological studies in animals have shown that, relative to the auditory nerve, the representation of slow-rate envelope related information seems to be enhanced at later stages of the auditory system, where information might be integrated across channels in various ways (Langner, 1992; Sinex et al., 2005). For example, Sinex and colleagues (Sinex et al., 2002; Sinex et al., 2003; Sinex et al., 2005; Sinex, 2008) recorded single-unit discharge patterns in the Chinchilla in response to complex tones with a mistuned harmonic (usually the 4th). They used relatively large amounts of mistuning (at least 3% of the nominal harmonic frequency), as they were interested in mechanisms underlying the perceptual segregation of mistuned components. Sinex et al. (2002) and Sinex et al. (2005) observed an enhanced representation of the slow envelopes in the inferior colliculus (IC) relative to that observed in the auditory nerve (Sinex et al., 2003). The rate of the slow envelopes corresponded to ΔF rather than 2ΔF, or was even lower, corresponding to the true repetition rate of the mistuned complex. For example, when the frequency of 4th harmonic of a 250-F0 complex was increased by 30 Hz, the peri-stimulus discharge pattern of an IC neuron tuned to about 1200 Hz showed peaks separated by about 4 ms and a strong slower modulation with a period of 33 ms [figure 3 in Sinex et al. (2002)]. This was explained in terms of higher-order beats that reflect the combination of beat information across channels. In particular, it involved channels with a beat rate of F0, which likely arose through interaction between adjacent harmonics in the stimulus. Overall, the beat rates reported by the subjects in the present study agree with those shown by Sinex (2002), despite the differences between the stimuli.
VII. SUMMARY AND CONCLUSIONS
In experiment 1, thresholds were measured for the detection of mistuning of the 8th harmonic in otherwise harmonic complex tones containing harmonics 6–10 for F0s of 280 and 1400 Hz. For the latter, phase locking to the temporal fine structure is weak or absent. Harmonics were presented either diotically or dichotically (odd and even harmonics to different ears). In the diotic condition and for the longer durations, mistuning detection thresholds were very low for both F0s (at about 1–2 Hz), consistent with detection of beats produced by peripheral interactions. In the diotic condition, there was no significant difference between thresholds in the two frequency regions, indicating no difference in peripheral interactions between the two regions. In the dichotic condition, the mistuned component was perceptually segregated from the complex for the low F0 and thresholds were measurable but higher (at about 40–50 Hz), while for the high F0, thresholds were unmeasurable. Thus, for the high F0, the inability of subjects to detect the mistuning of a single component when no beats are audible stands in marked contrast to the very good ability to detect the mistuning in the presence of audible beats. It also contrasts with the finding that at least some subjects had a reasonably good ability to match musical intervals between complex tones in the very high frequency region (Gockel and Carlyon, 2021), but is similar to, and possibly related to, the finding of larger FDLs in the very high than in the lower frequency region (Lau et al., 2017; Gockel et al., 2020).
In experiment 2, psychometric functions were measured for the detection of mistuning of the 8th harmonic in the dichotic condition for a stimulus duration of 1 s. For a mistuning of 5 Hz, performance was at chance level for both F0s, consistent with the absence of beat cues. For a mistuning of 50% of the F0, performance was very good for the low F0, but was close to chance for the high F0. There was no evidence that the psychometric functions had an unusual shape. For the high F0, subjects reported no “popping out” effect, consistent with a role of phase locking in perceptual segregation.
In experiment 3, the perceived beat rate for the diotic complex tones with a mistuned 8th harmonic was measured. The perceived beat rate corresponded to the amount of mistuning (in Hz) for both F0s. A phenomenological model consisting of a gammatone filterbank, half-wave rectification, a compressive nonlinearity, and a STW was used to analyse the slow envelope fluctuations of the waveforms that arise from interactions between components in the auditory periphery. The effect of the presence of a combination tone at the frequency of the 8th harmonic was also assessed. It is argued that an explanation of the perceived beat rate requires the interaction of the mistuned harmonic with components beyond its two direct and closest harmonic neighbours. At a peripheral level, the interaction could be either directly in a filter tuned to a frequency between the two higher harmonics, i.e., away from the mistuned harmonic, or/and indirectly in a filter tuned closer to the mistuned harmonic via a CDT at the nominal harmonic frequency that arises from the interaction between two consecutive harmonic components. In addition, somewhat more centrally mediated higher-order beats across channels may contribute.
This research was supported by the Medical Research Council UK (SUAG/042/G101400). We thank Brian Moore for helpful comments on an earlier version of this paper. We also thank Joshua Bernstein and three reviewers for helpful comments.
A special case occurs when the components of the complex tone have relative phases (e.g., all starting in cosine phase or sine phase) that lead to a waveform on the basilar membrane with a high peak factor. For such stimuli, a high harmonic that has a sufficiently different phase from the other harmonics appears in the dips of the waveform of the basilar membrane and can be heard out (Moore and Glasberg, 1989). In this way, a slightly mistuned high harmonic may initially be inaudible, but after a certain time, when its phase has shifted sufficiently relative to that of the other components, it may pop out. However, in this paper the focus is on complex tones in which the component phases are random. For such stimuli, high harmonics are not perceived as popping out when they differ in relative phase from the other components (Moore and Glasberg, 1989).
See supplementary material at https://www.scitation.org/doi/suppl/10.1121/10.0012351 for details of the model implementation and its predictions.