Intelligibility of four-band speech stimuli was investigated (n = 18), such that only one of the frequency bands was preserved, whereas other bands were locally time-reversed (segment duration: 75–300 ms), or vice versa. Intelligibility was best retained (82% at 75 ms) when the second lowest band (540–1700 Hz) was preserved. When the same band was degraded, the largest drop (10% at 300 ms) occurred. The lowest and second highest bands contributed similarly less strongly to intelligibility. The highest frequency band contributed least. A close connection between the second lowest frequency band and sonority was suggested.
Sonority can be defined as “a unique type of relative, n-ary (non-binary) feature-like phonological element that potentially categorizes all speech sounds into a hierarchical scale,” according to Parker (2012a). Since the concept of sonority was proposed by several authors [for a review, see Clements (1988)], e.g., as aperture by de Saussure (1916), the concept has been studied for many years, and the framework of sonority has been supported as well. One of the most influential theories in linguistics today, optimality theory (Prince and Smolensky, 2004), is based fully on the sonority sequencing principle (Clements, 1988), which states that a phoneme at the position of a syllable nucleus has the peak in sonority; in contrast, the phonemes before and after the nucleus have increasingly lower sonority as they are further from the nucleus. Nevertheless, controversy regarding the concept of sonority continues (e.g., Parker, 2012b; Rahilly, 2016). For example, the consonant /s/ shows conspicuous behavior that does not fit into sonority scales in Indo-European languages as well as in other languages (Goad, 2016), the sonority scales proposed by phoneticians are slightly different from each other (e.g., de Saussure, 1916; Harris, 1994; Selkirk, 1984; Spencer, 1996), and the arguments about the concept of sonority tend to be circular (Rahilly, 2016).
Nakajima and colleagues (Nakajima et al., 2017; Ueda and Nakajima, 2017; Zhang et al., 2020) took an approach to the issue entirely different from the previous approach in linguistics. Based on factor scores extracted from correlation coefficients between power fluctuations in critical-band filtered speech (Ueda and Nakajima, 2017), they examined relationships between the factor scores and phonemic labels assigned to speech sentences. Nakajima et al. (2017) found high correlation coefficients (0.82–0.87) between factor scores in one of the factors and the order of phonemic categories in the previously proposed sonority scales (de Saussure, 1916; Harris, 1994; Selkirk, 1984; Spencer, 1996). Specifically, the factor that was closely related to the second lowest frequency band (540–1700 Hz), i.e., the mid-low factor (Nakajima et al., 2017), produced factor scores that were correlated best with the sonority scales. If this is indeed the case, this frequency band (540–1700 Hz) should affect the intelligibility of speech more profoundly than other frequency bands. The present investigation aims to further clarify the relationship between the four frequency bands extracted from speech (Ueda and Nakajima, 2017) and speech intelligibility.
The four frequency bands (50–540, 540–1700, 1700–3300, and 3300–7000 Hz), which provided the basis for the arguments by Nakajima et al. (2017), were determined using factor analyses on 58–200 spoken sentences in eight languages/dialects by 10–20 talkers (Ueda and Nakajima, 2017). The spoken sentences were bandpass filtered with a critical-band filter bank. Each filter output was squared to obtain power fluctuations. Correlation coefficients were calculated between every possible combination of the power fluctuations. The correlation coefficient matrix was submitted to a principal component analysis, and extracted components were varimax rotated to obtain factors. The cutoff frequencies for the four frequency bands came from the crossover frequencies of the factor loading curves. Therefore, the frequency bands were obtained without referring to any linguistic processing (e.g., identifying phonemes or segmenting syllables). Rather, they originated from one of the most basic and common functions of the auditory system, namely, critical-band filtering. Putting the matter in a broader context, the frequency bands relate to other specific studies as well. One example is the investigation on how informational masking (e.g., Durlach et al., 2003; Shinn-Cunningham, 2008) on each formant of the first three formants affects intelligibility (Roberts and Summers, 2018), suggesting that the second formant is most important for the intelligibility of three-formant analogs of speech. Because the range of frequency variation for the second formant largely overlaps with the range of the second lowest frequency band, i.e., 540–1700 Hz, the study by Roberts and Summers also predicts that the second lowest frequency band should be the most influential frequency band for intelligibility.
Matsuo et al. (2020) provided evidence suggesting that the second lowest frequency band may be the key maintaining high speech intelligibility. They used chimeric locally time-reversed speech, in which original speech samples were divided into two frequency bands, i.e., an upper band and a lower band. One of the bands was locally time-reversed, but the other band was preserved as in the original. By chimeric, they meant a composite stimulus comprising both the locally time-reversed part and the original part from the same speech sample (we will adopt the same meaning for this term in the present paper). Locally time-reversed speech is a kind of degraded speech in which original speech samples are periodically segmented, individually reversed in time, and then concatenated in the original order (e.g., Greenberg and Arai, 2004; Ishida et al., 2016; Saberi and Perrott, 1999; Steffen and Werani, 1994; Stilp et al., 2010; Teng et al., 2019; Ueda et al., 2017). Typically, intelligibility of locally time-reversed speech is almost perfect at about 40-ms segment duration under normalized speech rates. Intelligibility goes down to 50% at about 65-ms segment duration and then becomes less than 10% at about 100-ms segment duration, irrespective of language (Ueda et al., 2017). In the study of Matsuo et al. (2020), a low-pass filter and a high-pass filter of the same cutoff frequency were employed. The cutoff frequencies were 570, 840, 1170, 1600, 2150, and 2900 Hz [five steps of two critical-bandwidth intervals (Ueda and Nakajima, 2017; Zwicker and Terhardt, 1980)]. The results of the experiment, in which intelligibility of the chimeric stimuli was examined, showed that intelligibility started to decline when the degradation included the frequency range of 840–1600 Hz.
However, the relationship between that particular frequency band and intelligibility was inferred indirectly, because their chimeric speech stimuli always included other frequency bands being processed, except for the conditions in which bands at both ends of the frequency axis were degraded. Therefore, we planned to perform an experiment to directly address the relationship between each frequency band and intelligibility. Specifically, we adopted the four frequency bands that Ueda and Nakajima (2017) determined, targeting each band for preservation or degradation one by one (Fig. 1). We examined intelligibility of the stimuli, in addition to that of control stimuli in which all bands were preserved as in the original or degraded. The frequency range highlighted by Matsuo et al. (2020), i.e., 840–1600 Hz, was included in the second lowest frequency band (540–1700 Hz) determined by Ueda and Nakajima (2017). Thus, the purpose of this investigation was to examine whether any particular band was more influential than others on intelligibility when each individual band was preserved or degraded.
A total of 18 unpaid listeners (ages 19–23) participated in the experiment. They were all Japanese native listeners with normal hearing. Their hearing levels were tested with an audiometer (RION AA-56, RION, Kokubunji, Japan) within the frequency range of 250–8000 Hz. The research was conducted with prior approval of the Ethics Committee of Kyushu University (approval ID: 70).
2.2 Stimuli and conditions
A total of 150 Japanese sentences spoken by a female talker were extracted from the Multilingual Speech Database 2002 (NTT Advanced Technology Corp., Kawasaki, Japan; 16 000-Hz sampling, 16-bit linear quantization). The speech rate of this particular talker (95%) was very close to the average calculated with ten talkers (five females and five males) included in the speech database. It has been shown that the primary determinant of intelligibility for locally time-reversed speech, except for segment duration, is speech rate (Stilp et al., 2010; Ueda et al., 2017). Ueda et al. (2017) employed both a female and a male talker in four languages and found that the intelligibility curves along segment duration for the two talkers in each language were almost identical or only marginally different. Therefore, we employed only one female talker for the stimuli in this experiment. The sentences in the database were based on articles published in newspapers and magazines. The average number of morae (a mora is a syllable-like unit in Japanese) per sentence was 18 [standard deviation (SD) = 2.9]. Extracted spoken sentences were edited to eliminate unnecessary blanks and noises. Edited speech samples were converted into 44 100-Hz sampling, with 16-bit linear quantization with Praat (Boersma and Weenink, 2020) before further processing.
Three variables were manipulated: segment duration (75, 150, and 300 ms), stimulus types [ORG-n: a target band (n) was preserved as in the original except for filtering and segmenting, and the other bands were locally time-reversed; LTR-n: a target band (n) was locally time-reversed, and the other bands were preserved as in the original], and a target band (none, 1, 2, 3, or 4). Target bands were numbered from 1 (the lowest) to 4 (the highest). The target band “none” refers to the conditions in which no band was preserved or degraded. Thus, “ORG-none” represents the condition in which all frequency bands were locally time-reversed, and “LTR-none” represents the condition in which all frequency bands were preserved as in the original.
The speech samples were passed through a bank of bandpass filters, dividing the frequency range from 50 to 7000 Hz into four frequency bands: 50–540, 540–1700, 1700–3300, and 3300–7000 Hz. Filtered speech samples were segmented with the three steps of segment duration, including 5-ms cosine ramps. Depending on the conditions, each segment was locally time-reversed or preserved as it was. The segments were then concatenated in the original order, and all frequency bands were summed up across frequency. The signal processing was performed with a custom software written in the J language (J Software, 2020).
The stimuli were presented to participants diotically through headphones (DT 990 PRO, Beyerdynamic GmbH, Heilbronn, Germany) in a sound-attenuated booth (Music cabin SC3, Takahashi Kensetsu, Kawasaki, Japan). Custom software written with the LiveCode package (LiveCode Community, 2018) was used to present the stimuli. The headphones were driven with an optical interface (USB interface, Roland UA-4FX, Roland Corp., Shizuoka, Japan) and a headphone amplifier with a built-in digital-to-analog (D/A) converter (AT-DHA 3000, Audiotechnica, Machida, Japan). The sound pressure level of speech was adjusted to 72 dB (A), using a 1000-Hz calibration tone provided with the speech database. The sound pressure level was measured with an artificial ear (Brüel & Kjær type 4153, Brüel & Kjær Sound & Vibration Measurement A/S, Nærum, Denmark), a condenser microphone (Brüel & Kjær type 4192), and a sound level meter (Brüel & Kjær type 2250).
Participants were instructed to write down exactly what they heard with hiragana or katakana (sets of symbols that are used to represent Japanese morae) without guessing. Each mora was examined for whether it was correct or incorrect. The number of correct morae in each sentence was counted. A blank response was counted as incorrect, and homophone errors were permitted. The percentage of correct morae was calculated for summarizing and presenting the results in a figure. Statistical analysis was based on the binomial results, i.e., correct or incorrect.
Percentages of mora accuracy are represented in Fig. 2. The shape of the distribution remained roughly the same as the segment duration was extended from 75 over 150 to 300 ms. Among ORG-n stimuli, intelligibility was best preserved for the ORG-2 stimuli. Specifically, when the segment duration was 75 ms, intelligibility was 82% for the ORG-2 stimuli, whereas for the control stimuli (ORG-none), intelligibility dropped to 35%. When the segment duration was 150 ms, intelligibility was 42% for the ORG-2 stimuli, whereas for the ORG-none stimuli, intelligibility was just 3%. When the segment duration was 300 ms, intelligibility was still 27% for the ORG-2 stimuli, but only 1% for the ORG-none stimuli. For ORG-1, 3, and 4 stimuli, intelligibility was always lower than the intelligibility for the ORG-2 stimuli. The intelligibility for ORG-1 and -3 was comparable. The intelligibility for the ORG-4 stimuli (50%, 8%, and 2% at 75-, 150-, and 300-ms segment duration, respectively) was slightly better than the intelligibility for the control stimuli (ORG-none; 35%, 3%, and 1%, respectively).
For LTR-n stimuli, the intelligibility dropped mostly for the LTR-2 stimuli, for which intelligibility went down from 96% to 90% as segment duration increased from 75 to 300 ms. Intelligibility for the control stimuli (LTR-none), however, was invariably high, that is, more than 99%. The intelligibility for LTR-1 and -3 stimuli was comparable. The intelligibility for LTR-4 stimuli was indistinguishable from the intelligibility for the control stimuli (LTR-none).
The observations above were confirmed by the analyses using a generalized linear mixed model (GLMM), with a logistic linking function as implemented in an add-in for JMP (SAS Institute Inc., 2018), applied to the results of the two stimulus types separately. The data were analyzed for fixed effects of segment duration (continuous predictor), target band (categorical predictor), and their interaction and for a random effect of listener. For the ORG-n stimuli, the model revealed p values smaller than 0.001 in all effects: segment duration , target band , and their interaction . For the LTR-n stimuli, this model revealed p values smaller than 0.05 in segment duration [, p < 0.048] and target band [, p < 0.001]. The p value for the interaction was 0.6. To examine whether or not the differences between target bands were reliable, Tukey–Kramer honestly significant difference (HSD) tests were applied. For the ORG-n stimuli, p values were smaller than 0.05 for the differences between all combinations of target bands except for the difference between 1 and 3 (p = 0.21). For the LTR-n stimuli, p values were smaller than 0.01 for the differences between 1 and 2, 1 and none, 2 and 3, 2 and 4, 2 and none, and 3 and none. Other p values (for 1 and 3, 1 and 4, 3 and 4, and 4 and none) exceeded 0.05.
Summarizing the results, among the ORG-n stimuli, intelligibility was highest for the ORG-2 stimuli. Among the LTR-n stimuli, intelligibility was lowest for the LTR-2 stimuli. These features were observed irrespective of segment duration. Thus, band 2 (540–1700 Hz) mostly influenced intelligibility of the chimeric locally time-reversed speech stimuli. At the same time, intelligibility was comparable for either the pair of ORG-1 and -3 or the pair of LTR-1 and -3. Thus, the contribution to intelligibility by bands 1 and 3 was comparable too. The contribution by band 4 was the smallest—but distinct in ORG-n conditions—whereas in LTR-n conditions, the contribution by band 4 was not apparent, probably because of a ceiling effect. The intelligibility of ORG-none control stimuli for which all frequency bands were locally time-reversed was 35%, 3%, and 1% at 75-, 150-, and 300-ms segment duration, respectively, which corresponded well with the previous results obtained with the same speech database (Ueda et al., 2017).
This pattern of results cannot be attributed to the average power level differences between the bands, because an analysis of the average power levels observed in the same speech database and comparable frequency bands showed that the levels were highest in the lowest band (corresponding to band 1 in the current study), followed by the levels of the second lowest band (corresponding to band 2) and gradually going down in the third and fourth bands in this order (Ueda et al., 2018). Thus, the order of the average power levels is at odds with the current results. In addition, an experiment in which the average power levels were exchanged between every possible pair of bands showed that the manipulation did not affect the intelligibility of four-band noise-vocoded speech, when the levels of bands 1 and 2, 1 and 3, and 3 and 4 were exchanged. The reduction in intelligibility caused by the level exchange between bands 1 and 4, 2 and 3, and 2 and 4 was marginal and 20% at most (Ueda et al., 2018).
Thus, band 2 is most informative for speech intelligibility. When the band was degraded, the drop in intelligibility was the largest, whereas when the band was preserved, the intelligibility was preserved best. The present results are in accord with our previous findings obtained with the combination of a lower frequency band and a higher frequency band, in which either one of the bands was degraded and the other band was preserved (Matsuo et al., 2020). Also, the current results are consistent with the results of the study on informational masking of three-formant speech analogs by extraneous formants (Roberts and Summers, 2018). Furthermore, the results are in line with the previous findings by Nakajima et al. (2017), suggesting that band 2 correlated mostly with sonority. This band corresponds to a factor with a single peak (Ueda and Nakajima, 2017), the mid-low factor (Nakajima et al., 2017). It has been confirmed that vowels and sonorants dominate the factor (Nakajima et al., 2017; Zhang et al., 2020); thus, it is natural that band 2 has the closest connection with sonority and, hence, syllable formation. Bands 1 and 3 contributed also to intelligibility to some extent, although the contribution was less prominent compared to that of band 2. Nakajima et al. (2017), in fact, found moderate correlation coefficients (0.30–0.54) between sonority scales and the low and mid-high factor, which is bimodal and corresponds to bands 1 and 3 (Grange and Culling, 2018; Ueda and Nakajima, 2017). It is therefore plausible that the contributions by bands 1 and 3 to intelligibility were comparable, because the two bands are connected to the bimodal factor. The negative correlation coefficients (–0.45 to –0.28) observed by Nakajima et al. (2017) between sonority scales and the high factor, corresponding to band 4 in the current study, were not confirmed in the present results. This may be the limitation of the current experimental paradigm. Further investigations are warranted to clarify the issue.
This work was supported by Japan Society for the Promotion of Science (JSPS) KAKENHI Grant No. JP19H00630. The authors would like to thank Yoshitaka Nakajima for providing J language software routines, Hikaru Eguchi for programming in LiveCode, Daiki Higuchi for running the experiment, Gerard B. Remijn for providing helpful comments on the draft, and Yoshitaka Nakajima and Hiroshige Takeichi for valuable discussion.