Although the telephone band (0.3–3 kHz) provides sufficient information for speech recognition, the contribution of the non-telephone band (<0.3 and >3 kHz) is unclear. To investigate its contribution, speech intelligibility and talker identification were evaluated using consonants, vowels, and sentences. The non-telephone band produced relatively good intelligibility for consonants (76.0%) and sentences (77.4%), but not vowels (11.5%). The non-telephone band supported good talker identification only with sentences (74.5%), but not vowels (45.8%) or consonants (10.8%). Furthermore, the non-telephone band cannot produce satisfactory speech intelligibility in noise at the sentence level, suggesting the importance of full-band access in realistic listening.
1. Introduction
Speech sounds carry a wealth of information regarding a speaker's intelligibility, identity, and emotions. Speech perception is also remarkably robust against not only distortions in amplitude, time, and frequency, but also noise, reverberation, and other challenging environments (Licklider , 1948; Saberi and Perrott, 1999; Warren , 1995). A paradigm shift in the transmission of acoustic information was the invention of the telephone, which has fundamentally changed speech communication from exclusively in-person interactions to additional remote exchanges. Although speech contains energy in high frequencies up to 15 kHz, early telephones adopted a narrow bandwidth between 0.3 and 3 kHz to optimally balance transmission efficiency and speech intelligibility (Fletcher, 1953; French and Steinberg, 1947).
The adoption of this narrow telephone bandwidth has not only paved the way for pervasive usage of modern telecommunication, but also established a prevailing dogma in speech research, asserting the paramount importance of the 0.3- to 3-kHz band in speech intelligibility (ANSI, 1997; Bell , 1992; DePaolis , 1996; Fletcher and Galt, 1950; French and Steinberg, 1947; Studebaker , 1987). The frequencies outside the 0.3- to 3-kHz band, on the other hand, have been mostly assumed to be responsible for speech quality or talker identity, but not intelligibility (Monson , 2011; Moore , 2008; Olson, 1947; Rodman, 2003; Schwartz, 1968). Consequently, contemporary wideband telephony uses frequencies outside the 0.3- to 3-kHz range to enhance speech quality and talker identification or mitigate listening fatigue (e.g., Rodman, 2003).
A few studies showed that energies above 3 kHz contribute to speech intelligibility (Lippmann, 1996; Monson , 2019; Motlagh Zadeh , 2019), but the narrowband speech perception dogma still remains, and a systematic study quantifying the relative contribution of non-telephone bandwidth is lacking. Understanding the role of non-telephone bandwidth can not only bridge this knowledge gap, but also facilitate potential practical applications such as speech delivery through alternative frequencies. This implementation would be particularly beneficial for listeners who have limited access to full-band information. The present study quantifies the contribution of the non-telephone bandwidth (i.e., <0.3 or >3 kHz) to speech intelligibility and talker identity using three types of speech materials, including consonants, vowels, and sentences. Considering that most everyday conversations occur in noise, we conducted an additional test to measure the effect of noise on sentence intelligibility using the full-, telephone-, and non-telephone-band stimuli. Finally, we subjected the sentences to two modern objective measures (Rix , 2001; Taal , 2011), which were typically used to evaluate artificial intelligence (AI)-generated speech for intelligibility and quality. These two measures were compared against the human subjective evaluations to provide insights into potential AI algorithms for improving speech perception outcome. We hypothesized that the non-telephone-band speech contains sufficient information to support good performance in speech intelligibility in quiet, noise, and talker identification.
2. Methods
2.1 Participants
A total of 15 adults with normal hearing [<20-dB hearing level (HL) at audiometric frequencies from 125 to 8000 Hz] and native English proficiency participated in the study. Informed consent was obtained from all participants under a human subject protocol approved by the University of California Irvine Institutional Review Board.
2.2 Speech intelligibility
In the speech intelligibility task, five participants (22 ± 4 years old; three females) recognized the three types of stimuli regardless of talkers. The consonant stimuli consisted of 20 consonants spoken by two males and two females in the a/C/a structure (Shannon , 1999). The vowel stimuli consisted of 16 vowels spoken by two males and two females in the h/V/d structure (Hillenbrand , 1995). The sentence stimuli were 250 sentences spoken by a male talker (Hearing In Noise Test [HINT] sentences; Nilsson , 1994). These sentences were divided into 25 equally intelligible lists with ten sentences per list. Each sentence had five keywords. The original bandwidths were 11, 11, and 8 kHz for consonants, vowels, and sentences, respectively. The sentences had a narrower bandwidth as a result of an antialiasing filter (low pass at 8 kHz) applied (Nilsson , 1994). All speech materials were resampled to 22.05 kHz and then root-mean-square (RMS) equalized. Three speech-spectrum-shaped noise stimuli were generated to match the long-term spectrum of the consonant, vowel, and sentence materials, respectively, and to calibrate all stimuli to an 80-dB sound pressure level (SPL) presentation level. These calibrated stimuli were used as the full-band stimuli.
Both telephone-band and non-telephone-band stimuli were generated by applying 8th-order Butterworth low- or high-pass filters to the full-band stimuli. The telephone-band stimuli preserved energies between 0.3 and 3.15 kHz, while the non-telephone-band stimuli contained only energies below 0.3 kHz and above 3.15 kHz. Figure 1 displays the three stimuli's long-term spectra with the full [panel (a)], telephone [panel (b)], and non-telephone [panel (c)] bands. Together, three types of stimuli were constructed, resulting in a total of 240 consonant tokens (20 consonants × 4 talkers × 3 bands), 192 vowel tokens (16 × 4 × 3), and 750 sentences (250 × 1 × 3).
In consonant recognition, participants selected the token they heard on each trial by clicking on 1 of the 20 consonant alternatives displayed on a computer screen. In vowel recognition, the participants performed the same task, except they selected from 1 of the 16 vowel alternatives. A training session was conducted with full-band consonants and vowels prior to the formal testing to help participants map the speech tokens to their written forms on the computer screen. Trial-by-trial feedback was provided during training. Participants were required to achieve at least 80% correct performance in training before proceeding to the formal testing. On average, three training sessions were needed for participants to reach ≥80% correct performance. In the formal testing session, no feedback was provided. All participants identified all vowel and consonant tokens in the testing session. The stimulus order was counterbalanced across participants.
In sentence recognition, participants verbally repeated as many words as possible from a sentence presented on each trial. A correct response was recorded only when all five keywords in a sentence were correctly identified. A training session using ten full-band sentences (from the same list) was conducted prior to the testing session. Trial-by-trial feedback was provided during training. The participants were required to identify at least eight out of ten sentences in the training session to qualify for the testing session (i.e., 80% correct). Only one training session was needed for all participants to reach the ≥80% correct criterion. In the testing session, all participants identified 60 sentences (i.e., 2 lists × 10 sentences × 3 bands) that were not used in the training session. The stimulus order was counterbalanced across participants. Altogether, the speech intelligibility test took about 1.5 h to complete.
2.3 Sentence intelligibility in noise
The speech reception threshold (SRT) was obtained as the signal-to-noise ratio (SNR) at which 50% of the sentences were correctly identified. Target sentences were the HINT sentences. (Lists used in sentence recognition tests were excluded.) Three types of stimuli (i.e., full-, telephone-, and non-telephone-band stimuli) were generated using the same method described above. Three steady-state noise makers were generated by spectrally matching a white noise to the long-term average spectrum of the full-, telephone-, or non-telephone-band sentences, respectively (Fig. 2, gray lines). Five target-masker combinations were tested: full-band target with full-band masker [Fig. 2(a)], telephone-band target with full-band [Fig. 2(b)] or telephone-band [Fig. 2(d)] masker, and non-telephone-band target with full-band [Fig. 2(c)] or non-telephone-band [Fig. 2(e)] masker.
During the test, five participants (25 ± 2 years; one female) verbally repeated as many words as possible from a presented sentence while ignoring the noise. A correct response was recorded when all the five keywords of the sentence were correctly identified. A one-down, one-up adaptive procedure was used to measure the SRT (Levitt, 1971). The target sentence was calibrated to and fixed at an 80-dB SPL, while the noise level was adjusted on each trial based on the response from participants. The SRT was measured following the same procedure as described in Nilsson (1994). The sentence intelligibility in noise test took about 40 min to complete.
2.4 Talker identification
In the talker identification task, five participants (21 ± 3 years; two females) identified talkers regardless of stimuli. Ten talkers produced three types of speech materials. In talker identification using consonants, ten adults (five females) produced three isolated consonants (/v/, /z/, /sh/, without consonant-vowel transition, selected from Shannon , 1999). The three fricative consonants—two voiced and one voiceless—were selected for their relatively long duration after removing the consonant-vowel transition from the original token. In talker identification using vowels, ten talkers (three adult females, three adult males, two female children and two male children) produced three isolated vowels (/æ/, /o/, /i/, selected from Hillenbrand , 1995). These vowels were selected for their high/high, high/low and low/high F1/F2 values (Vongphoe and Zeng, 2005). The ten talkers were different from those used in the speech intelligibility test. In talker identification using sentences, ten adults (five females) spoke the same three sentences from the voice cloning toolkit (VCTK) Corpus (Yamagishi , 2019).
All stimuli with an 11-kHz bandwidth were re-sampled and normalized to an 80-dB SPL. These full-band stimuli were then spectrally filtered to generate the telephone- and non-telephone-band stimuli. Overall, the talker identification test used 90 tokens for each stimulus type (10 talkers × 3 bands × 3 stimuli). Figure 3 displays the stimuli's long-term spectra.
In the talker identification test, all ten talkers were displayed on a computer screen, and the participants selected the talker after they listened to the stimulus on each trial. The same procedure was employed with ten different talkers for consonants, vowels, and sentences, respectively. A training session using the full-band stimuli was conducted prior to formal testing to help the participants learn the talker identities. Trial-by-trial feedback was provided during training. The participants were trained until they reached asymptotic performance. The participants then performed the talker identification test without any feedback. The order of speech material and band was counterbalanced across the participants. The talker identification test took about 1.5 h to complete.
2.5 Objective measure of speech intelligibility and quality
The short-time objective intelligibility, or STOI, was used as an objective measure of speech intelligibility (Taal , 2011). The STOI estimates the intelligibility of degraded speech based on correlation of temporal envelopes in short time-frequency segments between the degraded speech and its original clean reference. The STOI score ranges from 0 to 1, corresponding to totally unintelligible speech at 0 to 100% intelligible clean speech at 1.
The perceptual evaluation of speech quality, or PESQ, was used as an objective measure of speech quality (Rix , 2001). The PESQ also requires a clean speech signal to estimate speech quality. The PESQ score ranges from –0.5 to 4.5, corresponding a subjective mean opinion score from 0 (bad) to 5 (excellent). For both STOI and PESQ, full-band sentences were used as the clean reference and the telephone- and non-telephone-band sentences were used as the degraded signal.
2.6 Data analysis
Within-subjects repeated-measure analysis of variance (ANOVA) was used to compare the levels of performance across bands in all tests for each material, respectively. Information transmission analysis was conducted to evaluate the information received on speech features for consonants and vowels (Miller and Nicely, 1955). The information transmission analysis followed the sequential information analysis (SINFA) procedure (Wang and Bilger, 1973). A confusion matrix was extracted from the trial-by-trial data and was summed and averaged across participants. Consonant features included voicing, manner, and place, and vowel features were duration, F1, and F2 (Xu , 2005). All analysis was conducted using r version 4.2.0.
3. Results
3.1 Speech intelligibility
Figure 1 shows the percentage of correct speech intelligibility results for consonants, vowels, and sentences under full-band [panel (d)], telephone-band [panel (e)], and non-telephone-band [panel (f)] conditions. There was a significant effect of band for each stimulus [consonant, F(2, 8) = 33.55, P < 0.001; vowel, F(2, 8) = 151.50, P < 0.0001; sentence, F(2, 8) = 33.38, P < 0.001]. The full-band condition produced 97.6% correct recognition for consonants, 80.1% for vowels, and 98.6% for sentences [Fig. 1(d)]. The relatively low performance for vowels was due to confusion by participants in California who could not differentiate between /ɑ/ and /ɔ/ (Labov, 1998) and another confusion between /ʌ/ and /ɑ/, which had similar F1 and F2 values and were distinguished based on duration (Xu , 2005). Compared with the full-band condition, the telephone-band stimuli produced similar performance [Fig.1(e)] (consonant, 93.8%, t = –1.40, P = 0.40; vowel, 81.8%, t = 0.40, P = 0.90; sentence, 98.6%, t = –0.001, P = 1.00). In contrast, the non-telephone-band stimuli produced significantly poorer performance than the full-band stimuli [Fig.1(f)] (consonant, 76.0%, t = –7.70, P < 0.001; vowel, 11.5%, t = –14.10, P < 0.0001; sentence, 77.4%, t = –7.10, P < 0.001). Note that the 11.5% correct recognition for the non-telephone-band vowels was barely above chance (chance level = 8.3%).
Figure 1(g) shows the percentage of information received for consonants. The full-band stimuli allowed nearly perfect information received on all three features (voicing, 96.5%; manner, 95.2%; place, 95.9%). The telephone-band stimuli provided slightly lower information transferred (voicing, 95.5%; manner, 91.3%; place, 86.4%). The non-telephone band provided much lower information transferred on all consonant features than the full band, with the lowest information being received for the place feature (voicing, 74.1%; manner, 70.1%; place, 47.5%). Compared to consonants, the information received on vowel features was much lower [Fig. 1(h)]. The full- and telephone-band stimuli produced comparable information transferred for vowels (duration, 66.4% vs 64.9%; F1, 77.1% vs 76.2%; F2, 75.7% vs 74.8%). In contrast, the non-telephone-band stimuli could not transfer any significant information regarding any of the three features (duration, 0.4%; F1, 2.6%; F2, 3.5%).
3.2 Sentence recognition in noise
Figure 2 displays the relative difference in presentation levels between the target and masker, at which 50% speech intelligibility was obtained for the five different stimulus band conditions. There was a significant main effect of band [F(4, 16) = 67.42, P < 0.0001]. For noise with a full-band speech spectrum [Figs. 2(a), 2(b), and 2(c)], the full-band and telephone-band sentences yielded comparable speech reception thresholds (mean = –2.9 dB vs –0.9-dB SNR, t = 0.94, P = 0.84). Surprisingly, the non-telephone-band sentences produced a 23.8-dB worse threshold compared to the full-band sentences (mean = 20.9 dB, t = 10.89, P < 0.0001).
When the target and the noise had the same bandwidth [Figs. 2(d) and 2(e)], the telephone- and full-band speech produced similar SRTs (mean = –3.5 vs –2.9 dB, t = –0.29, P = 0.99). The non-telephone speech produced nearly the same results between the non-telephone-band and full-band noise (mean = 20.5 vs 20.9 dB, t = 0.21, P = 0.99).
3.3 Talker identification
Figure 3 shows the talker identification results for consonants, vowels, and sentences under full-band [panel (d)], telephone-band [panel (e)], and non-telephone-band [panel (f)] conditions. The effect of band was not significant for consonants [F(2, 8) = 1.35, P = 0.31] or sentences [F(2, 8) = 2.15, P = 0.18] for opposite reasons. Band did not affect talker identification with consonants because of the nearly chance performance (16.0%, 13.3%, and 10.8% under full-, telephone-, and non-telephone-band conditions, respectively). For sentences, band was not a significant factor because of the uniformly good performance (83.1%, 74.9%, 74.5%). However, talker identification with vowels depended on bands [F(2, 8) = 103.01, P < 0.0001]. Similar performance was found between the full and telephone bands (85.0% vs 72.3%, t = 2.53, P = 0.051), whereas poorer performance was found in the non-telephone band (45.8%, t = –9.52, P < 0.0001; t = –14.06, P < 0.0001). See the supplementary material for a summary of the behavioral experiments.
3.4 Objective measure of sentence intelligibility and quality
Figures 4(a) and 4(b) display STOI and PESQ scores for sentences used in the speech intelligibility and talker identification experiments. The telephone-band sentences yield higher STOI and PESQ scores than the non-telephone-band sentences (STOI, 0.90 vs 0.66; PESQ, 2.43 vs 1.40). While the STOI score of telephone-band sentences (0.9) indicates nearly perfect speech intelligibility (i.e., 100% correct), the PESQ score (2.43) only approximates a “fair” speech quality on the mean opinion score.
Figure 4(c) shows that the STOI score increases as a function of SNR for all band conditions. The STOI score depends primarily on the speech bandwidth, with the full-, telephone-, and non-telephone-band speech increasing from 0.3 to 0.4 at low SNRs to asymptotic scores of 1, 0.8, and 0.6 at high SNRs, respectively. However, the STOI score does not depend on noise bandwidth, as shown by the minimal difference between the full-band noise and the bandwidth-matched noise.
4. Discussion
The non-telephone band produced overall good performance for speech recognition in quiet and talker identification, but poor speech-in-noise performance. This result is partially consistent with our hypotheses. There was a dichotomy in the performance between consonants and vowels: The non-telephone band contained adequate information for recognition of consonants (76.0%), but not vowels (11.5%), whereas the opposite result was obtained for talker identification using consonants (10.8%) and vowels (45.8%). Despite this dichotomy, sentences, which consist of both consonants and vowels, provided super-combinatory information for good performance in both intelligibility (77.4%) and talker identification (74.5%). However, this super-combination did not translate into good performance in noise. Here, we discuss the acoustic and perceptual mechanisms underlying the dichotomy between consonants and vowels, the super-combination, and the noise susceptibility in the non-telephone band.
4.1 Consonant vs vowel dichotomy
Good intelligibility of the non-telephone-band consonants was expected because Lippmann (1996) reported ∼90% consonant intelligibility for a similar band-stop manipulation (<0.8 kHz, >3.15 kHz). Moreover, consonant features, especially voicing and manner, are extremely robust and distributed across frequencies (e.g., Van Tasell , 1987). On the contrary, vowel intelligibility depends critically on the first and second formants (Carlson , 1970; Carlson , 1975), whose frequencies lie mostly within the telephone band (Peterson and Barney, 1952). Indeed, most vowels in the non-telephone band were perceived as /i/ in “heed,” which is the only vowel that has low F1 (200–300 Hz) and high F2 (3.1–3.5 kHz).
While the non-telephone band produced good consonant intelligibility, the noise-like bursts in isolated consonants (/v/, /z/, /sh/) provided negligible information regarding talker identity. On the contrary, the unintelligible vowels (/æ/, /o/, /i/) in the non-telephone band still contained not only the F0 that is lower than 300 Hz, but also high harmonics (>3 kHz) that could be used to reconstruct F0. It was likely that the F0 information contributed to the 48.5% correct talker identification using the non-telephone-band vowels.
The relative contributions of consonants and vowels to speech intelligibility and talker identity have been a long-term topic of debate (e.g., Ladefoged, 2001; Owren and Cardillo, 2006; Fogerty and Humes, 2012). The present result showed frequency dependency of the consonant-vowel dichotomy. At least for the non-telephone band, consonants contribute to intelligibility, whereas vowels contribute to talker identity.
4.2 Combination vs redundancy
Sentences (S) consist of consonants (C) and vowels (V). The simplest combination model is S = C + V, whereas the simplest redundancy model is S = max(C, V). Our data, except for the talker identification with the non-telephone band, were consistent with the redundancy model. Speech redundancy is well known as evidenced by relatively high intelligibility with “consonant-only” or “vowel-only” words and sentences (e.g., Cole , 1996; Owren and Cardillo, 2006). Additionally, sentences contain contextual information that can help restore the missing spectral information. It is interesting to note that the redundancy or high sentence intelligibility can be obtained by either the band-stop stimuli like the non-telephone band or the narrowly filtered bandpass stimuli (Stickney and Assmann, 2001; Warren , 1995).
Even the combination model cannot fully explain the exceptional talker identification using non-telephone-band sentences. The 74.5% correct performance is nearly 17.9 percentage points higher than 56.6% or the sum of the consonant (10.8%) and vowel (45.8%) performances. One reason for this disparity can be due to limitations in the stimuli. The sentences contain more information than the selected consonants/vowels used in the experiments. Other consonants, formant transitions, syllable durations, and rhythmic patterns may have contributed to talker identification. The other reason is similar to super-combination of two independent yet complementary cues such as the low-frequency acoustic cue (i.e., <300 Hz) and cochlear implant stimulation (Chang , 2006).
4.3 Why is the non-telephone-band speech susceptible to noise?
A surprising finding of the present study was the 23.8-dB difference in speech reception thresholds between the full and non-telephone bands (Fig. 2). One reason for this difference was the baseline performance in quiet: 98.6% correct for the full band and 77.4% for the non-telephone band. This 21.2-percentage-point baseline difference would be translated into an 8-dB difference in speech reception threshold (Smits , 2021), leaving still a 15.8-dB (=23.8–8 dB) unexplained full-band vs non-telephone-band difference. Note that such conversion may not be accurate given the ceiling effect observed with the performance in quiet. Auditory object formation and separation might explain this remaining difference. In the full- and telephone-band stimuli, harmonics below 3 kHz in voiced speech form an auditory object that would allow a listener to easily separate speech from noise. In contrast, the intelligibility-carrying high frequencies (>3 kHz) in the non-telephone-band speech sound like noise, thus requiring high SNRs to separate the two.
4.4 Implications
The present study has several implications. First, the relatively high speech intelligibility by the non-telephone band in quiet may be useful to special populations. For example, high-frequency speech similar to the non-telephone high band may be explicitly extracted and delivered to those with rare residual hearing at high frequencies or auditory neuropathy with selectively impaired temporal processing at low frequencies (Starr , 1996; Zeng , 2005). Second, the dichotomy between consonants and vowels may be explored to deliver non-telephone-band speech to those using electro-acoustic stimulation (vonIlberg , 1999) so that the combined sounds can produce a super-additive benefit for speech intelligibility. Third, compared with the full band, both telephone and non-telephone bands produced relatively low performance in talker identification. The objective quality measure also showed low PESQ scores—2.43 and 1.40 out of 4.5—for the telephone and non-telephone bands, respectively. Recent advances in generative AI (Tian , 2017) may be used to restore or even improve the quality of the band limited sounds.
4.5 Limitations
The current study presents several limitations that could have weakened the findings. First, the sample size of this study (n = 5) was small. The small sample size revealed only major effects and likely missed relatively smaller effects. Second, there was only one talker for the sentence stimuli used in speech intelligibility test. Although the task goal was to identify the speech content, the lack of variability in talkers could have reduced the task difficulty and led to overestimated sentence intelligibility.
5. Conclusions
(1) The non-telephone frequencies in sentence materials produced good sentence intelligibility (77.4%) and talker identification (74.5%). (2) Consonants contributed to speech intelligibility, while vowels contributed to talker identification. (3) The non-telephone-band speech is susceptible to noise, possibly due to the similarity between high-frequency sounds in the non-telephone-band and noise.
Supplementary Material
See the supplementary material for a summary of the behavioral experiments.
Acknowledgments
We express our gratitude to Antoinette Abdelmalek, Tianyi Jia, and Cindy Hoan-Tran for participating in the pilot study. This research was supported by Center for Hearing Research at the University of California Irvine.
Author Declarations
Conflict of Interest
The authors have no conflicts to disclose.
Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.