Although the telephone band (0.3–3 kHz) provides sufficient information for speech recognition, the contribution of the non-telephone band (<0.3 and >3 kHz) is unclear. To investigate its contribution, speech intelligibility and talker identification were evaluated using consonants, vowels, and sentences. The non-telephone band produced relatively good intelligibility for consonants (76.0%) and sentences (77.4%), but not vowels (11.5%). The non-telephone band supported good talker identification only with sentences (74.5%), but not vowels (45.8%) or consonants (10.8%). Furthermore, the non-telephone band cannot produce satisfactory speech intelligibility in noise at the sentence level, suggesting the importance of full-band access in realistic listening.

Speech sounds carry a wealth of information regarding a speaker's intelligibility, identity, and emotions. Speech perception is also remarkably robust against not only distortions in amplitude, time, and frequency, but also noise, reverberation, and other challenging environments (Licklider , 1948; Saberi and Perrott, 1999; Warren , 1995). A paradigm shift in the transmission of acoustic information was the invention of the telephone, which has fundamentally changed speech communication from exclusively in-person interactions to additional remote exchanges. Although speech contains energy in high frequencies up to 15 kHz, early telephones adopted a narrow bandwidth between 0.3 and 3 kHz to optimally balance transmission efficiency and speech intelligibility (Fletcher, 1953; French and Steinberg, 1947).

The adoption of this narrow telephone bandwidth has not only paved the way for pervasive usage of modern telecommunication, but also established a prevailing dogma in speech research, asserting the paramount importance of the 0.3- to 3-kHz band in speech intelligibility (ANSI, 1997; Bell , 1992; DePaolis , 1996; Fletcher and Galt, 1950; French and Steinberg, 1947; Studebaker , 1987). The frequencies outside the 0.3- to 3-kHz band, on the other hand, have been mostly assumed to be responsible for speech quality or talker identity, but not intelligibility (Monson , 2011; Moore , 2008; Olson, 1947; Rodman, 2003; Schwartz, 1968). Consequently, contemporary wideband telephony uses frequencies outside the 0.3- to 3-kHz range to enhance speech quality and talker identification or mitigate listening fatigue (e.g., Rodman, 2003).

A few studies showed that energies above 3 kHz contribute to speech intelligibility (Lippmann, 1996; Monson , 2019; Motlagh Zadeh , 2019), but the narrowband speech perception dogma still remains, and a systematic study quantifying the relative contribution of non-telephone bandwidth is lacking. Understanding the role of non-telephone bandwidth can not only bridge this knowledge gap, but also facilitate potential practical applications such as speech delivery through alternative frequencies. This implementation would be particularly beneficial for listeners who have limited access to full-band information. The present study quantifies the contribution of the non-telephone bandwidth (i.e., <0.3 or >3 kHz) to speech intelligibility and talker identity using three types of speech materials, including consonants, vowels, and sentences. Considering that most everyday conversations occur in noise, we conducted an additional test to measure the effect of noise on sentence intelligibility using the full-, telephone-, and non-telephone-band stimuli. Finally, we subjected the sentences to two modern objective measures (Rix , 2001; Taal , 2011), which were typically used to evaluate artificial intelligence (AI)-generated speech for intelligibility and quality. These two measures were compared against the human subjective evaluations to provide insights into potential AI algorithms for improving speech perception outcome. We hypothesized that the non-telephone-band speech contains sufficient information to support good performance in speech intelligibility in quiet, noise, and talker identification.

A total of 15 adults with normal hearing [<20-dB hearing level (HL) at audiometric frequencies from 125 to 8000 Hz] and native English proficiency participated in the study. Informed consent was obtained from all participants under a human subject protocol approved by the University of California Irvine Institutional Review Board.

In the speech intelligibility task, five participants (22 ± 4 years old; three females) recognized the three types of stimuli regardless of talkers. The consonant stimuli consisted of 20 consonants spoken by two males and two females in the a/C/a structure (Shannon , 1999). The vowel stimuli consisted of 16 vowels spoken by two males and two females in the h/V/d structure (Hillenbrand , 1995). The sentence stimuli were 250 sentences spoken by a male talker (Hearing In Noise Test [HINT] sentences; Nilsson , 1994). These sentences were divided into 25 equally intelligible lists with ten sentences per list. Each sentence had five keywords. The original bandwidths were 11, 11, and 8 kHz for consonants, vowels, and sentences, respectively. The sentences had a narrower bandwidth as a result of an antialiasing filter (low pass at 8 kHz) applied (Nilsson , 1994). All speech materials were resampled to 22.05 kHz and then root-mean-square (RMS) equalized. Three speech-spectrum-shaped noise stimuli were generated to match the long-term spectrum of the consonant, vowel, and sentence materials, respectively, and to calibrate all stimuli to an 80-dB sound pressure level (SPL) presentation level. These calibrated stimuli were used as the full-band stimuli.

Both telephone-band and non-telephone-band stimuli were generated by applying 8th-order Butterworth low- or high-pass filters to the full-band stimuli. The telephone-band stimuli preserved energies between 0.3 and 3.15 kHz, while the non-telephone-band stimuli contained only energies below 0.3 kHz and above 3.15 kHz. Figure 1 displays the three stimuli's long-term spectra with the full [panel (a)], telephone [panel (b)], and non-telephone [panel (c)] bands. Together, three types of stimuli were constructed, resulting in a total of 240 consonant tokens (20 consonants × 4 talkers × 3 bands), 192 vowel tokens (16 × 4 × 3), and 750 sentences (250 × 1 × 3).

Fig. 1.

(a)–(c) Long-term average spectra of consonants (blue), vowels (dark red), and sentences (orange) under (a) full-band, (b) telephone-band, and (c) non-telephone-band conditions. The gray solid line represents the schematic filter shape for the telephone-band (mid) and the non-telephone-band (lower) conditions, respectively. (d)–(f) Mean intelligibility scores for consonant (C), vowel (V), and sentence (S) recognition under (d) full-band, (e) telephone-band, and (f) non-telephone-band conditions. The horizontal dashed lines represent the chance level performance for consonants (5%), vowels (8%), and sentences (0%). Error bars represent 1 standard deviation from the mean. (g) Percentage of information received on speech features for consonants. (h) Percentage of information received on speech features for vowels.

Fig. 1.

(a)–(c) Long-term average spectra of consonants (blue), vowels (dark red), and sentences (orange) under (a) full-band, (b) telephone-band, and (c) non-telephone-band conditions. The gray solid line represents the schematic filter shape for the telephone-band (mid) and the non-telephone-band (lower) conditions, respectively. (d)–(f) Mean intelligibility scores for consonant (C), vowel (V), and sentence (S) recognition under (d) full-band, (e) telephone-band, and (f) non-telephone-band conditions. The horizontal dashed lines represent the chance level performance for consonants (5%), vowels (8%), and sentences (0%). Error bars represent 1 standard deviation from the mean. (g) Percentage of information received on speech features for consonants. (h) Percentage of information received on speech features for vowels.

Close modal

In consonant recognition, participants selected the token they heard on each trial by clicking on 1 of the 20 consonant alternatives displayed on a computer screen. In vowel recognition, the participants performed the same task, except they selected from 1 of the 16 vowel alternatives. A training session was conducted with full-band consonants and vowels prior to the formal testing to help participants map the speech tokens to their written forms on the computer screen. Trial-by-trial feedback was provided during training. Participants were required to achieve at least 80% correct performance in training before proceeding to the formal testing. On average, three training sessions were needed for participants to reach ≥80% correct performance. In the formal testing session, no feedback was provided. All participants identified all vowel and consonant tokens in the testing session. The stimulus order was counterbalanced across participants.

In sentence recognition, participants verbally repeated as many words as possible from a sentence presented on each trial. A correct response was recorded only when all five keywords in a sentence were correctly identified. A training session using ten full-band sentences (from the same list) was conducted prior to the testing session. Trial-by-trial feedback was provided during training. The participants were required to identify at least eight out of ten sentences in the training session to qualify for the testing session (i.e., 80% correct). Only one training session was needed for all participants to reach the ≥80% correct criterion. In the testing session, all participants identified 60 sentences (i.e., 2 lists × 10 sentences × 3 bands) that were not used in the training session. The stimulus order was counterbalanced across participants. Altogether, the speech intelligibility test took about 1.5 h to complete.

The speech reception threshold (SRT) was obtained as the signal-to-noise ratio (SNR) at which 50% of the sentences were correctly identified. Target sentences were the HINT sentences. (Lists used in sentence recognition tests were excluded.) Three types of stimuli (i.e., full-, telephone-, and non-telephone-band stimuli) were generated using the same method described above. Three steady-state noise makers were generated by spectrally matching a white noise to the long-term average spectrum of the full-, telephone-, or non-telephone-band sentences, respectively (Fig. 2, gray lines). Five target-masker combinations were tested: full-band target with full-band masker [Fig. 2(a)], telephone-band target with full-band [Fig. 2(b)] or telephone-band [Fig. 2(d)] masker, and non-telephone-band target with full-band [Fig. 2(c)] or non-telephone-band [Fig. 2(e)] masker.

Fig. 2.

Long-term frequency spectra of stimuli at mean sentence recognition thresholds in sentence recognition in noise tests using (a) full-band, (b) telephone-band, and (c) non-telephone-band sentences in full-band noise or in (d) telephone-band noise and (e) non-telephone-band noise.

Fig. 2.

Long-term frequency spectra of stimuli at mean sentence recognition thresholds in sentence recognition in noise tests using (a) full-band, (b) telephone-band, and (c) non-telephone-band sentences in full-band noise or in (d) telephone-band noise and (e) non-telephone-band noise.

Close modal

During the test, five participants (25 ± 2 years; one female) verbally repeated as many words as possible from a presented sentence while ignoring the noise. A correct response was recorded when all the five keywords of the sentence were correctly identified. A one-down, one-up adaptive procedure was used to measure the SRT (Levitt, 1971). The target sentence was calibrated to and fixed at an 80-dB SPL, while the noise level was adjusted on each trial based on the response from participants. The SRT was measured following the same procedure as described in Nilsson (1994). The sentence intelligibility in noise test took about 40 min to complete.

In the talker identification task, five participants (21 ± 3 years; two females) identified talkers regardless of stimuli. Ten talkers produced three types of speech materials. In talker identification using consonants, ten adults (five females) produced three isolated consonants (/v/, /z/, /sh/, without consonant-vowel transition, selected from Shannon , 1999). The three fricative consonants—two voiced and one voiceless—were selected for their relatively long duration after removing the consonant-vowel transition from the original token. In talker identification using vowels, ten talkers (three adult females, three adult males, two female children and two male children) produced three isolated vowels (/æ/, /o/, /i/, selected from Hillenbrand , 1995). These vowels were selected for their high/high, high/low and low/high F1/F2 values (Vongphoe and Zeng, 2005). The ten talkers were different from those used in the speech intelligibility test. In talker identification using sentences, ten adults (five females) spoke the same three sentences from the voice cloning toolkit (VCTK) Corpus (Yamagishi , 2019).

All stimuli with an 11-kHz bandwidth were re-sampled and normalized to an 80-dB SPL. These full-band stimuli were then spectrally filtered to generate the telephone- and non-telephone-band stimuli. Overall, the talker identification test used 90 tokens for each stimulus type (10 talkers × 3 bands × 3 stimuli). Figure 3 displays the stimuli's long-term spectra.

Fig. 3.

(a)–(c) Long-term average spectra of consonants (blue), vowels (dark red), and sentences (orange) under (a) full-band, (b) telephone-band, and (c) non-telephone-band conditions in talker identification tests. (d)–(f) Mean talker identification scores using consonants, vowels, and sentences under (d) full-band, (e) telephone-band, and (f) non-telephone-band conditions. The horizontal dash line represents the chance level performance (10%). Error bars represent 1 standard deviation from the mean.

Fig. 3.

(a)–(c) Long-term average spectra of consonants (blue), vowels (dark red), and sentences (orange) under (a) full-band, (b) telephone-band, and (c) non-telephone-band conditions in talker identification tests. (d)–(f) Mean talker identification scores using consonants, vowels, and sentences under (d) full-band, (e) telephone-band, and (f) non-telephone-band conditions. The horizontal dash line represents the chance level performance (10%). Error bars represent 1 standard deviation from the mean.

Close modal

In the talker identification test, all ten talkers were displayed on a computer screen, and the participants selected the talker after they listened to the stimulus on each trial. The same procedure was employed with ten different talkers for consonants, vowels, and sentences, respectively. A training session using the full-band stimuli was conducted prior to formal testing to help the participants learn the talker identities. Trial-by-trial feedback was provided during training. The participants were trained until they reached asymptotic performance. The participants then performed the talker identification test without any feedback. The order of speech material and band was counterbalanced across the participants. The talker identification test took about 1.5 h to complete.

The short-time objective intelligibility, or STOI, was used as an objective measure of speech intelligibility (Taal , 2011). The STOI estimates the intelligibility of degraded speech based on correlation of temporal envelopes in short time-frequency segments between the degraded speech and its original clean reference. The STOI score ranges from 0 to 1, corresponding to totally unintelligible speech at 0 to 100% intelligible clean speech at 1.

The perceptual evaluation of speech quality, or PESQ, was used as an objective measure of speech quality (Rix , 2001). The PESQ also requires a clean speech signal to estimate speech quality. The PESQ score ranges from –0.5 to 4.5, corresponding a subjective mean opinion score from 0 (bad) to 5 (excellent). For both STOI and PESQ, full-band sentences were used as the clean reference and the telephone- and non-telephone-band sentences were used as the degraded signal.

Within-subjects repeated-measure analysis of variance (ANOVA) was used to compare the levels of performance across bands in all tests for each material, respectively. Information transmission analysis was conducted to evaluate the information received on speech features for consonants and vowels (Miller and Nicely, 1955). The information transmission analysis followed the sequential information analysis (SINFA) procedure (Wang and Bilger, 1973). A confusion matrix was extracted from the trial-by-trial data and was summed and averaged across participants. Consonant features included voicing, manner, and place, and vowel features were duration, F1, and F2 (Xu , 2005). All analysis was conducted using r version 4.2.0.

Figure 1 shows the percentage of correct speech intelligibility results for consonants, vowels, and sentences under full-band [panel (d)], telephone-band [panel (e)], and non-telephone-band [panel (f)] conditions. There was a significant effect of band for each stimulus [consonant, F(2, 8) = 33.55, P < 0.001; vowel, F(2, 8) = 151.50, P < 0.0001; sentence, F(2, 8) = 33.38, P < 0.001]. The full-band condition produced 97.6% correct recognition for consonants, 80.1% for vowels, and 98.6% for sentences [Fig. 1(d)]. The relatively low performance for vowels was due to confusion by participants in California who could not differentiate between /ɑ/ and /ɔ/ (Labov, 1998) and another confusion between /ʌ/ and /ɑ/, which had similar F1 and F2 values and were distinguished based on duration (Xu , 2005). Compared with the full-band condition, the telephone-band stimuli produced similar performance [Fig.1(e)] (consonant, 93.8%, t = –1.40, P = 0.40; vowel, 81.8%, t = 0.40, P = 0.90; sentence, 98.6%, t = –0.001, P = 1.00). In contrast, the non-telephone-band stimuli produced significantly poorer performance than the full-band stimuli [Fig.1(f)] (consonant, 76.0%, t = –7.70, P < 0.001; vowel, 11.5%, t = –14.10, P < 0.0001; sentence, 77.4%, t = –7.10, P < 0.001). Note that the 11.5% correct recognition for the non-telephone-band vowels was barely above chance (chance level = 8.3%).

Figure 1(g) shows the percentage of information received for consonants. The full-band stimuli allowed nearly perfect information received on all three features (voicing, 96.5%; manner, 95.2%; place, 95.9%). The telephone-band stimuli provided slightly lower information transferred (voicing, 95.5%; manner, 91.3%; place, 86.4%). The non-telephone band provided much lower information transferred on all consonant features than the full band, with the lowest information being received for the place feature (voicing, 74.1%; manner, 70.1%; place, 47.5%). Compared to consonants, the information received on vowel features was much lower [Fig. 1(h)]. The full- and telephone-band stimuli produced comparable information transferred for vowels (duration, 66.4% vs 64.9%; F1, 77.1% vs 76.2%; F2, 75.7% vs 74.8%). In contrast, the non-telephone-band stimuli could not transfer any significant information regarding any of the three features (duration, 0.4%; F1, 2.6%; F2, 3.5%).

Figure 2 displays the relative difference in presentation levels between the target and masker, at which 50% speech intelligibility was obtained for the five different stimulus band conditions. There was a significant main effect of band [F(4, 16) = 67.42, P < 0.0001]. For noise with a full-band speech spectrum [Figs. 2(a), 2(b), and 2(c)], the full-band and telephone-band sentences yielded comparable speech reception thresholds (mean = –2.9 dB vs –0.9-dB SNR, t = 0.94, P = 0.84). Surprisingly, the non-telephone-band sentences produced a 23.8-dB worse threshold compared to the full-band sentences (mean = 20.9 dB, t = 10.89, P < 0.0001).

When the target and the noise had the same bandwidth [Figs. 2(d) and 2(e)], the telephone- and full-band speech produced similar SRTs (mean = –3.5 vs –2.9 dB, t = –0.29, P = 0.99). The non-telephone speech produced nearly the same results between the non-telephone-band and full-band noise (mean = 20.5 vs 20.9 dB, t = 0.21, P = 0.99).

Figure 3 shows the talker identification results for consonants, vowels, and sentences under full-band [panel (d)], telephone-band [panel (e)], and non-telephone-band [panel (f)] conditions. The effect of band was not significant for consonants [F(2, 8) = 1.35, P = 0.31] or sentences [F(2, 8) = 2.15, P = 0.18] for opposite reasons. Band did not affect talker identification with consonants because of the nearly chance performance (16.0%, 13.3%, and 10.8% under full-, telephone-, and non-telephone-band conditions, respectively). For sentences, band was not a significant factor because of the uniformly good performance (83.1%, 74.9%, 74.5%). However, talker identification with vowels depended on bands [F(2, 8) = 103.01, P < 0.0001]. Similar performance was found between the full and telephone bands (85.0% vs 72.3%, t = 2.53, P = 0.051), whereas poorer performance was found in the non-telephone band (45.8%, t = –9.52, P < 0.0001; t = –14.06, P < 0.0001). See the supplementary material for a summary of the behavioral experiments.

Figures 4(a) and 4(b) display STOI and PESQ scores for sentences used in the speech intelligibility and talker identification experiments. The telephone-band sentences yield higher STOI and PESQ scores than the non-telephone-band sentences (STOI, 0.90 vs 0.66; PESQ, 2.43 vs 1.40). While the STOI score of telephone-band sentences (0.9) indicates nearly perfect speech intelligibility (i.e., 100% correct), the PESQ score (2.43) only approximates a “fair” speech quality on the mean opinion score.

Fig. 4.

(a) Average STOI scores for telephone and non-telephone-band sentences. (b) Average PESQ scores for telephone and non-telephone-band sentences. (c) Average STOI scores for full-, telephone-, and non-telephone-band sentences as a function of signal-to-noise ratio noise. Solid symbols represent the full-band noise, whereas open symbols either the telephone- or non-telephone-band noise. Abbreviations: FB, full-band; TB, telephone-band; NTB, non-telephone band; N, noise; S, sentence.

Fig. 4.

(a) Average STOI scores for telephone and non-telephone-band sentences. (b) Average PESQ scores for telephone and non-telephone-band sentences. (c) Average STOI scores for full-, telephone-, and non-telephone-band sentences as a function of signal-to-noise ratio noise. Solid symbols represent the full-band noise, whereas open symbols either the telephone- or non-telephone-band noise. Abbreviations: FB, full-band; TB, telephone-band; NTB, non-telephone band; N, noise; S, sentence.

Close modal

Figure 4(c) shows that the STOI score increases as a function of SNR for all band conditions. The STOI score depends primarily on the speech bandwidth, with the full-, telephone-, and non-telephone-band speech increasing from 0.3 to 0.4 at low SNRs to asymptotic scores of 1, 0.8, and 0.6 at high SNRs, respectively. However, the STOI score does not depend on noise bandwidth, as shown by the minimal difference between the full-band noise and the bandwidth-matched noise.

The non-telephone band produced overall good performance for speech recognition in quiet and talker identification, but poor speech-in-noise performance. This result is partially consistent with our hypotheses. There was a dichotomy in the performance between consonants and vowels: The non-telephone band contained adequate information for recognition of consonants (76.0%), but not vowels (11.5%), whereas the opposite result was obtained for talker identification using consonants (10.8%) and vowels (45.8%). Despite this dichotomy, sentences, which consist of both consonants and vowels, provided super-combinatory information for good performance in both intelligibility (77.4%) and talker identification (74.5%). However, this super-combination did not translate into good performance in noise. Here, we discuss the acoustic and perceptual mechanisms underlying the dichotomy between consonants and vowels, the super-combination, and the noise susceptibility in the non-telephone band.

Good intelligibility of the non-telephone-band consonants was expected because Lippmann (1996) reported ∼90% consonant intelligibility for a similar band-stop manipulation (<0.8 kHz, >3.15 kHz). Moreover, consonant features, especially voicing and manner, are extremely robust and distributed across frequencies (e.g., Van Tasell , 1987). On the contrary, vowel intelligibility depends critically on the first and second formants (Carlson , 1970; Carlson , 1975), whose frequencies lie mostly within the telephone band (Peterson and Barney, 1952). Indeed, most vowels in the non-telephone band were perceived as /i/ in “heed,” which is the only vowel that has low F1 (200–300 Hz) and high F2 (3.1–3.5 kHz).

While the non-telephone band produced good consonant intelligibility, the noise-like bursts in isolated consonants (/v/, /z/, /sh/) provided negligible information regarding talker identity. On the contrary, the unintelligible vowels (/æ/, /o/, /i/) in the non-telephone band still contained not only the F0 that is lower than 300 Hz, but also high harmonics (>3 kHz) that could be used to reconstruct F0. It was likely that the F0 information contributed to the 48.5% correct talker identification using the non-telephone-band vowels.

The relative contributions of consonants and vowels to speech intelligibility and talker identity have been a long-term topic of debate (e.g., Ladefoged, 2001; Owren and Cardillo, 2006; Fogerty and Humes, 2012). The present result showed frequency dependency of the consonant-vowel dichotomy. At least for the non-telephone band, consonants contribute to intelligibility, whereas vowels contribute to talker identity.

Sentences (S) consist of consonants (C) and vowels (V). The simplest combination model is S = C + V, whereas the simplest redundancy model is S = max(C, V). Our data, except for the talker identification with the non-telephone band, were consistent with the redundancy model. Speech redundancy is well known as evidenced by relatively high intelligibility with “consonant-only” or “vowel-only” words and sentences (e.g., Cole , 1996; Owren and Cardillo, 2006). Additionally, sentences contain contextual information that can help restore the missing spectral information. It is interesting to note that the redundancy or high sentence intelligibility can be obtained by either the band-stop stimuli like the non-telephone band or the narrowly filtered bandpass stimuli (Stickney and Assmann, 2001; Warren , 1995).

Even the combination model cannot fully explain the exceptional talker identification using non-telephone-band sentences. The 74.5% correct performance is nearly 17.9 percentage points higher than 56.6% or the sum of the consonant (10.8%) and vowel (45.8%) performances. One reason for this disparity can be due to limitations in the stimuli. The sentences contain more information than the selected consonants/vowels used in the experiments. Other consonants, formant transitions, syllable durations, and rhythmic patterns may have contributed to talker identification. The other reason is similar to super-combination of two independent yet complementary cues such as the low-frequency acoustic cue (i.e., <300 Hz) and cochlear implant stimulation (Chang , 2006).

A surprising finding of the present study was the 23.8-dB difference in speech reception thresholds between the full and non-telephone bands (Fig. 2). One reason for this difference was the baseline performance in quiet: 98.6% correct for the full band and 77.4% for the non-telephone band. This 21.2-percentage-point baseline difference would be translated into an 8-dB difference in speech reception threshold (Smits , 2021), leaving still a 15.8-dB (=23.8–8 dB) unexplained full-band vs non-telephone-band difference. Note that such conversion may not be accurate given the ceiling effect observed with the performance in quiet. Auditory object formation and separation might explain this remaining difference. In the full- and telephone-band stimuli, harmonics below 3 kHz in voiced speech form an auditory object that would allow a listener to easily separate speech from noise. In contrast, the intelligibility-carrying high frequencies (>3 kHz) in the non-telephone-band speech sound like noise, thus requiring high SNRs to separate the two.

The present study has several implications. First, the relatively high speech intelligibility by the non-telephone band in quiet may be useful to special populations. For example, high-frequency speech similar to the non-telephone high band may be explicitly extracted and delivered to those with rare residual hearing at high frequencies or auditory neuropathy with selectively impaired temporal processing at low frequencies (Starr , 1996; Zeng , 2005). Second, the dichotomy between consonants and vowels may be explored to deliver non-telephone-band speech to those using electro-acoustic stimulation (vonIlberg , 1999) so that the combined sounds can produce a super-additive benefit for speech intelligibility. Third, compared with the full band, both telephone and non-telephone bands produced relatively low performance in talker identification. The objective quality measure also showed low PESQ scores—2.43 and 1.40 out of 4.5—for the telephone and non-telephone bands, respectively. Recent advances in generative AI (Tian , 2017) may be used to restore or even improve the quality of the band limited sounds.

The current study presents several limitations that could have weakened the findings. First, the sample size of this study (n = 5) was small. The small sample size revealed only major effects and likely missed relatively smaller effects. Second, there was only one talker for the sentence stimuli used in speech intelligibility test. Although the task goal was to identify the speech content, the lack of variability in talkers could have reduced the task difficulty and led to overestimated sentence intelligibility.

(1) The non-telephone frequencies in sentence materials produced good sentence intelligibility (77.4%) and talker identification (74.5%). (2) Consonants contributed to speech intelligibility, while vowels contributed to talker identification. (3) The non-telephone-band speech is susceptible to noise, possibly due to the similarity between high-frequency sounds in the non-telephone-band and noise.

See the supplementary material for a summary of the behavioral experiments.

We express our gratitude to Antoinette Abdelmalek, Tianyi Jia, and Cindy Hoan-Tran for participating in the pilot study. This research was supported by Center for Hearing Research at the University of California Irvine.

The authors have no conflicts to disclose.

The data that support the findings of this study are available from the corresponding author upon reasonable request.

1.
ANSI
(
1997
). S3.5-1997 (R2017),
American National Standard Methods for Calculation of the Speech Intelligibility Index
(
American National Standards Institute
,
New York
).
2.
Bell
,
T. S.
,
Dirks
,
D. D.
, and
Trine
,
T. D.
(
1992
). “
Frequency-importance functions for words in high- and low-context sentences
,”
J. Speech Lang. Hear. Res.
35
(
4
),
950
959
.
3.
Carlson
,
R.
,
Fant
,
G.
, and
Granström
,
B.
(
1975
). “
Two-formant models, pitch and vowel perception
,” in
Auditory Analysis and Perception of Speech
(
Academic Press
,
New York
), pp.
55
82
.
4.
Carlson
,
R.
,
Granström
,
B.
, and
Fant
,
G.
(
1970
). “
Some studies concerning perception of isolated vowels
,”
STL-QPSR
11
(
2–3
),
19
35
, available at https://www.speech.kth.se/~rolf/MyKTHpapers_6/www.speech.kth.se/prod/publications/files/qpsr/1970/1970_11_2-3_019-035.pdf.
5.
Chang
,
J. E.
,
Bai
,
J. Y.
, and
Zeng
,
F.-G.
(
2006
). “
Unintelligible low-frequency sound enhances simulated cochlear-implant speech recognition in noise
,”
IEEE Trans. Biomed. Eng.
53
(
12
),
2598
2601
.
6.
Cole
,
R. A.
,
Yan
,
Y.
,
Mak
,
B.
,
Fanty
,
M.
, and
Bailey
,
T.
(
1996
). “
The contribution of consonants versus vowels to word recognition in fluent speech
,” in
1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings
, Atlanta, GA (
IEEE
,
New York
), Vol.
2
, pp.
853
856
.
7.
DePaolis
,
R. A.
,
Janota
,
C. P.
, and
Frank
,
T.
(
1996
). “
Frequency importance functions for words, sentences, and continuous discourse
,”
J. Speech Lang. Hear. Res.
39
(
4
),
714
723
.
8.
Fletcher
,
H.
(
1953
).
Speech and Hearing in Communication
, 2nd ed. (
Van Nostrand
,
New York
).
9.
Fletcher
,
H.
, and
Galt
,
R. H.
(
1950
). “
The perception of speech and its relation to telephony
,”
J. Acoust. Soc. Am.
22
(
2
),
89
151
.
10.
Fogerty
,
D.
, and
Humes
,
L. E.
(
2012
). “
The role of vowel and consonant fundamental frequency, envelope, and temporal fine structure cues to the intelligibility of words and sentences
,”
J. Acoust. Soc. Am.
131
(
2
),
1490
1501
.
11.
French
,
N. R.
, and
Steinberg
,
J. C.
(
1947
). “
Factors governing the intelligibility of speech sounds
,”
J. Acoust. Soc. Am.
19
(
1
),
90
119
.
12.
Hillenbrand
,
J.
,
Getty
,
L. A.
,
Clark
,
M. J.
, and
Wheeler
,
K.
(
1995
). “
Acoustic characteristics of American English vowels
,”
J. Acoust. Soc. Am.
97
(
5
),
3099
3111
.
14.
Labov
,
T. G.
(
1998
). “
English acquisition by immigrants to the United States at the beginning of the Twentieth Century
,”
Am. Speech
73
(
4
),
368
398
.
15.
Ladefoged
,
P.
(
2001
). “
Vowels and consonants: An introduction to the sound of language
,”
Phonetica
58
(
3
),
211
212
.
16.
Levitt
,
H.
(
1971
). “
Transformed up‐down methods in psychoacoustics
,”
J. Acoust. Soc. Am.
49
(
2B
),
467
477
.
17.
Licklider
,
J. C. R.
,
Bindra
,
D.
, and
Pollack
,
I.
(
1948
). “
The intelligibility of rectangular speech-waves
,”
Am. J. Psychol.
61
(
1
),
1
20
.
18.
Lippmann
,
R. P.
(
1996
). “
Accurate consonant perception without mid-frequency speech energy
,”
IEEE Trans. Speech Audio Process.
4
(
1
),
66
69
.
19.
Miller
,
G. A.
, and
Nicely
,
P. E.
(
1955
). “
An analysis of perceptual confusions among some English consonants
,”
J. Acoust. Soc. Am.
27
(
2
),
338
352
.
20.
Monson
,
B. B.
,
Lotto
,
A. J.
, and
Ternström
,
S.
(
2011
). “
Detection of high-frequency energy changes in sustained vowels produced by singers
,”
J. Acoust. Soc. Am.
129
(
4
),
2263
2268
.
21.
Monson
,
B. B.
,
Rock
,
J.
,
Schulz
,
A.
,
Hoffman
,
E.
, and
Buss
,
E.
(
2019
). “
Ecological cocktail party listening reveals the utility of extended high-frequency hearing
,”
Hear. Res.
381
,
107773
.
22.
Moore
,
B. C. J.
,
Stone
,
M. A.
,
Füllgrabe
,
C.
,
Glasberg
,
B. R.
, and
Puria
,
S.
(
2008
). “
Spectro-temporal characteristics of speech at high frequencies, and the potential for restoration of audibility to people with mild-to-moderate hearing loss
,”
Ear Hear.
29
(
6
),
907
922
.
23.
Motlagh Zadeh
,
L.
,
Silbert
,
N. H.
,
Sternasty
,
K.
,
Swanepoel
,
D. W.
,
Hunter
,
L. L.
, and
Moore
,
D. R.
(
2019
). “
Extended high-frequency hearing enhances speech perception in noise
,”
Proc. Natl. Acad. Sci. U.S.A.
116
(
47
),
23753
23759
.
24.
Nilsson
,
M.
,
Soli
,
S. D.
, and
Sullivan
,
J. A.
(
1994
). “
Development of the Hearing in Noise Test for the measurement of speech reception thresholds in quiet and in noise
,”
J. Acoust. Soc. Am.
95
(
2
),
1085
1099
.
25.
Olson
,
H. F.
(
1947
). “
Frequency range preference for speech and music
,”
J. Acoust. Soc. Am.
19
,
549
555
.
26.
Owren
,
M. J.
, and
Cardillo
,
G. C.
(
2006
). “
The relative roles of vowels and consonants in discriminating talker identity versus word meaning
,”
J. Acoust. Soc. Am.
119
(
3
),
1727
1739
.
27.
Peterson
,
G. E.
, and
Barney
,
H. L.
(
1952
). “
Control methods used in a study of the vowels
,”
J. Acoust. Soc. Am.
24
(
2
),
175
184
.
28.
Rix
,
A. W.
,
Beerends
,
J. G.
,
Hollier
,
M. P.
, and
Hekstra
,
A. P.
(
2001
). “
Perceptual evaluation of speech quality (PESQ)—a new method for speech quality assessment of telephone networks and codecs
,” in
2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings
, Salt Lake City, UT (
IEEE
,
New York
), Vol.
2
, pp.
749
752
.
29.
Rodman
,
J.
(
2003
).
The Effect of Bandwidth on Speech Intelligibility: White Paper
(
Polycom, Inc
.,
Santa Cruz, CA
).
30.
Saberi
,
K.
, and
Perrott
,
D. R.
(
1999
). “
Cognitive restoration of reversed speech
,”
Nature
398
(
6730
),
760
.
31.
Schwartz
,
M. F.
(
1968
). “
Identification of speaker sex from isolated, voiceless fricatives
,”
J. Acoust. Soc. Am.
43
(
5
),
1178
1179
.
32.
Shannon
,
R. V.
,
Jensvold
,
A.
,
Padilla
,
M.
,
Robert
,
M. E.
, and
Wang
,
X.
(
1999
). “
Consonant recordings for speech testing
,”
J. Acoust. Soc. Am.
106
(
6
),
L71
L74
.
33.
Smits
,
C.
,
De Sousa
,
K. C.
, and
Swanepoel
,
D. W.
(
2021
). “
An analytical method to convert between speech recognition thresholds and percentage-correct scores for speech-in-noise tests
,”
J. Acoust. Soc. Am.
150
(
2
),
1321
1331
.
34.
Starr
,
A.
,
Picton
,
T. W.
,
Sininger
,
Y.
,
Hood
,
L. J.
, and
Berlin
,
C. I.
(
1996
). “
Auditory neuropathy
,”
Brain
119
(
Pt. 3
),
741
753
.
35.
Stickney
,
G. S.
, and
Assmann
,
P. F.
(
2001
). “
Acoustic and linguistic factors in the perception of bandpass-filtered speech
,”
J. Acoust. Soc. Am.
109
(
3
),
1157
1165
.
36.
Studebaker
,
G. A.
,
Pavlovic
,
C. V.
, and
Sherbecoe
,
R. L.
(
1987
). “
A frequency importance function for continuous discourse
,”
J. Acoust. Soc. Am.
81
(
4
),
1130
1138
.
37.
Taal
,
C. H.
,
Hendriks
,
R. C.
,
Heusdens
,
R.
, and
Jensen
,
J.
(
2011
). “
An algorithm for intelligibility prediction of time–frequency weighted noisy speech
,”
IEEE Trans. Audio. Speech Lang. Process.
19
(
7
),
2125
2136
.
37.
Tian
,
Y. h.
,
Chen
,
X.-l.
,
Xiong
,
H.-k.
,
Li
,
H.-l.
,
Dai
,
L.-r.
,
Chen
,
J.
,
Xing
,
J.-l.
,
Chen
,
J.
,
Wu
,
X.-h.
,
Hu
,
W.-m.
,
Huang
,
T.-j.
, and
Gao
,
W.
(
2017
). “Towards human-like and transhuman perception in AI 2.0: a review,”
Front. Inf. Technol. Electron. Eng.
18
,
58
67
.
38.
Van Tasell
,
D. J.
,
Soli
,
S. D.
,
Kirby
,
V. M.
, and
Widin
,
G. P.
(
1987
). “
Speech waveform envelope cues for consonant recognition
,”
J. Acoust. Soc. Am.
82
(
4
),
1152
1161
.
39.
Vongphoe
,
M.
, and
Zeng
,
F.-G.
(
2005
). “
Speaker recognition with temporal cues in acoustic and electric hearing
,”
J. Acoust. Soc. Am.
118
(
2
),
1055
1061
.
37.
VonIlberg
,
C.
,
Kiefer
,
J.
,
Tillein
,
J.
,
Pfenningdorff
,
T.
,
Hartmann
,
R.
,
Stürzebecher
,
E.
, and
Klinke
,
R.
(
1999
). “Electric-acoustic stimulation of the auditory system: New technology for severe hearing loss,”
ORL
61
(
6
),
334
340
.
40.
Wang
,
M. D.
, and
Bilger
,
R. C.
(
1973
). “
Consonant confusions in noise: A study of perceptual features
,”
J. Acoust. Soc. Am.
54
(
5
),
1248
1266
.
41.
Warren
,
R. M.
,
Riener
,
K. R.
,
Bashford
,
J. A.
, and
Brubaker
,
B. S.
(
1995
). “
Spectral redundancy: Intelligibility of sentences heard through narrow spectral slits
,”
Percept. Psychophys.
57
(
2
),
175
182
.
42.
Xu
,
L.
,
Thompson
,
C. S.
, and
Pfingst
,
B. E.
(
2005
). “
Relative contributions of spectral and temporal cues for phoneme recognition
,”
J. Acoust. Soc. Am.
117
(
5
),
3255
3267
.
43.
Yamagishi
,
J.
,
Veaux
,
C.
, and
MacDonald
,
K.
(
2019
). “
CSTR VCTK Corpus: English multi-speaker corpus for CSTR Voice Cloning Toolkit (version 0.92)
,” https://doi.org/10.7488/ds/2645 (Last viewed August 15, 2022).
44.
Zeng
,
F.-G.
,
Kong
,
Y.-Y.
,
Michalewski
,
H. J.
, and
Starr
,
A.
(
2005
). “
Perceptual consequences of disrupted auditory nerve activity
,”
J. Neurophysiol.
93
(
6
),
3050
3063
.

Supplementary Material