Short-time processing was employed to manipulate the amplitude, bandwidth, and temporal fine structure (TFS) in sentences. Fifty-two native-English-speaking, normal-hearing listeners participated in four sentence-recognition experiments. Results showed that recovered envelope (E) played an important role in speech recognition when the bandwidth was > 1 equivalent rectangular bandwidth. Removing TFS drastically reduced sentence recognition. Preserving TFS greatly improved sentence recognition when amplitude information was available at a rate ≥ 10 Hz (i.e., time segment ≤ 100 ms). Therefore, the short-time TFS facilitates speech perception together with the recovered E and works with the coarse amplitude cues to provide useful information for speech recognition.

Temporal information in an acoustic signal can be partitioned into two components, i.e., the rapidly varying temporal fine structure (TFS) and the slower changes in amplitude envelope (E). Mathematically, this way of partitioning known as Hilbert transform takes the form of S(t)=Σi=1Mai(t)cos(θi(t)), in which the original signal S(t) is decomposed into M sets of the cosine waves weighed by the instantaneous amplitude a(t). Note that θ(t) is the instantaneous phase of the cosine function. Thus, in this context, E and TFS correspond to instantaneous amplitude a(t) and instantaneous phase θ(t) of the cosine function, respectively. The perceptual significance of acoustic TFS and E has been studied in many laboratories (Smith et al., 2002; Xu and Pfingst, 2003; Zeng et al., 2004; Lorenzi et al., 2006; Fogerty, 2011; Wang et al., 2011, 2015; Apoux et al., 2013; Li et al., 2015; Qi et al., 2017). It is often assumed that E contributes to speech perception, whereas TFS contributes to music or tone perception as well as perception of interaural time difference (Smith et al., 2002; Xu and Pfingst, 2003, 2008).

The role of acoustic TFS in speech perception is still under debate. Several studies have tested the perception of TFS speech (Lorenzi et al., 2006; Lorenzi et al., 2009; Moore, 2008; Hopkins and Moore, 2009). The TFS speech is created with the formula: TFS-speech = Σi=1Mai(t)2¯cos(θi(t)), in which the E information is discarded by replacing it with the root-mean-square (RMS) of the E. Results from those studies have indicated that TFS conveys a fair amount of speech information, and that TFS information is particularly helpful for speech recognition in fluctuating noise (Hopkins and Moore, 2009). However, the acoustic TFS can be reconverted into neural E cues at the output of cochlear filters. Such E cues have been termed as reconstructed or recovered E information. Several recent investigations indicated that the reconstructed E may account for the intelligibility of the TFS speech (Lorenzi et al., 2012; Shamma and Lorenzi, 2013; Swaminathan et al., 2014; Léger et al., 2015). Thus, the interpretation of previous studies that have used TFS speech may have been confounded with the presence of recovered E information.

Most of the previous studies processed the speech material on the basis of the full length of the speech tokens or sentences. The present study adopted a method that implemented short-time segment TFS processing. This approach was first proposed by Chen and Guan (2013) who used segment durations of 50 and 100 ms instead of the full length of the sentences. In the present study, we processed the TFS in six different short-time segments (i.e., 50 to 300 ms) and termed the resulting TFS as short-time temporal fine structure (STFS). By manipulating the segment duration, we could study the effects of rate of changes (20 to 3.33 Hz) in the amplitude on sentence recognition. Such a slow rate of changes in amplitude represents gross E information. Four experiments were carried out in the present study to examine the contributions of TFS as well as amplitude cues in short segments in sentence recognition and to evaluate potential benefit of using STFS in speech perception.

Fifty-two normal-hearing, native English-speaking young adults (21 males and 31 females) were recruited and randomly assigned to one of the four experiments with N of 14, 13, 13, and 12, respectively. The subjects were aged between 18 and 31 years old with a mean age of 24.6 years old. They were screened for normal hearing (≤20 dB hearing level) at octave frequencies between 250 and 8000 Hz. The use of human subjects was reviewed and approved by the Institutional Review Board of Ohio University.

The speech stimuli used were AzBio English sentences that consisted of 33 lists with 20 sentences in each list. The sentences ranged from 3 to 12 words (median = 7) in length. A list contained on average 140 words. All words were counted in scoring. All speech sentences were sampled at 22 050 Hz and were spoken by 2 male and 2 female adult speakers (Spahr et al., 2012).

Signal processing was performed in matlab (MathWorks, Natick, MA). To generate the short-time TFS (STFS), the entire sentence was first segmented into 50, 100, 150, 200, 250, or 300 ms. Each segment was then bandpass filtered into 8, 16, and 32 bands. With the overall frequency range of 64 to 8932 Hz, the bandpass filters were equivalent to 4-, 2-, and 1-ERB wide [ERB stands for the equivalent rectangular bandwidth and is an approximation to the bandwidths of the auditory filters of human cochlea (Glasberg and Moore, 1990)]. Hilbert transform was used to extract the instantaneous amplitude and phase of each band.

In experiment I, the amplitude information was removed by replacing it with 1. Thus, only the TFS of the original speech signal was maintained. The following equation shows the processing:

FTFSj=i=1Mcos(θij(n)),j=1,2,,N,
(1)

where FTFS stands for flattened TFS, i represents the ith subband, and the j represents the jth segment of the sentence processed. M is the number of bands that varied from 8 to 32 and N is the number of time segments each ranging between 50 and 300 ms.

In experiment II, the signal processing was similar to experiment I except that the amplitude information was preserved by using the RMS value of the amplitude in each time segment. This process removed the detailed E information but kept the TFS of the original speech signal intact. The following equation shows the processing:

STFSj=i=1Maij(n)2¯cos(θij(n)),j=1,2,,N,
(2)

where i represents the ith subband and j represents the jth segment of sentence processed. As mentioned above, M was the number of bands that varied from 8 to 32 and N was the number of time segments of 50 to 300 ms long each. This was similar to the conventional way of processing TFS speech except that the processing was performed for each short-time segment, so that STFS = {STFS1, STFS2,…, STFSN}.

In experiment III, the signal processing was similar to experiment II except that the instantaneous phase was replaced with random values so that TFS information was removed. The equation below shows the signal processing,

RTFSj=i=1Maij(n)2¯cos(rand(θij(n)),j=1,2,,N,
(3)

where RTFS stands for randomized TFS and rand(θij(n)) is a random number generated to replace the phase value for the ith frequency band and at the jth time segment.

In experiment IV, the signal processing was similar to experiments II and III except that the instantaneous phase was replaced with sinusoids so that TFS information was removed. The equation below shows the signal processing,

CTFSj=i=1Maij(n)2¯cos(2πfin+φij),j=1,2,,N,
(4)

where CTFS stands for constant-frequency TFS and φij is the random starting phase for the ith frequency band and at the jth time segment.

Subjects listened under supra-aural headphones in a sound booth. The order of the 18 conditions (i.e., 3 bandwidths × 6 segment lengths) was randomized. Each condition used a different AzBio sentence list. A total of 360 sentences were transcribed by each subject. The subjects listened to each sentence as many times as desired and then transcribed their answer. Prior to the real test, each subject went through a training period in which they listened to 72 processed sentences. Feedback was provided in practice but not in the real test.

Figure 1 shows the average sentence-recognition performance as a function of the segment duration. The four panels of Fig. 1 represent data from the four experiments. A general linear model (GLM) analysis was performed to examine the effects of segment duration and processing bandwidth on sentence-recognition scores. The percent-correct scores of sentence recognition were treated as binomial data and a logit transformation of the percent-correct data was applied for the GLM analysis (Warton and Hui, 2011). Table 1 summarizes the results of the GLM analysis for the four experiments.

Fig. 1.

(Color online) Group mean sentence-recognition performance of experiment I (left), experiment II (second), experiment III (third), and experiment IV (right) as a function of segment duration. The solid, dashed, and dotted lines represent the three different bandwidths of the bandpass filters (i.e., 1, 2, and 4 ERBs). Error bars represent standard deviations (SDs).

Fig. 1.

(Color online) Group mean sentence-recognition performance of experiment I (left), experiment II (second), experiment III (third), and experiment IV (right) as a function of segment duration. The solid, dashed, and dotted lines represent the three different bandwidths of the bandpass filters (i.e., 1, 2, and 4 ERBs). Error bars represent standard deviations (SDs).

Close modal
Table 1.

Results of the GLM analysis of the sentence-recognition performance of the four experiments.

ExperimentNSegment durationBandwidth
βtpβtp
14 −0.047 −7.5 < 0.0001 2.167 112.7 < 0.0001 
II 13 −0.716 −70.7 < 0.0001 1.511 71.7 < 0.0001 
III 13 −1.971 −76.6 < 0.0001 −0.951 −37.0 < 0.0001 
IV 12 −2.468 −66.7 < 0.0001 −1.351 −42.4 < 0.0001 
ExperimentNSegment durationBandwidth
βtpβtp
14 −0.047 −7.5 < 0.0001 2.167 112.7 < 0.0001 
II 13 −0.716 −70.7 < 0.0001 1.511 71.7 < 0.0001 
III 13 −1.971 −76.6 < 0.0001 −0.951 −37.0 < 0.0001 
IV 12 −2.468 −66.7 < 0.0001 −1.351 −42.4 < 0.0001 

The four experiments yielded diverse results depending on the signal processing of the speech materials. In the following, the results are discussed in light of our current understanding of reconstructed E at the output of the cochlear filters from the TFS speech signals. Another aspect that we try to emphasize is the contribution of the amplitude information in the short-time segments to speech perception.

The TFS speech signals in experiment I were similar to those used in previous studies (Lorenzi et al., 2006; Swaminathan et al., 2014; Léger et al., 2015; Gilbert and Lorenzi, 2006). In those previous studies, Hilbert transform was used to decompose the speech signal (consonant tokens in a VCV format) into E and TFS components. The E information was removed and the TFS information of various numbers of bands were added together to form the TFS speech. The amplitude of the TFS was determined by the RMS values of the removed E. In experiment I of the present study, the amplitude of the TFS in all bands was set at 1. The short-time processing was irrelevant since the amplitude of all segments was set at the same level. The mean sentence-recognition scores across all segment durations were approximately 9%, 45%, and 88% correct for bandwidths of 1, 2, and 4 ERBs, respectively. Our results were consistent with previous reports in that TFS speech is more intelligible if the bandwidth is wider from which the TFS speech is extracted using Hilbert transform. Recovered E information is found to be responsible for the results. The wider the bandwidth, the more readily the E will be recovered from the TFS signal (Ghitza, 2001; Zeng et al., 2004). Thus, experiment I of the present study provided additional evidence that recovered E contributes to sentence recognition in TFS speech.

Experiment II of the present study showed interesting interaction between bandwidth and segment duration. In this experiment, the amplitude of each segment was set at the RMS value of the E in that segment. This manipulation removed the detailed temporal fluctuation of the E but kept the overall amplitude contour of the sentence. The rate of amplitude changes is the reciprocal of the segment duration. At a bandwidth of 4 ERBs, the sentence-recognition scores were all above 90% correct. The recovered E was mostly responsible for such high performance. In the bandwidth conditions of 1 and 2 ERBs, the contributions of recovered E were reduced. On the other hand, the amplitude information for each short-time segment appeared to provide useful information for sentence recognition. The performance improved from approximately 13% and 53% correct at 300-ms segment for bandwidths of 1 and 2 ERBs, respectively, to nearly 100% correct at 50-ms segment (Fig. 1, second panel). Therefore, an amplitude cue at a rate of 20 Hz can provide perfect sentence recognition for TFS speech extracted from bands of 1 - or 2-ERB wide. At a 10-Hz rate, sentence recognition was around 90% correct. Xu et al. (2005) reported that English vowel and consonant recognition saturated when E information was at or above 4 and 16 Hz, respectively. The present study indicates that good sentence recognition requires E information at or above 10 Hz. This result was also reminiscent of that of Saberi and Perrott (1999) who demonstrated that the recognition of locally time-reversed sentences deteriorated when the time segment was greater than 100 ms.

In experiments III and IV, the TFS information was removed and replaced with noise and sinusoids of constant frequencies, respectively. Both experiments produced very similar results. At 50-ms segment duration, amplitude information at a 20-Hz rate produced nearly perfect sentence recognition when the number of bands was 32 but only approximately 50% correct when the number of bands was 8. Sentence recognition performance dropped precipitously when the segment duration was 100 ms and further decreased to floor when the segment duration was 150 ms or longer. Note that the sentence-recognition performance at 8 bands of 50-ms segment duration was equivalent to that at 32 bands of 100-ms segment duration. This result provides evidence to a tradeoff between temporal and spectral cues for sentence recognition. Such a tradeoff has between previously demonstrated in English phoneme recognition as well as lexical tone recognition (Xu and Pfingst, 2008; Xu et al., 2002; Xu et al., 2005).

It is interesting to compare the speech recognition performance across all four experiments. Without amplitude information (experiment I), sentence recognition performance was approximately 9%, 45%, and 88% correct for bandwidths of 1, 2, and 4 ERBs, respectively. Adding amplitude information improved the performance dramatically (experiment II). Yet, with only amplitude and no TFS information (experiments III and IV), sentence recognition was very poor. We could not attribute the much better performance in experiment II entirely to the recovered E information. It appears that the TFS information facilitates speech perception in addition to the amplitude information. Whether the TFS and the amplitude information work independently or constructively remains to be tested (cf. Seldran et al., 2011).

In the present study, we manipulated the amount of TFS and amplitude information in speech signals using short-time processing. Results showed interesting interactions among the time segment, TFS, amplitude, and recovered E on sentence recognition. While earlier studies illustrated the contribution of TFS information in speech perception (Lorenzi et al., 2006; Lorenzi et al., 2009; Moore, 2008; Hopkins and Moore, 2009, 2010; Gilbert and Lorenzi, 2006), several new studies suggested that recovered E in the TFS speech is responsible for the speech intelligibility (Swaminathan et al., 2014; Léger et al., 2015). The results of the present study suggest that we cannot completely rule out the contributions of TFS to speech perception. The results indicate that amplitude information in short-time segments facilitates speech perception when TFS information is available. Therefore, short-time processing of TFS speech is probably beneficial when implementing speech-processing strategies in hearing instruments such as hearing aids and cochlear implants for the hearing impaired. Whereas the ability to utilize TFS information in the hearing-impaired is a matter of debate (Moore, 2008; Hopkins and Moore, 2009; Léger et al., 2015), further testing of the short-time processed TFS speech is necessary. Another area that might show the beneficial effect of short-time processing of the TFS speech is speech perception of tonal languages such as Mandarin Chinese or music perception (Smith et al., 2002; Xu and Pfingst, 2003). Only one study (Chen and Guan, 2013) has tested Mandarin-Chinese speech perception using short time segments of 50 and 100 ms in TFS speech processing. The vowel duration in conversational speech of Mandarin Chinese is approximately 100 ms but is close to 200 ms in text reading (Yang et al., 2017). Therefore, future studies using various segment durations in TFS speech processing for Mandarin Chinese are warranted.

The authors are grateful to Kyle Brown, Ali Colopy, Samantha Cross, Yitao Mao, Lauren Muscari, and Jing Yang for their technical and editorial assistance. The study was supported in part by the National Natural Science Foundation of China (Grant No. 81300820/61071187) (L.H.), the Special Research Fund for the Public Welfare Industry of Health (Grant No. 201202001) (L.H.), and the NIH/NIDCD Grant No. R15- DC014587 (L.X.).

1.
Apoux
,
F.
,
Yoho
,
S. E.
,
Youngdahl
,
C. L.
, and
Healy
,
E. W.
(
2013
). “
Role and relative contribution of temporal envelope and fine structure cues in sentence recognition by normal-hearing listeners
,”
J. Acoust. Soc. Am.
134
(
3
),
2205
2212
.
2.
Chen
,
F.
, and
Guan
,
T.
(
2013
). “
Effect of temporal modulation rate on the intelligibility of phase-based speech
,”
J. Acoust. Soc. Am.
134
,
EL520
EL526
.
3.
Fogerty
,
D.
(
2011
). “
Perceptual weighting of individual and concurrent cues for sentence intelligibility: Frequency, envelope, and fine structure
,”
J. Acoust. Soc. Am.
129
,
977
988
.
4.
Ghitza
,
O.
(
2001
). “
On the upper cutoff frequency of the auditory critical-band envelope detectors in the context of speech perception
,”
J. Acoust. Soc. Am.
110
(
3
),
1628
1640
.
5.
Gilbert
,
G.
, and
Lorenzi
,
C.
(
2006
). “
The ability of listeners to use recovered envelope cues from speech fine structure
,”
J. Acoust. Soc. Am.
119
(
4
),
2438
2444
.
6.
Glasberg
,
B. R.
, and
Moore
,
B. J.
(
1990
). “
Derivation of auditory filter shapes from notched-noise data
,”
Hear. Res.
47
,
103
138
.
7.
Hopkins
,
K.
, and
Moore
,
B. J.
(
2009
). “
The contribution of temporal fine structure to the intelligibility of speech in steady and modulated noise
,”
J. Acoust. Soc. Am.
125
(
1
),
442
446
.
8.
Hopkins
,
K.
, and
Moore
,
B. J.
(
2010
). “
Importance of temporal fine structure information in speech at different spectral regions for normal-hearing and hearing-impaired subjects
,”
J. Acoust. Soc. Am.
127
(
3
),
1595
1608
.
9.
Léger
,
A. C.
,
Desloge
,
J. G.
,
Braida
,
L. D.
, and
Swaminathan
,
J.
(
2015
). “
The role of recovered envelope cues in the identification of temporal-fine-structure speech for hearing-impaired listeners
,”
J. Acoust. Soc. Am.
137
(
1
),
505
508
.
10.
Li
,
B.
,
Hou
,
L.
,
Xu
,
L.
,
Yang
,
G.
,
Feng
,
Y.
, and
Yin
,
S.
(
2015
). “
The effect of steep high-frequency hearing loss on speech recognition using temporal fine structure in the low frequency regions
,”
Hear. Res.
326
,
66
74
.
11.
Lorenzi
,
C.
,
Debruille
,
L.
,
Garnier
,
S.
,
Fleuriot
,
P.
, and
Moore
,
B. J.
(
2009
). “
Abnormal processing of temporal fine structure in speech for frequencies where absolute thresholds are normal
,”
J. Acoust. Soc. Am.
125
,
27
30
.
12.
Lorenzi
,
C.
,
Gilbert
,
G.
,
Carn
,
H.
,
Garnier
,
S.
, and
Moore
,
B. J.
(
2006
). “
Speech perception problems of the hearing impaired reflect inability to use temporal fine structure
,”
Proc. Natl. Acad. Sci. U.S.A.
103
(
49
),
18866
18869
.
13.
Lorenzi
,
C.
,
Wallaert
,
N.
,
Gnansia
,
D.
,
Léger
,
A.
,
Ives
,
D. T.
,
Chays
,
A.
,
Garnier
,
S.
, and
Cazals
,
Y.
(
2012
). “
Temporal-envelope reconstruction for hearing-impaired listeners
,”
J. Assoc. Res. Otolaryngol.
13
,
853
865
.
14.
Moore
,
B. J.
(
2008
). “
The role of temporal fine structure processing in pitch perception, masking, and speech perception for normal-hearing and hearing-impaired people
,”
J. Assoc. Res. Otolaryngol.
9
,
399
406
.
15.
Qi
,
B.
,
Mao
,
Y.
,
Liu
,
J.
,
Liu
,
B.
,
Zhang
,
L.
, and
Xu
,
L.
(
2017
). “
Relative contributions of temporal fine structure and envelope cues for lexical tone perception in noise
,”
J. Acoust. Soc. Am.
141
(
5
),
3022
3029
.
16.
Saberi
,
K.
, and
Perrott
,
D. R.
(
1999
). “
Cognitive restoration of reversed speech
,”
Nature
398
,
760
.
17.
Seldran
,
F.
,
Micheyl
,
C.
,
Truy
,
E.
,
Berger-Vachon
,
C.
,
Thai-Van
,
H.
, and
Gallego
,
S.
(
2011
). “
A model-based analysis of the ‘combined-stimulation advantage,’ 
Hear. Res.
282
,
252
264
.
18.
Shamma
,
S.
, and
Lorenzi
,
C.
(
2013
). “
On the balance of envelope and temporal fine structure in the encoding of speech in the early auditory system
,”
J. Acoust. Soc. Am.
133
,
2818
2833
.
19.
Smith
,
Z. M.
,
Delgutte
,
B.
, and
Oxenham
,
A. J.
(
2002
). “
Chimaeric sounds reveal dichotomies in auditory perception
,”
Nature
416
,
87
90
.
20.
Spahr
,
A. J.
,
Dorman
,
M. F.
,
Litvak
,
L. M.
,
Van Wie
,
S.
,
Gifford
,
R. H.
,
Loizou
,
P. C.
,
Loiselle
,
L. M.
,
Oakes
,
T.
, and
Cook
,
S.
(
2012
). “
Development and validation of the AzBio sentence lists
,”
Ear Hear.
33
(
1
),
112
117
.
21.
Swaminathan
,
J.
,
Reed
,
C. M.
,
Desloge
,
J. G.
,
Braida
,
L. D.
, and
Delhorne
,
L. A.
(
2014
). “
Consonant identification using temporal fine structure and recovered envelope cues
,”
J. Acoust. Soc. Am.
135
(
4
),
2078
2090
.
22.
Wang
,
S.
,
Liu
,
D.
,
Dong
,
R.
,
Wang
,
Y.
,
Chen
,
J.
,
Zhang
,
L.
, and
Xu
,
L.
(
2015
). “
The role of temporal envelope and fine structure in Mandarin lexical tone perception in auditory neuropathy spectrum disorder
,”
PLoS One
10
(
6
),
e0129710
.
23.
Wang
,
S.
,
Xu
,
L.
, and
Mannell
,
R.
(
2011
). “
Relative contributions of temporal envelope and fine structure cues to lexical tone recognition in hearing-impaired listeners
,”
J. Assoc. Res. Otolaryngol.
12
,
783
794
.
24.
Warton
,
D. I.
, and
Hui
,
F. K.
(
2011
). “
The arcsine is asinine: The analysis of proportions in ecology
,”
Ecology
92
,
3
10
.
25.
Xu
,
L.
, and
Pfignst
,
B. E.
(
2003
). “
Relative importance of the temporal envelope and fine structure in tone perception
,”
J. Acoust. Soc. Am.
114
(
6
),
3024
3027
.
26.
Xu
,
L.
, and
Pfingst
,
B. E.
(
2008
). “
Spectral and temporal cues for speech recognition: Implications for auditory prostheses
,”
Hear. Res.
242
,
132
140
.
27.
Xu
,
L.
,
Thompson
,
C. S.
, and
Pfingst
,
B. E.
(
2005
). “
Relative contributions of spectral and temporal cues for phoneme recognition
,”
J. Acoust. Soc. Am.
117
,
3255
3267
.
28.
Xu
,
L.
,
Tsai
,
Y.
, and
Pfingst
,
B. E.
(
2002
). “
Features of stimulation affecting tonal-speech perception: Implications for cochlear prostheses
,”
J. Acoust. Soc. Am.
112
(
1
),
247
258
.
29.
Yang
,
J.
,
Zhang
,
Y.
,
Li
,
A.
, and
Xu
,
L.
(
2017
). “
On the duration of Mandarin tones
,” in
Proceedings of the Annual Conference of the International Speech Communication Association
(
Interspeech, Stockholm, Sweden
,
2017
), pp.
1407
1411
.
30.
Zeng
,
F. G.
,
Nie
,
K. B.
,
Liu
,
S.
,
Stickney
,
G.
,
Del Rio
,
E.
,
Kong
,
Y.
, and
Chen
,
H.
(
2004
). “
On the dichotomy in auditory perception between temporal envelope and fine structure cues
,”
J. Acoust. Soc. Am.
116
(
3
),
1351
1354
.