Effects of temporal distortions on consonant perception were measured using locally time-reversed nonsense syllables. Consonant recognition was measured in both audio and audio-visual modalities for assessing whether the addition of visual speech cues can recover consonant errors caused by time reversing. The degradation in consonant recognition depended highly on the manner of articulation, with sibilant fricatives, affricates, and nasals showing the least degradation. Because consonant errors induced by time reversing were primarily in voicing and place-of-articulation (mostly limited to stop-plosives and non-sibilant fricatives), undistorted visual speech cues could resolve only about half the errors (i.e., only place-of-articulation errors).

Temporal dynamics of speech are crucial for its perception, and distorting temporal information degrades speech perception. For example, Remez et al. (2013) and Ueda et al. (2017) summarized several studies that reported a systematic degradation of speech recognition performance after locally time reversing short segments of speech signal, i.e., a monotonic decrease in word recognition scores with increasing reversal duration, especially above 125 ms. This decrease was considered to be due to degradation of the strongest speech modulations between 3 and 8 Hz, which correspond to the average syllable rate in conversational speech (Saberi and Perrott, 1999; Greenberg and Arai, 2001). In other words, speech cues shorter than 125 ms were thought to be relatively unimportant for recognizing sentences. However, later studies that used less predictable speech stimuli reported significant degradation in speech recognition when speech cues as short as 50 to 100 ms were disturbed (Kiss et al., 2008; Stilp et al., 2010; Boulenger et al., 2011; Remez et al., 2013), suggesting that sensitivity to temporal distortions significantly shorter than the purported speech syllable duration range (∼120–300 ms) becomes evident when context information is reduced. This was further asserted by Ishida et al. (2016), who reported the degradation in speech intelligibility to occur at shorter reversal durations for pseudowords that lack lexical context (30–50 ms) than for meaningful words (50–100 ms). Thus, the temporal boundaries over which speech can be reconstructed after temporally distorting speech cues depend on the amount of lexical, semantic, and syntactic context information in the speech materials.

For shorter speech segments such as consonants, presented in the absence of any linguistic context, the ability to recover from the degradation caused by local time reversing varies significantly across consonants. This is because some consonant cues are highly dynamic and asymmetric in time, e.g., formant transitions, envelope rise/fall rate, etc., while some other cues, such as frication energy (especially for sibilant fricatives) and spectral holes in nasals, are relatively stationary and symmetric in time (Stevens and Klatt, 1974; Mermelstein, 1977; Kewley-Port et al., 1983; Jongman et al., 2000; Mitani et al., 2006; Li et al., 2012; Remez et al., 2013; Ishida et al., 2016). Both Bharadwaj and Assmann (2000) and Ishida et al. (2016) measured consonant intelligibility in locally time-reversed nonsense words or phrases, using two different testing paradigms, and reported significantly more degradation for stop-plosives than for fricative consonants. However, the exact nature of consonant errors induced by local time reversing is not known. A detailed analysis of these errors would potentially allow us to predict the intelligibility of speech with different amounts of context information when listening in acoustic environments, or on communication channels, that temporally perturb the signal. Furthermore, it can help in designing strategies to overcome those errors. For example, if the errors are primarily in place- and manner-of-articulation, then providing visual speech cues by displaying the face of the talker could significantly improve the recognition of temporally distorted speech audio (Grant and Walden, 1996). However, if the errors induced by temporal distortions are in voicing, then visual speech cues would be of little or no help.1

To test this hypothesis, the current study measured consonant recognition in both audio-only and audio-visual modalities. In this study, the effect of temporal distortions on consonant recognition was measured using locally time-reversed /ɑ/C/ɑ/ syllables, and consonant errors were analyzed using confusion matrix (CM) analysis (Miller and Nicely, 1955).

The speech stimuli were isolated /ɑ/C/ɑ/ syllables spoken by a female talker, with 18 consonants (/p/, /t/, /k/, /f/, /θ/, /s/, /ʃ/, /ʧ/, /b/, /d/, /ɡ/, /v/, /ð/, /z/, /ʒ/, /ʤ/, /m/, /n/) (Grant and Walden, 1996). Eight productions of each consonant were recorded, resulting in a total of 144 tokens. The audio waveforms of these tokens were locally time-reversed in segments of duration T =0 (unprocessed), 20, 40, 60, 80, 100, 120, 140, and 160 ms. The start and end of each segment was chosen to be at a zero crossing to minimize discontinuity. Thus, the actual duration of each segment varied slightly, but was within ±5.25% of the intended reversal duration. In the audio-visual modality, audio stimuli were combined with the undistorted video of the talker's face.

Figure 1 shows auditory spectrograms of /ɑʃɑ/ and /ɑtɑ/ tokens, before (top panels) and after (middle panels) time reversing (T =120 ms). After time reversing, vowels, formant transitions, and other consonant cues (i.e., the release burst or frication energy) changed their relative temporal locations, thereby disturbing cues such as the voice onset time (VOT) for stop consonants. Time-reversal processing also changed the direction of formant transitions (i.e., increasing vs decreasing in frequency). However, purely spectral cues such as the bandwidth of /ʃ/ and /t/ bursts remained unaffected by time reversing. Long-term average spectra for the two tokens (bottom panels) illustrate that the spectral peaks in the average spectra were not affected by this kind of processing. However, spectral valleys were somewhat filled after processing, most likely due to the spectral spread of energy caused by discontinuities at the time-reversal boundaries.

Fig. 1.

Auditory spectrograms (i.e., logarithmically-compressed envelopes of the output of a bank of 200 gammatone filters) of /ɑʃɑ/ (left) and /ɑtɑ/ (right) tokens, spoken by a female speaker at time reversal durations of T =0 (original, top) and 120 ms (middle). The vertical dashed lines mark the boundaries of the initial vowel (V1), leading formant transition (FT1), the consonant (C), following formant transition (FT2) and the final vowel (V2). The gray, dashed-dotted rectangles highlight the frication energy of /ʃ/ and the release burst and formant transitions of /t/. The two bottom panels show power spectral density for the two tokens before and after time reversals.

Fig. 1.

Auditory spectrograms (i.e., logarithmically-compressed envelopes of the output of a bank of 200 gammatone filters) of /ɑʃɑ/ (left) and /ɑtɑ/ (right) tokens, spoken by a female speaker at time reversal durations of T =0 (original, top) and 120 ms (middle). The vertical dashed lines mark the boundaries of the initial vowel (V1), leading formant transition (FT1), the consonant (C), following formant transition (FT2) and the final vowel (V2). The gray, dashed-dotted rectangles highlight the frication energy of /ʃ/ and the release burst and formant transitions of /t/. The two bottom panels show power spectral density for the two tokens before and after time reversals.

Close modal

Five subjects (two male and three female) with normal hearing and ages ranging between 29 and 55 years participated in this study. All listeners had thresholds ≤20 dB hearing level (HL) for frequencies up to 4 kHz. Above 4 kHz, only one listener had thresholds between 25 and 30 dB HL, while others had thresholds ≤20 dB HL.

Listeners sat in a sound-treated room in front of a video touch-screen that was used to display stimulus video as well as response options. Audio stimuli were presented diotically at 72 dB sound pressure level (linear weighting) through TDT system 3 (Tucker Davis Technologies, Alachua, FL) and HD-580 headphones (Sennheiser electronic GmbH & Co. KG, Germany). Listeners entered their response by choosing from 18 consonant labels displayed on the touch screen after each stimulus presentation. Listeners were first familiarized with consonant sounds and response labels using unprocessed speech and, in some cases, locally time-reversed syllables with reversal duration of 20 ms (the least distorted of the time-reversal conditions). During this training/familiarization session, correct-answer feedback was provided. Listeners were required to demonstrate 100% performance on the unprocessed speech condition before continuing on to the test phase of the study. The test data were collected in blocks of 72 trials each, consisting of four randomly drawn tokens of each consonant (18 consonants × 4 tokens). The reversal duration was fixed within each test block and total five blocks were tested for each reversal duration. However, if a listener scored 100% in two consecutive test blocks of a reversal duration, then no further blocks were tested for that duration. Presentation order of blocks was randomized across reversal duration and modality. CMs were generated by pooling data across subjects.

Figure 2 shows recognition scores of individual consonants as a function of the reversal duration, with the solid line in each panel representing the average consonant score. As expected, the temporal distortion caused by local time reversing did not affect all consonants equally. Consonants /t/, /θ/, /b/, /d/, /ɡ/, /v/, /ð/, /ʒ/, /ʧ/, and /ʤ/ (bottom panel) were the most affected, showing a noticeable degradation in scores starting from T =80 ms and reaching a score as low as 40% to 60% at T =160 ms. On the other hand, consonants /s/, /z/, /m/, and /n/ (top panel) were hardly affected and had scores greater than 94% at all measured reversal durations. Consonants /p/, /k/, /f/, and /ʃ/ (middle panel) showed a reduction in scores with increase in reversal duration, but the scores remained mostly above 80%.

Fig. 2.

Recognition scores for individual consonants plotted as a function of the reversal duration, and divided in three groups—high scoring (top panel), medium scoring (middle panel), and low scoring (bottom panel) consonants. The thick solid line represents average consonant recognition score in all three panels. Reversal durations in msec are shown at bottom and the corresponding reversal rates in Hz are shown at the top.

Fig. 2.

Recognition scores for individual consonants plotted as a function of the reversal duration, and divided in three groups—high scoring (top panel), medium scoring (middle panel), and low scoring (bottom panel) consonants. The thick solid line represents average consonant recognition score in all three panels. Reversal durations in msec are shown at bottom and the corresponding reversal rates in Hz are shown at the top.

Close modal

These results are consistent with Bharadwaj and Assmann (2000), and correlate with spectro-temporal characteristics of various consonant cues. For example, perceptual cues for stop plosives depend highly on the temporal dynamics of the speech signal (Stevens and Klatt, 1974; Kewley-Port et al., 1983), while those for nasals are primarily spectral in nature, independent of the temporal sequence of events (Mermelstein, 1977). Similarly, the most prominent features of sibilant fricatives are predominantly spectral (Jongman et al., 2000; Li et al., 2012), which would have little or no effect of temporal scrambling of the waveform. However, only two of the four sibilants, viz., /s/ and /z/, were not affected by time reversing, whereas scores for sibilants /ʃ/ and /ʒ/ reduced. To understand these particular trends, consonant confusion errors were analyzed.

Consonant confusions were analyzed using a row-normalized CM for each reversal duration. The row-normalization converted each row of a CM to an empirically estimated probability distribution (in percent), with the diagonal element representing correct recognition probability and the off-diagonal elements representing confusion error probabilities. Some of these row-normalized CMs, along with a graphical illustration, can be found in the supplementary material.2 The most prominent errors (>20%) were primarily voicing confusions /p/-/b/, /t/-/d/, /k/-/ɡ/, /f-/v/, and /θ/-/ð/. The other prominent confusions (>10%) included those across the manner of articulation (/ʃ/-/ʧ/ and /ʒ/-/ʤ/) or the place of articulation (/f/-/θ/, /p/-/t/), or both (/t/-/θ/, and /b/-/v/). Only one prominent error, viz., /d/-/p/, was across both voicing and the place of articulation.

These consonant confusions help explain the strong interaction between the manner of articulation and the degradation in recognition scores. For example, all prominent voicing and place of articulation errors listed above were only among stop-plosives or non-sibilant fricatives, resulting in most of the low-scoring consonants at 160 ms reversal duration (Fig. 2, lower and middle panels) to be from these two manner classes. On the other hand, there were no prominent errors among nasals, making those high-scoring consonants (Fig. 2, top panel). Affricates /ʧ/ and /ʤ/ have the same place of articulation and the same spectral shape as the sibilant consonants /ʃ/ and /ʒ/, but the envelope rise of the frication energy in the affricates is much shorter than that in the two sibilants (Mitani et al., 2006). If a temporal discontinuity caused by local time reversing falls in the middle of the frication of a sibilant, like that shown in the middle left panel in Fig. 1, then the result is sharply rising frication energy, which may make a sibilant to be perceived as an affricate. There were no such affricate counterparts for the sibilant consonants /s/ and /z/, which explains the high recognition scores for these two sibilants.

Voicing confusions, seen primarily in stop plosives and non-sibilant fricatives, were highly asymmetric with the probability of voiced consonants being recognized as their unvoiced counterparts much higher than vice versa. A secondary but important voicing cue for stop plosive and non-sibilant fricatives is the voicing murmur that appears just before the release burst in voiced consonants, which is sometimes referred to as the “voice-bar” and which is absent in unvoiced consonants (Fulop and Disner, 2011). The temporal scrambling caused by local time reversals not only disrupts the VOT, but can also place the voice bar after the release burst. The absence of voice-bar before the release burst is likely to make a voiced consonant to be perceived as an unvoiced consonant, especially when the primary voicing cue (i.e., the VOT) is distorted, but not vice versa.

As expected, visual speech input was able to resolve most place-of-articulation errors and, to some extent, manner-of-articulation errors due to mutual information between place and manner categories. Specifically, /ʒ/-/ʤ/, /f/-/θ/, /p/-/t/, /t/-/θ/, /b/-/v/, and /d/-/p/ errors became negligible (i.e., below the chance level) in the audio-visual modality. This resulted in large improvements in scores for some consonants, e.g., from 50% to 85% for /t/, from 40% to 86% for /θ/, and from 55% to 86% for /v/ at 160 ms reversal duration. At this reversal duration, the other noticeable score improvements were for /b/, /d/, and /ʤ/ (20, 22, and 18 percentage points, respectively). However, visual speech cues did not reduce any voicing errors, as expected, resulting in no improvement for a consonant like /ɡ/, which had low scores primarily due to voicing confusion with /k/. Overall, only half the errors (i.e., only place errors, but not voicing errors) were resolved by visual cues. For example, at 160 ms reversal duration, visual speech cues approximately halved the error rate from 30.5% to 16.8%.

Contribution of temporal cues in recognizing a consonant depends highly on the manner of articulation with consonant errors mostly limited to stop plosives and non-sibilant fricatives. Because consonant errors caused by local time reversing are primarily in voicing and place-of-articulation, visual speech cues could resolve only about half the errors (i.e., only place errors, but not voicing errors), resulting in a moderate overall improvement in recognition scores. Thus, when the speech audio is temporally distorted, displaying the talker's face can provide only partial improvement, and may require additional techniques like the use of linguistic context for further error correction.

We thank Mary Cord for audiometric screening. This research was supported by a Cooperative Research and Development Agreements between Clinical Investigation Regulatory Office, U.S. Army Medical Department, and School and the Oticon Foundation, Copenhagen, Denmark. The views expressed in this paper are those of the authors and do not reflect the official policy of the Department of the Army/Navy/Air Force, Department of Defense (DoD), or the U.S. Government. The identification of specific products or scientific instrumentation does not constitute endorsement or implied endorsement on the part of the authors, DoD, or any component agency.

1

The shape of the talker's mouth (jaw, lips, etc.) indicates the location of constriction while producing a consonant sound, thus providing the place-of-articulation cue. The manner-of articulation shares mutual information with the place-of-articulation, e.g., a bilabial consonant can be a stop plosive or a nasal, but not a fricative. Due to this mutual information, partial information about the manner-of-articulation can also be obtained from visual speech. However, there are no visual correlates of voicing.

2

See the supplementary material at https://doi.org/10.1121/1.5129562 for detailed confusion matrices. This document contains consonant confusion matrices at reversal durations of 120 and 160 ms, in both tabular form and graphical illustration form. The document also contains a table that lists voicing, place-of-articulation, and manner-of-articulation categories of consonants used in this study.

1.
Bharadwaj
,
S. V.
, and
Assmann
,
P. F.
(
2000
). “
Effects of time reversal on consonant identification
,”
J. Acoust. Soc. Am.
108
,
2604
.
2.
Boulenger
,
V.
,
Hoen
,
M.
,
Jacquier
,
C.
, and
Meunier
,
F.
(
2011
). “
Interplay between acoustic/phonetic and semantic processes during spoken sentence comprehension: An ERP study
,”
Brain Lang.
116
,
51
63
.
3.
Fulop
,
S. A.
, and
Disner
,
S. F.
(
2011
). “
Examining the voice bar
,”
Proc. Mtgs. Acoust.
14
(
1
),
060002
.
4.
Grant
,
K. W.
, and
Walden
,
B. E.
(
1996
). “
Evaluating the articulation index for auditory-visual consonant recognition
,”
J. Acoust. Soc. Am.
100
(
4
),
2415
2424
.
5.
Greenberg
,
S.
, and
Arai
,
T.
(
2001
). “
The relation between speech intelligibility and the complex modulation spectrum
,” in
Proceedings of the 7th European Conference on Speech Communication Technology (Eurospeech-2001)
, pp.
473
476
.
6.
Ishida
,
M.
,
Samuel
,
A. G.
, and
Arai
,
T.
(
2016
). “
Some people are ‘more lexical' than others
,”
Cognition
151
,
68
75
.
7.
Jongman
,
A.
,
Wayland
,
R.
, and
Wong
,
S.
(
2000
). “
Acoustic characteristic of English fricatives
,”
J. Acoust. Soc. Am.
108
(
3
),
1252
1263
.
8.
Kewley-Port
,
D.
,
Pisoni
,
D. B.
, and
Studdert-Kennedy
,
M.
(
1983
). “
Perception of static and dynamic acoustic cues to place of articulation in initial stop consonant
,”
J. Acoust. Soc. Am.
73
(
5
),
1779
1793
.
9.
Kiss
,
M.
,
Cristescu
,
T.
,
Fink
,
M.
, and
Wittmann
,
M.
(
2008
). “
Auditory language comprehension of temporally reversed speech signals in native and non-native speakers
,”
Acta Neurobiol. Exp.
68
(
2
),
204
213
.
10.
Li
,
F.
,
Trevino
,
A.
,
Menon
,
A.
, and
Allen
,
J. B.
(
2012
). “
A psychoacoustic method for studying the necessary and sufficient perceptual cues of American English fricative consonants in noise
,”
J. Acoust. Soc. Am.
132
(
4
),
2663
2675
.
11.
Mermelstein
,
P.
(
1977
). “
On detecting nasals in continuous speech
,”
J. Acoust. Soc. Am.
61
(
2
),
581
587
.
12.
Miller
,
G. A.
, and
Nicely
,
P. E.
(
1955
). “
An analysis of perceptual confusions among some English consonants
,”
J. Acoust. Soc. Am.
27
(
2
),
338
352
.
13.
Mitani
,
S.
,
Kitama
,
T.
, and
Sato
,
Y.
(
2006
). “
Voiceless affricate/fricative distinction by frication duration and amplitude rise slope
,”
J. Acoust. Soc. Am.
120
(
3
),
1600
1607
.
14.
Remez
,
R. E.
,
Thomas
,
E. F.
,
Dubowski
,
K. R.
,
Koinis
,
S. M.
,
Porter
,
N. A. C.
,
Paddu
,
N. U.
,
Moskalenko
,
M.
, and
Grossman
,
Y. S.
(
2013
). “
Modulation sensitivity in perceptual organization of speech
,”
Atten. Percept. Psychophys.
75
,
1353
1358
.
15.
Saberi
,
K.
, and
Perrott
,
D. R.
(
1999
). “
Cognitive restoration of reversed speech
,”
Nature
398
,
760
.
16.
Stevens
,
K. N.
, and
Klatt
,
D. H.
(
1974
). “
Role of formant transitions in the voiced-voiceless distinction for stops
,”
J. Acoust. Soc. Am.
55
(
3
),
653
659
.
17.
Stilp
,
C. E.
,
Kiefte
,
M.
,
Alexander
,
J. M.
, and
Kluender
,
K. R.
(
2010
). “
Cochlea-scaled spectral entropy predicts rate-invariant intelligibility of temporally distorted sentences
,”
J. Acoust. Soc. Am.
128
(
4
),
2112
2126
.
18.
Ueda
,
K.
,
Nakajima
,
Y.
,
Ellermeier
,
W.
, and
Kattner
,
F.
(
2017
). “
Intelligibility of locally time-reversed speech: A multilingual comparison
,”
Sci. Rep.
7
(
1
),
1782
.

Supplementary Material