Effects of temporal distortions on consonant perception were measured using locally time-reversed nonsense syllables. Consonant recognition was measured in both audio and audio-visual modalities for assessing whether the addition of visual speech cues can recover consonant errors caused by time reversing. The degradation in consonant recognition depended highly on the manner of articulation, with sibilant fricatives, affricates, and nasals showing the least degradation. Because consonant errors induced by time reversing were primarily in voicing and place-of-articulation (mostly limited to stop-plosives and non-sibilant fricatives), undistorted visual speech cues could resolve only about half the errors (i.e., only place-of-articulation errors).
1. Introduction
Temporal dynamics of speech are crucial for its perception, and distorting temporal information degrades speech perception. For example, Remez et al. (2013) and Ueda et al. (2017) summarized several studies that reported a systematic degradation of speech recognition performance after locally time reversing short segments of speech signal, i.e., a monotonic decrease in word recognition scores with increasing reversal duration, especially above 125 ms. This decrease was considered to be due to degradation of the strongest speech modulations between 3 and 8 Hz, which correspond to the average syllable rate in conversational speech (Saberi and Perrott, 1999; Greenberg and Arai, 2001). In other words, speech cues shorter than 125 ms were thought to be relatively unimportant for recognizing sentences. However, later studies that used less predictable speech stimuli reported significant degradation in speech recognition when speech cues as short as 50 to 100 ms were disturbed (Kiss et al., 2008; Stilp et al., 2010; Boulenger et al., 2011; Remez et al., 2013), suggesting that sensitivity to temporal distortions significantly shorter than the purported speech syllable duration range (∼120–300 ms) becomes evident when context information is reduced. This was further asserted by Ishida et al. (2016), who reported the degradation in speech intelligibility to occur at shorter reversal durations for pseudowords that lack lexical context (30–50 ms) than for meaningful words (50–100 ms). Thus, the temporal boundaries over which speech can be reconstructed after temporally distorting speech cues depend on the amount of lexical, semantic, and syntactic context information in the speech materials.
For shorter speech segments such as consonants, presented in the absence of any linguistic context, the ability to recover from the degradation caused by local time reversing varies significantly across consonants. This is because some consonant cues are highly dynamic and asymmetric in time, e.g., formant transitions, envelope rise/fall rate, etc., while some other cues, such as frication energy (especially for sibilant fricatives) and spectral holes in nasals, are relatively stationary and symmetric in time (Stevens and Klatt, 1974; Mermelstein, 1977; Kewley-Port et al., 1983; Jongman et al., 2000; Mitani et al., 2006; Li et al., 2012; Remez et al., 2013; Ishida et al., 2016). Both Bharadwaj and Assmann (2000) and Ishida et al. (2016) measured consonant intelligibility in locally time-reversed nonsense words or phrases, using two different testing paradigms, and reported significantly more degradation for stop-plosives than for fricative consonants. However, the exact nature of consonant errors induced by local time reversing is not known. A detailed analysis of these errors would potentially allow us to predict the intelligibility of speech with different amounts of context information when listening in acoustic environments, or on communication channels, that temporally perturb the signal. Furthermore, it can help in designing strategies to overcome those errors. For example, if the errors are primarily in place- and manner-of-articulation, then providing visual speech cues by displaying the face of the talker could significantly improve the recognition of temporally distorted speech audio (Grant and Walden, 1996). However, if the errors induced by temporal distortions are in voicing, then visual speech cues would be of little or no help.1
To test this hypothesis, the current study measured consonant recognition in both audio-only and audio-visual modalities. In this study, the effect of temporal distortions on consonant recognition was measured using locally time-reversed /ɑ/C/ɑ/ syllables, and consonant errors were analyzed using confusion matrix (CM) analysis (Miller and Nicely, 1955).
2. Methods
2.1 Stimuli
The speech stimuli were isolated /ɑ/C/ɑ/ syllables spoken by a female talker, with 18 consonants (/p/, /t/, /k/, /f/, /θ/, /s/, /ʃ/, /ʧ/, /b/, /d/, /ɡ/, /v/, /ð/, /z/, /ʒ/, /ʤ/, /m/, /n/) (Grant and Walden, 1996). Eight productions of each consonant were recorded, resulting in a total of 144 tokens. The audio waveforms of these tokens were locally time-reversed in segments of duration T = 0 (unprocessed), 20, 40, 60, 80, 100, 120, 140, and 160 ms. The start and end of each segment was chosen to be at a zero crossing to minimize discontinuity. Thus, the actual duration of each segment varied slightly, but was within ±5.25% of the intended reversal duration. In the audio-visual modality, audio stimuli were combined with the undistorted video of the talker's face.
Figure 1 shows auditory spectrograms of /ɑʃɑ/ and /ɑtɑ/ tokens, before (top panels) and after (middle panels) time reversing (T = 120 ms). After time reversing, vowels, formant transitions, and other consonant cues (i.e., the release burst or frication energy) changed their relative temporal locations, thereby disturbing cues such as the voice onset time (VOT) for stop consonants. Time-reversal processing also changed the direction of formant transitions (i.e., increasing vs decreasing in frequency). However, purely spectral cues such as the bandwidth of /ʃ/ and /t/ bursts remained unaffected by time reversing. Long-term average spectra for the two tokens (bottom panels) illustrate that the spectral peaks in the average spectra were not affected by this kind of processing. However, spectral valleys were somewhat filled after processing, most likely due to the spectral spread of energy caused by discontinuities at the time-reversal boundaries.
2.2 Listeners
Five subjects (two male and three female) with normal hearing and ages ranging between 29 and 55 years participated in this study. All listeners had thresholds ≤20 dB hearing level (HL) for frequencies up to 4 kHz. Above 4 kHz, only one listener had thresholds between 25 and 30 dB HL, while others had thresholds ≤20 dB HL.
2.3 Procedure
Listeners sat in a sound-treated room in front of a video touch-screen that was used to display stimulus video as well as response options. Audio stimuli were presented diotically at 72 dB sound pressure level (linear weighting) through TDT system 3 (Tucker Davis Technologies, Alachua, FL) and HD-580 headphones (Sennheiser electronic GmbH & Co. KG, Germany). Listeners entered their response by choosing from 18 consonant labels displayed on the touch screen after each stimulus presentation. Listeners were first familiarized with consonant sounds and response labels using unprocessed speech and, in some cases, locally time-reversed syllables with reversal duration of 20 ms (the least distorted of the time-reversal conditions). During this training/familiarization session, correct-answer feedback was provided. Listeners were required to demonstrate 100% performance on the unprocessed speech condition before continuing on to the test phase of the study. The test data were collected in blocks of 72 trials each, consisting of four randomly drawn tokens of each consonant (18 consonants × 4 tokens). The reversal duration was fixed within each test block and total five blocks were tested for each reversal duration. However, if a listener scored 100% in two consecutive test blocks of a reversal duration, then no further blocks were tested for that duration. Presentation order of blocks was randomized across reversal duration and modality. CMs were generated by pooling data across subjects.
3. Results and discussion
3.1 Recognition scores
Figure 2 shows recognition scores of individual consonants as a function of the reversal duration, with the solid line in each panel representing the average consonant score. As expected, the temporal distortion caused by local time reversing did not affect all consonants equally. Consonants /t/, /θ/, /b/, /d/, /ɡ/, /v/, /ð/, /ʒ/, /ʧ/, and /ʤ/ (bottom panel) were the most affected, showing a noticeable degradation in scores starting from T = 80 ms and reaching a score as low as 40% to 60% at T = 160 ms. On the other hand, consonants /s/, /z/, /m/, and /n/ (top panel) were hardly affected and had scores greater than 94% at all measured reversal durations. Consonants /p/, /k/, /f/, and /ʃ/ (middle panel) showed a reduction in scores with increase in reversal duration, but the scores remained mostly above 80%.
These results are consistent with Bharadwaj and Assmann (2000), and correlate with spectro-temporal characteristics of various consonant cues. For example, perceptual cues for stop plosives depend highly on the temporal dynamics of the speech signal (Stevens and Klatt, 1974; Kewley-Port et al., 1983), while those for nasals are primarily spectral in nature, independent of the temporal sequence of events (Mermelstein, 1977). Similarly, the most prominent features of sibilant fricatives are predominantly spectral (Jongman et al., 2000; Li et al., 2012), which would have little or no effect of temporal scrambling of the waveform. However, only two of the four sibilants, viz., /s/ and /z/, were not affected by time reversing, whereas scores for sibilants /ʃ/ and /ʒ/ reduced. To understand these particular trends, consonant confusion errors were analyzed.
3.2 Consonant errors
Consonant confusions were analyzed using a row-normalized CM for each reversal duration. The row-normalization converted each row of a CM to an empirically estimated probability distribution (in percent), with the diagonal element representing correct recognition probability and the off-diagonal elements representing confusion error probabilities. Some of these row-normalized CMs, along with a graphical illustration, can be found in the supplementary material.2 The most prominent errors (>20%) were primarily voicing confusions /p/-/b/, /t/-/d/, /k/-/ɡ/, /f-/v/, and /θ/-/ð/. The other prominent confusions (>10%) included those across the manner of articulation (/ʃ/-/ʧ/ and /ʒ/-/ʤ/) or the place of articulation (/f/-/θ/, /p/-/t/), or both (/t/-/θ/, and /b/-/v/). Only one prominent error, viz., /d/-/p/, was across both voicing and the place of articulation.
These consonant confusions help explain the strong interaction between the manner of articulation and the degradation in recognition scores. For example, all prominent voicing and place of articulation errors listed above were only among stop-plosives or non-sibilant fricatives, resulting in most of the low-scoring consonants at 160 ms reversal duration (Fig. 2, lower and middle panels) to be from these two manner classes. On the other hand, there were no prominent errors among nasals, making those high-scoring consonants (Fig. 2, top panel). Affricates /ʧ/ and /ʤ/ have the same place of articulation and the same spectral shape as the sibilant consonants /ʃ/ and /ʒ/, but the envelope rise of the frication energy in the affricates is much shorter than that in the two sibilants (Mitani et al., 2006). If a temporal discontinuity caused by local time reversing falls in the middle of the frication of a sibilant, like that shown in the middle left panel in Fig. 1, then the result is sharply rising frication energy, which may make a sibilant to be perceived as an affricate. There were no such affricate counterparts for the sibilant consonants /s/ and /z/, which explains the high recognition scores for these two sibilants.
Voicing confusions, seen primarily in stop plosives and non-sibilant fricatives, were highly asymmetric with the probability of voiced consonants being recognized as their unvoiced counterparts much higher than vice versa. A secondary but important voicing cue for stop plosive and non-sibilant fricatives is the voicing murmur that appears just before the release burst in voiced consonants, which is sometimes referred to as the “voice-bar” and which is absent in unvoiced consonants (Fulop and Disner, 2011). The temporal scrambling caused by local time reversals not only disrupts the VOT, but can also place the voice bar after the release burst. The absence of voice-bar before the release burst is likely to make a voiced consonant to be perceived as an unvoiced consonant, especially when the primary voicing cue (i.e., the VOT) is distorted, but not vice versa.
3.3 Visual speech cues
As expected, visual speech input was able to resolve most place-of-articulation errors and, to some extent, manner-of-articulation errors due to mutual information between place and manner categories. Specifically, /ʒ/-/ʤ/, /f/-/θ/, /p/-/t/, /t/-/θ/, /b/-/v/, and /d/-/p/ errors became negligible (i.e., below the chance level) in the audio-visual modality. This resulted in large improvements in scores for some consonants, e.g., from 50% to 85% for /t/, from 40% to 86% for /θ/, and from 55% to 86% for /v/ at 160 ms reversal duration. At this reversal duration, the other noticeable score improvements were for /b/, /d/, and /ʤ/ (20, 22, and 18 percentage points, respectively). However, visual speech cues did not reduce any voicing errors, as expected, resulting in no improvement for a consonant like /ɡ/, which had low scores primarily due to voicing confusion with /k/. Overall, only half the errors (i.e., only place errors, but not voicing errors) were resolved by visual cues. For example, at 160 ms reversal duration, visual speech cues approximately halved the error rate from 30.5% to 16.8%.
4. Summary and conclusions
Contribution of temporal cues in recognizing a consonant depends highly on the manner of articulation with consonant errors mostly limited to stop plosives and non-sibilant fricatives. Because consonant errors caused by local time reversing are primarily in voicing and place-of-articulation, visual speech cues could resolve only about half the errors (i.e., only place errors, but not voicing errors), resulting in a moderate overall improvement in recognition scores. Thus, when the speech audio is temporally distorted, displaying the talker's face can provide only partial improvement, and may require additional techniques like the use of linguistic context for further error correction.
Acknowledgments
We thank Mary Cord for audiometric screening. This research was supported by a Cooperative Research and Development Agreements between Clinical Investigation Regulatory Office, U.S. Army Medical Department, and School and the Oticon Foundation, Copenhagen, Denmark. The views expressed in this paper are those of the authors and do not reflect the official policy of the Department of the Army/Navy/Air Force, Department of Defense (DoD), or the U.S. Government. The identification of specific products or scientific instrumentation does not constitute endorsement or implied endorsement on the part of the authors, DoD, or any component agency.
The shape of the talker's mouth (jaw, lips, etc.) indicates the location of constriction while producing a consonant sound, thus providing the place-of-articulation cue. The manner-of articulation shares mutual information with the place-of-articulation, e.g., a bilabial consonant can be a stop plosive or a nasal, but not a fricative. Due to this mutual information, partial information about the manner-of-articulation can also be obtained from visual speech. However, there are no visual correlates of voicing.
See the supplementary material at https://doi.org/10.1121/1.5129562 for detailed confusion matrices. This document contains consonant confusion matrices at reversal durations of 120 and 160 ms, in both tabular form and graphical illustration form. The document also contains a table that lists voicing, place-of-articulation, and manner-of-articulation categories of consonants used in this study.