This paper describes acoustic cues for classification of consonant voicing in a distinctive feature-based speech recognition system. Initial acoustic cues are selected by studying consonant production mechanisms. Spectral representations, band-limited energies, and correlation values, along with Mel-frequency cepstral coefficients features (MFCCs) are also examined. Analysis of variance is performed to assess relative significance of features. Overall, 82.2 %, 80.6 %, and 78.4 % classification rates are obtained on the TIMIT database for stops, fricatives, and affricates, respectively. Combining acoustic parameters with MFCCs shows performance improvement in all cases. Also, performance in the NTIMIT telephone channel speech shows that acoustic parameters are more robust than MFCCs.

In a knowledge-based speech recognition system, linguistic information is extracted from the speech signal by finding acoustic correlates of the articulation of speech sounds.1 Such systems attempt to construct a more environment-robust and less computationally intensive method compared with widely used statistical systems. An approach outlined by Stevens2 describes procedures to find linguistic units termed distinctive features from speech. Various distinctive features describe the manner, the articulators involved, and the place of articulation for each phoneme. For obstruent consonants in English, i.e. stops, fricatives and affricates, the features [body], [blade], and [lips], can be used to distinguish consonant place, with an additional distinction in consonant voicing. In the distinctive feature representation of speech, the primary features that describe voicing are [stiff]/[slack] vocal folds. These features are related to the state or configuration of the vocal folds that may encourage or discourage vocal fold vibration.

Numerous studies have been conducted to study consonant acoustics. For example, Stevens3,4 studied acoustics of stop and affricate consonants with models of mechanical, aerodynamic, and acoustic events. Similarly, Shadle5 researched fricative consonants using speech production models. Analyses of consonant voicing has also been carried out by many researchers. Bell-Berti6 investigated speech production mechanisms between voice and voiceless stops, and Crowther7 found that vocalic cues affect consonant voicing. A previous study by Choi8 described an early module for detection of consonant voicing in the distinctive feature-based speech recognition system.

This study aims to investigate consonant voicing characteristics, and to evaluate acoustic parameters for consonant voicing classification for a distinctive feature-based speech recognition system. Performance in clean and telephone speech is examined to evaluate the robustness of acoustic cues in noisy environments. In this paper, it is assumed that consonant manner class detection has been completed, so that consonant voicing classification is carried out on regions identified as stops, fricatives and affricates. Various acoustic measurements are investigated, and Mel-frequency cepstral coefficients (MFCCs), which are widely used in statistical speech recognition systems are also included for comparison. Analysis of variance (ANOVA) tests are used to assess the significance of measurements for classification of consonant voicing, and results are presented for clean and telephone speech.

In order to examine consonant voicing, stops, fricatives, and affricatives were extracted from the TIMIT corpus,9 in all phonetic contexts. The excised consonant database is taken from 6300 continuous sentences spoken by 630 speakers from different dialect regions in the United States. The corpus includes word and phone labels, and a lexicon is provided. The recordings are conducted in a noise-free environment, and the signal is sampled at 16 kHz. These are divided into training and test sets. The obstruent consonants consists of 54 144 tokens, with 39 933 the tokens included in the training set, and 14 211 tokens included in test set. The numbers of stop, fricative and affricate tokens in the training are, 19 572 stop closures, 24 221 stop releases, 16 706 complete stops, 21 424 fricatives, 2031 affricate releases and 1803 complete affricates. The test set includes 8490 stop closures, 6992 stop releases, 5918 complete stops, 7724 fricatives, 631 affricate releases, and 569 complete affricates.

In this paper, TIMIT phone labels are used to find consonant landmarks (stops, fricatives and affricates) and for locations for extracting features, at closures and releases. This is equivalent to assuming that consonant landmarks have been detected in advance. Using these landmarks, closure measurements are extracted 20 ms after the start of stop, fricative, and affricate closures. Likewise, release measurements are extracted 20 ms before the end of stop, fricative and affricate releases, in order to measure residual voicing, which is correlated with pharyngeal expansion to maintain phonation.6 The consonant voicing classification module is tested for three different cases: using closures only, releases only and both. In the cases where both closures and releases are available, we concatenated closure and release features into a larger feature set. In this paper, performance is indicated using classification rate (rate of correctly classified tokens to all test tokens) as well as equal error rates (shown in parentheses after classification rates). Also, the NTIMIT database10 is used to examine channel robustness. The NTIMIT database consists of TIMIT utterances which are transmitted over a variety of telephone line conditions.

Detecting acoustic cues for consonant voicing involves extracting several types of information from the signal. One of these measures is energy difference. Vocal fold vibration at low frequencies is observable along with possible high frequency frication noise during the closure interval, but less so for unvoiced consonants.8 In this paper, we measured four energy-related features to capture the properties of low and high frequency energy. The parameters used are band energies from 0 to 400 Hz (E1) and from 2000 to 7000 Hz (E2). To assess the difference in energy between low- and high-frequency, we used the ratio of energy in 0–400 Hz and in 2000–7000 Hz. To produce vibration of the vocal folds for obstruents, there is expected to be increasing correlation value for a short-term period. Therefore, peak normalized cross correlation values (PNCC) were also used as a consonant voicing measure. The intensity of aspiration following the release of the obstruent8 and the presence of glottalization at the end of voicing are also indicators for unvoiced obstruents. This quantity may be determined by examining the characteristics of the phonation source near the consonant landmarks. The amplitude of the first harmonic (H1) is a suitable measure for assessing the strength of vocal fold vibration. Finally, duration was used as an acoustic cue. Voice onset time carries information about voicing for a stop,11 and duration of fricatives differ for voiced and unvoiced fricatives.12 

The measurements obtained for consonant voicing classification in the TIMIT training set are first examined using ANOVA. One-way analysis is performed for each of the acoustic measurements, and significant features with P < 0.05 are found. In all, seven significant features are found, and F-values for consonant voicing features at stop, fricative, and affricate closures and releases are shown in Table I. The F-value is computed as the ratio of the between-group variance in the data over within-group variance, and indicates relative discriminative power between features. Entries that are not significant are marked with a dash. Results show that measurements for energy ratio and PNCC are significant for all consonants. rms energy is discriminative for stops and all releases, and E1, E2, and H1 is discriminative for all closures. Duration features and E2 are significant for all releases.

TABLE I.

ANOVA results (F-values) for seven acoustic measurements for the training data set. Entries with probabilities greater than P > 0.05 are not significant and marked with a dash.

measurementsstop closurestop releasefricative closurefricative releaseaffricate closureaffricate release
rms energy 1319 1434 — 1364 — 265 
0–400 Hz energy (E1) 3882 148 2125 569 241 — 
2000–7000 Hz energy (E2) 187 2978 2672 2935 — 248 
0–400 Hz/2000–7000 Hz 2798 3991 2033 2077 200 — 
PNCC 2913 2784 2772 2999 200 — 
H1 2091 192 1692 1586 375 — 
Duration — 3178 3213 3213 — 553 
measurementsstop closurestop releasefricative closurefricative releaseaffricate closureaffricate release
rms energy 1319 1434 — 1364 — 265 
0–400 Hz energy (E1) 3882 148 2125 569 241 — 
2000–7000 Hz energy (E2) 187 2978 2672 2935 — 248 
0–400 Hz/2000–7000 Hz 2798 3991 2033 2077 200 — 
PNCC 2913 2784 2772 2999 200 — 
H1 2091 192 1692 1586 375 — 
Duration — 3178 3213 3213 — 553 

Using acoustic parameters, Gaussian mixture models (GMMs) with four mixtures are trained for each task from the TIMIT training data and tested using the TIMIT test data. In this paper, we divided acoustic cues into three groups. The first group (energy) contains rms energy, 0–400 Hz band-limited energy, 2000–7000 Hz band-limited energy, and ratio of band-limited energies (0–400 Hz/2000–7000 Hz). The second group (pitch) includes H1 and PNCC. Finally, the last group uses duration of releases.

The three groups (seven acoustic parameters) are next used to classify consonant voicing. Classification rates for stops are shown in Fig. 1(a). Pitch related features are most important, but closure duration seems to be less significant in classifying voicing for stops. These classification rates are supported by the ANOVA results. Among the landmarks, releases were more important than closures. The classification rates for both locations are fairly consistent for the three groups with 82.2% (81.2%) classification for all acoustic cues.

FIG. 1.

(a) Classification rates for stop closures, stop releases and both, (b) classification rates for fricative closures, fricative releases and both, and (c) classification rates for affricate closures, affricate releases, and both using energy properties (energy), pitch properties (pitch), duration, and all acoustic features (all). (d) Consonant voicing classification rate for stop closures, stop releases and both, (e) consonant voicing classification rate for fricative closures, fricative releases, and both, and (f) consonant voicing classification rate for affricate closures, affricate releases, and both using acoustic parameters, MFCC(13), MFCC(39), and using acoustic parameters and MFCC(39).

FIG. 1.

(a) Classification rates for stop closures, stop releases and both, (b) classification rates for fricative closures, fricative releases and both, and (c) classification rates for affricate closures, affricate releases, and both using energy properties (energy), pitch properties (pitch), duration, and all acoustic features (all). (d) Consonant voicing classification rate for stop closures, stop releases and both, (e) consonant voicing classification rate for fricative closures, fricative releases, and both, and (f) consonant voicing classification rate for affricate closures, affricate releases, and both using acoustic parameters, MFCC(13), MFCC(39), and using acoustic parameters and MFCC(39).

Close modal

Next, Fig. 1(b) shows the classification rate for fricatives. A single measure for fricative duration is used, so there is no distinction between closure and release measurements. Duration was the most important for detection rate, and pitch-related features are more significant at closures compared to releases. The results for all parameters was 80.6% (80.3%). Best reported results for fricative voicing by Ali13 on a selected subset of the TIMIT database are around 95%, trained on 220 fricatives from six speakers and tested on 500 fricatives from 22 speakers.

Finally, the results for different acoustic properties for affricates are shown in Fig. 1(c). Compared to other consonants, pitch related features are less significant, especially at releases, however, energy properties at closures seem to be significant features. The release duration was also important in classification. Overall, performance is slightly lower than other consonants, and 78.4% (78.0%) classification rate is obtained for all acoustic parameters. Overall, the release durations were important to classify consonant voicing for all consonants. Results using the features of both releases and closures show consistently good performance.

In the next experiment, various combinations of conventional Mel-frequency cepstral coefficients (MFCCs) and the acoustic features described above are investigated. Using acoustic parameters with cepstral features, Gaussian mixture models with four mixtures are trained for each task from the TIMIT training data. In this paper, we used two versions of Mel-frequency cepstral coefficients, MFCC(13) and MFCC(39). MFCC(13) denotes 13 cepstra without derivatives and MFCC(39) denotes 13 cepstra with first and second derivatives. MFCC features are extracted using tools available in the HTK toolkit.14 For cases where MFCCs and acoustic cues are used together, a straightforward process of concatenating MFCCs to acoustic cues after a normalization process is adopted.

The acoustic parameters and MFCCs are next used to classify consonant voicing for the test set from the TIMIT database. The overall classification rate for stops are shown in Fig. 1(d). Classification rates using acoustic cues are 78.6% (77.3%) and 71.3% (71.3%) at stop releases and stop closures, respectively. These results are better than results of MFCC(13) but generally lower than MFCC(39). When using acoustic cues, the classification rates show 7% difference between releases and closures, while the results of MFCC(39) shows more consistent performance. However, when acoustic cues are added to MFCC(39) features, performance improves for all cases. Also, the classification rates using both closures and releases shows 7% improvement compared to using only cues from either the closures or the releases.

Similarly, Figs. 1(e) and 1(f) shows the classification results of fricatives and affricates, respectively. The results using acoustic parameters shows 77.8% (77.5%) and 79.8% (78.1%) at fricative releases and fricative closures, respectively. The results of MFCC(39) shows about 2% better performance than acoustic results, however, acoustic results are better than results of MFCC(13). Using acoustic cues, 72.1% (71.5%) and 71% (72.1%) classification results was achieved at affricate releases and affricate closures, respectively. The results of MFCC(39) again performed about 3% better than results of acoustic cues.

In general, if acoustic features are added to MFCC(39), about 2% performance improvement was achieved for all consonants. These results indicate that acoustic features include additional consonant voicing discriminative information to MFCCs.

Finally, experiments were performed to examine the effects of channel degraded speech on consonant voicing distinction. We used the NTIMIT database for telephone speech data. As shown in Table II, overall results for telephone speech gives 2–15 % lower performance for each case compared to TIMIT results. Comparing the results with each feature, MFCCs shows 8 %–15 % performance degradation while acoustic cues show only 2 %–9 % performance degradation. This result indicates that acoustic features are more robust to spectral degradation than MFCCs. Especially, the results of stops and fricatives show a large difference, despite the larger dimension of MFCCs compared to acoustic parameters. In addition, these results are comparable to results reported by Borys.15 Borys et al. used support vector machines (SVMs) to detect distinctive features from the NTIMIT corpus. The SVMs used MFCCs, formants and other acoustic cues16 from several frames (for a total of 651 features), and achieved 74.7% accuracy for consonant voicing classification. Later research17 with refinements showed around 80% accuracy within a similar framework.

TABLE II.

Classification rates of consonant voicing in NTIMIT degraded speech.

Stop acoustic 74.1% 
  MFCC(39) 70.6% 
  acoustic + MFCC(39) 74.7% 
Fricative acoustic 77.9% 
MFCC(39) 74.6% 
acoustic + MFCC(39) 77.7% 
Affricate acoustic 71.4% 
MFCC(39) 71.8% 
acoustic + MFCC(39) 72.1% 
Stop acoustic 74.1% 
  MFCC(39) 70.6% 
  acoustic + MFCC(39) 74.7% 
Fricative acoustic 77.9% 
MFCC(39) 74.6% 
acoustic + MFCC(39) 77.7% 
Affricate acoustic 71.4% 
MFCC(39) 71.8% 
acoustic + MFCC(39) 72.1% 

This work examines acoustic parameters for classification of consonant voicing in English, as part of a distinctive feature-based speech recognition system that aims for lower computational load and better robustness compared to conventional statistical methods. First, consonant landmarks are found at closures or releases of stops, fricatives and affricates. Acoustic cues are found around these landmarks and are used to determine whether the consonant is voiced or unvoiced.

The acoustic cues used to infer the values of underlying features are initially selected by studying the production mechanisms involved in producing a consonant. Extended measurements are then examined, including spectral representations, band-limited energy and correlation values, along with widely used cepstral coefficient features. ANOVA analysis is performed for all acoustic measurements for each consonant. Energy ratio (0–400 Hz/2000–7000 Hz) and PNCC are found to be significant measurements, and band limited energy, H1 and duration also are discriminative for closures or releases of consonants.

The classification rate using all acoustic features for stops, fricatives, and affricates are 82.2% (81.2%), 80.6% (80.3%), and 78.4% (78.0%), respectively. Overall, results show that using acoustic parameters, with or without addition of MFCCs, is able to classify consonant voicing among the stops, fricatives and affricates. In addition, adding the acoustic parameters to MFCCs shows performance improvement in all cases. In telephone speech, it is found that acoustic parameters are more robust than MFCCs in classifying consonant voicing.

In this paper, we did not consider adjacent vowel effects. However, an earlier study by Choi8 finds that features at vowels adjacent to consonants, such as F0, H1-H2 and formant trajectory, are discriminative for consonant voicing. In order to further investigate adjacent vowel effects, acoustic cues at adjacent vowels for voiced and unvoiced consonants need to be studied. Finally, the consonant voicing module described in this paper will be evaluated within an overall distinctive-feature based speech recognition system.

1.
C. Y.
Espy-Wilson
,
T.
Pruthi
,
A.
Juneja
and
O.
Deshmukh
, “
Landmark-based approach to speech recognition: an alternative to HMMs
,”
Proceedings of Interspeech 2007
, pp.
886
889
.
2.
K. N.
Stevens
, “
Toward a model for lexical access based on acoustic landmarks and distinctive features
,”
J. Acoust. Soc. Am.
111
,
1872
1891
(
2002
).
3.
K. N.
Stevens
, “
Modelling affricate consonants
,”
Speech Commun.
13
,
33
43
(
1993
).
4.
K. N.
Stevens
, “
Models for the production and acoustics of stop consonants
,”
Speech Commun.
13
,
367
365
(
1993
).
5.
C.
Shadle
,
“The acoustics of fricative consonants,”
RLE Technical Report No. 506,
MIT
, Cambridge, MA,
1985
.
6.
F.
Bell-Berti
, “
Control of pharyngeal cavity size for English voiced and voiceless stops
,”
J. Acoust. Soc. Am.
57
,
456
461
(
1975
).
7.
C. S.
Crowther
and
V.
Mann
, “
Native language factors affecting use of vocalic cues to final consonant voicing in English
,”
J. Acoust. Soc. Am.
92
,
711
722
(
1992
).
8.
J. Y.
Choi
, “
Detection of consonant voicing: a module for a hierarchical speech recognition system
,” Ph.D thesis,
MIT
,
1999
.
9.
J. S.
Garofolo
,
L. F.
Lamel
,
W. M.
Fisher
,
J. G.
Fiscus
,
D. S.
Pallett
, and
N. L.
Dahlgren
, “
The DARPA TIMIT acoustic-phonetic contiuous speech corpus CDROM
,” Linguistic Data Consortium (
1993
).
10.
C.
Jankowski
,
A.
Kalyanswamy
,
S.
Basson
, and
J.
Spitz
, “
NTIMIT: A phonetically balanced, continuous speech, telephone bandwidth speech database
,”
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal
(
1990
), pp.
109
112
.
11.
P.
Azou
,
C.
Ozsancak
,
R. J.
Morris
,
M.
Jan
,
F.
Eustache
, and
D.
Hannequin
, “
Voice onset time in aphasia, apraxia of speech and dysarthria: A review
,”
Clin. Linguistics Phonet.
14
,
131
150
(
2000
).
12.
A.
Jongman
,
R.
Wayland
, and
S.
Wong
, “
Acoustic charicteristics of English fricatives
,”
J. Acoust. Soc. Am.
108
,
1252
1263
(
1985
).
13.
A. A.
Ali
,
J.
van der Spiegel
, and
P.
Mueller
, “
An acoustic-phonetic feature based system for the automatic recognition of fricative consonants
,”
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal
(
1998
), pp.
961
964
.
14.
S.
Young
,
G.
Evermann
,
T.
Hain
,
D.
Kershaw
,
G.
Moore
,
J.
Odell
,
D.
Ollason
,
D.
Povey
,
V.
Valtchev
, and
P.
Woodland
,
The HTK Book
(
Cambridge University Press
,
Cambridge
,
2002
).
15.
S.
Borys
and
M.
Hasegawa-Johnson
, “
Distinctive feature based SVM discriminant features for improvements to phone recognition on telephone band speech
,”
Proceedings of Interspeech
(
2005
).
16.
N.
Bitar
, “
Acoustic analysis and modelling of speech based on phonetic features
,” Ph.D. thesis,
Boston University
(
1998
).
17.
S.
Borys
, “
An SVM front end landmark speech recognition system
,” M.S. thesis,
University of Illinois at Urbana-Champaign
(
2008
).