Numerous attempts have been made to find low-dimensional, formant-related representations of speech signals that are suitable for automatic speech recognition. However, it is often not known how these features behave in comparison with true formants. The purpose of this study was to compare two sets of automatically extracted formant-like features, i.e., robust formants and HMM2 features, to hand-labeled formants. The robust formant features were derived by means of the split Levinson algorithm while the HMM2 features correspond to the frequency segmentation of speech signals obtained by two-dimensional hidden Markov models. Mel-frequency cepstral coefficients (MFCCs) were also included in the investigation as an example of state-of-the-art automatic speech recognition features. The feature sets were compared in terms of their performance on a vowel classification task. The speech data and hand-labeled formants that were used in this study are a subset of the American English vowels database presented in Hillenbrand et al. [J. Acoust. Soc. Am. 97, 3099–3111 (1995)]. Classification performance was measured on the original, clean data and in noisy acoustic conditions. When using clean data, the classification performance of the formant-like features compared very well to the performance of the hand-labeled formants in a gender-dependent experiment, but was inferior to the hand-labeled formants in a gender-independent experiment. The results that were obtained in noisy acoustic conditions indicated that the formant-like features used in this study are not inherently noise robust. For clean and noisy data as well as for the gender-dependent and gender-independent experiments the MFCCs achieved the same or superior results as the formant features, but at the price of a much higher feature dimensionality.

1.
Bazzi, I., Acero, A., and Deng, L. (2003). “An expectation maximization approach for formant tracking using a parameter-free non-linear predictor,” in Proceedings of ICASSP 2003, Hong Kong, pp. I.464–I.467.
2.
Davis
,
S. B.
, and
Mermelstein
,
P.
(
1980
). “
Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences
,”
IEEE Trans. Acoust., Speech, Signal Process.
ASSP-28
,
357
366
.
3.
Delsarte
,
P.
, and
Genin
,
Y. V.
(
1986
). “
The Split Levinson Algorithm
,”
IEEE Trans. Acoust., Speech, Signal Process.
ASSP-34
,
470
478
.
4.
de Wet, F., Cranen, B., de Veth, J., and Boves, L. (2000). “Comparing acoustic features for robust ASR in fixed and cellular network applications,” in Proceedings of ICASSP 2000, Istanbul, Turkey, pp. 1415–1418.
5.
Duda, R. O., Hart, P. E., and Stork, D. G. (2001). Pattern Classification, 2nd ed. (Wiley, New York).
6.
Flanagan, J. L. (1972). Speech Analysis Synthesis and Perception, 2nd ed. (Springer, Berlin).
7.
Garner, P., and Holmes, W. (1998). “On the robust incorporation of formant features into hidden Markov models for automatic speech recognition,” in Proceedings of ICASSP 1998, Seattle, WA, pp. 1–4.
8.
Hillenbrand
,
J. M.
, and
Gayvert
,
R. T.
(
1993
). “
Vowel classification based on fundamental frequency and formant frequencies
,”
J. Speech Hear. Res.
36
,
694
700
.
9.
Hillenbrand
,
J. M.
,
Getty
,
L. A.
,
Clark
,
M. J.
, and
Wheeler
,
K.
(
1995
). “
Acoustic characteristics of American English vowels
,”
J. Acoust. Soc. Am.
97
,
3099
3111
.
10.
Holmes, J., Holmes, W., and Garner, P. (1997). “Using formant frequencies in speech recognition,” in Proceedings of Eurospeech 1997, Rhodes, Greece, pp. 2083–2086.
11.
Hunt, M. J. (1999). “Spectral signal processing for ASR,” in Proceedings of ASRU 1999, Keystone, CO.
12.
Juang, B.-H., Chou, W., and Lee, C.-H. (1996). “Statistical and discriminative methods for speech recognition,” in Automatic Speech and Speaker Recognition, Advanced Topics, edited by C.-H. Lee, F. Soong, and K. Paliwal (Kluwer Academic, Boston).
13.
Ladefoged, P. (1975). A Course in Phonetics (Harcourt Brace Jovanovich, New York).
14.
Markel, J., and Gray, A. H. (1976). Linear Prediction of Speech (Springer, Berlin).
15.
Minifie, F. D., Hixon, T. J., and Williams, F., editors (1973). Normal Aspects of Speech, Hearing and Language (Prentice–Hall, Englewood Cliffs, NJ).
16.
Nadeu, C. (1999). “On the filter-bank-based parametrization front-end for robust HMM speech recognition,” in Proceedings of Nokia Workshop on Robust Methods for Speech Recognition in Adverse Conditions, Tampere, Finland, pp. 235–238.
17.
Noisex (1990). NOISE-ROM-0. NATO: AC243/(Panel 3)/RSG-10, ESPRIT: Project 2589-SAM.
18.
Peterson
,
G. E.
, and
Barney
,
H. L.
(
1952
). “
Control methods used in a study of the vowels
,”
J. Acoust. Soc. Am.
24
,
175
184
.
19.
Pols
,
L. C. W.
,
van der Kamp
,
L. J. T.
, and
Plomp
,
R.
(
1969
). “
Perceptual and physical space of vowel sounds
,”
J. Acoust. Soc. Am.
46
,
458
467
.
20.
Rabiner, L. R., and Juang, B. H. (1993). Fundamentals of Speech Recognition (Prentice–Hall, Englewood Cliffs, NJ).
21.
Rabiner, L. R., and Schafer, R. W. (1978). Digital Processing of Speech Signals (Prentice–Hall, Englewood Cliffs, NJ).
22.
Stevens, K. N. (1998). Acoustic Phonetics (MIT, Cambridge, MA).
23.
Weber, K. (2003). “HMM mixtures (HMM2) for robust speech recognition,” PhD thesis, Swiss Federal Institute of Technology Lausanne (EPFL), Lausanne, Switzerland.
24.
Weber, K., Bengio, S., and Bourlard, H. (2000). “HMM2—A novel approach to HMM emission probability estimation,” in Proceedings of ICSLP 2000, Beijing, China, pp. (III)147–150.
25.
Weber, K., Bengio, S., and Bourlard, H. (2001a). “HMM2—extraction of formant structures and their use for robust ASR,” in Proceedings of Eurospeech 2001, Aalborg, Denmark, pp. 607–610.
26.
Weber, K., Bengio, S., and Bourlard, H. (2001b). “A pragmatic view of the application of HMM2 for ASR,” IDIAP-RR 23, IDIAP, Martigny, Switzerland.
27.
Weber, K., Bengio, S., and Bourlard, H. (2001c). “Speech recognition using advanced HMM2 features,” in Proceedings of ASRU 2001, Madonna di Campiglio, Trento, Italy.
28.
Weber, K., de Wet, F., Cranen, B., Boves, L., Bengio, S., and Bourlard, H. (2002). “Evaluation of formant-like features for ASR,” in Proceedings of ICSLP 2002, Denver, CO.
29.
Welling, L., and Ney, H. (1996). “A model for efficient formant estimation,” in Proceedings of ICASSP 1996, Atlanta, GA, pp. 797–800.
30.
Willems, L. F. (1986). “Robust formant analysis,” in IPO Annual Report 21, Eindhoven, The Netherlands, pp. 34–40.
31.
Young, S., Odell, J., Ollason, D., Valtchev, V., and Woodland, P. (1997). The HTK Book (for HTK Version 2.1) (Cambridge University, Cambridge).
This content is only available via PDF.
You do not currently have access to this content.