This paper addresses the problem of automatic identification of vowels uttered in isolation by female and child speakers. In this case, the magnitude spectrum of voiced vowels is sparsely sampled since only frequencies at integer multiples of F0 are significant. This impacts negatively on the performance of vowel identification techniques that either ignore pitch or rely on global shape models. A new pitch-dependent approach to vowel identification is proposed that emerges from the concept of timbre and that defines perceptual spectral clusters (PSC) of harmonic partials. A representative set of static PSC-related features are estimated and their performance is evaluated in automatic classification tests using the Mahalanobis distance. Linear prediction features and Mel-frequency cepstral coefficients (MFCC) coefficients are used as a reference and a database of five (Portuguese) natural vowel sounds uttered by 44 speakers (including 27 child speakers) is used for training and testing the Gaussian models. Results indicate that perceptual spectral cluster (PSC) features perform better than plain linear prediction features, but perform slightly worse than MFCC features. However, PSC features have the potential to take full advantage of the pitch structure of voiced vowels, namely in the analysis of concurrent voices, or by using pitch as a normalization parameter.

1.
Bladon
,
R. W.
(
1982
). “
Arguments against formants in the auditory representation of speech
,” in
The Representation of Speech in the Peripheral Auditory System
, edited by
R.
Carlson
and
B.
Granstrom
(
Elsevier Biomedical
,
Amsterdam
), pp.
95
102
.
2.
Bruce
,
I. C.
,
Karkhanis
,
N. V.
,
Young
,
E. D.
, and
Sachs
,
M. B.
(
2002
). “
Robust formant tracking in noise
,” in
IEEE International Conference on Acoustics, Speech and Signal Processing
, pp.
I281
I284
, Orlando, Florida.
3.
Chen
,
B.
, and
Loisou
,
P. C.
(
2004
). “
Formant frequency estimation in noise
,” in
IEEE International Conference on Acoustics, Speech and Signal Processing
, pp.
I581
I584
, Montreal, Canada.
5.
Chistovich
,
L.
, and
Lublinskaja
,
V.
(
1979
). “
The center of gravity effect in vowel spectra and critical distance between the formants: Psychoacoustical study of perception of vowel-like stimuli
,”
Hear. Res.
1
,
185
195
.
6.
Davis
,
S. B.
, and
Mermelstein
,
P.
(
1980
). “
Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences
,”
IEEE Trans. Acoust., Speech, Signal Process.
28
,
357
366
.
6.
de Cheveigné
,
A.
, and
Kawahara
,
H.
(
1999
). “
Missing-data model of vowel identification
,”
J. Acoust. Soc. Am.
105
,
3497
3508
.
7.
Diehl
,
R. L.
,
Lindblom
,
B.
,
Hoemeke
,
K. A.
, and
Fahey
,
R. P.
(
1996
). “
On explaining certain male-female differences in the phonetic realization of vowel categories
,”
J. Phonetics
24
,
187
208
.
8.
Dusan
,
S.
, and
Rabiner
,
L.
(
2005
). “
On integrating insights from human speech perception into automatic speech recognition
,” in
the Ninth European Conference on Speech Communication and Technology (Interspeech-2005)
, Lisbon, Portugal, pp.
1233
1236
.
9.
Fant
,
G.
(
1970
).
Acoustic Theory of Speech Production
(Mouton, The Hague).
10.
Ferreira
,
A.
, and
Sinha
,
D.
(
2005
). “
Accurate and robust frequency estimation in the odft domain
,” in
2005 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
, New Paltz, NY, pp.
203
206
.
11.
Ferreira
,
A. J. S.
(
1996a
). “
Audio spectral coder
,”
100th Convention of the Audio Engineering Society
, Copenhagen, Denmark.
12.
Ferreira
,
A. J. S.
(
1996b
). “
Perceptual coding of harmonic signals
,”
100th Convention of the Audio Engineering Society
, Copenhagen, Denmark.
13.
Ferreira
,
A. J. S.
(
1998
). “
Spectral coding and post-processing of high quality audio
,” Ph.D. thesis, Faculdade de Engenharia da Universidade do Porto-Portugal, Porto, Portugal, http://telecom.inescn.pt/doc/phd_en.html (last viewed on May 12th 2007).
14.
Ferreira
,
A. J. S.
(
2001
). “
Accurate estimation in the odft domain of the frequency, phase and magnitude of stationary sinusoids
,” in
2001 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
, New Paltz, NY, pp.
47
50
.
15.
Ferreira
,
A. J. S.
(
2005
). “
New signal features for robust identification of isolated vowels
,” in
the Ninth European Conference on Speech Communication and Technology (Interspeech-2005)
, Lisbon, Portugal, pp.
345
348
.
16.
Flexer
,
A.
(
2006
). “
Statistical evaluation of music information retrieval experiments
,”
J. New Music Res.
35
,
113
120
.
16.
Goldstein
,
L. G.
(
1973
). “
An optimum processor theory for the central formation of the pitch of complex tones
,”
J. Acoust. Soc. Am.
54
(
6
), pp.
1496
1516
.
17.
Hermansky
,
H.
(
1990
). “
Perceptual linear predictive (plp) analysis of speech
,”
J. Acoust. Soc. Am.
87
,
1738
1752
.
18.
Hermansky
,
H.
,
Hanson
,
B.
, and
Wakita
,
H.
(
1985
). “
Low-dimensional representation of vowels based on all-pole modeling in the psychophysical domain
,”
Speech Commun.
4
,
181
187
.
19.
Hess
,
W.
(
1983
).
Pitch Determination of Speech Signals—Algorithms and Devices
(
Springer
,
Berlin
).
20.
Hillenbrand
,
J. M.
, and
Houde
,
R. A.
(
2003
). “
A narrow band pattern-matching model of vowel perception
,”
J. Acoust. Soc. Am.
113
,
1044
1055
.
21.
Hillenbrand
,
J. M.
,
Houde
,
R. A.
, and
Gayvert
,
R. T.
(
2006
). “
Speech perception based on spectral peaks versus spectral shape
,”
J. Acoust. Soc. Am.
119
,
4041
4054
.
22.
Honig
,
F.
,
Stemmer
,
G.
,
Hacker
,
C.
, and
Brugnara
,
F.
(
2005
). “
Revisiting perceptual linear prediction (plp)
,” in
the Ninth European Conference on Speech Communication and Technology (Interspeech-2005)
, Lisbon, Portugal, pp.
2997
3000
.
23.
Klatt
,
D. H.
(
1982
). “
Prediction of perceived phonetic distance from critical-band spectra—a first step
,” in
IEEE International Conference on Acoustics, Speech and Signal Processing
, Paris, France, pp.
1278
1281
.
24.
Lass
,
N. J.
, ed. (
1996
).
Principles of Experimental Phonetics
(
Mosby Year Book
,
St. Louis, MO
).
25.
Maurer
,
D.
,
Cook
,
N.
,
Landis
,
T.
, and
d’Heureuse
,
C.
(
1992
). “
Are measured differences between the formants of men, women and children due to f0 differences?
,”
J. Int. Phonetic Assoc.
21
,
66
79
.
26.
Mollis
,
M. R.
(
2005
). “
Evaluating models of vowel perception
,”
J. Acoust. Soc. Am.
118
,
1062
1071
.
27.
Moore
,
B. C. J.
(
1989
).
An Introduction to the Psychology of Hearing
(
Academic
,
New York
).
28.
Palethorpe
,
S.
,
Wales
,
R.
,
Clark
,
J. E.
, and
Senserrick
,
T.
(
1996
). “
Vowel classification in children
,”
J. Acoust. Soc. Am.
100
,
3843
3851
.
29.
Plomb
,
R.
(
2002
).
The Intelligent Ear—On the Nature of Sound Perception
(
Erlbaum
,
Mahwah, NJ
).
30.
Rabiner
,
L.
, and
Juang
,
B.-H.
(
1993
).
Fundamentals of Speech Recognition
(
Prentice-Hall
,
Englewood Cliffs, NJ
).
31.
Ryalls
,
J. H.
, and
Lieberman
,
P.
(
1982
). “
Fundamental frequency and vowel perception
,”
J. Acoust. Soc. Am.
72
,
1631
1634
.
31.
Sroka
,
J. J.
, and
Braida
,
L. D.
(
2005
). “
Human and machine consonant recognition
,”
Speech Commun.
45
,
401
423
.
32.
Thorpe
,
C.
, and
Watson
,
C.
(
2000
). “
Vowel identification in singing at high pitch
,” in
Proceedings of the Eighth Australian International Conference on Speech Science and Technology
, Canberra, Australia, pp.
280
286
.
33.
Vaidyanathan
,
P. P.
(
1993
).
Multirate Systems and Filter Banks
(
Prentice-Hall
,
Englewood Cliffs, NJ
).
34.
Wet
,
F.
,
Weber
,
K.
,
Boves
,
L.
,
Cranen
,
B.
,
Bengio
,
S.
, and
Bourlard
,
H.
(
2004
). “
Evaluation of formant-like features on an automatic vowel classification task
,”
J. Acoust. Soc. Am.
116
,
1781
1791
.
35.
Zahorian
,
S. A.
, and
Jagharghi
,
A. J.
(
1993
). “
Spectral-shape features versus formants as acoustic correlates for vowels
,”
J. Acoust. Soc. Am.
94
,
1966
1982
.
36.
Zwicker
,
E.
(
1961
). “
Subdivision of the audible frequency range into critical bands
,”
J. Acoust. Soc. Am.
33
,
248
284
.
You do not currently have access to this content.