The influence of different sources of speech-intrinisic variation (speaking rate, effort, style and dialect or accent) on human speech perception was investigated. In listening experiments with 16 listeners, confusions of consonant-vowel-consonant (CVC) and vowel-consonant-vowel (VCV) sounds in speech-weighted noise were analyzed. Experiments were based on the OLLO logatome speech database, which was designed for a man-machine comparison. It contains utterances spoken by 50 speakers from five dialect/accent regions and covers several intrinsic variations. By comparing results depending on intrinsic and extrinsic variations (i.e., different levels of masking noise), the degradation induced by variabilities can be expressed in terms of the SNR. The spectral level distance between the respective speech segment and the long-term spectrum of the masking noise was found to be a good predictor for recognition rates, while phoneme confusions were influenced by the distance to spectrally close phonemes. An analysis based on transmitted information of articulatory features showed that voicing and manner of articulation are comparatively robust cues in the presence of intrinsic variations, whereas the coding of place is more degraded. The database and detailed results have been made available for comparisons between human speech recognition (HSR) and automatic speech recognizers (ASR).

1.
Allen
,
J. B.
(
1994
). “
How do human process and recognize speech?
,”
IEEE Trans. Speech Audio Process.
2
,
567
577
.
2.
Barker
,
J.
, and
Cooke
,
M.
(
2007
). “
Modelling speaker intelligibility in noise
,”
Speech Commun.
49
,
402
417
.
3.
Bronkhorst
,
A. W.
,
Bosman
,
A. J.
, and
Smoorenburg
,
G. G.
(
1993
). “
A model for context effects in speech recognition
,”
J. Acoust. Soc. Am.
93
,
499
509
.
4.
Chang
,
S.
,
Wester
,
M.
, and
Greenberg
,
S.
(
2005
). “
An elitist approach to automatic articulatory-acoustic feature classification for phonetic characterization of spoken language
,”
Speech Commun.
47
,
290
311
.
5.
Cooke
,
M.
, and
Scharenborg
,
O.
(
2008
). “
The interspeech 2008 consonant challenge
,” in
Proceedings of Interspeech
, pp.
1781
1784
.
6.
Cooke
,
M. P.
,
Green
,
P. D.
,
Josifovski
,
L. B.
, and
Vizinho
,
A.
(
2001
). “
Robust automatic speech recognition with missing and uncertain acoustic data
,”
Speech Commun.
34
,
267
285
.
7.
Dreschler
,
W. A.
,
Ludvigson
,
C.
, and
Westermann
,
S.
(
2001
). “
ICRA noises: Artificial noise signals with speechlike spectral and temporal properties for hearing instrument assessment
,”
Audiology
40
,
148
157
.
8.
Dubno
,
J. R.
, and
Levitt
,
H.
(
1981
). “
Predicting consonant confusions from acoustic analysis
,”
J. Acoust. Soc. Am.
69
,
249
261
.
9.
Fissore
,
L.
,
Mertins
,
A.
,
Ris
,
A.
,
Rose
,
R.
,
Tyagi
,
V.
, and
Wellekens
,
C.
, (
2007
). “
Automatic speech recognition and speech variability: A review
,”
Speech Commun.
49
,
763
786
.
10.
Flege
,
J. E.
,
Schirru
,
C.
, and
MacKay
,
I. R. A.
(
2003
). “
Interaction between the native and second language phonetic subsystems
,”
Speech Commun.
40
,
467
491
.
11.
Fosler-Lussier
,
E.
, and
Morgan
,
N.
(
1999
). “
Effects of speaking rate and word frequency on conversational pronunciations
,”
Speech Commun.
29
,
137
158
.
12.
French
,
N. R.
, and
Steinberg
,
J. C.
(
1947
). “
Factors governing the intelligibility of speech sounds
,”
J. Acoust. Soc. Am.
19
,
90
119
.
13.
Friesen
,
L. M.
,
Shannon
,
R. V.
,
Baskent
,
D.
, and
Wang
,
X.
(
2001
). “
Speech recognition in noise as a function of the number of spectral channels: Comparison of acoustic hearing and cochlear implants
,”
J. Acoust. Soc. Am.
110
,
1150
1163
.
14.
Gelfand
,
S.
,
Piper
,
N.
, and
Silman
,
S.
(
1985
). “
Consonant recognition in quiet as a function of aging among normal hearing subjects
,”
J. Acoust. Soc. Am.
78
,
1198
1206
.
15.
Grant
,
K. W.
, and
Walden
,
B. E.
(
1996
). “
Evaluating the articulation index for auditory-visual consonant recognition
,”
J. Acoust. Soc. Am.
100
,
2415
2424
.
16.
Hazan
,
V.
, and
Markham
,
D.
(
2004
). “
Acoustic-phonetic correlates of talker intelligibility for adults and children
,”
J. Acoust. Soc. Am.
116
,
3108
3118
.
17.
Hermansky
,
H.
, and
Morgan
,
H.
(
1994
). “
RASTA processing of speech
,”
IEEE Trans. Speech Audio Process.
2
,
578
589
.
18.
Jürgens
,
T.
,
Brand
,
T.
, and
Kollmeier
,
B.
(
2007
). “
Modelling the human-machine gap in speech reception: Microscopic speech intelligibility prediction for normal-hearing subjects with an auditory model
,” in
Proceedings of Interspeech
, pp.
410
413
.
19.
Kipp
,
A.
,
Wesenick
,
M. -B.
, and
Schiel
,
F.
(
1996
). “
Automatic detection and segmentation of pronunciation variants in German speech corpora
,” in
Proceedings of the International Conference on Spoken Language Processing (ICSLP)
, pp.
106
109
.
20.
Kleinschmidt
,
M.
, and
Gelbart
,
D.
(
2002
). “
Improving word accuracy with Gabor feature extraction
,” in
Proceedings of the International Conference on Spoken Language Processing (ICSLP)
, pp.
545
548
.
21.
Kliem
,
K.
(
1993
). “
Entwicklung und Evaluation eines Zweisilber-Reimtestverfahrens in deutscher Sprache zur Bestimmung der Sprachverständlichkeit in der klinischen Audiologie und Nachrichtentechnik (Development and evaluation of a German bisyllabic rhyme test for speech intelligibility measurements in clinical audiology and communications engineering
),” Ph.D. thesis,
University of Oldenburg
, Oldenburg, Germany.
22.
Kohler
,
K.
(
1995
).
Einführung in die Phonetik des Deutschen (Introduction to German Phonetics)
(
Erich Schmidt
,
Berlin
).
23.
Kollmeier
,
B.
(
1990
). “
Meßmethodik, Modellierung und Verbesserung der Verständlichkeit von Sprache (Measurement, modeling and improvement of speech intelligibility)
,” Habilitation thesis,
University of Göttingen
, Fachbereich Physik, Göttingen.
24.
Kollmeier
,
B.
,
Kliem
,
K.
, and
Wesselkamp
,
M.
(
1997
). “
Development and evaluation of a German sentence test for objective and subjective speech intelligibility assessment
,”
J. Acoust. Soc. Am.
102
,
2412
2421
.
25.
Kollmeier
,
B.
, and
Wallenberg
,
E. -L.
(
1989
). “
Sprachverständlichkeitsmessungen für die Audiologie mit einem Reimtest in deutscher Sprache: Erstellung und Evaluation von Testlisten (Speech intelligibility measurements for audiology based on a German rhyme test: Preparation and evaluation of test lists)
,”
Audiologische Akustik
28
,
50
65
.
26.
Krause
,
J. C.
(
1993
). “
The effects of speaking rate and speaking mode on intelligibility
,” Master's thesis, Dept. of Electrical Engineering,
Massachusetts Institute of Technology
, Cambridge, MA.
27.
Krause
,
J. C.
, and
Braida
,
L. D.
(
2002
). “
Investigating alternative forms of clear speech: The effects of speaking rate and speaking mode on intelligibility
,”
J. Acoust. Soc. Am.
112
,
2165
2172
.
28.
Krause
,
J. C.
, and
Braida
,
L. D.
(
2004
). “
Acoustic properties of naturally produced clear speech at normal speaking rates
,”
J. Acoust. Soc. Am.
115
,
362
378
.
29.
Li
,
C. -n.
(
2003
). “
Accent, intelligibility, and comprehensibility in the perception of foreign-accented Lombard speech
,”
J. Acoust. Soc. Am.
114
,
2364
.
30.
Lippmann
,
R.
(
1997
). “
Speech recognition by machines and humans
,”
Speech Commun.
22
,
1
15
.
31.
MacArthur
,
T.
(
1992
).
The Oxford Companion to the English Language
,
Oxford University Press
,
New York
.
32.
Meyer
,
B. T.
,
Wächter
,
M.
,
Brand
,
T.
, and
Kollmeier
,
B.
(
2007
). “
Phoneme confusions in human and automatic speech recognition
,” in
Proceedings of Interspeech
, pp.
1485
1488
.
33.
Meyer
,
B. T.
,
Wesker
,
T.
,
Brand
,
T.
,
Mertins
,
A.
, and
Kollmeier
,
B.
(
2006
). “
A human-machine comparison in speech recognition based on a logatome corpus
,” in
Proceedings of the Workshop on Speech Recognition and Intrinsic Variation
, pp.
95
101
.
34.
Miller
,
G.
, and
Nicely
,
P.
(
1955
). “
An analysis of perceptual confusions among some english consonants
,”
J. Acoust. Soc. Am.
27
,
338
352
.
35.
Mühler
,
R.
,
Ziese
,
M.
, and
Rostalski
,
D.
(
2009
). “
Development of a speaker discrimination test for cochlear implant users based on the OLLO logatome corpus
,”
ORL
71
,
14
20
.
36.
Müller
,
C.
(
1992
). “
Perzeptive Analyse und Weiterentwicklung eines Reimtestverfahrens für die Sprachaudiometrie (Perceptual analysis and development of a ryhme test for speech audiometry)
,” Ph.D. thesis,
Georg-August-Universität
, Göttingen, Germany
37.
Phatak
,
S.
, and
Allen
,
J. B.
(
2007
). “
Consonant and vowel confusions in speech-weighted noise
,”
J. Acoust. Soc. Am.
121
,
2312
2326
.
38.
Scharenborg
,
O.
(
2010
). “
Modeling the use of durational information in human spoken-word recognition
,”
J. Acoust. Soc. Am.
127
,
3758
3770
.
39.
Schriberg
,
L. D.
,
Kwiatkowski
,
J.
, and
Hoffmann
,
K.
(
1984
). “
A procedure for phonetic transcription by consensus
,”
J. Speech Hear. Res.
27
,
456
465
.
40.
Siegler
,
M. A.
, and
Stern
,
R. M.
(
1995
). “
On the effect of speech rate in large vocabulary speech recognition systems
,” in
Proceedings of ICASSP
, pp.
612
615
.
41.
Siniscalchi
,
S. M.
,
Svendsen
,
T.
, and
Lee
,
C. -H.
(
2008
). “
Towards a detector-based universal phone recognizer
,” in
Proceedings of ICASSP
, pp.
4261
4264
.
42.
Sroka
,
J. J.
, and
Braida
,
L. D.
(
2005
). “
Human and machine consonant recognition
,”
Speech Commun.
45
,
401
423
.
43.
Stern
,
R.
,
Acero
,
A.
,
Liu
,
F. H.
, and
Ohshima
,
Y.
(
1996
). “
Signal processing for robust speech recognition
,”
Automatic Speech and Speaker Recognition
, edited by
C. -H.
Lee
,
F. K.
Soong
, and
K. K.
Paliwal
(
Springer
,
Berlin
), Chap. 15.
44.
Studebaker
,
G. A.
(
1985
). “
A ‘rationalized’ arcsine transform
,”
J. Speech Hear. Res.
28
,
455
462
.
45.
Tchorz
,
J.
, and
Kollmeier
,
B.
(
1999
). “
A model of auditory perception as front end for automatic speech recognition
,”
J. Acoust. Soc. Am.
106
,
2040
2050
.
46.
ten Bosch
,
L.
, and
Kirchhoff
,
K.
(
2007
). “
Bridging the gap between human and automatic speech recognition
,”
Speech Commun.
49
,
331
335
.
47.
Wang
,
M.
, and
Bilger
,
R.
(
1973
). “
Consonant confusions in noise: A study of perceptual features
,”
J. Acoust. Soc. Am.
54
,
1248
1266
.
48.
Weintraub
,
M.
,
Taussig
,
K.
,
Hunicke-Smith
,
K.
, and
Snodgrass
,
A.
(
1996
). “
Effect of speaking style on LVCSR performance
,” in
Proceedings of the Addendum of ICSLP
, pp.
1457
1460
.
49.
Wesker
,
T.
,
Meyer
,
B.
,
Wagener
,
K.
,
Anemüller
,
J.
,
Mertins
,
A.
, and
Kollmeier
,
B.
(
2005
). “
Oldenburg logatome speech corpus (OLLO) for speech recognition experiments with humans and machines
,” in
Proceedings of Interspeech
,
1273
1276
.
You do not currently have access to this content.