To test if simultaneous spectral and temporal processing is required to extract robust features for automatic speech recognition (ASR), the robust spectro-temporal two-dimensional-Gabor filter bank (GBFB) front-end from Schädler, Meyer, and Kollmeier [J. Acoust. Soc. Am. 131, 4134–4151 (2012)] was de-composed into a spectral one-dimensional-Gabor filter bank and a temporal one-dimensional-Gabor filter bank. A feature set that is extracted with these separate spectral and temporal modulation filter banks was introduced, the separate Gabor filter bank (SGBFB) features, and evaluated on the CHiME (Computational Hearing in Multisource Environments) keywords-in-noise recognition task. From the perspective of robust ASR, the results showed that spectral and temporal processing can be performed independently and are not required to interact with each other. Using SGBFB features permitted the signal-to-noise ratio (SNR) to be lowered by 1.2 dB while still performing as well as the GBFB-based reference system, which corresponds to a relative improvement of the word error rate by 12.8%. Additionally, the real time factor of the spectro-temporal processing could be reduced by more than an order of magnitude. Compared to human listeners, the SNR needed to be 13 dB higher when using Mel-frequency cepstral coefficient features, 11 dB higher when using GBFB features, and 9 dB higher when using SGBFB features to achieve the same recognition performance.

1.
Barker
,
J.
,
Vincent
,
E.
,
Ma
,
N.
,
Christensen
,
H.
, and
Green
,
P.
(
2013
). “
The PASCAL CHiME speech separation and recognition challenge
,”
Comput. Speech Lang.
27
,
621
633
.
2.
Chi
,
T.
,
Ru
,
P.
, and
Shamma
,
S. A.
(
2005
). “
Multiresolution spectrotemporal analysis of complex sounds
,”
J. Acoust. Soc. Am.
118
,
887
906
.
3.
Davis
,
S.
, and
Mermelstein
,
P.
(
1980
). “
Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences
,”
IEEE Trans. Acoust. Speech Signal Process.
28
,
357
366
.
4.
De la Torre
,
A.
,
Peinado
,
A. M.
,
Segura
,
J. C.
,
Pérez-Córdoba
,
J. L.
,
Benítez
,
M. C.
, and
Rubio
,
A. J.
(
2005
). “
Histogram equalization of speech representation for robust speech recognition
,”
IEEE Trans. Speech Audio Process.
13
,
355
366
.
5.
Depireux
,
D. A.
,
Simon
,
J. Z.
,
Klein
,
D. J.
, and
Shamma
,
S. A.
(
2001
). “
Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex
,”
J. Neurophysiol.
85
,
1220
1234
.
6.
Ezzat
,
T.
,
Bouvrie
,
J. V.
, and
Poggio
,
T.
(
2007
). “
Spectro-temporal analysis of speech using 2-D Gabor filters
,” in
Proceedings of Interspeech 2007
, pp.
506
509
.
7.
Hermansky
,
H.
(
1990
). “
Perceptual linear predictive (PLP) analysis of speech
,”
J. Acoust. Soc. Am.
87
,
1738
1752
.
8.
Hermansky
,
H.
, and
Fousek
,
P.
(
2005
). “
Multi-resolution RASTA filtering for TANDEM-based ASR
,” in
Proceedings of Interspeech 2005
, pp.
361
364
.
9.
Hermansky
,
H.
,
Kohn
,
P.
,
Morgan
,
N.
, and
Bayya
,
A.
(
1992
). “
RASTA-PLP speech analysis technique
,” in
Proceedings of ICASSP 1992
, Vol. 1, pp.
121
124
.
10.
Hermansky
,
H.
, and
Sharma
,
S.
(
1999
). “
Temporal patterns (TRAPS) in ASR of noisy speech
,” in
Proceedings of ICASSP 1999
, Vol. 1, pp.
289
292
.
11.
Kleinschmidt
,
M.
(
2002
). “
Methods for capturing spectro-temporal modulations in automatic speech recognition
,”
Acta Acust. Acust.
88
,
416
422
.
12.
Kleinschmidt
,
M.
, and
Gelbart
,
D.
(
2002
). “
Improving word accuracy with Gabor feature extraction
,” in
Proceedings of Interspeech 2002
, pp.
25
28
.
13.
Lippmann
,
R. P.
(
1997
). “
Speech recognition by machines and humans
,”
Speech Commun.
22
,
1
15
.
14.
Mesgarani
,
N.
,
Slaney
,
M.
, and
Shamma
,
S. A.
(
2006
). “
Discrimination of speech from non-speech based on multiscale spectro-temporal modulations
,”
IEEE Trans. Audio Speech Lang. Proc.
14
,
920
930
.
15.
Meyer
,
B. T.
,
Brand
,
T.
, and
Kollmeier
,
B.
(
2011
). “
Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes
,”
J. Acoust. Soc. Am.
129
,
388
403
.
16.
Meyer
,
B. T.
, and
Kollmeier
,
B.
(
2011
). “
Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition
,”
Speech Commun.
53
,
753
767
.
17.
Moritz
,
N.
,
Anemuller
,
J.
, and
Kollmeier
,
B.
(
2011
). “
Amplitude modulation spectrogram based features for robust speech recognition in noisy and reverberant environments
,” in
Proceedings of ICASSP 2011
, pp.
5492
5495
.
18.
Moritz
,
N.
,
Schädler
,
M. R.
,
Adiloglu
,
K.
,
Meyer
,
B. T.
,
Jürgens
,
T.
,
Gerkmann
,
T.
,
Kollmeier
,
B.
,
Doclo
,
S.
, and
Goetze
,
S.
(
2013
). “
Noise robust distant automatic speech recognition utilizing NMF based source separation and auditory feature extraction
,” in
Proceeding of CHiME Workshop 2013
, Vancouver, British Columbia, Canada, pp.
1
6
.
19.
Nadeu
,
C.
,
Macho
,
D.
, and
Hernando
,
J.
(
2001
). “
Time and frequency filtering of filter-bank energies for robust HMM speech recognition
,”
Speech Commun.
34
,
93
114
.
20.
Qiu
,
A.
,
Schreiner
,
C. E.
, and
Escabí
,
M. A.
(
2003
). “
Gabor analysis of auditory midbrain receptive fields: Spectro-temporal and binaural composition
,”
J. Neurophysiol.
90
,
456
476
.
21.
Schädler
,
M. R.
(
2014
). “
Reference Matlab implementations of feature extraction algorithms
,” http://medi.uni-oldenburg.de/SGBFB (Last viewed January 14, 2015).
22.
Schädler
,
M. R.
,
Meyer
,
B. T.
, and
Kollmeier
,
B.
(
2012
). “
Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition
,”
J. Acoust. Soc. Am.
131
,
4134
4151
.
23.
Schröder
,
J.
,
Moritz
,
N.
,
Schädler
,
M. R.
,
Cauchi
,
B.
,
Adiloglu
,
K.
,
Anemüller
,
J.
,
Doclo
,
S.
,
Kollmeier
,
B.
, and
Goetze
,
S.
(
2013
). “
On the use of spectro-temporal features for the IEEE AASP challenge ‘Detection and classification of acoustic scenes and events
,’ ” in
Proceeding of Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2013
, pp.
1
4
.
24.
Vertanen
,
K.
(
2006
). “
Baseline WSJ acoustic models for HTK and Sphinx: Training recipes and recognition experiments
,”
Technical report, Cavendish Laboratory
, University of Cambridge, Cambridge, UK.
25.
Viikki
,
O.
, and
Laurila
,
K.
(
1998
). “
Cepstral domain segmental feature vector normalization for noise robust speech recognition
,”
Speech Commun.
25
,
133
147
.
26.
Vincent
,
E.
,
Barker
,
J.
,
Watanabe
,
S.
,
Le Roux
,
J.
,
Nesta
,
F.
, and
Matassoni
,
M.
(
2013
). “
The second ‘CHiME’ speech separation and recognition challenge: Datasets, tasks and baselines
,” in
Proceedings of Workshop on Automatic Speech Recognition and Understanding (ASRU) 2013
, pp.
126
130
.
27.
Weide
,
R. L.
, and
Rudnicky
,
A.
(
2008
). “
The CMU pronouncing dictionary
,” available at http://www.speech.cs.cmu.edu/cgi-bin/cmudict (Last viewed January 14, 2015).
28.
Young
,
S.
,
Evermann
,
G.
,
Gales
,
M.
,
Hain
,
T.
,
Kershaw
,
D.
,
Liu
,
X.
,
Moore
,
G.
,
Odell
,
J.
,
Ollason
,
D.
, and
Povey
,
D.
(
2009
). “
The HTK book
” (for HTK version 3.4). Cambridge University Engineering Department, pp.
1
384
.
You do not currently have access to this content.