The modulation filtering approach to robust automatic speech recognition (ASR) is based on enhancing perceptually relevant regions of the modulation spectrum while suppressing the regions susceptible to noise. In this paper, a data-driven unsupervised modulation filter learning scheme is proposed using convolutional restricted Boltzmann machine. The initial filter is learned using the speech spectrogram while subsequent filters are learned using residual spectrograms. The modulation filtered spectrograms are used for ASR experiments on noisy and reverberant speech where these features provide significant improvements over other robust features. Furthermore, the application of the proposed method for semi-supervised learning is investigated.

1.
Chi
,
T.
,
Ru
,
P.
, and
Shamma
,
S. A.
(
2005
). “
Multiresolution spectrotemporal analysis of complex sounds
,”
J. Acoust. Soc. Am.
118
(
2
),
887
906
.
2.
Domont
,
X.
,
Heckmann
,
M.
,
Joublin
,
F.
, and
Goerick
,
C.
(
2008
). “
Hierarchical spectro-temporal features for robust speech recognition
,” in
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, pp.
4417
4420
.
3.
Drullman
,
R.
,
Festen
,
J. M.
, and
Plomp
,
R.
(
1994
). “
Effect of temporal envelope smearing on speech reception
,”
J. Acoust. Soc. Am.
95
(
2
),
1053
1064
.
4.
Elliott
,
T. M.
, and
Theunissen
,
F. E.
(
2009
). “
The modulation transfer function for speech intelligibility
,”
PLoS Comput. Biol.
5
(
3
),
e100302
.
5.
ETSI
,
E.
(
2002
). “
202 050 v1. 1.1 STQ; Distributed Speech Recognition; Advanced Front-End Feature Extraction Algorithm; Compression Algorithms
,”
ETSI ES
202
(
050
),
v1
, available at http://www.etsi.org/deliver/etsi_es/202000_202099/202050/01.01.05_60/es_202050v010105p.pdf.
6.
Ezzat
,
T.
,
Bouvrie
,
J. V.
, and
Poggio
,
T. A.
(
2007
). “
Spectro-temporal analysis of speech using 2-D Gabor filters
,”
Proc. Interspeech
506
509
.
7.
Hermansky
,
H.
, and
Morgan
,
N.
(
1994
). “
RASTA processing of speech
,”
IEEE Trans. Speech Audio Process.
2
(
4
),
578
589
.
8.
Hinton
,
G. E.
(
2002
). “
Training products of experts by minimizing contrastive divergence
,”
Neural Comput.
14
(
8
),
1771
1800
.
9.
Huang
,
J. T.
,
Li
,
J.
, and
Gong
,
Y.
(
2015
). “
An analysis of convolutional neural networks for speech recognition
,” in
Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, pp.
4989
4993
.
10.
Hung
,
J. W.
, and
Lee
,
L. S.
(
2006
). “
Optimization of temporal filters for constructing robust features in speech recognition
,”
IEEE Trans. Audio Speech Lang. Process.
14
(
3
),
808
832
.
11.
Jolliffe
,
I.
(
2002
).
Principal Component Analysis
(
Wiley
,
New York)
.
12.
Kim
,
C.
, and
Stern
,
R. M.
(
2012
). “
Power-normalized cepstral coefficients (PNCC) for robust speech recognition
,” in
Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, pp.
4101
4104
.
13.
Kinoshita
,
K.
,
Delcroix
,
M.
,
Gannot
,
S.
,
Habets
,
E. A.
,
Haeb-Umbach
,
R.
,
Kellermann
,
W.
,
Leutnant
,
V.
,
Maas
,
R.
,
Nakatani
,
T.
,
Raj
,
B.
, and
Sehr
,
A.
(
2016
). “
A summary of the REVERB challenge: State-of-the-art and remaining challenges in reverberant speech processing research
,”
EURASIP J. Adv. Sign. Process.
2016
(
1
),
1
19
.
14.
Kleinschmidt
,
M.
(
2003
). “
Localized spectro-temporal features for automatic speech recognition
,” in
Proceedings of Eurospeech
,
2003
, pp.
2573
2576
.
15.
Kovacs
,
G.
,
Toth
,
L.
, and
Van Compernolle
,
D.
(
2015
). “
Selection and enhancement of Gabor filters for automatic speech recognition
,”
Int. J. Speech Technol.
18
(
1
),
1
16
.
16.
Lee
,
H.
,
Grosse
,
R.
,
Ranganath
,
R.
, and
Ng
,
A. Y.
(
2009
). “
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations
,” in
Proceedings of the 26th Annual International Conference on Machine Learning
, ACM, pp.
609
616
.
17.
Mallat
,
S. G.
, and
Zhang
,
Z.
(
1993
). “
Matching pursuits with time-frequency dictionaries
,”
IEEE Trans. Sign. Process.
41
(
12
),
3397
3415
.
18.
Norouzi
,
M.
,
Ranjbar
,
M.
, and
Mori
,
G.
(
2009
). “
Stacks of convolutional restricted Boltzmann machines for shift-invariant feature learning
,” in
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp.
2735
2742
.
19.
Palaz
,
D.
,
Collobert
,
R.
, and
Doss
,
M. M.
(
2013
). “
Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks
,” in
Proceedings of Interspeech
, pp.
1766
1770
.
20.
Povey
,
D.
,
Ghoshal
,
A.
,
Boulianne
,
G.
,
Burget
,
L.
,
Glembek
,
O.
,
Goel
,
N.
,
Hannemann
,
M.
,
Motlicek
,
P.
,
Qian
,
Y.
,
Schwarz
,
P.
, and
Silovsky
,
J.
(
2011
). “
The Kaldi speech recognition toolkit
,” in
Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)
.
21.
Sadjadi
,
S. O.
, and
Hansen
,
J. H.
(
2015
). “
Mean Hilbert envelope coefficients (MHEC) for robust speaker with CNN as the ASR training system and language identification
,”
Speech Commun.
72
,
138
148
.
22.
Sailor
,
H. B.
, and
Patil
,
H. A.
(
2016
). “
Filterbank learning using convolutional restricted Boltzmann machine for speech recognition
,” in
Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, pp.
5895
5899
.
23.
Sainath
,
T. N.
,
Kingsbury
,
B.
,
Mohamed
,
A. R.
, and
Ramabhadran
,
B.
(
2013
). “
Learning filter banks within a deep neural network framework
,” in
Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)
, pp.
297
302
.
24.
Salakhutdinov
,
R.
,
Mnih
,
A.
, and
Hinton
,
G.
(
2007
). “
Restricted Boltzmann machines for collaborative filtering
,” in
Proceedings of ACM Proceedings of the 24th International Conference on Machine Learning
, pp.
791
798
.
25.
Schadler
,
M. R.
, and
Kollmeier
,
B.
(
2013
). “
Separable spectro-temporal Gabor filter bank features: Reducing the complexity of robust features for automatic speech recognition
,”
J. Acoust. Soc. Am.
137
(
4
),
2047
2059
.
26.
Schadler
,
M. R.
,
Meyer
,
B. T.
, and
Kollmeier
,
B.
(
2012
). “
Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition
,”
J. Acoust. Soc. Am.
131
(
5
),
4134
4151
.
27.
Shannon
,
R. V.
,
Zeng
,
F. G.
,
Kamath
,
V.
,
Wygonski
,
J.
, and
Ekelid
,
M.
(
1995
). “
Speech recognition with primarily temporal cues
,”
Science
270
(
5234
),
303
304
.
28.
Van Vuuren
,
S.
, and
Hermansky
,
H.
(
1997
). “
Data-driven design of RASTA-like filters
,” in
Proceedings of Eurospeech
, pp.
1607
1610
.
29.
Wang
,
Y. X.
, and
Zhang
,
Y. J.
(
2013
). “
Nonnegative matrix factorization: A comprehensive review
,”
IEEE Trans. Knowledge Data Eng.
25
(
6
),
1336
1353
.
You do not currently have access to this content.