The goal of this study is to investigate advanced signal processing approaches [single frequency filtering (SFF) and zero-time windowing (ZTW)] with modern deep neural networks (DNNs) [convolution neural networks (CNNs), temporal convolution neural networks (TCN), time-delay neural network (TDNN), and emphasized channel attention, propagation and aggregation in TDNN (ECAPA–TDNN)] for dialect classification of major dialects of English. Previous studies indicated that SFF and ZTW methods provide higher spectro-temporal resolution. To capture the intrinsic variations in articulations among dialects, four feature representations [spectrogram (SPEC), cepstral coefficients, mel filter-bank energies, and mel–frequency cepstral coefficients (MFCCs)] are derived from SFF and ZTW methods. Experiments with and without data augmentation using CNN classifiers revealed that the proposed features performed better than baseline short-time Fourier transform (STFT)–based features on the UT-Podcast database [Hansen, J. H., and Liu, G. (2016). “Unsupervised accent classification for deep data fusion of accent and language information,” Speech Commun. 78, 19–33]. Even without data augmentation, all the proposed features showed an approximate improvement of 15%–20% (relative) over best baseline (SPEC–STFT) feature. TCN, TDNN, and ECAPA-TDNN classifiers that capture wider temporal context further improved the performance for many of the proposed and baseline features. Among all the baseline and proposed features, the best performance is achieved with single frequency filtered cepstral coefficients for TCN (81.30%), TDNN (81.53%), and ECAPA-TDNN (85.48%). An investigation of data-driven filters, instead of fixed mel-scale, improved the performance by 2.8% and 1.4% (relatively) for SPEC–STFT and SPEC–SFF, and nearly equal for SPEC–ZTW. To assist related work, we have made the code available ([Kethireddy, R., and Kadiri, S. R. (2022). “Deep neural architectures for dialect classification with single frequency filtering and zero-time windowing feature representations,” https://github.com/r39ashmi/e2e_dialect (Last viewed 21 December 2021)].).

1.
Abdel-Hamid
,
O.
,
Mohamed
,
A.
,
Jiang
,
H.
, and
Penn
,
G.
(
2012
). “
Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition
,” in
Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, Kyoto, Japan, 25–30 March 2012, pp.
4277
4280
.
2.
Aneeja
,
G.
, and
Yegnanarayana
,
B.
(
2015
). “
Single frequency filtering approach for discriminating speech and nonspeech
,”
IEEE Trans. Audio Speech Lang. Process.
23
(
4
),
705
717
.
3.
Arslan
,
L. M.
, and
Hansen
,
J. H.
(
1997
). “
A study of temporal features and frequency characteristics in American English foreign accent
,”
J. Acoust. Soc. Am.
102
(
1
),
28
40
..
4.
Bai
,
S.
,
Kolter
,
J. Z.
, and
Koltun
,
V.
(
2018
). “
An empirical evaluation of generic convolutional and recurrent networks for sequence modeling
,” arXiv:1803.01271.
5.
Behravan
,
H.
,
Hautamäki
,
V.
,
Siniscalchi
,
S. M.
,
Kinnunen
,
T.
, and
Lee
,
C.
(
2016
). “
i-vector modeling of speech attributes for automatic foreign accent recognition
,”
IEEE Trans. Audio Speech Lang. Process.
24
(
1
),
29
41
.
6.
Bougrine
,
S.
,
Cherroun
,
H.
, and
Ziadi
,
D.
(
2018
). “
Prosody-based spoken Algerian Arabic dialect identification
,”
Procedia Comput. Sci.
128
,
9
17
.
7.
Cai
,
W.
,
Cai
,
D.
,
Huang
,
S.
, and
Li
,
M.
(
2019
). “
Utterance-level end-to-end language identification using attention-based CNN-BLSTM
,” in
Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, Brighton, UK, 12–17 May 2019, pp.
5991
5995
.
8.
Chen
,
N. F.
,
Shen
,
W.
,
Campbell
,
J. P.
, and
Torres-Carrasquillo
,
P. A.
(
2011
). “
Informative dialect recognition using context-dependent pronunciation modeling
,” in
Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, Prague, Czech Republic, 22–27 May 2011, pp.
4396
4399
.
9.
Chen
,
N. F.
,
Tam
,
S. W.
,
Shen
,
W.
, and
Campbell
,
J. P.
(
2014
). “
Characterizing phonetic transformations and acoustic differences across English dialects
,”
IEEE Trans. Audio Speech Lang. Process.
22
(
1
),
110
124
.
10.
Chennupati
,
N.
,
Kadiri
,
S. R.
, and
Yegnanarayana
,
B.
(
2019
). “
Spectral and temporal manipulations of sff envelopes for enhancement of speech intelligibility in noise
,”
Comput. Speech Lang.
54
,
86
105
.
11.
Cui
,
Y.
,
Jia
,
M.
,
Lin
,
T.
,
Song
,
Y.
, and
Belongie
,
S.
(
2019
). “
Class-balanced loss based on effective number of samples
,” in
Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)
, Long Beach, CA, 16–20 June 2019, pp.
9260
9269
.
12.
DeMarco
,
A.
, and
Cox
,
S. J.
(
2012
). “
Iterative classification of regional British accents in i-vector space
,” in
Proceedings on the Symposium on Machine Learning in Speech and Language Processing (MLSPL)
, Portland, OR, 14 September 2012, pp.
1
4
.
13.
Desplanques
,
B.
,
Thienpondt
,
J.
, and
Demuynck
,
K.
(
2020
). “
ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification
,” arXiv:2005.07143.
14.
Dhananjaya
,
N.
(
2011
). “
Signal processing for excitation-based analysis of acoustic events in speech
,” Ph.D. thesis,
IIT Madras, Chennai
.
15.
Dhananjaya
,
N.
,
Yegnanarayana
,
B.
, and
Bhaskararao
,
P.
(
2012
). “
Acoustic analysis of trill sounds
,”
J. Acoust. Soc. Am.
131
(
4
),
3141
3152
.
16.
Gao
,
S.
,
Cheng
,
M.-M.
,
Zhao
,
K.
,
Zhang
,
X.-Y.
,
Yang
,
M.-H.
, and
Torr
,
P. H.
(
2019
). “
Res2net: A new multi-scale backbone architecture
,”
IEEE Transactions on Pattern Analysis and Machine Intelligence
.
17.
Hansen
,
J. H.
, and
Liu
,
G.
(
2016
). “
Unsupervised accent classification for deep data fusion of accent and language information
,”
Speech Commun.
78
,
19
33
.
18.
He
,
K.
,
Zhang
,
X.
,
Ren
,
S.
, and
Sun
,
J.
(
2016
). “
Deep residual learning for image recognition
,” in
Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)
, Las Vegas, NV, 27–30 June 2016, pp.
770
778
.
19.
Johnson
,
R.
, and
Zhang
,
T.
(
2017
). “
Deep pyramid convolutional neural networks for text categorization
,” in
Proceedings of the Association for Computational Linguistics (ACL)
, Vancouver, Canada, 31 July–2 August 2017, pp.
562
570
.
20.
Kadiri
,
S. R.
, and
Alku
,
P.
(
2019
). “
Mel-frequency cepstral coefficients derived using the zero-time windowing spectrum for classification of phonation types in singing
,”
J. Acoust. Soc. Am.
146
(
5
),
EL418
EL423
.
21.
Kadiri
,
S. R.
, and
Yegnanarayana
,
B.
(
2017
). “
Epoch extraction from emotional speech using single frequency filtering approach
,”
Speech Commun.
86
,
52
63
.
22.
Kadiri
,
S. R.
, and
Yegnanarayana
,
B.
(
2018a
). “
Analysis and detection of phonation modes in singing voice using excitation source features and single frequency filtering cepstral coefficients (SFFCC)
,” in
Proceedings of Interspeech
, Hyderabad, India, 2–6 September 2018, pp.
441
445
.
23.
Kadiri
,
S. R.
, and
Yegnanarayana
,
B.
(
2018b
). “
Breathy to tense voice discrimination using zero-time windowing cepstral coefficients (ZTWCCs)
,” in
Proceedings of Interspeech
, Hyderabad, India, 2–6 September 2018, pp.
232
236
.
24.
Kat
,
L. W.
, and
Fung
,
P.
(
1999
). “
Fast accent identification and accented speech recognition
,” in
Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, Phoenix, AZ, 15–19 March 1999, Vol.
1
, pp.
221
224
.
25.
Kethireddy
,
R.
,
Kadiri
,
S. R.
,
Alku
,
P.
, and
Gangashetty
,
S. V.
(
2020a
). “
Mel-weighted single frequency filtering spectrogram for dialect identification
,”
IEEE Access
8
,
174871
174879
.
26.
Kethireddy
,
R.
,
Kadiri
,
S. R.
, and
Gangashetty
,
S. V.
(
2020b
). “
Learning filterbanks from raw waveform for accent classification
,” in
Proceedings of the International Joint Conference on Neural Networks
, Glasgow, UK, 19–24 July 2020, pp.
1
6
.
27.
Kethireddy
,
R.
,
Kadiri
,
S. R.
,
Kesiraju
,
S.
, and
Gangashetty
,
S. V.
(
2020c
). “
Zero-time windowing cepstral coefficients for dialect classification
,” in
Proceedings of Odyssey, The Speaker and Language Recognition Workshop
, Tokyo, Japan, 1–5 November 2020, pp.
32
38
.
28.
Kethireddy
,
R.
, and
Kadiri
,
S. R.
(
2022
). “
Deep neural architectures for dialect classification with single frequency filtering and zero-time windowing feature representations
,” https://github.com/r39ashmi/e2e_dialect (Last viewed 21 December 2021).
29.
Ko
,
T.
,
Peddinti
,
V.
,
Povey
,
D.
, and
Khudanpur
,
S.
(
2015
). “
Audio augmentation for speech recognition
,” in
Proceedings of Interspeech
, Dresden, Germany, 6–10 September 2015, pp.
3586
3589
.
30.
Krizhevsky
,
A.
,
Sutskever
,
I.
, and
Hinton
,
G. E.
(
2012
). “
ImageNet classification with deep convolutional neural networks
,” in
Proceedings of Advances in Neural Information Processing Systems
,
Lake Tahoe, NV
,
3–6 December 2012
, pp.
1106
1114
.
31.
Lo
,
S. C. B.
,
Chan
,
H. P.
,
Lin
,
J. S.
,
Li
,
H.
,
Freedman
,
M. T.
, and
Mun
,
S. K.
(
1995
). “
Artificial convolution neural network for medical image pattern recognition
,”
Neural Netw.
8
(
7-8
),
1201
1214
.
32.
Lu
,
H.
,
Zhang
,
H.
, and
Nayak
,
A.
(
2020
). “
A deep neural network for audio classification with a classifier attention mechanism
,” arXiv preprint arXiv:2006.09815.
33.
Nagrani
,
A.
,
Chung
,
J. S.
, and
Zisserman
,
A.
(
2017
). “
Voxceleb: A large-scale speaker identification dataset
,” arXiv:1706.08612.
34.
Najafian
,
M.
,
Khurana
,
S.
,
Shan
,
S.
,
Ali
,
A.
, and
Glass
,
J.
(
2018
). “
Exploiting convolutional neural networks for phonotactic based dialect identification
,” in
Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, Calgary, AB, Canada, 15–20 April 2018, pp.
5174
5178
.
35.
Nellore
,
B. T.
,
Prasad
,
R.
,
Kadiri
,
S. R.
,
Gangashetty
,
S. V.
, and
Yegnanarayana
,
B.
(
2017
). “
Locating burst onsets using SFF envelope and phase information
,” in
Proceedings of Interspeech
, Stockholm, Sweden, 20–24 August 2017, pp.
3023
3027
.
36.
Pandey
,
A.
, and
Wang
,
D.
(
2019
). “
TCNN: Temporal convolutional neural network for real-time speech enhancement in the time domain
,” in
Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, Brighton, UK, 12–17 May 2019, pp.
6875
6879
.
37.
Pannala
,
V.
,
Aneeja
,
G.
,
Kadiri
,
S. R.
, and
Yegnanarayana
,
B.
(
2016
). “
Robust estimation of fundamental frequency using single frequency filtering approach
,” in
Proceedings of Interspeech
, San Francisco, CA, 8–12 September 2016, pp.
2155
2159
.
38.
Peddinti
,
V.
,
Chen
,
G.
,
Manohar
,
V.
,
Ko
,
T.
,
Povey
,
D.
, and
Khudanpur
,
S.
(
2015a
). “
JHU ASpIRE system: Robust LVCSR with TDNNs, ivector adaptation and RNN-LMS
,” in
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
, Scottsdale, AZ, 13–17 December 2015, pp.
539
546
.
39.
Peddinti
,
V.
,
Povey
,
D.
, and
Khudanpur
,
S.
(
2015b
). “
A time delay neural network architecture for efficient modeling of long temporal contexts
,” in
Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, South Brisbane, Queensland, Australia, 19–24 April 2015, pp.
3214
3218
.
40.
Qi
,
Z.
,
Ma
,
Y.
,
Gu
,
M.
,
Jin
,
Y.
,
Li
,
S.
,
Zhang
,
Q.
, and
Shen
,
Y.
(
2018
). “
End-to-end Chinese dialect identification using deep feature model of recurrent neural network
,” in
Proceedings of the International Conference on Computer and Communications (ICCC)
, Chengdu, China, 7–10 December 2018, pp.
2148
2152
.
41.
Rajpal
,
A.
,
Patel
,
T. B.
,
Sailor
,
H. B.
,
Madhavi
,
M. C.
,
Patil
,
H. A.
, and
Fujisaki
,
H.
(
2016
). “
Native language identification using spectral and source-based features
,” in
Proceedings of Interspeech
, San Francisco, CA, 8–12 September 2016, pp.
2383
2387
.
42.
Ravanelli
,
M.
,
Parcollet
,
T.
,
Plantinga
,
P.
,
Rouhe
,
A.
,
Cornell
,
S.
,
Lugosch
,
L.
,
Subakan
,
C.
,
Dawalatabad
,
N.
,
Heba
,
A.
,
Zhong
,
J.
,
Chou
,
J.-C.
,
Yeh
,
S.-L.
,
Fu
,
S.-W.
,
Liao
,
C.-F.
,
Rastorgueva
,
E.
,
Grondin
,
F.
,
Aris
,
W.
,
Na
,
H.
,
Gao
,
Y.
,
Mori
,
R. D.
, and
Bengio
,
Y.
(
2021
). “
SpeechBrain: A general-purpose speech toolkit
,” arXiv:2106.04624.
43.
Rouas
,
J.
(
2007
). “
Automatic prosodic variations modeling for language and dialect discrimination
,”
IEEE Trans. Audio. Speech, Lang. Process.
15
(
6
),
1904
1911
.
44.
Seki
,
H.
,
Yamamoto
,
K.
, and
Nakagawa
,
S.
(
2017
). “
A deep neural network integrated with filterbank learning for speech recognition
,” in
Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, New Orleans, LA, 5–9 March 2017, pp.
5480
5484
.
45.
Shon
,
S.
,
Ali
,
A.
, and
Glass
,
J.
(
2018a
). “
Convolutional neural network and language embeddings for end-to-end dialect recognition
,” in
Proceedings of Odyssey, The Speaker and Language Recognition Workshop
, Les Sables d'Olonne, France, 26–29 June 2018, pp.
98
104
.
46.
Shon
,
S.
,
Hsu
,
W.-N.
, and
Glass
,
J.
(
2018b
). “
Unsupervised representation learning of speech for dialect identification
,” in
IEEE Spoken Language Technology Workshop
,
Athens, Greece, 18–21 December 2018
, pp.
105
111
.
47.
Siddhant
,
A.
,
Jyothi
,
P.
, and
Ganapathy
,
S.
(
2017
). “
Leveraging native language speech for accent identification using deep siamese networks
,” in
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
, Okinawa, Japan, 16-20 December 2017, pp.
621
628
.
48.
Simonyan
,
K.
, and
Zisserman
,
A.
(
2015
). “
Very deep convolutional networks for large-scale image recognition
,” in
Proceedings of the International Conference on Learning Representations (ICLR)
, San Diego, CA, 7–9 May 2015.
49.
Snyder
,
D.
,
Garcia-Romero
,
D.
,
Sell
,
G.
,
Povey
,
D.
, and
Khudanpur
,
S.
(
2018
). “
X-vectors: Robust dnn embeddings for speaker recognition
,” in
Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, Calgary, AB, Canada, 15–20 April 2018, pp.
5329
5333
.
50.
SoX
. (
2021
). “
Audio manipulation tool
,” http://sox.sourceforge.net/ (Last viewed 21 December 2021).
51.
Waibel
,
A.
(
1989
). “
Modular construction of time-delay neural networks for speech recognition
,”
Neural Comput.
1
(
1
),
39
46
.
52.
Wu
,
Y.
,
Mao
,
H.
, and
Yi
,
Z.
(
2018
). “
Audio classification using attention-augmented convolutional neural network
,”
Knowl. Based Syst.
161
,
90
100
.
53.
Yegnanarayana
,
B.
, and
Dhananjaya
,
N.
(
2013
). “
Spectro-temporal analysis of speech signals using zero-time windowing and group delay function
,”
Speech Commun.
55
(
6
),
782
795
.
54.
Yu
,
H.
,
Tan
,
Z.-H.
,
Zhang
,
Y.
,
Ma
,
Z.
, and
Guo
,
J.
(
2017
). “
DNN filter bank cepstral coefficients for spoofing detection
,”
IEEE Access
5
,
4779
4787
.
You do not currently have access to this content.