The approach proposed in this study includes methods specifically dedicated to the detection of allophonic variation in English. This study aims to find an efficient method for automatic evaluation of aspiration in the case of Polish second-language (L2) English speakers' pronunciation when whole words are analyzed instead of particular allophones extracted from words. Sample words including aspirated and unaspirated allophones were prepared by experts in English phonetics and phonology. The datasets created include recordings of words pronounced by nine native English speakers of standard southern British accent and 20 Polish L2 English users. Complete unedited words are treated as input data for feature extraction and classification algorithms such as k-nearest neighbors, naive Bayes method, long-short term memory, and convolutional neural network (CNN). Various signal representations, including low-level audio features, the so-called mid-term and feature trajectory, and spectrograms, are tested in the context of their usability for the detection of aspiration. The results obtained show high potential for an automated evaluation of pronunciation focused on a particular phonological feature (aspiration) when classifiers analyze whole words. Additionally, CNN returns satisfying results for the automated classification of words containing aspirated and unaspirated allophones produced by Polish L2 speakers.

1.
Abadi
,
M.
(
2019
). “
Tensorflow
,” https://www.tensorflow.org/ (Last viewed February 2020).
2.
Adams
,
O.
,
Cohn
,
T.
,
Neubig
,
G.
,
Cruz
,
H.
,
Bird
,
S.
, and
Michaud
,
S.
(
2018
). “
Evaluating phonemic transcription of low-resource tonal languages for language documentation
,”
LREC 2018 (Language Resources and Evaluation Conference)
, May 2018, Miyazaki, Japan, pp.
3356
3365
, https://halshs.archives-ouvertes.fr/halshs-01709648v4/document (Last viewed February 2021).
3.
Almpanidis
,
G.
, and
Kotropoulos
,
C.
(
2007
). “
Automatic phonemic segmentation using the Bayesian information criterion with generalized gamma priors
,” in
15th European Signal Processing Conference
, pp.
2055
2059
.
4.
Aubanel
,
V.
, and
Nguyen
,
N.
(
2010
). “
Automatic recognition of regional phonological variation in conversational interaction
,”
Speech Commun.
52
(
6
),
577
586
.
5.
Benki
,
J. R.
(
2001
). “
Place of articulation and first formant transition pattern both affect perception of voicing in English
,”
J. Phon.
29
(
1
),
1
22
.
6.
Brocki
,
Ł.
, and
Marasek
,
K.
(
2015
). “
Deep belief neural networks and bidirectional long-short term memory hybrid for speech recognition
,”
Arch. Acoust.
40
(
2
),
191
195
.
7.
Buduma
,
N.
, and
Locascio
,
N.
(
2017
).
Fundamentals of Deep Learning: Designing Next-Generation Machine Intelligence Algorithms
(
O'Reilly Media
,
Sebastopol, CA
).
8.
Chiu
,
S.-T.
(
1991
). “
Bandwidth selection for kernel density estimation
,”
Ann. Stat.
19
(
4
),
1883
1905
.
9.
Cho
,
T.
, and
Ladefoged
,
P.
(
1999
). “
Variation and universals in VOT: Evidence from 18 languages
,”
J. Phon.
27
(
2
),
207
229
.
10.
Choi
,
H.
,
Park
,
S.
,
Park
,
J.
, and
Hahn
,
M.
(
2019
). “
Multi-speaker emotional acoustic modeling for CNN-based speech synthesis
,” in
2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, pp.
6950
6954
.
11.
Chollet
,
F.
(
2019
).
keras-team/keras
, https://github.com/keras-team/keras (Last viewed February 2021).
12.
Czyżewski
,
A.
,
Piotrowska
,
M.
, and
Kostek
,
B.
(
2017a
). “
Analysis of allophones based on audio signal recordings and parameterization
,”
J. Acoust. Soc. Am.
141
(
5
),
3521
3521
.
13.
Czyżewski
,
A.
,
Kostek
,
B.
,
Bratoszewski
,
P.
,
Kotus
,
J.
, and
Szykulski
,
M.
(
2017b
). “
An audio-visual corpus for multimodal automatic speech recognition
,”
J. Intell. Inf. Syst.
49
,
167
192
.
14.
Dalka
,
P.
,
Bratoszewski
,
P.
, and
Czyżewski
,
A.
(
2014
). “
Visual lip contour detection for the purpose of speech recognition
,”
IEEE 2014 International Conference on Signals and Electronic Systems
, pp.
1
4
.
15.
Deng
,
C.
,
Ji
,
X.
,
Rainey
,
C.
,
Zhang
,
J.
, and
Lu
,
W.
(
2020
). “
Integrating machine learning with human knowledge
,”
iScience
23
(
11
),
101656
.
16.
Dromey
,
C.
, and
Black
,
K. M.
(
2017
). “
Effects of laryngeal activity on articulation
,”
IEEE/ACM Trans. Audio Speech Lang. Proc.
25
(
12
),
2272
2280
.
17.
Dwarampudi
,
M.
, and
Reddy
,
N. V.
(
2019
). “
Effects of padding on LSTMs and CNNs
,” arXiv:1903.07288.
18.
Ge
,
Z.
,
Sharma
,
S. R.
, and
Smith
,
M. J.
(
2011
). “
Adaptive frequency cepstral coefficients for word mispronunciation detection
,”
4th International Congress on Image and Signal Processing (CISP)
,
IEEE
, pp.
2388
2391
.
19.
Geron
,
A.
(
2017
).
Hands-on Machine Learning with Scikit-Learn and Tensor-Flow: Concepts, Tools, and Techniques to Build Intelligent Systems
(
O'Reilly Media
,
Sebastopol, CA
).
20.
Giannakopoulos
,
T.
, and
Pikrakis
,
A.
(
2014
).
Introduction to Audio Analysis: A MATLAB Approach
(
Academic Press
,
New York
).
21.
Ghosh
,
J. K.
,
Delampady
,
M.
, and
Samanta
,
T.
(
2007
).
An Introduction to Bayesian Analysis: Theory and Methods
(
Springer Science & Business Media
,
New York
).
22.
Hamooni
,
H.
,
Mueen
,
A.
, and
Neel
,
A.
(
2016
). “
Phoneme sequence recognition via DTW-based classification
,”
Knowl. Inf. Syst.
48
,
253
275
.
23.
Heffner
,
R.-M. S.
(
1950
).
General Phonetics
(
University of Wisconsin Press
,
Madison
).
24.
Illa
,
A.
, and
Ghosh
,
P. K.
(
2020
). “
Closed-set speaker conditioned acoustic-to-articulatory inversion using bi-directional long short term memory network
,”
J. Acoust. Soc. Am.
147
(
2
),
EL171
EL176
.
25.
Islam
,
J.
(
2019
).
Phonetics and Phonology of ‘Voiced-Pirated’ Stops: Evidence from Production, Perception, Alternation and Learnability
(
Georgetown University–Graduate School of Arts & Sciences
,
Washington, DC
).
26.
Jensen
,
J. T.
(
2004
).
Principles of Generative Phonology: An Introduction
(
John Benjamins
,
New York
), p.
250
.
27.
Jiao
,
Y.
,
Berisha
,
V.
,
Liss
,
J.
,
Hsu
,
S.-C.
,
Levy
,
E.
, and
McAuliffe
,
M.
(
2017
). “
Articulation entropy. An unsupervised measure of articulatory precision
,”
IEEE Sign. Proc. Lett.
24
,
485
489
.
28.
Jones
,
D.
(
1956
). “
The hyphen as a phonetic sign
,”
STUF Lang. Typol. Univ.
9
(
1-4
),
99
107
.
29.
Kazanina
,
N.
,
Bowers
,
J. S.
, and
Idsardi
,
W.
(
2018
). “
Phonemes: Lexical access and beyond
,”
Psychon. Bull. Rev.
25
,
560
585
.
30.
Keating
,
P. A.
,
Mikoś
,
M. J.
, and
Ganong
,
W. F.
 III
(
1981
). “
A cross-language study of range of voice onset time in the perception of initial stop voicing
,”
J. Acoust. Soc. Am.
70
(
5
),
1261
1271
.
31.
Keating
,
P.
,
Linker
,
W.
, and
Huffman
,
M.
(
1983
). “
Patterns in allophone distribution for voiced and voiceless stops
,”
J. Phon.
11
(
3
),
277
290
.
32.
Kingma
,
D. P.
, and
Ba
,
J.
(
2014
). “
Adam: A method for stochastic optimization
,” arXiv:1412.6980.
33.
Kim
,
H.-G.
,
Moreau
,
N.
, and
Sikora
,
T.
(
2005
).
MPEG-7 Audio and beyond: Audio Content Indexing and Retrieval
(
Wiley
,
New York
).
34.
Korvel
,
G.
, and
Kostek
,
B.
(
2017
). “
Voiceless stop consonant modelling and synthesis framework based on MISO dynamic system
,”
Arch. Acoust.
42
(
3
),
375
383
.
35.
Korvel
,
G.
, and
Kostek
,
B.
(
2018
). “
Examining feature vector for phoneme recognition
,”
IEEE International Symposium on Signal Processing and Information Technology (ISSPIT)
, Bilbao, Spain, pp.
394
398
.
36.
Korvel
,
G.
,
Treigys
,
P.
, and
Kostek
,
B.
(
2021
). “
Highlighting interlanguage phoneme differences based on similarity matrices and convolutional neural network
,”
J. Acoust. Soc. Am.
149
(
1
),
508
523
.
37.
Korvel
,
G.
,
Treigys
,
P.
,
Tamulevičius
,
G.
,
Bernatavičienė
,
J.
, and
Kostek
,
B.
(
2018
). “
Analysis of 2D feature spaces for deep learning-based speech recognition
,”
J. Audio Eng. Soc.
66
(
12
),
1072
1081
.
38.
Korvel
,
G.
,
Kurowski
,
A.
,
Kostek
,
B.
, and
Czyżewski
,
A.
(
2019
). “
Speech analytics based on machine learning
,” in
Machine Learning Paradigms. Intelligent Systems Reference Library
, edited by
G.
Tsihrintzis
,
D.
Sotiropoulos
, and
L.
Jain
(
Springer
,
Cham
), Vol.
149
, pp.
129
157
.
39.
Kostek
,
B.
,
Kupryjanow
,
A.
,
Żwan
,
P.
,
Jiang
,
W.
,
Raś
,
Z. W.
,
Wojnarski
,
M.
, and
Świetlicka
,
J.
(
2011
). “
Report of the ISMIS 2011 contest: Music information retrieval
,” in
International Symposium on Methodologies for Intelligent Systems
(
Springer
,
Berlin
), pp.
715
724
.
40.
Lisker
,
L.
, and
Abramson
,
A. S.
(
1964
). “
A cross-language study of voicing in initial stops: Acoustical measurements
,”
Word
20
(
3
),
384
422
.
41.
Mikoś
,
M.
,
Keating
,
P.
, and
Moslin
,
B.
(
1978
). “
The perception of voice onset time in Polish
,”
J. Acoust. Soc. Am.
63
(
S1
),
S19
S19
.
42.
Mitterer
,
H.
,
Reinisch
,
E.
, and
McQueen
,
J. M.
(
2018
). “
Allophones, not phonemes in spoken-word recognition
,”
J. Mem. Lang.
98
,
77
92
.
43.
Nahar
,
K. M. O.
,
Elshafei
,
M.
,
Al-Khatib
,
W. G.
,
Al-Muhtaseb
,
H.
, and
Alghamdi
,
M. M.
(
2012
). “
Statistical analysis of Arabic phonemes used in Arabic speech recognition
,” in
Neural Information Processing
, Vol.
7663
of ICONIP 2012. Lecture Notes in Computer Science, edited by
T.
Huang
,
Z.
Zeng
,
C.
Li
, and
C. S.
Leung
(
Springer
,
Berlin
).
44.
Pandey
,
P. C.
, and
Shah
,
M. S.
(
2009
). “
Estimation of place of articulation during stop closures of vowel consonant vowel utterances
,”
IEEE Trans. Audio Speech Lang. Proc.
17
(
2
),
277
286
.
45.
Palaz
,
D.
,
Magimai-Doss
,
M.
, and
Collobert
,
R.
(
2019
). “
End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition
,”
Speech Commun.
108
,
15
32
.
46.
Partila
,
P.
,
Tovarek
,
J.
,
Ilk
,
G. H.
,
Rozhon
,
J.
, and
Voznak
,
M.
(
2020
). “
Deep learning serves voice cloning: How vulnerable are automatic speaker verification systems to spoofing trials?
,”
IEEE Commun. Magn.
58
(
2
),
100
105
.
47.
Piotrowska
,
M.
,
Czyżewski
,
A.
,
Ciszewski
,
T.
,
Korvel
,
G.
,
Kurowski
,
A.
, and
Kostek
,
B.
(
2021
). “Alofon repository corpus and extras,” www.modality-corpus.org (Last viewed 6/29/2021).
48.
Piotrowska
,
M.
,
Korvel
,
G.
,
Kostek
,
B.
,
Rojczyk
,
A.
, and
Czyżewski
,
A.
(
2018a
). “
Objectivization of phonological evaluation of speech elements by means of audio parametrization
,”
11th International Conference on Human System Interaction (HSI
), pp.
325
331
.
49.
Piotrowska
,
M.
,
Korvel
,
G.
,
Kurowski
,
A.
,
Kostek
,
B.
, and
Czyżewski
,
A.
(
2018b
). “
Machine learning applied to aspirated and non-aspirated allophone classification—An approach based on audio fingerprinting
,”
145 Audio Engineering Society Convention
,
New York
(October 17–20).
50.
Plewa
,
M.
, and
Kostek
,
B.
(
2015
). “
Music mood visualization using self-organizing maps
,”
Audio Eng. Soc. Conv. Arch. Acoust.
40
(
4
),
513
525
.
51.
Rabha
,
S.
,
Sarmah
,
P.
, and
Prasanna
,
S. M.
(
2019
). “
Aspiration in fricative and nasal consonants: Properties and detection
,”
J. Acoust. Soc. Am.
146
(
1
),
614
625
.
52.
Rafałko
,
J.
(
2016
). “
Algorithm of allophone borders correction in automatic segmentation of acoustic units
,”
145 Audio Engineering Society Convention. IFIP International Conference on Computer Information Systems and Industrial Management
(
Springer
,
Berlin
), pp.
462
469
.
53.
Recasens
,
D.
(
2012
). “
A cross-language acoustic study of initial and final allophones of/l
,”
Speech Commun.
54
(
3
),
368
383
.
54.
Refaeilzadeh
,
P.
,
Tang
,
L.
, and
Liu
,
H.
(
2014
).
Cross-Validation. Encyclopedia of Database Systems
(
Springer
,
Berlin
), pp.
532
538
.
55.
Rojczyk
,
A.
(
2010
). “
Preceding vowel duration as a cue to the consonant voicing contrast: Perception experiments with Polish-English bilinguals
,” in
Issues in Accents English: Variability and Norm
(
Cambridge Scholars
,
Newcastle upon Tyne, UK
), pp.
341
360
.
56.
Rojczyk
,
A.
(
2012
). “
Phonetic and phonological mode in second-language speech: VOT imitation
,” in
EuroSLA 22–22nd Annual Conference of the European Second Language Association
,
Poznań, Poland
, pp.
5
8
.
57.
Rosner
,
A.
, and
Kostek
,
B.
(
2018
). “
Automatic music genre classification based on musical instrument track separation
,”
J. Intell. Inf. Syst.
50
(
2
),
363
384
.
58.
Saleem
,
N.
,
Irfan Khattak
,
N.
,
Ali
,
M. Y.
, and
Shafi
,
M.
(
2019
). “
Deep neural network for supervised single-channel speech enhancement
,”
Arch. Acoust.
44
(
1
),
3
12
.
59.
Salehinejad
,
H.
,
Sankar
,
S.
,
Barfett
,
J.
,
Colak
,
E.
, and
Valaee
,
S.
(
2018
). “
Recent advances in recurrent neural networks
,” https://arXiv:1801.01078 (Last viewed February 2021).
60.
Shahin
,
M.
, and
Ahmed
,
B.
(
2019
). “
Anomaly detection based pronunciation verification approach using speech attribute features
,”
Speech Commun.
111
,
29
43
.
61.
Smailis
,
C.
,
Sarafianos
,
N.
,
Giannakopoulos
,
T.
, and
Perantonis
,
S.
(
2016
). “
Fusing active orientation models and mid-term audio features for automatic depression estimation
,” in
Proceedings of the 9th ACM International Conference on Pervasive Technologies Related to Assistive Environments
,
39
.
62.
Tsipas
,
N.
,
Vrysis
,
L.
,
Dimoulas
,
C.
, and
Papanikolaou
,
G.
(
2015
). “
Methods for Speech/Music Detection and Classification
,” in
Proceedings of MIREX 2015
.
63.
Tsipas
,
N.
,
Vrysis
,
L.
,
Konstantoudakis
,
K.
, and
Dimoulas
,
C.
(
2020
). “
Semi-supervised audio-driven TV-news speaker diarization using deep neural embeddings
,”
J. Acoust. Soc. Am.
148
(
6
),
3751
3761
.
64.
Vrysis
,
L.
,
Tsipas
,
N.
,
Thoidis
,
I.
, and
Dimoulas
,
C.
(
2020
). “
1D/2D deep CNNs vs. temporal feature integration for general audio classification
,”
J. Audio Eng. Soc.
68
(
1/2
),
66
77
.
65.
Vryzas
,
N.
,
Kotsakis
,
R.
,
Liatsou
,
A.
,
Dimoulas
,
C.
, and
Kalliris
,
G.
(
2018
). “
Speech emotion recognition for performance interaction
,”
J. Audio Eng. Soc.
66
(
6
),
457
467
.
66.
Waniek-Klimczak
,
E.
(
2005
).
Temporal Parameters in Second Language Speech: An Applied Linguistic Phonetics Approach
(
Łódź University Press
,
Poland
).
67.
Wei
,
S.
,
Hu
,
G.
,
Hu
,
Y.
, and
Wang
,
R.-H.
(
2009
). “
A new method for mispronunciation detection using support vector machine based on pronunciation space models
,”
Speech Commun.
51
(
10
),
896
905
.
68.
Woore
,
R.
(
2018
). “
Learners' pronunciations of familiar and unfamiliar French words: What can they tell us about phonological decoding in an L2?
,”
Language Learn. J.
46
(
4
),
456
469
.
69.
Yu
,
J.
,
Markov
,
K.
, and
Matsui
,
T.
(
2019
). “
Articulatory and spectrum information fusion based on deep recurrent neural networks
,”
IEEE/ACM Trans. Audio Speech Lang. Proc.
27
(
4
),
742
752
.
You do not currently have access to this content.