Goodness of pronunciation (GOP) is the most widely used method for automatic mispronunciation detection. In this paper, a transfer learning approach to GOP based mispronunciation detection when applying maximum F1-score criterion (MFC) training to deep neural network (DNN)-hidden Markov model based acoustic models is proposed. Rather than train the whole network using MFC, a DNN is used, whose hidden layers are borrowed from native speech recognition with only the softmax layer trained according to the MFC objective function. As a result, significant mispronunciation detection improvement is obtained. In light of this, the two-stage transfer learning based GOP is investigated in depth. The first stage exploits the hidden layer(s) to extract phonetic-discriminating features. The second stage uses a trainable softmax layer to learn the human standard for judgment. The validation is carried out by experimenting with different mispronunciation detection architectures using acoustic models trained by different criteria. It is found that it is preferable to use frame-level cross-entropy to train the hidden layer parameters. Classifier based mispronunciation detection is further experimented with using features computed by transfer learning based GOP and it is shown that it also helps to achieve better results.

1.
C.
Cucchiarini
,
H.
Strik
, and
L.
Boves
, “
Automatic evaluation of Dutch pronunciation by using speech recognition technology
,” in
Proceedings of the Automatic Speech Recognition and Understanding Workshop (ASRU)
, Santa Barbara, CA (
IEEE
,
New York
,
1997
), pp.
622
629
.
2.
K.
Zechner
,
D.
Higgins
,
X.
Xi
, and
D. M.
Williamson
, “
Automatic scoring of non-native spontaneous speech in tests of spoken English
,”
Speech Commun.
51
(
10
),
883
895
(
2009
).
3.
D.
Higgins
,
X.
Xi
,
K.
Zechner
, and
D.
Williamson
, “
A three-stage approach to the automated scoring of spontaneous spoken responses
,”
Comput. Speech Lang.
25
(
2
),
282
306
(
2011
).
4.
W.
Hu
,
Y.
Qian
, and
F. K.
Soong
, “
A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL)
,” in
Proceeding of Interspeech
, Lyon, France [
International Speech Communication Association (ISCA)
,
Baixas, France
,
2013
], pp.
1886
1890
.
5.
A.
Metallinou
and
J.
Cheng
, “
Using deep neural networks to improve proficiency assessment for children English language learners
,” in
Proceedings of Interspeech
, Singapore (
ISCA
,
Baixas, France
,
2014
), pp.
1468
1472
.
6.
J.
Zhang
,
C.
Huang
,
M.
Chu
,
F. K.
Soong
, and
W.
Ye
, “
Generalized segment posterior probability for automatic Mandarin pronunciation evaluation
,” in
Proceedings of ICASSP
, Hawaii (
IEEE
,
New York
,
2007
), pp.
201
204
.
7.
H.
Franco
,
L.
Neumeyer
,
M.
Ramos
, and
H.
Bratt
, “
Automatic detection of phone-level mispronunciation for language learning
,” in
Proceedings of Eurospeech
, Budapest, Hungary (
ISCA
,
Baixas, France
,
1999
), pp.
851
854
.
8.
S.
Witt
and
S.
Young
, “
Phone-level pronunciation scoring and assessment for interactive language learning
,”
Speech Commun.
30
(
2-3
),
95
108
(
2000
).
9.
F.
Zhang
,
C.
Huang
,
F. K.
Soong
,
M.
Chu
, and
R.
Wang
, “
Automatic mispronunciation detection for Mandarin
,” in
Proceedings of ICASSP
, Las Vegas (
IEEE
,
New York
,
2008
), pp.
2077
2080
.
10.
S.
Wei
,
G.
Hu
,
Y.
Hu
, and
R.
Wang
, “
A new method for mispronunciation detection using support vector machine based on pronunciation space models
,”
Speech Commun.
51
,
896
905
(
2009
).
11.
W.
Hu
,
Y.
Qian
, and
F. K.
Soong
, “
A new neural network based logistic regression classifier for improving mispronunciation detection of L2 language learners
,” in
The 9th International Symposium on Chinese Spoken Language Processing (ISCSLP)
, Singapore (
IEEE
,
New York
,
2014
), pp.
245
249
.
12.
W.
Hu
,
Y.
Qian
,
F. K.
Soong
, and
Y.
Wang
, “
Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based regression classifiers
,”
Speech Commun.
67
,
154
166
(
2015
).
13.
T.
Kawahara
,
M.
Dantsuji
, and
Y.
Tsubota
, “
Practical use of English pronunciation system for Japanese students in the CALL classroom
,” in
Proceedings of Interspeech
, Jeju, Korea (
ISCA
,
Baixas, France
,
2004
), pp.
1689
1692
.
14.
D.
Luo
,
X.
Yang
, and
L.
Wang
, “
Improvement of segmental mispronunciation detection with prior knowledge extracted from large L2 speech corpus
,” in
Proceedings of Interspeech
, Florence, Italy (
ISCA
,
Baixas, France
,
2011
), pp.
1593
1596
.
15.
T.
Cincarek
,
R.
Gruhn
,
C.
Hacker
,
E.
Nöth
, and
S.
Nakamura
, “
Automatic pronunciation scoring of words and sentences independent from the non-native's first language
,”
Comput. Speech Lang.
23
(
1
),
65
88
(
2009
).
16.
S.
Robertson
,
C.
Munteanu
, and
G.
Penn
, “
Pronunciation error detection for new language learners
,” in
Proceedings of Interspeech
, San Francisco (
ISCA
,
Baixas, France
,
2016
), pp.
2691
2695
.
17.
S.
Ronanki
,
B.
Li
, and
J.
Salsman
, “
Automatic pronunciation evaluation and mispronunciation detection using CMU Sphinx
,” in
Proceedings of the Workshop on Speech and Language Processing Tools in Education, 24th International Conference on Computational Linguistics (COLING)
, Mumbai, India [
Association for Computational Linguistics (ACL)
,
Stroudsburg, VA
,
2012
], pp.
61
68
.
18.
X.
Qian
,
H.
Meng
, and
F. K.
Soong
, “
The use of DBN-HMMs for mispronunciation detection and diagnosis in L2 English to support computer-aided pronunciation training
,” in
Proceedings of Interspeech
, Portland, OR (
ISCA
,
Baixas, France
,
2012
), pp.
775
778
.
19.
X.
Qian
,
H.
Meng
, and
F. K.
Soong
, “
A two-pass framework of mispronunciation detection and diagnosis for computer-aided pronunciation training
,”
IEEE/ACM Trans. Audio Speech Lang. Process.
24
(
6
),
1020
1028
(
2016
).
20.
B.
Juang
and
S.
Katagiri
, “
Discriminative learning for minimum error classification
,”
IEEE Trans. Sign. Process.
40
(
1
),
3043
3054
(
1992
).
21.
L.
Bahl
,
P.
Brown
,
P.
Souza
, and
R.
Mercer
, “
Maximum mutual information estimation of hidden Markov model parameters for speech recognition
,” in
Proceedings of ICASSP
, Tokyo, Japan (
IEEE
,
New York
,
1986
), Vol.
11
, pp.
49
52
.
22.
P. C.
Woodland
and
D.
Povey
, “
Large scale discriminative training of hidden Markov models for speech recognition
,”
Comput. Speech Lang.
16
(
1
),
25
47
(
2002
).
23.
D.
Povey
and
P. C.
Woodland
, “
Minimum phone error and I-smoothing for improved discriminative training
,” in
Proceedings of ICASSP
, Orlando (
IEEE
,
New York
,
2002
), pp.
105
108
.
24.
D.
Povey
, “
Discriminative training for large vocabulary speech recognition
,” Ph.D. thesis,
Cambridge University
.
Cambridge, UK
(
2004
).
25.
H.
Huang
,
J.
Wang
, and
H.
Abudureyimu
, “
Maximum F1-score discriminative training for automatic mispronunciation detection in computer-assisted language learning
,” in
Proceedings of Interspeech
, Portland, OR (
ISCA
,
Baixas, France
,
2012
), pp.
815
818
.
26.
H.
Huang
,
H.
Xu
,
X.
Wang
, and
W.
Silamu
, “
Maximum F1-score discriminative training criterion for automatic mispronunciation detection
,”
IEEE/ACM Trans. Audio Speech Lang. Process.
23
,
787
797
(
2015
).
27.
G.
Hinton
,
L.
Deng
,
D.
Yu
,
G.
Dahl
,
A. R.
Mohamed
,
N.
Jaitly
,
A.
Senior
,
V.
Vanhoucke
,
P.
Nguyen
,
T.
Sainath
, and
B.
Kingsbury
, “
Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups
,”
IEEE Sign. Process. Mag.
29
,
82
97
(
2012
).
28.
W.
Hu
,
Y.
Qian
, and
F. K.
Soong
, “
A DNN-based acoustic modeling of tonal language and its application to mandarin pronunciation training
,” in
Proceedings of ICASSP
, Florance, Italy (
ISCA
,
Baixas, France
,
2014
), pp.
3206
3210
.
29.
J.
Cheng
,
X.
Chen
, and
A.
Metallinou
, “
Deep neural network acoustic models for spoken assessment applications
,”
Speech Commun.
73
,
14
27
(
2015
).
30.
A.
Lee
,
Y.
Zhang
, and
J. R.
Glass
, “
Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams
,” in
Proceedings of ICASSP
, Vancouver, Canada (
IEEE
,
New York
,
2013
), pp.
8227
8231
.
31.
Y. C.
Hsu
,
M. H.
Yang
,
H. T.
Hung
, and
B.
Chen
, “
Mispronunciation detection leveraging maximum performance criterion training of acoustic models and decision functions
,” in
Proceedings of Interspeech
, San Fransisco (
ISCA
,
Baixas, France
,
2016
), pp.
1886
1890
.
32.
B.
Kingsbury
, “
Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling
,” in
Proceedings of ICASSP
, Taipei, Taiwan (
IEEE
,
New York
,
2009
), pp.
3761
3764
.
33.
X.
Yang
,
A.
Loukina
, and
K.
Evanini
, “
Machine learning approaches to improving pronunciation error detection on an imbalanced corpus
,” in
2014 IEEE Spoken Language Technology Workshop (SLT)
, South Lake Tahoe (
IEEE
,
New York
,
2014
), pp.
300
305
.
34.
X.
Zhu
and
I.
Davidson
,
Knowledge discovery and data mining: Challenges and realities
(
Information Science Reference
,
Hershey, PA
,
2007
), p.
118
.
35.
J.
Cohen
, “
A coefficient of agreement for nominal scales
,”
Educat. Psychol. Meas.
20
(
1
),
37
46
(
1960
).
36.
I.
Pillai
,
G.
Fumera
, and
F.
Roli
, “
Designing multi-label classifiers that maximize F-measures: State of the art
,”
Pattern Recog.
61
,
394
404
(
2017
).
37.
D. D.
Lewis
, “
Evaluating and optimizing autonomous text classification systems
,” in
Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, Seattle, USA [
Association for Computing Machinery (ACM)
,
New York
,
1995
), pp.
246
254
.
38.
T.
Joachims
, “
A support vector method for multivariate performance measures
,” in
Proceedings of ICML
, Bonn, Germany (
ACM
,
New York
,
2005
), pp.
377
384
.
39.
M.
Jansche
, “
Maximum expected F-measure training of logistic regression models
,” in
Proceedings of HLT/EMNLP
, Vancouver, Canada (
ACL
,
Stroudsburg, VA
,
2005
), pp.
692
699
.
40.
S.
Wei
,
H.
Wang
,
Q.
Liu
, and
R.
Wang
, “
CDF-matching for automatic tone error detection in Mandarin CALL system
,” in
Proceedings of ICASSP
, Hawaii (
IEEE
,
New York
,
2007
), pp.
205
208
.
41.
J.
Cheng
, “
Automatic tone assessment of non-native Mandarin speakers
,” in
Proceedings of the Interspeech
, Lyon, France (
ISCA
,
Baixas, France
,
2013
), pp.
1299
1302
.
42.
D.
Povey
,
A.
Ghoshal
,
G.
Boulianne
,
L.
Burget
,
O.
Glembek
,
N.
Goel
,
M.
Hannemann
,
P.
Motlíček
,
Y.
Qian
,
P.
Schwarz
,
J.
Silovský
,
G.
Stemmer
, and
K.
Vesely
, “
The Kaldi speech recognition toolkit
,” in
ASRU
(
IEEE
,
New York
,
2011
), pp.
7304
7308
.
You do not currently have access to this content.