In speaker verification research, objective performance benchmarking of listeners and automatic speaker verification (ASV) systems are of key importance in understanding the limits of speaker recognition. While the adoption of common data and metrics has been instrumental to progress in ASV, there are two major shortcomings. First, the utterances lack intentional voice changes imposed by the speaker. Second, the standard evaluation metrics focus on average performance across all speakers and trials. As a result, a knowledge gap remains in how the acoustic changes impact recognition performance at the level of individual speakers. This paper addresses the limits of speaker recognition in ASV systems under voice disguise using a linear mixed effects model to analyze the impact of change in long-term statistics of selected features (formants F1–F4, the bandwidths B1–B4, F0, and speaking rate) to ASV log-likelihood ratio (LLR) score. The correlations between the proposed predictive model and the LLR scores are 0.72 for females and 0.81 for male speakers. As a whole, the difference in long-term F0 between enrollment and test utterances was found to be the individually most detrimental factor, even if the ASV system uses only spectral, rather than prosodic, features.

1.
Adami
,
A.
(
2007
). “
Modeling prosodic differences for speaker recognition
,”
Speech Commun.
49
(
4
),
277
291
.
2.
Ajili
,
M.
(
2017
). “
Reliability of voice comparison for forensic applications. (fiabilité de la comparaison des voix dans le cadre judiciaire)
,” Ph.D. thesis,
University of Avignon
,
France
.
3.
Ajili
,
M.
,
Bonastre
,
J.-F.
, and
Rossato
,
S.
(
2018
). “
Voice comparison and rhythm: Behavioral differences between target and non-target comparisons
,” in
Proceedings of Interspeech 2018
, September 2–6, Hyderabad, India, pp.
1061
1065
.
4.
Akaike
,
H.
(
1974
). “
A new look at the statistical model identification
,”
IEEE Trans. Autom. Control
19
(
6
),
716
723
.
5.
Bates
,
D.
,
Mächler
,
M.
,
Bolker
,
B.
, and
Walker
,
S.
(
2015
). “
Fitting linear mixed-effects models using lme4
,”
J. Stat. Softw.
67
(
1
),
1
48
.
6.
Boersma
,
P.
(
1993
). “
Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound
,” in
Proceedings of the Institute of Phonetic Sciences
, Vol.
17
, pp.
97
110
.
7.
Boersma
,
P.
, and
Weenink
,
D.
(
2015
). “
Praat: Doing phonetics by computer (version 5.4.09) [computer program]
,” (Last viewed 15 June, 2015).
8.
Brümmer
,
N.
,
Burget
,
L.
,
Cernocky
,
J.
,
Glembek
,
O.
,
Grezl
,
F.
,
Karafiat
,
M.
,
van Leeuwen
,
D. A.
,
Matejka
,
P.
,
Schwarz
,
P.
, and
Strasheim
,
A.
(
2007
). “
Fusion of heterogeneous speaker recognition systems in the STBU submission for the NIST speaker recognition evaluation 2006
,”
IEEE Trans. Audio Speech Lang. Process.
15
(
7
),
2072
2084
.
9.
Casella
,
G.
, and
Berger
,
R. L.
(
2002
).
Statistical Inference
, Vol.
2
(
Duxbury Pacific Grove
,
CA
).
10.
Childers
,
D.
(
1978
).
IEEE Press Selected Reprint Series: Modern Spectrum Analysis
(
IEEE
,
New York
), pp.
34
41
.
11.
Chung
,
J. S.
,
Nagrani
,
A.
, and
Zisserman
,
A.
(
2018
). “
Voxceleb2: Deep speaker recognition
,” in
Proceedings of Interspeech 2018, 19th Annual Conference of the International Speech Communication Association
, September 2–6, Hyderabad, India, pp.
1086
1090
.
12.
Cohen
,
J.
(
1988
).
Statistical Power Analysis for the Behavioral Sciences
(
Lawrence Earlbaum Associates
,
Mahwah, NJ
).
13.
Dehak
,
N.
,
Dumouchel
,
P.
, and
Kenny
,
P.
(
2007
). “
Modeling prosodic features with joint factor analysis for speaker verification
,”
IEEE Trans. Audio Speech Lang. Process.
15
(
7
),
2095
2103
.
14.
Dehak
,
N.
,
Kenny
,
P.
,
Dehak
,
R.
,
Dumouchel
,
P.
, and
Ouellet
,
P.
(
2011
). “
Front-end factor analysis for speaker verification
,”
IEEE Trans. Audio Speech Lang. Process.
19
(
4
),
788
798
.
15.
De Jong
,
N. H.
, and
Wempe
,
T.
(
2009
). “
Praat script to detect syllable nuclei and measure speech rate automatically
,”
Behav. Res. Methods
41
(
2
),
385
390
.
16.
Dellwo
,
V.
,
Leemann
,
A.
, and
Kolly
,
M.-J.
(
2012
). “
Speaker idiosyncratic rhythmic features in the speech signal
,” in
Proceedings of Interspeech 2012
, September 9–13, Portland, OR, pp.
1584
1587
.
17.
Dellwo
,
V.
,
Leemann
,
A.
, and
Kolly
,
M.-J.
(
2015
). “
Rhythmic variability between speakers: Articulatory, prosodic, and linguistic factors
,”
J. Acoust. Soc. Am.
137
(
3
),
1513
1528
.
18.
Dempster
,
A.
,
Laird
,
N.
, and
Rubin
,
D.
(
1977
). “
Maximum likelihood from incomplete data via the EM algorithm
,”
J. R. Stat. Soc. Ser. B
39
(
1
),
1
38
, available at www.jstor.org/stable/2984875.
19.
Doddington
,
G. R.
,
Przybocki
,
M. A.
,
Martin
,
A. F.
, and
Reynolds
,
D. A.
(
2000
). “
The NIST speaker recognition evaluation—Overview, methodology, systems, results, perspective
,”
Speech Commun.
31
(
2
),
225
254
.
20.
El-Jaroudi
,
A.
, and
Makhoul
,
J.
(
1991
). “
Discrete all-pole model
,”
IEEE Trans. Signal Process.
39
(
2
),
411
423
.
21.
Farrús
,
M.
,
Hernando
,
J.
, and
Ejarque
,
P.
(
2007
). “
Jitter and shimmer measurements for speaker recognition
,” in
Proceedings of Interspeech
, August 27–31, Antwerp, Belgium, pp.
778
781
.
22.
Garofolo
,
J.
,
Lamel
,
L.
,
Fisher
,
W.
,
Fiscus
,
J.
,
Pallett
,
D.
,
Dahlgren
,
N.
, and
Zue
,
V.
(
1993
). “
TIMIT acoustic-phonetic continuous speech corpus LDC93S1
,” Web Download, linguistic Data Consortium, Philadelphia.
23.
González Hautamäki
,
R.
(
2017
). “
Human-induced voice modification and speaker recognition: Automatic, perceptual and acoustic perspectives
,” Ph.D. thesis,
University of Eastern Finland
, Dissertations in Forestry and Natural Sciences, Joensuu, Finland, p.
56
.
24.
González Hautamäki
,
R.
,
Kanervisto
,
A.
,
Hautamäki
,
V.
, and
Kinnunen
,
T.
(
2018a
). “
Perceptual evaluation of the effectiveness of voice disguise by age modification
,” in
Proceedings of Odyssey 2018 The Speaker and Language Recognition Workshop
, June 26–29, Les Sables d'Olonne, France, pp.
320
326
.
25.
González Hautamäki
,
R.
,
Sahidullah
,
M.
,
Hautamäki
,
V.
,
Bentz
,
M.
,
Werner
,
S.
, and
Kinnunen
,
T.
(
2018b
). “
Corpus of age-related voice disguise (AVOID)
,” http://urn.fi/urn:nbn:fi:lb-2018060621 (Last viewed 19 July 2019).
26.
González Hautamäki
,
R.
,
Sahidullah
,
M.
,
Hautamäki
,
V.
, and
Kinnunen
,
T.
(
2017
). “
Acoustical and perceptual study of voice disguise by age modification in speaker verification
,”
Speech Commun.
95
,
1
15
.
27.
González Hautamäki
,
R.
,
Sahidullah
,
M.
,
Kinnunen
,
T.
, and
Hautamäki
,
V.
(
2016
). “
Age-related voice disguise and its impact in speaker verification accuracy
,” in
Proceedings of Odyssey: The Speaker and Language Recognition Workshop
, June 21–24, Bilbao, Spain, pp.
277
282
.
28.
Greenberg
,
C. S.
,
Martin
,
A. F.
,
Barr
,
B. N.
, and
Doddington
,
G. R.
(
2011
). “
Report on performance results in the NIST 2010 speaker recognition evaluation
,” in
Proceedings of Interspeech
, August 27–31, Florence, Italy, pp.
261
264
.
29.
Hanilci
,
C.
,
Kinnunen
,
T.
,
Saeidi
,
R.
,
Pohjalainen
,
J.
,
Alku
,
P.
, and
Ertas
,
F.
(
2013
). “
Speaker identification from shouted speech: Analysis and compensation
,” in
Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing
, May 26–31, Vancouver, Canada, pp.
8027
8031
.
30.
Hansen
,
J. H.
, and
Hasan
,
T.
(
2015
). “
Speaker recognition by machines and humans: A tutorial review
,”
IEEE Signal Process. Mag.
32
(
6
),
74
99
.
31.
Hansen
,
J. H. L.
,
Nandwana
,
M. K.
, and
Shokouhi
,
N.
(
2017
). “
Analysis of human scream and its impact on text-independent speaker verification
,”
J. Acoust. Soc. Am.
141
(
4
),
2957
2967
.
32.
Hatch
,
A. O.
,
Kajarekar
,
S. S.
, and
Stolcke
,
A.
(
2006
). “
Within-class covariance normalization for SVM-based speaker recognition
,” in
Proceedings Interspeech
, September 17–21, Pittsburgh, PA pp.
1471
1474
.
33.
Kahn
,
J.
,
Audibert
,
N.
,
Rossato
,
S.
, and
Bonastre
,
J.
(
2010
). “
Intra-speaker variability effects on speaker verification performance
,” in
Odyssey 2010: The Speaker and Language Recognition Workshop
, June 28–July 1, Brno, Czech Republic, p.
21
.
34.
Larcher
,
L.
,
Lee
,
K.
,
Ma
,
B.
, and
Li
,
H.
(
2014
). “
Text-dependent speaker verification: Classifiers, databases and RSR2015
,”
Speech Commun.
60
,
56
77
.
35.
Lee
,
K. A.
,
Larcher
,
A.
,
Wang
,
W.
,
Kenny
,
P.
,
Brummer
,
N.
,
van Leeuwen
,
D. A.
,
Aronowitz
,
H.
,
Kockmann
,
M.
,
Vaquero
,
C.
,
Ma
,
B.
,
Li
,
H.
,
Stafylakis
,
T.
,
Alam
,
J.
,
Swart
,
A.
, and
Perez
,
J.
(
2015
). “
The RedDots data collection for speaker recognition
,” in
Proceedings of Interspeech
, September 6–10, Dresden, Germany, pp.
2996
3000
.
36.
Leemann
,
A.
, and
Kolly
,
M.-J.
(
2015
). “
Speaker-invariant suprasegmental temporal features in normal and disguised speech
,”
Speech Commun.
75
,
97
122
.
37.
Leemann
,
A.
,
Kolly
,
M.-J.
, and
Dellwo
,
V.
(
2014
). “
Speaker-individuality in suprasegmental temporal features: Implications for forensic voice comparison
,”
Forensic Sci. Int.
238
,
59
67
.
38.
Lei
,
Y.
, and
Hansen
,
J. H.
(
2016
). “
Corpora for the evaluation of robust speaker recognition systems
,” in
Proceedings of Interspeech
, September 8–12m San Francisco, CA, pp.
2776
2780
.
39.
Mandasari
,
M. I.
,
Saeidi
,
R.
, and
van Leeuwen
,
D. A.
(
2015
). “
Quality measures based calibration with duration and noise dependency for speaker recognition
,”
Speech Commun.
72
,
126
137
.
40.
Mary
,
L.
, and
Yegnanarayana
,
B.
(
2008
). “
Extraction and representation of prosodic features for language and speaker recognition
,”
Speech Commun.
50
(
10
),
782
796
.
41.
Moez
,
A.
,
Jean-Francois
,
B.
,
Waad
,
B. K.
,
Solange
,
R.
, and
Juliette
,
K.
(
2016
). “
Phonetic content impact on Forensic Voice Comparison
,” in
2016 IEEE Spoken Language Technology Workshop (SLT)
, December 13–16, San Diego, CA, pp.
210
217
.
42.
Park
,
S. J.
,
Yeung
,
G.
,
Vesselinova
,
N.
,
Kreiman
,
J.
,
Keating
,
P. A.
, and
Alwan
,
A.
(
2018
). “
Towards understanding speaker discrimination abilities in humans and machines for text-independent short utterances of different speech styles
,”
J. Acoust. Soc. Am.
144
(
1
),
375
386
.
43.
Patterson
,
H. D.
, and
Thompson
,
R.
(
1971
). “
Recovery of inter-block information when block sizes are unequal
,”
Biometrika
58
(
3
),
545
554
.
44.
Pietrowicz
,
M.
,
Hasegawa-Johnson
,
M.
, and
Karahalios
,
K. G.
(
2017
). “
Acoustic correlates for perceived effort levels in male and female acted voices
,”
J. Acoust. Soc. Am.
142
(
2
),
792
811
.
45.
Povey
,
D.
,
Ghoshal
,
A.
,
Boulianne
,
G.
,
Burget
,
L.
,
Glembek
,
O.
,
Goel
,
N.
,
Hannemann
,
M.
,
Motlicek
,
P.
,
Qian
,
Y.
,
Schwarz
,
P.
,
Silovsky
,
J.
,
Stemmer
,
G.
, and
Vesely
,
K.
(
2011
). “
The kaldi speech recognition toolkit
,” in
IEEE 2011 Workshop on Automatic Speech Recognition and Understanding
, December 11–15, Big Island, HI.
46.
Prince
,
S. J. D.
, and
Elder
,
J. H.
(
2007
). “
Probabilistic linear discriminant analysis for inferences about identity
,” in
Proceedings of the International Conference on Computer Vision (ICCV)
, October 22–29, Venice, Italy, pp.
1
8
.
47.
Rodman
,
R.
, and
Powell
,
M.
(
2000
). “
Computer recognition of speakers who disguise their voice
,” in
Proceedings of the International Conference on Signal Processing Applications and Technology ICSPAT
, October 16-19, Dallas, TX.
48.
Saeidi
,
R.
,
Huhtakallio
,
I.
, and
Alku
,
P.
(
2016
). “
Analysis of face mask effect on speaker recognition
,” in
Proceedings of Interspeech
, September 8–12, San Francisco, CA, pp.
1800
1804
.
49.
Schmidt-Nielsen
,
A.
, and
Stern
,
K. R.
(
1985
). “
Identification of known voices as a function of familiarity and narrow-band coding
,”
J. Acoust. Soc. Am.
77
(
2
),
658
663
.
50.
Shriberg
,
E.
,
Ferrera
,
L.
,
Kajarekar
,
S.
,
Venkataraman
,
A.
, and
Stolcke
,
A.
(
2005
). “
Modeling prosodic feature sequences for speaker recognition
,”
Speech Commun.
46
(
3–4
),
455
472
.
51.
Skoog Waller
,
S.
, and
Eriksson
,
M.
(
2016
). “
Vocal age disguise: The role of fundamental frequency and speech rate and its perceived effects
,”
Front. Psychol.
7
,
1814
.
52.
Skoog Waller
,
S.
,
Eriksson
,
M.
, and
Sörqvist
,
P.
(
2015
). “
Can you hear my age? Influences of speech rate and speech spontaneity on estimation of speaker age
,”
Front. Pscyhol.
6
,
978
.
53.
Snyder
,
D.
,
Garcia-Romero
,
D.
,
Sell
,
G.
,
Povey
,
D.
, and
Khudanpur
,
S.
(
2018
). “
X-vectors: Robust DNN embeddings for speaker recognition
,” in
Proceedings of IEEE ICASSP
, April 15–20, Calgary, Canada, pp.
5329
5333
.
54.
Sönmez
,
M. K.
,
Heck
,
L. P.
,
Weintraub
,
M.
, and
Shriberg
,
E.
(
1997
). “
A lognormal tied mixture model of pitch for prosody based speaker recognition
,” in
Proceedings of the Fifth European Conference on Speech Communication and Technology, EUROSPEECH 1997
, September 22–25, Rhodes, Greece.
55.
Vestman
,
V.
,
Gowda
,
D.
,
Sahidullah
,
M.
,
Alku
,
P.
, and
Kinnunen
,
T.
(
2018
). “
Speaker recognition from whispered speech: A tutorial survey and an application of time-varying linear prediction
,”
Speech Commun.
99
,
62
79
.
56.
Wang
,
D.
, and
Narayanan
,
S. S.
(
2007
). “
Robust speech rate estimation for spontaneous speech
,”
IEEE Trans. Audio Speech Lang. Process.
15
(
8
),
2190
2201
.
57.
Zhang
,
C.
(
2012
). “
Acoustic analysis of disguised voices with raised and lowered pitch
,” in
Proceedings of the International Symposium on Chinese Spoken Language Processing (ISCSLP)
, December 5–8, Hong Kong, pp.
353
357
.
58.
Zhang
,
C.
, and
Tan
,
T.
(
2008
). “
Voice disguise and automatic speaker recognition
,”
Forensic Sci. Int.
175
(
2–3
),
118
122
.

Supplementary Material

You do not currently have access to this content.