The automatic analysis of conversational audio remains difficult, in part, due to the presence of multiple talkers speaking in turns, often with significant intonation variations and overlapping speech. The majority of prior work on psychoacoustic speech analysis and system design has focused on single-talker speech or multi-talker speech with overlapping talkers (for example, the cocktail party effect). There has been much less focus on how listeners detect a change in talker or in probing the acoustic features significant in characterizing a talker's voice in conversational speech. This study examines human talker change detection (TCD) in multi-party speech utterances using a behavioral paradigm in which listeners indicate the moment of perceived talker change. Human reaction times in this task can be well-estimated by a model of the acoustic feature distance among speech segments before and after a change in talker, with estimation improving for models incorporating longer durations of speech prior to a talker change. Further, human performance is superior to several online and offline state-of-the-art machine TCD systems.

1.
S. D.
Goldinger
, “
Echoes of echoes? An episodic theory of lexical access
,”
Psychol. Rev.
105
(
2
),
251
279
(
1998
).
2.
J. D. M.
Laver
, “
Voice quality and indexical information
,”
Br. J. Disord. Commun.
3
(
1
),
43
54
(
1968
).
3.
S. C.
Levinson
, “
Turn-taking in human communication—Origins and implications for language processing
,”
Trends Cognit. Sci.
20
(
1
),
6
14
(
2016
).
4.
L. C.
Nygaard
and
D. B.
Pisoni
, “
Talker-specific learning in speech perception
,”
Percept. Psychophys.
60
(
3
),
355
376
(
1998
).
5.
P. T.
Kitterick
,
P. J.
Bailey
, and
A. Q.
Summerfield
, “
Benefits of knowing who, where, and when in multi-talker listening
,”
J. Acoust. Soc. Am.
127
(
4
),
2498
2508
(
2010
).
6.
I. S.
Johnsrude
,
A.
Mackey
,
H.
Hakyemez
,
E.
Alexander
,
H. P.
Trang
, and
R. P.
Carlyon
, “
Swinging at a cocktail party: Voice familiarity aids speech perception in the presence of a competing voice
,”
Psychol. Sci.
24
(
10
),
1995
2004
(
2013
).
7.
M. J.
Sjerps
,
H.
Mitterer
, and
J. M.
McQueen
, “
Listening to different speakers: On the time-course of perceptual compensation for vocal-tract characteristics
,”
Neuropsychologia
49
(
14
),
3831
3846
(
2011
).
8.
Y.
Lavner
,
I.
Gath
, and
J.
Rosenhouse
, “
The effects of acoustic modifications on the identification of familiar voices speaking isolated vowels
,”
Speech Commun.
30
(
1
),
9
26
(
2000
).
9.
G.
Sell
,
C.
Suied
,
M.
Elhilali
, and
S.
Shamma
, “
Perceptual susceptibility to acoustic manipulations in speaker discrimination
,”
J. Acoust. Soc. Am.
137
(
2
),
911
922
(
2015
).
10.
T.
Chi
,
P.
Ru
, and
S. A.
Shamma
, “
Multiresolution spectrotemporal analysis of complex sounds
,”
J. Acoust. Soc. Am.
118
(
2
),
887
906
(
2005
).
11.
K. M.
Fenn
,
H.
Shintel
,
A. S.
Atkins
,
J. I.
Skipper
,
V. C.
Bond
, and
H. C.
Nusbaum
, “
When less is heard than meets the ear: Change deafness in a telephone conversation
,”
Quart. J. Exp. Psychol.
64
(
7
),
1442
1456
(
2011
).
12.
M. S.
Vitevitch
, “
Change deafness: The inability to detect changes between two voices
,”
J. Exp Psychol: Human Percept Perform
29
(
2
),
333
342
(
2003
).
13.
J. G.
Neuhoff
,
S. A.
Schott
,
A. J.
Kropf
, and
E. M.
Neuhoff
, “
Familiarity, expertise, and change detection: Change deafness is worse in your native language
,”
Perception
43
(
2–3
),
219
222
(
2014
).
14.
D. A.
Coker
and
J.
Burgoon
, “
The nature of conversational involvement and nonverbal encoding patterns
,”
Human Commun. Res.
13
(
4
),
463
494
(1987).
15.
J.
Kreiman
and
D.
Sidtis
,
Foundations of Voice Studies: An Interdisciplinary Approach to Voice Production and Perception
(
Wiley
,
New York
,
2011
).
16.
M.
Latinus
,
P.
McAleer
,
P. E.
Bestelmeyer
, and
P.
Belin
, “
Norm-based coding of voice identity in human auditory cortex
,”
Curr. Biol.
23
(
12
),
1075
1080
(
2013
).
17.
L. E.
Humes
and
J. B.
Ahlstrom
, “
Relation between reaction time and loudness
,”
J. Speech, Lang., Hear. Res.
27
(
2
),
306
310
(
1984
).
18.
J.
Schlittenlacher
,
W.
Ellermeier
, and
G.
Avci
, “
Simple reaction time for broadband sounds compared to pure tones
,”
Atten. Percept. Psychophys.
79
(
2
),
628
636
(
2017
).
19.
D. S.
Emmerich
,
D. A.
Fantini
, and
W.
Ellermeier
, “
An investigation of the facilitation of simple auditory reaction time by predictable background stimuli
,”
Percept. Psychophys.
45
(
1
),
66
70
(
1989
).
20.
C.
Suied
,
P.
Susini
, and
S.
McAdams
, “
Evaluating warning sound urgency with reaction times
,”
J. Exp. Psychol. Appl.
14
(
3
),
201
(
2008
).
21.
C.
Suied
,
P.
Susini
,
S.
McAdams
, and
R. D.
Patterson
, “
Why are natural sounds detected faster than pips?
,”
J. Acoust. Soc. Am.
127
(
3
),
EL105
EL110
(
2010
).
22.
Y.
Boubenec
,
J.
Lawlor
,
U.
Górska
,
S.
Shamma
, and
B.
Englitz
, “
Detecting changes in dynamic and complex acoustic environments
,”
ELife
6
,
e24910
(
2017
).
23.
E.
Shriberg
, “
Spontaneous speech: How people really talk and why engineers should care
,” in
Ninth European Conference on Speech Communication and Technology
(
2005
).
24.
J.
Barker
,
S.
Watanabe
,
E.
Vincent
, and
J.
Trmal
, “
The fifth CHiME speech separation and recognition challenge: Dataset, task and baselines
,” arXiv:1803.10609 (
2018
).
25.
G.
Sell
and
A.
McCree
, “
Multi-speaker conversations, cross-talk, and diarization for speaker recognition
,” in
Proc. IEEE Intl. Conf. Acoust. Speech Signal Process.
(
2017
), pp.
5425
5429
.
26.
O.
Novotný
,
P.
Matějka
,
O.
Plchot
,
O.
Glembek
,
L.
Burget
, and
J.
Černocký
, “
Analysis of speaker recognition systems in realistic scenarios of the SITW 2016 Challenge
,” in
Proc. INTERSPEECH, ISCA
(
2016
), pp.
828
832
.
27.
X.
Huang
and
K. F.
Lee
, “
On speaker-independent, speaker-dependent, and speaker-adaptive speech recognition
,”
IEEE Trans. Speech Audio Process.
1
(
2
),
150
157
(
1993
).
28.
A. G.
Adam
,
S. S.
Kajarekar
, and
H.
Hermansky
, “
A new speaker change detection method for two-speaker segmentation
,” in
Proc. IEEE Intl. Conf. Acoust. Speech Signal Process.
(
2002
), Vol. 4, pp.
3908
3911
.
29.
J.
Ajmera
,
I.
McCowan
, and
H.
Bourlard
, “
Robust speaker change detection
,”
IEEE Signal Process. Lett.
11
(
8
),
649
651
(
2004
).
30.
N.
Dhananjaya
and
B.
Yegnanarayana
, “
Speaker change detection in casual conversations using excitation source features
,”
Speech Commun.
50
(
2
),
153
161
(
2008
).
31.
V.
Gupta
, “
Speaker change point detection using deep neural nets
,” in
Proc. IEEE Intl. Conf. Acoust. Speech Signal Process.
(
2015
), pp.
4420
4424
.
32.
R.
Wang
,
M.
Gu
,
L.
Li
,
M.
Xu
, and
T. F.
Zheng
, “
Speaker segmentation using deep speaker vectors for fast speaker change scenarios
,” in
Proc. IEEE Intl. Conf. Acoust. Speech Signal Process.
(
2017
), pp.
5420
5424
.
33.
A.
Tritschler
and
R. A.
Gopinath
, “
Improved speaker segmentation and segments clustering using the Bayesian information criterion
,” in
Sixth European Conference on Speech Communication and Technology
(
1999
).
34.
M.
Sarma
,
S. N.
Gadre
,
B. D.
Sarma
, and
S. R. M.
Prasanna
, “
Speaker change detection using excitation source and vocal tract system information
,” in
2015 Twenty First National Conference on Communications (NCC)
,
IEEE
(
2015
), pp.
1
6
.
35.
M.
Yang
,
Y.
Yang
, and
Z.
Wu
, “
A pitch-based rapid speech segmentation for speaker indexing
,” in
Seventh IEEE International Symposium on Multimedia (ISM'05)
(
2005
).
36.
B.
Abdolali
and
H.
Sameti
, “
A novel method for speech segmentation based on speakers' characteristics
,” arXiv:1205.1794 (
2012
).
37.
W. N.
Chan
,
T.
Lee
,
N.
Zheng
, and
H.
Ouyang
, “
Use of vocal source features in speaker segmentation
,” in
Proc. IEEE Intl. Conf. Acoust. Speech Signal Process.
(
2006
).
38.
H.
Gish
,
M. H.
Siu
, and
R.
Rohlicek
, “
Segregation of speakers for speech recognition and speaker identification
,” in
Proc. IEEE Intl. Conf. Acoust. Speech Signal Process.
(
1991
), Vol. 2, pp.
873
876
.
39.
S. S.
Cheng
,
H. M.
Wang
, and
H. C.
Fu
, “
BIC-based speaker segmentation using divide-and-conquer strategies with application to speaker diarization
,”
IEEE Trans. Audio, Speech Lang. Process.
18
(
1
),
141
157
(
2010
).
40.
A. S.
Malegaonkar
,
A. M.
Ariyaeeinia
, and
P.
Sivakumaran
, “
Efficient speaker change detection using adapted Gaussian mixture models
,”
IEEE Trans. Audio, Speech Lang. Process.
15
(
6
),
1859
1869
(
2007
).
41.
V.
Karthik
,
D.
Satish
, and
C.
Sekhar
, “
Speaker change detection using support vector machine
,” in
Proc. 3rd Int. Conf. Non-Linear Speech Process
(
2005
), pp.
19
22
.
42.
V.
Panayotov
,
G.
Chen
,
D.
Povey
, and
S.
Khudanpur
, “
Librispeech: An ASR corpus based on public domain audio books
,” in
Proc. IEEE Intl. Conf. Acoust. Speech Signal Process.
(
2015
), pp.
5206
5210
.
43.
H.
Kawahara
,
I.
Masuda-Katsuse
, and
A.
de Cheveignè
, “
Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds
,”
Speech Commun.
27
(
3-4
),
187
207
(
1999
).
44.
https://gorilla.sc (Last viewed 15 August 2018).
45.
See supplementary material at https://doi.org/10.1121/1.5084044 E-JASMAN-145-046812 for supplementary experiments and results on change detection.
46.
Sennheiser HD 215 II closed over-ear back headphone with high passive noise attenuation (Hanover, Lower Saxony, Germany).
47.
A.
Mirzaei
,
S.-M.
Khaligh-Razavi
,
M.
Ghodrati
,
S.
Zabbah
, and
R.
Ebrahimpour
, “
Predicting the human reaction time based on natural image statistics in a rapid categorization task
,”
Vision Res.
81
,
36
44
(
2013
).
48.
R. T.
Pramod
and
S. P.
Arun
,
Do computational models differ systematically from human object perception?
,” in
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(
2016
).
49.
D.
Pins
and
C.
Bonnet
, “
On the relation between stimulus intensity and processing time: Piéron's law and choice reaction time
,”
Percept. Psychophys.
58
(
3
),
390
400
(
1996
).
50.
L. R.
Rabiner
and
B.-H.
Juang
,
Fundamentals of Speech Recognition
(
PTR Prentice Hall
,
Englewood Cliffs
,
1993
), Vol. 14.
51.
G.
Peeters
, “
A large set of audio features for sound description (similarity and classification) in the CUIDADO project
,” Tech. Rep., IRCAM (
2004
).
52.
B.
Mathieu
,
S.
Essid
,
T.
Fillon
,
J.
Prado
, and
G.
Richard
, “
Yaafe, an easy to use and efficient audio feature extraction software
,” in
ISMIR
(
2010
), pp.
441
446
.
53.
A. C.
Cameron
and
F. A. G.
Windmeijer
, “
An R-squared measure of goodness of fit for some common nonlinear regression models
,”
J. Econometrics
77
(
2
),
329
342
(
1997
).
54.
G.
Sell
and
D.
Garcia-Romero
, “
Speaker diarization with PLDA i-vector scoring and unsupervised calibration
,” in
Spoken Language Technology Workshop (SLT)
,
IEEE
(
2014
), pp.
413
417
.
56.
N.
Dehak
,
P. J.
Kenny
,
R.
Dehak
,
P.
Dumouchel
, and
P.
Ouellet
, “
Front-end factor analysis for speaker verification
,”
IEEE/ACM Trans. Audio, Speech Lang. Process.
19
(
4
),
788
798
(
2011
).
57.
I.
Salmun
,
I.
Opher
, and
I.
Lapidot
, “
On the use of plda i-vector scoring for clustering short segments
,” in
Proc. Odyssey
(
2016
).
58.
D.
Povey
,
A.
Ghoshal
,
G.
Boulianne
,
L.
Burget
,
O.
Glembek
,
N.
Goel
,
M.
Hannemann
,
P.
Motlicek
,
Y.
Qian
,
P.
Schwarz, J. Silovsky, G. Stemmer
, and
K.
Vesely
, “
The Kaldi speech recognition toolkit
,” in
IEEE Workshop on Automatic Speech Recognition and Understanding
, EPFL-CONF-192584 (
2011
).
59.
D.
Dimitriadis
and
P.
Fousek
, “
Developing on-line speaker diarization system
,” in
INTERSPEECH
(
2017
).
60.
Z.
Meng
,
L.
Mou
, and
Z.
Jin
, “
Hierarchical RNN with static sentence-level attention for text-based speaker change detection
,” in
Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
(
2017
), pp.
2203
2206
.
61.
I. V.
Serban
and
J.
Pineau
, “
Text-based speaker identification for multi-participant open-domain dialogue systems
,” in
NIPS Workshop on Machine Learning for Spoken Language Understanding
,
Montreal, Quebec, Canada
(
2015
).
62.
R.
Řehùøek
and
P.
Sojka
, “
Software framework for topic modelling with large corpora
,” in
Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks
,
ELRA, Valletta, Malta
(
2010
), pp.
45
50
.
63.
F.
Chollet
, “
Keras
,” available at https://keras.io (Last viewed
15 August 2018
).

Supplementary Material

You do not currently have access to this content.