In this paper, an audio-driven, multimodal approach for speaker diarization in multimedia content is introduced and evaluated. The proposed algorithm is based on semi-supervised clustering of audio-visual embeddings, generated using deep learning techniques. The two modes, audio and video, are separately addressed; a long short-term memory Siamese neural network is employed to produce embeddings from audio, whereas a pre-trained convolutional neural network is deployed to generate embeddings from two-dimensional blocks representing the faces of speakers detected in video frames. In both cases, the models are trained using cost functions that favor smaller spatial distances between samples from the same speaker and greater spatial distances between samples from different speakers. A fusion stage, based on hypotheses derived from the established practices in television content production, is deployed on top of the unimodal sub-components to improve speaker diarization performance. The proposed methodology is evaluated against VoxCeleb, a large-scale dataset with hundreds of available speakers and AVL-SD, a newly developed, publicly available dataset aiming at capturing the peculiarities of TV news content under different scenarios. In order to promote reproducible research and collaboration in the field, the implemented algorithm is provided as an open-source software package.

1.
Aggarwal
,
V.
,
Gopalakrishnan
,
V.
,
Jana
,
R.
,
Ramakrishnan
,
K.
, and
Vaishampayan
,
V. A.
(
2013
). “
Optimizing cloud resources for delivering IPTV services through virtualization
,”
IEEE Trans. Multimedia
15
(
4
),
789
801
.
2.
Ban
,
Y.
,
Girin
,
L.
,
Alameda-Pineda
,
X.
, and
Horaud
,
R.
(
2017
). “
Exploiting the complementarity of audio and visual data in multi-speaker tracking
,” in
Proceedings of the IEEE International Conference on Computer Vision Workshops
, pp.
446
454
.
3.
Barnard
,
M.
,
Koniusz
,
P.
,
Wang
,
W.
,
Kittler
,
J.
,
Naqvi
,
S. M.
, and
Chambers
,
J.
(
2014
). “
Robust multi-speaker tracking via dictionary learning and identity modeling
,”
IEEE Trans. Multimedia
16
(
3
),
864
880
.
4.
Barras
,
C.
,
Zhu
,
X.
,
Meignier
,
S.
, and
Gauvain
,
J.-L.
(
2006
). “
Multistage speaker diarization of broadcast news
,”
IEEE Trans. Audio Speech Lang. Process.
14
(
5
),
1505
1512
.
5.
Bello-Orgaz
,
G.
,
Jung
,
J. J.
, and
Camacho
,
D.
(
2016
). “
Social big data: Recent achievements and new challenges
,”
Info. Fusion
28
,
45
59
.
6.
Benavent
,
X.
,
Garcia-Serrano
,
A.
,
Granados
,
R.
,
Benavent
,
J.
, and
de Ves
,
E.
(
2013
). “
Multimedia information retrieval based on late semantic fusion approaches: Experiments on a Wikipedia image collection
,”
IEEE Trans. Multimedia
15
(
8
),
2009
2021
.
7.
Ben-Harush
,
O.
,
Ben-Harush
,
O.
,
Lapidot
,
I.
, and
Guterman
,
H.
(
2012
). “
Initialization of iterative-based speaker diarization systems for telephone conversations
,”
IEEE Trans. Audio Speech Lang. Process
20
(
2
),
414
425
.
8.
Boakye
,
K.
,
Trueba-Hornero
,
B.
,
Vinyals
,
O.
, and
Friedland
,
G.
(
2008
). “
Overlapped speech detection for improved speaker diarization in multiparty meetings
,” in
Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing
, pp.
4353
4356
.
9.
Bost
,
X.
,
Linares
,
G.
, and
Gueye
,
S.
(
2015
). “
Audiovisual speaker diarization of TV series
,” in
Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing
, pp.
4799
4803
.
10.
Bozonnet
,
S.
,
Vallet
,
F.
,
Evans
,
N.
,
Essid
,
S.
,
Richard
,
G.
, and
Carrive
,
J.
(
2010
). “
A multimodal approach to initialisation for top-down speaker diarization of television shows
,” in
Proceedings of 18th European Signal Processing Conference
, pp.
581
585
.
11.
Bredin
,
H.
(
2016
). “
Tristounet: Triplet loss for speaker turn embedding
,” arXiv preprint arXiv:1609.04301.
12.
Bredin
,
H.
(
2017
). “
Pyannote.metrics: A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems
,” in INTERSPEECH, pp. 3587–3591.
13.
Chen
,
J.
, and
Wang
,
D.
(
2017
). “
Long short-term memory for speaker generalization in supervised speech separation
,”
J. Acoust. Soc. Am.
141
(
6
),
4705
4714
.
14.
Cheon
,
M.
,
Lee
,
W.
,
Hyun
,
C.-H.
, and
Park
,
M.
(
2011
). “
Rotation invariant histogram of oriented gradients
,”
Intl. J. Fuzzy Logic Intel. Syst.
11
(
4
),
293
298
.
15.
Cho
,
K.
,
Courville
,
A.
, and
Bengio
,
Y.
(
2015
). “
Describing multimedia content using attention-based encoder-decoder networks
,”
IEEE Trans. Multimedia
17
(
11
),
1875
1886
.
16.
Czyzewski
,
A.
,
Kostek
,
B.
,
Bratoszewski
,
P.
,
Kotus
,
J.
, and
Szykulski
,
M.
(
2017
). “
An audio-visual corpus for multimodal automatic speech recognition
,”
J. Intel. Info. Syst.
49
(
2
),
167
192
.
17.
Dalal
,
N.
, and
Triggs
,
B.
(
2005
). “
Histograms of oriented gradients for human detection
,” in
Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05)
, Vol.
1
, pp.
886
893
.
18.
Dimitriadis
,
D.
, and
Fousek
,
P.
(
2017
). “
Developing on-line speaker diarization system
,” in
Interspeech
, pp.
2739
2743
.
19.
Dimoulas
,
C. A.
(
2016
). “
Audiovisual spatial-audio analysis by means of sound localization and imaging: A multimedia healthcare framework in abdominal sound mapping
,”
IEEE Trans. Multimedia
18
(
10
),
1969
1976
.
20.
Dimoulas
,
C. A.
, and
Symeonidis
,
A. L.
(
2015
). “
Syncing shared multimedia through audiovisual bimodal segmentation
,”
IEEE MultiMedia
22
(
3
),
26
42
.
21.
Essid
,
S.
, and
Févotte
,
C.
(
2013
). “
Smooth nonnegative matrix factorization for unsupervised audiovisual document structuring
,”
IEEE Trans. Multimedia
15
(
2
),
415
425
.
22.
Fields
,
B.
,
Jacobson
,
K.
,
Rhodes
,
C.
,
d'Inverno
,
M.
,
Sandler
,
M.
, and
Casey
,
M.
(
2011
). “
Analysis and exploitation of musician social networks for recommendation and discovery
,”
IEEE Trans. Multimedia
13
(
4
),
674
686
.
23.
Friedland
,
G.
,
Hung
,
H.
, and
Yeo
,
C.
(
2009
). “
Multi-modal speaker diarization of real-world meetings using compressed-domain video features,” in Proceedings of IEEE International Conference on Acoustics
,
Speech and Signal Processing
, pp.
4069
4072
.
24.
Garcia-Romero
,
D.
,
Snyder
,
D.
,
Sell
,
G.
,
Povey
,
D.
, and
McCree
,
A.
(
2017
). “
Speaker diarization using deep neural network embeddings
,” in
Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing
, pp.
4930
4934
.
25.
Gebru
,
I. D.
,
Ba
,
S.
,
Li
,
X.
, and
Horaud
,
R.
(
2018
). “
Audio-visual speaker diarization based on spatiotemporal bayesian fusion
,”
IEEE Trans. Pattern Analysis Mach. Intel.
40
(
5
),
1086
1099
.
26.
Gong
,
Z.
,
Zhong
,
P.
, and
Hu
,
W.
(
2019
). “
Diversity in machine learning
,”
IEEE Access
7
,
64323
64350
.
27.
Hadsell
,
R.
,
Chopra
,
S.
, and
LeCun
,
Y.
(
2006
). “
Dimensionality reduction by learning an invariant mapping
,” in
Null
,
IEEE
, pp.
1735
1742
.
28.
He
,
K.
,
Zhang
,
X.
,
Ren
,
S.
, and
Sun
,
J.
(
2016
). “
Deep residual learning for image recognition
,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp.
770
778
.
29.
Hochreiter
,
S.
, and
Schmidhuber
,
J.
(
1997
). “
LSTM can solve hard long time lag problems
,” in
Proceedings of Advances in Neural Information Processing Systems
, pp.
473
479
.
30.
Illa
,
A.
, and
Ghosh
,
P. K.
(
2020
). “
Closed-set speaker conditioned acoustic-to-articulatory inversion using bi-directional long short-term memory network
,”
J. Acoust. Soc. Am.
147
(
2
),
EL171
EL176
.
31.
Izadinia
,
H.
,
Saleemi
,
I.
, and
Shah
,
M.
(
2013
). “
Multimodal analysis for identification and segmentation of moving-sounding objects
,”
IEEE Trans. Multimedia
15
(
2
),
378
390
.
32.
Johnston
,
A. B.
, and
Burnett
,
D. C.
(
2012
).
WebRTC: APIs and RTCWEB Protocols of the HTML5 Real-Time Web
(
Digital Codex LLC
).
33.
King
,
D. E.
(
2009
). “
Dlib-ml: A machine learning toolkit
,”
J. Mach. Learn. Res.
10
(Jul),
1755
1758
.
34.
Korvel
,
G.
, and
Kostek
,
B.
(
2019
). “
Discovering rule-based learning systems for the purpose of music analysis
,” in
Proceedings of Meetings on Acoustics 178ASA
, Acoustical Society of America, Vol.
39
, p.
035004
.
35.
Kumar
,
M.
,
Kim
,
S. H.
,
Lord
,
C.
, and
Narayanan
,
S.
(
2020
). “
Improving speaker diarization for naturalistic child-adult conversational interactions using contextual information
,”
J. Acoust. Soc. Am.
147
(
2
),
EL196
EL200
.
36.
Maaten
,
L. v. d.
, and
Hinton
,
G.
(
2008
). “
Visualizing data using t-SNE
,”
J. Mach. Learn. Res.
9
(Nov),
2579
2605
.
37.
Minotto
,
V. P.
,
Jung
,
C. R.
, and
Lee
,
B.
(
2015
). “
Multimodal multi-channel on-line speaker diarization using sensor fusion through SVM
,”
IEEE Trans. Multimedia
17
(
10
),
1694
1705
.
38.
Myer
,
S.
, and
Tomar
,
V. S.
(
2018
). “
Efficient keyword spotting using time delay neural networks
,” arXiv preprint arXiv:1807.04353.
39.
Nagrani
,
A.
,
Chung
,
J. S.
, and
Zisserman
,
A.
(
2017
). “
VoxCeleb: A large-scale speaker identification dataset
,” arXiv preprint arXiv:1706.08612.
40.
Nathwani
,
K.
,
Pandit
,
P.
, and
Hegde
,
R. M.
(
2013
). “
Group delay-based methods for speaker segregation and its application in multimedia information retrieval
,”
IEEE Trans. Multimedia
15
(
6
),
1326
1339
.
41.
Noulas
,
A.
,
Englebienne
,
G.
, and
Krose
,
B. J.
(
2012
). “
Multimodal speaker diarization
,”
IEEE Trans. Pattern Analysis and Mach. Intel.
34
(
1
),
79
93
.
42.
Otsuka
,
K.
,
Araki
,
S.
,
Ishizuka
,
K.
,
Fujimoto
,
M.
,
Heinrich
,
M.
, and
Yamato
,
J.
(
2008
). “
A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization
,” in
Proceedings of the 10th International Conference on Multimodal Interfaces
,
ACM
, pp.
257
264
.
43.
Pedregosa
,
F.
,
Varoquaux
,
G.
,
Gramfort
,
A.
,
Michel
,
V.
,
Thirion
,
B.
,
Grisel
,
O.
,
Blondel
,
M.
,
Prettenhofer
,
P.
,
Weiss
,
R.
,
Dubourg
,
V.
, et al. (
2011
). “
Scikit-learn: Machine learning in Python
,”
J. Mach. Learn. Res.
12
(Oct),
2825
2830
.
44.
Purwins
,
H.
,
Li
,
B.
,
Virtanen
,
T.
,
Schlüter
,
J.
,
Chang
,
S.-Y.
, and
Sainath
,
T.
(
2019
). “
Deep learning for audio signal processing
,”
IEEE J. Selected Topics Signal Process.
13
(
2
),
206
219
.
45.
Rouvier
,
M.
, and
Favre
,
B.
(
2016
). “
Investigation of speaker embeddings for cross-show speaker diarization
,” in
IEEE International Conference on Acoustics, Speech and Signal Processing
, pp.
5585
5589
.
46.
Rowley
,
H. A.
,
Baluja
,
S.
, and
Kanade
,
T.
(
1998
). “
Neural network-based face detection
,”
IEEE Trans. Pattern Analysis Mach. Intel.
20
(
1
),
23
38
.
47.
Sagonas
,
C.
,
Antonakos
,
E.
,
Tzimiropoulos
,
G.
,
Zafeiriou
,
S.
, and
Pantic
,
M.
(
2016
). “
300 faces in-the-wild challenge: Database and results
,”
Image Vision Comput.
47
,
3
18
.
48.
Schroff
,
F.
,
Kalenichenko
,
D.
, and
Philbin
,
J.
(
2015
). “
Facenet: A unified embedding for face recognition and clustering
,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp.
815
823
.
49.
Sell
,
G.
, and
Garcia-Romero
,
D.
(
2014
). “
Speaker diarization with PLDA i-vector scoring and unsupervised calibration
,” in
Spoken Language Technology Workshop
, pp.
413
417
.
50.
Shum
,
S. H.
,
Dehak
,
N.
,
Dehak
,
R.
, and
Glass
,
J. R.
(
2013
). “
Unsupervised methods for speaker diarization: An integrated and iterative approach
,”
IEEE Trans. Audio, Speech, Lang. Process.
21
(
10
),
2015
2028
.
51.
Stowell
,
D.
,
Giannoulis
,
D.
,
Benetos
,
E.
,
Lagrange
,
M.
, and
Plumbley
,
M. D.
(
2015
). “
Detection and classification of acoustic scenes and events
,”
IEEE Trans. Multimedia
17
(
10
),
1733
1746
.
52.
Sutskever
,
I.
,
Vinyals
,
O.
, and
Le
,
Q. V.
(
2014
). “
Sequence to sequence learning with neural networks
,” in
Advances in Neural Information Processing Systems
, pp.
3104
3112
.
53.
Tsipas
,
N.
,
Vrysis
,
L.
,
Dimoulas
,
C.
, and
Papanikolaou
,
G.
(
2017
). “
Efficient audio-driven multimedia indexing through similarity-based speech/music discrimination
,”
Multimedia Tools Appl.
76
(
24
),
25603
25621
.
54.
Tsipas
,
N.
,
Vrysis
,
L.
,
Dimoulas
,
C. A.
, and
Papanikolaou
,
G.
(
2015a
). “
Content-based music structure analysis using vector quantization
,” in
Proceedings of the Audio Engineering Society Convention
138.
55.
Tsipas
,
N.
,
Zapartas
,
P.
,
Vrysis
,
L.
, and
Dimoulas
,
C.
(
2015b
). “
Augmenting social multimedia semantic interaction through audio-enhanced web-tv services
,” in
Proceedings of the Audio Mostly on Interaction with Sound
, pp.
1
7
.
56.
Vallet
,
F.
,
Essid
,
S.
, and
Carrive
,
J.
(
2013
). “
A multimodal approach to speaker diarization on TV talk-shows
,”
IEEE Trans. Multimedia
15
(
3
),
509
520
.
57.
Vrysis
,
L.
,
Tsipas
,
N.
,
Thoidis
,
I.
, and
Dimoulas
,
C.
(
2020
). “
1d/2d deep CNNs vs. temporal feature integration for general audio classification
,”
J. Audio Engineering Society
68
(
1/2
),
66
77
.
58.
Vryzas
,
N.
,
Tsipas
,
N.
, and
Dimoulas
,
C.
(
2020
). “
Web radio automation for audio stream management in the era of big data
,”
Information
11
(
4
),
205
.
59.
Wang
,
Q.
,
Downey
,
C.
,
Wan
,
L.
,
Mansfield
,
P. A.
, and
Moreno
,
I. L.
(
2017
). “
Speaker diarization with LSTM
,” arXiv preprint arXiv:1710.10468.
60.
Wittenburg
,
P.
,
Brugman
,
H.
,
Russel
,
A.
,
Klassmann
,
A.
, and
Sloetjes
,
H.
(
2006
). “
ELAN: A professional framework for multimodality research
,” in
5th International Conference on Language Resources and Evaluation
, pp.
1556
1559
.
61.
Wooters
,
C.
, and
Huijbregts
,
M.
(
2008
). “
The ICSI RT07s speaker diarization system
,” in
Multimodal Technologies for Perception of Humans
(
Springer
), pp.
509
519
.
62.
Yao
,
S.
,
Wang
,
Y.
, and
Niu
,
B.
(
2015
). “
An efficient cascaded filtering retrieval method for big audio data
,”
IEEE Trans. Multimedia
17
(
9
),
1450
1459
.
63.
YouTube
Press Statistics
,” https://www.youtube.com/yt/about/press/, accessed 2018-08-16.
64.
Zhang
,
H.
,
Bao
,
F.
,
Gao
,
G.
, and
Zhang
,
H.
(
2016
). “
Comparison on neural network based acoustic model in Mongolian speech recognition
,” in
International Conference on Asian Language Processing
,
IEEE
, pp.
1
5
.
65.
Zhao
,
Z.
,
Yang
,
Q.
,
Lu
,
H.
,
Weninger
,
T.
,
Cai
,
D.
,
He
,
X.
, and
Zhuang
,
Y.
(
2018
). “
Social-aware movie recommendation via multimodal network learning
,”
IEEE Trans. Multimedia
20
(
2
),
430
440
.
66.
Zhou
,
P.
,
Zhou
,
Y.
,
Wu
,
D.
, and
Jin
,
H.
(
2016
). “
Differentially private online learning for cloud-based video recommendation with multimedia big data in social networks
,”
IEEE Trans. Multimedia
18
(
6
),
1217
1229
.
67.
Zhu
,
Q.
,
Yeh
,
M.-C.
,
Cheng
,
K.-T.
, and
Avidan
,
S.
(
2006
). “
Fast human detection using a cascade of histograms of oriented gradients
,” in
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Vol.
2
, pp.
1491
1498
.
68.
Zhu
,
W.
,
Luo
,
C.
,
Wang
,
J.
, and
Li
,
S.
(
2011
). “
Multimedia cloud computing
,”
IEEE Signal Process. Mag.
28
(
3
),
59
69
.
You do not currently have access to this content.