In this paper, an audio-driven, multimodal approach for speaker diarization in multimedia content is introduced and evaluated. The proposed algorithm is based on semi-supervised clustering of audio-visual embeddings, generated using deep learning techniques. The two modes, audio and video, are separately addressed; a long short-term memory Siamese neural network is employed to produce embeddings from audio, whereas a pre-trained convolutional neural network is deployed to generate embeddings from two-dimensional blocks representing the faces of speakers detected in video frames. In both cases, the models are trained using cost functions that favor smaller spatial distances between samples from the same speaker and greater spatial distances between samples from different speakers. A fusion stage, based on hypotheses derived from the established practices in television content production, is deployed on top of the unimodal sub-components to improve speaker diarization performance. The proposed methodology is evaluated against VoxCeleb, a large-scale dataset with hundreds of available speakers and AVL-SD, a newly developed, publicly available dataset aiming at capturing the peculiarities of TV news content under different scenarios. In order to promote reproducible research and collaboration in the field, the implemented algorithm is provided as an open-source software package.
Skip Nav Destination
Article navigation
December 2020
December 15 2020
Semi-supervised audio-driven TV-news speaker diarization using deep neural embeddingsa)
Special Collection:
Machine Learning in Acoustics
Nikolaos Tsipas
;
Nikolaos Tsipas
b)
Aristotle University of Thessaloniki
, Thessaloniki, Greece
Search for other works by this author on:
Lazaros Vrysis
;
Lazaros Vrysis
c)
Aristotle University of Thessaloniki
, Thessaloniki, Greece
Search for other works by this author on:
Konstantinos Konstantoudakis
;
Konstantinos Konstantoudakis
d)
Aristotle University of Thessaloniki
, Thessaloniki, Greece
Search for other works by this author on:
Charalampos Dimoulas
Charalampos Dimoulas
e)
Aristotle University of Thessaloniki
, Thessaloniki, Greece
Search for other works by this author on:
J. Acoust. Soc. Am. 148, 3751–3761 (2020)
Article history
Received:
August 25 2020
Accepted:
November 25 2020
Citation
Nikolaos Tsipas, Lazaros Vrysis, Konstantinos Konstantoudakis, Charalampos Dimoulas; Semi-supervised audio-driven TV-news speaker diarization using deep neural embeddings. J. Acoust. Soc. Am. 1 December 2020; 148 (6): 3751–3761. https://doi.org/10.1121/10.0002924
Download citation file:
Pay-Per-View Access
$40.00
Sign In
You could not be signed in. Please check your credentials and make sure you have an active account and try again.
Citing articles via
Day-to-day loudness assessments of indoor soundscapes: Exploring the impact of loudness indicators, person, and situation
Siegbert Versümer, Jochen Steffens, et al.
A survey of sound source localization with deep learning methods
Pierre-Amaury Grumiaux, Srđan Kitić, et al.
All we know about anechoic chambers
Michael Vorländer
Related Content
Child-adult speech diarization in naturalistic conditions of preschool classrooms using room-independent ResNet model and automatic speech recognition-based re-segmentation
J. Acoust. Soc. Am. (February 2024)
Improving speaker diarization for naturalistic child-adult conversational interactions using contextual information
J. Acoust. Soc. Am. (February 2020)
Overlapped speech detection using phase features
J. Acoust. Soc. Am. (October 2021)
Speech detection models for effective communicable disease risk assessment in air travel environments
Proc. Mtgs. Acoust. (July 2024)
Performance evaluation of speaker recognition system using area under ROC curve for extracted novel features from SDM and MDM speech signals
AIP Conference Proceedings (October 2022)