While deep learning has driven recent improvements in audio speaker diarization, it often faces performance issues in challenging interaction scenarios and varied acoustic settings such as between a child and adult (caregiver/examiner). In this work, the role of contextual factors that affect diarization performance in such interactions is analyzed. Factors that affect each type of diarization error are identified. Furthermore, a DNN is trained on diarization outputs in conjunction with the factors to improve diarization performance. The results demonstrate the usefulness of incorporating context in improving diarization performance of child-adult interactions in clinical settings.

Speaker diarization is the process of identifying speaker identities and speaking times in audio recordings, i.e., determining who spoke when. A typical audio diarization system consists of multiple pipelined components: speech activity detection (SAD), speaker change detection and speaker clustering. While research in speaker diarization has typically been focused on meeting and broadcast news (Anguera et al., 2012), a growing number of applications involving naturalistic and varied audio recording scenarios in recent times have renewed interest in this task (Sell et al., 2018). One such application domain is clinical assessments and evaluations for children with autism spectrum disorder (ASD) which involve spoken interactions between a clinician and a child.

ASD is a neuro-developmental disorder, often characterized by impairments in socio-communicative abilities, specifically idiosyncrasies in verbal communication such as awkward prosody and phrasing, neologism and echolalia (Kim et al., 2014). The prevalence of ASD has increased continuously with a current rate at 1 in 59 children (Baio et al., 2018). Behavioral observations made during structured interactions between a child and a trained clinician guided by well-established diagnostic measures such as the Autism Diagnostic Observation Schedule offer clinically relevant information for diagnosis, autism symptom severity, and treatment planning.

Computational analysis of such autism diagnosis sessions using automatically extracted behavioral features, specifically speech and language features extracted from both participants, offer objective insights and have shown to be significantly predictive of autism symptom severity (Bone et al., 2016b). However, speech feature extraction has always been dependent on manual speaker diarization labels, which can be time-consuming and expensive to obtain. Diarization is particularly challenging in such scenarios due to varied acoustic background conditions, short individual utterance lengths (Najafian and Hansen, 2016), and disfluencies arising due to a developing vocabulary in the case of young kids and toddlers. Hence, application of state-of-the-art diarization systems trained on primarily adult speech corpora representing other interaction scenarios, such as meetings, can leave room for improvement when applied to such adult-child interactions. Straightforward adaptation of such systems using annotated child speech corpora might not be feasible due to often small sizes of such data.

In this work, we utilize contextual information to improve an x-vector based diarization system applied to ASD relevant guided behavioral observations. First, we study the association between diarization errors with acoustic and conversational factors, namely, utterance duration, speaker change proximity, signal-to-noise ratio (SNR), and speech intensity. Next, we train a deep neural network with feed-forward attention mechanism using these factors along with diarization outputs from the x-vector system towards improving the diarization robustness and performance.

Recent efforts in speaker diarization have replaced traditional i-vectors with deep neural network embeddings such as x-vectors (Sell et al., 2018) obtained using supervised training. Typically, a time-delay neural network along with temporal pooling layers is trained to classify speakers. A bottleneck layer (usually close to the output) is used to extract speaker embeddings referred to as x-vectors. At test time, x-vectors are extracted from the voiced regions at uniform intervals followed by agglomerative hierarchical clustering (AHC).

Compared to the number of prior efforts directed at improving diarization system performance, relatively few studies have systematically analyzed the different sources and types of diarization errors. In Mirghafori and Wooters (2006), the authors studied the relation between diarization performance and various session-level features such as speaker count and rate of conversational turns. Alternatively, local features such as utterance duration and distance to closest speaker change point were used to study missed speech and speaker errors in Knox et al. (2012).

We focus on child-adult interactions related to a recently proposed treatment outcome measure: BOSCC (Brief Observation of Social Communication Change) (Grzadzinski et al., 2016). BOSCC tracks changes in social-communication over the course of an ASD treatment. A BOSCC session consists of alternating talk and play activities between the participants, each focusing on a specific set of target behavioral characteristics.

We use BOSCC sessions collected with children of varied age range and multiple locations. The BOSCC-SchoolAge sessions were administered with verbal school-age children with complex language skills. Relatively clean audio was collected using a lapel microphone worn by the psychologist interlocutor in all sessions. For error analysis and training the neural network for error improvement, we use BOSCC-ToddlerHigh and BOSCC-ToddlerLow sessions, which are collected from minimally verbal toddlers and preschoolers with limited language (nonverbal, single words or phrase speech). These sessions were administered by a caregiver, and represent a more naturalistic data collection setup aimed at behavioral assessments of the child with a familiar adult. Further details are reported in Table 1.

Table 1.

Data statistics of child-adult interactions.

Corpus# SessChild age (yr)Duration (min)Speech fraction (%)
ChildPsych
BOSCC-SchoolAge 27 9.29 ± 3.30 17.76 ± 11.99 27.5 ± 7.8 40.8 ± 7.1 
BOSCC-ToddlerHigh 20 2.21 ± 0.55 11.1 ± 3.9 13.9 ± 6.1 39.6 ± 5.8 
BOSCC-ToddlerLow 18 1.80 ± 0.29 10.1 ± 0.3 7.8 ± 5.6 37.0 ± 10.1 
Corpus# SessChild age (yr)Duration (min)Speech fraction (%)
ChildPsych
BOSCC-SchoolAge 27 9.29 ± 3.30 17.76 ± 11.99 27.5 ± 7.8 40.8 ± 7.1 
BOSCC-ToddlerHigh 20 2.21 ± 0.55 11.1 ± 3.9 13.9 ± 6.1 39.6 ± 5.8 
BOSCC-ToddlerLow 18 1.80 ± 0.29 10.1 ± 0.3 7.8 ± 5.6 37.0 ± 10.1 

We select context factors based on relevance in child-adult interactions and/or analyzing diarization system performance. Background noise is known to adversely affect diarization system performance (Sell et al., 2018), and noise strength can be estimated using SNR. Speech intensity is observed to be a significant indicator of “atypical prosody,” characteristic in the speech of children with autism (Bone et al., 2016a). Both SNR and speech intensity can be reliably estimated in non-speech regions, unlike other features of interest in child-adult interactions, such as prosody. Additionally, we select two conversational factors (utterance length and speaker change proximity) used in a previous study (Knox et al., 2012) which analyzed speaker segments that are prone to diarization errors. While additional features can potentially prove useful, we restrict to the above as a first step towards diarization error analysis and error improvement. Following, we describe each of them.

Utterance length: This is defined as duration of speaking turn for the current frame. Utterance length is zero for all non-speech frames. Short utterances typically do not contain enough speaker information for embedding extraction and are more prone to errors (Knox et al., 2012).

Speaker change proximity: This is the absolute distance (in time) to the nearest speaker change point. We define a speaker change point as the time instance at which speakers switch, or begin speaking, or end speaking. The minimum proximity considered in this work is 0.25 s, which is the standard no-score collar as defined by NIST evaluations.

SNR: Given a mixed audio signal, SNR is the relative strength between the noise-only and speech-only components. We estimate SNR using the NIST-STNR tool (Ellis, 2011) and scale the values (in dB) to zero-mean and unit variance within each session to assist training.

Speech intensity: We use the praat toolkit (Boersma and Weenink, 2009) to estimate speech intensity as the smoothed version of the signal energy. Since absolute intensity values are not informative, we normalize intensity to zero mean and unit variance within each session.

We borrow SAD and x-vector models developed on out-of-domain corpora owing to the limited size of available child-adult speech corpora. We use a two-hidden layer feed-forward DNN developed as part of the DARPA RATS project (DARPA, 2015) to classify speech from non-speech. The input consists of spliced (±15 frames) 13-dimensional MFCCs, while the output nodes are binary labels. The network is trained to minimize cross-entropy loss. For x-vector extraction, we use the pre-trained model provided with the CALLHOME recipe in Kaldi, similar to the best system in Sell et al. (2018). In this system, a sliding window of 1.5 s duration is used to extract x-vectors followed by AHC on PLDA (Probabilistic Linear Discriminant Analysis) scores to obtain the diarization labels. We adapt the baseline system by training the PLDA transforms on BOSCC-SchoolAge, which is found to further improve performance.

In the first experiment, we study the effect of each contextual factor listed in Sec. 4 on different types of diarization errors produced by the baseline system. For each session in BOSCC-ToddlerHigh and BOSCC-ToddlerLow datasets, we compute the factors and decisions (correct, missed speech, and speaker error) at frame-level. We note that false alarms are not accounted for in this analysis since we are interested in the effect of factors on child/adult speech segments only (not silence regions). While plotting the results in Fig. 1, the maximum ranges for utterance duration and speaker change proximity are chosen so as to ensure sufficient number of frames for analysis.

Fig. 1.

(Color online) Effect of contextual factors on adult speech (top) and child speech (bottom) during the baseline diarization system. For each speaker and factor range, all possible outcomes are normalized to sum to 1 so as to display the error distributions uniformly across the context ranges. Within each bar, the outcomes (from top to bottom) follow: Misclassified frames, missed frames, and correctly classified frames.

Fig. 1.

(Color online) Effect of contextual factors on adult speech (top) and child speech (bottom) during the baseline diarization system. For each speaker and factor range, all possible outcomes are normalized to sum to 1 so as to display the error distributions uniformly across the context ranges. Within each bar, the outcomes (from top to bottom) follow: Misclassified frames, missed frames, and correctly classified frames.

Close modal

Given that contextual factors are associated with diarization errors, we hypothesize that this relation can be exploited to identify errors and improve diarization performance We feed the outputs from the baseline diarization system (speaker labels along with silence) and the time-aligned contextual factors to a deep neural network. At the output, the network is provided with a single label representing the speaker. Hence, we pose the neural network training as a sequence classification problem where the output belongs to one of three classes: child, adult, or silence. The input is spliced with context frames to exploit temporal information, which cannot be captured during the error analysis in Sec. 5.2. We note that, since the labels from baseline system do not include speech overlap, the proposed model cannot learn from these regions. We acknowledge errors from overlap region in the final model.

We define an input sample as a contiguous block of frames spanning a duration of 2 s. At each instant, the diarization output is converted into a three-dimensional one-hot encoding and appended with the contextual factors. The SNR and intensity values are obtained directly from audio; while utterance length and speaker change proximity are obtained using the diarization outputs since oracle speaker labels are not available during testing. We experiment with three types of networks in this work—ffn (feed-forward network), blstm (bidirectional long short-term memory), and atten (attention). In ffn, the input is flattened across the time axis before passed through 3 dense layers with 256 units each. Blstm employs long short-term memory layers to capture forward and backward temporal information. The hidden state is fed to 3 dense layers with 256 units each. Atten makes use of feed-forward attention (Raffel and Ellis, 2015) to selectively attend to frames relevant to the output.

All networks used in this work are optimized using rmsprop to minimize the cross-entropy loss between output labels and logits. Batch normalization and dropout (rate = 0.2) are used in the dense layers for regularization. We obtain the results using cross-validation, where the sessions are divided into four folds and the network is retrained from scratch within each fold. In this way, all sessions are treated as test data.

From Fig. 2, we observe significant differences between how contextual factors affect diarization errors on child and adult speech portions. The fraction of correctly classified frames marginally increases for both speakers with longer utterances and farther from speaker change points, similar to Knox et al. (2012). However, the improvement is reflected by lower fractions of speaker errors for child speech and lower fractions of missed speech for adult speech. Further, short child utterances are more likely to get diarized as adult speech than any other outcome, and the fraction of missed child speech is consistent across different utterance lengths and speaker change proximity. While adult speech is seen to perform better as SNR increases, child speech exhibits an optimal SNR with respect to correctly classified frames. Specifically, the fraction of child speech diarized as adult speech increases with SNR, suggesting that the baseline diarization system is likely to cluster clean speech segments from the child into adult. Frames with low speech intensity from both speakers get missed by the SAD, while child speech is likely to get diarized as adult speech as the intensity increases.

Fig. 2.

(Color online) (Left) Illustration of feed-forward attention mechanism. (Right) The attention network used during diarization error improvement. Speaker labels from the baseline diarization system and factors are independently attended to in time using feed-forward attention. Context vectors from each source are merged and passed to a fully connected network to predict the output labels.

Fig. 2.

(Color online) (Left) Illustration of feed-forward attention mechanism. (Right) The attention network used during diarization error improvement. Speaker labels from the baseline diarization system and factors are independently attended to in time using feed-forward attention. Context vectors from each source are merged and passed to a fully connected network to predict the output labels.

Close modal

From Table 2, the x-vector baseline results in relatively high diarization error rates (DERs) for a dyadic conversation with known number of speakers. We surmise that this is due to large fraction of silence/noisy regions (Pearson correlation between DER and silence fraction, r = 0.61, p <0.001). With the use of contextual factors, all networks provide gains in DER, with atten resulting in 8.2% and 15.8% relative improvement in DER. The results between different networks underscore the importance of exploiting temporal information for diarization.

Table 2.

DER results from error improvement network.

Baseline+ffn+blstm+atten
BOSCC-ToddlerHigh 55.08 53.86 54.58 50.56 
BOSCC-ToddlerLow 65.27 61.51 60.06 54.94 
Baseline+ffn+blstm+atten
BOSCC-ToddlerHigh 55.08 53.86 54.58 50.56 
BOSCC-ToddlerLow 65.27 61.51 60.06 54.94 

We showed that a state-of-the-art diarization system does not necessarily perform well on naturalistic child-speech interactions, especially for toddlers and preschoolers with limited language levels using data obtained from ordinary recording systems. We investigated the role of context in improving upon the results of such a system. First, we examined the effect of various contextual factors on diarization errors, analyzing the different effects on child speech and adult speech. Next, we trained an attention network to improve diarization errors using contextual factors. The results suggest the benefit of local context in improving child-adult speaker diarization. Additional contextual factors would be explored especially from the visual modality, since audio-video diarization has been shown to perform better than audio-only diarization. In the future, we would like to pose the DNN training as a sequence-to-sequence learning task where the output labels can be substituted with the label sequence.

This research was supported by the Simons Foundation and National Institute of Mental Health (NIMH Grant No. 1R01 MH114925-01).

1.
Anguera
,
X.
,
Bozonnet
,
S.
,
Evans
,
N.
,
Fredouille
,
C.
,
Friedland
,
G.
, and
Vinyals
,
O.
(
2012
). “
Speaker diarization: A review of recent research
,”
IEEE Trans. Audio Speech Lang. Proc.
20
(
2
),
356
370
.
2.
Baio
,
J.
,
Wiggins
,
L.
,
Christensen
,
D. L.
,
Maenner
,
M. J.
,
Daniels
,
J.
,
Warren
,
Z.
,
Kurzius-Spencer
,
M.
,
Zahorodny
,
W.
,
Rosenberg
,
C. R.
,
White
,
T.
,
Durkin
,
M. S.
,
Imm
,
P.
,
Nikolaou
,
L.
,
Yeargin-Allsopp
,
M.
,
Lee
,
L.
,
Harrington R., Lopez
,
M.
,
Fitzgerald
,
R. T.
,
Hewitt
,
A.
,
Pettygrove
,
S.
,
Constantino
,
J. N.
,
Vehorn
,
A.
,
Shenouda
,
J.
,
Hall-Lande
,
J.
,
Braun
,
K. V. N.
,
Dowling
,
N. F.
, (
2018
). “
Prevalence of autism spectrum disorder among children aged 8 years autism and developmental disabilities monitoring network, 11 sites, United States, 2014
,”
MMWR Surveillance Summaries
67
(
6
),
1
.
3.
Boersma
,
P
, and
Weenink
,
D.
(
2009
). “
Praat: Doing phonetics by computer
,” www.praat.org (Last viewed 02/11/2020).
4.
Bone
,
D.
,
Bishop
,
S.
,
Gupta
,
R.
,
Lee
,
S.
, and
Narayanan
,
S. S.
(
2016a
). “
Acoustic-prosodic and turn-taking features in interactions with children with neurodevelopmental disorders
,” in
Interspeech
, ISCA, pp.
1185
1189
.
5.
Bone
,
D.
,
Bishop
,
S. L.
,
Black
,
M. P.
,
Goodwin
,
M. S.
,
Lord
,
C.
, and
Narayanan
,
S. S.
(
2016b
). “
Use of machine learning to improve autism screening and diagnostic instruments: Effectiveness, efficiency, and multi-instrument fusion
,”
J. Child Psychol. Psych.
57
(
8
),
927
937
.
6.
DARPA
(2015). “
Robust Automatic Transcription of Speech (RATS)
,” https://www.darpa.mil/program/robust-automatic-transcription-of-speech (Last viewed 02/11/2020).
7.
Ellis
,
D.
(
2011
). “
Objective measures of speech quality/SNR
,” https://labrosa.ee.columbia.edu/projects/snreval/ (Last viewed 02/11/2020).
8.
Grzadzinski
,
R.
,
Carr
,
T.
,
Colombi
,
C.
,
McGuire
,
K.
,
Dufek
,
S.
,
Pickles
,
A.
, and
Lord
,
C.
(
2016
). “
Measuring changes in social communication behaviors: Preliminary development of the brief observation of social communication change
(
BOSCC)
,”
J. Autism Dev. Dis.
46
(
7
),
2464
2479
.
9.
Kim
,
S. H.
,
Paul
,
R.
,
Tager-Flusberg
,
H.
, and
Lord
,
C.
(
2014
). “
Language and communication in autism
,” in
Handbook of Autism and Pervasive Developmental Disorders
, 4th ed. (
American Cancer Society
,
Atlanta
), Chap. 10.
10.
Knox
,
M. T.
,
Mirghafori
,
N.
, and
Friedland
,
G.
(
2012
). “
Where did I go wrong?: Identifying troublesome segments for speaker diarization systems
,” in
Interspeech
, ISCA.
11.
Mirghafori
,
N.
, and
Wooters
,
C.
(
2006
). “
Nuts and flakes: A study of data characteristics in speaker diarization
,” in
2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings
, IEEE, Vol.
1
.
12.
Najafian
,
M.
, and
Hansen
,
J. H. L.
(
2016
). “
Speaker independent diarization for child language environment analysis using deep neural networks
,” in
IEEE Spoken Language Technology Workshop (SLT)
, pp.
114
120
.
13.
Raffel
,
C.
, and
Ellis
,
D. P. W.
(
2015
). “
Feed-forward networks with attention can solve some long-term memory problems
,” arXiv:1512.08756.
14.
Sell
,
G.
,
Snyder
,
D.
,
McCree
,
A.
,
Garcia-Romero
,
D.
,
Villalba
,
J.
,
Maciejewski
,
M.
,
Manohar
,
V.
,
Dehak
,
N.
,
Povey
,
D.
,
Watanabe
,
S.
, and
Khudanpur
,
S.
(
2018
). “
Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge
,” in
Interspeech
, ISCA, pp.
2808
2812
.