While deep learning has driven recent improvements in audio speaker diarization, it often faces performance issues in challenging interaction scenarios and varied acoustic settings such as between a child and adult (caregiver/examiner). In this work, the role of contextual factors that affect diarization performance in such interactions is analyzed. Factors that affect each type of diarization error are identified. Furthermore, a DNN is trained on diarization outputs in conjunction with the factors to improve diarization performance. The results demonstrate the usefulness of incorporating context in improving diarization performance of child-adult interactions in clinical settings.
1. Introduction
Speaker diarization is the process of identifying speaker identities and speaking times in audio recordings, i.e., determining who spoke when. A typical audio diarization system consists of multiple pipelined components: speech activity detection (SAD), speaker change detection and speaker clustering. While research in speaker diarization has typically been focused on meeting and broadcast news (Anguera et al., 2012), a growing number of applications involving naturalistic and varied audio recording scenarios in recent times have renewed interest in this task (Sell et al., 2018). One such application domain is clinical assessments and evaluations for children with autism spectrum disorder (ASD) which involve spoken interactions between a clinician and a child.
ASD is a neuro-developmental disorder, often characterized by impairments in socio-communicative abilities, specifically idiosyncrasies in verbal communication such as awkward prosody and phrasing, neologism and echolalia (Kim et al., 2014). The prevalence of ASD has increased continuously with a current rate at 1 in 59 children (Baio et al., 2018). Behavioral observations made during structured interactions between a child and a trained clinician guided by well-established diagnostic measures such as the Autism Diagnostic Observation Schedule offer clinically relevant information for diagnosis, autism symptom severity, and treatment planning.
Computational analysis of such autism diagnosis sessions using automatically extracted behavioral features, specifically speech and language features extracted from both participants, offer objective insights and have shown to be significantly predictive of autism symptom severity (Bone et al., 2016b). However, speech feature extraction has always been dependent on manual speaker diarization labels, which can be time-consuming and expensive to obtain. Diarization is particularly challenging in such scenarios due to varied acoustic background conditions, short individual utterance lengths (Najafian and Hansen, 2016), and disfluencies arising due to a developing vocabulary in the case of young kids and toddlers. Hence, application of state-of-the-art diarization systems trained on primarily adult speech corpora representing other interaction scenarios, such as meetings, can leave room for improvement when applied to such adult-child interactions. Straightforward adaptation of such systems using annotated child speech corpora might not be feasible due to often small sizes of such data.
In this work, we utilize contextual information to improve an x-vector based diarization system applied to ASD relevant guided behavioral observations. First, we study the association between diarization errors with acoustic and conversational factors, namely, utterance duration, speaker change proximity, signal-to-noise ratio (SNR), and speech intensity. Next, we train a deep neural network with feed-forward attention mechanism using these factors along with diarization outputs from the x-vector system towards improving the diarization robustness and performance.
2. Background
2.1 Speaker diarization using x-vectors
Recent efforts in speaker diarization have replaced traditional i-vectors with deep neural network embeddings such as x-vectors (Sell et al., 2018) obtained using supervised training. Typically, a time-delay neural network along with temporal pooling layers is trained to classify speakers. A bottleneck layer (usually close to the output) is used to extract speaker embeddings referred to as x-vectors. At test time, x-vectors are extracted from the voiced regions at uniform intervals followed by agglomerative hierarchical clustering (AHC).
2.2 Analyzing diarization errors
Compared to the number of prior efforts directed at improving diarization system performance, relatively few studies have systematically analyzed the different sources and types of diarization errors. In Mirghafori and Wooters (2006), the authors studied the relation between diarization performance and various session-level features such as speaker count and rate of conversational turns. Alternatively, local features such as utterance duration and distance to closest speaker change point were used to study missed speech and speaker errors in Knox et al. (2012).
3. Dataset
We focus on child-adult interactions related to a recently proposed treatment outcome measure: BOSCC (Brief Observation of Social Communication Change) (Grzadzinski et al., 2016). BOSCC tracks changes in social-communication over the course of an ASD treatment. A BOSCC session consists of alternating talk and play activities between the participants, each focusing on a specific set of target behavioral characteristics.
We use BOSCC sessions collected with children of varied age range and multiple locations. The BOSCC-SchoolAge sessions were administered with verbal school-age children with complex language skills. Relatively clean audio was collected using a lapel microphone worn by the psychologist interlocutor in all sessions. For error analysis and training the neural network for error improvement, we use BOSCC-ToddlerHigh and BOSCC-ToddlerLow sessions, which are collected from minimally verbal toddlers and preschoolers with limited language (nonverbal, single words or phrase speech). These sessions were administered by a caregiver, and represent a more naturalistic data collection setup aimed at behavioral assessments of the child with a familiar adult. Further details are reported in Table 1.
Corpus . | # Sess . | Child age (yr) . | Duration (min) . | Speech fraction (%) . | |
---|---|---|---|---|---|
Child . | Psych . | ||||
BOSCC-SchoolAge | 27 | 9.29 ± 3.30 | 17.76 ± 11.99 | 27.5 ± 7.8 | 40.8 ± 7.1 |
BOSCC-ToddlerHigh | 20 | 2.21 ± 0.55 | 11.1 ± 3.9 | 13.9 ± 6.1 | 39.6 ± 5.8 |
BOSCC-ToddlerLow | 18 | 1.80 ± 0.29 | 10.1 ± 0.3 | 7.8 ± 5.6 | 37.0 ± 10.1 |
Corpus . | # Sess . | Child age (yr) . | Duration (min) . | Speech fraction (%) . | |
---|---|---|---|---|---|
Child . | Psych . | ||||
BOSCC-SchoolAge | 27 | 9.29 ± 3.30 | 17.76 ± 11.99 | 27.5 ± 7.8 | 40.8 ± 7.1 |
BOSCC-ToddlerHigh | 20 | 2.21 ± 0.55 | 11.1 ± 3.9 | 13.9 ± 6.1 | 39.6 ± 5.8 |
BOSCC-ToddlerLow | 18 | 1.80 ± 0.29 | 10.1 ± 0.3 | 7.8 ± 5.6 | 37.0 ± 10.1 |
4. Contextual factors
We select context factors based on relevance in child-adult interactions and/or analyzing diarization system performance. Background noise is known to adversely affect diarization system performance (Sell et al., 2018), and noise strength can be estimated using SNR. Speech intensity is observed to be a significant indicator of “atypical prosody,” characteristic in the speech of children with autism (Bone et al., 2016a). Both SNR and speech intensity can be reliably estimated in non-speech regions, unlike other features of interest in child-adult interactions, such as prosody. Additionally, we select two conversational factors (utterance length and speaker change proximity) used in a previous study (Knox et al., 2012) which analyzed speaker segments that are prone to diarization errors. While additional features can potentially prove useful, we restrict to the above as a first step towards diarization error analysis and error improvement. Following, we describe each of them.
Utterance length: This is defined as duration of speaking turn for the current frame. Utterance length is zero for all non-speech frames. Short utterances typically do not contain enough speaker information for embedding extraction and are more prone to errors (Knox et al., 2012).
Speaker change proximity: This is the absolute distance (in time) to the nearest speaker change point. We define a speaker change point as the time instance at which speakers switch, or begin speaking, or end speaking. The minimum proximity considered in this work is 0.25 s, which is the standard no-score collar as defined by NIST evaluations.
SNR: Given a mixed audio signal, SNR is the relative strength between the noise-only and speech-only components. We estimate SNR using the NIST-STNR tool (Ellis, 2011) and scale the values (in dB) to zero-mean and unit variance within each session to assist training.
Speech intensity: We use the praat toolkit (Boersma and Weenink, 2009) to estimate speech intensity as the smoothed version of the signal energy. Since absolute intensity values are not informative, we normalize intensity to zero mean and unit variance within each session.
5. Experiments
5.1 Baseline system
We borrow SAD and x-vector models developed on out-of-domain corpora owing to the limited size of available child-adult speech corpora. We use a two-hidden layer feed-forward DNN developed as part of the DARPA RATS project (DARPA, 2015) to classify speech from non-speech. The input consists of spliced (±15 frames) 13-dimensional MFCCs, while the output nodes are binary labels. The network is trained to minimize cross-entropy loss. For x-vector extraction, we use the pre-trained model provided with the CALLHOME recipe in Kaldi, similar to the best system in Sell et al. (2018). In this system, a sliding window of 1.5 s duration is used to extract x-vectors followed by AHC on PLDA (Probabilistic Linear Discriminant Analysis) scores to obtain the diarization labels. We adapt the baseline system by training the PLDA transforms on BOSCC-SchoolAge, which is found to further improve performance.
5.2 Error analysis
In the first experiment, we study the effect of each contextual factor listed in Sec. 4 on different types of diarization errors produced by the baseline system. For each session in BOSCC-ToddlerHigh and BOSCC-ToddlerLow datasets, we compute the factors and decisions (correct, missed speech, and speaker error) at frame-level. We note that false alarms are not accounted for in this analysis since we are interested in the effect of factors on child/adult speech segments only (not silence regions). While plotting the results in Fig. 1, the maximum ranges for utterance duration and speaker change proximity are chosen so as to ensure sufficient number of frames for analysis.
5.3 Improving diarization errors
Given that contextual factors are associated with diarization errors, we hypothesize that this relation can be exploited to identify errors and improve diarization performance We feed the outputs from the baseline diarization system (speaker labels along with silence) and the time-aligned contextual factors to a deep neural network. At the output, the network is provided with a single label representing the speaker. Hence, we pose the neural network training as a sequence classification problem where the output belongs to one of three classes: child, adult, or silence. The input is spliced with context frames to exploit temporal information, which cannot be captured during the error analysis in Sec. 5.2. We note that, since the labels from baseline system do not include speech overlap, the proposed model cannot learn from these regions. We acknowledge errors from overlap region in the final model.
We define an input sample as a contiguous block of frames spanning a duration of 2 s. At each instant, the diarization output is converted into a three-dimensional one-hot encoding and appended with the contextual factors. The SNR and intensity values are obtained directly from audio; while utterance length and speaker change proximity are obtained using the diarization outputs since oracle speaker labels are not available during testing. We experiment with three types of networks in this work—ffn (feed-forward network), blstm (bidirectional long short-term memory), and atten (attention). In ffn, the input is flattened across the time axis before passed through 3 dense layers with 256 units each. Blstm employs long short-term memory layers to capture forward and backward temporal information. The hidden state is fed to 3 dense layers with 256 units each. Atten makes use of feed-forward attention (Raffel and Ellis, 2015) to selectively attend to frames relevant to the output.
All networks used in this work are optimized using rmsprop to minimize the cross-entropy loss between output labels and logits. Batch normalization and dropout (rate = 0.2) are used in the dense layers for regularization. We obtain the results using cross-validation, where the sessions are divided into four folds and the network is retrained from scratch within each fold. In this way, all sessions are treated as test data.
6. Results
From Fig. 2, we observe significant differences between how contextual factors affect diarization errors on child and adult speech portions. The fraction of correctly classified frames marginally increases for both speakers with longer utterances and farther from speaker change points, similar to Knox et al. (2012). However, the improvement is reflected by lower fractions of speaker errors for child speech and lower fractions of missed speech for adult speech. Further, short child utterances are more likely to get diarized as adult speech than any other outcome, and the fraction of missed child speech is consistent across different utterance lengths and speaker change proximity. While adult speech is seen to perform better as SNR increases, child speech exhibits an optimal SNR with respect to correctly classified frames. Specifically, the fraction of child speech diarized as adult speech increases with SNR, suggesting that the baseline diarization system is likely to cluster clean speech segments from the child into adult. Frames with low speech intensity from both speakers get missed by the SAD, while child speech is likely to get diarized as adult speech as the intensity increases.
From Table 2, the x-vector baseline results in relatively high diarization error rates (DERs) for a dyadic conversation with known number of speakers. We surmise that this is due to large fraction of silence/noisy regions (Pearson correlation between DER and silence fraction, r = 0.61, p < 0.001). With the use of contextual factors, all networks provide gains in DER, with atten resulting in 8.2% and 15.8% relative improvement in DER. The results between different networks underscore the importance of exploiting temporal information for diarization.
7. Conclusions
We showed that a state-of-the-art diarization system does not necessarily perform well on naturalistic child-speech interactions, especially for toddlers and preschoolers with limited language levels using data obtained from ordinary recording systems. We investigated the role of context in improving upon the results of such a system. First, we examined the effect of various contextual factors on diarization errors, analyzing the different effects on child speech and adult speech. Next, we trained an attention network to improve diarization errors using contextual factors. The results suggest the benefit of local context in improving child-adult speaker diarization. Additional contextual factors would be explored especially from the visual modality, since audio-video diarization has been shown to perform better than audio-only diarization. In the future, we would like to pose the DNN training as a sequence-to-sequence learning task where the output labels can be substituted with the label sequence.
Acknowledgments
This research was supported by the Simons Foundation and National Institute of Mental Health (NIMH Grant No. 1R01 MH114925-01).