A listening test is proposed in which human participants detect talker changes in two natural, multi-talker speech stimuli sets—a familiar language (English) and an unfamiliar language (Chinese). Miss rate, false-alarm rate, and response times (RT) showed a significant dependence on language familiarity. Linear regression modeling of RTs using diverse acoustic features derived from the stimuli showed recruitment of a pool of acoustic features for the talker change detection task. Further, benchmarking the same task against the state-of-the-art machine diarization system showed that the machine system achieves human parity for the familiar language but not for the unfamiliar language.
1. Introduction
The perception (and decoding) of talker attributes is essential while listening to multi-talker speech conversations. In this paper, we present an experimental paradigm to probe talker change detection in human listeners with stimuli drawn from familiar and unfamiliar languages and find that change detection is dependent on language familiarity and specific acoustic features. A human-machine comparison using a diarization system shows that the performance of the machine system is on par with the human performance for the familiar language.
Previous behavioral studies suggest a substantial influence of indexical attributes, such as talker identity, dialect, age, etc. (Laver, 1968), on speech intelligibility. For example, talker familiarity improves speech in noise perception (Johnsrude et al., 2013; Kitterick et al., 2010; Nygaard and Pisoni, 1998) and accent familiarity alters the perceived meaning of an utterance (Cai et al., 2017). These imply perception of talker cues helps in parsing the semantic message. Lavner et al. (2000) suggest that talker identification uses a distinct group of acoustic features. Yet, Sell et al. (2015) argue that a combination of vocal source, vocal tract, and cortical features fail to explain the perceived talker discrimination in a listening test with simple word-level utterances. Talker perception improves with increase in phonetic content in the speech signal, that is, from vowels to words and sentences (Goggin et al., 1991). Deafness to talker change (Neuhoff et al., 2014), as well as perceptual sensitivity in judging talker dissimilarity (Fleming et al., 2014; Perrachione et al., 2011; Perrachione et al., 2019), are both found to be affected by linguistic familiarity as well. These studies suggest an interplay between phonetic, semantic, and talker perception while listening to speech.
Unlike single-talker speech, multi-talker conversations contain talker change instances and detecting these instances is required for segregating the speech into time segments corresponding to who spoke what, and when. Human listeners, on average, take approximately 700 ms (from the instant of change) to report a talker change (Sharma et al., 2019). While acoustic features before and after the change instant influence change detection, it is not clear if semantic processing in a familiar language impacts talker change detection (TCD). Hence, this paper compares talker change detection using stimuli from familiar and unfamiliar languages.
We designed two speech stimuli sets, one in a language familiar to the participants (English) and another in an unfamiliar language (Mandarin Chinese, henceforth referred to as Chinese). We assume that, compared to a familiar language, semantic processing is minimal while listening to the unfamiliar language. The participants took part in a listening test to indicate the number of talkers in multi-talker stimuli derived from these datasets. The collected data were analyzed to understand the impact of language familiarity on detection metrics, namely, miss and false alarm rates, and on the use of acoustic features in responding to the task via regression modeling of the response times (RT). Further, talker change detection is identified as a crucial pre-processing step (Ryant et al., 2018; Ryant et al., 2019) for machine recognition of conversational speech. This step is primarily approached using diarization systems. We investigate the performance of the state-of-art diarization system based on x-vector embeddings (Snyder et al., 2018) on the stimuli sets used in the human listening task. In the recent years there have been claims on achieving human parity in applications like automatic speech recognition (ASR) (Saon et al., 2017; Xiong et al., 2016) and machine translation (Hassan et al., 2018). In this context, highlighting the performance gap, if any, between humans and machines constitutes an important step to achieve human parity for speaker diarization systems.
2. Methods
The study presented here is an extension of our work in (Sharma et al., 2020a) with a larger set of human participants, and a detailed analysis of reaction time modeling.
2.1 Participants
A total of 28 human participants (21 male, age range 20–37; mean age 24 years, with self-reported normal hearing) participated in the listening test. All participants were proficient in English and had no prior exposure to Chinese. The protocol for the behavioral experiment was approved by the Indian Institute of Science Human Ethics Committee. All participants provided written consent for the test and were provided with monetary compensation.
2.2 Stimuli
The English and Chinese speech signal recordings were taken from the LibriSpeech corpus (Panayotov et al., 2015) and the Aishell corpus (Bu et al., 2017), respectively. These corpora are composed of read speech audio data (audiobooks and news broadcasts) from more than 400 talkers and are freely available in the public-domain. For our experiment, the single talker stimuli were formed by concatenating two utterances from the same talker, while the two-talker stimuli were formed by concatenating two utterances from two different, gender-matched talkers. Both utterances were chosen to avoid any contextual continuity, and had a duration ranging from 2.5 to 5 s, forming a stimulus of 5–10 s. With this approach, two curated stimuli sets were constructed—one for English and one for Chinese, each with 50 single talker and 50 two-talker stimuli. All the stimuli were manually checked for quality (absence of noise/channel distortions). In order to avoid any talker adaptation during listening to these stimuli, none of the talkers appeared in more than one stimulus. A comparison of the distribution of a few of the acoustic features for the stimuli in the two stimuli sets is shown in Fig. 1(b). The acoustic features, namely, pitch, harmonic-to-noise ratio (correlated with perceived voice quality), and intensity (correlated with perceived loudness), are obtained from short-time 40 ms speech segments (with temporal hop of 10 ms) derived from the speech signals [extracted using praat (Boersma and Weenink, 2020)]. There is considerable overlap between the distributions, illustrating the acoustic feature similarity between the two stimuli sets. The bimodal distribution in pitch is due to male and female utterances in the stimuli sets.
2.3 Listening test
The listening test for each participant was conducted in two sessions. Each session had stimuli only from one language. The ordering of language presentations was randomized across participants, and the order of stimuli presentation in each language session was randomized for every participant. The experiment was conducted in an isolated sound booth using high fidelity headphones (Sensheiser HD 215). A graphical user interface designed in python and html was used for stimuli presentation and recording responses [stimulus material available at Sharma et al. (2020b)]. After presentation of a stimulus, the listener responded with a button press indicating the number of talkers (1 or 2). Visual feedback (correct/incorrect) was provided to the participant after every trial. An illustration of a trial is shown in Fig. 1(a). A small set of 16 trials were provided to get familiar with the task. On average, the session for each language took 20 min and there was a 10 min break between sessions, making the total experiment duration 50 min per participant.
2.4 Behavioral data pre-processing
The performance measures used are: (i) Miss rate (%): the percentage of two talker stimuli reported by the participant as single talker, (ii) False Alarms (FA) rate (%): the percentage of single talker stimuli reported as two-talker, and (iii) Response time (RT): the time duration between the end of the stimulus and the participant's response in the form of button press [that is, RT= shown in Fig. 1(a)]. Any trial with a response time RT < 20 ms (too fast) or RT > 2 s (too slow) was discarded for the analysis. The discarded trials constituted 6.7% of the collected responses.
2.5 Machine system
We used an implementation of a state-of-the-art speech diarization system which uses x-vector embeddings as acoustic features. The x-vector embeddings from short speech segments are fed to a probabilistic linear discriminant analysis (PLDA) to generate the affinity matrix. The PLDA affinity matrix is used by an agglomerative hierarchical clustering (AHC) framework to cluster x-vector features. The output is talker-level segmentation of the input speech signal. We consider the system output hypothesis as two talkers if more than one talker is present in the segmentation. The system implementation details are provided in Singh et al. (2019). The x-vector embeddings (Singh et al., 2019) are derived from a hidden layer of a time-delay neural network trained for a talker classification task on the VoxCeleb-1 and VoxCeleb-2 [celebrity speech corpus (Chung et al., 2018) composed of 7323 talkers]. These embeddings (512 dimensional) capture the talker attributes derived from 1 s segments of speech. The threshold for the AHC clustering was varied from –0.250 to 0.250, in increments of 0.005, to compute the miss and false-alarm probabilities. These values were used to obtain the detection error trade-off curve plotted in Fig. 3(d).
3. Results
3.1 Behavioral data
A scatter plot of miss-rate and FA-rate for unfamiliar (Chinese) versus familiar language (English) stimuli sets is shown in Figs. 2(a) and 2(b). A majority of the participants showed a higher miss-rate for Chinese trials and a higher FA-rate for English trials. The d-prime for the task [Fig. 2(c)] was found to be greater than 1.5 for most of the participants indicating the participants performed the task effectively. Also, the bias [Fig. 2(d)] was between 0.4 and 3 with a higher spread for Chinese trials. The miss and FA averaged across participants is shown in Figs. 2(e) and 2(f). The average miss-rate is significantly higher for the unfamiliar language [that is, Chinese, with , Cohen's ]. The average FA-rate is significantly higher for the familiar language [that is, English, with , Cohen's ]. The distributions of pooled RTs (from all participants) for correct and incorrect responses are shown in Fig. 2(g); these are visually distinct for the two languages. The grand average of participants' mean RT is shown in Figs. 2(h) and 2(i). The average RT for unfamiliar language (Chinese) is significantly smaller [with Cohen's for correct responses, and , Cohen's for incorrect responses]. These observations indicate a significant impact of language familiarity on human TCD performance.
3.2 Linear regression modeling of RTs
A linear regression model was constructed with acoustic feature distances as predictor variables and the RT as the dependent variable. As RT is always greater than zero and has a skewed distribution [see Figs. 2(e), 2(f)], the natural logarithmic transformation of RT was used. The acoustic features included: mel-spectrogram (MEL; using 40 filters), mel-frequency cepstral coefficients (MFCC; 13 coefficients), intensity (INTENSITY), spectral centroid (SCENTROID), pitch (PITCH), harmonic-to-noise ratio (HNR), and x-vectors (XVEC, features used in the machine system). Given a stimulus signal, for each feature type, we obtain two representations - one for each of the concatenated utterances. These feature representations correspond to average of short-time frame-wise (40 ms, with temporal hop of 10 ms; unvoiced frames were discarded) extracted features [using praat; McFee et al. (2020)]. The feature distance is measured as the Euclidean distance between the mean of feature representations from the two utterances. Alongside the acoustic features distances, we also included stimulus duration (Td) as a predictor variable. As there is a significant impact of language type on RT (seen in Sec. 3.1), we model RTs separately for different subsets of the pooled data. We have eight models basing on language (Chinese/English), response (correct/incorrect trials), and trial stimulus type (two talker/single talker). Figure 3(a) shows the result obtained from a type-II analysis of variance (ANOVA) on every model. There is variability in the RTs across subjects making the subject identity (SUB_ID), a categorical predictor variable, significant in all the models. With respect to acoustic features, more acoustic features are significant for English compared to Chinese stimuli. The R2 is also high for English compared to Chinese implying a relatively higher percentage of the observed data variance explained by the predictors for English stimuli. Interestingly, the stimulus duration is also found to be of significance in most of the models. Surprisingly, MFCC and HNR did not turn out to be of significance in any model and SCENTROID was significant in the majority of the models. The XVEC was found to be significant for two-talker correct English trials. This is interesting as the x-vector features are designed to capture talker differences and have been shown to be useful in machine diarization systems.
3.3 Human-machine comparison
The machine system performance is shown in Fig. 3(b). The human performance is also included in this figure. The plot suggests that performance on the familiar language (English) for the machine system is on par with human performance for the same stimuli. The performance on the unfamiliar language (Chinese) is worse for the machines, compared to human performance. This indicates that the future design of machine diarization systems could target invariance to language mis-match.
4. Discussion
The listening test results from the study show a significant impact of language familiarity on human talker change detection performance. Though the sound stimuli we used were short in duration (2.5–5 s utterances), each utterance had close to 8–10 words and hence, was not devoid of semantic information. Such short duration, sentence-level speech stimuli have previously been used for analyzing language familiarity effects on speaker dissimilarity judgments by Perrachione et al. (Perrachione et al., 2019) and Flemming et al. (Fleming et al., 2014). These studies highlight that even using time-reversed speech devoid of semantics is sufficient to illustrate a familiarity effect.
The results show a lower miss rate for familiar language suggesting that success in semantic processing (and understanding) benefits TCD. However, we also find that the FA is higher for the familiar language. This suggests that a majority of participants falsely associated a change in context between the utterances with a talker change. This is not the case for the unfamiliar language (significantly lower FA) as the semantic understanding is absent. The RT for familiar language trials is significantly higher compared to the unfamiliar language trials. This finding suggests that comprehension of speech (which likely occurs in familiar language stimuli) adversely affects the TCD response time, whereas in the unfamiliar language case, there is no conflict (increased cognitive load) of semantic processing involved. Past work by Neuhoff et al. (Neuhoff et al., 2014) presents an interesting interplay between semantic and indexical information extraction, showing greater change deafness for familiar language (without participants being cued to attend to the change). In contrast to that study, the subjects in the current study were instructed to attend to talker changes and were also provided feedback after every trial. Therefore, even when we instruct participants to attend to the change, we still find effects of language familiarity on change detection. In particular, listening to familiar language speech distracts from the ability to attend to indexical information, which likely manifests as the increased reaction times observed in the familiar language trials.
We note that the subject pool recruited for the study consisted of non-native English speakers, proficient in English. Data from our past study (Sharma et al., 2020) and also from Köster and Schiller (1997) suggests non-nativeness does not have an impact on talker perception tasks, though future studies may wish to manipulate this factor.
Moving to the regression analysis of RTs, we find that a majority of the acoustic features failed to be of significance for the unfamiliar language trials. This was also reflected in a lower R2 for the data drawn from trials corresponding to the unfamiliar language. We hypothesize that language familiarity enables usage of acoustic features which are different from those used for unfamiliar language.
To the best of our knowledge, this study is the first of its kind to contrast human and machine performance on a talker counting task. The human-machine performance comparison shows that the diarization systems based on x-vector embeddings can achieve human-like performance even on short duration stimuli when the training and test data come from the same language. However, the results indicate that humans are superior in generalizing to unfamiliar languages. The future design of embeddings for diarization systems can target language invariance to overcome this limitation.
Acknowledgments
This work started at the Telluride Neuromorphic Workshop in Telluride, Colorado during the summer of 2019, supported by funds from the National Science Foundation (NSF). The work done by N.K.S., V.K., and S.G. was supported by grants from the British Telecom India Research Center (BTIRC).