Speakers tailor their speech to different types of interlocutors. For example, speech directed to voice technology has different acoustic-phonetic characteristics than speech directed to a human. The present study investigates the perceptual consequences of human- and device-directed registers in English. We compare two groups of speakers: participants whose first language is English (L1) and bilingual L1 Mandarin-L2 English talkers. Participants produced short sentences in several conditions: an initial production and a repeat production after a human or device guise indicated either understanding or misunderstanding. In experiment 1, a separate group of L1 English listeners heard these sentences and transcribed the target words. In experiment 2, the same productions were transcribed by an automatic speech recognition (ASR) system. Results show that transcription accuracy was highest for L1 talkers for both human and ASR transcribers. Furthermore, there were no overall differences in transcription accuracy between human- and device-directed speech. Finally, while human listeners showed an intelligibility benefit for coda repair productions, the ASR transcriber did not benefit from these enhancements. Findings are discussed in terms of models of register adaptation, phonetic variation, and human-computer interaction.
I. INTRODUCTION
Recently, with the growing prevalence of voice-activated artificially intelligent (voice-AI) devices, humans are increasingly talking with technology. This new type of interlocutor is able to both understand and generate fluent speech, albeit in ways that differ from humans. For example, ASR systems may contain errors (Koenecke , 2020), and device interlocutors are perceived as less competent (Oviatt , 1998; Cohn , 2022; Cowan , 2015). Previous research has demonstrated that people interact differently with computer interactors than they do with other humans. For instance, acoustic differences have been found between human-directed speech and computer-directed speech by speakers in their native first language [e.g., Cohn (2022), Cohn and Zellou (2021), and Siegert and Krüger (2021)], but the consequences of these adjustments on intelligibility is an understudied area of research, particularly across different language varieties.
As voice technology increases in use, more people engage with voice-AI in their second language(s) (L2) to complete everyday tasks (e.g., setting calendar reminders, sending text messages) and to learn new languages [e.g., Tai and Chen (2022)]. While users tend to behave as if devices have a lower communicative competence than human interlocutors (Cowan , 2015; Cohn , 2022), L2 speakers, in particular, face additional barriers when communicating with technology (Song , 2022). Automatic speech recognition (ASR) systems, used to transcribe spoken language into text and used in voice-AI technology, have systematic biases against non-mainstream language varieties [e.g., for ethnolects in Koenecke (2020) and Wassink (2022); L2 speakers in Chan (2022), Choe (2022), Song (2022), and Moussalli and Cardoso (2017)]. For example, Chan (2022) found that ASR transcription was less accurate for L2 than L1 English speakers when using an English language model. They additionally found differences based on L1 language type: L2 speakers whose L1 is a tonal language (e.g., Mandarin) were even less accurately transcribed, but there was less of a decline for L2 speakers who learned English earlier. As related work has shown that ASR biases can be internalized by speakers (Mengesha , 2021), there is a need for more studies exploring disparities across L1 and L2 speech directed toward devices. The current study tests how L1 and L2 English speakers adapt their speech when talking to technology, compared to talking to another person, and the perceptual consequences of those adaptations for both human and machine comprehenders.
A. Speakers' adaptations for addressee
Speakers tailor productions to different communicative situations, environments, and listeners. The hyper- and hypo-articulation (HandH) model proposes that this is the result of a balance of competing needs for the talker that compete in real time: minimizing articulatory effort while maximizing intelligibility for the listener (Lindblom, 1990). This results in speaking style variation which may range from hypospeech (casual or plain speech) to hyperspeech (clear or enhanced speech). When intelligibility is not a priority, speakers place greater emphasis on minimizing articulatory effort, producing casual speech. On the other hand, when there is a higher probability of not being understood, speakers maximize intelligibility at the cost of greater articulatory effort, producing clear speech. Under the HandH theory, talkers producing clear speech modify their speech with the goal of being more intelligible to a listener, such as changing their speech rate, intensity, and vowel productions (Bradlow , 2003; Ferguson and Kewley-Port, 2007; Krause and Braida, 2004). These adaptations may occur in a variety of situations where greater intelligibility is called for, including after a misunderstanding has occurred, or when speakers are interacting with interlocutors whom they judge to have lower communicative competences or different perceptual needs.
In this paper, we will focus on the types of clear speech adaptations that are directed to specific types of listeners. Speakers use different speech styles, or “registers,” to communicate with different types of interlocutors, adjusting their speech in consistent ways according to an interlocutor's characteristics and presumed needs (Clark and Murphy, 1982). In this paper, we use the term register to refer to different ways that we talk to different interlocutors. This includes infant-directed speech (DS) (Fernald , 1989), non-native speaker-DS (Uther , 2007; Aoki and Zellou, 2023a), and hearing-impaired individual-DS (Picheny , 1986; Aoki and Zellou, 2023b), which have, at times, distinct acoustic adaptations compared to speech directed to L1, normal-hearing adult listeners. L2 speakers also produce register differences, talking to different interlocutors in different, listener-targeted ways. For example, child-DS shares many similar features cross-linguistically (Fernald , 1989), and child-DS produced by L2 speakers shares characteristics with child-DS produced by L1 speakers, including a slower speech rate as compared to adult-DS (Fish , 2017). Additionally, Kato and Baese-Berk (2022) found acoustic modifications between plain speech and hearing-impaired-individual-DS produced by L2 talkers, although the size of the enhancements was smaller for L2 talkers with lower proficiency. Taken together, these findings suggest that interlocutor-specific acoustic-phonetic registers, such as for device-DS, is a cross-linguistic phenomenon. This leads us to ask whether L2 speakers produce device-DS in similar or different ways from L1 speakers.
In addition to adaptations for different types of addressees, speakers may also produce targeted adaptations based on feedback during interactions. For example, the presence of phonetically similar competitors (e.g., “bill” and “pill”) in a communicative context leads speakers to strategically enhance relevant segments acoustically (Buz , 2016; Seyfarth , 2016). Furthermore, feedback that a misunderstanding has taken place increases the size of this enhancement (Buz , 2016; Cohn and Zellou, 2021; Cohn , 2022). As with L1 talkers, L2 talkers also enhance speech in the presence of acoustic competitors (Kato and Baese-Berk, 2021), suggesting we might observe parallel adjustments in the current study across our bilingual and monolingual groups.
B. Perceptual consequences of speech adaptations
Speech produced in different speaking styles often has consequences for listeners in terms of intelligibility (Aoki , 2022; Aoki and Zellou, 2023c). For example, some work has shown improved intelligibility when listening to nonnative-speaker-DS (Kangatharan , 2023) and hearing-impaired-individual-DS (Picheny , 1985) as compared to speech directed to an L1, normal-hearing listener. Yet, the majority of studies examining the consequences of register adaptations have focused on L1 speakers, and few studies have examined the perceptual consequences of L2 register variation. For one, Kato and Baese-Berk (2022) found that L1 English listeners did show an intelligibility benefit from L2 English (L1 Mandarin) hearing-impaired-individual-DS. This intelligibility benefit varied by talker proficiency, with a greater benefit of hearing-impaired-individual-DS produced by higher-proficiency talkers as compared to lower proficiency talkers. On the other hand, Kato and Baese-Berk (2023) found no intelligibility benefit from L2 English (L1 Mandarin) hearing-impaired-individual-DS for L1 listeners. Despite sizable acoustic differences between enhanced and plain speech produced by L2 English (L1 Mandarin) talkers, neither higher nor lower proficiency talkers' acoustic modifications resulted in a clear speech intelligibility benefit for L1 English (human) listeners, suggesting that acoustic modifications do not always lead to an intelligibility gain for listeners.
Since the study of the intelligibility consequences of L2 register variation is limited, insights can be taken from general research on L2 clear speech, which is often elicited by instructing talkers to imagine that they are speaking to someone with hearing loss or a non-native speaker. Several studies have found that L2 clear speech provides an intelligibility benefit to L1 listeners (Smiljanić and Bradlow, 2011; Jung and Dmitrieva, 2023). Rogers (2010) also found that L1 listeners benefited from L2 English (L1 Spanish) clear speech. This intelligibility benefit was greater for talkers who had an earlier age of immersion, with a greater clear speech benefit produced by L1 Spanish talkers who had moved to an English-speaking culture before the age of 12 as compared to talkers who moved after the age of fifteen. In summary, it seems that while L2 hyperspeech can provide intelligibility benefits for L1 listeners, the size of the intelligibility benefit varies by proficiency and age of immersion.
Beyond specific registers, targeted segmental speech enhancements may also result in intelligibility benefits. Kato (2020) investigated perceptual consequences of acoustic modifications produced by L2 English (L1 Mandarin) talkers when a minimal pair competitor was or was not present in the production context. For example, participants read aloud “click on the peer now” when the words peer, beer, and town were on the screen. In a forced-choice perception task, L1 English listeners identified the target word more accurately for productions where a competitor was present in the production context (i.e., when talkers were making acoustic modifications to differentiate the target word from the competitor). However, this perceptual benefit only held for high-proficiency L2 talkers and not for lower-proficiency L2 talkers. Furthermore, this effect was only found for contrasts where both sounds existed only in the L2 (/i/ and /ɪ/ vowel contrast, and word-final /p/ and /b/ consonant contrast). This study provides an example of a perceptual benefit in L2 speech when there is a need to enhance or differentiate specific segments.
Although some registers, including nonnative-speaker-DS and hearing-impaired-individual-DS, seem to afford intelligibility benefits for all listeners, not all registers may have a universal intelligibility benefit. Infant-DS has been shown to aid in infant word learning (Singh , 2009) and word segmentation (Thiessen , 2005). Yet, despite this benefit for infant listeners, words in infant-DS may be less intelligible to adults than adult-DS (Bard and Anderson, 1983). This suggests that some registers might provide an intelligibility benefit for the target listener, while non-target listeners experience a decrease in intelligibility.
C. Technology-directed registers and impact on intelligibility
A growing body of work has examined how people talk to technology, computer-DS (Oviatt , 1998; Burnham , 2010; Mayo , 2012). Although many modern voice-AI devices, such as Apple's Siri, are able to interact with users using speech in ways that are similar to interactions with humans, many users hold beliefs that AI systems are less communicatively competent than humans (Siegert and Krüger, 2018; Cohn , 2022; Oviatt , 1998). These users' beliefs may result from experience with voice technology [e.g., Gambino (2020)] as well as experience with voice technology misunderstanding speech (Koenecke , 2020). These perceptions that device versus human addressees have differing perceptual needs could result in different registers for speech directed to these different types of interlocutors.
Indeed, acoustic differences have been found between device-DS and human-DS, with studies often reporting that device-DS is louder than human-DS (Raveh , 2019; Siegert and Krüger, 2021; Cohn , 2022). The directionality of findings for other acoustic characteristics distinguishing device- and human-DS are mixed [see Cohn (2022) for a review]. Some studies report no difference in duration (Raveh , 2019; Siegert and Krüger, 2021) and others report a slower rate for voice-AI-DS (Cohn , 2021). Raveh (2019) reported a higher mean f0 for voice-AI-DS, while Cohn (2022) reported a lower mean f0 for voice-AI-DS. Cohn and Zellou (2021) reported a larger f0 range for voice-AI-DS, while Cohn (2022) reported a smaller f0 range for voice-AI-DS. Additionally, Cohn (2022) reported that f0 range in voice-AI-DS increased over time, becoming more similar to human-DS, while mean f0 decreased over time for voice-AI-DS, diverging from human-DS.
It is possible that the acoustic differences between device-DS and human-DS registers may have perceptual consequences for listeners. Whether device-DS is more or less intelligible than human-DS to human listeners and ASR comprehenders is an open question, which will be explored in the present study.
D. Present study
The present study takes a perceptual approach to extend Cohn (2022), which examined the acoustic adjustments L1 English speakers make in a pseudo-interactive task with a human and a voice assistant (Apple's Siri). Cohn (2022) used a purely acoustic approach to probe register differences. The present study builds upon that approach by examining the intelligibility of speech directed toward device and human addressees by their intended listener (human listener vs Siri ASR). In addition, we investigate whether the talker's language background shapes these effects.
We selected a subset of L1 English data from Cohn (2022), and collected additional data from L1 Mandarin, L2 English speakers who completed the identical experiment. In the experiment, participants were told they were interacting with either a human or Siri and read a sentence off the screen. The interlocutor displayed the word it understood on the screen and asked, via a pre-recorded human or Siri voice, for verbal confirmation whether their perception was correct. Then the participant repeated the sentence to either confirm or repair the human/Siri's reception. In the present study, L1 and L2 productions collected under this paradigm were transcribed by L1 English human listeners and an ASR transcriber.
The present study examines the register of device-DS and human-DS, as well as the effect of production condition (original production, confirmation of a correct understanding, repair of a misheard coda, and repair of a misheard vowel). We also investigate the perception of speech enhancements and device-DS as produced by L1 and L2 speakers. Thus, there were 3 variables: Speaker Language Background (L1, L2), Addressee (Human, Siri), and condition (Original, Confirm Correct, Coda Repair, Vowel Repair).
The first research question is whether there is an intelligibility difference for the device-DS register as compared to the human-DS register. It may be that humans perceive voice technology as having unique perceptual needs that require greater articulatory effort to accommodate; if so, speech produced in a device-DS register may be more intelligible than speech produced in a human-DS register, similar to nonnative-speaker-DS and hearing-impaired-individual-DS. This would align with Mayo (2012) who found that (imagined) computer-DS had the most accurate transcription by native English listeners, while infant-DS was the least intelligible.
The second research question is whether the production condition within the pseudo-interactive task (i.e., production after a misunderstanding) results in an intelligibility difference for listeners. According to the HandH model, talkers expend more acoustic effort when listener understanding is a priority (Lindblom, 1990). Accordingly, speech produced after there has been a misunderstanding in conversation (i.e., Coda Repair, Vowel Repair) might be expected to be more enhanced and thus more intelligible, while a repeat production after the listener has already indicated understanding (i.e., Confirm Correct) might be expected to be more reduced, and thus less intelligible.
The third research question is whether talker language background and speaking style has an impact on intelligibility. Previous research has shown that L2 speech is less intelligible than L1 speech to both L1 listeners (Smiljanić and Bradlow, 2007) and devices (Moussalli and Cardoso, 2017). However, it may be that there are differences in effect of language background on the intelligibility of the different speaking styles, as was found by Kato and Baese-Berk (2022). For instance, we may find smaller differences in intelligibility by speaking style for L2 talkers as compared to L1 talkers, due to lesser experience with the language.
Finally, we also ask whether the three previously discussed factors of Addressee, condition, and Language Background result in intelligibility differences for two different types of listeners: L1 human listeners and a device comprehender, in this case the Apple ASR system. If listeners experience an intelligibility benefit when listening to a register intended for a listener of that type, then human listeners will find human-DS more intelligible than device-DS, while ASR systems will find the device-DS register more intelligible than human-DS. On the other hand, although there are acoustic differences between device-DS and human-DS, those differences may not be sufficient to induce intelligibility differences for listeners. This may especially be the case for L2 speech; Kato and Baese-Berk (2023) found that although L2 speakers produced acoustic clear speech modifications, those differences did not result in an intelligibility benefit for listeners. Thus, it may be that there is no register difference for intelligibility.
II. EXPERIMENT 1
Experiment 1 investigates the intelligibility of device-DS and human-DS for L1 English human listeners. A recording task (Sec. II A) was used to elicit productions for the transcription task (Sec. II B). The methodology of the recording task followed that of experiment 2 (higher error rate) in Cohn (2022).
A. Recording task
1. Talkers
9 L1 Mandarin talkers were recruited from the UC Davis psychology subject pool to participate in a recording task. These talkers were a subset of talkers from a larger study with many language backgrounds. Recordings from 2 talkers were later removed as there were technical difficulties in their data collection (leading to empty recordings), leaving a total of 7 L1 Mandarin talkers (4 female, 3 male). The participants had an age range of 18–23, with a mean age of 20 years. The talkers reported starting learning English between the ages of 4 and 16 (mean age 8.9) and moving to the U.S. between the ages of 12 and 19 (mean age 15.7).
Participants completed a survey about their usage of voice-AI devices (Ammari , 2019). Six of the seven L1 Mandarin talkers had used Siri before; all eight of these talkers reported using Siri once a week or less. They reported using Siri for music (n = 3), setting alarms (n = 3), and looking up information (n = 2). One participant reported using Alexa once a week or less to play music.
Recordings of nine L1 English talkers from Cohn (2022), also recruited from the UC Davis psychology subject pool, were age and gender-matched to the nine original L1 Mandarin talkers (6 female, 3 male). The participants had an age range of 18–22, with a mean age of 19.9 years.
All nine L1 English talkers reported having used Siri before, and using Siri at a rate of once a week or less. They used Siri to find information (n = 6), call or text contacts (n = 4), set reminders (n = 1), or change phone settings (n = 1). Six participants reported using other voice-AI systems: Google (n = 2), Alexa (n = 1), Cortana (n = 1), and unspecified (n = 3). Of these, two participants reported using voice-AI systems once or twice daily, while one participant reported using a voice-AI system hourly.
2. Materials
Both short recordings and images were used to create human and Siri guises for the recording task. An adult female recorded for the human guise, while Siri text-to-speech was used for the Siri guise. A short introduction was recorded for each guise (“I am Siri. I am a digital assistant on Apple products”/“I am Melissa. I work here in the Phonetics Lab”), as well as task instructions. Four short responses were generated to give initial feedback to the talkers (e.g., “Is this the word?”) as well as four short responses indicating understanding (e.g., “Okay, got it”). See Cohn (2022) (Appendix C) for the complete list of productions. In addition to the recordings, a photo of a cell phone open to Siri and a stock photo of an adult female were also presented.
This task used 44 sentences, each containing a monosyllabic target word in the frame “The word is ____.” Half of these target words had the form CVC, where the final consonant was a voiced oral stop at a bilabial or alveolar place of articulation. The other half of the target words were of the form CVN. These were matched with the CVC words, and differed only in a final nasal stop at the same place of articulation (e.g., paid and pain). For the full list of words, see the Target CVC and Target CVN lists in Appendix B in Cohn (2022). An additional 12 sentences, also with monosyllabic target words, were also included in the experiment [see the Target CiC and Target CaC word lists in Appendix B in Cohn (2022)]. These 12 sentences were used in Cohn (2022) to examine the vowel space of talkers, but were not used as stimuli in the current study.
3. Procedure
Talkers from both L1 and L2 speaker groups participated in a recording task from Cohn (2022). After completing informed consent, speakers were instructed about the nature of the task and told they would be interacting with Siri and a person. They also received auditory instructions along with an introduction from both Siri and the recorded human voice. These instructions were given at the start of a human-directed elicitation block (human voice) and Siri-directed elicitation block (Siri voice). The experiment was conducted using E-Prime (Psychology Software Tools, 2016).
In each trial, speakers produced sentences with multiple turns. The sentences were produced to two types of interlocutors: a mock-human interlocutor with a recorded human voice, or a mock-device interlocutor with the voice of the voice-AI system Siri. In each trial, speakers first read a sentence off the screen (“The word is pain”). Next, a word was shown on the screen (e.g., PINE) and the interlocutor asked for confirmation whether this was the word the speaker had intended to produce (e.g., “Is this correct?”). If the word was “heard” incorrectly, the incorrect word differed from the target word either by an incorrect vowel or an incorrect coda. Next, the speaker used a button box to indicate whether the interlocutor had heard them correctly or not, then produced the sentence a second time (“The word is pain”), whereupon the interlocutor indicated understanding (e.g., “Okay, got it”).
The recordings were thus produced under four different conditions. The Original condition contains productions produced the first time participants read the sentence off the screen. The Confirm Correct condition contains productions that were produced after the interlocutor indicated correct understanding of the talker's production. The Coda Repair condition refers to productions that were made after the interlocutor indicated that it had heard a word with an incorrect coda (e.g., “paid” instead of “pain”); likewise, the Vowel Repair condition refers to productions made by the talker after the interlocutor indicated that it heard a word with an incorrect vowel (e.g., “pine” instead of “pain”).
The Siri-interlocutor and human-interlocutor trials were presented in two separate blocks, and the order of blocks was counterbalanced across participants. In each block, 12 vowel space trials were followed by 44 target trials. In these target trials, each of the 44 CVC or CVN words appeared once. There were 16 trials in the Confirm Correct condition, 14 Coda Repair, and 14 Vowel Repair. Participants thus produced 4 recordings of each target word: one original and one repeat for each of the two interlocutors (Human vs Siri). Trials alternated between CVC and CVN words, with words randomly selected from each group. Both the rate and type of errors was identical across the human and Siri blocks and occurred on identical trials (note that sentences were pseudorandomized in lists).
Overall, each talker made 224 productions: 12 vowel space trials (not included in the analysis for the present study) + 44 target trials * 2 productions * 2 interlocutors. In total over all 16 talkers, 2816 instances of the target words were elicited.
B. Transcription task
1. Listeners
177 L1 English listeners were recruited from the UC Davis Psychology subject pool (2 nonbinary, 78 female, 37 male; mean age 19.7 with an age range of 18–36) to complete the task. All listeners specified English as their L1 and as their strongest language.
2. Materials
The sentences from the recording task were amplitude-normalized to 60 dB (Cohn , 2021) and embedded in white noise with a signal-to-noise ratio of +3 dB (Zellou , 2022) and then the entire sound file was amplitude normalized to 60 dB in praat. Sound files with missing words, coughs, yawns, and other artifacts were removed. If one sound file was missing for a given trial (e.g., the original production for Siri-directed “pain”), then the other sound file from that trial was also removed (e.g., the second production for Siri-directed “pain”). In total, 124 sound files were removed, leaving a total of 154–174 (mean 168.3) productions per talker.
3. Procedure
Listeners completed a transcription task remotely using Qualtrics. They began with an audio calibration step, where participants listened to the sentence “Bill heard they asked about the host” and adjusted their audio settings to a comfortable level. Then they identified the last word in the sentence that they heard from phonological competitor options, to confirm that the participant could hear the audio at a comfortable listening level. In the experimental trials, they heard each sound-mixed sentence once, and then typed the final word that they heard in a text box (“The word is: _____”). Each listener heard recordings from only one talker. Trials were presented in random order. Depending on which talker the listeners were assigned, they transcribed between 154 and 174 recordings. Each talker was transcribed by a minimum of 4 listeners.
4. Statistical analysis
Transcription accuracy was coded as binomial data (correct = 1, incorrect = 0). Homophones and obvious spelling mistakes were coded as accurate. Nonwords that were judged by an L1 English speaker to have the same pronunciation as the target word were also coded as accurate (“boad” or “boed” for “bode”; “boen” for “bone”; “cawd” for “cod”; “daude,” “dawed,” “dawd,” or “dod” for “Dodd”; “lawd,” “lawed,” “lod,” or “lodd” for “laud”; “rubb” for “rub”; “schide,” “shide,” “shyd,” or “shyed” for “shied”; “sudd” for “sud”). These nonwords with comparable pronunciations were included in the intelligibility analysis because these types of responses provide evidence that the listener perceived what the talker intended to produce. Responses that were more than one syllable or word (e.g., “where's lawn” for “lawn,” “courseload” for “load,” “your side” for “side”) were coded as correct if the final syllable met the same criteria as a monosyllabic response.
C. Results
Figure 1 shows the proportion of keywords correctly transcribed by L1 English listeners, split by language background of the talker (L1 or L2) and by each of the four conditions (original production, confirmation of a correct understanding, repair of a misheard coda, and repair of a misheard vowel). The left panel shows Human-DS and the right panel shows Siri-DS.
The model showed no main effect of interlocutor for human- versus device-DS (p = 0.14). As seen in Fig. 1, there was an effect of language background, with transcription accuracy higher for L1 speakers on average (Coef = 1.00, SE = 0.18, z = −5.60, p < 0.001).
There were several effects of condition. Although accuracy for trials in the Original condition did not differ from the grand mean (p = 0.06), accuracy for trials in the Confirm Correct condition were lower than the grand mean (Coef = −0.11, SE = 0.04, z = −3.19, p < 0.01), as seen in Fig. 1. As for the conditions after the addressee misunderstood the talker, accuracy for the Coda Repair condition was higher than the grand mean (Coef = 0.16, SE = 0.04, z = 4.27, p < 0.001) while accuracy for the Vowel Repair condition was lower than the grand mean (Coef = −0.10, SE = 0.04, z = −2.63, p = 0.01).
The model also showed interactions between talker Language Background and condition, specifically for the Confirm Correct and Coda Repair conditions. L1 talkers were less accurately transcribed, on average, in listeners' transcriptions of the Confirm Correct condition (Coef = −0.08, SE = 0.04, z = −2.27, p = 0.02). On the other hand, L1 talkers' productions in the Coda Repair condition were transcribed more accurately on average (Coef = 0.08, SE = 0.04, z = 2.09, p = 0.04).
There were no other significant effects, including no significant interactions between Addressee and either condition, talker Language Background, or the three-way interaction between these factors. The full model output is provided in the OSF repository for the project.
III. EXPERIMENT 2
Experiment 2 investigates the intelligibility of device-DS and human-DS for an ASR transcriber to test the hypothesis that device-DS would be more intelligible for ASR than human-DS.
A. Procedure
A single audio file was created which concatenated the auditory stimuli from experiment 1. Each stimulus was separated with a keyword (the word “start” produced by the “Joanna” voice from Amazon Polly) so that it would be easier to segment the text after transcription. The audio file thus alternated between stimulus and keyword (“The word is pain,” “Start,” “The word is todd,” “Start,” “The word is paid,” “Start,”…). One second of silence was included between each sentence and keyword. The stimulus order was randomized and the stimuli were not masked by background noise.
Transcription was then conducted by routing the audio signal internally through Loopback (Matuschak, n.d.) on a Mac computer (macOS Catalina, Version 10.15.7). There were four steps to the internal transcription process, which were loosely based on an online tutorial (Geerling, 2022). First, a new “device” was created within Loopback and linked to QuickTime Player. Second, dictation was activated in System Preferences. Third, the file was opened in QuickTime Player and the “Play” button was clicked. Fourth, Microsoft Word for Mac (Version 16.661) and the “Dictate” button was clicked. The sentences appeared in real time within a Microsoft Word document as the audio played. After transcription was complete, the sentences were copied and pasted into a TextEdit file for further analysis.
B. Data processing and analysis
The sentences were separated from each other using strsplit() in r. With a custom-made script, the sentence-final keywords were then extracted and manually corrected (e.g., if the target word was “dodd,” the homophone “dod” was manually altered). Final keyword accuracy was coded binomially as correct (=1) or incorrect (=0), and a strict coding criterion was employed, such that all affixes needed to be in place to be counted as correct (e.g., if “lie” was considered incorrect if the correct response was “lied”). As in experiment 1, homophones were considered to be correct.
C. Results
Figure 2 shows aggregated accuracy proportions in each condition by speaker language background, word condition, and interlocutor. The full model output is provided in the OSF repository for this project.
There were three significant effects. First, as seen in Fig. 2, the L1 speaker productions were recognized more accurately on average (Coef = 0.73, SE = 0.11, z = 6.36, p < 0.001). In addition, the effect of condition was significant. Across all combinations of interlocutor and speaker language background, the Confirm Correct productions were not understood as well as the other condition types, as accuracy for these trials was lower than the grand mean (Coef = –0.16, SE = 0.06, z = –2.44, p = 0.01), while the Original productions were understood better, as accuracy for these trials was above the grand mean (Coef = 0.12, SE = 0.05, z = 2.51, p = 0.01). No other effects, including interactions, reached significance.
IV. DISCUSSION
The present study investigated whether potential speech style differences across human- and Siri- directed speech (“DS”) produced by L1 and L2 talkers would be borne out in differing levels of intelligibility by human and ASR comprehenders.
A. Human- and device-directed registers
First, the present study found no intelligibility differences across device-DS and human-DS for either human listeners or ASR systems. Previous research has found acoustic differences between device-DS and human-DS (Cohn and Zellou, 2021; Raveh , 2019; Siegert and Krüger, 2021), including acoustic differences for the L1 productions in the present study in Cohn (2022). These acoustic differences show that talkers are making listener-oriented adjustments for human and ASR comprehenders. However, unlike listener-oriented adjustments for L2 speaker-DS (Uther , 2007) and hearing-impaired-individual-DS (Picheny , 1986), talkers' acoustic adjustments for device-DS do not appear to provide an intelligibility benefit for either human or device transcribers. It may be that the acoustic differences in the present study are not large enough to result in an intelligibility benefit, or that these acoustic differences are simply the wrong differences to be helpful (Aoki and Zellou, 2023b). There is also precedent for acoustic modifications in L2 enhanced speech not resulting in an intelligibility benefit for listeners (Kato and Baese-Berk, 2023).
One limitation of the present study is that talkers produced their speech neither to a human nor to a voice-AI system, but instead to a computer program with recorded human and Siri voices. Scarborough and Zellou (2013) demonstrated that talkers show measurable differences in their speech when talking to real versus imagined speakers. The current methodology, which found no intelligibility differences between human and device-directed speech, might be obscuring register or speech style differences that would emerge in more naturalistic interactions with humans versus voice-AI systems.
Overall, the L1 talkers were more intelligible than L2 talkers (to both L1 and ASR comprehenders), a well-established difference [e.g., Bent and Bradlow (2003) and Moussalli and Cardoso (2017)]. There was also no difference in the intelligibility of Human vs Siri-DS across L1 and L2 speakers for both humans and ASR systems. This suggests that L1 and L2 groups might be using similar strategies to target the two different interlocutors with their speech styles.
B. Effect of production condition
Additionally, the present study found that the local contexts, when a speaker had been heard correctly or misunderstood, shaped intelligibility. Here, patterns differed between human and ASR comprehenders. Human listeners showed limited intelligibility benefits for the Coda Repair condition, while the ASR comprehender benefited from the Original productions more than any other condition.
Human listeners were more accurate, on average, in their transcriptions of productions which occurred after the interlocutor misheard the coda of the target word (Coda Repair condition). This is consistent with the HandH model prediction that acoustic effort increases when listener understanding is a priority, i.e., after a misunderstanding has occurred (Lindblom, 1990). In this case, L1 and L2 talkers were successful in modifying their speech in a way that is beneficial to listeners for repeat productions after a coda misunderstanding.
However, the ASR comprehender did not experience an intelligibility benefit for productions after a coda misunderstanding. This indicates that the ASR system was not able to make use of either L1 or L2 talkers' acoustic enhancements when they made an error repair. It may be that the Original production is closer to the citation forms which the ASR was trained on than to hyperarticulated forms. It could also be that coda misunderstandings resulted in nasalized vowels (given the high proportion of CVNs), and that human listeners benefited from vowel nasality while the ASR system did not. Future work could tease apart the cause of this lack of intelligibility benefit for the ASR system. Whatever the cause, while self-repetition can be a useful and effective strategy for L2 speakers to overcome misunderstandings during interactions with machines (Lee, 2009; Jepson, 2005), repetition is not enough to overcome all communication breakdowns, and when it fails, L2 speakers often fall back on other negotiation tactics when interacting with voice-AIs, such as rephrasing (Moussalli and Cardoso, 2020). The findings of the present study support prior findings that repetition is not an effective strategy for either L1 or L2 users to resolve communication breakdowns with voice-AI. We note, though, that one limitation of the present study is that we only looked at L1 Mandarin-L2 English speakers; studies looking at L2 English speakers with other language backgrounds is an important future direction.
Interestingly, for productions after the interlocutor misheard the vowel of the target word, human listeners experienced the opposite effect: listeners were actually less accurate in their transcriptions in the Vowel Repair condition, while the ASR comprehender showed no increase or decrease for this condition. This finding for the human listeners is consistent with the effects of background noise masking the target signal; for example, related work has shown that vowels, and other sonorant elements, are more robust to white noise than consonants (Gordon-Salant, 1985; Wardrip-Fruin, 1982, 1985). In the Vowel Repair condition, participants produced the target words after the system had correctly understood the other consonants (e.g., “bed” vs “bode”). Therefore, they might have produced more hypo-articulated consonants given this shared context (cf. HandH theory), and these hypo-articulated consonants may have been disproportionately masked by the white noise, leading to the reduction in intelligibility to human listeners. Future studies could use multi-talker babble, which obscures both vowels and consonants more equally (Gordon-Salant, 1985), to investigate differences between intelligibility changes after vowel-error versus coda-error feedback.
Overall, while L1 talkers were transcribed more accurately than L2 talkers, there were some interactions between production condition and language background for human listeners. Human listeners were less accurate on average for L1 productions in the Confirm Correct condition. This may be because L1 talkers were more comfortable using a more casual or less effortful speech style once the interlocutor had indicated understanding. This would demonstrate greater flexibility to the communicative needs of an interaction in the productions of L1 talkers.
An additional limitation is the pseudo-interactive design of the experiment: the feedback participants are being given is not actually dependent on their productions, and instead is somewhat random. The high error rate in the experiment may have led to enhancement across all conditions, obscuring differences [although there is some evidence that this changes over time (Cohn , 2022)]. Or, participants may have felt that their initial productions were unlikely to have been misunderstood in the way they were (with a different vowel). Future work with authentic listener feedback can allow us to explore this possibility further.
C. Societal implications
More broadly, the finding that the ASR system less accurately identifies speech from individuals of one language background than another is a further concerning addition to prior work showing discrepancies in ASR accuracies across speakers of different language varieties (Adda-Decker and Lamel, 2005; Tatman and Kasten, 2017; Koenecke , 2020). Unequal ASR performance across speakers with different language backgrounds has major societal implications. In particular, since voice-AI systems are increasingly used for everyday tasks and becoming incorporated for use in areas ranging from language learning to clinical or emergency technologies, devices' language-based inequalities will increasingly exacerbate inequalities in everyday life. Recent work has also shown that users of varieties that are more likely to be misunderstood by ASR systems report feeling that such technologies are not made for them (Mengesha , 2021). ASR systems can be successfully trained to recognize variation in accents (Nigmatulina , 2020), and both time and resources should be spent to ensure that voice-AI systems do not further perpetuate systematic social inequalities through access to resources.
V. CONCLUSION
Speakers are dynamic, and communication breakdowns can trigger them to modify their speech in ways that boost intelligibility for listeners. Overall, L2 English was less intelligible for both human and ASR comprehenders. Yet, both L1 and L2 groups were similar in the lack of an intelligibility benefit of human-DS compared to device-DS for Siri and human interlocutors. While coda enhancement resulted in limited intelligibility boosts for human listeners, these enhancements were not effective for the Siri ASR.
ACKNOWLEDGMENTS
This material is based upon work supported by the National Science Foundation SBE Postdoctoral Research Fellowship under Grant No. 1911855 to M.C. Thank you to our undergraduate researchers who assisted with data collection for the project: Marlene Andrade, Jazmina Chavez, Melina Sarian, Divine Otico, Patricia Sandoval, and Eleanor Lacaze.
AUTHOR DECLARATIONS
Conflict of Interest
The authors report funding from the National Science Foundation and employment at Google Inc. (provided by Magnti) for M.C. No other conflicts of interest are reported.
Ethics Approval
Informed consent was obtained from all participants, and the study was approved by the UC Davis Institutional Review Board.
DATA AVAILABILITY
The data that support the findings of this study, including full model outputs, are openly available in an Open Science Framework (OSF) repository for the paper (OSF, 2024).