Speakers tailor their speech to different types of interlocutors. For example, speech directed to voice technology has different acoustic-phonetic characteristics than speech directed to a human. The present study investigates the perceptual consequences of human- and device-directed registers in English. We compare two groups of speakers: participants whose first language is English (L1) and bilingual L1 Mandarin-L2 English talkers. Participants produced short sentences in several conditions: an initial production and a repeat production after a human or device guise indicated either understanding or misunderstanding. In experiment 1, a separate group of L1 English listeners heard these sentences and transcribed the target words. In experiment 2, the same productions were transcribed by an automatic speech recognition (ASR) system. Results show that transcription accuracy was highest for L1 talkers for both human and ASR transcribers. Furthermore, there were no overall differences in transcription accuracy between human- and device-directed speech. Finally, while human listeners showed an intelligibility benefit for coda repair productions, the ASR transcriber did not benefit from these enhancements. Findings are discussed in terms of models of register adaptation, phonetic variation, and human-computer interaction.

Recently, with the growing prevalence of voice-activated artificially intelligent (voice-AI) devices, humans are increasingly talking with technology. This new type of interlocutor is able to both understand and generate fluent speech, albeit in ways that differ from humans. For example, ASR systems may contain errors (Koenecke , 2020), and device interlocutors are perceived as less competent (Oviatt , 1998; Cohn , 2022; Cowan , 2015). Previous research has demonstrated that people interact differently with computer interactors than they do with other humans. For instance, acoustic differences have been found between human-directed speech and computer-directed speech by speakers in their native first language [e.g., Cohn (2022), Cohn and Zellou (2021), and Siegert and Krüger (2021)], but the consequences of these adjustments on intelligibility is an understudied area of research, particularly across different language varieties.

As voice technology increases in use, more people engage with voice-AI in their second language(s) (L2) to complete everyday tasks (e.g., setting calendar reminders, sending text messages) and to learn new languages [e.g., Tai and Chen (2022)]. While users tend to behave as if devices have a lower communicative competence than human interlocutors (Cowan , 2015; Cohn , 2022), L2 speakers, in particular, face additional barriers when communicating with technology (Song , 2022). Automatic speech recognition (ASR) systems, used to transcribe spoken language into text and used in voice-AI technology, have systematic biases against non-mainstream language varieties [e.g., for ethnolects in Koenecke (2020) and Wassink (2022); L2 speakers in Chan (2022), Choe (2022), Song (2022), and Moussalli and Cardoso (2017)]. For example, Chan (2022) found that ASR transcription was less accurate for L2 than L1 English speakers when using an English language model. They additionally found differences based on L1 language type: L2 speakers whose L1 is a tonal language (e.g., Mandarin) were even less accurately transcribed, but there was less of a decline for L2 speakers who learned English earlier. As related work has shown that ASR biases can be internalized by speakers (Mengesha , 2021), there is a need for more studies exploring disparities across L1 and L2 speech directed toward devices. The current study tests how L1 and L2 English speakers adapt their speech when talking to technology, compared to talking to another person, and the perceptual consequences of those adaptations for both human and machine comprehenders.

The remainder of the introduction reviews speakers' adaptations for context (Sec. I A), perceptual consequences of those adaptations (Sec. I B), and technology-directed registers (Sec. I C). Section I D outlines the present study, research questions, and hypotheses.

Speakers tailor productions to different communicative situations, environments, and listeners. The hyper- and hypo-articulation (HandH) model proposes that this is the result of a balance of competing needs for the talker that compete in real time: minimizing articulatory effort while maximizing intelligibility for the listener (Lindblom, 1990). This results in speaking style variation which may range from hypospeech (casual or plain speech) to hyperspeech (clear or enhanced speech). When intelligibility is not a priority, speakers place greater emphasis on minimizing articulatory effort, producing casual speech. On the other hand, when there is a higher probability of not being understood, speakers maximize intelligibility at the cost of greater articulatory effort, producing clear speech. Under the HandH theory, talkers producing clear speech modify their speech with the goal of being more intelligible to a listener, such as changing their speech rate, intensity, and vowel productions (Bradlow , 2003; Ferguson and Kewley-Port, 2007; Krause and Braida, 2004). These adaptations may occur in a variety of situations where greater intelligibility is called for, including after a misunderstanding has occurred, or when speakers are interacting with interlocutors whom they judge to have lower communicative competences or different perceptual needs.

In this paper, we will focus on the types of clear speech adaptations that are directed to specific types of listeners. Speakers use different speech styles, or “registers,” to communicate with different types of interlocutors, adjusting their speech in consistent ways according to an interlocutor's characteristics and presumed needs (Clark and Murphy, 1982). In this paper, we use the term register to refer to different ways that we talk to different interlocutors. This includes infant-directed speech (DS) (Fernald , 1989), non-native speaker-DS (Uther , 2007; Aoki and Zellou, 2023a), and hearing-impaired individual-DS (Picheny , 1986; Aoki and Zellou, 2023b), which have, at times, distinct acoustic adaptations compared to speech directed to L1, normal-hearing adult listeners. L2 speakers also produce register differences, talking to different interlocutors in different, listener-targeted ways. For example, child-DS shares many similar features cross-linguistically (Fernald , 1989), and child-DS produced by L2 speakers shares characteristics with child-DS produced by L1 speakers, including a slower speech rate as compared to adult-DS (Fish , 2017). Additionally, Kato and Baese-Berk (2022) found acoustic modifications between plain speech and hearing-impaired-individual-DS produced by L2 talkers, although the size of the enhancements was smaller for L2 talkers with lower proficiency. Taken together, these findings suggest that interlocutor-specific acoustic-phonetic registers, such as for device-DS, is a cross-linguistic phenomenon. This leads us to ask whether L2 speakers produce device-DS in similar or different ways from L1 speakers.

In addition to adaptations for different types of addressees, speakers may also produce targeted adaptations based on feedback during interactions. For example, the presence of phonetically similar competitors (e.g., “bill” and “pill”) in a communicative context leads speakers to strategically enhance relevant segments acoustically (Buz , 2016; Seyfarth , 2016). Furthermore, feedback that a misunderstanding has taken place increases the size of this enhancement (Buz , 2016; Cohn and Zellou, 2021; Cohn , 2022). As with L1 talkers, L2 talkers also enhance speech in the presence of acoustic competitors (Kato and Baese-Berk, 2021), suggesting we might observe parallel adjustments in the current study across our bilingual and monolingual groups.

Speech produced in different speaking styles often has consequences for listeners in terms of intelligibility (Aoki , 2022; Aoki and Zellou, 2023c). For example, some work has shown improved intelligibility when listening to nonnative-speaker-DS (Kangatharan , 2023) and hearing-impaired-individual-DS (Picheny , 1985) as compared to speech directed to an L1, normal-hearing listener. Yet, the majority of studies examining the consequences of register adaptations have focused on L1 speakers, and few studies have examined the perceptual consequences of L2 register variation. For one, Kato and Baese-Berk (2022) found that L1 English listeners did show an intelligibility benefit from L2 English (L1 Mandarin) hearing-impaired-individual-DS. This intelligibility benefit varied by talker proficiency, with a greater benefit of hearing-impaired-individual-DS produced by higher-proficiency talkers as compared to lower proficiency talkers. On the other hand, Kato and Baese-Berk (2023) found no intelligibility benefit from L2 English (L1 Mandarin) hearing-impaired-individual-DS for L1 listeners. Despite sizable acoustic differences between enhanced and plain speech produced by L2 English (L1 Mandarin) talkers, neither higher nor lower proficiency talkers' acoustic modifications resulted in a clear speech intelligibility benefit for L1 English (human) listeners, suggesting that acoustic modifications do not always lead to an intelligibility gain for listeners.

Since the study of the intelligibility consequences of L2 register variation is limited, insights can be taken from general research on L2 clear speech, which is often elicited by instructing talkers to imagine that they are speaking to someone with hearing loss or a non-native speaker. Several studies have found that L2 clear speech provides an intelligibility benefit to L1 listeners (Smiljanić and Bradlow, 2011; Jung and Dmitrieva, 2023). Rogers (2010) also found that L1 listeners benefited from L2 English (L1 Spanish) clear speech. This intelligibility benefit was greater for talkers who had an earlier age of immersion, with a greater clear speech benefit produced by L1 Spanish talkers who had moved to an English-speaking culture before the age of 12 as compared to talkers who moved after the age of fifteen. In summary, it seems that while L2 hyperspeech can provide intelligibility benefits for L1 listeners, the size of the intelligibility benefit varies by proficiency and age of immersion.

Beyond specific registers, targeted segmental speech enhancements may also result in intelligibility benefits. Kato (2020) investigated perceptual consequences of acoustic modifications produced by L2 English (L1 Mandarin) talkers when a minimal pair competitor was or was not present in the production context. For example, participants read aloud “click on the peer now” when the words peer, beer, and town were on the screen. In a forced-choice perception task, L1 English listeners identified the target word more accurately for productions where a competitor was present in the production context (i.e., when talkers were making acoustic modifications to differentiate the target word from the competitor). However, this perceptual benefit only held for high-proficiency L2 talkers and not for lower-proficiency L2 talkers. Furthermore, this effect was only found for contrasts where both sounds existed only in the L2 (/i/ and /ɪ/ vowel contrast, and word-final /p/ and /b/ consonant contrast). This study provides an example of a perceptual benefit in L2 speech when there is a need to enhance or differentiate specific segments.

Although some registers, including nonnative-speaker-DS and hearing-impaired-individual-DS, seem to afford intelligibility benefits for all listeners, not all registers may have a universal intelligibility benefit. Infant-DS has been shown to aid in infant word learning (Singh , 2009) and word segmentation (Thiessen , 2005). Yet, despite this benefit for infant listeners, words in infant-DS may be less intelligible to adults than adult-DS (Bard and Anderson, 1983). This suggests that some registers might provide an intelligibility benefit for the target listener, while non-target listeners experience a decrease in intelligibility.

A growing body of work has examined how people talk to technology, computer-DS (Oviatt , 1998; Burnham , 2010; Mayo , 2012). Although many modern voice-AI devices, such as Apple's Siri, are able to interact with users using speech in ways that are similar to interactions with humans, many users hold beliefs that AI systems are less communicatively competent than humans (Siegert and Krüger, 2018; Cohn , 2022; Oviatt , 1998). These users' beliefs may result from experience with voice technology [e.g., Gambino (2020)] as well as experience with voice technology misunderstanding speech (Koenecke , 2020). These perceptions that device versus human addressees have differing perceptual needs could result in different registers for speech directed to these different types of interlocutors.

Indeed, acoustic differences have been found between device-DS and human-DS, with studies often reporting that device-DS is louder than human-DS (Raveh , 2019; Siegert and Krüger, 2021; Cohn , 2022). The directionality of findings for other acoustic characteristics distinguishing device- and human-DS are mixed [see Cohn (2022) for a review]. Some studies report no difference in duration (Raveh , 2019; Siegert and Krüger, 2021) and others report a slower rate for voice-AI-DS (Cohn , 2021). Raveh (2019) reported a higher mean f0 for voice-AI-DS, while Cohn (2022) reported a lower mean f0 for voice-AI-DS. Cohn and Zellou (2021) reported a larger f0 range for voice-AI-DS, while Cohn (2022) reported a smaller f0 range for voice-AI-DS. Additionally, Cohn (2022) reported that f0 range in voice-AI-DS increased over time, becoming more similar to human-DS, while mean f0 decreased over time for voice-AI-DS, diverging from human-DS.

It is possible that the acoustic differences between device-DS and human-DS registers may have perceptual consequences for listeners. Whether device-DS is more or less intelligible than human-DS to human listeners and ASR comprehenders is an open question, which will be explored in the present study.

The present study takes a perceptual approach to extend Cohn (2022), which examined the acoustic adjustments L1 English speakers make in a pseudo-interactive task with a human and a voice assistant (Apple's Siri). Cohn (2022) used a purely acoustic approach to probe register differences. The present study builds upon that approach by examining the intelligibility of speech directed toward device and human addressees by their intended listener (human listener vs Siri ASR). In addition, we investigate whether the talker's language background shapes these effects.

We selected a subset of L1 English data from Cohn (2022), and collected additional data from L1 Mandarin, L2 English speakers who completed the identical experiment. In the experiment, participants were told they were interacting with either a human or Siri and read a sentence off the screen. The interlocutor displayed the word it understood on the screen and asked, via a pre-recorded human or Siri voice, for verbal confirmation whether their perception was correct. Then the participant repeated the sentence to either confirm or repair the human/Siri's reception. In the present study, L1 and L2 productions collected under this paradigm were transcribed by L1 English human listeners and an ASR transcriber.

The present study examines the register of device-DS and human-DS, as well as the effect of production condition (original production, confirmation of a correct understanding, repair of a misheard coda, and repair of a misheard vowel). We also investigate the perception of speech enhancements and device-DS as produced by L1 and L2 speakers. Thus, there were 3 variables: Speaker Language Background (L1, L2), Addressee (Human, Siri), and condition (Original, Confirm Correct, Coda Repair, Vowel Repair).

The first research question is whether there is an intelligibility difference for the device-DS register as compared to the human-DS register. It may be that humans perceive voice technology as having unique perceptual needs that require greater articulatory effort to accommodate; if so, speech produced in a device-DS register may be more intelligible than speech produced in a human-DS register, similar to nonnative-speaker-DS and hearing-impaired-individual-DS. This would align with Mayo (2012) who found that (imagined) computer-DS had the most accurate transcription by native English listeners, while infant-DS was the least intelligible.

The second research question is whether the production condition within the pseudo-interactive task (i.e., production after a misunderstanding) results in an intelligibility difference for listeners. According to the HandH model, talkers expend more acoustic effort when listener understanding is a priority (Lindblom, 1990). Accordingly, speech produced after there has been a misunderstanding in conversation (i.e., Coda Repair, Vowel Repair) might be expected to be more enhanced and thus more intelligible, while a repeat production after the listener has already indicated understanding (i.e., Confirm Correct) might be expected to be more reduced, and thus less intelligible.

The third research question is whether talker language background and speaking style has an impact on intelligibility. Previous research has shown that L2 speech is less intelligible than L1 speech to both L1 listeners (Smiljanić and Bradlow, 2007) and devices (Moussalli and Cardoso, 2017). However, it may be that there are differences in effect of language background on the intelligibility of the different speaking styles, as was found by Kato and Baese-Berk (2022). For instance, we may find smaller differences in intelligibility by speaking style for L2 talkers as compared to L1 talkers, due to lesser experience with the language.

Finally, we also ask whether the three previously discussed factors of Addressee, condition, and Language Background result in intelligibility differences for two different types of listeners: L1 human listeners and a device comprehender, in this case the Apple ASR system. If listeners experience an intelligibility benefit when listening to a register intended for a listener of that type, then human listeners will find human-DS more intelligible than device-DS, while ASR systems will find the device-DS register more intelligible than human-DS. On the other hand, although there are acoustic differences between device-DS and human-DS, those differences may not be sufficient to induce intelligibility differences for listeners. This may especially be the case for L2 speech; Kato and Baese-Berk (2023) found that although L2 speakers produced acoustic clear speech modifications, those differences did not result in an intelligibility benefit for listeners. Thus, it may be that there is no register difference for intelligibility.

Experiment 1 investigates the intelligibility of device-DS and human-DS for L1 English human listeners. A recording task (Sec. II A) was used to elicit productions for the transcription task (Sec. II B). The methodology of the recording task followed that of experiment 2 (higher error rate) in Cohn (2022).

1. Talkers

9 L1 Mandarin talkers were recruited from the UC Davis psychology subject pool to participate in a recording task. These talkers were a subset of talkers from a larger study with many language backgrounds. Recordings from 2 talkers were later removed as there were technical difficulties in their data collection (leading to empty recordings), leaving a total of 7 L1 Mandarin talkers (4 female, 3 male). The participants had an age range of 18–23, with a mean age of 20 years. The talkers reported starting learning English between the ages of 4 and 16 (mean age 8.9) and moving to the U.S. between the ages of 12 and 19 (mean age 15.7).

Participants completed a survey about their usage of voice-AI devices (Ammari , 2019). Six of the seven L1 Mandarin talkers had used Siri before; all eight of these talkers reported using Siri once a week or less. They reported using Siri for music (n = 3), setting alarms (n = 3), and looking up information (n = 2). One participant reported using Alexa once a week or less to play music.

Recordings of nine L1 English talkers from Cohn (2022), also recruited from the UC Davis psychology subject pool, were age and gender-matched to the nine original L1 Mandarin talkers (6 female, 3 male). The participants had an age range of 18–22, with a mean age of 19.9 years.

All nine L1 English talkers reported having used Siri before, and using Siri at a rate of once a week or less. They used Siri to find information (n = 6), call or text contacts (n = 4), set reminders (n = 1), or change phone settings (n = 1). Six participants reported using other voice-AI systems: Google (n = 2), Alexa (n = 1), Cortana (n = 1), and unspecified (n = 3). Of these, two participants reported using voice-AI systems once or twice daily, while one participant reported using a voice-AI system hourly.

2. Materials

Both short recordings and images were used to create human and Siri guises for the recording task. An adult female recorded for the human guise, while Siri text-to-speech was used for the Siri guise. A short introduction was recorded for each guise (“I am Siri. I am a digital assistant on Apple products”/“I am Melissa. I work here in the Phonetics Lab”), as well as task instructions. Four short responses were generated to give initial feedback to the talkers (e.g., “Is this the word?”) as well as four short responses indicating understanding (e.g., “Okay, got it”). See Cohn (2022) (Appendix C) for the complete list of productions. In addition to the recordings, a photo of a cell phone open to Siri and a stock photo of an adult female were also presented.

This task used 44 sentences, each containing a monosyllabic target word in the frame “The word is ____.” Half of these target words had the form CVC, where the final consonant was a voiced oral stop at a bilabial or alveolar place of articulation. The other half of the target words were of the form CVN. These were matched with the CVC words, and differed only in a final nasal stop at the same place of articulation (e.g., paid and pain). For the full list of words, see the Target CVC and Target CVN lists in Appendix B in Cohn (2022). An additional 12 sentences, also with monosyllabic target words, were also included in the experiment [see the Target CiC and Target CaC word lists in Appendix B in Cohn (2022)]. These 12 sentences were used in Cohn (2022) to examine the vowel space of talkers, but were not used as stimuli in the current study.

3. Procedure

Talkers from both L1 and L2 speaker groups participated in a recording task from Cohn (2022). After completing informed consent, speakers were instructed about the nature of the task and told they would be interacting with Siri and a person. They also received auditory instructions along with an introduction from both Siri and the recorded human voice. These instructions were given at the start of a human-directed elicitation block (human voice) and Siri-directed elicitation block (Siri voice). The experiment was conducted using E-Prime (Psychology Software Tools, 2016).

In each trial, speakers produced sentences with multiple turns. The sentences were produced to two types of interlocutors: a mock-human interlocutor with a recorded human voice, or a mock-device interlocutor with the voice of the voice-AI system Siri. In each trial, speakers first read a sentence off the screen (“The word is pain”). Next, a word was shown on the screen (e.g., PINE) and the interlocutor asked for confirmation whether this was the word the speaker had intended to produce (e.g., “Is this correct?”). If the word was “heard” incorrectly, the incorrect word differed from the target word either by an incorrect vowel or an incorrect coda. Next, the speaker used a button box to indicate whether the interlocutor had heard them correctly or not, then produced the sentence a second time (“The word is pain”), whereupon the interlocutor indicated understanding (e.g., “Okay, got it”).

The recordings were thus produced under four different conditions. The Original condition contains productions produced the first time participants read the sentence off the screen. The Confirm Correct condition contains productions that were produced after the interlocutor indicated correct understanding of the talker's production. The Coda Repair condition refers to productions that were made after the interlocutor indicated that it had heard a word with an incorrect coda (e.g., “paid” instead of “pain”); likewise, the Vowel Repair condition refers to productions made by the talker after the interlocutor indicated that it heard a word with an incorrect vowel (e.g., “pine” instead of “pain”).

The Siri-interlocutor and human-interlocutor trials were presented in two separate blocks, and the order of blocks was counterbalanced across participants. In each block, 12 vowel space trials were followed by 44 target trials. In these target trials, each of the 44 CVC or CVN words appeared once. There were 16 trials in the Confirm Correct condition, 14 Coda Repair, and 14 Vowel Repair. Participants thus produced 4 recordings of each target word: one original and one repeat for each of the two interlocutors (Human vs Siri). Trials alternated between CVC and CVN words, with words randomly selected from each group. Both the rate and type of errors was identical across the human and Siri blocks and occurred on identical trials (note that sentences were pseudorandomized in lists).

Overall, each talker made 224 productions: 12 vowel space trials (not included in the analysis for the present study) + 44 target trials * 2 productions * 2 interlocutors. In total over all 16 talkers, 2816 instances of the target words were elicited.

1. Listeners

177 L1 English listeners were recruited from the UC Davis Psychology subject pool (2 nonbinary, 78 female, 37 male; mean age 19.7 with an age range of 18–36) to complete the task. All listeners specified English as their L1 and as their strongest language.

2. Materials

The sentences from the recording task were amplitude-normalized to 60 dB (Cohn , 2021) and embedded in white noise with a signal-to-noise ratio of +3 dB (Zellou , 2022) and then the entire sound file was amplitude normalized to 60 dB in praat. Sound files with missing words, coughs, yawns, and other artifacts were removed. If one sound file was missing for a given trial (e.g., the original production for Siri-directed “pain”), then the other sound file from that trial was also removed (e.g., the second production for Siri-directed “pain”). In total, 124 sound files were removed, leaving a total of 154–174 (mean 168.3) productions per talker.

3. Procedure

Listeners completed a transcription task remotely using Qualtrics. They began with an audio calibration step, where participants listened to the sentence “Bill heard they asked about the host” and adjusted their audio settings to a comfortable level. Then they identified the last word in the sentence that they heard from phonological competitor options, to confirm that the participant could hear the audio at a comfortable listening level. In the experimental trials, they heard each sound-mixed sentence once, and then typed the final word that they heard in a text box (“The word is: _____”). Each listener heard recordings from only one talker. Trials were presented in random order. Depending on which talker the listeners were assigned, they transcribed between 154 and 174 recordings. Each talker was transcribed by a minimum of 4 listeners.

4. Statistical analysis

Transcription accuracy was coded as binomial data (correct = 1, incorrect = 0). Homophones and obvious spelling mistakes were coded as accurate. Nonwords that were judged by an L1 English speaker to have the same pronunciation as the target word were also coded as accurate (“boad” or “boed” for “bode”; “boen” for “bone”; “cawd” for “cod”; “daude,” “dawed,” “dawd,” or “dod” for “Dodd”; “lawd,” “lawed,” “lod,” or “lodd” for “laud”; “rubb” for “rub”; “schide,” “shide,” “shyd,” or “shyed” for “shied”; “sudd” for “sud”). These nonwords with comparable pronunciations were included in the intelligibility analysis because these types of responses provide evidence that the listener perceived what the talker intended to produce. Responses that were more than one syllable or word (e.g., “where's lawn” for “lawn,” “courseload” for “load,” “your side” for “side”) were coded as correct if the final syllable met the same criteria as a monosyllabic response.

Accuracy was modeled with a mixed-effects logistic regression model using the lme4 r package (Bates , 2015). The model contained fixed effects of Speaker Language Background (L1, L2), Addressee (Human, Siri), and condition (Original, Confirm Correct, Coda Repair, Vowel Repair). All effects were sum-coded. Random intercepts were included for each word, listener, and speaker, as well as Addressee and condition slopes for each speaker and listener. Lmer syntax for this model structure is provided in Eq. (1). This original model structure resulted in singularity and convergence errors. The random effects structure of the model was systematically simplified until the model converged, following Barr (2013) and Cohn (2022). The final retained model structure is provided in Eq. (2). All model outputs, including models with releveled factors, are provided in an OSF repository (OSF, 2024):
(1)
(2)

Figure 1 shows the proportion of keywords correctly transcribed by L1 English listeners, split by language background of the talker (L1 or L2) and by each of the four conditions (original production, confirmation of a correct understanding, repair of a misheard coda, and repair of a misheard vowel). The left panel shows Human-DS and the right panel shows Siri-DS.

FIG. 1.

(Color online) Experiment 1. Proportion of keywords correctly transcribed by human listeners by speaker language background (L1, L2) and production condition (Original, Confirm Correct, Coda Repair, Vowel Repair) as a function of interlocutor (left panel: Human; right panel: Siri). Error bars represent standard errors.

FIG. 1.

(Color online) Experiment 1. Proportion of keywords correctly transcribed by human listeners by speaker language background (L1, L2) and production condition (Original, Confirm Correct, Coda Repair, Vowel Repair) as a function of interlocutor (left panel: Human; right panel: Siri). Error bars represent standard errors.

Close modal

The model showed no main effect of interlocutor for human- versus device-DS (p = 0.14). As seen in Fig. 1, there was an effect of language background, with transcription accuracy higher for L1 speakers on average (Coef = 1.00, SE = 0.18, z = −5.60, p < 0.001).

There were several effects of condition. Although accuracy for trials in the Original condition did not differ from the grand mean (p = 0.06), accuracy for trials in the Confirm Correct condition were lower than the grand mean (Coef = −0.11, SE = 0.04, z = −3.19, p < 0.01), as seen in Fig. 1. As for the conditions after the addressee misunderstood the talker, accuracy for the Coda Repair condition was higher than the grand mean (Coef = 0.16, SE = 0.04, z = 4.27, p < 0.001) while accuracy for the Vowel Repair condition was lower than the grand mean (Coef = −0.10, SE = 0.04, z = −2.63, p = 0.01).

The model also showed interactions between talker Language Background and condition, specifically for the Confirm Correct and Coda Repair conditions. L1 talkers were less accurately transcribed, on average, in listeners' transcriptions of the Confirm Correct condition (Coef = −0.08, SE = 0.04, z = −2.27, p = 0.02). On the other hand, L1 talkers' productions in the Coda Repair condition were transcribed more accurately on average (Coef = 0.08, SE = 0.04, z = 2.09, p = 0.04).

There were no other significant effects, including no significant interactions between Addressee and either condition, talker Language Background, or the three-way interaction between these factors. The full model output is provided in the OSF repository for the project.

Experiment 2 investigates the intelligibility of device-DS and human-DS for an ASR transcriber to test the hypothesis that device-DS would be more intelligible for ASR than human-DS.

A single audio file was created which concatenated the auditory stimuli from experiment 1. Each stimulus was separated with a keyword (the word “start” produced by the “Joanna” voice from Amazon Polly) so that it would be easier to segment the text after transcription. The audio file thus alternated between stimulus and keyword (“The word is pain,” “Start,” “The word is todd,” “Start,” “The word is paid,” “Start,”…). One second of silence was included between each sentence and keyword. The stimulus order was randomized and the stimuli were not masked by background noise.

Transcription was then conducted by routing the audio signal internally through Loopback (Matuschak, n.d.) on a Mac computer (macOS Catalina, Version 10.15.7). There were four steps to the internal transcription process, which were loosely based on an online tutorial (Geerling, 2022). First, a new “device” was created within Loopback and linked to QuickTime Player. Second, dictation was activated in System Preferences. Third, the file was opened in QuickTime Player and the “Play” button was clicked. Fourth, Microsoft Word for Mac (Version 16.661) and the “Dictate” button was clicked. The sentences appeared in real time within a Microsoft Word document as the audio played. After transcription was complete, the sentences were copied and pasted into a TextEdit file for further analysis.

The sentences were separated from each other using strsplit() in r. With a custom-made script, the sentence-final keywords were then extracted and manually corrected (e.g., if the target word was “dodd,” the homophone “dod” was manually altered). Final keyword accuracy was coded binomially as correct (=1) or incorrect (=0), and a strict coding criterion was employed, such that all affixes needed to be in place to be counted as correct (e.g., if “lie” was considered incorrect if the correct response was “lied”). As in experiment 1, homophones were considered to be correct.

A mixed-effects logistic regression model was employed to analyze the data (Bates , 2015). The model contained fixed effects of Speaker Language Background (L1, L2), Addressee (Human, Siri), and condition (Confirm Correct, Coda Repair, Vowel Repair, Original). All effects were sum-coded. By-word and by-speaker random intercepts were added to the model, and by-speaker random slopes for Addressee were added, following the model in experiment 1. Model structure is provided in Eq. (3):
(3)
Note that the model for automatic speech recognition (experiment 2) has a different structure from the model for human listener recognition (experiment 1). The latter experiment had more than one listener (warranting by-Listener random effects), while the current experiment only had one “listener” (implying that by-Listener random effects would be inappropriate). Given these differences, the results in experiments 1 and 2 were not directly compared statistically.

Figure 2 shows aggregated accuracy proportions in each condition by speaker language background, word condition, and interlocutor. The full model output is provided in the OSF repository for this project.

FIG. 2.

(Color online) Experiment 2. Proportion of keywords correctly transcribed by the Apple automatic speech recognition (ASR) system by speaker language background (L1, L2) and word condition (Confirm Correct, Coda Repair, Vowel Repair, Original) as a function of interlocutor (left panel: Human; right panel: Siri). Error bars represent standard errors.

FIG. 2.

(Color online) Experiment 2. Proportion of keywords correctly transcribed by the Apple automatic speech recognition (ASR) system by speaker language background (L1, L2) and word condition (Confirm Correct, Coda Repair, Vowel Repair, Original) as a function of interlocutor (left panel: Human; right panel: Siri). Error bars represent standard errors.

Close modal

There were three significant effects. First, as seen in Fig. 2, the L1 speaker productions were recognized more accurately on average (Coef = 0.73, SE = 0.11, z = 6.36, p < 0.001). In addition, the effect of condition was significant. Across all combinations of interlocutor and speaker language background, the Confirm Correct productions were not understood as well as the other condition types, as accuracy for these trials was lower than the grand mean (Coef = –0.16, SE = 0.06, z = –2.44, p = 0.01), while the Original productions were understood better, as accuracy for these trials was above the grand mean (Coef = 0.12, SE = 0.05, z = 2.51, p = 0.01). No other effects, including interactions, reached significance.

The present study investigated whether potential speech style differences across human- and Siri- directed speech (“DS”) produced by L1 and L2 talkers would be borne out in differing levels of intelligibility by human and ASR comprehenders.

First, the present study found no intelligibility differences across device-DS and human-DS for either human listeners or ASR systems. Previous research has found acoustic differences between device-DS and human-DS (Cohn and Zellou, 2021; Raveh , 2019; Siegert and Krüger, 2021), including acoustic differences for the L1 productions in the present study in Cohn (2022). These acoustic differences show that talkers are making listener-oriented adjustments for human and ASR comprehenders. However, unlike listener-oriented adjustments for L2 speaker-DS (Uther , 2007) and hearing-impaired-individual-DS (Picheny , 1986), talkers' acoustic adjustments for device-DS do not appear to provide an intelligibility benefit for either human or device transcribers. It may be that the acoustic differences in the present study are not large enough to result in an intelligibility benefit, or that these acoustic differences are simply the wrong differences to be helpful (Aoki and Zellou, 2023b). There is also precedent for acoustic modifications in L2 enhanced speech not resulting in an intelligibility benefit for listeners (Kato and Baese-Berk, 2023).

One limitation of the present study is that talkers produced their speech neither to a human nor to a voice-AI system, but instead to a computer program with recorded human and Siri voices. Scarborough and Zellou (2013) demonstrated that talkers show measurable differences in their speech when talking to real versus imagined speakers. The current methodology, which found no intelligibility differences between human and device-directed speech, might be obscuring register or speech style differences that would emerge in more naturalistic interactions with humans versus voice-AI systems.

Overall, the L1 talkers were more intelligible than L2 talkers (to both L1 and ASR comprehenders), a well-established difference [e.g., Bent and Bradlow (2003) and Moussalli and Cardoso (2017)]. There was also no difference in the intelligibility of Human vs Siri-DS across L1 and L2 speakers for both humans and ASR systems. This suggests that L1 and L2 groups might be using similar strategies to target the two different interlocutors with their speech styles.

Additionally, the present study found that the local contexts, when a speaker had been heard correctly or misunderstood, shaped intelligibility. Here, patterns differed between human and ASR comprehenders. Human listeners showed limited intelligibility benefits for the Coda Repair condition, while the ASR comprehender benefited from the Original productions more than any other condition.

Human listeners were more accurate, on average, in their transcriptions of productions which occurred after the interlocutor misheard the coda of the target word (Coda Repair condition). This is consistent with the HandH model prediction that acoustic effort increases when listener understanding is a priority, i.e., after a misunderstanding has occurred (Lindblom, 1990). In this case, L1 and L2 talkers were successful in modifying their speech in a way that is beneficial to listeners for repeat productions after a coda misunderstanding.

However, the ASR comprehender did not experience an intelligibility benefit for productions after a coda misunderstanding. This indicates that the ASR system was not able to make use of either L1 or L2 talkers' acoustic enhancements when they made an error repair. It may be that the Original production is closer to the citation forms which the ASR was trained on than to hyperarticulated forms. It could also be that coda misunderstandings resulted in nasalized vowels (given the high proportion of CVNs), and that human listeners benefited from vowel nasality while the ASR system did not. Future work could tease apart the cause of this lack of intelligibility benefit for the ASR system. Whatever the cause, while self-repetition can be a useful and effective strategy for L2 speakers to overcome misunderstandings during interactions with machines (Lee, 2009; Jepson, 2005), repetition is not enough to overcome all communication breakdowns, and when it fails, L2 speakers often fall back on other negotiation tactics when interacting with voice-AIs, such as rephrasing (Moussalli and Cardoso, 2020). The findings of the present study support prior findings that repetition is not an effective strategy for either L1 or L2 users to resolve communication breakdowns with voice-AI. We note, though, that one limitation of the present study is that we only looked at L1 Mandarin-L2 English speakers; studies looking at L2 English speakers with other language backgrounds is an important future direction.

Interestingly, for productions after the interlocutor misheard the vowel of the target word, human listeners experienced the opposite effect: listeners were actually less accurate in their transcriptions in the Vowel Repair condition, while the ASR comprehender showed no increase or decrease for this condition. This finding for the human listeners is consistent with the effects of background noise masking the target signal; for example, related work has shown that vowels, and other sonorant elements, are more robust to white noise than consonants (Gordon-Salant, 1985; Wardrip-Fruin, 1982, 1985). In the Vowel Repair condition, participants produced the target words after the system had correctly understood the other consonants (e.g., “bed” vs “bode”). Therefore, they might have produced more hypo-articulated consonants given this shared context (cf. HandH theory), and these hypo-articulated consonants may have been disproportionately masked by the white noise, leading to the reduction in intelligibility to human listeners. Future studies could use multi-talker babble, which obscures both vowels and consonants more equally (Gordon-Salant, 1985), to investigate differences between intelligibility changes after vowel-error versus coda-error feedback.

Overall, while L1 talkers were transcribed more accurately than L2 talkers, there were some interactions between production condition and language background for human listeners. Human listeners were less accurate on average for L1 productions in the Confirm Correct condition. This may be because L1 talkers were more comfortable using a more casual or less effortful speech style once the interlocutor had indicated understanding. This would demonstrate greater flexibility to the communicative needs of an interaction in the productions of L1 talkers.

An additional limitation is the pseudo-interactive design of the experiment: the feedback participants are being given is not actually dependent on their productions, and instead is somewhat random. The high error rate in the experiment may have led to enhancement across all conditions, obscuring differences [although there is some evidence that this changes over time (Cohn , 2022)]. Or, participants may have felt that their initial productions were unlikely to have been misunderstood in the way they were (with a different vowel). Future work with authentic listener feedback can allow us to explore this possibility further.

More broadly, the finding that the ASR system less accurately identifies speech from individuals of one language background than another is a further concerning addition to prior work showing discrepancies in ASR accuracies across speakers of different language varieties (Adda-Decker and Lamel, 2005; Tatman and Kasten, 2017; Koenecke , 2020). Unequal ASR performance across speakers with different language backgrounds has major societal implications. In particular, since voice-AI systems are increasingly used for everyday tasks and becoming incorporated for use in areas ranging from language learning to clinical or emergency technologies, devices' language-based inequalities will increasingly exacerbate inequalities in everyday life. Recent work has also shown that users of varieties that are more likely to be misunderstood by ASR systems report feeling that such technologies are not made for them (Mengesha , 2021). ASR systems can be successfully trained to recognize variation in accents (Nigmatulina , 2020), and both time and resources should be spent to ensure that voice-AI systems do not further perpetuate systematic social inequalities through access to resources.

Speakers are dynamic, and communication breakdowns can trigger them to modify their speech in ways that boost intelligibility for listeners. Overall, L2 English was less intelligible for both human and ASR comprehenders. Yet, both L1 and L2 groups were similar in the lack of an intelligibility benefit of human-DS compared to device-DS for Siri and human interlocutors. While coda enhancement resulted in limited intelligibility boosts for human listeners, these enhancements were not effective for the Siri ASR.

This material is based upon work supported by the National Science Foundation SBE Postdoctoral Research Fellowship under Grant No. 1911855 to M.C. Thank you to our undergraduate researchers who assisted with data collection for the project: Marlene Andrade, Jazmina Chavez, Melina Sarian, Divine Otico, Patricia Sandoval, and Eleanor Lacaze.

The authors report funding from the National Science Foundation and employment at Google Inc. (provided by Magnti) for M.C. No other conflicts of interest are reported.

Informed consent was obtained from all participants, and the study was approved by the UC Davis Institutional Review Board.

The data that support the findings of this study, including full model outputs, are openly available in an Open Science Framework (OSF) repository for the paper (OSF, 2024).

1.
Adda-Decker
,
M.
, and
Lamel
,
L.
(
2005
). “
Do speech recognizers prefer female speakers?
,” in
Ninth European Conference on Speech Communication and Technology
.
2.
Ammari
,
T.
,
Kaye
,
J.
,
Tsai
,
J. Y.
, and
Bentley
,
F.
(
2019
). “
Music, search, and IoT: How people (really) use voice assistants
,”
ACM Trans. Comput. Hum. Interact.
26
,
1
28
.
3.
Aoki
,
N. B.
,
Cohn
,
M.
, and
Zellou
,
G.
(
2022
). “
The clear speech intelligibility benefit for text-to-speech voices: Effects of speaking style and visual guise
,”
JASA Express Lett.
2
,
045204
.
4.
Aoki
,
N. B.
, and
Zellou
,
G.
(
2023a
). “
Speakers talk more clearly when they see an East Asian face: Effects of visual guise on speech production
,” in
Proceedings of the 20th International Congress of Phonetic Sciences
,
Guarant International
,
Prague, Czech Republic
(August 7–11, 2023), pp.
2294
2298
.
5.
Aoki
,
N.
, and
Zellou
,
G.
(
2023b
). “
When speaking clearly does not enhance comprehension: Comparing intelligibility of hard-of-hearing- and non-native-directed speech for native and non-native listeners
,”
J. Acoust. Soc. Am.
154
,
A157
.
6.
Aoki
,
N. B.
, and
Zellou
,
G.
(
2023c
). “
When clear speech does not enhance memory: Effects of speaking style, voice naturalness, and listener age
,”
Proc. Mtgs. Acoust.
51
,
060002
.
7.
Bard
,
E. G.
, and
Anderson
,
A. H.
(
1983
). “
The unintelligibility of speech to children
,”
J. Child Lang.
10
,
265
292
.
8.
Barr
,
D. J.
,
Levy
,
R.
,
Scheepers
,
C.
, and
Tily
,
H. J.
(
2013
). “
Random effects structure for confirmatory hypothesis testing: Keep it maximal
,”
J. Mem. Lang.
68
,
255
278
.
9.
Bates
,
D.
,
Mächler
,
M.
,
Bolker
,
B.
, and
Walker
,
S.
(
2015
). “
Fitting linear mixed-effects models using lme4
,”
J. Stat. Soft.
67
,
1
48
.
10.
Bent
,
T.
, and
Bradlow
,
A. R.
(
2003
). “
The interlanguage speech intelligibility benefit
,”
J. Acoust. Soc. Am.
114
,
1600
1610
.
11.
Bradlow
,
A. R.
,
Kraus
,
N.
, and
Hayes
,
E.
(
2003
). “
Speaking clearly for children with learning disabilities
,”
J. Speech. Lang. Hear. Res.
46
,
80
97
.
12.
Burnham
,
D. K.
,
Joeffry
,
S.
, and
Rice
,
L.
(
2010
). “
Computer- and human-directed speech before and after correction
,” in
Proceedings of The 13th Australasian International Conference on Speech Science and Technology
,
Melbourne, Australia
, http://handle.uws.edu.au:8081/1959.7/504796 (Last viewed April 29, 2024), pp.
13
17
.
13.
Buz
,
E.
,
Tanenhaus
,
M. K.
, and
Jaeger
,
T. F.
(
2016
). “
Dynamically adapted context-specific hyper-articulation: Feedback from interlocutors affects speakers' subsequent pronunciations
,”
J. Mem. Lang.
89
,
68
86
.
14.
Chan
,
M. P. Y.
,
Choe
,
J.
,
Li
,
A.
,
Chen
,
Y.
,
Gao
,
X.
, and
Holliday
,
N.
(
2022
). “
Training and typological bias in ASR performance for world Englishes
,” in
Proceedings of Interspeech 2022
.
15.
Choe
,
J.
,
Chen
,
Y.
,
Chan
,
M. P. Y.
,
Li
,
A.
,
Gao
,
X.
, and
Holliday
,
N.
(
2022
). “
Language-specific effects on automatic speech recognition errors for world Englishes
,” in
Proceedings of the 29th International Conference on Computational Linguistics
, pp.
7177
7186
.
16.
Clark
,
H. H.
, and
Murphy
,
G. L.
(
1982
). “
Audience design in meaning and reference
,” in
Advances in Psychology, Language and Comprehension
, edited by
J.-F.
Le Ny
and
W.
Kintsch
(
Elsevier
,
The Netherlands
), Vol.
9
, pp.
287
299
.
17.
Cohn
,
M.
,
Ferenc Segedin
,
B.
, and
Zellou
,
G.
(
2022
). “
Acoustic-phonetic properties of Siri- and human-directed speech
,”
J. Phon.
90
,
101123
.
18.
Cohn
,
M.
,
Pycha
,
A.
, and
Zellou
,
G.
(
2021
). “
Intelligibility of face-masked speech depends on speaking style: Comparing casual, clear, and emotional speech
,”
Cognition
210
,
104570
.
19.
Cohn
,
M.
, and
Zellou
,
G.
(
2021
). “
Prosodic differences in human- and Alexa-directed speech, but similar local intelligibility adjustments
,”
Front. Commun.
6
,
675704
.
20.
Cowan
,
B. R.
,
Branigan
,
H. P.
,
Obregón
,
M.
,
Bugis
,
E.
, and
Beale
,
R.
(
2015
). “
Voice anthropomorphism, interlocutor modelling and alignment effects on syntactic choices in human−computer dialogue
,”
Int. J. Hum.-Comput. Stud.
83
,
27
42
.
21.
Ferguson
,
S. H.
, and
Kewley-Port
,
D.
(
2007
). “
Talker differences in clear and conversational speech: Acoustic characteristics of vowels
,”
J. Speech. Lang. Hear. Res.
50
,
1241
1255
.
22.
Fernald
,
A.
,
Taeschner
,
T.
,
Dunn
,
J.
,
Papousek
,
M.
,
De Boysson-Bardies
,
B.
, and
Fukui
,
I.
(
1989
). “
A cross-language study of prosodic modifications in mothers' and fathers' speech to preverbal infants
,”
J. Child Lang.
16
,
477
501
.
23.
Fish
,
M. S.
,
García-Sierra
,
A.
,
Ramírez-Esparza
,
N.
, and
Kuhl
,
P. K.
(
2017
). “
Infant-directed speech in English and Spanish: Assessments of monolingual and bilingual caregiver VOT
,”
J. Phon.
63
,
19
34
.
24.
Gambino
,
A.
,
Fox
,
J.
, and
Ratan
,
R. A.
(
2020
). “
Building a stronger CASA: Extending the computers are social actors paradigm
,”
Hum. Mach. Commun.
1
,
71
85
.
25.
Geerling
,
J.
(
2022
). “
How to transcribe audio to text using Dictation on a Mac
,” https://www.jeffgeerling.com/blog/2022/how-transcribe-audio-text-using-dictation-on-mac (Last viewed September 1, 2023).
26.
Gordon-Salant
,
S.
(
1985
). “
Some perceptual properties of consonants in multitalker babble
,”
Percept. Psychophys.
38
,
81
90
.
27.
Jepson
,
K.
(
2005
). “
Conversations—and negotiated interaction—in text and voice chat rooms
,”
Language Learn. Technol.
9
,
79
98
.
28.
Jung
,
Y.-J.
, and
Dmitrieva
,
O.
(
2023
). “
Non-native talkers and listeners and the perceptual benefits of clear speech
,”
J. Acoust. Soc. Am.
153
,
137
148
.
29.
Kangatharan
,
J.
,
Uther
,
M.
, and
Gobet
,
F.
(
2023
). “
The effect of clear speech to foreign-sounding interlocutors on native listeners' perception of intelligibility
,”
Speech Commun.
150
,
66
72
.
30.
Kato
,
M.
(
2020
). “
Production and perception of native and non-native speech enhancements
,” Ph.D. dissertation,
University of Oregon
,
Eugene, OR
.
31.
Kato
,
M.
, and
Baese-Berk
,
M. M.
(
2021
). “
Contextually-relevant enhancement of non-native phonetic contrasts
,”
J. Phon.
88
,
101099
.
32.
Kato
,
M.
, and
Baese-Berk
,
M. M.
(
2022
). “
Perceptual consequences of native and non-native clear speech
,”
J. Acoust. Soc. Am.
151
,
1246
1258
.
33.
Kato
,
M.
, and
Baese-Berk
,
M. M.
(
2023
). “
The effects of acoustic and semantic enhancements on perception of native and non-native speech
,”
Lang. Speech
67
,
40
–71.
34.
Koenecke
,
A.
,
Nam
,
A.
,
Lake
,
E.
,
Nudell
,
J.
,
Quartey
,
M.
,
Mengesha
,
Z.
,
Toups
,
C.
,
Rickford
,
J. R.
,
Jurafsky
,
D.
, and
Goel
,
S.
(
2020
). “
Racial disparities in automated speech recognition
,”
Proc. Natl. Acad. Sci. U.S.A.
117
,
7684
7689
.
35.
Krause
,
J. C.
, and
Braida
,
L. D.
(
2004
). “
Acoustic properties of naturally produced clear speech at normal speaking rates
,”
J. Acoust. Soc. Am.
115
,
362
378
.
36.
Lee
,
J.
(
2009
).
The Effect of Computer-Mediated Communication (CMC) Interaction on L2 Vocabulary Acquisition: A Comparison Study of CMC Interaction and Face-to-Face Interaction
(
Iowa State University
,
Ames, IA
).
37.
Lindblom
,
B.
(
1990
). “
Explaining phonetic variation: A sketch of the HandH theory
,” in
Speech Production and Speech Modeling
(
Springer
,
Dordrecht, The Netherlands)
, Vol.
55
, pp.
403
439
.
38.
Mayo
,
C.
,
Aubanel
,
V.
, and
Cooke
,
M.
(
2012
). “
Effect of prosodic changes on speech intelligibility
,” in
Proceedings of the 13th Annual Conference of the International Speech Communication Association: INTERSPEECH 2012
,
Portland, Oregon
, pp.
1706
1709
, available at https://www.researchgate.net/publication/267363352_Effect_of_prosodic_changes_on_speech_intelligibility (Last viewed April 29, 2024).
39.
Mengesha
,
Z.
,
Heldreth
,
C.
,
Lahav
,
M.
,
Sublewski
,
J.
, and
Tuennerman
,
E.
(
2021
). “ 
‘I don't think these devices are very culturally sensitive.’—Impact of automated speech recognition errors on African Americans
,”
Front. Artif. Intell.
4
,
725911
.
40.
Moussalli
,
S.
, and
Cardoso
,
W.
(
2017
). “
Can you understand me? Speaking robots and accented speech
,” CALL in a climate of change: Adapting to turbulent global conditions, short papers from EuroCALL.
41.
Moussalli
,
S.
, and
Cardoso
,
W.
(
2020
). “
Intelligent personal assistants: Can they understand and be understood by accented L2 learners?
,”
Comput. Assisted Language Learn.
33
,
865
890
.
42.
Nigmatulina
,
I.
,
Kew
,
T.
, and
Samardzic
,
T.
(2020). “
ASR for non-standardised languages with dialectal variation: The case of Swiss German
,” in
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects,
pp.
15
24
.
43.
44.
Oviatt
,
S.
,
MacEachern
,
M.
, and
Levow
,
G.-A.
(
1998
). “
Predicting hyperarticulate speech during human-computer error resolution
,”
Speech Commun.
24
,
87
110
.
45.
Picheny
,
M. A.
,
Durlach
,
N. I.
, and
Braida
,
L. D.
(
1985
). “
Speaking clearly for the hard of hearing I: Intelligibility differences between clear and conversational speech
,”
J. Speech. Lang. Hear. Res.
28
,
96
103
.
46.
Picheny
,
M. A.
,
Durlach
,
N. I.
, and
Braida
,
L. D.
(
1986
). “
Speaking clearly for the hard of hearing II: Acoustic characteristics of clear and conversational speech
,”
J. Speech. Lang. Hear. Res.
29
,
434
446
.
47.
Psychology Software Tools, Inc.
(
2016
). “
E-Prime 3.0
,” https://support.pstnet.com/ (Last viewed April 29, 2024).
48.
Raveh
,
E.
,
Steiner
,
I.
,
Siegert
,
I.
,
Gessinger
,
I.
, and
Möbius
,
B.
(
2019
). “
Comparing phonetic changes in computer-directed and human-directed speech
,” in
Elektronische Sprachsignalverarbeitung 2019, Studientexte zur Sprachkommunikation (Electronic Speech Signal Processing 2019, Study Texts on Speech Communication)
(
TUDpress
,
Dresden, Germany
), pp.
42
49
.
49.
Rogers
,
C. L.
,
DeMasi
,
T. M.
, and
Krause
,
J. C.
(
2010
). “
Conversational and clear speech intelligibility of /bVd/ syllables produced by native and non-native English speakers
,”
J. Acoust. Soc. Am.
128
,
410
423
.
50.
Scarborough
,
R.
, and
Zellou
,
G.
(
2013
). “
Clarity in communication: ‘Clear’ speech authenticity and lexical neighborhood density effects in speech production and perception
,”
J. Acoust. Soc. Am.
134
,
3793
3807
.
51.
Seyfarth
,
S.
,
Buz
,
E.
, and
Jaeger
,
T. F.
(
2016
). “
Dynamic hyperarticulation of coda voicing contrasts
,”
J. Acoust. Soc. Am.
139
,
EL31
EL37
.
52.
Siegert
,
I.
, and
Krüger
,
J.
(
2018
). “
How do we speak with Alexa: Subjective and objective assessments of changes in speaking style between HC and HH conversations
,” Kognitive Systeme, 2018 (Cognitive Systems, 2018).
53.
Siegert
,
I.
, and
Krüger
,
J.
(
2021
). “ 
‘Speech melody and speech content didn't fit together’—Differences in speech behavior for device directed and human directed interactions
,” in
Advances in Data Science: Methodologies and Applications, Intelligent Systems Reference Library (ISRL)
, 1st ed. (
Springer
,
Cham, Switzerland
), Vol.
189
, pp.
65
95
.
54.
Singh
,
L.
,
Nestor
,
S.
,
Parikh
,
C.
, and
Yull
,
A.
(
2009
). “
Influences of infant-directed speech on early word recognition
,”
Infancy
14
,
654
666
.
55.
Smiljanić
,
R.
, and
Bradlow
,
A. R.
(
2007
). “
Clear speech intelligibility: Listener and talker effects
,” in
Proceedings of the XVIth International Congress of Phonetic Sciences
, Saarbrucken Germany.
56.
Smiljanić
,
R.
, and
Bradlow
,
A. R.
(
2011
). “
Bidirectional clear speech perception benefit for native and high-proficiency non-native talkers and listeners: Intelligibility and accentedness
,”
J. Acoust. Soc. Am.
130
,
4020
4031
.
57.
Song
,
J. Y.
,
Pycha
,
A.
, and
Culleton
,
T.
(
2022
). “
Interactions between voice-activated AI assistants and human speakers and their implications for second-language acquisition
,”
Front. Commun.
7
,
995475
.
58.
Tai
,
T.-Y.
, and
Chen
,
H. H.-J.
(
2022
). “
The impact of intelligent personal assistants on adolescent EFL learners' listening comprehension
,”
Comput. Assist. Lang. Learn.
(published online).
59.
Tatman
,
R.
, and
Kasten
,
C.
(
2017
). “
Effects of talker dialect, gender and race on accuracy of Bing Speech and YouTube automatic captions
,” in
Proceedings of Interspeech,
pp.
934
938
.
60.
Thiessen
,
E. D.
,
Hill
,
E. A.
, and
Saffran
,
J. R.
(
2005
). “
Infant-directed speech facilitates word segmentation
,”
Infancy
7
,
53
71
.
61.
Uther
,
M.
,
Knoll
,
M. A.
, and
Burnham
,
D.
(
2007
). “
Do you speak E-NG-L-I-SH? A comparison of foreigner- and infant-directed speech
,”
Speech Commun.
49
,
2
7
.
62.
Wardrip-Fruin
,
C.
(
1982
). “
On the status of temporal cues to phonetic categories: Preceding vowel duration as a cue to voicing in final stop consonants
,”
J. Acoust. Soc. Am.
71
,
187
195
.
63.
Wardrip‐Fruin
,
C.
(
1985
). “
The effect of signal degradation on the status of cues to voicing in utterance‐final stop consonants
,”
J. Acoust. Soc. Am.
77
,
1907
1912
.
64.
Wassink
,
A. B.
,
Gansen
,
C.
, and
Bartholomew
,
I.
(
2022
). “
Uneven success: Automatic speech recognition and ethnicity-related dialects
,”
Speech Commun.
140
,
50
70
.
65.
Zellou
,
G.
,
Lahrouchi
,
M.
, and
Bensoukas
,
K.
(
2022
). “
Clear speech in Tashlhiyt Berber: The perception of typologically uncommon word-initial contrasts by native and naive listeners
,”
J. Acoust. Soc. Am.
152
,
3429
3443
.