This paper examines the adaptations African American English speakers make when imagining talking to a voice assistant, compared to a close friend/family member and to a stranger. Results show that speakers slowed their rate and produced less pitch variation in voice-assistant-“directed speech” (DS), relative to human-DS. These adjustments were not mediated by how often participants reported experiencing errors with automatic speech recognition. Overall, this paper addresses a limitation in the types of language varieties explored when examining technology-DS registers and contributes to our understanding of the dynamics of human-computer interaction.
1. Introduction
Increasingly, people engage with voice technology (e.g., Google Assistant, Amazon's Alexa, Apple's Siri; Ammari , 2019) to complete everyday tasks. Yet, interactions with voice technology at times involve errors in generating and understanding speech. For example, automatic speech recognition (ASR) systems, which convert the spoken signal to text, in some cases mistranscribe speakers’ utterances (Ngueajio and Washington, 2022). As a result, people often have an assumption that technology will have difficulty understanding them, compared to a (normal hearing) adult listener (Oviatt , 1998; Cohn , 2022).
Speakers across language and dialect varieties have long been shown to “style shift” (Rickford and McNair-Knox, 1994; Labov, 2001), adapting their productions based on a number of social factors (e.g., addressee type, social closeness). For example, speakers make distinct acoustic-phonetic adjustments, known as “registers,” for infants (Fernald, 1992; Narayan and McDermott, 2016), non-native speakers (Uther , 2007), and hearing-impaired listeners (Picheny , 1986). A growing body of work provides evidence for a technology-directed speech (“DS”) register as well. Compared to human-DS, technology-DS often includes increasing intensity (loudness) (Raveh , 2019; Cohn , 2022; Lunsford , 2006; Siegert and Krüger, 2021), slowing speech rate (Cohn and Zellou, 2021), increasing segment duration (Burnham , 2010; Mayo , 2012), and in some cases, decreasing pitch variation (Mayo , 2012; Cohn , 2022). Yet, prior research examining technology-DS registers has largely focused on mainstream varieties of US English, British English, and German (Cohn , 2022; Cohn , 2021a; Mayo , 2012; Raveh , 2019), but not on speaker groups more commonly misunderstood by technology.
The current study addresses this gap, comparing technology- and human-DS adjustments made by African American English (AAE) speakers. AAE is a language variety spoken in the United States, with roots in the African American and Black communities (Wolfram and Thomas, 2008; Mufwene , 2021; Rickford , 2015). There is a large body of work documenting the rule-governed and systematic phonetic, prosodic, and lexico-syntactic features of AAE, with differences across these dimensions from mainstream varieties of US English (e.g., Rickford, 1999; Thomas, 2007; Bailey and Thomas, 1998; Thomas and Lanehart, 2015), as well as across varieties of AAE (e.g., across regions; Wolfram, 2007; Holliday, 2019). In a seminal study, Koenecke (2020) demonstrated that across five commercially available ASR systems, AAE speakers were misunderstood at a higher rate than mainstream English speakers in the United States; this finding has been replicated across other ASR systems as well (e.g., Wassink , 2022; Martin and Tang, 2020; Lai and Holliday, 2023; for a review, see Ngueajio and Washington, 2022). As the rate of ASR errors can be higher for AAE speakers, this can lead to downstream effects of linguistic discrimination in technology (e.g., in healthcare, jobs; for a discussion, see Martin and Wright, 2023).
The current paper tests AAE technology-DS and human-DS registers for two acoustic properties: speech rate and pitch variation (perceived fundamental frequency, f0), paralleling related work in human-human (e.g., Narayan and McDermott, 2016) and human-computer interaction (e.g., Cohn , 2021a; Cohn , 2021b; Cohn , 2022). On the one hand, theories of technology equivalence (e.g., Nass , 1997; Lee, 2008) propose that people apply behaviors from human-human interaction to technology. For example, people mirror emotional expressiveness in a human and a text-to-speech (TTS) voice, producing parallel changes in duration and pitch variation (Cohn , 2021b). One possibility, in line with technology equivalence, is that speakers will show no differences in their rate and pitch variation in technology- and human-DS. On the other hand, routinized interaction theories of human-computer interaction (Gambino , 2020) propose that people have a “routinized” way of engaging with technology based on their real experiences with the systems. For example, Mayo (2012) found that a British English male speaker produced longer utterances with a reduced pitch range when imagining talking to a computer, compared to reading target sentences “plainly.” Cohn (2022) found that, when talking to a TTS voice (Apple's Siri), California English speakers are often louder and display less pitch variation, compared to when they are talking to a human voice. In line with routinized interaction theories, one prediction is that AAE speakers will produce features consistent with technology-DS and will produce even larger technology- and human-DS distinctions if they report being misunderstood by ASR more frequently.
2. Methods
The current study tests three conditions (voice assistant-, familiar human-, unfamiliar human-DS), comparing two features: speech rate and pitch variation. Across the three conditions, participants produced utterances in response to the identical set of prompts, providing a controlled context for the comparisons.
2.1 Participants
A total of n = 21 participants were recruited through Dope Labs,1 an agency that leads recruitment and facilitation of interviews with underrepresented communities. Recruitment criteria included identifying as Black or African American, currently living in the United States, being between 18 and 64 years old, indicating that their first language is English, and reporting prior experience using voice technology and with errors.2 The current study used the inclusive definition of AAE, following King (2020), as any speaker within the African American community.
Data were excluded for one participant whose recordings had a high level of background noise and one participant who did not complete the entire study. Retained participants consisted of n = 19 participants (n = 10 female, n = 8 male, n = 1 nonbinary; age range: 18–55 years3) from several regions across the United States (n = 2 East Coast, n = 3 Midwest, n = 8 South, n = 6 West Coast4). All participants reported using a voice assistant or speech-to-text frequently (n = 15 “a few times a day,” n = 1 “once a day,” n = 3 “a few times a week”). All participants had experience with at least one voice assistant (n = 19 Siri, n = 18 Google Assistant), as well as speech-to-text (n = 14 sending a message on a social media service; e.g., WhatsApp, Facebook Messenger). All participants completed informed consent and were compensated for their time.
All participants reported prior experience with ASR errors. In response to the question, “When using voice technology, how often do errors occur?” n = 7 responded “most of the time,” while n = 12 responded “some of the time.” In response to the reasons for these errors, most commonly participants indicated that the system “does not understand the way I speak” (n = 12), “misinterprets what I want” (n = 11), and “does not understand my context” (n = 11). Two participants explicitly mentioned the AAE variety as a source of ASR errors (examples shown in Table 1). All participants completed informed consent and were compensated for their time.
2.2 Procedure
Participants completed the experiment from a quiet room in their homes via a Zoom web-conferencing call with a Dope Labs researcher (n = 7; all Black/African American identifying). First, participants were asked questions about their experiences with voice assistants, including “When using voice technology, how often do errors occur?” on a Likert scale (“Never,” “Some of the time,” “Most of the time,” or “Every time”). Next, participants completed three imagined addressee conditions (in separate blocks5): (1) Voice assistant-DS (“Imagine you are talking to your voice assistant”); (2) familiar human-DS (“Imagine talking to a close friend or family member”); (3) unfamiliar human-DS (“Imagine talking to a stranger, such as a person in your community, or a person you just met”).
In each addressee condition, participants produced directives based on a set list of prompts (e.g., “Imagine you are talking to your voice assistant, how would you ask your assistant “The weather for a future date in a specific location”). Prompts were selected to cover five categories common to interactions with voice assistants (example shown in Table 2; full list provided in the supplementary material, Table S1):
-
Answering questions (n = 10; e.g., “The weather for a future date in a specific location”).
-
Getting stuff done (n = 11; e.g., “Create calendar event for a future date with others”).
-
Playing media (n = 10; e.g., “Play a specific song”).
-
Communicating (n = 10; e.g., “Call a friend”).
-
Just for fun (n = 10; e.g., “Hear an impression of a given animal”).
Participant . | Describe the most common failure you experience when using voice technology. What are the main reasons you believe these failures or errors occur? . |
---|---|
P3 | “When trying to send texts with voice typing I run into a lot of errors and sometimes have to retype my messages because the voice tech doesn't understand my slang (AAVE) or misspells names and phrases. This mostly happens when I try to send texts to other black people because I speak differently with them than I do with everyone else.” |
P12 | “Sometimes if I ask them to call someone it will call the wrong person, or isn't understanding my directions fully.” |
P25 | “I have error with my friends and family's names. I also have issues when trying to help my grandfather use Google Assistant. He exclusively speaks in AAVE so sometimes Google doesn't understand. I also have issues playing certain songs with ‘slang titles.’ ” |
P26 | “The technology is not understanding my voice, my vocal tone.” |
Participant . | Describe the most common failure you experience when using voice technology. What are the main reasons you believe these failures or errors occur? . |
---|---|
P3 | “When trying to send texts with voice typing I run into a lot of errors and sometimes have to retype my messages because the voice tech doesn't understand my slang (AAVE) or misspells names and phrases. This mostly happens when I try to send texts to other black people because I speak differently with them than I do with everyone else.” |
P12 | “Sometimes if I ask them to call someone it will call the wrong person, or isn't understanding my directions fully.” |
P25 | “I have error with my friends and family's names. I also have issues when trying to help my grandfather use Google Assistant. He exclusively speaks in AAVE so sometimes Google doesn't understand. I also have issues playing certain songs with ‘slang titles.’ ” |
P26 | “The technology is not understanding my voice, my vocal tone.” |
Condition . | Query . |
---|---|
Voice-assistant-DS | Hey Google, what's the weather in LA for tomorrow? |
Familiar-DS | Hey dad, do you know the weather for tomorrow? |
Unfamiliar-DS | Hi, excuse me. Do you know the weather in LA for tomorrow? |
Condition . | Query . |
---|---|
Voice-assistant-DS | Hey Google, what's the weather in LA for tomorrow? |
Familiar-DS | Hey dad, do you know the weather for tomorrow? |
Unfamiliar-DS | Hi, excuse me. Do you know the weather in LA for tomorrow? |
In total, participants produced n = 153 productions (n = 51 productions in each addressee condition: assistant-, familiar-, unfamiliar-DS). In total, the study took roughly 1 h.
2.3 Acoustic analysis
The audio tracks were extracted and participants’ utterances were segmented with ffmpeg. The first author listened to all of the experimental trials (n = 2894) and annotated the presence of noise, other speakers, interference by the interviewer, or laughter; these trials (n = 697) were excluded from the analysis. The retained trials (n = 2197) were acoustically analyzed with Parselmouth (Jadoul , 2018), the Python extension of Praat (Boersma and Weenink, 2021).
Over each utterance, several measurements6 were taken, including speaking rate (mean number of syllables per second; adapted from De Jong , 2017). For pitch measurements, fundamental frequency (f0) was measured over the utterance (at ten equidistant points; DiCanio, 2007), based on plausible maxima and minima by speaker gender (female, 100–350 Hz; male, 60–210 Hz; nonbinary, 60–350 Hz). f0 values were then converted to semitones (base = 75 Hz) and values more than three standard deviations from speakers’ mean were excluded; trial f0 variation was then calculated from these values (cf. Cohn , 2022; Cohn and Zellou, 2021).
2.4 Statistical analysis
Each acoustic feature (speech rate, pitch variation) was analyzed in a separate linear mixed effects model with the lme4 R package (Bates , 2015). Fixed effects included Addressee (three levels: voice assistant-DS, familiar human-DS, unfamiliar human-DS; treatment coded, reference = voice-assistant-DS), Reported Error Frequency (two levels: most of the time, some of the time; sum coded), and their interaction. Additionally, the fixed effect of Proportion of Experiment (time-normalized; centered) was included in each model. Random effects included by-Participant random intercepts and by-Participant random slopes for Addressee and Proportion of Experiment.
Model comparisons were used to assess the inclusion of demographic predictors: Age (two levels:7 younger (18–34 years) and older (35–55 years); sum coded), Gender (female, male, nonbinary; sum coded), and Geographic Region (East Coast, Midwest, South, West Coast; sum coded) in each model. Using the backward selection, the model that improved fit, relative to the increase in parameterization, was retained [as assessed by Akaike Information Criterion (corrected) with the MuMIN R package; Bartoń, 2017]. In cases of singularity or convergence errors, the random effects structure was simplified following Barr (2013) and Cohn (2022). Finally, the collinearity of the predictors was assessed with the performance R package (Lüdecke , 2021).
3. Results
Model outputs are provided in supplementary material, see Tables S2-3, including the retained model structures. Predictors in the retained models all had low collinearity (all with a variance inflation factor < 5). Summarized raw acoustic measurements are plotted in Figs. 1 and 2.
The speech rate model showed effects of Addressee. As seen in Fig. 1, relative to voice assistant-DS, the speech rate was faster in familiar human-DS [Coef = 0.81, t = 7.86, p < 0.001] and unfamiliar human-DS [Coef = 1.41, t = 8.28, p < 0.001]. While there was no effect of Reported Error Frequency or any interactions with the Addressee, there was an effect of Proportion Experiment: over the course of the experiment, participants tended to slow their speech rate [Coef = −0.96, t = −3.72, p <0.001].
The pitch variation model revealed effects of Addressee. As seen in Fig. 2, there was more pitch variation in familiar human-DS [Coef = 0.36, t = 2.90, p < 0.01] and unfamiliar human-DS [Coef = 0.49, t = 2.94, p < 0.01] than assistant-DS. There was no effect of Reported Error Frequency nor any interactions with Addressee, and no effect of Proportion of Experiment.
4. Discussion
The current study tested AAE speakers’ adjustments in rate and pitch variation when imagining talking to a voice assistant, compared to imagining talking to a friend/family member or a stranger. First, there were consistent adaptations for voice technology: Speech directed toward an imagined voice assistant is slower and has less pitch variation, suggesting there is “style shifting” (Rickford and McNair-Knox, 1994; Labov, 2001) for technology- and human-DS. These consistent adjustments provide support for routinized interaction theories of human-computer interaction (Gambino , 2020).
In related work, a slower speech rate has been observed in response to a misunderstanding (e.g., Stent , 2008) as well as for addressees with communicative barriers (for a review, see Smiljanić and Bradlow, 2009), including voice assistants (e.g., Cohn and Zellou, 2021). In the current study, speakers produce faster speech when imagining talking to a friend or family member (∼5.2 syllables per second) than when imagining talking to a voice assistant (∼4.6 syllables per second), consistent with prior work examining differences in clear and casual speech (Smiljanić and Bradlow, 2005).
Observing less pitch variation in voice-assistant-DS is also consistent with findings in technology-DS (e.g., Cohn , 2022; Mayo , 2012). In Cohn (2022), California English speakers produced more monotone speech when talking to a Siri voice, compared to a human voice, finding a decrease in –0.24 semitones of pitch range, paralleling the pitch reduction observed in the present study (–0.28 semitones) when directing utterances to a voice assistant.
Mayo (2012) also found reduced pitch range in speech directed toward an imagined computer, compared to when reading a sentence “plainly.” While speculative, these adjustments might stem from convergence toward the stereotype of voice assistants, which are often reported as sounding “robotic,” “monotone,” and “emotionless” (Siegert and Krüger, 2021; Cohn , 2023). For example, in human-human interaction, speakers adopt idealized patterns of their interlocutor, even if those features are not actually present (Wade, 2022). In human-computer interaction, speakers have been shown to adopt the pronunciation patterns of TTS voices as well (e.g., Cohn , 2023; Cohn l., 2021b; Gessinger , 2021; Raveh , 2019), suggesting that convergence toward the expectation of a voice assistant might be part of the technology-DS register observed in the current study.
While the higher error rate for ASR systems transcribing AAE has been well-documented in the literature (e.g., Koenecke , 2020), participants’ reported frequency of ASR errors did not shape speakers’ technology-DS adaptations in the current study. One possibility is that the two levels, with reported ASR errors occurring either “most of the time” or “some of the time,” were not distinct enough experiences to warrant differing degrees of technology-DS adaptations. As all participants reported frequent ASR errors, their collective adaptation strategies might instead reflect a more systematized way of engaging with technology.
Comparing voice-assistant-DS to the familiar- and unfamiliar-DS conditions, there were also differences observed, paralleling style shifting in human-human interactions (Rickford and McNair-Knox, 1994; Labov, 2001). Voice-assistant-DS was slowest, followed by familiar human-DS, and then unfamiliar human-DS. While more formal contexts can lead to a slower speech rate (e.g., request to a supervisor in Duran , 2023), faster speech in the current study was observed toward an unfamiliar community member. Here, one difference is that participants were asking a request from a peer. One possible explanation is that participants provided more context surrounding the request to someone they did not know (e.g., “Hi there, sorry to bother you. My phone just died. Would you mind if I used your phone to call my mom?”). Accordingly, participants might have spoken more quickly to convey respect for the (imagined) stranger's time. At the same time, the degree of pitch variation was equally larger for both human-DS registers than voice-assistant-DS, illustrating that speakers independently adjust rate and pitch variation as they shift across different addressee styles.
This experiment has several limitations that can be addressed in future work. First, the current study examined one variety of English, AAE, but there are many other language varieties that are commonly misunderstood by ASR both within the United States (e.g., Wassink , 2022) and other countries (e.g., Markl, 2022). Future work exploring additional dialects and languages is needed for a more complete understanding of the effects of ASR errors on speakers’ adaptations for speech technology. Second, the current study examined two features, speech rate and pitch variation, to probe the effects of a technology-DS register for AAE speakers. Yet, AAE has distinct prosodic, segmental, lexical, and syntactic differences from mainstream US English (e.g., Rickford, 1999; Thomas and Lanehart, 2015) that speakers might adopt when talking to technology. As each individual did not produce an identical set of sentences, it is possible that the specific lexical and syntactic choices in the present study could have further shaped speech rate and pitch variation adaptations to imagined addressees. Future studies comparing naturalistic productions, as well as controlled sets of target utterances, could shed light on these potential interactions.
Third, participants completed the experiment from their homes and their technological set-ups varied. For example, most participants did not have an external, head-mounted microphone. Furthermore, each participant completed the study from a different space, including in the presence of other types of background noise (note, however, that trials containing any noise were removed from analysis). Future studies comparing in-lab and at-home adaptations can probe whether technology-DS adaptations are consistent across more controlled environments.
Fourth, participants in the current study only completed imagined scenarios. Related research has shown that phonetic adjustments for a “real” addressee can vary from those in imagined contexts (e.g., Scarborough and Zellou, 2013), suggesting that the register adjustments might be even larger in a real context. For example, future studies examining how AAE speakers correct an error made by an ASR system can probe how code switching (e.g., between AAE and mainstream US English) and technology-DS registers interact. There is work showing that AAE speakers report code-switching toward mainstream US English with technology (Mengesha , 2021; Harrington , 2022), suggesting that inequalities in voice technology shape the way AAE speakers engage with it in more naturalistic interactions as well. Overall, this paper addresses a limitation in the types of language varieties explored when examining technology-DS registers, and contributes to our understanding of the dynamics of human-computer interaction in an increasingly technological world.
Supplementary Material
See the supplementary material for model outputs and stimuli.
Acknowledgment
Thank you to Dope Labs for assistance with data collection.
Author Declarations
Conflict of Interest
The authors report employment at Google Inc. (provided by Magnit for M.C.). No other conflicts of interest are reported.
Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
See https://dopelabs.org.
“When using voice technology, how often do errors occur?” not equal to “Never.”
n = 8, 18–24 year-olds; n = 6, 25–34 year-olds; n = 4, 35–44 year-olds; n = 1, 45–55 year-old.
Geographical region was identified based on the state in which the participants indicated living in.
Note that the order of blocks was fixed for n=18 participants (unfamiliar-, familiar-, assistant-DS) and differed slightly for n = 1 participant (familiar-, unfamiliar-, assistant-DS). Therefore, Proportion of the Experiment (time-normalized) was included as a predictor in all models to account for changes over time.
Note that intensity was not measured in the current study as Zoom auto-normalized intensity to 70 dB in the call.
Age was binned based on the age range each participant identified as being in; the “Younger” age group included participants who identified as being in the 18–24 and 25–34 age ranges, while the “Older” age group included participants in the 35–44 and 45–55 age ranges.