This paper examines the adaptations African American English speakers make when imagining talking to a voice assistant, compared to a close friend/family member and to a stranger. Results show that speakers slowed their rate and produced less pitch variation in voice-assistant-“directed speech” (DS), relative to human-DS. These adjustments were not mediated by how often participants reported experiencing errors with automatic speech recognition. Overall, this paper addresses a limitation in the types of language varieties explored when examining technology-DS registers and contributes to our understanding of the dynamics of human-computer interaction.

Increasingly, people engage with voice technology (e.g., Google Assistant, Amazon's Alexa, Apple's Siri; Ammari , 2019) to complete everyday tasks. Yet, interactions with voice technology at times involve errors in generating and understanding speech. For example, automatic speech recognition (ASR) systems, which convert the spoken signal to text, in some cases mistranscribe speakers’ utterances (Ngueajio and Washington, 2022). As a result, people often have an assumption that technology will have difficulty understanding them, compared to a (normal hearing) adult listener (Oviatt , 1998; Cohn , 2022).

Speakers across language and dialect varieties have long been shown to “style shift” (Rickford and McNair-Knox, 1994; Labov, 2001), adapting their productions based on a number of social factors (e.g., addressee type, social closeness). For example, speakers make distinct acoustic-phonetic adjustments, known as “registers,” for infants (Fernald, 1992; Narayan and McDermott, 2016), non-native speakers (Uther , 2007), and hearing-impaired listeners (Picheny , 1986). A growing body of work provides evidence for a technology-directed speech (“DS”) register as well. Compared to human-DS, technology-DS often includes increasing intensity (loudness) (Raveh , 2019; Cohn , 2022; Lunsford , 2006; Siegert and Krüger, 2021), slowing speech rate (Cohn and Zellou, 2021), increasing segment duration (Burnham , 2010; Mayo , 2012), and in some cases, decreasing pitch variation (Mayo , 2012; Cohn , 2022). Yet, prior research examining technology-DS registers has largely focused on mainstream varieties of US English, British English, and German (Cohn , 2022; Cohn , 2021a; Mayo , 2012; Raveh , 2019), but not on speaker groups more commonly misunderstood by technology.

The current study addresses this gap, comparing technology- and human-DS adjustments made by African American English (AAE) speakers. AAE is a language variety spoken in the United States, with roots in the African American and Black communities (Wolfram and Thomas, 2008; Mufwene , 2021; Rickford , 2015). There is a large body of work documenting the rule-governed and systematic phonetic, prosodic, and lexico-syntactic features of AAE, with differences across these dimensions from mainstream varieties of US English (e.g., Rickford, 1999; Thomas, 2007; Bailey and Thomas, 1998; Thomas and Lanehart, 2015), as well as across varieties of AAE (e.g., across regions; Wolfram, 2007; Holliday, 2019). In a seminal study, Koenecke (2020) demonstrated that across five commercially available ASR systems, AAE speakers were misunderstood at a higher rate than mainstream English speakers in the United States; this finding has been replicated across other ASR systems as well (e.g., Wassink , 2022; Martin and Tang, 2020; Lai and Holliday, 2023; for a review, see Ngueajio and Washington, 2022). As the rate of ASR errors can be higher for AAE speakers, this can lead to downstream effects of linguistic discrimination in technology (e.g., in healthcare, jobs; for a discussion, see Martin and Wright, 2023).

The current paper tests AAE technology-DS and human-DS registers for two acoustic properties: speech rate and pitch variation (perceived fundamental frequency, f0), paralleling related work in human-human (e.g., Narayan and McDermott, 2016) and human-computer interaction (e.g., Cohn , 2021a; Cohn , 2021b; Cohn , 2022). On the one hand, theories of technology equivalence (e.g., Nass , 1997; Lee, 2008) propose that people apply behaviors from human-human interaction to technology. For example, people mirror emotional expressiveness in a human and a text-to-speech (TTS) voice, producing parallel changes in duration and pitch variation (Cohn , 2021b). One possibility, in line with technology equivalence, is that speakers will show no differences in their rate and pitch variation in technology- and human-DS. On the other hand, routinized interaction theories of human-computer interaction (Gambino , 2020) propose that people have a “routinized” way of engaging with technology based on their real experiences with the systems. For example, Mayo (2012) found that a British English male speaker produced longer utterances with a reduced pitch range when imagining talking to a computer, compared to reading target sentences “plainly.” Cohn (2022) found that, when talking to a TTS voice (Apple's Siri), California English speakers are often louder and display less pitch variation, compared to when they are talking to a human voice. In line with routinized interaction theories, one prediction is that AAE speakers will produce features consistent with technology-DS and will produce even larger technology- and human-DS distinctions if they report being misunderstood by ASR more frequently.

The current study tests three conditions (voice assistant-, familiar human-, unfamiliar human-DS), comparing two features: speech rate and pitch variation. Across the three conditions, participants produced utterances in response to the identical set of prompts, providing a controlled context for the comparisons.

A total of n = 21 participants were recruited through Dope Labs,1 an agency that leads recruitment and facilitation of interviews with underrepresented communities. Recruitment criteria included identifying as Black or African American, currently living in the United States, being between 18 and 64 years old, indicating that their first language is English, and reporting prior experience using voice technology and with errors.2 The current study used the inclusive definition of AAE, following King (2020), as any speaker within the African American community.

Data were excluded for one participant whose recordings had a high level of background noise and one participant who did not complete the entire study. Retained participants consisted of n = 19 participants (n = 10 female, n = 8 male, n = 1 nonbinary; age range: 18–55 years3) from several regions across the United States (n = 2 East Coast, n = 3 Midwest, n = 8 South, n = 6 West Coast4). All participants reported using a voice assistant or speech-to-text frequently (n = 15 “a few times a day,” n = 1 “once a day,” n = 3 “a few times a week”). All participants had experience with at least one voice assistant (n = 19 Siri, n = 18 Google Assistant), as well as speech-to-text (n = 14 sending a message on a social media service; e.g., WhatsApp, Facebook Messenger). All participants completed informed consent and were compensated for their time.

All participants reported prior experience with ASR errors. In response to the question, “When using voice technology, how often do errors occur?” n = 7 responded “most of the time,” while n = 12 responded “some of the time.” In response to the reasons for these errors, most commonly participants indicated that the system “does not understand the way I speak” (n = 12), “misinterprets what I want” (n = 11), and “does not understand my context” (n = 11). Two participants explicitly mentioned the AAE variety as a source of ASR errors (examples shown in Table 1). All participants completed informed consent and were compensated for their time.

Participants completed the experiment from a quiet room in their homes via a Zoom web-conferencing call with a Dope Labs researcher (n = 7; all Black/African American identifying). First, participants were asked questions about their experiences with voice assistants, including “When using voice technology, how often do errors occur?” on a Likert scale (“Never,” “Some of the time,” “Most of the time,” or “Every time”). Next, participants completed three imagined addressee conditions (in separate blocks5): (1) Voice assistant-DS (“Imagine you are talking to your voice assistant”); (2) familiar human-DS (“Imagine talking to a close friend or family member”); (3) unfamiliar human-DS (“Imagine talking to a stranger, such as a person in your community, or a person you just met”).

In each addressee condition, participants produced directives based on a set list of prompts (e.g., “Imagine you are talking to your voice assistant, how would you ask your assistant “The weather for a future date in a specific location”). Prompts were selected to cover five categories common to interactions with voice assistants (example shown in Table 2; full list provided in the supplementary material, Table S1):

  1. Answering questions (n = 10; e.g., “The weather for a future date in a specific location”).

  2. Getting stuff done (n = 11; e.g., “Create calendar event for a future date with others”).

  3. Playing media (n = 10; e.g., “Play a specific song”).

  4. Communicating (n = 10; e.g., “Call a friend”).

  5. Just for fun (n = 10; e.g., “Hear an impression of a given animal”).

Table 1.

Examples of participants’ typed rationales for ASR errors.

Participant Describe the most common failure you experience when using voice technology. What are the main reasons you believe these failures or errors occur?
P3  “When trying to send texts with voice typing I run into a lot of errors and sometimes have to retype my messages because the voice tech doesn't understand my slang (AAVE) or misspells names and phrases. This mostly happens when I try to send texts to other black people because I speak differently with them than I do with everyone else.” 
P12  “Sometimes if I ask them to call someone it will call the wrong person, or isn't understanding my directions fully.” 
P25  “I have error with my friends and family's names. I also have issues when trying to help my grandfather use Google Assistant. He exclusively speaks in AAVE so sometimes Google doesn't understand. I also have issues playing certain songs with ‘slang titles.’ ” 
P26  “The technology is not understanding my voice, my vocal tone.” 
Participant Describe the most common failure you experience when using voice technology. What are the main reasons you believe these failures or errors occur?
P3  “When trying to send texts with voice typing I run into a lot of errors and sometimes have to retype my messages because the voice tech doesn't understand my slang (AAVE) or misspells names and phrases. This mostly happens when I try to send texts to other black people because I speak differently with them than I do with everyone else.” 
P12  “Sometimes if I ask them to call someone it will call the wrong person, or isn't understanding my directions fully.” 
P25  “I have error with my friends and family's names. I also have issues when trying to help my grandfather use Google Assistant. He exclusively speaks in AAVE so sometimes Google doesn't understand. I also have issues playing certain songs with ‘slang titles.’ ” 
P26  “The technology is not understanding my voice, my vocal tone.” 
Table 2.

Example productions for: “The weather for a future date in a specific location.”

Condition Query
Voice-assistant-DS  Hey Google, what's the weather in LA for tomorrow? 
Familiar-DS  Hey dad, do you know the weather for tomorrow? 
Unfamiliar-DS  Hi, excuse me. Do you know the weather in LA for tomorrow? 
Condition Query
Voice-assistant-DS  Hey Google, what's the weather in LA for tomorrow? 
Familiar-DS  Hey dad, do you know the weather for tomorrow? 
Unfamiliar-DS  Hi, excuse me. Do you know the weather in LA for tomorrow? 

In total, participants produced n = 153 productions (n = 51 productions in each addressee condition: assistant-, familiar-, unfamiliar-DS). In total, the study took roughly 1 h.

The audio tracks were extracted and participants’ utterances were segmented with ffmpeg. The first author listened to all of the experimental trials (n = 2894) and annotated the presence of noise, other speakers, interference by the interviewer, or laughter; these trials (n = 697) were excluded from the analysis. The retained trials (n = 2197) were acoustically analyzed with Parselmouth (Jadoul , 2018), the Python extension of Praat (Boersma and Weenink, 2021).

Over each utterance, several measurements6 were taken, including speaking rate (mean number of syllables per second; adapted from De Jong , 2017). For pitch measurements, fundamental frequency (f0) was measured over the utterance (at ten equidistant points; DiCanio, 2007), based on plausible maxima and minima by speaker gender (female, 100–350 Hz; male, 60–210 Hz; nonbinary, 60–350 Hz). f0 values were then converted to semitones (base = 75 Hz) and values more than three standard deviations from speakers’ mean were excluded; trial f0 variation was then calculated from these values (cf. Cohn , 2022; Cohn and Zellou, 2021).

Each acoustic feature (speech rate, pitch variation) was analyzed in a separate linear mixed effects model with the lme4 R package (Bates , 2015). Fixed effects included Addressee (three levels: voice assistant-DS, familiar human-DS, unfamiliar human-DS; treatment coded, reference = voice-assistant-DS), Reported Error Frequency (two levels: most of the time, some of the time; sum coded), and their interaction. Additionally, the fixed effect of Proportion of Experiment (time-normalized; centered) was included in each model. Random effects included by-Participant random intercepts and by-Participant random slopes for Addressee and Proportion of Experiment.

Model comparisons were used to assess the inclusion of demographic predictors: Age (two levels:7 younger (18–34 years) and older (35–55 years); sum coded), Gender (female, male, nonbinary; sum coded), and Geographic Region (East Coast, Midwest, South, West Coast; sum coded) in each model. Using the backward selection, the model that improved fit, relative to the increase in parameterization, was retained [as assessed by Akaike Information Criterion (corrected) with the MuMIN R package; Bartoń, 2017]. In cases of singularity or convergence errors, the random effects structure was simplified following Barr (2013) and Cohn (2022). Finally, the collinearity of the predictors was assessed with the performance R package (Lüdecke , 2021).

Model outputs are provided in supplementary material, see Tables S2-3, including the retained model structures. Predictors in the retained models all had low collinearity (all with a variance inflation factor < 5). Summarized raw acoustic measurements are plotted in Figs. 1 and 2.

Fig. 1.

Mean speaking rate at the utterance level across the Imagined conditions (voice assistant-DS, familiar human-DS, and unfamiliar human-DS). Individual points show participant means. Error bars indicate the standard error of the mean.

Fig. 1.

Mean speaking rate at the utterance level across the Imagined conditions (voice assistant-DS, familiar human-DS, and unfamiliar human-DS). Individual points show participant means. Error bars indicate the standard error of the mean.

Close modal
Fig. 2.

Mean pitch variation at the utterance level across the Imagined conditions (voice assistant-DS, familiar human-DS, and unfamiliar human-DS). Individual points show participant means. Error bars indicate the standard error of the mean.

Fig. 2.

Mean pitch variation at the utterance level across the Imagined conditions (voice assistant-DS, familiar human-DS, and unfamiliar human-DS). Individual points show participant means. Error bars indicate the standard error of the mean.

Close modal

The speech rate model showed effects of Addressee. As seen in Fig. 1, relative to voice assistant-DS, the speech rate was faster in familiar human-DS [Coef = 0.81, t = 7.86, p < 0.001] and unfamiliar human-DS [Coef = 1.41, t = 8.28, p < 0.001]. While there was no effect of Reported Error Frequency or any interactions with the Addressee, there was an effect of Proportion Experiment: over the course of the experiment, participants tended to slow their speech rate [Coef = −0.96, t = −3.72, p <0.001].

The pitch variation model revealed effects of Addressee. As seen in Fig. 2, there was more pitch variation in familiar human-DS [Coef = 0.36, t = 2.90, p < 0.01] and unfamiliar human-DS [Coef = 0.49, t = 2.94, p < 0.01] than assistant-DS. There was no effect of Reported Error Frequency nor any interactions with Addressee, and no effect of Proportion of Experiment.

The current study tested AAE speakers’ adjustments in rate and pitch variation when imagining talking to a voice assistant, compared to imagining talking to a friend/family member or a stranger. First, there were consistent adaptations for voice technology: Speech directed toward an imagined voice assistant is slower and has less pitch variation, suggesting there is “style shifting” (Rickford and McNair-Knox, 1994; Labov, 2001) for technology- and human-DS. These consistent adjustments provide support for routinized interaction theories of human-computer interaction (Gambino , 2020).

In related work, a slower speech rate has been observed in response to a misunderstanding (e.g., Stent , 2008) as well as for addressees with communicative barriers (for a review, see Smiljanić and Bradlow, 2009), including voice assistants (e.g., Cohn and Zellou, 2021). In the current study, speakers produce faster speech when imagining talking to a friend or family member (∼5.2 syllables per second) than when imagining talking to a voice assistant (∼4.6 syllables per second), consistent with prior work examining differences in clear and casual speech (Smiljanić and Bradlow, 2005).

Observing less pitch variation in voice-assistant-DS is also consistent with findings in technology-DS (e.g., Cohn , 2022; Mayo , 2012). In Cohn (2022), California English speakers produced more monotone speech when talking to a Siri voice, compared to a human voice, finding a decrease in –0.24 semitones of pitch range, paralleling the pitch reduction observed in the present study (–0.28 semitones) when directing utterances to a voice assistant.

Mayo (2012) also found reduced pitch range in speech directed toward an imagined computer, compared to when reading a sentence “plainly.” While speculative, these adjustments might stem from convergence toward the stereotype of voice assistants, which are often reported as sounding “robotic,” “monotone,” and “emotionless” (Siegert and Krüger, 2021; Cohn , 2023). For example, in human-human interaction, speakers adopt idealized patterns of their interlocutor, even if those features are not actually present (Wade, 2022). In human-computer interaction, speakers have been shown to adopt the pronunciation patterns of TTS voices as well (e.g., Cohn , 2023; Cohn l., 2021b; Gessinger , 2021; Raveh , 2019), suggesting that convergence toward the expectation of a voice assistant might be part of the technology-DS register observed in the current study.

While the higher error rate for ASR systems transcribing AAE has been well-documented in the literature (e.g., Koenecke , 2020), participants’ reported frequency of ASR errors did not shape speakers’ technology-DS adaptations in the current study. One possibility is that the two levels, with reported ASR errors occurring either “most of the time” or “some of the time,” were not distinct enough experiences to warrant differing degrees of technology-DS adaptations. As all participants reported frequent ASR errors, their collective adaptation strategies might instead reflect a more systematized way of engaging with technology.

Comparing voice-assistant-DS to the familiar- and unfamiliar-DS conditions, there were also differences observed, paralleling style shifting in human-human interactions (Rickford and McNair-Knox, 1994; Labov, 2001). Voice-assistant-DS was slowest, followed by familiar human-DS, and then unfamiliar human-DS. While more formal contexts can lead to a slower speech rate (e.g., request to a supervisor in Duran , 2023), faster speech in the current study was observed toward an unfamiliar community member. Here, one difference is that participants were asking a request from a peer. One possible explanation is that participants provided more context surrounding the request to someone they did not know (e.g., “Hi there, sorry to bother you. My phone just died. Would you mind if I used your phone to call my mom?”). Accordingly, participants might have spoken more quickly to convey respect for the (imagined) stranger's time. At the same time, the degree of pitch variation was equally larger for both human-DS registers than voice-assistant-DS, illustrating that speakers independently adjust rate and pitch variation as they shift across different addressee styles.

This experiment has several limitations that can be addressed in future work. First, the current study examined one variety of English, AAE, but there are many other language varieties that are commonly misunderstood by ASR both within the United States (e.g., Wassink , 2022) and other countries (e.g., Markl, 2022). Future work exploring additional dialects and languages is needed for a more complete understanding of the effects of ASR errors on speakers’ adaptations for speech technology. Second, the current study examined two features, speech rate and pitch variation, to probe the effects of a technology-DS register for AAE speakers. Yet, AAE has distinct prosodic, segmental, lexical, and syntactic differences from mainstream US English (e.g., Rickford, 1999; Thomas and Lanehart, 2015) that speakers might adopt when talking to technology. As each individual did not produce an identical set of sentences, it is possible that the specific lexical and syntactic choices in the present study could have further shaped speech rate and pitch variation adaptations to imagined addressees. Future studies comparing naturalistic productions, as well as controlled sets of target utterances, could shed light on these potential interactions.

Third, participants completed the experiment from their homes and their technological set-ups varied. For example, most participants did not have an external, head-mounted microphone. Furthermore, each participant completed the study from a different space, including in the presence of other types of background noise (note, however, that trials containing any noise were removed from analysis). Future studies comparing in-lab and at-home adaptations can probe whether technology-DS adaptations are consistent across more controlled environments.

Fourth, participants in the current study only completed imagined scenarios. Related research has shown that phonetic adjustments for a “real” addressee can vary from those in imagined contexts (e.g., Scarborough and Zellou, 2013), suggesting that the register adjustments might be even larger in a real context. For example, future studies examining how AAE speakers correct an error made by an ASR system can probe how code switching (e.g., between AAE and mainstream US English) and technology-DS registers interact. There is work showing that AAE speakers report code-switching toward mainstream US English with technology (Mengesha , 2021; Harrington , 2022), suggesting that inequalities in voice technology shape the way AAE speakers engage with it in more naturalistic interactions as well. Overall, this paper addresses a limitation in the types of language varieties explored when examining technology-DS registers, and contributes to our understanding of the dynamics of human-computer interaction in an increasingly technological world.

See the supplementary material for model outputs and stimuli.

Thank you to Dope Labs for assistance with data collection.

The authors report employment at Google Inc. (provided by Magnit for M.C.). No other conflicts of interest are reported.

The data that support the findings of this study are available from the corresponding author upon reasonable request.

2

“When using voice technology, how often do errors occur?” not equal to “Never.”

3

n = 8, 18–24 year-olds; n = 6, 25–34 year-olds; n = 4, 35–44 year-olds; n = 1, 45–55 year-old.

4

Geographical region was identified based on the state in which the participants indicated living in.

5

Note that the order of blocks was fixed for n=18 participants (unfamiliar-, familiar-, assistant-DS) and differed slightly for n = 1 participant (familiar-, unfamiliar-, assistant-DS). Therefore, Proportion of the Experiment (time-normalized) was included as a predictor in all models to account for changes over time.

6

Note that intensity was not measured in the current study as Zoom auto-normalized intensity to 70 dB in the call.

7

Age was binned based on the age range each participant identified as being in; the “Younger” age group included participants who identified as being in the 18–24 and 25–34 age ranges, while the “Older” age group included participants in the 35–44 and 45–55 age ranges.

1.
Ammari
,
T.
,
Kaye
,
J.
,
Tsai
,
J. Y.
, and
Bentley
,
F.
(
2019
). “
Music, search, and IoT: How people (really) use voice assistants
,”
ACM Trans. Comput-Hum. Interact.
26
,
1
28
.
2.
Bailey
,
G.
, and
Thomas
,
E.
(
1998
). “
Some aspects of African-American Vernacular English phonology
” in
African-American English: Structure, History and Use
, edited by
S.
Mufwene
,
J. R.
Rickford
,
G.
Bailey
, and
J.
Baugh
(
Routledge
,
New York
), pp.
85
109
.
3.
Barr
,
D. J.
,
Levy
,
R.
,
Scheepers
,
C.
, and
Tily
,
H. J.
(
2013
). “
Random effects structure for confirmatory hypothesis testing: Keep it maximal
,”
J. Mem. Lang.
68
,
255
278
.
4.
Bartoń
,
K.
(
2017
). “
MuMIn: Multi-model inference. R package.
,” https://ci.nii.ac.jp/naid/10030918982/ (Last viewed June 2018).
5.
Bates
,
D.
,
Mächler
,
M.
,
Bolker
,
B.
, and
Walker
,
S.
(
2015
). “
Fitting linear mixed-effects models using lme4
,”
J. Stat. Softw.
67
,
1
48
.
6.
Boersma
,
P.
, and
Weenink
,
D.
(
2021
). “
Praat: Doing phonetics by computer
,” http://www.praat.org/ (Last viewed May 2023).
7.
Burnham
,
D. K.
,
Joeffry
,
S.
, and
Rice
,
L.
(
2010
). “
Computer-and human-directed speech before and after correction
,” in
Proceedings of the 13th Australasian International Conference on Speech Science and Technology
,
December 14–16
,
Melbourne, Australia
, pp.
13
17
.
8.
Cohn
,
M.
,
Ferenc Segedin
,
B.
, and
Zellou
,
G.
(
2022
). “
Acoustic-phonetic properties of Siri- and human-directed speech
,”
J. Phon.
90
,
101123
.
9.
Cohn
,
M.
,
Keaton
,
A.
,
Beskow
,
J.
, and
Zellou
,
G.
(
2023
). “
Vocal accommodation to technology: The role of physical form
,”
Lang. Sci.
99
,
101567
.
10.
Cohn
,
M.
,
Liang
,
K.-H.
,
Sarian
,
M.
,
Zellou
,
G.
, and
Yu
,
Z.
(
2021a
). “
Speech rate adjustments in conversations with an Amazon Alexa Socialbot
,”
Front. Commun.
6
,
671429
.
11.
Cohn
,
M.
,
Predeck
,
K.
,
Sarian
,
M.
, and
Zellou
,
G.
(
2021b
). “
Prosodic alignment toward emotionally expressive speech: Comparing human and Alexa model talkers
,”
Speech Commun.
135
,
66
75
.
12.
Cohn
,
M.
, and
Zellou
,
G.
(
2021
). “
Prosodic differences in human- and Alexa-directed speech, but similar local intelligibility adjustments
,”
Front. Commun.
6
(
1
),
675704
.
13.
De Jong
,
N. H.
,
Wempe
,
T.
,
Quené
,
H.
, and
Persoon
,
I.
(
2017
). “
Praat script speech rate v2
,” https://sites.google.com/site/speechrate/Home/praat-script-syllable-nuclei-v2 (Last viewed January 2021).
14.
DiCanio
,
C.
(
2007
). “
Extract pitch averages
,” https://www.acsu.buffalo.edu/∼cdicanio/scripts/Get_pitch.praat (Last viewed May 12, 2019).
15.
Duran
,
D.
,
Weirich
,
M.
, and
Jannedy
,
S.
(
2023
). “
Assessing register variation in local speech rate
,” in
Proceedings of the 20th ICPhS
,
August 7–11
,
Prague, Czech Republic
, pp.
2315
2318
.
16.
Fernald
,
A.
(
1992
). “
Meaningful melodies in mothers’ speech to infants
,” in
Nonverbal Vocal Communication Comparative and Developmental Approaches
(Cambridge University Press
,
Cambridge, UK
), pp.
262
282
.
17.
Gambino
,
A.
,
Fox
,
J.
, and
Ratan
,
R. A.
(
2020
). “
Building a stronger CASA: Extending the computers are social actors paradigm
,”
Hum.-Mach. Commun.
1
,
71
85
.
18.
Gessinger
,
I.
,
Raveh
,
E.
,
Steiner
,
I.
, and
Möbius
,
B.
(
2021
). “
Phonetic accommodation to natural and synthetic voices: Behavior of groups and individuals in speech shadowing
,”
Speech Commun.
127
,
43
63
.
19.
Harrington
,
C. N.
,
Garg
,
R.
,
Woodward
,
A.
, and
Williams
,
D.
(
2022
). “‘
It's kind of like code-switching’: Black older adults’ experiences with a voice assistant for health information seeking
,” in
Proceedings of CHI ’22
,
April 29–May 5, 2022
,
New Orleans, LA
.
20.
Holliday
,
N. R.
(
2019
). “
Variation in question intonation in the corpus of regional African American language
,”
Am. Speech
94
,
110
130
.
21.
Jadoul
,
Y.
,
Thompson
,
B.
, and
de Boer
,
B.
(
2018
). “
Introducing Parselmouth: A Python interface to Praat
,”
J. Phon.
71
,
1
15
.
22.
King
,
S.
(
2020
). “
From African American Vernacular English to African American language: Rethinking the study of race and language in African Americans’ speech
,”
Annu. Rev. Linguist.
6
,
285
300
.
23.
Koenecke
,
A.
,
Nam
,
A.
,
Lake
,
E.
,
Nudell
,
J.
,
Quartey
,
M.
,
Mengesha
,
Z.
,
Toups
,
C.
,
Rickford
,
J. R.
,
Jurafsky
,
D.
, and
Goel
,
S.
(
2020
). “
Racial disparities in automated speech recognition
,”
Proc. Natl. Acad. Sci. U.S.A.
117
,
7684
7689
.
24.
Labov
,
W.
(
2001
). “
Applying our knowledge of African American English to the problem of raising reading levels in inner-city schools
,” in
Sociocultural and Historical Contexts of African American English
(
John Benjamins Co
.,
Philadelphia, PA
), pp.
299
317
.
25.
Lai
,
L.-F.
, and
Holliday
,
N.
(
2023
). “
Exploring sources of racial bias in automatic speech recognition through the lens of rhythmic variation
,” in
Proceedings of Interspeech 2023
,
August 20–24
,
Dublin, Ireland
, pp.
1284
1288
.
26.
Lee
,
K. M.
(
2008
). “
Media equation theory
,” in
International Encyclopedia of Communication
(
John Wiley & Sons, Ltd
.,
Malden, MA
), pp.
1
4
.
27.
Lüdecke
,
D.
,
Ben-Shachar
,
M. S.
,
Patil
,
I.
,
Waggoner
,
P.
, and
Makowski
,
D.
(
2021
). “
Performance: An R package for assessment, comparison and testing of statistical models
,”
J. Open Source Softw.
6
,
3139
.
28.
Lunsford
,
R.
,
Oviatt
,
S.
, and
Arthur
,
A. M.
(
2006
). “
Toward open-microphone engagement for multiparty interactions
,” in
Proceedings of the 8th International Conference on Multimodal Interfaces
,
November 2–4
,
New York, NY
, pp.
273
280
.
29.
Martin
,
J. L.
, and
Tang
,
K.
(
2020
). “
Understanding racial disparities in automatic speech recognition: The case of habitual ‘be
,’” in
Proceedings of Interspeech
,
October 25–29
,
Shanghai, China
, pp.
626
630
.
30.
Markl
,
N.
(
2022
). “
Language variation and algorithmic bias: understanding algorithmic bias in British English automatic speech recognition
,’” in
Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency,
pp.
521
534
.
31.
Martin
,
J. L.
, and
Wright
,
K. E.
(
2023
). “
Bias in automatic speech recognition: The case of African American language
,”
Appl. Ling.
44
(
4
),
613
630
.
32.
Mayo
,
C.
,
Aubanel
,
V.
, and
Cooke
,
M.
(
2012
). “
Effect of prosodic changes on speech intelligibility
,” in
Proceedings of the Interspeech 2012
,
September 9–13
,
Portland, OR
, pp.
1706
1709
.
33.
Mengesha
,
Z.
,
Heldreth
,
C.
,
Lahav
,
M.
,
Sublewski
,
J.
, and
Tuennerman
,
E.
(
2021
). “ ‘
I don't think these devices are very culturally sensitive.’—Impact of automated speech recognition errors on African Americans
,”
Front. Artif. Intell.
4
,
725911
.
34.
Mufwene
,
S. S.
,
Rickford
,
J. R.
,
Bailey
,
G.
, and
Baugh
,
J.
(
2021
).
African-American English: Structure, History, and Use
(
Routledge
,
New York
).
35.
Narayan
,
C. R.
, and
McDermott
,
L. C.
(
2016
). “
Speech rate and pitch characteristics of infant-directed speech: Longitudinal and cross-linguistic observations
,”
J. Acoust. Soc. Am.
139
,
1272
1281
.
36.
Nass
,
C.
,
Moon
,
Y.
,
Morkes
,
J.
,
Kim
,
E.-Y.
, and
Fogg
,
B. J.
(
1997
). “
Computers are social actors: A review of current research
,”
Hum. Values Des. Comput. Technol.
72
,
137
162
.
37.
Ngueajio
,
M. K.
, and
Washington
,
G.
(
2022
). “
Hey ASR system! why aren't you more inclusive? Automatic speech recognition systems’ bias and proposed bias mitigation techniques. A literature review
,” in
International Conference on Human-Computer Interaction
(
Springer Nature
,
Cham, Switzerland
), pp.
421
440
.
38.
Oviatt
,
S.
,
MacEachern
,
M.
, and
Levow
,
G.-A.
(
1998
). “
Predicting hyperarticulate speech during human-computer error resolution
,”
Speech Commun.
24
,
87
110
.
39.
Picheny
,
M. A.
,
Durlach
,
N. I.
, and
Braida
,
L. D.
(
1986
). “
Speaking clearly for the hard of hearing II: Acoustic characteristics of clear and conversational speech
,”
J. Speech. Lang. Hear. Res.
29
,
434
446
.
40.
Raveh
,
E.
,
Steiner
,
I.
,
Siegert
,
I.
,
Gessinger
,
I.
, and
Möbius
,
B.
(
2019
). “
Comparing phonetic changes in computer-directed and human-directed speech
,” in
Stud. Zur Sprachkommun. Elektron. Sprachsignalverarbeitung (Study Texts on Speech Communication: Electronic Speech Signal Processing),
(
TUDpress
,
Dresden, Germany
), pp.
42
49
.
41.
Rickford
,
J. R.
(
1999
). “
Phonological and grammatical features of African American vernacular (“AAVE,”)
,” in
African American Vernacular English
(
Blackwell
,
Oxford, UK
), pp.
3
14
.
42.
Rickford
,
J. R.
,
Duncan
,
G. J.
,
Gennetian
,
L. A.
,
Gou
,
R. Y.
,
Greene
,
R.
,
Katz
,
L. F.
,
Kessler
,
R. C.
,
Kling
,
J. R.
,
Sanbonmatsu
,
L.
,
Sanchez-Ordonez
,
A.
,
Sciandra
,
M.
,
Thomas
,
E.
, and
Ludwig
,
J.
(
2015
). “
Neighborhood effects on use of African-American vernacular English
,”
Proc. Natl. Acad. Sci. U.S.A.
112
(
38
),
11817
11822
.
43.
Rickford
,
J. R.
, and
McNair-Knox
,
F.
(
1994
). “
Addressee-and topic-influenced style shift: A quantitative sociolinguistic study
,” in Sociolinguistic Perspectives on Register (Oxford University Press, New York), pp.
235
276
.
44.
Scarborough
,
R.
, and
Zellou
,
G.
(
2013
). “
Clarity in communication: ‘Clear’ speech authenticity and lexical neighborhood density effects in speech production and perception
,”
J. Acoust. Soc. Am.
134
,
3793
3807
.
45.
Siegert
,
I.
, and
Krüger
,
J.
(
2021
). “
‘Speech melody and speech content didn't fit together’—differences in speech behavior for device directed and human directed interactions
,” in
Advances in Data Science: Methodologies and Applications
, 1st ed. (
Springer
,
Switzerland
), pp.
65
95
.
46.
Smiljanić
,
R.
, and
Bradlow
,
A. R.
(
2005
). “
Production and perception of clear speech in Croatian and English
,”
J. Acoust. Soc. Am.
118
,
1677
1688
.
47.
Smiljanić
,
R.
, and
Bradlow
,
A. R.
(
2009
). “
Speaking and hearing clearly: Talker and listener factors in speaking style changes
,”
Lang. Linguist. Compass
3
,
236
264
.
48.
Stent
,
A. J.
,
Huffman
,
M. K.
, and
Brennan
,
S. E.
(
2008
). “
Adapting speaking after evidence of misrecognition: Local and global hyperarticulation
,”
Speech Commun.
50
,
163
178
.
49.
Thomas
,
E. R.
(
2007
). “
Phonological and phonetic characteristics of African American vernacular English
,”
Lang. Linguist. Compass
1
,
450
475
.
50.
Thomas
,
E. R.
, and
Lanehart
,
S. L.
(
2015
). “
Prosodic features of African American English
,” in
The Oxford Handbook of African American Language
(
Oxford University Press
,
Oxford, UK
).
51.
Uther
,
M.
,
Knoll
,
M. A.
, and
Burnham
,
D.
(
2007
). “
Do you speak E-NG-LI-SH? A comparison of foreigner-and infant-directed speech
,”
Speech Commun.
49
,
2
7
.
52.
Wade
,
L.
(
2022
). “
Experimental evidence for expectation-driven linguistic convergence
,”
Language
98
,
63
97
.
53.
Wassink
,
A. B.
,
Gansen
,
C.
, and
Bartholomew
,
I.
(
2022
). “
Uneven success: Automatic speech recognition and ethnicity-related dialects
,”
Speech Commun.
140
,
50
70
.
54.
Wolfram
,
W.
(
2007
). “
Sociolinguistic folklore in the study of African American English
,”
Lang. Linguist.
1
,
292
313
.
55.
Wolfram
,
W.
, and
Thomas
,
E.
(
2008
).
The Development of African American English
(
John Wiley & Sons
,
New York
).

Supplementary Material