This work explores the relationship between phonetic and perceptual metrics for convergence in shadowed productions by adults and 6-year-old children by isolating the role of voice onset time (VOT) in listeners' similarity judgments. Results show a small but independent role for VOT: listeners were less likely to identify shadowed tokens as more similar to the model when natural VOT convergence present in the stimulus set had been artificially removed (experiments 1 and 2). However, VOT equivalence alone, when accompanied by naturally occurring variation along other dimensions, was not sufficient to drive listeners' judgments of similarity (experiment 3).

When repeating after, or “shadowing” another talker, speakers tend to approximate the characteristics of the speech of the “model” talker, a phenomenon that also appears in conversational interaction (Pardo et al., 2017).1 While there appears to be an automatic component to convergence (Goldinger, 1998), it is also linguistically and socially selective, with different linguistic features and different pairs of talkers eliciting different amounts of convergence (Clopper and Dossey, 2020). Despite the importance of these types of studies in understanding the role of interactional factors in speech communication, it is still unclear how best to quantify convergence.

Convergence has typically been measured in two general ways: acoustic metrics examine how speakers adjust their productions along measurable acoustic dimensions, while perceptual measures are based on listeners' holistic judgments of similarity between the shadower and the model talker, incorporating multiple dimensions simultaneously. These two metrics have some degree of overlap, but they are sensitive to different elements of what speakers are doing and therefore give different types of information. In this study, we explore the relationship between these two metrics, using data from a shadowing task done by children and adults. By comparing participants' judgments across conditions where a single acoustic dimension has been manipulated, we isolate the independent contribution of a single dimension, voice onset time (VOT), in predicting listeners' perceptual similarity judgments.2

Acoustic convergence has been found on multiple phonetic dimensions, including vowel formants (Babel et al., 2013; Dufour and Nguyen, 2013), duration (Pardo, 2013; Pardo et al., 2017), f0 (Babel and Bulatov, 2012; Kim, 2012), and VOT (Fowler et al., 2003; Sanchez et al., 2010; Shockley et al., 2004). However, convergence effects are often small and inconsistent, with substantial differences across individual shadowers and model talkers (Pardo et al., 2017). Furthermore, if the goal is to investigate convergence more generally, the scope of the conclusions that can be drawn from this type of analysis is limited to the subset of acoustic dimensions chosen by the researcher and can never encompass the quasi-infinite number of dimensions on which shadowers could in principle be imitating the model talker.

A more holistic measure of convergence draws on listeners' perceptual judgments of similarity between the shadower and model talker [see review in Pardo et al. (2017)]: if acoustic convergence has occurred, listeners should perceive shadowed productions as more similar to the model talker than baseline productions. Under the assumptions that (1) perceptual judgments are based on the integration of perceived similarity across multiple acoustic dimensions and (2) listeners are sensitive to similarity along all of these dimensions, a perceptual measure can provide a method of quantifying convergence that avoids some of the limitations of acoustic measures. However, these assumptions may not hold: talkers could be converging along a dimension that listeners do not consider when assessing similarity. In this case, relying solely on perceptual judgments risks missing systematic patterns of convergence.

Several studies have examined to what extent listeners' judgments reflect measurable acoustic convergence, using correlational analyses. The most comprehensive of these (Pardo et al., 2017) tested whether changes in three phonetic properties of vowels (duration, formants, and f0) predicted listeners' judgments of similarity, using data from 92 shadowers and 12 model talkers. Although results were complex, the extent of acoustic convergence was predictive of listeners' responses [see also Pardo et al. (2013) and Walker and Campbell-Kibler (2015)]. These studies provide evidence that listeners are sensitive to similarity along these acoustic dimensions and provide some support for the idea that similarity judgments can be used as a global metric to assess convergence. However, not all studies have found evidence for this sort of relationship (e.g., Babel and Bulatov, 2012; Babel et al., 2013). Furthermore, even if there is a correlation between perceptual and acoustic metrics, this does not mean that they are capturing the same thing, but rather that there is at least some overlap in what they capture.

Furthermore, while a correlation between perceptual measures and acoustic convergence on a given dimension is consistent with the idea that listeners are using that dimension to inform their judgments, the use of naturalistic productions leaves open the possibility that it is not actually the target dimension driving listeners' judgments. If a shadower shows imitation on the target dimensions, they are likely also converging with the model along other dimensions, so it is difficult to determine whether listeners' judgments are based on the target dimension or one of the other co-varying acoustic dimensions. In the current work, we eliminate this confound by comparing similarity judgments across conditions where VOT has been manipulated independently, with all other acoustic information held constant. This allows us to test more directly the independent contribution of acoustic similarity on this dimension to listeners' perceptions of convergence.

An additional question is whether listeners use the same strategies for different types of talkers. Children and adults have been shown to have different production patterns for VOT: for example, children's productions have been found to be more acoustically variable (including in VOT; Lowenstein and Nittrouer, 2008), presumably due to still-developing control of their articulators. Furthermore, Nielsen (2014) found that children (preschoolers and third-graders) showed more VOT imitation than adults in a shadowing task [though see also Paquette-Smith (2018), who found no difference in VOT convergence across age groups]. Because of these differences, we might expect listeners to use different strategies when assessing similarity for child vs adult shadowers. Overall, if perceptual measures are used to assess convergence, it is important to have a thorough understanding of which dimensions are used by listeners in assessing similarity and whether this differs depending on the talker.

While VOT convergence has been found in several shadowing studies using acoustic metrics (Fowler et al., 2003; Sanchez et al., 2010; Shockley et al., 2004), we are not aware of any work examining whether it influences perceptual judgments of convergence. This work explores the contribution of VOT to listeners' similarity judgments and extends beyond the correlational methodology of previous work to test for an independent effect of a single dimension, using a diverse set of shadowers (children and adults). In experiment 1, we test whether listeners' similarity judgments are related to the extent of naturally occurring VOT convergence, with results suggesting that VOT may inform their judgments. In experiment 2, we replicate experiment 1 with stimuli that have been manipulated such that the natural VOT convergence was removed. If VOT plays an independent role, listeners should show a decrease in accuracy when VOT differences have been neutralized. Finally, in experiment 3, we test whether VOT similarity, in and of itself, can drive perceptual similarity, even in the absence of any other cues to convergence, by asking listeners to assess similarity of two non-shadowed tokens, one of which had VOT artificially extended to approximate the VOT of the comparison talker but with all other natural acoustic variability intact. A secondary question across all experiments is whether listeners rely less on VOT when assessing similarity of child vs adult productions.

Perception stimuli were drawn from a shadowing task reported in Paquette-Smith (2018), in which child and adult shadowers (native speakers of Canadian English) repeated stop-initial words after two female model talkers (one American English, one Australian English) whose VOT had been systematically extended by an average of 50 ms to elicit convergence in VOT (Nielsen, 2011, 2014). The stimulus set consisted of 26 English words familiar to children; all were mono- or bi-syllabic, began with /p/, and had initial stress. Stimuli fell into three categories: there were two types of tokens produced by shadowers (baseline tokens, produced prior to shadowing, and shadowed tokens, repeated directly after a model talker) as well as the corresponding tokens produced by the model talkers (characterized by extended VOT). To ensure that convergence was actually present in the set of stimuli used in this study, we chose a subset of eight talkers who showed the most VOT imitation in the shadowing task, as quantified by the greatest difference between shadowed and baseline productions (children: 1 male, 3 female, age 5–7; adults: 2 male, 2 female, age 19–20). Within this subset, the children and adults did not differ significantly in the extent of convergence in production.3 VOTs of the shadowers' productions were annotated from the stop release burst to the onset of periodicity in the following vowel. All VOT manipulations were done using the PSOLA duration manipulation algorithm (Moulines and Charpentier, 1990) in Praat (Boersma and Weenink, 2018).

Ninety-six native English listeners participated (65 female, 31 male, ages 17–34), 32 in each of the three experiments. All experiments were XAB tasks: in each trial, listeners naive to the purpose of the study heard a model talker's production of a word (X), followed by two productions of the same word by another talker (A and B), and were asked to indicate whether A or B sounded most like X. There were 200 trials in experiments 1 and 2 and 174 in experiment 3 (8 talkers × 9–13 words × 2 repetitions).4 Trials were blocked by model talker and shadower age (e.g., in one block, X was always the same model talker, and A and B were productions of the two adults who had shadowed that model talker), resulting in four blocks. The order of blocks and order of trials within each block were randomized by participant. The task took about 15 min.

In experiment 1 (natural VOT), the A and B tokens to be compared to the model talker were one baseline and one shadowed token, in randomized order, produced by the same shadower. Experiment 2 (equal VOT) was identical except for the fact that the VOT differences present in experiment 1 had been neutralized by manipulating the VOT of each shadowed token to be equivalent to the VOT of the corresponding baseline token presented in the same trial. In experiment 3 (lengthened VOT), the A and B tokens to be compared to the model talker were two different baseline tokens (B1 and B2) from a shadower. The VOT of each baseline token was manipulated to create three versions: (1) the average VOT of B1 and B2, (2) a “long” token with the average plus 28 ms (the average shadowed-minus-baseline difference across the dataset), and (3) an “extra-long” token, which was equal to the corresponding model's production for that trial (on average, 50 ms longer than the original). If VOT is a strong contributor to listeners' similarity judgments, then listeners should be more likely to choose lengthened tokens as more similar to the model, and this effect should be even stronger when VOT is the same duration as that of the model (extra-long), compared to the more subtle long condition. In each trial, listeners heard the model token (X) followed by the “average” baseline and one of the lengthened tokens (A and B, in randomized order). Each pair of baseline tokens was heard twice during the experiment, e.g., first with the average version of B1 paired with long B2 and second with the average version of B2 paired with extra-long B1. The long and extra-long trials were randomized within each block and undifferentiated from each other.

For all analyses, we used logistic mixed-effects regression models to examine the effect of the predictor variable of interest (which differs by experiment), shadower age (binary factor, child vs adult), and their interaction on listeners' responses. Models were implemented using the lme4 package in R (Bates et al., 2015; R Core Team, 2019). Unless otherwise noted, all models included random intercepts for participant, word, and shadower and random slopes for the predictor variable of interest for each of these with uncorrelated intercepts and slopes. Categorical predictors were simple-coded (–0.5, 0.5), such that beta-coefficients represent the (change in) log-odds of the relevant response between the two levels of a factor, with positive values indicating higher accuracy. Full model summaries are shown in Table 1.

Table 1.

Summaries of the three statistical models used in our analysis. Significant effects are shown in bold, and reference levels are given in italics.

Model 1 (experiment 1)βSEzp
Intercept 0.94 0.17 5.51 <0.001 
Difference-in-distance: DIDVOT 0.20 0.07 2.71 0.007 
Age: child vs adult 0.07 0.27 0.27 0.785 
DID * age –0.02 0.11 –0.19 0.847 
Model 2 (experiment 1 vs experiment 2) β SE z p 
Intercept 0.78 0.15 5.17 <0.001 
Experiment (natural vs equalized) −0.23 0.09 −2.50 0.013 
Age: child vs adult 0.07 0.25 0.29 0.775 
Experiment * age 0.03 0.08 0.33 0.742 
Model 3 (experiment 3) β SE z p 
Intercept 0.06 0.04 1.51 0.132 
Lengthening condition: long vs extra-long 0.15 0.13 1.11 0.267 
Age: child vs adult –0.06 0.07 –0.91 0.364 
Lengthening condition * age –0.01 0.17 –0.07 0.941 
Model 1 (experiment 1)βSEzp
Intercept 0.94 0.17 5.51 <0.001 
Difference-in-distance: DIDVOT 0.20 0.07 2.71 0.007 
Age: child vs adult 0.07 0.27 0.27 0.785 
DID * age –0.02 0.11 –0.19 0.847 
Model 2 (experiment 1 vs experiment 2) β SE z p 
Intercept 0.78 0.15 5.17 <0.001 
Experiment (natural vs equalized) −0.23 0.09 −2.50 0.013 
Age: child vs adult 0.07 0.25 0.29 0.775 
Experiment * age 0.03 0.08 0.33 0.742 
Model 3 (experiment 3) β SE z p 
Intercept 0.06 0.04 1.51 0.132 
Lengthening condition: long vs extra-long 0.15 0.13 1.11 0.267 
Age: child vs adult –0.06 0.07 –0.91 0.364 
Lengthening condition * age –0.01 0.17 –0.07 0.941 

First, we tested whether listeners' responses were correlated with the extent of VOT imitation in experiment 1 (natural), as shown in Fig. 1. We computed a difference-in-distance (DIDVOT) measure to represent the extent of approximation of the model's VOT value in the shadowed vs baseline condition, following previous work (Pardo et al., 2013; Pardo et al., 2017): DIDVOT=|(VOTBaseVOTMod)||(VOTShadVOTMod)|, such that positive difference scores represent closer proximity to the model talker in shadowed vs baseline productions. We then used a logistic mixed-effects regression model to test the effect of DIDVOT (scaled to z-scores) and age on listeners' “shadowed” responses (model 1).5 The effect of DIDVOT was significant, indicating that the likelihood of listeners' accurately choosing the shadowed production as closer to the model increased with greater values of DIDVOT, consistent with the idea that listeners are sensitive to VOT in assessing similarity. This VOT effect did not differ across ages; nor did overall accuracy differ by age, as indicated by non-significant effects of age and the age by DIDVOT interaction.

Fig. 1.

Results of experiment 1: Listeners' choice of shadowed (vs baseline) token in natural productions by child (black) and adult (gray) shadowers, as a function of the extent of imitation, quantified by a difference-in-distance value (DIDVOT, see text for details). The regression line shows the best-fit logistic curve. The dashed line shows chance performance.

Fig. 1.

Results of experiment 1: Listeners' choice of shadowed (vs baseline) token in natural productions by child (black) and adult (gray) shadowers, as a function of the extent of imitation, quantified by a difference-in-distance value (DIDVOT, see text for details). The regression line shows the best-fit logistic curve. The dashed line shows chance performance.

Close modal

To test whether VOT contributed independently to this correlation, we tested whether having the natural VOT convergence present in the stimuli (as in experiment 1) vs equalized (as in experiment 2) affected listeners' choice of shadowed (vs baseline) response (model 2).6 The effect of VOT condition was significant: listeners were less likely to consider the shadowed response more similar to the model in the condition where VOT differences had been artificially removed, as shown in Fig. 2. Again, no differences were found depending on shadower age, either in overall accuracy or in the effect of VOT across ages.

Fig. 2.

Left: Distribution of listeners' responses: percentage choice of the shadowed (vs baseline) token for experiments 1 and 2 and of the lengthened (vs average) token for experiment 3. Error bars show 95% confidence intervals of by-listener means. Right: Percentage choice of shadowed/lengthened tokens, broken down by shadower. Each dot represents the mean accuracy for a given shadower (black representing children, gray representing adults) in a given condition.

Fig. 2.

Left: Distribution of listeners' responses: percentage choice of the shadowed (vs baseline) token for experiments 1 and 2 and of the lengthened (vs average) token for experiment 3. Error bars show 95% confidence intervals of by-listener means. Right: Percentage choice of shadowed/lengthened tokens, broken down by shadower. Each dot represents the mean accuracy for a given shadower (black representing children, gray representing adults) in a given condition.

Close modal

These results indicate an independent role for VOT. However, the difference in accuracy across experiments was very small (equalized: 66%; natural: 71%), and listeners were well above chance even when VOT convergence was neutralized, consistent with the idea that there are other converging features used by listeners in assessing similarity.

Experiment 3 was designed to test whether extending VOT to approximate the model's lengthened VOT value would increase perceptual similarity, even in the context of other natural variability. We tested whether listeners chose the lengthened token at a rate greater than chance and whether the effect differed depending on lengthening condition via a model in which the response variable was listeners' choice of the lengthened (vs average) token, with a predictor variable of lengthening condition (model 3).7 Although the results were in the predicted direction, with 53% lengthened response for extra-long vs 49% for long (Fig. 2), the effect of lengthening condition was not significant, and the non-significant intercept of the model indicates that listeners were not above chance overall. The effects of age and the length × age interaction were not significant. Taken together with the results of experiment 2, this is an indication that convergence in VOT alone may not be enough to drive judgments of similarity, even when the “converged” VOT values are identical to those of the model.

Previous work has demonstrated that acoustic and perceptual metrics of convergence are related and that listeners' perceptions of similarity are correlated with similarity in specific acoustic features (e.g., Pardo et al. (2017) and references therein), but we are not aware of work testing the independent contribution of a single dimension to perceptual similarity. We explored this through a series of three experiments testing the effect of systematic manipulations of VOT on similarity judgments. We found that VOT contributes independently to perceived similarity across a diverse group of shadowers (children and adults). The results of experiment 1 showed that a percept of similarity between a shadower and model talker is enhanced by convergence in VOT. Furthermore, we found that listeners' performance decreased when the natural convergence was removed by equalizing the VOT of the shadowed production to match the value of the baseline production.

The fact that perceived similarity was reduced when VOT differences were eliminated shows that VOT convergence is not only correlated with similarity judgments, as shown in previous work (Pardo et al., 2017; Walker and Campbell-Kibler, 2015), but that it also contributes independently to these judgments. Correlation alone does not indicate an independent role for a dimension, because it is possible that there are features co-varying with the target dimension that listeners could be attending to. Only by manipulating a dimension independently is it possible to isolate its role.

While our findings demonstrate that VOT plays an independent role, our results also suggest that other features are also contributing to listeners' similarity judgments. There was only a small decrease in accuracy when VOT convergence was removed from natural productions, and accuracy remained well above chance, strongly suggesting that other converging features were being used by listeners. Furthermore, the role of VOT was not strong enough to exert a significant influence on listeners' similarity judgments in the face of more variability, as seen in experiment 3. Even when VOT values of a production matched the model's values exactly (in the extra-lengthened condition), that token was not chosen more than a non-lengthened token. It is possible that differences in the stimuli or design of the two studies could have contributed to the discrepancy between the apparent use of VOT in experiment 3 compared with the earlier analyses.8 However, taken at face value, these findings suggest that listeners' use of VOT may depend on the extent of variability present in the stimuli. Results from experiment 2 suggest that shadowed tokens were more similar to model productions than baseline tokens, even in dimensions other than VOT. Therefore, in experiments 1 and 2, where listeners were choosing between a baseline and a shadowed token, there was overall more acoustic similarity within the stimuli than in experiment 3, where listeners were choosing between two baseline tokens, and this greater variability in experiment 3 may have distracted listeners' attention from the VOT differences, leading to an effect too small to be detected with the current design. Given that many acoustic dimensions (e.g., f0, duration, formants) contribute to judgments of perceptual similarity (Babel et al., 2013; Pardo et al., 2017), future work could explore how salience of a single dimension is affected by the extent of variability in other dimensions.

Previous work (e.g., Babel et al., 2013; Pardo et al., 2017) has highlighted how convergence differs across model/shadower pairs, leading to the idea that we might expect listeners to weight cues, or assess similarity, differently depending on the shadower. We included both child and adult shadowers to maximize generalizability and to test whether listeners might show systematically different strategies based on the age of the shadower, given previous findings of differences of use of VOT both in the extent of variability in production in general (Lowenstein and Nittrouer, 2008) and in the extent of VOT convergence in shadowing tasks (Nielsen, 2014). We did not find evidence of differences based on shadower age, indicating that if there is a difference in listeners' strategies across ages, it was not large enough to be detected with the current number of talkers (four per group) and methodology.

There were, however, noticeable differences in effects across individual shadowers. This variability was particularly striking in experiment 3, where the numerically larger (but non-significant) use of VOT in the extra-long compared to the long condition appears to be driven by several individual shadowers (see right panel of Fig. 2). There are multiple explanations for this variability. It is possible that listeners do indeed show differing degrees of reliance on certain dimensions in different talkers. At the same time, even if the reliance on VOT is constant across all shadowers, some shadowers may show less variability on other dimensions, resulting in the appearance of a relatively greater role for VOT. Exploring the factors underlying this variability is an interesting direction for future work; regardless of the reason, these individual differences reinforce the importance of using multiple shadowers in perception tasks (Pardo, 2013; Pardo et al., 2017). While more work is necessary to come to a full understanding of how perceptual strategies for assessing similarity may differ across populations, these results show general consistency across children and adults.

Our results demonstrate that VOT is one member of the constellation of acoustic dimensions that informs listeners' similarity judgments and that it plays a small but independent role. Using manipulated stimuli provides a fresh perspective on how to tease apart the individual role of acoustic features in perceptual judgments, and future work along similar lines examining different acoustic dimensions will move us toward an accurate understanding of how much overlap and divergence exists between acoustic and perceptual metrics for convergence.

The results of experiments 1 and 2 were presented at the International Congress of Phonetic Sciences (ICPhS) 2019. We would like to thank Fatima Adil, Crystal Chow, Tamim Fattah, and Anna Lyashenko for their help running experiments and Thomas St Pierre for helpful feedback. This work was supported by a grant from the Natural Sciences and Engineering Research Council of Canada.

1

Our study explores results of a non-interactive shadowing task, but phonetic convergence also has been extensively studied in conversational interaction. We discuss phonetic convergence in interactive and shadowing contexts together, but it should be noted that convergence may not operate identically in these two types of scenarios (e.g., Pardo et al., 2018).

2

Following previous work, we use the term VOT as a shorthand to refer to the acoustic consequences of the articulatory gesture defining VOT: the time from release of the stop to the onset of voicing.

3

This was tested via a linear mixed-effects model with VOT (s) as the response variable and age, baseline vs shadowed condition, and their interaction, as fixed effects. There was a significant effect of condition, with shadowed tokens on average 28 ms longer in VOT than baseline [β=0.28,standarderror(SE)=0.003,t=7.96,p<0.001]. The interaction of age and condition was not significant (β=0.11,SE=0.007,t=1.68,p=0.12), indicating a lack of evidence to conclude there is a difference in the magnitude of the effect across age groups.

4

The number of trials was smaller in experiment 3 because not all shadowers successfully produced two baseline tokens for all words; when one token of a set was not available, the whole set for that word/participant was omitted.

5

glmer(response∼DIDVOT*Age+(DIDVOTǁlistener)+(DIDVOTǁshadower)+(DIDVOTǁword)).

6

glmer(response∼VOT.cond*Age+(VOT.condǁlistener)+(VOT.condǁshadower)+(VOT.condǁword)).

7

glmer(response∼Length.cond*Age+(Length.condǁlistener)+(Length.condǁshadower)+(Length.condǁword)).

8

For example, statistical power was different between the two analyses, since one was between- and one was within-subjects. Furthermore, as suggested by a reviewer, it is possible that the manipulations done in experiment 2 to shadowed, but not baseline, stimuli, could have contained artefacts that might have affected their perception.

1.
Babel
,
M.
, and
Bulatov
,
D.
(
2012
). “
The role of fundamental frequency in phonetic accommodation
,”
Lang. Speech
55
(
2
),
231
248
.
2.
Babel
,
M.
,
McAuliffe
,
M.
, and
Haber
,
G.
(
2013
). “
Can mergers-in-progress be unmerged in speech accommodation?
,”
Front. Psychol.
4
,
653
.
3.
Bates
,
D.
,
Maechler
,
M.
,
Bolker
,
B. M.
, and
Walker
,
S. C.
(
2015
). “
lme4: Linear mixed-effects models using ‘Eigen’ and S4
,”
J. Stat. Softw.
67
,
470
474
.
4.
Boersma
,
P.
, and
Weenink
,
D.
(
2018
). “
Praat: Doing phonetics by computer
,” http://www.praat.org (Last viewed April 9, 2021).
5.
Clopper
,
C. G.
, and
Dossey
,
E.
(
2020
). “
Phonetic convergence to Southern American English: Acoustics and perception
,”
J. Acoust. Soc. Am.
147
(
1
),
671
683
.
6.
Dufour
,
S.
, and
Nguyen
,
N.
(
2013
). “
How much imitation is there in a shadowing task?
,”
Front. Psychol.
4
,
346
.
7.
Fowler
,
C. A.
,
Brown
,
J. M.
,
Sabadini
,
L.
, and
Weihing
,
J.
(
2003
). “
Rapid access to speech gestures in perception: Evidence from choice and simple response time tasks
,”
J. Mem. Lang.
49
(
3
),
396
413
.
8.
Goldinger
,
S. D.
(
1998
). “
Echoes of echoes? An episodic theory of lexical access
,”
Psychol. Rev.
105
(
2
),
251
279
.
9.
Kim
,
J.
(
2012
). “
Some aspects in mimicry of lexical pitch accent by children and adults
,”
Korean J. Linguist.
37
(
2
),
285
300
.
10.
Lowenstein
,
J. H.
, and
Nittrouer
,
S.
(
2008
). “
Patterns of acquisition of native voice onset time in English-learning children
,”
J. Acoust. Soc. Am.
124
(
2
),
1180
1191
.
11.
Moulines
,
E.
, and
Charpentier
,
F.
(
1990
). “
Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones
,”
Speech Commun.
9
,
453
467
.
12.
Nielsen
,
K.
(
2011
). “
Specificity and abstractness of VOT imitation
,”
J. Phon.
39
(
2
),
132
142
.
13.
Nielsen
,
K.
(
2014
). “
Phonetic imitation by young children and its developmental changes
,”
J. Speech Lang. Hear. Res.
57
(
6
),
2065
2075
.
14.
Paquette-Smith
,
M.
(
2018
). “
The effect of accent exposure on social cognition and language acquisition in early childhood
,” Ph.D. thesis,
University of Toronto
,
Canada
.
15.
Pardo
,
J. S.
(
2013
). “
Measuring phonetic convergence in speech production
,”
Front. Psychol.
4
,
559
.
16.
Pardo
,
J. S.
,
Jordan
,
K.
,
Mallari
,
R.
,
Scanlon
,
C.
, and
Lewandowski
,
E.
(
2013
). “
Phonetic convergence in shadowed speech: The relation between acoustic and perceptual measures
,”
J. Mem. Lang.
69
(
3
),
183
195
.
17.
Pardo
,
J. S.
,
Urmanche
,
A.
,
Wilman
,
S.
, and
Wiener
,
J.
(
2017
). “
Phonetic convergence across multiple measures and model talkers
,”
Atten. Percept. Psychophys.
79
(
2
),
637
659
.
18.
Pardo
,
J. S.
,
Urmanche
,
A.
,
Wilman
,
S.
,
Wiener
,
J.
,
Mason
,
N.
,
Francis
,
K.
, and
Ward
,
M.
(
2018
). “
A comparison of phonetic convergence in conversational interaction and speech shadowing
,”
J. Phon.
69
,
1
11
.
19.
R Core Team
(
2019
). “
R: A language and environment for statistical computing
,”
R Foundation for Statistical Computing
,
Vienna
, http://www.R-project.org (Last viewed April 9, 2021).
20.
Sanchez
,
K.
,
Miller
,
R. M.
, and
Rosenblum
,
L. D.
(
2010
). “
Visual influences on alignment to voice onset time
,”
J. Speech Lang. Hear. Res.
53
(
2
),
262
272
.
21.
Shockley
,
K.
,
Sabadini
,
L.
, and.
Fowler
,
C. A.
(
2004
). “
Imitation in shadowing words
,”
Percept. Psychophys.
66
(
3
),
422
429
.
22.
Walker
,
A.
, and
Campbell-Kibler
,
K.
(
2015
). “
Repeat what after whom? Exploring variable selectivity in a cross-dialectal shadowing task
,”
Front. Psychol.
6
,
546
.