This article reports an acoustic study analysing the time-varying spectral properties of word-initial English liquids produced by 31 first-language (L1) Japanese and 14 L1 English speakers. While it is widely accepted that L1 Japanese speakers have difficulty in producing English /l/ and /ɹ/, the temporal characteristics of L2 English liquids are not well-understood, even in light of previous findings that English liquids show dynamic properties. In this study, the distance between the first and second formants (F2–F1) and the third formant (F3) are analysed dynamically over liquid-vowel intervals in three vowel contexts using generalised additive mixed models (GAMMs). The results demonstrate that L1 Japanese speakers produce word-initial English liquids with stronger vocalic coarticulation than L1 English speakers. L1 Japanese speakers may have difficulty in dissociating F2–F1 between the liquid and the vowel to a varying degree, depending on the vowel context, which could be related to perceptual factors. This article shows that dynamic information uncovers specific challenges that L1 Japanese speakers have in producing L2 English liquids accurately.
I. INTRODUCTION
A. Acquisition of English /l/ and /ɹ/ by L1 Japanese speakers
The current study investigates time-varying spectral properties of English liquids produced by first-language (L1) Japanese speakers. Numerous studies have shown that the acquisition of English liquids is particularly challenging for L1 Japanese speakers (e.g., Aoyama , 2019; Best and Strange, 1992; Flege , 1995; Saito and Munro, 2014; Sheldon and Strange, 1982). They typically perceive English /l/ and /ɹ/ as instances of a single L1 category of Japanese /r/ (e.g., Best and Strange, 1992; Guion , 2000). This corresponds to the learning of “similar” phones between L1 and L2 in the Speech Learning model (SLM) (Flege, 1995; Flege and Bohn, 2021) and the single-category (SC) or the category-goodness (CG) assimilation scenarios in the Perceptual Assimilation model of Second Language (L2) Speech Learning (PAM-L2) (Best and Strange, 1992; Best and Tyler, 2007; Hattori and Iverson, 2009), predicting a moderate to substantial difficulty in acquisition of the L2 sounds. SLM posits that perceptual accuracy lays the foundation for accurate L2 speech production because L2 learners develop articulatory rules in the L2 phonetic categories that are established over the course of L2 speech learning (Flege and Bohn, 2021).
The difficulty L1 Japanese speakers face in acquiring English /l/ and /ɹ/ is associated with their sensitivity to the phonetic cues used to distinguish the contrast. The key spectral dimension that contrasts English /l/ and /ɹ/ is the frequency of the third formant (F3); American English /ɹ/ is associated with a notably low F3 at 1300 Hz for male speakers and 1800 Hz for female speakers whereas laterals show a high F3 at approximately 2500–2800 Hz (Espy-Wilson, 1992; Stevens, 2000). The F2 frequency is associated with the resonance of the vocal tract cavity posterior to the primary constriction for both laterals and rhotics, which are commonly produced with a backed tongue body configuration (Stevens, 2000). Laterals are generally characterised by clear-dark allophony according to syllabic position; “clear” /l/s are often associated with laterals in pre-vocalic, syllable-initial position, and they typically have higher F2 values and a greater separation between F2 and F1 (F2–F1) than the post-vocalic “dark” counterpart (Carter and Local, 2007; Recasens, 2012). American English exhibits relatively darker realisations of liquids than British English overall, but syllable-initial laterals in American English are still somewhat “clearer” than syllable-final counterparts (Recasens, 2012). This clear-dark allophony according to the syllable position results from different articulatory configurations, such that the degree of the tongue body retraction is greater for the final laterals than for the initial laterals (Recasens, 2012).
L1 Japanese speakers tend to rely on the less reliable cue of F2 in their perception of English /l ɹ/ than a more robust cue of F3 (Iverson , 2003; Saito and Munro, 2014). As a result, they tend to produce the distinction along the F2 dimension instead of learning to make a contrast along F3 (Aoyama , 2019; Saito and van Poeteren, 2018). For instance, they produce word-initial English /l/ with a somewhat higher F2 (approximately 1500 –1800 Hz) than L1 English speakers (approximately 1200–1500 Hz), whereas F2 frequencies for English /ɹ/ are similar between the two speaker populations (Aoyama , 2019; Flege , 1995). As for F3, they produce English /ɹ/ with a relatively high F3 (2000–2600 Hz) but produce /l/ with F3 values comparable to L1 English speakers (Aoyama , 2019; Flege , 1995; Saito and Munro, 2014). Nevertheless, previous research claims that L1 Japanese speakers could learn to use the acoustic cues as L1 English speakers would do, especially for F1 and F2; several studies reported similar F1 values in production of English liquids between L1 Japanese and L1 English speakers (Aoyama , 2019; Flege , 1995; Saito and Munro, 2014). Saito and Munro (2014) also argue that the use of F2 is easier for L1 Japanese speakers to acquire than that of F3 for English /ɹ/ based on findings that L1 Japanese speakers who resided in Canada for longer than 2.5 months produced native-like F2 values for English /ɹ/ compared to those who had less overseas experience.
The degree of difficulty in L1 Japanese speakers' acquisition of English liquids also varies depending on the vowel context, in which they are better at correctly identifying word-initial English liquids adjacent to front vowels compared to back vowels in perception (Shimizu and Dantsuji, 1983). This might be because L1 Japanese speakers may also perceive English /l/ and /ɹ/ as a sequence of a back vowel and a tap (i.e., [Ɯɾ]), possibly due to the vocalic nature of English liquids (Guion , 2000). L1 Japanese speakers are more likely to hear a /w/-like percept when perceiving English /l/ and /ɹ/ than L1 English speakers (Best and Strange, 1992; Mochizuki, 1981; Yamada and Tohkura, 1992). These results overall suggest that L1 Japanese speakers are sensitive not only to the phonemic status but also phonetic details of English /l/ and /ɹ/. In particular, Shimizu and Dantsuji (1983) speculate that coarticulatory properties may play a role in explaining the vocalic contextual effects in L1 Japanese speakers' correct identification of English /l/ and /ɹ/.
B. Dynamic analysis of English liquids
Although the errors in segmental realisation in L2 speech are claimed to be rooted in perception, accurate perception does not always entail accurate production (Flege and Bohn, 2021; Sheldon and Strange, 1982). While this does not mean that the role of perceptual accuracy should be discounted, it implies that L2 speech production may be shaped by a combination of factors in addition to perceptual accuracy.
One such possible factor includes the dynamic nature involved in the production of English liquids. Articulation of English liquids requires coordination of multiple articulatory gestures for accurate production (Campbell , 2010; Sproat and Fujimura, 1993). English laterals, for instance, involve coordination of tongue tip and dorsum gestures, and the timing and magnitude interact with the syllabic position; a tongue tip gesture precedes a tongue dorsum gesture with a greater magnitude for clear /l/ whereas the two gestures could be timed synchronously for the dark /l/ (Sproat and Fujimura, 1993). English rhotics show similar patterning of gestural timing and magnitude, where labial gestures precede the tongue tip and tongue body gestures (Campbell , 2010; Proctor , 2019). The dynamic nature of articulation in English liquids suggests that the acoustic characteristics of English liquids are inherently non-static, and it is, therefore, often challenging to select a single point in time that adequately represents liquid quality (Kirkham , 2019; Ying , 2012).
In addition, acoustic realisations of liquids interact with the neighbouring segments as a result of coarticulation. While coarticulation is often viewed as a consequence of the physiological mechanisms in the transition between segmental targets, some aspects of coarticulation may be language-specific and thus need to be learned (Beristain, 2022; Keating, 1985). Word-initial /ɹ/ in English, for instance, shows lower F3 values when followed by back vowels compared to other vowel conditions (King and Ferragne, 2020). Similarly, vowel context influences realisations of American English /l/, particularly among word-initial /l/s, such that F2 values are higher in the /i/ context than in the /a/ context (Recasens, 2012). Coarticulatory effects of liquids could span longer term than the domain of liquid segment itself and provide perceptual basis for listeners to distinguish English /l/ and /ɹ/ (West, 1999a,b).
The findings regarding the dynamic nature of liquid production and liquid-vowel coarticulation may account for the specific difficulties that L1 Japanese speakers have in producing English /l/ and /ɹ/. L1 Japanese speakers tend to substitute English /l/ and /ɹ/ with an alveolar tap or flap [ɾ], a canonical realisation of Japanese /r/ (Riney , 2000). Previous articulatory studies show that alveolar taps/flaps show stronger coarticulatory effects with the neighbouring vowels than English laterals and rhotics; while the tongue dorsum gesture is actively involved in the production of English /l/ and /ɹ/, taps and flaps [ɾ] show either less involvement of the tongue dorsum or a “stabilization” tongue dorsum gesture, resulting in stronger coarticulation with the vowel (Morimoto, 2020; Proctor, 2011; Recasens, 1991; Yamane , 2015). Furthermore, an x-ray study suggests that L1 Japanese speakers' articulation of English liquids shows greater variability according to the vocalic environment (Zimmermann , 1984). In sum, Japanese and English liquids differ in the way they are coarticulated with the vowels, and it can be predicted that L1 Japanese speakers exhibit different liquid-vowel coarticulatory patterns from that of L1 English speakers.
Despite the findings regarding the complexity involved in the production of English liquids, our understanding remains relatively limited regarding the specific mechanism whereby L1 Japanese speakers struggle to produce English /l/ and /ɹ/. This may be because previous research commonly evaluates liquid quality based on a single-point measurement, in which formant frequencies are measured at one point in time, such as the F3 minima, the spectral onset, or the spectral release (Aoyama , 2019; Flege , 1995; Saito and Munro, 2014). Analysis of liquids based on a single measurement, however, inevitably averages out temporal information that may be important for understanding the dynamic characteristics of English liquids.
In the current study, I show that dynamic formant measurement of English liquids allows us to better understand specific challenges that L1 Japanese speakers have in producing English /l/ and /ɹ/. Previous research suggests that (1) L1 Japanese speakers' acquisition of English liquids may be influenced by the phonetic details, such as vowel environments, and (2) English liquids show dynamic characteristics and interactions with the neighbouring vowels. Given these, I hypothesise that L1 Japanese speakers' production of English liquids will exhibit different dynamic acoustical properties compared to L1 English speakers. This study therefore asks what dynamic acoustic properties L1 Japanese speakers would show in their production of English /l ɹ/ compared to L1 English speakers.
I combine static and dynamic analyses of the acoustic properties of English liquids in this study. The static analysis investigates the distance between second and first formants (F2–F1) and the third formant (F3) extracted at the liquid midpoint. The inclusion of this measure allows me to discuss the results in light of previous research in which the single-measurement analysis has been widely used (e.g., Aoyama , 2019; Flege , 1995; Saito and Munro, 2014; Saito and van Poeteren, 2018). In addition, the time-varying changes in the F2–F1 and F3 values will capture the complex nature of liquid acoustics and the coarticulatory interactions between the liquid and the vowel (Howson and Redford, 2021; Kirkham , 2019; Sproat and Fujimura, 1993).
II. METHODS
A. Participants
The data for the current study are obtained from 45 speakers: 31 L1 Japanese learners of English (17 female and 14 male) aged between 18 and 22 years [M = 19.81 years, standard deviation (SD) = 1.05] and 14 L1 North American L1 English speakers (11 female and three male) aged between 21 and 43 years (M = 28.93 years, SD = 6.08).
All of the L1 Japanese speakers were undergraduate university students recruited from two universities in Japan, located near the cities of Nagoya and Kobe, respectively. Their profile is considered to be typical for average Japanese university students who study English as a foreign language; all of them studied English primarily through the school curriculum in either or both primary and secondary schools, and continued it at the tertiary level, with a mean length of English study being 9.31 years (SD = 2.42). They did not have an extended stay in an English-speaking country, with the length of overseas experience ranging from none to 4.25 months (M = 0.77 months, SD = 1.35).
In evaluating L1 Japanese speakers' L2 English proficiency, participants were asked to report their perception on their own oral fluency on a scale of seven, with 1 being “I do not speak English at all.” to 7 being “No problems in using English in daily life.” This is because there was no common measure available across participants to estimate their English proficiency due to the fact that students have taken different kinds of tests or that first-year students had not yet taken any English language test. Nevertheless, judging from the test scores that some of the participants were able to provide and observations by the researcher who has experience in English language teaching in Japan, their English proficiency is considered to be lower to upper intermediate, which largely agrees with their subjective evaluation of their fluency in English (M = 3.84, SD = 1.10) (see supplementary material for further details about the participants).1
The 14 L1 English speakers identify themselves as fluent L1 speakers of North American English who grew up using English until 13 years of age. Five of them are from Canada and nine are from the United States. They resided in the United Kingdom (UK) at the time of recording; six of them were postgraduate students enrolled at a UK university and the rest worked in companies in the UK. Recruitment of L1 North American English speakers reflects the situation that American English tends to be chosen as a pedagogical model in English language teaching in Japan and therefore it is appropriate for L1 Japanese speakers' production to be compared to that of L1 North American English speakers (Setter and Jenkins, 2005).
B. Data collection
The audio recordings analysed in this study are a subset of data collection for a larger study, in which both articulatory and acoustic data were obtained in a simultaneous high-speed ultrasound-audio recording setting. For this reason, the participants wore an ultrasound headset while recording stimuli for the current study. The participants were recorded in a sound-attenuated booth at universities in the UK for L1 North American English speakers and in a quiet room at universities in Japan for L1 Japanese speakers. In recording some of the L1 Japanese speakers, however, there was minor background fan noise because of the Covid-19 restrictions mandating air ventilation at the time of recording. Acoustic signals were pre-amplified, digitized, and recorded onto a laptop computer via a Sound Devices (Reedsburg, WI) USB-Pre2 audio interface at 44.1 kHz with 16 bit quantisation.
The participants were asked to sit in front of the laptop screen and read the stimuli words in isolation that were displayed one by one orthographically using Articulate Assistant Advanced (AAA) (Edinburgh, UK) software version 220.4.1 (Articulate Instruments, 2022). No carrier phrases were used here because (1) the use of carrier phrases would impose additional difficulty on L1 Japanese speakers, especially those who were less proficient in English, and (2) the experiment had to be as short as possible due to time constraints in the data collection sessions.
In light of the language mode hypothesis (Grosjean, 2008) that the language setting in an experiment can influence the participants' speech perception and possibly production, the recording sessions for the L1 Japanese speakers were structured as follows. The first half of the experiment, including briefing, equipment setup, and recording of the Japanese words (not presented in this paper), was conducted while I was giving instructions in Japanese. Then, I switched the language of instructions to English and the participant engaged in a short English conversation activity. This included a semi-structured dialogue in which I asked five simple questions to the participants (e.g., “What do you study?,” “What do you like the best about the university?,” etc.) Finally, the Japanese participants recorded the English words while I gave all the instructions in English. While it would have been theoretically desirable to have someone else who was an L1 English speaker lead the data collection session for English words, it was challenging for reasons of time and room availability given that each session for L1 Japanese speakers took up to 90 min.
The recording session with the L1 North American English speakers did not require such considerations because they recorded English words only. All the procedures were, therefore, conducted in English and each session took up to approximately 60 min. The participants were compensated for their time and participation with the amount of 2000 Japanese yen or 15 British pound stirlings in the form of cash or vouchers commensurate with the regulations at each of the recording venues. The research project has been reviewed and approved by the ethics committees at Lancaster University, Kobe Gakuin University, and Meijo University. Informed consent to take part in the study was obtained in written form from all participants.
C. Materials
Word-initial English /l/ and /ɹ/ were elicited from 16 monosyllabic words (eight minimal pairs), followed by a close front /i/, an open front /æ/, or a close back vowel /u/ (see Table I). The coda consonants were restricted to bilabials /p b m/ or labiodentals /f v/ to minimise the anticipatory coarticulatory effects on the word-initial liquids. All the target words were checked using the Longman Pronunciation Dictionary (Wells, 2008) to ensure that they have the intended vowel environment in American English.
D. Segmentation and data processing
Prior to segmentation, audio recordings were low-pass filtered at 11 000 Hz and downsampled to 22 050 Hz. Automatic segmentation was carried out at phoneme level with a Montreal Forced Aligner (MFA) version 2.0.6 (McAuliffe , 2017). I then inspected the aligned data visually and manually corrected the segmentation using Praat where necessary (Boersma and Weenink, 2022).
I classified the liquid tokens into two broad categories: approximants and non-approximants, based on the spectrographic representations aided by auditory impressions. This decision reflects the consideration that the L1 Japanese speakers' production of liquids might show a wide range of variations due to the allophonic variation of Japanese /r/ and their articulatory strategies for English /l/ and /ɹ/. Realisations for Japanese /r/ include other types of approximants than English liquids, such as the canonical [ɾ], retroflex flap [ɻ], retroflex lateral approximant [ɭ], and a lateral flap [ɹ] (Akamatsu, 1997; Arai, 2013). They may also use a single strategy or produce a reversed realisation for English /l ɹ/. It could be the case, for instance, that they produce a lateral liquid for both English /l ɹ/. It is also possible that they use [l] for English /ɹ/ and [ɹ] for English /l/. Classifications based on these two broad categories: approximants and non-approximants, therefore, guide me to choose an appropriate type of analysis while maximising the chance of capturing diverse acoustic properties in the L1 and L2 English liquids.
Based on these considerations, I first broadly labelled tokens as approximants if the liquid token in question shows a vowel-like formant structure (Ladefoged and Johnson, 2010). The spectral analysis focuses only on the tokens that are classified here as approximants; it thus excludes 281 non-approximants tokens (e.g., taps or flaps [ɾ]) out of a total of 2914 tokens, leaving 2633 tokens for further processing. The spectrographic examples of an approximant and a non-approximant token are shown in Figs. 1 and 2.
Following this, I segmented the liquid approximant tokens based on the primary cues of a steady state or an approximately steady state of the F2 and an abrupt change in amplitude in the waveform (Lawson , 2011). Laterals and rhotics in English involve various stages, including the transition into the liquid, the steady state, and the transition into the following vowel (Carter and Local, 2007). The current study uses the steady-state portion to define the liquid as in previous studies (Flege , 1995; Kirkham, 2017). Although the liquid steady-state is an approximation given the various stages involved in the liquid acoustics mentioned above, this issue can be minimised in the dynamic analysis because it shows holistic time-varying trajectories across the liquid and vowel.
E. Acoustic analysis
This study analyses 2306 liquid tokens for mid-point analysis and 2515 liquid-vowel tokens for dynamic analysis. The detailed breakdown is shown in Table II. The current study compares two acoustic parameters between L1 Japanese and L1 English speakers' production of English liquids: (1) the distance between second (F2) and first (F1) formants (F2–F1) and (2) the third formant (F3). F2–F1 is used as a measure to evaluate acoustic liquid quality; lower F2–F1 values can be related to darker realisations of liquids, resulting from a greater degree of tongue retraction (Howson and Redford, 2021; Sproat and Fujimura, 1993). F3 is a primary acoustic dimension that distinguishes English /l/ and /ɹ/, and previous research reports robust differences between L1 Japanese and L1 English speakers' production of English liquids.
Vowel context . | /i/ . | /æ/ . | /u/ . |
---|---|---|---|
L1 English | |||
Liquida | 155 / 187 | 188 / 173 | 119 / 130 |
Liquid-vowelb | 177 / 197 | 199 / 192 | 130 / 134 |
L1 Japanese | |||
Liquida | 205 / 246 | 298 / 286 | 149 / 170 |
Liquid-vowelb | 231 / 284 | 312 / 310 | 169 / 180 |
Vowel context . | /i/ . | /æ/ . | /u/ . |
---|---|---|---|
L1 English | |||
Liquida | 155 / 187 | 188 / 173 | 119 / 130 |
Liquid-vowelb | 177 / 197 | 199 / 192 | 130 / 134 |
L1 Japanese | |||
Liquida | 205 / 246 | 298 / 286 | 149 / 170 |
Liquid-vowelb | 231 / 284 | 312 / 310 | 169 / 180 |
/l/ tokens on the left; /ɹ/ tokens on the right.
/l/+vowel tokens on the left; /ɹ/+vowel tokens on the right.
F1, F2, and F3 values were estimated and extracted with Fast Track, an automatic formant estimation Praat plug-in (Barreda, 2021). Fast Track samples formant frequencies every 2 ms throughout the interval, resulting in smooth trajectories between F1 and F3. It then outputs the estimated formant frequencies while aggregating them in a specified number of bins. The current analysis uses 11 data points throughout the liquid-vowel interval for each formant trajectory. The advantage of using Fast Track is that it performs multiple-step formant estimations by adjusting the maximum formant frequency and obtains the “best-winning” analysis based on regression analyses predicting the formant frequency as a function of time (Barreda, 2021). This achieves increased formant estimation accuracy by specifying different formant frequency ranges according to speakers' age and gender.
In the current study, the female and male speakers were analysed separately with different ranges of the upper formant frequency ceiling: between 5000–7000 Hz for female speakers and between 4500–6500 Hz for male speakers. Fast Track then performs 24-step formant estimations with varying upper-frequency ceilings and estimates the formant frequencies at 11 equidistant points during (1) the liquid and (2) the liquid-vowel intervals with a 25 ms window padded before and after the segment. After formant tracking, formant estimation errors can be corrected based on visual inspection of the 24-step analyses. Using this, I visually inspected all the tokens one by one and either improved the formant measurement by nominating a different winning analysis or omitted the tokens when none of the analyses looked reasonable. At this visual inspection stage, 118 tokens out of the 2633 tokens (see Sec. II D) were excluded due to poor formant estimation accuracy.
Finally, Fast Track automatically omits tokens when they are shorter than 30 ms as formant estimation can be challenging for extremely short tokens. As a result, 209 tokens were excluded from the dataset for the static analysis, leaving 2306 tokens for static analysis and 2515 tokens for dynamic analysis. The difference in the number of tokens reflects the greater number of liquid-only tokens being omitted automatically by Fast Track as they were inevitably shorter than liquid-vowel intervals (see supplementary material for the data processing procedure described here).1
F. Statistical analysis
All statistical analyses were performed using R version 4.2.2 (R Core Team, 2022) and data visualisation was performed using the tidyverse suite (Wickham , 2019). Prior to the statistical analysis, the formant values were transformed into Bark scale using the bark function in the emuR package to allow for cross-speaker comparisons (Jochim 2023).
For the static analysis, Bark-converted F2–F1 (Bark F2–F1) and F3 (Bark F3) at liquid midpoint were modelled using linear mixed-effect models (LME) using the lme4::lmer function (Bates , 2015). Separate models were constructed for /l/ and /ɹ/, respectively. The fixed effects included (1) the speaker's first language (L1: i.e., English vs Japanese), (2) vowel context (vowel), and (3) the speaker's gender (gender). No interactions were included because initial explorations suggested that the current dataset does not have the statistical power to detect interactions.
Furthermore, an anonymous reviewer suggested classifying the participants into groups according to their English proficiency and including this variable for analysis. Following this, I classified the participants into four groups based on the distribution of their subjective fluency rating scores. L1 Japanese speakers are classified into the advanced (rating 5–6, n = 7), intermediate (rating 4, n = 14), and beginner (rating 1–3, n = 10) groups. L1 English speakers constitute a group on their own (L1 English; rating 7, n = 14). The L1 English speaker group, however, confounds the proficiency variable with the L1 variable, making the inclusion of the proficiency variable problematic. The issue is manifested in the rank-deficient warning for LMEs when both L1 and proficiency are included in the same model, suggesting that two or more variables are not linearly independent from each other. A further analysis using the caret::findLinearCombos function shows co-linearity between L1 and proficiency and suggests excluding the level of L1 English speakers from the proficiency variable.
For this reason, I perform a separate analysis focussing only on the L1 Japanese speakers' data to investigate the effects of proficiency and summarise the results at the end of the static analysis. I have included L1 Japanese speakers only here because inclusion of L1 English speakers might reduce the magnitude of between-group differences among L1 Japanese speakers. The visualisation includes L1 English speakers' data only for the purpose of comparison. I will not explore this extensively as this is not the main focus of the study (see supplementary material for further details of the analysis and results).1
The random effect structure for the linear models included by-participant varying slopes and by-participant varying intercepts for vowel contexts and by-word varying intercepts. As a result, the following specification is used for four final models (i.e., models predicting Bark F2–F1 and Bark F3 for /l/ and /ɹ/):
lmer(Bark F2–F1 or Bark F3 ∼ L1 + vowel + gender + (1 word) + (1 + vowel speaker)).
The significance of the fixed effects was tested via likelihood ratio testing by comparing the full model and the nested model excluding the fixed effect in question (Winter, 2020). If the full model significantly improved the model fit, I concluded that the main effect significantly influenced the outcome variable. The patterns associated with the vowel contexts are interpreted via data visualisation for the sake of model simplicity (see supplementary material for additional statistical comparisons).1
Second, the dynamic formant analysis used generalised additive mixed models (GAMMs) using the mgcv::bam function (Wood, 2017). The non-linear differences between contours can be evaluated in light of height and shape of the trajectories; the height dimension can be modelled via parametric terms, and the shape dimension via so-called smooth terms that specify the degree of wiggliness of contours (Sóskuthy , 2018). Differences between a set of contours can also be directly modelled by incorporating a reference smooth (i.e., a contour at the reference level) and the difference smooth (i.e., a contour that models the degree of by-group difference of contours) (Sóskuthy, 2017). For more details about GAMMs, please be referred to the existing tutorial papers (e.g., Sóskuthy, 2017; Sóskuthy , 2018; Wieling, 2018).
In the current study, I focus on differences in trajectory height and shape between the speaker groups (i.e., English vs Japanese). Separate models were constructed for each combination of the liquid-vowel pairings. Each model predicts the formant values, either Bark F2–F1 or Bark F3, by a parametric term of the speaker's first language and gender, as well as a time-varying reference smooth, a time-varying by-L1 difference smooth, and a time-varying by-gender smooth. It also includes time-by-speaker and time-by-word random smooths.
Note, again, that English proficiency was not included in the GAMMs models together with L1 as this resulted in inaccurate predictions of the formant trajectories compared to the visualisations of the raw data. Instead, similarly to the linear mixed-effect model analysis, I conducted a separate analysis for the effects of proficiency using the L1 Japanese speakers' data only and summarise the relevant results at the end of the dynamic analysis. The choice of including L1 Japanese speakers only reflects the consideration that L1 English speakers' trajectories may be different in both shape and height, which would make it difficult for me to interpret whether statistically significant differences result from speakers' L1 or L1 Japanese speakers' proficiency. This is clear in the visualisations in Figs. 9 and 10, in which L1 English speakers' trajectories are distinct from the three groups of L1 Japanese speakers (see supplementary material for further details).1
Residual autocorrelations in the trajectories were corrected using the autoregressive error model (AR model). The autoregressive parameter (rho: ρ) was set as the amount of autocorrelation at lag 1 in the model, estimated using the function in the itsadug package (van Rij , 2020). While this is usually an adequate estimate, the residual autocorrelations were negative in some cases, indicating that a lower value would be optimal (Sóskuthy , 2018; Wieling, 2018). In such cases, the new rho value was determined by exploring a range of values and visualising the autocorrelations at lag 1 for each rho value. The final model specification across 12 models (two outcome variables, i.e., Bark F2–F1 and Bark F3) for two liquids (i.e., /l/ and /ɹ/) in three vowel contexts (i.e., /æ/, /i/ and /u/) is
bam(Bark F2–F1 or Bark F3 ∼ L1 + gender + s(time, bs = “cr”) + s(time, by = L1, bs = “cr”) + s(time, by = gender, bs = “cr”) + s(time, speaker, bs = “fs,” xt = “cr,” m = 1) + s(time, word, bs = “fs,” xt = “cr,” m = 1), method = “ML”).
Trajectory height and shape were compared through model comparisons using the itsadug::compareML function following the previous research (Kirkham , 2019; Sóskuthy, 2017; Sóskuthy , 2018) as follows:
-
I first compared (1) the full model and (2) the nested model excluding the parametric and the smooth terms associated with the speaker's L1 or gender. This allows a comparison of the overall differences associated with these effects in both height and shape between the two contours.
-
If the above comparison showed a significantly improved model fit of the full model, I then compared (1) the full model and (2) the nested model including the parametric term of L1 or gender but still excluding the by-L1 or by-gender smooth term. This tests whether the two contours differ significantly in shape.
If the full model was still better in the model fit after procedure 2 above, I concluded that both trajectory height and shape were different at a statistically significant level. If the full model improved the model fit for procedure 1 but not for procedure 2, then there was only a difference in trajectory height. Otherwise, I concluded that there was little evidence that the two trajectories are significantly different.
III. RESULTS
A. Liquid static analysis
In this section, I first present the liquid midpoint analysis of F2–F1 and F3 using LMEs in order to investigate the overall trends in liquid quality. The static analysis tests the main effects of L1, vowel, and gender while the liquid-vowel interactions are interpreted via data visualisation. Note that the baseline participant population (i.e., intercept) is the female L1 English speakers in the /æ/ context but the gender is referred to only when the gender effect is discussed (see supplementary material for an additional analysis of vowel midpoints).1
1. F2-F1 midpoint
The model summaries for the F2–F1 models are shown in Table III. The lateral F2–F1 model predicts that L1 Japanese speakers produce laterals higher at 8.83 Bark than L1 English speakers (6.74 Bark). F2–F1 for laterals slightly varies according to the vowel context; F2–F1 is the highest in the /i/ context with an averaged F2–F1 being at 8.02 Bark, followed by /u/ (7.54 Bark) and /æ/ (6.74 Bark). Male speakers produce laterals with lower F2–F1 values at 6.06 Bark.
Variable . | β . | SE . | t . | p( ) . |
---|---|---|---|---|
Lateral /l/ | ||||
Intercept | 6.74 | 0.33 | 20.36 | |
L1 | <0.001 | |||
Japanese | 1.99 | 0.38 | 5.25 | |
Vowel | <0.001 | |||
/i/ | 1.28 | 0.16 | 8.23 | |
/u/ | 0.80 | 0.18 | 4.50 | |
Gender | 0.072 | |||
Male | −0.68 | 0.36 | −1.86 | |
Rhotic /ɹ/ | ||||
Intercept | 6.38 | 0.34 | 18.53 | |
L1 | <0.001 | |||
Japanese | 1.86 | 0.40 | 4.68 | |
Vowel | <0.001 | |||
/i/ | 1.15 | 0.16 | 7.09 | |
/u/ | 0.60 | 0.13 | 4.58 | |
Gender | 0.070 | |||
Male | −0.75 | 0.38 | −1.97 |
Variable . | β . | SE . | t . | p( ) . |
---|---|---|---|---|
Lateral /l/ | ||||
Intercept | 6.74 | 0.33 | 20.36 | |
L1 | <0.001 | |||
Japanese | 1.99 | 0.38 | 5.25 | |
Vowel | <0.001 | |||
/i/ | 1.28 | 0.16 | 8.23 | |
/u/ | 0.80 | 0.18 | 4.50 | |
Gender | 0.072 | |||
Male | −0.68 | 0.36 | −1.86 | |
Rhotic /ɹ/ | ||||
Intercept | 6.38 | 0.34 | 18.53 | |
L1 | <0.001 | |||
Japanese | 1.86 | 0.40 | 4.68 | |
Vowel | <0.001 | |||
/i/ | 1.15 | 0.16 | 7.09 | |
/u/ | 0.60 | 0.13 | 4.58 | |
Gender | 0.070 | |||
Male | −0.75 | 0.38 | −1.97 |
The rhotic F2–F1 model predicts that L1 English speakers produce rhotics in the /æ/ context at 6.38 Bark and L1 Japanese speakers overall produce 8.24 Bark. It also predicts higher F2–F1 overall in the /i/ context (7.53 Bark) and in the /u/ context (6.98 Bark) than in the /æ/ context. Similar to the laterals, male speakers produce rhotics with lower F2–F1 values at 5.63 Bark.
Overall, L1 Japanese speakers produce both English /l/ and /ɹ/ with consistently higher F2–F1 than L1 English speakers across vowel contexts (Fig. 3), and this is supported by the significant main effect of L1 for both /l/ [ (1) = 17.58, p < 0.001], and /ɹ/ [ (1) = 15.68, p < 0.001]. The main effect of vowel is also shown to be significant for both /l/ [ (2) = 22.74, p < 0.001] and /ɹ/ [ (1) = 22.35, p < 0.001]. While male speakers produce liquids with lower F2–F1 values than female speakers, this difference was not shown to be statistically significant for either laterals [ (1) = 3.23, p = 0.073], or rhotics [ (1) = 3.28, p = 0.070].
2. F3 midpoint
The model summaries for the F3 models are shown in Table IV. The lateral F3 model predicts that L1 English speakers produce F3 at 15.83 Bark for /l/ while L1 Japanese speakers have a slightly lower F3 at 15.54 Bark. Although model comparisons suggest significant effects of vowel for /l/ [ (2) = 13.05, p = 0.001], the difference seems to be quite minor; the model predicts 15.65 Bark for /l/ in the /i/ context and 15.44 Bark in the /u/ context. Finally, female speakers produce laterals with higher F3 values by 1.12 Bark than male speakers overall.
Variable . | β . | SE . | t . | p( ) . |
---|---|---|---|---|
Lateral /l/ | ||||
Intercept | 15.83 | 0.18 | 89.35 | |
L1 | 0.016 | |||
Japanese | −0.29 | 0.20 | −1.44 | |
Vowel | 0.001 | |||
/i/ | −0.18 | 0.08 | −2.12 | |
/u/ | −0.39 | 0.08 | −4.82 | |
Gender | <0.001 | |||
Male | −1.12 | 0.19 | −5.79 | |
Rhotic /ɹ/ | ||||
Intercept | 12.17 | 0.25 | 48.56 | |
L1 | <0.001 | |||
Japanese | 1.88 | 0.26 | 7.18 | |
Vowel | 0.001 | |||
/i/ | 0.37 | 0.08 | 4.47 | |
/u/ | 0.04 | 0.10 | 0.41 | |
Gender | <0.001 | |||
Male | −1.15 | 0.25 | −4.53 |
Variable . | β . | SE . | t . | p( ) . |
---|---|---|---|---|
Lateral /l/ | ||||
Intercept | 15.83 | 0.18 | 89.35 | |
L1 | 0.016 | |||
Japanese | −0.29 | 0.20 | −1.44 | |
Vowel | 0.001 | |||
/i/ | −0.18 | 0.08 | −2.12 | |
/u/ | −0.39 | 0.08 | −4.82 | |
Gender | <0.001 | |||
Male | −1.12 | 0.19 | −5.79 | |
Rhotic /ɹ/ | ||||
Intercept | 12.17 | 0.25 | 48.56 | |
L1 | <0.001 | |||
Japanese | 1.88 | 0.26 | 7.18 | |
Vowel | 0.001 | |||
/i/ | 0.37 | 0.08 | 4.47 | |
/u/ | 0.04 | 0.10 | 0.41 | |
Gender | <0.001 | |||
Male | −1.15 | 0.25 | −4.53 |
The rhotic F3 model predicts that L1 English speakers produce 12.17 Bark for /ɹ/ where L1 Japanese speakers produce higher F3 at 14.05 Bark. Similar to the laterals, slight differences are found for /ɹ/ in the /i/ and /u/ contexts compared to /æ/; the model predicts 12.54 Bark in the /i/ context and 12.21 Bark in the /u/ context. The main effect of vowel is also significant here [ (2) = 13.78, p = 0.001].
While the main effect of vowel influences the F3 values only slightly for both /l/ and /ɹ/, the effects of L1 are suggested to be significant for /ɹ/ [ (1) = 30.62, p < 0.001] but not for /l/ [ (1) = 1.97, p = 0.161]. Figure 4 seems to suggest a bimodal distribution in F3 (Bark) for L1 English speakers, especially for /l/ in the /i/ and /u/ contexts. This seems to result from gender-related differences, in which male speakers produced liquids with lower F3 values than female speakers. Indeed, the effects of gender are shown to be statistically significant for both laterals [ (1) = 22.70, p < 0.001] and rhotics [ (1) = 15.87, p < 0.001].
3. Effects of L2 proficiency on the midpoint formant measurement
In addition to the main analysis, the effects of proficiency are tested for the three groups of L1 Japanese speakers. Grouping is based on their subjective fluency judgement scores: beginner (n = 10, rating 1–3), intermediate (n = 14, rating 4), and advanced (n = 7, rating 5–6). Similarly to the main analysis, separate LME were specified in which Bark F2–F1 or Bark F3 are predicted by fixed effects of proficiency, vowel, and gender with by-item random intercepts and by-speaker random slopes and intercepts for vowels. The results are visualised in Figs. 5 and 6.
The F2–F1 models suggested statistically significant effects of proficiency on Bark F2–F1 for /ɹ/ [ (2) = 7.52, p = 0.002], in which the advanced L1 Japanese learners of English produce rhotics with lower F2–F1 than those in the beginner and intermediate groups. No statistically significant proficiency effects are found for /l/ [ (2) = 0.12, p = 0.94]. For Bark F3, no statistically significant effects of proficiency are found for either /l/ [ (2) = 0.81, p = 0.67] or /ɹ/ [ (2) = 0.057, p = 0.97].
4. Summary: Static analysis
L1 Japanese speakers produce higher F2–F1 for both /l/ and /ɹ/ across vowel contexts. F3 values for /l/ are only slightly lower for L1 Japanese speakers while they produce /ɹ/ with higher F3 than L1 English speakers across vowel contexts. Male speakers produce liquids with lower F2–F1 and F3 values, and this was particularly the case for F3. Finally, L1 Japanese speakers in the advanced group produced lower F2–F1 than the other groups for /ɹ/.
B. Dynamic analysis
Dynamic analysis in this section now focuses on variation in F2–F1 and F3 trajectories across the liquid-vowel interval using GAMMs. In the visualisation of the liquid-vowel trajectories (Figs. 7 and 8), the liquid portion corresponds roughly to the first third of the interval whereas the vowel corresponds to the second two-thirds. Note the visualisation shows the predictions based on the full models.
1. F2–F1 liquid-vowel trajectory
The results of the model comparisons for the F2–F1 dynamic analysis are shown in Table V for laterals and Table VI for rhotics. The visualisations are shown in Fig. 7. The model comparisons show that the height and shape of the F2–F1 trajectories are significantly different between L1 English and L1 Japanese speakers for both liquids in all vowel contexts. The visualisations of the GAMMs show that the trajectories for L1 English and L1 Japanese speakers are similar in the /i/ context (the middle panels in Fig. 7) but look quite different in the /æ/ (left) and /u/ (right) contexts. L1 English speakers follow a similar tendency across the vowel contexts such that they start from lower F2–F1 values at the onset of the liquid, showing an increase towards the vowel target and a slight decrease towards the offset of the vowel.
Comparison . | . | df . | p( ) . |
---|---|---|---|
/l/: /æ/ context | |||
Overall: L1 | 69.88 | 3 | <0.001 |
Shape: L1 | 66.63 | 2 | <0.001 |
Overall: gender | 6.67 | 3 | 0.004 |
Shape: gender | 1.10 | 2 | 0.333 |
/l/: /i/ context | |||
Overall: L1 | 16.54 | 3 | <0.001 |
Shape: L1 | 9.68 | 2 | <0.001 |
Overall: gender | 18.91 | 3 | <0.001 |
Shape: gender | 0.34 | 2 | 0.712 |
/l/: /u/ context | |||
Overall: L1 | 25.41 | 3 | <0.001 |
Shape: L1 | 23.67 | 2 | <0.001 |
Overall: gender | 4.02 | 3 | 0.045 |
Shape: gender | 0.07 | 2 | 0.929 |
Comparison . | . | df . | p( ) . |
---|---|---|---|
/l/: /æ/ context | |||
Overall: L1 | 69.88 | 3 | <0.001 |
Shape: L1 | 66.63 | 2 | <0.001 |
Overall: gender | 6.67 | 3 | 0.004 |
Shape: gender | 1.10 | 2 | 0.333 |
/l/: /i/ context | |||
Overall: L1 | 16.54 | 3 | <0.001 |
Shape: L1 | 9.68 | 2 | <0.001 |
Overall: gender | 18.91 | 3 | <0.001 |
Shape: gender | 0.34 | 2 | 0.712 |
/l/: /u/ context | |||
Overall: L1 | 25.41 | 3 | <0.001 |
Shape: L1 | 23.67 | 2 | <0.001 |
Overall: gender | 4.02 | 3 | 0.045 |
Shape: gender | 0.07 | 2 | 0.929 |
Comparison . | . | df . | p( ) . |
---|---|---|---|
/ɹ/: /æ/ context | |||
Overall: L1 | 53.57 | 3 | <0.001 |
Shape: L1 | 45.94 | 2 | <0.001 |
Overall: gender | 4.10 | 3 | 0.042 |
Shape: gender | 0.06 | 2 | 0.938 |
/ɹ/: /i/ context | |||
Overall: L1 | 39.40 | 3 | <0.001 |
Shape: L1 | 24.09 | 2 | <0.001 |
Overall: gender | 21.90 | 3 | <0.001 |
Shape: gender | 0.33 | 2 | 0.723 |
/ɹ/: /u/ context | |||
Overall: L1 | 21.62 | 3 | <0.001 |
Shape: L1 | 17.83 | 2 | <0.001 |
Overall: gender | 4.00 | 3 | 0.046 |
Shape: gender | 0.02 | 2 | 0.985 |
Comparison . | . | df . | p( ) . |
---|---|---|---|
/ɹ/: /æ/ context | |||
Overall: L1 | 53.57 | 3 | <0.001 |
Shape: L1 | 45.94 | 2 | <0.001 |
Overall: gender | 4.10 | 3 | 0.042 |
Shape: gender | 0.06 | 2 | 0.938 |
/ɹ/: /i/ context | |||
Overall: L1 | 39.40 | 3 | <0.001 |
Shape: L1 | 24.09 | 2 | <0.001 |
Overall: gender | 21.90 | 3 | <0.001 |
Shape: gender | 0.33 | 2 | 0.723 |
/ɹ/: /u/ context | |||
Overall: L1 | 21.62 | 3 | <0.001 |
Shape: L1 | 17.83 | 2 | <0.001 |
Overall: gender | 4.00 | 3 | 0.046 |
Shape: gender | 0.02 | 2 | 0.985 |
L1 Japanese speakers, on the other hand, show distinct trajectory patterns depending on vowel context. In the /i/ context, their trajectories follow a similar tendency to that of L1 English speakers, but with an earlier rise from the liquid onset towards the vowel resulting in a consistently higher trajectory than L1 English speakers in the first half of the interval. In the /æ/ context, on the other hand, the L1 Japanese speakers show an opposite pattern to L1 English speakers, in which F2–F1 values are the highest earlier during the first third of the interval and decrease to the vowel with a small rise towards the end of the interval. Finally, the L1 Japanese speakers' trajectories in the /u/ context show smaller fluctuations than that of L1 English speakers; the trajectory shows almost a linear and monotonic decrease in this vowel context.
Differences associated with gender are statistically significant for trajectory height but not for shape for both laterals and rhotics across the vowel contexts. This suggests almost linear differences between female and male speakers' trajectories, in which female speakers show constantly higher trajectories than male speakers, and this is evident in Fig. 7.
2. F3 liquid-vowel trajectory
The model comparisons for F3 are shown in Table VII for laterals and in Table VIII for rhotics. The visualisations are shown in Fig. 8. The lateral-vowel trajectories (the top half of Fig. 8) show similarities between L1 English and L1 Japanese speakers. The model comparisons suggest that, while the trajectory shape and height are different between L1 English and L1 Japanese speakers in the /i/ context, the trajectories in the /æ/ and /u/ contexts are not statistically significantly different, with the L1 Japanese speakers' trajectories being slightly lower, especially in the first half of the interval.
Comparison . | . | df . | p( ) . |
---|---|---|---|
/l/: /æ/ context | |||
Overall: L1 | 3.12 | 3 | 0.100 |
Shape: L1 | — | — | — |
Overall: gender | 17.57 | 3 | <0.001 |
Shape: gender | 1.22 | 2 | 0.295 |
/l/: /i/ context | |||
Overall: L1 | 4.43 | 3 | 0.031 |
Shape: L1 | 2.53 | 2 | 0.080 |
Overall: gender | 33.71 | 3 | <0.001 |
Shape: gender | 5.67 | 2 | 0.003 |
/l/: /u/ context | |||
Overall: L1 | 1.81 | 3 | 0.306 |
Shape: L1 | — | — | — |
Overall: gender | 29.91 | 3 | <0.001 |
Shape: gender | 0.00 | 2 | 1.000 |
Comparison . | . | df . | p( ) . |
---|---|---|---|
/l/: /æ/ context | |||
Overall: L1 | 3.12 | 3 | 0.100 |
Shape: L1 | — | — | — |
Overall: gender | 17.57 | 3 | <0.001 |
Shape: gender | 1.22 | 2 | 0.295 |
/l/: /i/ context | |||
Overall: L1 | 4.43 | 3 | 0.031 |
Shape: L1 | 2.53 | 2 | 0.080 |
Overall: gender | 33.71 | 3 | <0.001 |
Shape: gender | 5.67 | 2 | 0.003 |
/l/: /u/ context | |||
Overall: L1 | 1.81 | 3 | 0.306 |
Shape: L1 | — | — | — |
Overall: gender | 29.91 | 3 | <0.001 |
Shape: gender | 0.00 | 2 | 1.000 |
Comparison . | . | df . | p( ) . |
---|---|---|---|
/ɹ/: /æ/ context | |||
Overall: L1 | 17.36 | 3 | <0.001 |
Shape: L1 | 10.32 | 2 | <0.001 |
Overall: gender | 8.26 | 3 | <0.001 |
Shape: gender | 1.05 | 2 | 0.350 |
/ɹ/: /i/ context | |||
Overall: L1 | 43.55 | 3 | <0.001 |
Shape: L1 | 26.89 | 2 | <0.001 |
Overall: gender | 22.40 | 3 | <0.001 |
Shape: gender | 3.21 | 2 | 0.041 |
/ɹ/: /u/ context | |||
Overall: L1 | 27.42 | 3 | <0.001 |
Shape: L1 | 8.31 | 2 | <0.001 |
Overall: gender | 25.96 | 3 | <0.001 |
Shape: gender | 2.87 | 2 | 0.057 |
Comparison . | . | df . | p( ) . |
---|---|---|---|
/ɹ/: /æ/ context | |||
Overall: L1 | 17.36 | 3 | <0.001 |
Shape: L1 | 10.32 | 2 | <0.001 |
Overall: gender | 8.26 | 3 | <0.001 |
Shape: gender | 1.05 | 2 | 0.350 |
/ɹ/: /i/ context | |||
Overall: L1 | 43.55 | 3 | <0.001 |
Shape: L1 | 26.89 | 2 | <0.001 |
Overall: gender | 22.40 | 3 | <0.001 |
Shape: gender | 3.21 | 2 | 0.041 |
/ɹ/: /u/ context | |||
Overall: L1 | 27.42 | 3 | <0.001 |
Shape: L1 | 8.31 | 2 | <0.001 |
Overall: gender | 25.96 | 3 | <0.001 |
Shape: gender | 2.87 | 2 | 0.057 |
Even in the lateral-/i/ context where trajectory height and shape are statistically significant, however, a closer look at the GAMMs model specifications and the model comparisons suggest that the difference between the two trajectories is marginal. Neither parametric or smooth terms associated with the L1 difference were statistically significant in the model summary [β = 0.18, standard error (SE) = 0.10, t = 1.83, p = 0.07 for the parametric term; F(6.05) = 1.79, p = 0.09 for the difference smooth]. The model comparison also suggests only a marginal improvement in the Akaike Information Criterion (AIC) values (1561.42 for the full model and 1565.84 for the nested model). Figure 8 also shows that the 95% confidence intervals of two trajectories overlap substantially throughout the liquid-vowel interval.
The /ɹ/-vowel trajectories for F3, on the other hand, show statistically significant differences in both trajectory height and shape in all vowel contexts, although both L1 English and L1 Japanese speakers share a similar trend in the visualisation in Fig. 8. Both groups show lower F3 values at the liquid onset, which then increase towards the vowel, where L1 English and L1 Japanese speakers' trajectories seem to converge. L1 Japanese speakers' trajectories are overall flatter and higher than that of L1 English speakers across all the vowel contexts.
Finally, similarly to the F2–F1 results, the gender effect seems to be statistically significant only for the trajectory height. This again suggests that the difference between trajectories for female and male speakers is close to linear (see Fig. 8).
3. Effects of L2 proficiency on formant trajectories
Similar to the static analysis, the effects of L1 Japanese speakers' proficiency have been tested separately from the main analysis. For each liquid-vowel pairing, the models predicts Bark F2–F1 or Bark F3 with parametric terms of proficiency and gender, a time-varying reference smooth, a time-varying by-proficiency difference smooth, and a time-varying by-gender difference smooth. The random effect is accounted for by time-by-speaker and time-by-word random smooths. The visualisations are shown in Figs. 9 and 10; please note that the predictions shown in these figures are based on the models excluding parametric and smooth terms associated with gender because the plots would be too crowded to interpret otherwise.
The analyses for Bark F2–F1 suggest a statistically significant effect of proficiency on the trajectory height for /ɹ/ in the /u/ context [ (6) = 10.24, p = 0.002], in which the F2–F1 trajectory for the advanced group is lower than the beginner or intermediate groups. The visualisation in Fig. 9, however, shows that the trajectory shape is quite different between L1 English speakers and the advanced L1 Japanese speakers. For Bark F3, no statistically significant effects of proficiency are found for either /l/ or /ɹ/ for the L1 Japanese speakers.
4. Summary: Dynamic analysis
The dynamic analysis shows substantial variability in the liquid-vowel realisations between L1 English and L1 Japanese speakers. Shape and height are significantly different for the F2–F1 trajectories for both /l/ and /ɹ/, with differences associated not only with the liquid portion corresponding to the first third of the interval but also with the transition patterns into the vowel. The F3 trajectories for /l/ are largely comparable between L1 English and L1 Japanese speakers with little evidence of statistically significant differences. The F3 trajectories for /ɹ/, on the other hand, differ substantially in the first half of the interval corresponding to the liquid portion. The effects of gender are manifested almost exclusively on the trajectory height, meaning a linear difference between trajectories for female and male speakers. Although advanced L1 Japanese speakers produced the lower F2–F1 trajectories in the /ɹ/-/u/ context than the beginner and intermediate groups, the trend is quite different from that of L1 English speakers.
IV. DISCUSSION
A. Spectro-temporal variability in L2 English liquids
The current paper aims to capture time-varying acoustic properties of English liquids produced by L1 English and L1 Japanese speakers. It combines two analyses of F2–F1 and F3: the static analysis at the liquid midpoint and the dynamic analysis over the liquid-vowel interval. The liquid midpoint analysis suggests that L1 Japanese speakers constantly produce higher F2–F1 for both English /l/ and /ɹ/ and higher F3 for /ɹ/ than L1 English speakers across vowel contexts. The dynamic analysis, on the other hand, shows that the between-L1 differences are non-linear, highlighting the complexity associated with the production of liquids and liquid-vowel coarticulation.
Comparing the effects of speaker gender and L1 demonstrate the importance of dynamic information in the liquid-vowel sequences. The static analysis shows that male speakers generally produce English liquids with lower F2–F1 and F3 frequencies than female speakers, and the gender difference is statistically significant for F3. The dynamic analysis further shows clearly that the spectral difference between female and male speakers seems to be linear; GAMMs model comparisons suggest statistically significant differences in trajectory height but not in trajectory shape, and it is quite clear from the visualisations in Figs. 7 and 8 that the differences in trajectories between female and male speakers are (almost) linear.
The dynamic difference associated with speaker L1, on the other hand, draws a much more complicated picture. While the time-varying analysis of F3 for /l/ indicates little difference between L1 English and L1 Japanese speakers, the F3 values for /ɹ/ show a clear between-L1 difference in the first half of the interval, indicating differences in acoustic realisations of liquids and the transition into the vowel. Also, the trajectory shape associated with L1 Japanese speakers' /ɹ/ is flatter, resulting in a smaller distinction between /ɹ/ and the vowels. The two language groups slightly differ in the point in time at which F3 achieves its maximum, such that L1 Japanese speakers seem to achieve the vowel target earlier than L1 English speakers do.
The F2–F1 trajectories further highlight the non-linear between-L1 differences in the trajectory (Fig. 7). In particular, L1 Japanese speakers show distinct trajectory patterns across vowel contexts, suggesting that their production of English liquids is subject to greater influence from the following vowels than that of L1 English speakers. The liquid-/i/ trajectories, for example, suggest that L1 Japanese speakers reach the vowel target earlier, given the early onset of the plateau, than L1 English speakers despite a similar trajectory pattern. The linear trend for the liquid-/u/ trajectories also indicates that L1 Japanese speakers do not clearly distinguish the liquid and the vowel on F2–F1.
The separate static analyses on the effects of L1 Japanese speakers' English proficiency demonstrated that advanced L1 Japanese-speaking learners of English produced lower F2–F1 values for /ɹ/ than the other two groups. Given that L1 English speakers produced lower F2–F1 values for /ɹ/ at the liquid midpoint, the findings support the previous claims that English /ɹ/ is easier for L1 Japanese speakers to learn than English /l/ (Aoyama , 2004), and that the use of F2 and F1 may be easier for them to acquire than that of F3 (Saito and Munro, 2014). The dynamic analysis in the current study further demonstrates that advanced L1 Japanese speakers' F2–F1 trajectory is statistically significantly lower in the /ɹ/-/u/ context than the other two groups. While this could be taken as evidence of the proficiency effects, the linear trend of the trajectories across proficiency groups also suggests that even advanced L1 Japanese speakers do not seem to differentiate /ɹ/ and /u/. Fundamentally, this lack of liquid-vowel differentiation might demonstrate a general influence from their L1 (i.e., Japanese). Further research is clearly needed to investigate the effects of L2 proficiency on the formant dynamics by employing more rigorous measures of L2 proficiency, especially given that acoustic profiles of L2 English liquids can be complex (Aoyama , 2019).
Overall, the dynamic analysis suggests that L1 Japanese speakers seem to differ not only in acoustic targets of English liquids, as captured in the static analysis, but also in the transition between the liquid and the vowel. The results are in line with the previous findings that the magnitude and timing of spectral changes differ in the production of English liquids by L1 English-speaking children (Howson and Redford, 2021) and by L2 learners of English (Espinal , 2020) from that of adult L1 English speakers. These non-linear between-language differences could point to some possible mechanisms whereby L1 Japanese speakers struggle to produce English liquids accurately in light of L2 speech learning.
B. Acquisition of English /l/ and /ɹ/ by L1 Japanese speakers
The overarching question in this study concerns how L1 Japanese speakers differ from L1 English speakers in dynamic acoustic realisations of word-initial English liquids as a function of following vowels. The static analysis suggests that both speaker's L1 and vowel context influence the acoustic realisations of word-initial English /l/ and /ɹ/. The L1 effect is unsurprising, given that it largely agrees with previous findings that L1 Japanese speakers produce both English /l/ and /ɹ/ with higher F2 and F3 values than L1 English speakers (Aoyama , 2019; Flege , 1995; Saito and van Poeteren, 2018). Regarding the vowel effect, the static analysis suggests a general tendency that liquids in the /i/ context are produced with higher F2–F1 values than in the /u/ context, whereas the /æ/ context seems to facilitate the lowest F2–F1 values for liquids. This could be explained in light of previous findings that the F2 values in English liquids tend to be higher when preceding a high vowel /i/ than a low vowel /a/ due to different articulatory demands on the tongue dorsum configurations (Recasens, 2012).
The dynamic results demonstrate that L1 Japanese speakers show different patterns of liquid-vowel coarticulatory patterns depending on the following vowel compared to L1 English speakers whose trajectory patterns are consistent across the vowel contexts. The liquid-/u/ trajectories, in particular, suggest that L1 Japanese speakers make a less clear distinction between the liquid and the vowel in the /u/ context. This could corroborate previous perceptual findings that L1 Japanese speakers are more likely to perceive a /w/-like percept when perceiving English /l/ and /ɹ/, resulting in a confusion between English /l ɹ/ and other categories (e.g., /w/ or [Ɯɾ]) and therefore in less success in identifying word-initial liquids in the back vowel context than in the front vowel context (Best and Strange, 1992; Guion , 2000; Mochizuki, 1981; Shimizu and Dantsuji, 1983). The data in this study demonstrate that such confusion arising from the vocalic component of English liquids in perception could also be observed in L1 Japanese speakers' production.
Generally, L1 Japanese speakers produce higher F3 for English /ɹ/ (Aoyama , 2019; Flege, 1995; Saito and Munro, 2014). This is apparent in both static and dynamic analyses; in particular, the dynamic analysis for F3 in Fig. 8 shows that by-group difference largely lies during the liquid portion, suggesting that the difference in F3 would be attributed to the liquid realisations. Previous research claims that F2 is an easier acoustic cue for L1 Japanese speakers to acquire (e.g., Saito and Munro, 2014; Saito and van Poeteren, 2018). While variations in F1 could be negligible between the two speaker populations (e.g., Flege , 1995; Saito and Munro, 2014), this claim does not explain well why the F2–F1 trajectories, which could derive from variations of F2, are significantly different both in height and shape between L1 Japanese and L1 English speakers (see Fig. 7). It could therefore be argued that the static analysis only captures a snapshot of acoustical realisations of English liquids, when, in fact, L1 Japanese speakers differ from L1 English speakers in the dynamic spectral characteristics during the liquid-vowel interval.
In addition, an anonymous reviewer suggested a possibility that L1 Japanese speakers might use different dynamic strategies to make a contrast (e.g., through F2) compared to L1 English speakers. It would, therefore, be worthwhile to investigate how L1 Japanese speakers use dynamic information to make such a phonological contrast, given especially that the Perceptual Assimilation model of L2 Speech Learning (PAM-L2) makes predictions about how L2 speakers assimilate L2 phonological contrasts into their L1 phonology (Best and Tyler, 2007).
Theoretically, the Speech Learning model (SLM) posits that L2 learners store representations of the L2 sounds at the level of the position-sensitive allophones (Flege, 1995; Flege and Bohn, 2021), and previous studies show that L1 Japanese speakers' perception of English /l/ and /ɹ/ is highly subject to the phonetic context and the coarticulatory effects with neighbouring segments (Mochizuki, 1981; Sheldon and Strange, 1982). Taken together, the current results demonstrate that L1 Japanese speakers are influenced by the phonetic details of L2 English liquids, not only in perception but also in production; L1 Japanese speakers show different patterns in the way they dissociate the liquid and vowel clearly, especially in the /u/ context, manifested in their production as different patterns of liquid-vowel coarticulation.
To summarise, the present study shows that the temporal spectral changes during the liquid-vowel intervals are significantly different between L1 English and L1 Japanese speakers along F2–F1 for both liquids and F3 for /ɹ/. The liquid-vowel trajectories of F2–F1 in the /i/ and /u/ contexts highlight particularly notable temporal variability in the L1 Japanese speakers' data, suggesting that the liquid-vowel coarticulation could be considered as one of the production properties that L1 Japanese speakers need to acquire in production of English liquids.
V. CONCLUSION
The present study examines the acoustics of L1 Japanese and L1 English speakers' production of word-initial English liquids. The key findings include that L1 Japanese speakers differ in the coarticulatory pattern between the liquid and vowel from L1 English speakers. The dynamic analysis using GAMMs not only generally agrees with the findings from the static analysis but also highlights the robust yet complicated differences between L1 and L2 speech in the formant dynamics. Overall, this study illustrates that the dynamic characteristics are important aspects involved in production of English liquids in the context of L2 speech learning. Directly studying formant dynamics opens discussions around the specific underlying mechanism of L2 speech production under the influence of speakers' L1, and future research will complement the current results using articulatory methods for a better understanding of the factors that may underlie differences in acoustic dynamics shown in this study.
ACKNOWLEDGMENTS
I thank Professor Claire Nance and Dr. Sam Kirkham for their comments and support. Professor Noriko Nakanishi, Profesor Yuri Nishio, and Dr. Bronwen Evans helped me with data collection. The research is financially supported by Graduate Scholarship for Degree-Seeking Students by Japan Student Services Organization (JASSO) and the 2022 Research Grant by the Murata Science Foundation. Data and codes that support the findings of this study are openly available on the Open Science Foundation (OSF) repository at https://osf.io/2phx5/. The author has no conflicts to disclose. This research is approved by ethics committees at Lancaster University, Kobe Gakuin University, and Meijo University. Informed consent was obtained from all participants.
See supplementary material at https://osf.io/2phx5/ for further details about the participants; the data processing procedure; further details of the analysis and results; additional statistical comparisons; and an additional analysis of vowel midpoints.