This study investigates the articulatory correlates of consonantal length contrasts in Japanese mimetic words using electromagnetic articulography data. Regression and dynamic time warping analyses applied to intragestural timing, kinematic properties, and intergestural timing reveal that Japanese geminates are characterized by longer closure phases, longer gestural plateaus, higher tongue tip positions, larger movements, and lower stiffness. Geminates also exhibit distinct timing relationships with adjacent vowels, specifically, longer times to target that allow for longer preceding vowels. These findings shed light on the articulatory mechanisms underlying Japanese geminate production, their relationship to acoustics, and their characterization in a broader cross-linguistic perspective.
1. Introduction
In many of the world's languages, consonant duration can be employed contrastively.1,2 A well-known example is Japanese. The production of Japanese long or “geminate” consonants has been primarily studied through their acoustic manifestations.3 Acoustically, Japanese geminates are distinguished from singletons mainly by their longer constriction durations.4 Geminates are also accompanied by slightly longer preceding vowels and by other non-durational cues: for example, larger intensity differences surrounding geminates, larger pitch-accent f0 drops across geminates, vowels with lower F1 after geminates, etc. (cf. Ref. 3).
Unlike their acoustic correlates, the articulatory mechanisms involved in Japanese geminate production have not been thoroughly explored.5–12 Three key areas require further investigation: (i) intragestural timing properties, which encompass the duration and timing characteristics of geminate articulation; (ii) kinematic properties, which are responsible for the shape of the trajectories of articulatory movements during geminate production; and (iii) intergestural timing properties, i.e., the timing relationships between geminate articulation and the articulation of surrounding sounds.
With respect to intragestural timing, one understudied aspect is how the longer closure duration of geminates is implemented kinematically. Two main strategies have been reported. The first strategy is the realization of geminates by holding constriction targets for longer periods of time: in effect, lengthening the plateau of kinematic trajectories.5 A second strategy is the realization of geminates by slowing down articulator movements, specifically, tongue movements for lingual consonants.9
With respect to kinematic features, one study of Japanese bilabial geminate plosives7 has uncovered that geminates display (i) higher articulator positions, (ii) larger movement amplitude, (iii) similar closure peak velocities, and (iv) lower stiffness. These findings point to distinct kinematic specifications for articulator control during geminate production, beyond simply longer durations (cf. also Refs. 3 and 13). More extended articulator contacts, compatible with more constricted targets, have also been reported for lingual consonants using electropalatography.13,14 However, more investigations of kinematic features in various manners and places of articulation are necessary to fully characterize geminate articulation in Japanese.
Finally, with respect to intergestural timing, pioneering comparative work on Japanese and Italian geminates by Smith5,6 has uncovered a potential kinematic basis for the longer duration of vowels before geminates in Japanese. This longer duration may be in part due to intrinsically longer kinematic movements, as also suggested in subsequent work.9 Crucially, however, Japanese geminates also display longer times to target compared to singletons, which may allow preceding vowels to be acoustically manifested for a longer period of time. This timing feature sets Japanese apart from other languages, for example, from Italian. In Italian, geminates are realized by anticipating closure onset during the preceding vowel, thus, in effect shortening the acoustic manifestation of the vowel.5,6,15
Smith5,6 also reported longer trans-consonantal lags between preceding and following vowels across geminates than singletons. This finding suggests a more sequential organization for vowels across a geminate in Japanese than in Italian, in which the trans-consonantal vocalic lag is unaffected or perhaps even shortened, at least with bilabial geminates.5,6,15 Smith's5,6 works offered an intriguing basis for the intergestural timing organization of geminates. However, subsequent work on Japanese geminates did not show clear intergestural timing differences8,9 and has also called into question the relationship between a longer time to target and duration of the preceding vowel.11
With these issues in mind, we investigated the articulatory correlates of geminate production in Japanese using electromagnetic articulography (EMA). We analyzed geminate production along the three main lines discussed above, namely, in terms of their intragestural timing, kinematic properties, and intergestural timing. The questions we addressed are as follows: (1) What are the kinematic correlates that underlie the longer duration of Japanese geminates? (2) What kinematic features differentiate the articulatory trajectories of singletons and geminates? (3) How are geminates timed with respect to surrounding segments, and is there a kinematic basis for the longer duration of preceding vowels?
Given that both acoustic and articulatory properties can be influenced by a host of lexical factors,16–19 we studied the properties of Japanese geminates by comparing them to singletons with the same lexical item. This is possible because Japanese mimetic (i.e., sound symbolic) words can be produced with gemination of a medial consonant to express emphasis.20 Japanese mimetic words thus offer a controlled testing ground for geminate articulation in a situation that is largely independent of lexical influences.
2. Methods
2.1 Participants and mateials
Seven native Japanese speakers (3 female, 4 male) participated in the experiment. All reported normal speech and hearing and daily use of the Tokyo dialect.
The speech material consisted of 21 existing Japanese mimetic words, with three items for each of seven consonantal types (Table 1). Each target consonant was produced as a singleton or a geminate (for emphasis).20
Target . | Item 1 . | Item 2 . | Item 3 . |
---|---|---|---|
[t] | kɑt(ː)ɑ–kɑtɑ | ɡɑt(ː)ɑ–ɡɑtɑ | pet(ː)ɑ–petɑ |
[d] | kud(ː)o–kudo | gud(ː)a–guda | od(ː)o–odo |
[ɾ] | paɾ(ː)a–paɾa | peɾ(ː)a–peɾa | doɾ(ː)o–doɾo |
[z]–[ʝ] | giz(ː)a–giza | oz(ː)u–ozu | uʝ(ː)i–uʝi |
[s] | kos(ː)o–koso | kas(ː)a–kasa | ɸus(ː)a–ɸusa |
[ts] | ɸut͡(ː)su– ɸut͡su | kat͡(ː)su–kat͡su | gut͡(ː)su–gut͡su |
[t⌢ɕ] | net͡(ː)ɕi–net͡ɕi | kat͡(ː)ɕa–kat͡ɕa | gut͡(ː)ɕa–gut͡ɕa |
Target . | Item 1 . | Item 2 . | Item 3 . |
---|---|---|---|
[t] | kɑt(ː)ɑ–kɑtɑ | ɡɑt(ː)ɑ–ɡɑtɑ | pet(ː)ɑ–petɑ |
[d] | kud(ː)o–kudo | gud(ː)a–guda | od(ː)o–odo |
[ɾ] | paɾ(ː)a–paɾa | peɾ(ː)a–peɾa | doɾ(ː)o–doɾo |
[z]–[ʝ] | giz(ː)a–giza | oz(ː)u–ozu | uʝ(ː)i–uʝi |
[s] | kos(ː)o–koso | kas(ː)a–kasa | ɸus(ː)a–ɸusa |
[ts] | ɸut͡(ː)su– ɸut͡su | kat͡(ː)su–kat͡su | gut͡(ː)su–gut͡su |
[t⌢ɕ] | net͡(ː)ɕi–net͡ɕi | kat͡(ː)ɕa–kat͡ɕa | gut͡(ː)ɕa–gut͡ɕa |
Given that previous work has focused on bilabial consonants5–8,12 and our interest also lies in the interaction between geminates and preceding vowels, we focused on consonants produced with the same articulator as vowels, the tongue. Thus, we examined apical/laminal consonants, as they are produced with the front of the tongue, which is straightforward to track with EMA.
In each trial, participants produced singleton mimetic words in isolation, followed by their emphatic geminate variant, also produced in isolation. Items were elicited in isolation to minimize total experiment duration and confounds that could arise from different prosodic realizations of a carrier phrase. The target consonants were never adjacent to word boundaries, thus, also limiting the need for a carrier phrase. The order of lexical items was fully randomized. In total, 21 unique words × 2 realizations (singleton/geminate) × 10 repetitions × 7 speakers yielded a total of 2940 tokens for analysis.
2.2 Experimental procedure
Articulatory movements were captured using an NDI Wave System [NDI (Northern Digital, Inc.), Waterloo, Canada] sampling at 100 Hz (S1–S3, 1F, 2M) and a Carstens AG501 system (Carstens, Chicago, IL) sampling at 1250 Hz (S4–S7, 2F, 2M). Acoustic data were simultaneously recorded at 22.05 kHz. The two datasets were first analyzed separately. Since no substantial differences were found, we present combined analyses of the two datasets. Our supplementary material contains scripts demonstrating our analyses for the datasets, both separately and combined, for the interested readers.
Both systems used a nearly identical sensor setup: five tongue sensors (three on the sagittal midline, two parasagittally). Only the most anterior sensor on the tongue, attached less than 1 cm from the tongue apex, entered into the analyses reported here. Additional sensors were placed on the lips, jaw, left/right mastoid processes, nasion, and maxilla (for AG501 sessions only).
Participants sat in a sound-attenuated room; stimuli were displayed in Japanese orthography on a monitor ∼25 cm outside the EMA field. Stimulus presentation was controlled using eprime (NDI Wave) or matlab21 (AG501), enabling real-time monitoring for errors. Rare hesitations or mispronunciations were marked and reinserted randomly among the remaining items. Following the main recording session, we also recorded the bite plane of each participant by having them hold a rigid object, with three (NDI Wave) and two (AG501) sensors attached to it, between their teeth. Head movements were corrected computationally after data collection with reference to three sensors on the head and the two/three sensors on the bite plane and maxilla. The head corrected data were rotated so that the origin of the spatial coordinates corresponds to the occlusal plane at the front teeth.
2.3 Data processing and analyses
The acoustic signals were manually segmented at the word and segmental level in praat22 using standard criteria based on waveform and the spectrogram characteristics by two research assistants naive to the purpose of the study (see Fig. 4 below). The segmentation was later double-checked by the first author. The acoustic segmentation was used as a starting point for articulatory landmarking of target consonants. The only acoustic boundaries used in the analyses involve the acoustic duration of the target consonants and preceding and following vowels. Given the wide variety of vocalic contexts and the impossibility to consistently identify vocalic landmarks from the kinematics alone, we adopt acoustic landmarking as events that reasonably correlate with kinematic events, especially target achievement, following previous work (e.g., Ref. 23).
The articulatory analyses reported in this paper focus on apical/laminal consonants; thus, the main sensor of interest is the tongue tip, henceforth TT. The wide variety of vocalic contexts in which target consonants appears has a strong influence on horizontal tongue position; thus, our landmarking was based on the dimension that we found to be most reliable to identify consonantal articulation, namely the TT vertical movement (henceforth TTy) (cf. Ref. 24 for a similar choice). Support in favor of the decision of using TTy as the main landmarking dimension also comes from a principal component analysis (PCA). We entered all three dimensions of TT movement in a PCA and found that the direction of maximum movement, the 1st principal component (PC) (explaining 95.7% of variance), correlates almost perfectly with vertical movement (median r = 0.97 over all tokens). These results suggest that tongue movement during target consonants production takes place robustly in TTy and that this dimension is reliable to track articulation.
All articulatory signals were smoothed and interpolated using the algorithm of Ref. 25. For landmarking, TTy trajectories were upsampled to 1000 Hz for the NDI Wave sessions and kept at 1250 Hz for the AG501 sessions. All trajectories were smoothed using a Savitzky–Golay finite filter of polynomial order 4 and frame length of 21. Consonantal gestures were automatically landmarked using a custom matlab routine based on peak velocity thresholding. Acoustic boundaries were used to locate the consonant midpoint, which defined a symmetric window spanning twice the segment's acoustic duration. Within this window, velocity extrema were identified for upward closure and downward release movements in TTy. Gestural onset and target were defined as the first and last timepoints surpassing 20% of peak closure velocity, while release and offset were similarly defined for release velocity. Landmarks were visually inspected and adjusted for ∼50 tokens as needed. Thirty tokens without clear gestural boundaries were excluded, along with their paired singleton or geminate, resulting in the exclusion of 60 tokens (∼2% of the data). Excluding the paired token was necessary because some of our analyses require pairing singletons and geminates obtained from the same trial, as described below.
From the acoustic and articulatory signals, we derived a set of intragestural timing, kinematic, and intergestural timing measures. For intragestural timing (Fig. 1), we investigated the following: (i) The closure phase duration, defined as the lag between gestural onset and target; (ii) the plateau duration, defined as the duration of the lag between gestural target and release; (iii) the release phase duration, defined as the lag between gestural release and offset.
We also investigated the relationship between plateau duration and acoustic consonant duration using Pearson correlation.
Additionally, we also developed holistic analyses of the kinematic trajectories that rely on dynamic time warping (DTW). Specifically, we took each singleton/geminate pair in a trial and used DTW to derive a pairwise warping function that identifies which portions of a singleton articulatory trajectory need to be stretched to “derive” a geminate. We inspected the shape of the warping functions and obtained localized average warps over normalized duration, following Ref. 26.
For kinematic properties (Fig. 2), we investigated the following: (iv) Maximum vertical and horizonal tongue position, as a proxy for constriction target and location; (v) peak velocity during closure; (vi) movement amplitude from onset to maximum constriction during closure; (vii) kinematic stiffness, defined as the ratio of peak velocity to movement amplitude, providing a normalized measure of movement speed independent of amplitude. Higher stiffness indicates shorter times to reach peak velocity and target attainment.
For intergestural timing properties (see Fig. 4 below), we investigated the following: (viii) The duration of the vowel preceding the target consonants estimated from acoustics (V1 Dur.); (ix) the duration of the lag between the preceding vowel acoustic onset and the consonantal gesture onset (V1 Ons. – C Ons.); (x) the duration of the lag between the preceding vowel acoustic onset and the consonantal gesture target (V1 Ons. – C Targ.); (xi) the duration of the trans-consonantal lag between the preceding and following vowels acoustic onsets (V1 Ons – V2 Ons.).
We also investigated the relationship between consonantal and preceding vowel duration using Pearson correlation, following Ref. 11.
All properties (i)–(xi) were analyzed using linear mixed-effects regression models in matlab's fitlme() function. The effect of gemination was assessed via log-likelihood ratio tests, comparing a baseline model (without the geminate effect) to an alternative model including a fixed effect for gemination. Both models featured the same random effect structure, with by-subject and by-item random intercepts and a random slope for gemination, while the alternative model included a fixed effect for gemination (reference coded as singleton). Outliers were identified as values with z-scores below –4 or above 4, resulting in the exclusion of 0% to 1.46% of the data across variables. For correlation analyses, values below the 1st percentile and above the 99th percentile were excluded to prevent extreme points from inflating or deflating observed correlations.
All data and scripts necessary to replicate our analyses and figures and to run separate analyses on the datasets are publicly available in an OSF repository at https://osf.io/27nyz/?view_only=4f93c383e24642e48d027c58fd945a27.
3. Results
3.1 Intragestural timing
We found a small but statistically reliable difference in closure phase duration [χ2(1) = 8.73, p = 0.003]. The singleton closure phase is 86 ms (95% confidence interval [CI] [73–100] ms), while the geminate closure phase is +11 ms longer (95% CI [6–17] ms) (Fig. 1, top right).
A more robust difference was observed for plateau duration [χ2(1) = 12.61, p = 0.0004]. The singleton plateau duration was 41 ms (95% CI [26–56] ms), while the geminate plateau duration was +108 ms longer (95% CI [72–144] ms) (Fig. 1, bottom left). No reliable differences were found in terms of the release phase duration [χ2(1) = 0.04, p = 0.84] (Fig. 1, bottom right).
The fact that singletons and geminates differ primarily in terms of their plateau duration is also evident from DTW analyses. When each mimetic singleton is stretched to its geminate counterpart, a strong distortion of time is observed around the midpoint of the consonant, 0.2 to 0.6 of its proportional duration (Fig. 2F). Each singleton sample between 33% and 66% of the singleton trajectory is repeated between two and almost 3.5 times to derive a geminate, indicating a stretching of the plateau region (Fig. 2G).
The observed longer acoustic duration for geminates is closely related to the longer plateau, as indicated by a robust correlation between target consonant acoustic duration and plateau duration (r = 0.84, p < 1 × 10−215) (see Fig. 6 in the supplementary material).
3.2 Kinematic properties
We found a higher TTy position for geminates than singletons [χ2(1) = 8.76, p = 0.003]. Compared to singletons (6.6 mm, 95% CI [1.6–11.6]), geminates are produced with higher/more constricted targets, +0.80 mm (95% CI [0.39–1.19] mm) (Fig. 3, mid left). In the horizontal position, the difference was not significant [χ2(1) = 3.25, p = 0.07]; however, geminates tend to be more fronted, +0.23 mm (95% CI [0–0.47] mm).
The difference in movement amplitude was significant [χ2(1) = 8.5, p = 0.003]. Compared to singletons (8.16 mm, 95% CI [6.7–9.7]), geminates are produced with greater closure movement amplitude, +0.99 mm (95% CI [0.45–1.53] mm) (Fig. 3, mid right). We found no significant differences in closure peak velocity [χ2(1) = 1.45, p = 0.22] (Fig. 3, bottom left).
Finally, we found a significantly lower stiffness for geminates (χ2(1) = 5.13, p = 0.02). Compared to singletons (19.92 s−1, 95% CI [16.17–23.67 s−1]), geminates are produced with lower stiffness, −2.61 s−1 (95% CI [−4.54 to −0.69] s−1), indicating a slower movement and time to target (Fig. 3, bottom right).
3.3 Intergestural timing properties
We found that vowels preceding geminates had longer duration than vowels preceding singletons [χ2(1) = 7.57, p = 0.006]. Specifically, compared to vowels preceding singletons (70 ms, 95% CI [60–79]), vowels preceding geminates were +13 ms longer (95% CI [6–20] ms) (Fig. 4, mid left panel).
The durations of the lag between vowel acoustic onset and consonantal gesture onset (V1 Ons. – C Ons.) were significantly different, but very similar between geminates and singletons [χ2(1) = 4.12, p = 0.04] (Fig. 4, mid right panel). Compared to the lag of vowels preceding singletons (–4 ms, 95% CI [ –1–7]), the lag of vowels preceding geminates is slightly longer by +7 ms (95% CI [1–13] ms). Note, however, that the 95% CIs are very close to an overlap with 0. Moreover, this finding only emerges by pooling our datasets together, suggesting that the difference between singleton and geminates in V1 Ons. – C Ons. is very small and variable. Future work should further investigate the robustness of this finding.
On the other hand, the duration of the lag between vowel acoustic onset and consonantal gesture target (V1 Ons. – C Targ.) was clearly longer in geminates than in singletons [χ2(1) = 12.6, p = 0.0004]. Compared to the singleton lag (83 ms, 95% CI [69–96] ms), the geminate V1 Ons. – C Targ lag was +20 ms longer (95% CI [13–27] ms) (Fig. 4, bottom left panel).
Finally, the trans-consonantal lag between the preceding vowel acoustic onset and following vowel acoustic onset (V1 Ons – V2 Ons.) was much longer in geminate than in singleton production [χ2(1) = 19.43, p < 0.0001]. Compared to the lag in singleton production (218 ms, 95% CI [158–279] ms), the V1 Ons. – V2 Ons. lag in geminate production was +134 ms longer (95% CI [108–160] ms) (Fig. 4, bottom right panel).
Given the findings presented above, we hypothesized that the longer duration of vowels preceding geminates may be due to a longer time to target compared to singletons, effectively allowing for a longer steady-state period for vowel production. Evidence in favor of this hypothesis is offered by a strong correlation (r = 0.70, p < 1 × 10−215) observed between time to target and preceding vowel duration (see Fig. 7 in the supplementary material).
4. Discussion
Returning to our research questions, our analyses revealed robust differences between Japanese geminates and singletons in terms of intragestural timing, kinematic properties, and intergestural timing organization.
With respect to intragestural timing, geminates were produced with slightly longer closure phases and much longer gestural plateaus, a finding confirmed by DTW analyses. No differences were observed for release duration, in line with the previously-reported acoustic analyses, which reported no difference in Voice Onset Time.13 We also found a very strong correlation between consonantal acoustic duration and TTy plateau duration. Taken together, our findings align with previous work,5,9 indicating that Japanese speakers produce geminates with slightly longer closure movements and much longer gestural plateaus. Findings of longer closures and plateaus for geminates are not unique to Japanese. They have also been reported in other languages, like Italian.15
Turning to kinematic features, we found that Japanese speakers produce (lingual) geminates with a higher TT, i.e., with a more constricted posture, slightly larger movements, similar peak velocity, and lower stiffness, in line with previous reports.7 Taken together, the differences in kinematic features suggest that Japanese geminates are not just longer versions of singleton consonants. Their articulation is characterized by a different set of kinematic parameters. In this respect, our findings support the view that geminates differ from singletons not only in terms of durational properties but also in terms of more general articulatory strategies, an idea proposed for Tashlhiyt Berber2 and Italian2,15 and that also been entertained for Japanese3,13 and other Japonic languages.27
Finally, in terms of intragestural timing, we found that Japanese speakers produced slightly longer vowels before geminate consonants. Geminates also start roughly at the same time as singletons with respect to the preceding vowel, but their targets are reached later. Longer trans-consonantal lags between preceding and following vowels across geminates are also observed. These findings suggest that Japanese speakers produce geminates and singletons with relatively similar timing organization with respect to the preceding vowel. This has been noted in previous work5,6 and is a feature that sets apart Japanese geminate production from languages like Italian, where geminate closure robustly “intrudes” in the preceding vowel resulting in shortening of pre-geminate vowels.5,6,15
Additionally, unlike some previous work,11 our findings also suggest a robust positive correlation between the V1 – C Targ. lag and V1 duration. This pattern provides a possible kinematic basis for the longer vowel durations preceding Japanese geminates. Longer times to target allow the vocalic gesture to be acoustically exposed for a longer period of time before the acoustic consequences of consonantal articulation “kick in” during the consonantal plateau, which, as we have demonstrated, is the main articulatory correlate of consonantal acoustic duration. The longer time to target also reinforces other potential bases for longer vocalic acoustic duration. Namely, the slower and longer tongue body movements associated with vowels during geminate production5,6,9 can contribute to longer acoustic vowel durations in the presence of delayed consonantal target achievements for geminates and their associated acoustic consequences. Such slower tongue body movements are also observed in our data. This is illustrated in Fig. 8 in the supplementary material, where we present DTW analyses to show that there is a generalized need to warp time. Specifically, to obtain the combined vertical and horizontal tongue movement observed during pre- and post-geminate vowel production from the tongue movement observed during pre- and post-singleton vowel production time needs to be slowed down, especially while the consonant is being produced.
5. Conclusion
To conclude, our analyses have revealed that geminates, as produced in Japanese mimetic words, exhibit longer closure phases, extended gestural plateaus, higher TT positions, and more constricted postures. These articulatory profiles accord well with the acoustic properties of Japanese geminates, like their longer closure durations and lengthened preceding vowels. Additionally, our analyses also situated Japanese geminate production in a wider cross-linguistic context. Japanese geminates seem to be primarily produced by lengthening gestural plateaus compared to singletons, as demonstrated by our DTW analyses. However, even so, they are not simply extended versions of singletons: Some of their kinematic parameters are also different. These considerations lend plausibility to the proposal that even “canonical” geminates like those of Japanese and Italian are actually implemented by speakers using dimensions beyond duration, such as tighter constrictions and generally different kinematic profiles that have larger movements and lower stiffness. Final, our results show that languages can differ substantially in the timing of geminates.5 Japanese geminates and singletons start around the same time with respect to the preceding vowels, yet geminates reach their targets later, allowing for longer acoustic vowel durations. This is unlike other languages where geminate production starts earlier with respect to the preceding vowel, in effect, shortening it.15
Supplementary Material
See the supplementary material for additional analyses and figures.
Acknowledgments
We thank Piyapath Srisomyos and Teerawee Sukanchanon for help with data annotation, as well as Jeff Moore, Lia Bučar Shigemori, and Ulrike Rupprecht for help with data acquisition. Data collection was supported by Japan Society for the Promotion of Science (JSPS) Grant Nos. 15F15715 to the second and third authors and 26770147 and 26284059 to the second author.
Author Declarations
Conflict of Interest
The authors have no conflicts to disclose.
Ethics Approval
The authors obtained ethical approval from the institutional review boards (IRBs) of Western Sydney University and Keio University (Protocol No. HREC 9482) and from the IRB of the University of Munich. Informed consent was obtained from all participants.
Data Availability
The data that support the findings of this study are openly available in an OSF repository at https://osf.io/27nyz/?view_only=4f93c383e24642e48d027c58fd945a27.