The relationship between prosodic structure and segmental realisation is a central question within phonetics. For vowels, this has been typically examined in terms of duration, leaving largely unanswered how prosodic boundaries influence spectral realisation. This study examines the influence of prosodic boundary strength—as well as duration and pauses—on vowel dynamics in spontaneous Japanese. While boundary strength has a marginal effect on dynamics, increased duration and pauses result in greater vowel peripherality and spectral change. These findings highlight the complex relationship between prosodic and segmental structure, and illustrate the importance of multifactorial analysis in corpus research.
1. Introduction
A central question in phonetic research has concerned the relationship between prosodic structure and the acoustic realisation of speech segments, and how the organisation of words into hierarchical units modulates the production and perception of speech. Within this enterprise, much focus has considered the articulatory strengthening that occurs at the beginnings of prosodic units (“domain-initial strengthening” (Cho, 2005, 2009; Fougeron and Keating, 1997), as well as on the temporal lengthening of segments preceding a prosodic boundary (“final lengthening”) (Cho, 2015; Turk and Shattuck-Hufnagel, 2007; Wightman , 1992). With respect to the spectral characteristics of vowels, the vowel's position in F1-F2 space and the dynamic change in formant values over the vowel's timecourse have been shown to be expanded in domain-initial position (Georgeton and Fougeron, 2014) and under various forms of prosodic prominence (Fridland , 2014; Jacewicz , 2009; Mo , 2009; Wouters and Macon, 2002), though it remains less clear how the presence of a prosodic boundary following a vowel modulates its spectral characteristics. Given the importance of final lengthening and the presence of pauses to the production and perception of prosodic boundaries (e.g., Ferreira, 1993; Krivokapić, 2007; Petrone , 2017; Steffman and Jun, 2021; Streeter, 1978), the question remains open as to how the presence of the boundary itself influences spectral realisation independent of other factors associated with the edges of prosodic domains.
This study addresses this question by focusing on the influence of prosodic boundary strength and its associated phenomena–increased duration and post-boundary pauses—on the spectral properties of vowels, both as independent and interacting factors, using a corpus of spontaneous Japanese. Japanese has five vowels (/a/, /i/, /u/, /e/, /o/, with long vowel counterparts), and maintains two levels of prosodic structure—the Accentual Phrase (AP) and the Intonational Phrase (IP) (Kubozono, 1993; Pierrehumbert and Beckman, 1988; Venditti, 2005). APs are characterised by an F0 rise on the second mora,1 with downstep over the remainder of the phrase. IPs are delimited via F0 reset, as well as “pitch boundary movements” (PBMs), which perform a range of pragmatic and intentional functions (Venditti , 1998). Vowels have been shown to exhibit dynamic formant changeover their timecourse, where even monophthongs change in their spectral characteristics in production (Harrington and Cassidy, 1994; Hillenbrand , 1995; Hirata and Tsukada, 2004; Yazawa and Kondo, 2019). It has been demonstrated that these dynamic properties capture additional detail in the phonetic realisation of vowels compared with single point measurements (Farrington , 2018; Renwick and Stanley, 2020), which may in turn reflect acoustic variation under different prosodic contexts. The dynamics of vowels have previously been shown to be affected by duration, where longer vowel duration results in greater formant change (Fox and Jacewicz, 2009; Fridland , 2014; Yazawa and Kondo, 2019). With respect to how prosodic phenomena affect the dynamics of vowels, most studies have focused on the presence of prominence (Fridland , 2014; Jacewicz , 2009), leaving unaddressed the relationship between boundaries and formant dynamics (Brandt , 2018).
2. Methods
The data used for this study come from the Corpus of Spontaneous Japanese-Core (CSJ) (Koiso , 2014; Maekawa , 2000), containing ∼45 h of speech (recorded 1999–2001) from 137 speakers (58 female) born 1930–1979. Relevant to the research questions of the study, the CSJ contains extensive intonational and prosodic annotation based on X-JToBI annotation scheme (Kikuchi and Maekawa, 2003; Maekawa , 2002). Prosodic boundaries are annotated in the CSJ by way of “Break Index” (BI) labels, which reflect the prosodic association and disjuncture between words, aligning with the tonal and segmental cues to the boundary (Kikuchi and Maekawa, 2003; Venditti, 2005). This scheme uses 4 BI levels, mapping to separate levels of prosodic disjuncture. Values of BI = 0 represent minimal disjuncture, such as within words; BI = 1 reflects word boundaries within an AP; BI = 2 reflects AP-level boundaries; and BI = 3 reflects the right edges of IPs.2 The presence of pauses are annotated in the CSJ by means of chunking speech into inter-pausal units (IPUs), delimited by silences of 200 ms or greater.
All vowel data were extracted from the relational database version of the CSJ (Koiso , 2014), including the BI label, duration, and the presence of following pause, as well as speaker and word information. Vowels marked as devoiced (59 632 tokens) were excluded due to their unknown durational and spectral properties: it should be noted that the exclusion of devoiced vowels disproportionately affects the high vowels /i, u/, which are uniquely targeted by phonological devoicing processes within words and at prosodic boundaries (Fujimoto, 2015). As to not conflate phonetic vowel duration with phonological vowel length, this study focuses exclusively on phonemically short vowels in Japanese: as such, all multi-vowel sequences (i.e., phonemically long vowels like /e:/, sequences of different vowel qualities such as /ai/, etc.), were excluded from the analysis. Focusing only on short vowels means that “duration”, for the purpose of this analysis, refers specifically to phonetic vowel duration, consistent with previous studies examining the relationship between duration and formant dynamics (e.g., Fridland , 2014). Vowel formants {F1, F2} were extracted using the parselmouth Python package (Jadoul , 2018), using separate maxmimum formant values for male and female speakers (5000 and 5500 Hz, respectively), and values were extracted at 21 equidistant points across each vowels' timecourse. The first and last 20% points were excluded to avoid coarticulatory effects of surrounding segments (Renwick and Stanley, 2020; Williams and Escudero, 2014). Each formant point was Lobanov (Z) normalised within-speaker using all included tokens for the given speaker. In total, 300 257 vowel tokens (11 231 word types) were used in the final analysis (Table 1).
Counts of vowel tokens used in the analysis, grouped by upcoming Break Index value and by the presence of an upcoming pause.
. | . | /a/ . | /i/ . | /u/ . | /e/ . | /o/ . |
---|---|---|---|---|---|---|
Break Index | 0 (word-internal) | 62 763 | 11 215 | 8380 | 25 503 | 28 514 |
1 (AP-internal) | 21 100 | 10 947 | 8015 | 13 036 | 23 847 | |
2 (AP-final) | 11 857 | 5203 | 2275 | 4090 | 11 487 | |
3 (IP-final) | 18 007 | 5454 | 3426 | 12 541 | 12 615 | |
Pause presence | No pause | 101128 | 28 750 | 19 327 | 45 08 | 68 276 |
Pause | 12 599 | 4069 | 2769 | 9962 | 8187 |
. | . | /a/ . | /i/ . | /u/ . | /e/ . | /o/ . |
---|---|---|---|---|---|---|
Break Index | 0 (word-internal) | 62 763 | 11 215 | 8380 | 25 503 | 28 514 |
1 (AP-internal) | 21 100 | 10 947 | 8015 | 13 036 | 23 847 | |
2 (AP-final) | 11 857 | 5203 | 2275 | 4090 | 11 487 | |
3 (IP-final) | 18 007 | 5454 | 3426 | 12 541 | 12 615 | |
Pause presence | No pause | 101128 | 28 750 | 19 327 | 45 08 | 68 276 |
Pause | 12 599 | 4069 | 2769 | 9962 | 8187 |
To examine the distinct but overlapping effects of prosodic boundary strength, vowel duration, and following pause presence on vowel formant trajectories, the F1 and F2 trajectories were modelled using generalised additive mixed models (GAMMs) (Wood, 2017), which have been utilised in numerous recent analyses of dynamic formant trajectories (e.g., Kirkham , 2019; Renwick and Stanley, 2020; Sóskuthy , 2018; Stanley , 2021; Strycharczuk and Scobbie, 2017). While there are numerous other approaches to the statistical analysis of formant trajectories (e.g., Farrington , 2018; Risdal and Kohn, 2014; Williams and Escudero, 2014), GAMMs are well suited for addressing the research questions of the study, as it is possible to directly evaluate the distinct roles of prosodic boundaries, duration, and pauses on vowel position and trajectory shape, as well as the relationships between predictors on the outcome variable.
GAMMs for the normalised F1 and F2 trajectories were separately fit for each vowel using the mgcv package (Wood, 2011) in R (R Core Team, 2021), with parametric (linear) terms for BI label and the presence of the following pause, and non-parametric (smooth) terms for the sampling timepoint, duration, BI label (by timepoint), and following pause (by timepoint). To model the relationship between predictors, the models also included separate tensor product interactions between duration and sampling timepoint, which were also fit by BI labels and following pause presence. To control for the potential effect of confounding variables, the model was fit with parametric and non-parametric terms for the presence of PBMs and lexical pitch accents. To control for speaking rate (as a confounder to duration), the model also included a parametric term of local speech rate (syllables per second within an IP, subtracted from the speaker's mean speech rate), which was then scaled within-speaker (so the speech rate term is equivalent across speakers). The model also included a random smooth by speaker. As including a random smooth by word proved to be too computationally complex, the model was instead fit with a parametric term of log-transformed word frequency. To compensate for non-independence of the formant points in the trajectory, each model was first fit, and then refit using an AR1 parameter using the ρ value from the original fitted model (Sóskuthy, 2021). The statistical significance of the predictors of interest (prosodic boundary strength, duration, pause, and their interactions) was assessed by means of model comparison using the itsadug package (van Rij , 2020), where models fit without the term(s) of interest are compared on their minimised smoothing parameter selection score (and evaluated via a test of the difference between scores), as well as by visual examination of predicted model trajectories (Renwick and Stanley, 2020). Code and models for this study are available (Tanner, 2023).
3. Results
The results of comparing the fully-specified F1 and F2 GAMMs with those without each term of interest (prosodic boundary strength, duration, and pause) can be seen in Table 2. Looking first at the influence of prosodic boundary strength, including Break Index information resulted in a significant improvement in model fit for all vowels and formant types, indicating that the upcoming prosodic boundary has some modulating effect on the position and shape of formant trajectory (Table 2, rows 1–2). Figure 1(A) illustrates this observation, which shows the marginal effect of Break Index on each vowel's F1 and F2 trajectories. In contrast with prior studies on prosodic prominence (Jacewicz , 2009)—where vowels with greater prominence are more peripheral and have longer trajectories—there does not appear to be a clear order to the boundary effect on each vowel's acoustic realisation. Instead, trajectories appear similar in length across BI levels, with the position in vowel space not following any obvious “stronger > weaker” pattern with respect to boundary strength. Like prosodic boundary strength, the duration of the vowel also significantly improves model fit (Table 2, rows 3–4). As Fig. 1(B) shows, the duration of the vowel, independent of other prosodic factors, has a strong and ordered effect on the vowel, where vowels with longer durations are both more peripheral in formant space and exhibit substantially longer trajectories, which supports previous observations about regarding vowel duration and spectral characteristics (Fox and Jacewicz, 2009; Fridland , 2014; Mayr and Davies, 2011). The presence of a pause also has a significant independent effect on all vowels, with the exception of F2 for /u/ (Table 2, rows 5 and 6). Figure 1(C) shows that vowels preceding a pause appear to have a longer trajectory than those not preceding a pause, as well as appearing more peripheral (particularly in the starting points of the trajectories).
Model comparisons between the full model and different subset models for each vowel. First column denotes the set of terms removed in the subset model; second column denotes the degrees of freedom ((total difference in the number of terms between the subset and full model); third column denotes the formant modelled. The value for the model comparison is reported for each vowel, with significant results (p < 0.05) in bold.
Model vs. full . | Degree of freedom . | Formant . | /a/ . | /i/ . | /u/ . | /e/ . | /o/ . |
---|---|---|---|---|---|---|---|
No BI | 18 | F1 | 685.2 | 117.1 | 67.6 | 676.5 | 578.5 |
F2 | 951.1 | 144.1 | 412.2 | 2196.2 | 747.9 | ||
No duration | 28 | F1 | 4462.3 | 77.7 | 23.8 | 2210.4 | 1437.1 |
F2 | 1570.1 | 256.9 | 440.1 | 640.5 | 2316.9 | ||
No pause | 11 | F1 | 3896.6 | 16.6 | 43.6 | 820.3 | 1570.9 |
F2 | 513.9 | 27.5 | 1.1 | 137.3 | 129.1 | ||
No BI × duration | 9 | F1 | 277.1 | 38 | 0.9 | 81.2 | 117.2 |
F2 | 80.9 | 5.9 | 14.4 | 72.3 | 137.3 | ||
No pause × duration | 6 | F1 | 284.4 | 3.5 | 9.5 | 24.8 | 267.6 |
F2 | 29.6 | 4.3 | 0.7 | 25.7 | 7.2 |
Model vs. full . | Degree of freedom . | Formant . | /a/ . | /i/ . | /u/ . | /e/ . | /o/ . |
---|---|---|---|---|---|---|---|
No BI | 18 | F1 | 685.2 | 117.1 | 67.6 | 676.5 | 578.5 |
F2 | 951.1 | 144.1 | 412.2 | 2196.2 | 747.9 | ||
No duration | 28 | F1 | 4462.3 | 77.7 | 23.8 | 2210.4 | 1437.1 |
F2 | 1570.1 | 256.9 | 440.1 | 640.5 | 2316.9 | ||
No pause | 11 | F1 | 3896.6 | 16.6 | 43.6 | 820.3 | 1570.9 |
F2 | 513.9 | 27.5 | 1.1 | 137.3 | 129.1 | ||
No BI × duration | 9 | F1 | 277.1 | 38 | 0.9 | 81.2 | 117.2 |
F2 | 80.9 | 5.9 | 14.4 | 72.3 | 137.3 | ||
No pause × duration | 6 | F1 | 284.4 | 3.5 | 9.5 | 24.8 | 267.6 |
F2 | 29.6 | 4.3 | 0.7 | 25.7 | 7.2 |
Predicted normalised formant trajectories by Break Index (A), duration (B), and presence of following pause (C), estimated as the median value from 10 000 draws from each model's posterior distribution. Trajectories reflect the “marginal effect” of the term of interest, where all other model terms are held at their average values. For duration, “average” corresponds to the averaged normalised duration, while “short” and “long” correspond to approximately 1 standard deviation less or greater, respectively, to the average normalised duration.
Predicted normalised formant trajectories by Break Index (A), duration (B), and presence of following pause (C), estimated as the median value from 10 000 draws from each model's posterior distribution. Trajectories reflect the “marginal effect” of the term of interest, where all other model terms are held at their average values. For duration, “average” corresponds to the averaged normalised duration, while “short” and “long” correspond to approximately 1 standard deviation less or greater, respectively, to the average normalised duration.
Having examined the independent effects of prosodic boundary strength, duration, and following pause presence, the second goal of this study is to consider how these effects interact to condition each vowel's positional and dynamic realisation, given the strong overlap of these effects in speech. First considering how duration modulates the influence of prosodic boundary strength, Table 2 (rows 7 and 8) show that including this interaction as a non-parametric effect (i.e., its effect on trajectory shape) results in significantly improved model for all vowels except /u/. Figure 2(A) shows the model-predicted vowel trajectories at different duration values, and demonstrates that while /a/ (F1) and /e/ (F1 and F2) appear to exhibit greater trajectory differences between BI levels at longer vowel durations, such distinctions (in F2) for /o/ and /u/ are more present at shorter vowel durations. The relationship between duration and the presence of a following pause is more constrained, being only significant for both F1 and F2 for /a/ and /e/, and only significant for F1 for /o/ and /u/ (Table 2, rows 9 and 10). Compared with the interaction with Break Index, Fig. 2(B) shows a more complicated relationship between pause presence and duration, where the difference between pause and non-pause trajectories is greater at longer durations for /a/ F1, while shorter durations for /o/ F2, /a/ F2, and /e/ F2 differ substantially by the presence of a following pause.
Predicted normalised F1 (solid) and F2 (dashed) trajectories for Break Index (A) and following pause (B) by different rates of vowel duration. For duration, “average” corresponds to the averaged normalised duration, while “short” and “long” correspond to approximately 1 standard deviation less or greater, respectively, to the average normalised duration. Lines reflect median estimated value, and shaded areas correspond to 95% confidence intervals based on 10 000 draws from each model's posterior distribution.
Predicted normalised F1 (solid) and F2 (dashed) trajectories for Break Index (A) and following pause (B) by different rates of vowel duration. For duration, “average” corresponds to the averaged normalised duration, while “short” and “long” correspond to approximately 1 standard deviation less or greater, respectively, to the average normalised duration. Lines reflect median estimated value, and shaded areas correspond to 95% confidence intervals based on 10 000 draws from each model's posterior distribution.
4. Discussion
The goal of this study has been to examine the relationship between prosodic structure and the phonetic realisation of speech segments, focusing specifically on how the spectral properties of vowels—both overall position in formant space and the degree of spectral change—are modulated by the strength of an upcoming prosodic boundary. Given that the edges of prosodic domains are also closely related to other contextual factors, including increased duration due to pre-boundary lengthening (Cho, 2015; Turk and Shattuck-Hufnagel, 2007; Wightman , 1992) and the presence of pauses following the boundary (Ferreira, 1993), this study attempts to examine the influence of prosodic boundary strength on vowel realisation whilst accounting for the presence of these co-occurring effects, as well as considering how these effects interact to modulate the phonetic realisation of vowels.
Through modelling the effects of prosodic boundary strength, vowel duration, and pause presence in a corpus of spontaneous Japanese speech, it is found that prosodic boundary strength does exhibit an effect on the formant position and dynamic properties of Japanese phonemically short vowels, though this effect does not appear to follow a clear pattern of incremental strength (i.e., where stronger boundaries would result in more peripheral or dynamic vowels). This contrasts with previous studies examining the influence of prosodic prominence on vowel dynamics (Jacewicz , 2009; Wouters and Macon, 2002), and suggests that different dimensions of prosodic structure may have distinct effects on phonetic realisation. Instead, the duration of the vowel is shown to have a strong directional effect on the vowel's spectral properties, consistent with previous observations (Fridland , 2014; Jacewicz , 2009); it would be tempting, then, to consider the possibility that observations of increased vowel peripherality and spectral change in prosodically strong positions may instead reflect the confounding effect of duration. In other words, vowels in prosodically strong positions are typically longer in duration, which would in turn drive the observed effects on a vowel's formant position and trajectory. However, as Table 2 and Fig. 2(A) seem to suggest, some distinctions in prosodic boundary strength are more apparent at shorter durations (e.g., /o/ and /u /F2), indicating that phonetic vowel duration alone cannot account for the subtle role of prosodic boundary strength observed in this analysis. A possible explanation for this subtle boundary strength effect may be that Japanese makes little use of vowel dynamics for signalling the presence of a boundary—given that Japanese maintains a robust system of intonational cues for marking the edges of prosodic domains (Igarashi, 2014; Venditti , 1998; Venditti , 2008)—and the wide range of cross-linguistic variation in the perception of acoustic cues to prosodic structure (Kim, 2020). As such, it is possible that cue-weighting for the perception of boundaries is oriented towards durational lengthening, F0 reset, pauses, and the presence of PBMs (Krivokapić, 2007; Swerts, 1997; Wightman , 1992). Further research concerning the perception of prosody in Japanese would be needed to validate this hypothesis. Finally, as this study focuses exclusively on phonemically short vowels in Japanese, one possibility is that phonemically short vowels in Japanese may be too phonetically short for a speaker to articulatorily manifest any boundary-related effects. In this sense, further research may wish to examine the role of boundary strength on phonemically long vowels (e.g., /e:/, /a:/, etc.) or other vowel–vowel adjacent sequences (e.g., /ai/, /oi).
With respect to the relationship between prosody and the phonetic realisation of segments, this study has demonstrated the prosodic structure influences not just the temporal properties of segments, but also the spectral properties of vowels, including their position in formant space and their degree of dynamic formant change. Specifically, however, it is shown that the factors that often co-occur with prosodic boundaries in natural speech—segmental lengthening and the presence of pauses—appear to have the most pronounced effects on a vowel's spectral properties, which may reflect the articulatory and perceptual overlap in these factors. In this sense, this study illustrates the importance of multifactorial approaches to the study of phonetic phenomena, particularly in natural-speech settings (Tomaschek , 2018), where the analysis of potentially colinear or confounding variables may reveal patterns otherwise unobserved when single factors are examined in isolation.
Acknowledgments
The author wishes to thank Yōsuke Igarashi, Morgan Sonderegger, and Jane Stuart-Smith for helpful comments. Computational resources were provided by Calcul Québec and the Digital Research Alliance of Canada. The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to licencing restrictions (see https://clrd.ninjal.ac.jp/csj/en/index.html).
The mora is generally considered to be the primary timing unit in Japanese (Warner and Arai, 2001), while the syllable plays a separate role in constraining phonological well formedness (Kawahara, 2016).
Previous studies have made use of a “final” or “utterance” label based on the presence of final pause (e.g., Martin , 2016); since examining the effect of final pause is one goal of this study, an utterance-level label is not used and the presence of pauses are instead examined separately.