Phonological categories are often differentiated by multiple phonetic cues. This paper reports a production and perception study of a laryngeal contrast in Shanghai Wu that is not only cued in multiple dimensions, but also cued differently on different manners (stops, fricatives, sonorants) and in different positions (non-sandhi, sandhi). Acoustic results showed that, although this contrast has been described as phonatory in earlier literature, its primary cue is in tone in the non-sandhi context, with vowel phonation and consonant properties appearing selectively for specific manners of articulation. In the sandhi context where the tonal distinction is neutralized, these other cues may remain depending on the manner of articulation. Sonorants, in both contexts, embody the weakest cues. The perception results were largely consistent with the aggregate acoustic results, indicating that speakers adjust the perceptual weights of individual cues for a contrast according to manner and context. These findings support the position that phonological contrasts are formed by the integration of multiple cues in a language-specific, context-specific fashion and should be represented as such.
I. INTRODUCTION
A standard assumption about phonological contrast is that it is categorical, based on either segments (/p/ vs /b/) or features ([−voice] for /p/, [+voice] for /b/; Jakobson et al., 1952; Chomsky and Halle, 1968; Stevens, 2002; Clements, 2009). A major challenge for phoneticians and phonologists alike is to account for how speakers categorize gradient and variable acoustic signals into such discrete entities. Two salient aspects of this challenge relate to how featural contrasts are instantiated acoustically. First, contrasts are often differentiated by multiple acoustic cues. The stop voicing contrast in English, for example, is associated with differences in voice-onset time (VOT), closure duration, f0 of the following vowel, and a host of other acoustic properties (Lisker, 1986). Second, the acoustic cues for the same contrast often depend on the phonological context in which the contrast appears. For instance, the English voicing contrast would not benefit from the f0 cue of the following vowel in the final position, but would benefit from a duration difference on the vowel preceding it (Chen, 1970; Raphael, 1972). The investigations of how a contrast is acoustically realized in a multidimensional fashion, how the different acoustic cues are weighted in the perception of the contrast, and how the weighting is affected by the acoustic dimensions along which the cues vary, the distributional characteristics of the acoustic cues, the context in which the contrast appears, and the listeners' language background have contributed to significant theoretical issues in phonetics and phonology, such as the mode of speech perception (Repp, 1983; Parker et al., 1986; Massaro, 1987), the nature of distinctive features (Halle and Stevens, 1971; Kingston, 1992; Stevens and Keyser, 2010), the production-perception link (Newman, 2003; Shultz et al., 2012; DiCanio, 2014), the influence of phonological knowledge of a language on perception (Massaro and Cohen, 1983; Flege and Wang, 1989; Dupoux et al., 1999; Hallé and Best, 2007), the theories of perceptual contribution of secondary cues (Holt et al., 2001; Francis et al., 2008; Kingston et al., 2008; Llanos et al., 2013), and the mechanisms of phonetic category learning (Clayards et al., 2008; Toscano and McMurray, 2010; McMurray et al., 2011).
This paper contributes to this scholarship by presenting a case study on the cue realization and cue weighting of a laryngeal contrast on different segments in different contexts in Shanghai Wu. Like many Wu dialects of Chinese, Shanghai has a three-way distinction among voiceless aspirated, voiceless unaspirated, and voiced stops. The voiced series, however, is not realized with typical closure voicing, but is known as “voiceless with voiced aspiration” (Chao, 1967), indicating the involvement of breathy phonation. On fricatives, there is a two-way voicing contrast, whereby the voiced fricatives are truly voiced, and on sonorants, there is a modal-murmured distinction that corresponds to the voiceless-voiced distinction in obstruents (Chao, 1967; Xu and Tang, 1988; Zhu, 1999, 2006).
Shanghai Wu, like other Chinese dialects, is also tonal. There are three phonetic tones on open or sonorant-closed syllables, transcribed as 53, 34, and 13, and two phonetic tones on ʔ-closed syllables, 55 and 12. But there is a co-occurrence restriction between tones and onset laryngeal features in that the higher tones 53, 34, and 55 only occur on syllables with voiceless obstruent or modal sonorant onsets, and the lower tones only occur with phonologically voiced obstruent or murmured sonorant onsets (Xu and Tang, 1988; Zhu, 1999, 2006). Therefore, in Shanghai, there is a minimal contrast between tɔ3 “to arrive” and dɔ13 “news,” and this contrast is cued by both the voice quality of the initial consonant and f0. The examples in Table I illustrate the co-occurrence of the two rising tones 34 and 13 with the laryngeal features in Shanghai.
Examples of laryngeal and tone co-occurrence restrictions in Shanghai. Voiceless obstruents or modal sonorants co-occur with the high-rising tone 34; voiced obstruents or murmured sonorants co-occur with the low-rising tone 13.
Stops . | Fricatives . | Sonorants . | |||
---|---|---|---|---|---|
pu34 | “cloth” | fi34 | “fee” | mε34 | “beautiful” |
phu34 | “tattered” | ||||
bu13 | “division” | vi13 | “fat” | m̤ε13 | “plum” |
Stops . | Fricatives . | Sonorants . | |||
---|---|---|---|---|---|
pu34 | “cloth” | fi34 | “fee” | mε34 | “beautiful” |
phu34 | “tattered” | ||||
bu13 | “division” | vi13 | “fat” | m̤ε13 | “plum” |
Tones in connected speech are affected by a tone change process called tone sandhi in Shanghai. Polysyllabic compound words undergo a rightward spreading tone sandhi process by extending the tone on the first syllable over the entire compound domain and consequently wiping out the tonal contrasts in non-initial syllables (Zee and Maddieson, 1980; Xu and Tang, 1988; Zhu, 1999, 2006). For example, tɔ34 “to arrive” and dɔ13 “news,” when appearing as the second syllable of a disyllabic compound, are reported to lose their tonal difference, as shown in the following examples: /pɔ34-tɔ34/ → [pɔ33-tɔ44] “check-in”; /pɔ34-dɔ13/ → [pɔ33-dɔ44] “news report.” The voicing difference between the onset consonants on the second syllable, however, remains, and the voiced stops have been reported to have closure voicing in this position (Cao and Maddieson, 1992; Ren, 1992; Shen and Wang, 1995; Chen, 2011; Wang, 2011; Gao, 2015; Gao and Hallé, 2017).
The data pattern in Shanghai, therefore, presents a clear example in which a phonological contrast is realized differently on different manners and different positions: stops, fricatives, and sonorants can all carry the contrast, but via different sets of cues; the monosyllabic context is significant in that it is the only context in which the phonation-tone co-occurrence, as illustrated in Table I is fully manifested, while the second syllable of disyllables constitutes a position where the cues for the contrast are considerably altered by a tone sandhi process. We specifically focus on the contrast between voiceless unaspirated/modal and voiced/murmured consonants co-occurring with a high-rising and a low-rising tone, respectively (e.g., tɔ34 vs dɔ13; mε34 vs 13). As we review in Sec. I B below, although previous studies have established the multidimensional nature of this contrast, as well as the fact that the cues for the contrast vary by prosodic position, no study has expressly compared the realization of cues in different manners or studied how the cues are weighted in perception across manners and positions. This study aims to achieve these goals. In so doing, it has the potential to make the following unique contributions. First, previous studies on the perceptual contributions of voicing and f0 of a contrast have primarily been conducted on non-tone languages like English and Spanish, and in these languages, voicing has been found to be the primary cue (Abramson and Lisker, 1985; Shultz et al., 2012; Llanos et al., 2013). Shanghai, being from a tone-language family, could work in the opposite way with tone as a primary cue and voicing/voice quality a secondary cue, similar to Southern Vietnamese (Brunelle, 2009) and Eastern Cham (Brunelle, 2012). This provides an opportunity to observe the influence of language background on how cues are weighted and the limit and potential reasons for the primacy of a particular cue (see also Francis et al., 2008; Llanos et al., 2013). Second, the positional dependency of the realization of this contrast results from not only the position per se, but also a phonological alternation process that, at least according to the descriptive literature, categorically neutralizes one of the cues (tone) in the non-initial context. This puts the context scenario here, phonologically, between full realization (e.g., voicing in final position in English) and full neutralization (e.g., manner contrast in final position in Korean; Kim and Jongman, 1996) and allows it to contribute to the large literature on incomplete neutralization (e.g., Dinnsen and Charles-Luce, 1984; Port and Crawford, 1989; Warner et al., 2004; Dmitrieva et al., 2010). Third, phonetic studies of phonation have primarily focused on vowels (e.g., Huffman, 1987; Andruski and Ratliff, 2000; Blankenship, 2002; Wayland and Jongman, 2003; Esposito, 2010a, 2012; Khan, 2012) and obstruent consonants (e.g., Davis, 1994; Mikuteit and Reetz, 2007; Dutta, 2009; Berkson, 2016a); studies on sonorant consonant phonation (e.g., Aoki, 1970; Traill and Jackson, 1988; Berkson, 2016b) are relatively rare, presumably due to their typological rarity and the weak acoustic cues they embody (Berkson, 2016b). Shanghai furnishes an example that has a laryngeal contrast in both obstruents and sonorants, and thus provides a rare venue to compare the acoustics and perception of the contrast on the two types of segments.
A. Acoustic correlates of breathiness
During the production of breathy phonation, the vocal folds are in a relatively abducted configuration with low longitudinal tension. Articulatorily, this results in a higher open quotient of the glottal cycle and a less abrupt glottal closing gesture; aerodynamically, the increased airflow volume and the loose vibratory mode of the vocal fold cause turbulence noise at the glottis, which gives the auditory perception of breathy voice (Gordon and Ladefoged, 2001).
A host of acoustic parameters that result from these articulatory and aerodynamic properties have been identified in the literature. In terms of spectral measures, Klatt and Klatt (1990) and Holmberg et al. (1995) showed that a higher open quotient correlates with a greater difference between the amplitude of the first two harmonics (H1-H2), and Stevens (1977) and Hanson et al. (2001) demonstrated that the more gradual glottal closure results in a steeper spectral tilt that can be measured by the amplitude differences between f0 and F1-F3 (H1-A1, H1-A2, H1-A3). In terms of periodicity measures, Hillenbrand et al. (1994) advocated the use of cepstral-peak prominence (CPP), a measure of peak harmonic amplitude adjusted for the overall amplitude, of which breathy phonation is expected to have lower values than modal phonation; the harmonics-to-noise ratio (HNR) has also been used, with breathy phonation having lower HNR values (de Krom, 1993). In studies of phonological breathiness crosslinguistically, these measures have often been shown to be relevant acoustic and perceptual correlates. For instance, increased H1-H2 and spectral tilt measures have been found to be acoustic correlates of breathy vowels in Hmong (Huffman, 1987; Andruski and Ratliff, 2000; Esposito, 2012; Garellek et al., 2013), Khmer (Wayland and Jongman, 2003), Ju|'hoansi (Miller, 2007), Hindi (Dutta, 2009), Gujarati (Khan, 2012), Jalapa Mazatec (Blankenship, 2002; Esposito, 2010b; Garellek and Keating, 2011), and Santa Ana del Valle Zapotec (Esposito, 2010a). Esposito (2010b) and Garellek et al. (2013), in addition, found that these measures directly contribute to the perception of breathiness. Lower CPP values have been found for breathy vowels in Jalapa Mazatec (Blankenship, 2002; Garellek and Keating, 2011), White Hmong (Esposito, 2012), and Gujarati (Khan, 2012). Lower HNR values were found for breathy vowels in Ju|'hoansi (Miller, 2007), but not in Khmer (Wayland and Jongman, 2003).
Duration measures have also been found to correlate with breathiness. For stops, breathy stops have shorter closure durations than their plain counterparts in Bengali (Mikuteit and Reetz, 2007), Hindi (Dutta, 2009), and Marathi (Berkson, 2016a), and the shorter closure duration of voiced stops compared to voiceless stops is well known (e.g., Lisker, 1986).1 For fricatives, Jongman et al. (2000) showed that voiced fricatives generally have shorter frication duration than their voiceless counterparts. The duration pattern for sonorant phonation is scantily documented, but there is some evidence that breathy sonorants tend to be longer than their modal counterparts, as reported for Marathi (Berkson, 2013).
Finally, the phonological co-occurrence between breathy phonation and lower tones found in Shanghai is attested elsewhere as well, e.g., in Santa Ana del Valle Zapotec (Esposito, 2010a) and Hmong (Andruski and Ratliff, 2000; Esposito, 2012). This may be rooted in the general f0 lowering effect of breathiness (Laver, 1980; Gordon and Ladefoged, 2001), which has been well attested, e.g., in Khmu' (Abramson et al., 2007), Hindi (Dutta, 2009), and Marathi (Berkson, 2013). But whether this effect is a phonetic universal remains controversial, as there are studies that have shown either an f0 raising effect (Wayland and Jongman, 2003, for Khmer) or the lack of an f0 correlate (Garellek and Keating, 2011, for Jalapa Mazatec) for breathiness.
B. Previous research on the phonation–tone interaction in Shanghai Wu
As previously stated, existing literature on phonation–tone interaction in Shanghai has firmly established that the cues for the laryngeal contrast of interest here are multidimensional in both non-sandhi and sandhi positions. Cao and Maddieson (1992) showed that for syllables in isolation, i.e., the non-sandhi context, H1-H2 and H1-A12 were significantly higher at vowel onset after the voiced stop than after the voiceless unaspirated stop, but the differences disappeared at the mid and end points of the vowel; for syllables in the sandhi context (e.g., second syllable in disyllables), only the H1-H2 difference remained at vowel onset, and the magnitude of the difference was smaller; but the voiced stops were “phonetically voiced.” The acoustic study by Ren (1992) also showed tapering H1-H2 and H1-A1 differences on the vowel after voiced and voiceless unaspirated stops in the non-sandhi position; but in the sandhi position, Ren found an H1-A1 difference instead of an H1-H2 difference. Ren (1992) also conducted a perception study in which H1-H2 was varied in ten steps and f0 in three steps on the initial portion of the vowel after a stop in the sandhi position (ɦa13–ta34 “shoelace” to ɦa13–da13 “shoe (is) big”). Results showed that both H1-H2 and f0 had an effect on the perception of the second syllable: the /d/ response was more likely with a higher H1-H2; a raised f0 shifted response toward /t/, while a lowered f0 shifted the response toward /d/. Shen and Wang (1995) focused on the roles of the closure and release durations of the stop as the acoustic correlates of stop voicing. They showed that, although the two types of stops did not differ in their release duration (duration between the stop burst and the beginning of vowel periodicity), the voiceless stops had a significantly longer closure duration than the voiced stops in both initial and medial positions, and the voiced stops had closure voicing medially. The acoustic study by Wang (2011) returned similar duration results to Shen and Wang's except that she did not find a closure duration difference based on voicing in the initial position. In a series of perception studies that manipulated closure duration and f0, Wang showed that when restricting the tones to the two rising tones 34 (co-occurs with voiceless) and 13 (co-occurs with voiced), f0 was the primary perceptual cue for the contrast in initial position; in the medial position, both f0 and closure duration were used perceptually for the contrast, but closure voicing was not. Chen (2011) focused on the f0 perturbation effect from the stop voicing contrast in the sandhi context and found that the effect was minimal, and that its size was partly determined by the underlying tone of the preceding syllable. Chen argued that these patterns potentially serve the purpose of maximizing the tonal contrast on the preceding syllable, which determines the pitch contour of the entire sandhi domain; therefore, the f0 perturbation here is speaker controlled, at least in part. For H1-H2, Chen only found the expected difference in the /o/ context, with the voiced stops inducing greater H1-H2; for the /i/ context, the effect was the reverse.
In a dissertation (Gao, 2015) and a series of related publications (Gao and Hallé, 2013, 2015, 2016, 2017), Gao and Hallé presented the most comprehensive study of Shanghai phonation–tone interaction to date. Their acoustic investigation included all three manners (stops, fricatives, nasals) as onsets in monosyllables as well as both syllables of disyllables. In terms of duration, a consonant-vowel (CV) syllable with a voiceless fricative onset had a longer consonant and a shorter vowel than one with a corresponding voiced fricative onset (Gao, 2015); voiced stops had a significantly longer VOT than voiceless unaspirated stops by around 2–4 ms (Gao, 2015; Gao and Hallé, 2017). In terms of voicing in the initial position, voiced stops rarely had voicing, while voiced fricatives had voicing ratios (percentages of consonant duration being voiced) of around 30%–40%; in medial position, voiced stops and fricatives had over 90% voicing ratios, compared to around 20%–30% for voiceless ones (Gao, 2015; Gao and Hallé, 2017). For spectral and periodicity measures, they showed that for monosyllables, H1-H2, H1-A1, and H1-A2 were generally higher, while CPP was generally lower following voiced/murmured onsets than voiceless/modal ones, but the differences were the greatest and the most consistent for elder male speakers; linear discriminant analyses (LDAs) showed that H1-H2 was the most consistent cue across age and gender groups and in different tonal contexts. Only H1-H2 results were reported for the two syllables in disyllables. Results showed that for the first syllable, H1-H2 was higher after voiced/murmured onsets, but the difference was less clear-cut than in monosyllables; for the second syllable, no H1-H2 difference based on the voicing difference was found (Gao, 2015; Gao and Hallé, 2017). Perceptually, two experiments were conducted to investigate the effect of duration and voicing patterns on the identification of the laryngeal contrast. The first experiment created “congruent” and “incongruent” monosyllabic stimuli by imposing the f0 of one CV onto another when the two onsets differed in voicing, and the results showed that the congruence factor significantly affected the accuracy and reaction time of tone identification when the onsets were labial fricatives, which had the largest voicing difference. The second experiment created tonal continua between the two rising tones on both the long C-short V and short C-long V duration patterns and showed that the duration pattern shifted the listeners' identification response toward the category with that duration pattern, and the incongruence between tone and duration pattern slowed down the reaction time (Gao and Hallé, 2013; Gao, 2015). An additional experiment was carried out to investigate the effect of voice quality on perception. Tonal continua were again created between the two rising tones and imposed onto modal and breathy syllables (both synthesized and naturally produced modal and breathy syllables were used). Identification results showed that the voice quality of the syllable shifted the listeners' identification response toward the category with that phonation type, and the incongruence between tone and phonation slowed down the reaction time with the exception of naturally produced tokens with nasal onsets (Gao and Hallé, 2015; Gao, 2015).
With the exception of the work of Gao and Hallé, the previous studies only investigated a subset of the cues for stops. But even in the studies by Gao and Hallé, there was no direct comparison among the different manners, and their perception studies were restricted to monosyllables. In the present work, the goal is to provide a comprehensive look at the acoustic realization and perception of the contrast between voiceless unaspirated/modal and voiced/murmured consonants co-occurring with a high-rising and a low-rising tone, respectively, across different manners (e.g., tɔ34 vs dɔ13; fi34 vs vi13; mε34 vs 13) and different contexts (sandhi, non-sandhi) using a consistent set of methods, and consequently shed light on the language- and context-dependent nature of contrast realization and perceptual cue weighting, especially when a phonological alternation process is involved, as well as the production-perception link. In Secs. II and III, a production study and a perception study conducted to this end are reported.
II. EXPERIMENT 1: PRODUCTION STUDY
A. Methods
Thirteen monosyllabic voiceless/modal vs voiced/murmured minimal pairs were used for the non-sandhi context (six stop pairs, four fricative pairs, three sonorant pairs); all voiceless/modal syllables occurred with the high rising tone 34 and all voiced/murmured syllables with the low rising tone 13 (e.g., pu34 and bu13). The same pairs were then used as the second syllable of disyllabic compounds with matched first syllable for the sandhi context (e.g., fən53-pu34 and fən53-bu13). Both the monosyllabic and disyllabic words were embedded in the carrier sentence ŋu34 ɕja34 __ ɡəʔ12 əʔ55 zɨ13 “I write the character/word ___.” The reason the target stimuli were put in sentence-medial position was to allow the measurement of closure duration for onset stops, as duration has been more consistently shown as a perceptual cue for the contrast in previous studies (Wang, 2011; Gao and Hallé, 2013; Gao, 2015). The trade-off, however, is that this creates an environment that may also facilitate consonant voicing for the voiced obstruents even for the monosyllables. Tone sandhi (or lack thereof) on the target words, however, is not expected to be affected by the sentential context, as the preceding verb ɕja34 “to write” and the following demonstrative ɡəʔ12 “this” do not belong to the same prosodic word as the target. The full word list is given in Table II.
Word list used in the production experiment. Tone transcriptions reflect the base tones before the application of tone sandhi.
. | Monosyllables . | Disyllables . | ||||||
---|---|---|---|---|---|---|---|---|
Voiceless/modal . | Voiced/murmured . | Voiceless/modal . | Voiced/murmured . | |||||
Stops | pin34 | “pancake” | bin13 | “bottle” | ma13-pin34 | “to sell pancakes” | ma13-bin13 | “to sell bottles” |
pu34 | “to spread' | bu13 | “section” | fən53-pu34 | “distribution” | fən53-bu13 | “division” | |
tɔ34 | “to arrive' | dɔ13 | “news” | pɔ34-tɔ34 | “check-in” | pɔ34-dɔ13 | “news report” | |
ti34 | “emperor” | di13 | “brother” | ɦuɑ̃13-ti34 | “emperor” | ɦuɑ̃13-di13 | “royal brother” | |
kuε34 | “rail” | guε13 | “hoop” | thiɪʔ55-kuε34 | “rail” | thiɪʔ55-guε13 | “iron hoop” | |
koŋ34 | “arch” | goŋ13 | “together” | iɪʔ55-koŋ34 | “an arch” | iɪʔ55-goŋ13 | “all together” | |
Fricatives | fi34 | “fee” | vi13 | “fat' | kε34-fi34 | “to reduce the fee” | kε34-vi13 | “to lose weight” |
fən34 | “hard work” | vən13 | “article” | faʔ55-fən34 | “to work hard” | faʔ55-vən13 | “to publish an article” | |
sɨ34 | “water” | zɨ13 | “porcelain' | dɑ̃13-sɨ34 | “sugar water” | dɑ̃13-zɨ13 | “porcelain” | |
su34 | “lock” | zu13 | “seat” | tɕin53-su34 | “golden lock” | tɕin53-zu13 | “golden seat” | |
Sonorants | min34 | “chirp” | m̤in13 | “name” | ɳjɔ34-min34 | “bird's chirps” | ɳjɔ34-m̤in13 | “bird's name” |
mε34 | “America” | m̤ε13 | “plum” | ly34-mε34 | “traveling in the US” | ly34-m̤ε13 | Proper name | |
ɳjɔ34 | “bird” | ɳ̤jɔ13 | “around” | lε13-ɳjɔ34 | “blue bird” | lε13-ɳ̤jɔ13 | “indiscriminate” |
. | Monosyllables . | Disyllables . | ||||||
---|---|---|---|---|---|---|---|---|
Voiceless/modal . | Voiced/murmured . | Voiceless/modal . | Voiced/murmured . | |||||
Stops | pin34 | “pancake” | bin13 | “bottle” | ma13-pin34 | “to sell pancakes” | ma13-bin13 | “to sell bottles” |
pu34 | “to spread' | bu13 | “section” | fən53-pu34 | “distribution” | fən53-bu13 | “division” | |
tɔ34 | “to arrive' | dɔ13 | “news” | pɔ34-tɔ34 | “check-in” | pɔ34-dɔ13 | “news report” | |
ti34 | “emperor” | di13 | “brother” | ɦuɑ̃13-ti34 | “emperor” | ɦuɑ̃13-di13 | “royal brother” | |
kuε34 | “rail” | guε13 | “hoop” | thiɪʔ55-kuε34 | “rail” | thiɪʔ55-guε13 | “iron hoop” | |
koŋ34 | “arch” | goŋ13 | “together” | iɪʔ55-koŋ34 | “an arch” | iɪʔ55-goŋ13 | “all together” | |
Fricatives | fi34 | “fee” | vi13 | “fat' | kε34-fi34 | “to reduce the fee” | kε34-vi13 | “to lose weight” |
fən34 | “hard work” | vən13 | “article” | faʔ55-fən34 | “to work hard” | faʔ55-vən13 | “to publish an article” | |
sɨ34 | “water” | zɨ13 | “porcelain' | dɑ̃13-sɨ34 | “sugar water” | dɑ̃13-zɨ13 | “porcelain” | |
su34 | “lock” | zu13 | “seat” | tɕin53-su34 | “golden lock” | tɕin53-zu13 | “golden seat” | |
Sonorants | min34 | “chirp” | m̤in13 | “name” | ɳjɔ34-min34 | “bird's chirps” | ɳjɔ34-m̤in13 | “bird's name” |
mε34 | “America” | m̤ε13 | “plum” | ly34-mε34 | “traveling in the US” | ly34-m̤ε13 | Proper name | |
ɳjɔ34 | “bird” | ɳ̤jɔ13 | “around” | lε13-ɳjɔ34 | “blue bird” | lε13-ɳ̤jɔ13 | “indiscriminate” |
Ten native speakers (5 male, 5 female) with an age range of 19–30 and a mean age of 25 were recorded in a quiet room in Shanghai using an Electro-Voice N/D767 cardioid microphone (Burnsville, MN) and a Marantz portable solid state recorder (PMD 671, Cumberland, RI). Each of them read the stimuli twice. Subsequent measurements for the two repetitions were averaged before the statistical analyses.
Consonant durations were measured in Praat (Boersma and Weenink, 2012) by the second author. The duration for stops was the closure duration and was measured from the end of the previous syllable to the stop release. For fricatives and sonorants, the segments themselves were identified from the spectrograms and their durations measured. Durations were analyzed with linear mixed-effects models, with the laryngeal feature (referred to as voicing for brevity below) as fixed effects and subject and item as random effects. P-values were calculated using the lmerTest package in R (Kuznetsova et al., 2016). Monosyllables and disyllables were analyzed separately. Stops and fricatives were classified as “voiced” or “voiceless” depending on whether 50% or more of the consonant duration (closure for stops, frication duration for fricatives) had voicing, as determined from the waveforms and spectrograms in Praat by the first author.
The spectral measure H1*-H2* (corrected H1-H2 based on the frequencies and bandwidths of formants; Shue et al., 2011) and the periodicity measure CPP were selected to estimate the breathiness induced by the contrast.3 H1*-H2* and CPP values were measured every millisecond in VoiceSauce v1.12 (Shue et al., 2011), and the measurements over every 9.1% of the vowel duration were averaged, yielding 11 data points for each vowel for statistical analysis. The Snack Sound Toolkit (Sjölander, 2004) was used by VoiceSauce to find the frequencies and bandwidths of the formants with the covariance method, a pre-emphasis of 0.96, and a window length of 25 ms with a frame shift of 1 ms. Fundamental frequencies were measured at 10% intervals during the vowel using the ProsodyPro Praat script (Xu, 2005–2013). The Maxf0 and Minf0 parameters in the script, as well as the octave-jump cost, were adjusted for each speaker, and the f0 measurements were manually checked by the second author against pitch tracks and narrowband spectrograms in Praat to correct any measurement errors by the script. The f0 values in Hz were then converted into semitones and z-scored. Growth curve analyses (Mirman, 2014) were conducted on the H1*-H2*, CPP, and f0 curves over the vowel using third-order (cubic) orthogonal polynomials. The models were built up from the base model that only included subject, item, and subject-by-voicing random effects. Voicing and its interaction with the time terms were subsequently added step-wise, and their effects on model fit were evaluated using log-likelihood model comparison. Parameter estimates for the full model were then tested for significance using t-tests, and p-values were again estimated by the lmerTest package. Different manners and different positions were analyzed separately, and the voiceless/modal category was used as the baseline. H1*-H2* and CPP were similarly compared for sonorant consonants, but the measurements were averaged over every 20% of the sonorant duration, yielding only five data points for each sonorant. All statistical analyses were performed using the lme4 package (Bates et al., 2015) in R (R Core Team, 2014).
To investigate the relative contribution of the different acoustic cues in the laryngeal contrast for each manner in monosyllables and disyllables, LDAs were conducted to explore the extent to which the laryngeal category can be predicted from the acoustic cues. The greedy.wilks function in the klaR package (Weihs et al., 2005) in R was used to conduct stepwise forward variable selection for significant predictors (p < 0.05), and the lda function in the MASS package (Venables et al., 2002) was used to derive the coefficients for the variables for the linear discriminant functions. The overall Wilks's lambda values (from 0 to 1) for the discrimination (0 means total discrimination, 1 means no discrimination), as well as their F and p values, were calculated using the manova function (see also Gao, 2015).
B. Results
1. Duration and voicing measures
The consonant duration results are given in Fig. 1. For both the monosyllables and the second syllable of disyllables, the best model included the interaction between voicing and manner. An analysis with voicing nested under manner as fixed effects for monosyllables and disyllables, respectively, was then conducted to get voicing estimates for the different manners in the same model. For monosyllables, the effect of voicing is significant for fricatives (estimate = −59.168, Standard Error (SE) = 13.344, degrees of freedom (df) = 25.246, t = −4.434, p < 0.001), but not for stops (estimate = −11.073, SE = 10.930, df= 25.544, t = −1.013, p = 0.321) or sonorants (estimate= −0.783, SE = 15.393, df = 25.151, t = −0.051, p = 0.960). For the second syllable of disyllables, likewise, the effect of voicing is significant for fricatives (estimate = −66.554, SE = 12.558, df = 25.792, t = −5.300, p < 0.001), but not for stops (estimate = −8.870, SE = 10.250, df = 25.755, t = −0.865, p = 0.395) or sonorants (estimate = 0.182, SE = 14.484, df = 25.676, t = 0.013, p = 0.990).
Duration of onset consonants in monosyllables and the second syllable of disyllables. *: p < 0.05; **: p < 0.01; ***: p < 0.001.
Duration of onset consonants in monosyllables and the second syllable of disyllables. *: p < 0.05; **: p < 0.01; ***: p < 0.001.
In terms of voicing, 89% of the voiced stops and 100% of the voiced fricatives in the second syllable of disyllable tokens were classified as voiced. This generally agrees with the results of earlier studies (Shen and Wang, 1995; Chen, 2011; Wang, 2011; Gao, 2015; Gao and Hallé, 2017). In monosyllables, due to the intervocalic position in which the consonant appears in the sentential context, 33% of the voiced stop onsets were also classified as voiced. For fricatives, 100% of the voiced bilabial fricatives and 50% of the coronal fricatives were classified as voiced. The tendency for bilabial fricatives to have more voicing in this position has also been documented in Gao (2015) and Gao and Hallé (2017). Voiceless obstruents were occasionally voiced (11% for monosyllables, 31% for the second syllable of disyllables), contra traditional descriptions for which we have no good explanation except that the intervocalic or post-nasal positions in which they appear perhaps encouraged phonetic voicing.
2. Spectral and periodicity measures
The H1*-H2* and CPP results for the vowels after the three consonant manners in monosyllables are given in Figs. 2 and 3, respectively. Model comparisons for H1*-H2* showed that the model did not significantly improve with the addition of voicing or its interactions with the linear, quadratic, and cubic time terms for any manner (p > 0.15 for all comparisons). For CPP, the interaction between voicing and the quadratic time term did significantly improve the model for fricatives [χ2(1) = 8.455, p = 0.004]. Parameter estimates for the quadratic interaction (estimate = 2.160, SE = 0.610, t = 3.538, p = 0.006) indicated that voiceless fricatives induced a sharper peak for the CPP curve on the following vowel than voiced ones; no other model comparisons were significant (all p > 0.07).
H1*-H2* results over the duration of the vowels after stops, fricatives, and sonorants for monosyllables. Symbols represent observed data (vertical lines indicate ± SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.
H1*-H2* results over the duration of the vowels after stops, fricatives, and sonorants for monosyllables. Symbols represent observed data (vertical lines indicate ± SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.
CPP results over the duration of the vowels after stops, fricatives, and sonorants for monosyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.
CPP results over the duration of the vowels after stops, fricatives, and sonorants for monosyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.
For the second syllable of disyllables, stops and sonorants again did not exhibit any phonatory difference in H1*-H2* or CPP on the following vowel based on their laryngeal features (p > 0.18 for all model comparisons). For fricatives, however, model comparisons showed that for H1*-H2* the effect of voicing on the intercept significantly improved the model [χ2(1) = 9.564, p = 0.002], and parameter estimates (estimate = 2.241, SE = 0.568, t = 3.942, p = 0.002) indicated that voiceless fricatives induced a lower H1*-H2* than voiced fricatives; for CPP, the effects of the laryngeal feature on the intercept and quadratic time terms both significantly improved the model [intercept: χ2(1) = 8.752, p = 0.003; quadratic: χ2(1) = 6.353, p = 0.012], and parameter estimates showed a significant effect for the quadratic interaction (estimate = 2.881, SE = 0.943, t = 3.054, p = 0.011), indicating that voiceless fricatives again induced a sharper peak for the CPP curve on the following vowel. These results are given in Figs. 4 and 5.4
H1*-H2* results over the duration of the vowels after stops, fricatives, and sonorants for the second syllable of disyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.
H1*-H2* results over the duration of the vowels after stops, fricatives, and sonorants for the second syllable of disyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.
CPP results over the duration of the vowels after stops, fricatives, and sonorants for the second syllable of disyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.
CPP results over the duration of the vowels after stops, fricatives, and sonorants for the second syllable of disyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.
For the spectral and periodicity measures on the sonorant consonants themselves, for monosyllables, the model for H1*-H2* did not significantly improve with the addition of voicing or its interactions with the linear, quadratic, and cubic time terms (p > 0.75 for all comparisons), but the model for CPP did improve with the addition of voicing on the intercept [χ2(1) = 4.818, p = 0.028] and the quadratic time term [χ2(1) = 4.064, p = 0.044]. Parameter estimates indicated that the modal sonorants had an overall higher CPP value than the murmured sonorants (voicing intercept: estimate = −1.815, SE = 0.510, t = 3.561, p = 0.005), and the murmured sonorants had a more U-shaped curve than the modal sonorants (voicing and quadratic time term interaction: estimate = 0.890, SE = 0.395, t = 2.256, p = 0.041). For sonorant onsets on the second syllable of disyllables, the models for H1*-H2* and CPP did not significantly improve with the addition of voicing or its interactions with the linear, quadratic, and cubic time terms (p > 0.33 for all comparisons). The monosyllabic and disyllabic results are given in Figs. 6 and 7, respectively.
H1*-H2* and CPP results over the duration of sonorant onsets for monosyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.
H1*-H2* and CPP results over the duration of sonorant onsets for monosyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.
H1*-H2* and CPP results over the duration of sonorant onsets for the second syllable of disyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.
H1*-H2* and CPP results over the duration of sonorant onsets for the second syllable of disyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.
3. f0
The f0 results for the monosyllables and the second syllable of disyllables are given in Figs. 8 and 9, respectively. For monosyllables, the addition of voicing improved the model for the stops [χ2(1) = 8.350, p = 0.004] and fricatives [χ2(1) = 15.153, p < 0.001], and the addition of its interaction with the linear time term improved the model for the fricatives [χ2(1) = 11.224, p < 0.001] and sonorants [χ2(1)= 4.472, p = 0.034]. Parameter estimates for the full model, which include the effects of voicing and its interaction with the linear, quadratic, and cubic time terms for the three manners are summarized in Table III. With the voiceless/modal category as the baseline, the negative intercepts indicated that the f0s after the voiced/murmured consonants were significantly lower than those after the voiceless/modal consonants, and the positive coefficients for the interaction between voicing and the linear time term indicated that the f0s after the voiced/murmured consonants had sharper rising slopes than those after the voiceless/modal consonants; therefore, the f0 difference between the two types of onsets decreased over the duration of the vowel. For the second syllable in disyllables, however, only for the fricatives did the addition of the laryngeal feature significantly improve the model [χ2(1) = 3.849, p = 0.050]. No other model comparisons were significant (all p > 0.12). Parameter estimates for the full models indicated that the effects of voicing on the intercept or higher time terms were not significant for any manner, including the fricatives.
Normalized f0 results over the duration of the vowels after stops, fricatives, and sonorants for monosyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.
Normalized f0 results over the duration of the vowels after stops, fricatives, and sonorants for monosyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.
Normalized f0 results over the duration of the vowels after stops, fricatives, and sonorants for the second syllable of disyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.
Normalized f0 results over the duration of the vowels after stops, fricatives, and sonorants for the second syllable of disyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.
Parameter estimates for the monosyllable f0 analysis. Baseline = voiceless.
. | Estimate . | SE . | t . | p . | |
---|---|---|---|---|---|
Stop | Voicing: Intercept | −0.805 | 0.190 | −4.228 | <0.001 |
Voicing: Linear | 0.533 | 0.271 | 1.967 | 0.068 | |
Voicing: Quadratic | 0.330 | 0.208 | 1.588 | 0.137 | |
Voicing: Cubic | −0.144 | 0.140 | −1.028 | 0.314 | |
Fricative | Voicing: Intercept | −1.180 | 0.176 | −6.699 | <0.001 |
Voicing: Linear | 1.153 | 0.228 | 5.045 | <0.001 | |
Voicing: Quadratic | 0.353 | 0.184 | 1.917 | 0.078 | |
Voicing: Cubic | −0.228 | 0.177 | −1.283 | 0.231 | |
Sonorant | Voicing: Intercept | −0.682 | 0.244 | −2.789 | 0.019 |
Voicing: Linear | 0.973 | 0.400 | 2.431 | 0.043 | |
Voicing: Quadratic | 0.331 | 0.324 | 1.022 | 0.338 | |
Voicing: Cubic | −0.175 | 0.142 | −1.233 | 0.232 |
. | Estimate . | SE . | t . | p . | |
---|---|---|---|---|---|
Stop | Voicing: Intercept | −0.805 | 0.190 | −4.228 | <0.001 |
Voicing: Linear | 0.533 | 0.271 | 1.967 | 0.068 | |
Voicing: Quadratic | 0.330 | 0.208 | 1.588 | 0.137 | |
Voicing: Cubic | −0.144 | 0.140 | −1.028 | 0.314 | |
Fricative | Voicing: Intercept | −1.180 | 0.176 | −6.699 | <0.001 |
Voicing: Linear | 1.153 | 0.228 | 5.045 | <0.001 | |
Voicing: Quadratic | 0.353 | 0.184 | 1.917 | 0.078 | |
Voicing: Cubic | −0.228 | 0.177 | −1.283 | 0.231 | |
Sonorant | Voicing: Intercept | −0.682 | 0.244 | −2.789 | 0.019 |
Voicing: Linear | 0.973 | 0.400 | 2.431 | 0.043 | |
Voicing: Quadratic | 0.331 | 0.324 | 1.022 | 0.338 | |
Voicing: Cubic | −0.175 | 0.142 | −1.233 | 0.232 |
4. Linear discriminant analysis
Consonant duration and CPP and f0 values averaged over the entire vowel duration were used as the acoustic variables in the linear discriminant analysis. These variables were selected as representatives of the acoustic properties of the consonant, vowel phonation, and vowel f0. Consonant duration was selected as the consonant cue as previous studies have primarily shown the perceptual effect of duration (e.g., Wang, 2011; Gao and Hallé, 2013; Gao, 2015), and Wang (2011) has shown that listeners did not use closure voicing as a perceptual cue for stops. CPP was selected as the phonation cue as our acoustic results above showed stronger CPP effects than H1*-H2*. The variables were centered and scaled before being submitted to the discriminant analysis.
Table IV summarizes the coefficients for the variables for the linear discriminant functions as well as the Wilks's lambda, F, and p values for the discriminations. Significant predictors, as indicated by stepwise variable selection, are given in bold. “Voiceless/modal” was dummy coded as 0. Therefore, a negative coefficient for a factor indicates that a higher value for that factor is more likely to lead to a “voiceless/modal” classification. For monosyllables (non-sandhi), the only consistent predictor was f0; but for fricatives, both CPP and duration were significant as well, and the stepwise analysis selected f0 first, then CPP, followed by duration. For the second syllable in disyllables (sandhi), only the fricatives could be significantly discriminated, and the stepwise analysis selected duration first, then CPP.
Coefficients for the variables for the linear discriminant functions, as well as the Wilks's lambda, F, and p values for the discriminations. Significant predictors (p <0.05) are in bold.
Coefficients . | Duration . | CPP . | f0 . | Wilks's lambda . | F . | p . | |
---|---|---|---|---|---|---|---|
Monosyllable (non-sandhi) | Stop | −0.124 | −0.061 | −1.245 | 0.684 | 16.627 | <0.001 |
Fricative | −0.604 | −0.761 | −1.207 | 0.303 | 56.080 | <0.001 | |
Sonorant | 0.026 | −0.156 | −1.314 | 0.716 | 7.402 | <0.001 | |
Disyllable (sandhi) | Stop | −0.973 | 0.200 | −0.723 | 0.961 | 1.531 | 0.210 |
Fricative | −1.464 | −0.403 | −0.041 | 0.434 | 30.013 | <0.001 | |
Sonorant | 2.136 | 0.490 | −0.377 | 0.998 | 0.042 | 0.988 |
Coefficients . | Duration . | CPP . | f0 . | Wilks's lambda . | F . | p . | |
---|---|---|---|---|---|---|---|
Monosyllable (non-sandhi) | Stop | −0.124 | −0.061 | −1.245 | 0.684 | 16.627 | <0.001 |
Fricative | −0.604 | −0.761 | −1.207 | 0.303 | 56.080 | <0.001 | |
Sonorant | 0.026 | −0.156 | −1.314 | 0.716 | 7.402 | <0.001 | |
Disyllable (sandhi) | Stop | −0.973 | 0.200 | −0.723 | 0.961 | 1.531 | 0.210 |
Fricative | −1.464 | −0.403 | −0.041 | 0.434 | 30.013 | <0.001 | |
Sonorant | 2.136 | 0.490 | −0.377 | 0.998 | 0.042 | 0.988 |
C. Discussion
The acoustic results above indicate that this laryngeal contrast in Shanghai is primarily a tone contrast in the non-sandhi context (monosyllables), as although the H1*-H2* and CPP comparisons between the voiceless/modal and voiced/murmured categories were generally in the expected direction, with the voiceless/modal consonants exhibiting numerically lower H1*-H2* and higher CPP on the following vowel than the voiced/murmured ones, only the CPP comparison for fricatives reached significance under the growth curve analysis; f0 curves on the vowels after voiceless/modal and voiced/murmured consonants, however, differed significantly on both the intercept and slope for all three manners except for the slope for stops. There are indications that the consonants themselves still played a role in the contrast as the fricatives exhibited a duration difference, while the sonorants exhibited a CPP difference based on the contrast. Moreover, the attenuation of the f0 difference over the vowel after voiceless/modal vs voiced/murmured consonants also suggests that the f0 difference, at least in part, stems from the onset consonants. The LDAs provided the relative weighting of the acoustic cues from consonant duration, vowel phonation, and vowel f0 and corroborated the acoustic finding that the laryngeal contrast in the non-sandhi context is primarily tonal, with secondary cues from CPP and consonant duration for the fricatives.
In the sandhi context (second syllable of disyllables), the f0 difference was neutralized, but the stops gained a voicing difference despite losing the closure duration difference, and the fricatives exhibited both duration and voicing differences. For the sonorants, however, no difference between the modal and murmured categories was detected in consonant duration, consonant phonation, vowel phonation, or f0. The LDAs did not encode the effect of voicing, but confirmed that f0 cannot be used to discriminate the contrast, and that fricatives have enough secondary cues in duration and CPP to be differentiated.
These results show that the acoustic cues for the contrast indeed vary by the manner and position in which the contrast is realized. In the sandhi position where a phonological process presumably neutralizes the main cue for the contrast—f0, the contrast itself is incompletely neutralized for fricatives and arguably for stops, but completely neutralized for sonorants as far as the measures included here are concerned. The weakness of this contrast on sonorants hence finds some support in the results.
Unlike in previous studies (e.g., Cao and Maddieson, 1992; Ren, 1992; Gao, 2015), the H1*-H2* and CPP results here generally did not show a significant effect of the laryngeal feature. For f0, although we showed that it significantly covaried with the consonant feature in the non-sandhi context—a result shared by all previous research—we did not find incomplete neutralization in the sandhi context indicated by Ren (1992), Chen (2011), and Wang (2011). There are two potential reasons for these disparities. One is that, given our speakers were considerably younger than the speakers used in earlier studies, it is possible that Shanghai is gradually losing the phonation difference, and the contrast is now primarily cued by tone in the younger generations (see Gao, 2015; Gao and Hallé, 2016, 2017, for age and gender-based differences that support this contention). Another possibility is that the different results are partly due to the different statistical methods used. In the linear mixed-effects-based growth curve analyses, the random effects structure included not only subject and item, but also subject-by-voicing interaction. This helps reduce the type I error in hypothesis testing (Barr et al., 2013), in this case, the effect of voicing.
III. EXPERIMENT 2: PERCEPTION STUDY
A. Methods
The perception study investigated how the different acoustic cues for the laryngeal contrast are weighted in perception and how the weightings are affected by the manner and position of the contrast. The stimuli were monosyllabic and disyllabic words in which the target syllables had a full cross-classification of three sets of cues—consonant properties, vowel phonation, and vowel f0. These syllables were constructed by cross-splicing consonant and vowel portions of different syllables and superimposing the f0 contour from one vowel onto another in Praat. For instance, from two base tokens [pu34] (no. 1) and [bu13] (no. 8), six additional stimuli (no. 2–no. 7) were constructed, as shown in Table V. Three monosyllabic pairs, one from each manner, were selected as the original base tokens—pu34∼bu13, fi34∼vi13, and mε34∼m̤ε13, and their corresponding disyllabic pairs—fən53pu34 ∼ fən53bu13, kε34fi34 ∼ kε34vi13, and ly34mε34∼ly34m̤ε13—were selected as the originals for the disyllables. Therefore, there were 24 monosyllables and 24 disyllables in total as the perceptual stimuli. There are three main reasons why we used the cross-spliced stimuli in the perception experiment instead of the acoustic continua often used in similar studies. First, this method allows for a complete parallel for the investigation of different manners in different positions. The acoustic-continuum method necessitates the use of different values along the continuous scale due to different acoustic properties depending on context, and hence loses some of the parallelism. Second, the manipulation is easily executable. The acoustic-continuum method may not allow effective continua to be built due to the small acoustic differences in some contexts. Third, the method is symmetrical among the three sets of cues and hence makes no assumption about the importance of any particular one.
Examples of stimulus construction for the perception experiment from original tokens [pu34] and [bu13].
Stimulus number . | C properties . | V phonation . | V f0 . | Method . |
---|---|---|---|---|
1. | pu34 | pu34 | pu34 | Original |
2. | pu34 | pu34 | bu13 | Superimpose f0 of [bu13] onto [pu34] |
3. | pu34 | bu13 | pu34 | Cross-splice C of [pu34] to V of [bu13], then superimpose f0 of [pu34] onto the vowel |
4. | pu34 | bu13 | bu13 | Cross-splice C of [pu34] to the V of [bu13] |
5. | bu13 | pu34 | pu34 | Cross-splice C of [bu13] to the V of [pu34] |
6. | bu13 | pu34 | bu13 | Cross-splice C of [bu13] to V of [pu34], then superimpose f0 of [bu13] onto the vowel |
7. | bu13 | bu13 | pu34 | Superimpose f0 of [pu34] onto [bu13] |
8. | bu13 | bu13 | bu13 | Original |
Stimulus number . | C properties . | V phonation . | V f0 . | Method . |
---|---|---|---|---|
1. | pu34 | pu34 | pu34 | Original |
2. | pu34 | pu34 | bu13 | Superimpose f0 of [bu13] onto [pu34] |
3. | pu34 | bu13 | pu34 | Cross-splice C of [pu34] to V of [bu13], then superimpose f0 of [pu34] onto the vowel |
4. | pu34 | bu13 | bu13 | Cross-splice C of [pu34] to the V of [bu13] |
5. | bu13 | pu34 | pu34 | Cross-splice C of [bu13] to the V of [pu34] |
6. | bu13 | pu34 | bu13 | Cross-splice C of [bu13] to V of [pu34], then superimpose f0 of [bu13] onto the vowel |
7. | bu13 | bu13 | pu34 | Superimpose f0 of [pu34] onto [bu13] |
8. | bu13 | bu13 | bu13 | Original |
The base tokens were selected from a female speaker's production data, and a number of considerations went into the selection of these tokens. First, it was ensured that these tokens were representative of the overall acoustic patterns reported in Sec. II. Second, given that the f0 contour was either stretched or compressed when superimposed onto a vowel of a different duration, the original syllable pairs were selected such that their vowel durations were as similar as possible. Third, after f0 was superimposed onto a different vowel, H1*-H2* and CPP of the new token were remeasured, and we selected the base tokens for which these measures were minimally affected by the f0 manipulation. A summary of the acoustic measures for the 12 base tokens, as well as when the f0 of the base tokens was switched to that of the other laryngeal category, is given in Table VI, and all 48 test stimuli are provided as supplemental material online.5 All stimuli were embedded in the same carrier sentence and auditorily presented to the subjects through headphones for a two-alternative forced choice (2AFC) task, where they had to choose on a monitor the Chinese character(s) they heard.6 The entire stimulus list was presented four times, and the order of the stimuli was randomized each time. Forty-one native speakers (16 male, 25 female) with an age range of 19–37 yr and a mean age of 24.4 yr participated in the experiment in a quiet office at Fudan University in Shanghai.
Acoustic measures of the base tokens for the perception experiment as well as when the f0 of the base tokens was switched to that of the other laryngeal category (given in parentheses). H1*-H2*, CPP, and f0 were the average values over the vowel.
. | C duration (ms) . | H1*-H2* (dB) . | CPP (dB) . | f0 (Hz) . | |
---|---|---|---|---|---|
Monosyllable (non-sandhi) | pu34 | 126 | −1.55 (0.75) | 16.66 (17.28) | 217 (201) |
bu13 | 124 | 2.72 (1.77) | 16.01 (18.24) | 201 (217) | |
fi34 | 196 | −1.18 (1.22) | 18.90 (19.28) | 229 (191) | |
vi13 | 126 | 3.45 (0.58) | 17.24 (18.86) | 191 (229) | |
mε34 | 122 | −1.83 (4.19) | 22.98 (24.00) | 211 (211) | |
m̤ε13 | 118 | 8.38 (10.03) | 17.44 (21.56) | 172 (172) | |
Disyllable (sandhi) | fən53-pu34 | 57 | −1.28 (0.09) | 17.38 (19.41) | 217 (198) |
fən53-bu13 | 39 | 6.12 (6.43) | 17.47 (18.92) | 198 (217) | |
kε34-fi34 | 147 | −3.21 (4.79) | 19.11 (21.37) | 205 (187) | |
kε34-vi13 | 70 | 7.81 (2.03) | 20.73 (23.60) | 187 (205) | |
ly34-mε34 | 136 | 8.37 (8.47) | 20.54 (22.42) | 192 (194) | |
ly34-m̤ε13 | 99 | 12.27 (11.53) | 20.88 (21.77) | 194 (192) |
. | C duration (ms) . | H1*-H2* (dB) . | CPP (dB) . | f0 (Hz) . | |
---|---|---|---|---|---|
Monosyllable (non-sandhi) | pu34 | 126 | −1.55 (0.75) | 16.66 (17.28) | 217 (201) |
bu13 | 124 | 2.72 (1.77) | 16.01 (18.24) | 201 (217) | |
fi34 | 196 | −1.18 (1.22) | 18.90 (19.28) | 229 (191) | |
vi13 | 126 | 3.45 (0.58) | 17.24 (18.86) | 191 (229) | |
mε34 | 122 | −1.83 (4.19) | 22.98 (24.00) | 211 (211) | |
m̤ε13 | 118 | 8.38 (10.03) | 17.44 (21.56) | 172 (172) | |
Disyllable (sandhi) | fən53-pu34 | 57 | −1.28 (0.09) | 17.38 (19.41) | 217 (198) |
fən53-bu13 | 39 | 6.12 (6.43) | 17.47 (18.92) | 198 (217) | |
kε34-fi34 | 147 | −3.21 (4.79) | 19.11 (21.37) | 205 (187) | |
kε34-vi13 | 70 | 7.81 (2.03) | 20.73 (23.60) | 187 (205) | |
ly34-mε34 | 136 | 8.37 (8.47) | 20.54 (22.42) | 192 (194) | |
ly34-m̤ε13 | 99 | 12.27 (11.53) | 20.88 (21.77) | 194 (192) |
For each stimulus type defined by manner and position, a mixed-effects logistic regression was conducted with the subjects' binary responses as the dependent variable and the voicing specifications of consonant, phonation, and f0 cues as categorical predictors with random intercept by subject.7 A non-parametric analysis—the Classification and Regression Tree (CART) analysis (Breiman et al., 1984)—was also conducted using the rpart package in R to further investigate how the listeners classified the stimuli based on these cues. CART is a recursive partitioning technique that outlines the decision process for a category membership based on categorical predictors. The splits in a classification tree are selected so that the descendant subsets are “purer” than the current set, and the parameters for the splits can be considered as significant predictors for the classification. For our analysis, we constructed the classification trees by using consonant, phonation, and f0 cues as categorical predictors for the subjects' response for each manner and position by using the rpart function. We then conducted cost-complexity pruning for each tree based on the relative errors generated by tenfold cross-validation using the plotcp and prune functions (Baayen, 2008).
B. Results
The accuracy and d′ results for the listeners' classification of the natural tokens are given in Fig. 10. These results indicate that the subjects had near perfect identification of the contrast in the non-sandhi context regardless of manner and in the sandhi context for fricatives. For stops in the sandhi context, the identification was weaker, but well above chance; for sonorants, however, identification was at chance.
Perceptual accuracy and d′ for the natural tokens in the perception experiment.
The coefficients for the consonants, phonation, and f0 cues in the mixed-effects logistic regressions for different manners and positions are given in Tables VII and VIII. “Voiceless/modal” was dummy coded as 0 for both the response variable and all the categorical predictors. Therefore, the intercept in the models indicates the log odds [ln(p/(1 − p))] of the segment being given a “voiced/murmured” response when the consonant, phonation, and f0 cues all came from the voiceless/modal category, and the coefficients for consonant, phonation, and f0 indicate the increase of the log odds when these cues came from the voiced category, respectively. For monosyllables (non-sandhi), f0 was the only consistent factor that significantly affected the response, and its coefficient was the largest among the three cues for all three manners; but for stops, phonation also had a significant effect, and for fricatives, both the consonant and phonation cues were significant as well. For the second syllable in disyllables (sandhi), all factors contributed significantly to the response for stops and fricatives, with phonation and consonant cues having the largest coefficient for stops and fricatives, respectively; for sonorants, none of the factors was significant. All significant effects were in the expected direction, i.e., the cues from the voiced/murmured category elicited more voiced/murmured responses.
Parameter estimates for the mixed-effects logistic regressions for monosyllables (non-sandhi context). Baseline = voiceless.
. | Estimate . | SE . | z . | p . | |
---|---|---|---|---|---|
Stop | (Intercept) | −5.007 | 0.429 | −11.667 | <0.001 |
Consonant | −0.0984 | 0.222 | −0.443 | 0.658 | |
Phonation | 0.945 | 0.232 | 4.059 | <0.001 | |
f0 | 6.816 | 0.396 | 17.195 | <0.001 | |
Fricative | (Intercept) | −0.945 | 0.213 | −4.428 | <0.001 |
Consonant | 2.523 | 0.204 | 12.384 | <0.001 | |
Phonation | 1.126 | 0.177 | 6.374 | <0.001 | |
f0 | 2.551 | 0.205 | 12.464 | <0.001 | |
Sonorant | (Intercept) | −4.429 | 0.419 | −10.585 | <0.001 |
Consonant | 0.284 | 0.286 | 0.992 | 0.321 | |
Phonation | −0.284 | 0.286 | −0.992 | 0.321 | |
f0 | 7.411 | 0.450 | 16.486 | <0.001 |
. | Estimate . | SE . | z . | p . | |
---|---|---|---|---|---|
Stop | (Intercept) | −5.007 | 0.429 | −11.667 | <0.001 |
Consonant | −0.0984 | 0.222 | −0.443 | 0.658 | |
Phonation | 0.945 | 0.232 | 4.059 | <0.001 | |
f0 | 6.816 | 0.396 | 17.195 | <0.001 | |
Fricative | (Intercept) | −0.945 | 0.213 | −4.428 | <0.001 |
Consonant | 2.523 | 0.204 | 12.384 | <0.001 | |
Phonation | 1.126 | 0.177 | 6.374 | <0.001 | |
f0 | 2.551 | 0.205 | 12.464 | <0.001 | |
Sonorant | (Intercept) | −4.429 | 0.419 | −10.585 | <0.001 |
Consonant | 0.284 | 0.286 | 0.992 | 0.321 | |
Phonation | −0.284 | 0.286 | −0.992 | 0.321 | |
f0 | 7.411 | 0.450 | 16.486 | <0.001 |
Parameter estimates for the mixed-effects logistic regressions for the second syllable of disyllables (sandhi context). Baseline = voiceless.
. | Estimate . | SE . | z . | p . | |
---|---|---|---|---|---|
Stop | (Intercept) | −2.8715 | 0.298 | −9.632 | <0.001 |
Consonant | 0.292 | 0.146 | 1.996 | 0.046 | |
Phonation | 1.484 | 0.155 | 9.577 | <0.001 | |
f0 | 1.015 | 0.150 | 6.756 | <0.001 | |
Fricative | (Intercept) | −2.270 | 0.244 | −9.323 | <0.001 |
Consonant | 4.517 | 0.267 | 16.952 | <0.001 | |
Phonation | 0.957 | 0.179 | 5.343 | <0.001 | |
f0 | 2.406 | 0.202 | 11.885 | <0.001 | |
Sonorant | (Intercept) | 0.762 | 0.224 | 3.402 | <0.001 |
Consonant | 0.057 | 0.127 | 0.450 | 0.652 | |
Phonation | −0.221 | 0.127 | −1.734 | 0.083 | |
f0 | 0.172 | 0.127 | 1.350 | 0.177 |
. | Estimate . | SE . | z . | p . | |
---|---|---|---|---|---|
Stop | (Intercept) | −2.8715 | 0.298 | −9.632 | <0.001 |
Consonant | 0.292 | 0.146 | 1.996 | 0.046 | |
Phonation | 1.484 | 0.155 | 9.577 | <0.001 | |
f0 | 1.015 | 0.150 | 6.756 | <0.001 | |
Fricative | (Intercept) | −2.270 | 0.244 | −9.323 | <0.001 |
Consonant | 4.517 | 0.267 | 16.952 | <0.001 | |
Phonation | 0.957 | 0.179 | 5.343 | <0.001 | |
f0 | 2.406 | 0.202 | 11.885 | <0.001 | |
Sonorant | (Intercept) | 0.762 | 0.224 | 3.402 | <0.001 |
Consonant | 0.057 | 0.127 | 0.450 | 0.652 | |
Phonation | −0.221 | 0.127 | −1.734 | 0.083 | |
f0 | 0.172 | 0.127 | 1.350 | 0.177 |
The CART analyses after pruning are given in Fig. 11. The only pruning necessary was for fricatives in monosyllables for which the original tree from the rpart function also included branches based on phonation. Relative errors generated by tenfold cross-validation under different cost-complexity measures using the plotcp function indicate that the structural complexity introduced by these branches is not warranted. These branches were then subsequently pruned using the prune function.
CART analyses for stops, fricatives, and sonorants in monosyllables and fricatives in the second syllable of disyllables.
CART analyses for stops, fricatives, and sonorants in monosyllables and fricatives in the second syllable of disyllables.
For stops and sonorants in disyllables, only the root node was obtained, indicating that none of the cues was a significant factor in the partition. For stops and sonorants in monosyllables, f0 was the sole significant predictor for the subjects' classification (to read the Monosyllable_Stop graph, for instance: among the 656 tokens with f0 cues coming from voiceless stop onsets, 641 were classified as voiceless and 15 were classified as voiced; among the 656 tokens with f0 cues coming from voiced stop onsets, 549 were classified as voiced and 107 were classified as voiceless); for fricatives in monosyllables, f0 and consonant cues contributed significantly, but their roles differed: f0 > consonant; for fricatives in disyllables, only the consonant and f0 cues were relevant, and the former was more important.
C. Discussion
Both the logistic regression and CART analysis of the perception data showed that f0 was the primary cue that the listeners relied on in making category judgment for the laryngeal contrast in monosyllables (non-sandhi context). For the second syllable of disyllables (sandhi context), both analyses showed that the consonant and f0 cues contributed significantly to the voicing classification of fricatives. The logistic regression analysis, however, identified additional significant predictors: phonation for stops and fricatives in monosyllables, consonant, phonation, and f0 for stops in disyllables, and phonation for fricatives in disyllables. For a relatively small dataset with only a few predictors like ours, it seems that the CART analysis returned a more conservative estimate of what predictors are significant in the classification. Logistic regression and CART differ in that the former is able to provide an estimate of the average effect of a predictor while accounting for other predictors, whereas the latter's hierarchical structure does not allow the net effect of a predictor, in general, to be estimated (Lemon et al., 2003). Without a priori assumptions about how our perception data would pattern, it is perhaps worthwhile to consider both analyses to provide a more comprehensive view of the data.
The perception results were generally consistent with the aggregate production results: the laryngeal contrast in question was primarily cued by f0 in the non-sandhi context, and the f0 cue was able to override conflicting cues in the consonant or vowel phonation; for the sandhi context, f0 became ineffective in stops and sonorants, but still had an effect on fricative classification. Different manners relied on different cues, and classification was the most robust for fricatives. For stops in the sandhi context, the fact that the speakers were able to classify the natural tokens at a high rate indicates the relevance of the consonant cue, but the effect of the cue was not strong enough to override conflicting cues from f0 and phonation, if any. For sonorants in this context, however, both the natural token identification and the classification of all stimuli demonstrated that there was simply no reliable cue for the contrast.
It is worth noting that the coefficients in the LDA performed on the acoustic data are not directly comparable with the coefficients in the logistic regression analysis of the perception data as they mean very different things in the two analyses (logistic regression was not used for the acoustic data due to convergence problems). Moreover, the predictors in the acoustic study were continuous, while those in the perception study were categorical. However, comparisons among the coefficients within each analysis consistently point out how the cues are implemented and perceptually used differently based on manner and position and the importance of f0 cues in monosyllables and consonant cues for fricatives in the second syllable of disyllables.
IV. GENERAL DISCUSSION
Both the production and perception results here clearly show that, at least for the younger speakers that we tested, the laryngeal contrast in question in Shanghai is primarily realized as a tone difference acoustically in the non-sandhi position, and listeners accordingly attend to the f0 cues in classifying the contrast in this position. However, the fact that the f0 difference over the vowel diminishes over time indicates that the voicing/voice quality property of the onset consonant contributes to the contrast. This is also consistent with the weakness of the contrast for sonorants, which is a known crosslinguistic tendency for laryngeal contrasts for consonants, but would be difficult to explain if the contrast were purely tonal. Taken together with the acoustic and perceptual results of voicing and f0 cues in tonal and non-tonal languages elsewhere, the findings are consistent with the position that the perceptual system is tuned to the distribution of cues in the particular language.
Our results also shed light on whether certain cues are inherently better perceptually for a contrast. For instance, there is some evidence that consonant voicing is better cued on fricatives than on stops, as in the non-initial position, although both stops and fricatives exhibited an acoustic difference in voicing for the contrast, stop voicing did not seem to be a strong perceptual cue and was not able to override conflicting cues from the vowel, a finding also reported in Wang (2011), while fricative voicing was able to stand out as a cue for the listeners even when conflicting cues were present. This is potentially because the voicing contrast on fricatives is cued not only by consonant voicing, but also by the spectral peak and spectral moments provided by the frication noise (Jongman et al., 2000). It is also interesting to note that the voicing difference on fricatives is concomitant with a larger f0 difference than the voicing difference on stops in the second syllable sandhi position, as shown in the growth curve analysis, and the perception results showed that the f0 difference can be used by listeners. This indicates that the strength of one cue for a contrast may enhance another cue for the contrast realized elsewhere.
The presence of phonological tone sandhi had an interesting effect on the acoustic realization and perception of the laryngeal contrast in question. Although the intervocalic position is typically a prime position for laryngeal contrasts for consonants due to the transitional cues that vowels provide (Steriade, 1997), the fact that this contrast in Shanghai is primarily cued by f0 in non-sandhi contexts, and the f0 cue can potentially be lost due to tone sandhi in this position, makes this a special case. The f0 result on the second syllable of disyllables indicates that the tonal difference concomitant with the voicing difference of the onset consonant was indeed neutralized with fricative-onset syllables as marginal exceptions, but the contrast was only fully lost for the sonorants. For fricatives, there was a voicing and duration difference between the voiceless and voiced consonants, and the vowels also differed in the phonation and periodicity measures; perceptually, both consonant and f0 cues were able to drown out conflicting cues. For stops, the voiceless and voiced stops differed in closure voicing in this position; this voicing difference potentially led to the high d′ score for the classification of the natural tokens, but was ineffective when there were conflicting cues. The complexity of the situation indicates that there is more nuance to incomplete neutralization of a phonological contrast, as the “neutralizing” context, e.g., the non-initial sandhi context, may need to be further divided up, in this case, by manners of articulation of the onset consonant.
The weakness of the voice quality contrast for sonorant consonants was evident in both the production and perception results. In the non-sandhi position, the contrast was cued by f0 on the following vowel, and there was a CPP difference on the consonant itself, but the CPP cue was so weak that it was not able to compete with conflicting cues in the perception experiment. In the sandhi position, the sonorants were the only manner that lost all acoustic cues reported here between the contrasting pair, and the perceptual results also showed that there was no discriminability between the modal and murmured sonorants in this position. These results, on the one hand, support the contention of Berkson (2016b) that phonation contrasts tend to be more weakly cued on sonorants than obstruents, which potentially contributes to their typological rarity (see also Gao, 2015; Gao and Hallé, 2015); on the other hand, they also support the phonological theory of “licensing-by-cue” and its variations (Steriade, 1997, 2008), which contend that phonological contrasts are better licensed in contexts of better perceptibility and more susceptible to loss when the cues are endangered. The complete loss of the laryngeal contrast for sonorants in the sandhi position in Shanghai is a case in point. A caveat to the current results is that the acoustic and perceptual data both come from nasals, and it is possible that other sonorants, such as liquids, may behave differently, especially given that nasalization and spread glottis share similar acoustic consequences of increased amplitude of the first harmonic and increased bandwidth of the first formant (Keyser and Stevens, 2006), and have been shown to be perceptually confusable with each other (Klatt and Klatt, 1990). However, the confusion in the source of an increased first harmonic reported in Klatt and Klatt (1990) was for a female voice whose first harmonic is close to the value of the nasal pole; and Berkson (2013) showed in her study of breathiness in Marathi that only males cued breathiness with H1*-H2*, while females used CPP. This indicates that the confusion between nasalization and breathiness can potentially be avoided. Moreover, if the weakness of the phonation cues for sonorants is entirely due to the confusability between nasalization and breathiness, then the typological rarity of phonation contrasts on sonorants, in general, remains unaccounted for.
Although the results of our perception study by and large match the results of the acoustic study in the aggregate, we are not in a position to make generalizations about how the production and perception of this laryngeal contrast in Shanghai are related to each other on an individual speaker's level as the subjects in the two experiments were two distinct sets. It is possible that individual subjects tune their perception to the aggregate input in their environment, but we do not exclude the possibility that individual subjects' perception is disproportionally biased by their own production. It must also be acknowledged that both our production and perception studies were conducted with relatively young speakers of Shanghai, and as previously mentioned, it is possible that the voicing/voice quality contrast has undergone or is undergoing restructuring (see Gao, 2015; Gao and Hallé, 2016, 2017), the investigation of which requires a design that incorporates sociolinguistic factors, which our study does not.
Finally, if we consider this contrast to stem from a single distinctive feature, then this set of data lays out clearly the challenges for how the instantiation of this feature in a particular language can be acquired, as the issue concerns not only the weighting and integration of multiple cues in potentially unsupervised learning, but also how this learning can overcome the contextual dependency of cue weighting, especially when phonological processes intervene. The current work does not provide an answer to a difficult problem like this, but it does suggest that the learning of phonological contrast realization is likely guided by the morphophonological alternation in the language as well as the distributional properties of the acoustic dimensions along which the contrast manifests itself.
V. CONCLUSION
This paper presents a case study on how a phonological contrast is cued in multiple phonetic dimensions, both acoustically and perceptually. What is of particular interest is that the contrast in question—a laryngeal contrast in Shanghai Wu—is cued differently when realized on different manners (stops, fricatives, sonorants) and in different positions (non-sandhi, sandhi). Acoustic results showed that, although this contrast has been described as phonatory in earlier literature, its primary cue is in tone, at least in the younger speakers that were tested. In the non-sandhi position, phonation correlates only appear on fricative-onset syllables and sonorant consonants; stops and fricatives have consonant duration cues, and fricatives also have a frication voicing cue. In the sandhi position, tone sandhi neutralizes the f0 difference, but the contrast is maintained in fricatives by both consonant and vowel phonation cues, marginally maintained in stops by closure voicing, and lost in sonorants. The perception results were largely consistent with the aggregate acoustic results, indicating that speakers adjust the perceptual weights of individual cues for a contrast according to contexts. These findings support the position that phonological contrasts are formed by the integration of multiple cues in a language-specific, context-specific fashion and should be represented as such.
ACKNOWLEDGMENTS
We are grateful to Dan Yuan and Zhongmin Chen for hosting us at Fudan University for data collection, Yifeng Li and Zhenzhen Xu for serving as our Shanghai consultants, Kelly Berkson, Christina Esposito, and Goun Lee for helping us with VoiceSauce, Mingxing Li for helping us with the linear discriminant analysis, and the University of Kansas General Research Fund No. 2 301 618 for financial support. We also thank the Associate Editor Megha Sundara and four anonymous reviewers for their many insightful comments, which helped improve both the content and the presentation of the paper. All remaining errors are our own.
We focus on the closure instead of the post-release portion of the stops here as the previous literature in Shanghai has shown that the difference in release duration between voiceless unaspirated and voiced stops in either initial or medial position is minimal (Shen and Wang, 1995; Chen, 2011).
The authors reported H2-H1 and F1-H1. These were converted to H1-H2 and H1-F1, and F1 was changed, notationally, to A1, to be consistent with the rest of the paper.
For spectral measures, we focus on H1*-H2* for two reasons. First, although different spectral measures have been shown to be effective voice quality measures in different languages, H1-H2 is the most consistently used parameter in the literature and is found to be effective in the majority of languages with phonation contrasts. Gao (2015) and Gao and Hallé (2017) also found that H1-H2 was the most consistently used acoustic parameter for the laryngeal contrast in Shanghai by speakers of different age groups and genders and in different tonal contexts. Second, H1*-A1*, H1*-A2*, and H1*-A3* were also measured and analyzed for our study, and they did not reveal additional differences for the contrast in question not shown by H1*-H2*.
An anonymous reviewer asked whether the word pairs with a nasal coda behaved similarly to those that are open in the phonation measures due to the potential confusion between breathiness and nasality reported in the literature (Klatt and Klatt, 1990: Keyser and Stevens, 2006). We reran the growth curve analyses for H1*-H2* and CPP on the vowels for the stimuli without nasal codas, and the statistical patterns were identical to the ones reported here except that for the CPP in sonorant onsets in the monosyllabic (no sandhi) context, the addition of the voicing intercept [χ2(1) = 4.451, p = 0.035] and the interaction between voicing and the linear time term [χ2(1) = 4.522, p = 0.033] both significantly improved the model, with the modal sonorants inducing a greater CPP and a slower CPP decrease on the following vowel.
See supplementary material at https://doi.org/10.1121/1.5052364 E-JASMAN-144-014809 for the acoustic files used in the perception experiment.
An anonymous reviewer raised the issue of whether any prosodic effects on the onset consonants (e.g., as documented in Chen, 2011) could have influenced the perception results. In the production, the speaker read all items in the same carrier sentence, effectively putting the items in a focus position. In the perception study, the listeners also listened to the same carrier sentence and therefore performed the identification in the same focus position. Therefore, the entire study can be conceived as an investigation of this laryngeal contrast in focus position.
Additional analyses that included random slopes by subject for each factor were also conducted for different manners in the two positions. Models that included the random slopes for all factors all failed to converge. Models that included a subset of the random slopes were attempted as well, and there was no consistent random slope structure that converged for all manners and positions. We therefore opted to report the models with random intercept by subject only.