Phonological categories are often differentiated by multiple phonetic cues. This paper reports a production and perception study of a laryngeal contrast in Shanghai Wu that is not only cued in multiple dimensions, but also cued differently on different manners (stops, fricatives, sonorants) and in different positions (non-sandhi, sandhi). Acoustic results showed that, although this contrast has been described as phonatory in earlier literature, its primary cue is in tone in the non-sandhi context, with vowel phonation and consonant properties appearing selectively for specific manners of articulation. In the sandhi context where the tonal distinction is neutralized, these other cues may remain depending on the manner of articulation. Sonorants, in both contexts, embody the weakest cues. The perception results were largely consistent with the aggregate acoustic results, indicating that speakers adjust the perceptual weights of individual cues for a contrast according to manner and context. These findings support the position that phonological contrasts are formed by the integration of multiple cues in a language-specific, context-specific fashion and should be represented as such.

A standard assumption about phonological contrast is that it is categorical, based on either segments (/p/ vs /b/) or features ([−voice] for /p/, [+voice] for /b/; Jakobson et al., 1952; Chomsky and Halle, 1968; Stevens, 2002; Clements, 2009). A major challenge for phoneticians and phonologists alike is to account for how speakers categorize gradient and variable acoustic signals into such discrete entities. Two salient aspects of this challenge relate to how featural contrasts are instantiated acoustically. First, contrasts are often differentiated by multiple acoustic cues. The stop voicing contrast in English, for example, is associated with differences in voice-onset time (VOT), closure duration, f0 of the following vowel, and a host of other acoustic properties (Lisker, 1986). Second, the acoustic cues for the same contrast often depend on the phonological context in which the contrast appears. For instance, the English voicing contrast would not benefit from the f0 cue of the following vowel in the final position, but would benefit from a duration difference on the vowel preceding it (Chen, 1970; Raphael, 1972). The investigations of how a contrast is acoustically realized in a multidimensional fashion, how the different acoustic cues are weighted in the perception of the contrast, and how the weighting is affected by the acoustic dimensions along which the cues vary, the distributional characteristics of the acoustic cues, the context in which the contrast appears, and the listeners' language background have contributed to significant theoretical issues in phonetics and phonology, such as the mode of speech perception (Repp, 1983; Parker et al., 1986; Massaro, 1987), the nature of distinctive features (Halle and Stevens, 1971; Kingston, 1992; Stevens and Keyser, 2010), the production-perception link (Newman, 2003; Shultz et al., 2012; DiCanio, 2014), the influence of phonological knowledge of a language on perception (Massaro and Cohen, 1983; Flege and Wang, 1989; Dupoux et al., 1999; Hallé and Best, 2007), the theories of perceptual contribution of secondary cues (Holt et al., 2001; Francis et al., 2008; Kingston et al., 2008; Llanos et al., 2013), and the mechanisms of phonetic category learning (Clayards et al., 2008; Toscano and McMurray, 2010; McMurray et al., 2011).

This paper contributes to this scholarship by presenting a case study on the cue realization and cue weighting of a laryngeal contrast on different segments in different contexts in Shanghai Wu. Like many Wu dialects of Chinese, Shanghai has a three-way distinction among voiceless aspirated, voiceless unaspirated, and voiced stops. The voiced series, however, is not realized with typical closure voicing, but is known as “voiceless with voiced aspiration” (Chao, 1967), indicating the involvement of breathy phonation. On fricatives, there is a two-way voicing contrast, whereby the voiced fricatives are truly voiced, and on sonorants, there is a modal-murmured distinction that corresponds to the voiceless-voiced distinction in obstruents (Chao, 1967; Xu and Tang, 1988; Zhu, 1999, 2006).

Shanghai Wu, like other Chinese dialects, is also tonal. There are three phonetic tones on open or sonorant-closed syllables, transcribed as 53, 34, and 13, and two phonetic tones on ʔ-closed syllables, 55 and 12. But there is a co-occurrence restriction between tones and onset laryngeal features in that the higher tones 53, 34, and 55 only occur on syllables with voiceless obstruent or modal sonorant onsets, and the lower tones only occur with phonologically voiced obstruent or murmured sonorant onsets (Xu and Tang, 1988; Zhu, 1999, 2006). Therefore, in Shanghai, there is a minimal contrast between 3 “to arrive” and 13 “news,” and this contrast is cued by both the voice quality of the initial consonant and f0. The examples in Table I illustrate the co-occurrence of the two rising tones 34 and 13 with the laryngeal features in Shanghai.

TABLE I.

Examples of laryngeal and tone co-occurrence restrictions in Shanghai. Voiceless obstruents or modal sonorants co-occur with the high-rising tone 34; voiced obstruents or murmured sonorants co-occur with the low-rising tone 13.

StopsFricativesSonorants
pu34 “cloth” fi34 “fee” 34 “beautiful” 
phu34 “tattered” 
bu13 “division” vi13 “fat” ε13 “plum” 
StopsFricativesSonorants
pu34 “cloth” fi34 “fee” 34 “beautiful” 
phu34 “tattered” 
bu13 “division” vi13 “fat” ε13 “plum” 

Tones in connected speech are affected by a tone change process called tone sandhi in Shanghai. Polysyllabic compound words undergo a rightward spreading tone sandhi process by extending the tone on the first syllable over the entire compound domain and consequently wiping out the tonal contrasts in non-initial syllables (Zee and Maddieson, 1980; Xu and Tang, 1988; Zhu, 1999, 2006). For example, 34 “to arrive” and 13 “news,” when appearing as the second syllable of a disyllabic compound, are reported to lose their tonal difference, as shown in the following examples: /pɔ34-tɔ34/ → [pɔ33-tɔ44] “check-in”; /pɔ34-dɔ13/ → [pɔ33-dɔ44] “news report.” The voicing difference between the onset consonants on the second syllable, however, remains, and the voiced stops have been reported to have closure voicing in this position (Cao and Maddieson, 1992; Ren, 1992; Shen and Wang, 1995; Chen, 2011; Wang, 2011; Gao, 2015; Gao and Hallé, 2017).

The data pattern in Shanghai, therefore, presents a clear example in which a phonological contrast is realized differently on different manners and different positions: stops, fricatives, and sonorants can all carry the contrast, but via different sets of cues; the monosyllabic context is significant in that it is the only context in which the phonation-tone co-occurrence, as illustrated in Table I is fully manifested, while the second syllable of disyllables constitutes a position where the cues for the contrast are considerably altered by a tone sandhi process. We specifically focus on the contrast between voiceless unaspirated/modal and voiced/murmured consonants co-occurring with a high-rising and a low-rising tone, respectively (e.g., 34 vs 13; 34 vs m¨ε13). As we review in Sec. I B below, although previous studies have established the multidimensional nature of this contrast, as well as the fact that the cues for the contrast vary by prosodic position, no study has expressly compared the realization of cues in different manners or studied how the cues are weighted in perception across manners and positions. This study aims to achieve these goals. In so doing, it has the potential to make the following unique contributions. First, previous studies on the perceptual contributions of voicing and f0 of a contrast have primarily been conducted on non-tone languages like English and Spanish, and in these languages, voicing has been found to be the primary cue (Abramson and Lisker, 1985; Shultz et al., 2012; Llanos et al., 2013). Shanghai, being from a tone-language family, could work in the opposite way with tone as a primary cue and voicing/voice quality a secondary cue, similar to Southern Vietnamese (Brunelle, 2009) and Eastern Cham (Brunelle, 2012). This provides an opportunity to observe the influence of language background on how cues are weighted and the limit and potential reasons for the primacy of a particular cue (see also Francis et al., 2008; Llanos et al., 2013). Second, the positional dependency of the realization of this contrast results from not only the position per se, but also a phonological alternation process that, at least according to the descriptive literature, categorically neutralizes one of the cues (tone) in the non-initial context. This puts the context scenario here, phonologically, between full realization (e.g., voicing in final position in English) and full neutralization (e.g., manner contrast in final position in Korean; Kim and Jongman, 1996) and allows it to contribute to the large literature on incomplete neutralization (e.g., Dinnsen and Charles-Luce, 1984; Port and Crawford, 1989; Warner et al., 2004; Dmitrieva et al., 2010). Third, phonetic studies of phonation have primarily focused on vowels (e.g., Huffman, 1987; Andruski and Ratliff, 2000; Blankenship, 2002; Wayland and Jongman, 2003; Esposito, 2010a, 2012; Khan, 2012) and obstruent consonants (e.g., Davis, 1994; Mikuteit and Reetz, 2007; Dutta, 2009; Berkson, 2016a); studies on sonorant consonant phonation (e.g., Aoki, 1970; Traill and Jackson, 1988; Berkson, 2016b) are relatively rare, presumably due to their typological rarity and the weak acoustic cues they embody (Berkson, 2016b). Shanghai furnishes an example that has a laryngeal contrast in both obstruents and sonorants, and thus provides a rare venue to compare the acoustics and perception of the contrast on the two types of segments.

During the production of breathy phonation, the vocal folds are in a relatively abducted configuration with low longitudinal tension. Articulatorily, this results in a higher open quotient of the glottal cycle and a less abrupt glottal closing gesture; aerodynamically, the increased airflow volume and the loose vibratory mode of the vocal fold cause turbulence noise at the glottis, which gives the auditory perception of breathy voice (Gordon and Ladefoged, 2001).

A host of acoustic parameters that result from these articulatory and aerodynamic properties have been identified in the literature. In terms of spectral measures, Klatt and Klatt (1990) and Holmberg et al. (1995) showed that a higher open quotient correlates with a greater difference between the amplitude of the first two harmonics (H1-H2), and Stevens (1977) and Hanson et al. (2001) demonstrated that the more gradual glottal closure results in a steeper spectral tilt that can be measured by the amplitude differences between f0 and F1-F3 (H1-A1, H1-A2, H1-A3). In terms of periodicity measures, Hillenbrand et al. (1994) advocated the use of cepstral-peak prominence (CPP), a measure of peak harmonic amplitude adjusted for the overall amplitude, of which breathy phonation is expected to have lower values than modal phonation; the harmonics-to-noise ratio (HNR) has also been used, with breathy phonation having lower HNR values (de Krom, 1993). In studies of phonological breathiness crosslinguistically, these measures have often been shown to be relevant acoustic and perceptual correlates. For instance, increased H1-H2 and spectral tilt measures have been found to be acoustic correlates of breathy vowels in Hmong (Huffman, 1987; Andruski and Ratliff, 2000; Esposito, 2012; Garellek et al., 2013), Khmer (Wayland and Jongman, 2003), Ju|'hoansi (Miller, 2007), Hindi (Dutta, 2009), Gujarati (Khan, 2012), Jalapa Mazatec (Blankenship, 2002; Esposito, 2010b; Garellek and Keating, 2011), and Santa Ana del Valle Zapotec (Esposito, 2010a). Esposito (2010b) and Garellek et al. (2013), in addition, found that these measures directly contribute to the perception of breathiness. Lower CPP values have been found for breathy vowels in Jalapa Mazatec (Blankenship, 2002; Garellek and Keating, 2011), White Hmong (Esposito, 2012), and Gujarati (Khan, 2012). Lower HNR values were found for breathy vowels in Ju|'hoansi (Miller, 2007), but not in Khmer (Wayland and Jongman, 2003).

Duration measures have also been found to correlate with breathiness. For stops, breathy stops have shorter closure durations than their plain counterparts in Bengali (Mikuteit and Reetz, 2007), Hindi (Dutta, 2009), and Marathi (Berkson, 2016a), and the shorter closure duration of voiced stops compared to voiceless stops is well known (e.g., Lisker, 1986).1 For fricatives, Jongman et al. (2000) showed that voiced fricatives generally have shorter frication duration than their voiceless counterparts. The duration pattern for sonorant phonation is scantily documented, but there is some evidence that breathy sonorants tend to be longer than their modal counterparts, as reported for Marathi (Berkson, 2013).

Finally, the phonological co-occurrence between breathy phonation and lower tones found in Shanghai is attested elsewhere as well, e.g., in Santa Ana del Valle Zapotec (Esposito, 2010a) and Hmong (Andruski and Ratliff, 2000; Esposito, 2012). This may be rooted in the general f0 lowering effect of breathiness (Laver, 1980; Gordon and Ladefoged, 2001), which has been well attested, e.g., in Khmu' (Abramson et al., 2007), Hindi (Dutta, 2009), and Marathi (Berkson, 2013). But whether this effect is a phonetic universal remains controversial, as there are studies that have shown either an f0 raising effect (Wayland and Jongman, 2003, for Khmer) or the lack of an f0 correlate (Garellek and Keating, 2011, for Jalapa Mazatec) for breathiness.

As previously stated, existing literature on phonation–tone interaction in Shanghai has firmly established that the cues for the laryngeal contrast of interest here are multidimensional in both non-sandhi and sandhi positions. Cao and Maddieson (1992) showed that for syllables in isolation, i.e., the non-sandhi context, H1-H2 and H1-A12 were significantly higher at vowel onset after the voiced stop than after the voiceless unaspirated stop, but the differences disappeared at the mid and end points of the vowel; for syllables in the sandhi context (e.g., second syllable in disyllables), only the H1-H2 difference remained at vowel onset, and the magnitude of the difference was smaller; but the voiced stops were “phonetically voiced.” The acoustic study by Ren (1992) also showed tapering H1-H2 and H1-A1 differences on the vowel after voiced and voiceless unaspirated stops in the non-sandhi position; but in the sandhi position, Ren found an H1-A1 difference instead of an H1-H2 difference. Ren (1992) also conducted a perception study in which H1-H2 was varied in ten steps and f0 in three steps on the initial portion of the vowel after a stop in the sandhi position (ɦa13ta34 “shoelace” to ɦa13da13 “shoe (is) big”). Results showed that both H1-H2 and f0 had an effect on the perception of the second syllable: the /d/ response was more likely with a higher H1-H2; a raised f0 shifted response toward /t/, while a lowered f0 shifted the response toward /d/. Shen and Wang (1995) focused on the roles of the closure and release durations of the stop as the acoustic correlates of stop voicing. They showed that, although the two types of stops did not differ in their release duration (duration between the stop burst and the beginning of vowel periodicity), the voiceless stops had a significantly longer closure duration than the voiced stops in both initial and medial positions, and the voiced stops had closure voicing medially. The acoustic study by Wang (2011) returned similar duration results to Shen and Wang's except that she did not find a closure duration difference based on voicing in the initial position. In a series of perception studies that manipulated closure duration and f0, Wang showed that when restricting the tones to the two rising tones 34 (co-occurs with voiceless) and 13 (co-occurs with voiced), f0 was the primary perceptual cue for the contrast in initial position; in the medial position, both f0 and closure duration were used perceptually for the contrast, but closure voicing was not. Chen (2011) focused on the f0 perturbation effect from the stop voicing contrast in the sandhi context and found that the effect was minimal, and that its size was partly determined by the underlying tone of the preceding syllable. Chen argued that these patterns potentially serve the purpose of maximizing the tonal contrast on the preceding syllable, which determines the pitch contour of the entire sandhi domain; therefore, the f0 perturbation here is speaker controlled, at least in part. For H1-H2, Chen only found the expected difference in the /o/ context, with the voiced stops inducing greater H1-H2; for the /i/ context, the effect was the reverse.

In a dissertation (Gao, 2015) and a series of related publications (Gao and Hallé, 2013, 2015, 2016, 2017), Gao and Hallé presented the most comprehensive study of Shanghai phonation–tone interaction to date. Their acoustic investigation included all three manners (stops, fricatives, nasals) as onsets in monosyllables as well as both syllables of disyllables. In terms of duration, a consonant-vowel (CV) syllable with a voiceless fricative onset had a longer consonant and a shorter vowel than one with a corresponding voiced fricative onset (Gao, 2015); voiced stops had a significantly longer VOT than voiceless unaspirated stops by around 2–4 ms (Gao, 2015; Gao and Hallé, 2017). In terms of voicing in the initial position, voiced stops rarely had voicing, while voiced fricatives had voicing ratios (percentages of consonant duration being voiced) of around 30%–40%; in medial position, voiced stops and fricatives had over 90% voicing ratios, compared to around 20%–30% for voiceless ones (Gao, 2015; Gao and Hallé, 2017). For spectral and periodicity measures, they showed that for monosyllables, H1-H2, H1-A1, and H1-A2 were generally higher, while CPP was generally lower following voiced/murmured onsets than voiceless/modal ones, but the differences were the greatest and the most consistent for elder male speakers; linear discriminant analyses (LDAs) showed that H1-H2 was the most consistent cue across age and gender groups and in different tonal contexts. Only H1-H2 results were reported for the two syllables in disyllables. Results showed that for the first syllable, H1-H2 was higher after voiced/murmured onsets, but the difference was less clear-cut than in monosyllables; for the second syllable, no H1-H2 difference based on the voicing difference was found (Gao, 2015; Gao and Hallé, 2017). Perceptually, two experiments were conducted to investigate the effect of duration and voicing patterns on the identification of the laryngeal contrast. The first experiment created “congruent” and “incongruent” monosyllabic stimuli by imposing the f0 of one CV onto another when the two onsets differed in voicing, and the results showed that the congruence factor significantly affected the accuracy and reaction time of tone identification when the onsets were labial fricatives, which had the largest voicing difference. The second experiment created tonal continua between the two rising tones on both the long C-short V and short C-long V duration patterns and showed that the duration pattern shifted the listeners' identification response toward the category with that duration pattern, and the incongruence between tone and duration pattern slowed down the reaction time (Gao and Hallé, 2013; Gao, 2015). An additional experiment was carried out to investigate the effect of voice quality on perception. Tonal continua were again created between the two rising tones and imposed onto modal and breathy syllables (both synthesized and naturally produced modal and breathy syllables were used). Identification results showed that the voice quality of the syllable shifted the listeners' identification response toward the category with that phonation type, and the incongruence between tone and phonation slowed down the reaction time with the exception of naturally produced tokens with nasal onsets (Gao and Hallé, 2015; Gao, 2015).

With the exception of the work of Gao and Hallé, the previous studies only investigated a subset of the cues for stops. But even in the studies by Gao and Hallé, there was no direct comparison among the different manners, and their perception studies were restricted to monosyllables. In the present work, the goal is to provide a comprehensive look at the acoustic realization and perception of the contrast between voiceless unaspirated/modal and voiced/murmured consonants co-occurring with a high-rising and a low-rising tone, respectively, across different manners (e.g., 34 vs 13; fi34 vs vi13; 34 vs m¨ε13) and different contexts (sandhi, non-sandhi) using a consistent set of methods, and consequently shed light on the language- and context-dependent nature of contrast realization and perceptual cue weighting, especially when a phonological alternation process is involved, as well as the production-perception link. In Secs. II and III, a production study and a perception study conducted to this end are reported.

Thirteen monosyllabic voiceless/modal vs voiced/murmured minimal pairs were used for the non-sandhi context (six stop pairs, four fricative pairs, three sonorant pairs); all voiceless/modal syllables occurred with the high rising tone 34 and all voiced/murmured syllables with the low rising tone 13 (e.g., pu34 and bu13). The same pairs were then used as the second syllable of disyllabic compounds with matched first syllable for the sandhi context (e.g., fən53-pu34 and fən53-bu13). Both the monosyllabic and disyllabic words were embedded in the carrier sentence ŋu34 ɕja34 __ ɡəʔ12 əʔ55 zɨ13 “I write the character/word ___.” The reason the target stimuli were put in sentence-medial position was to allow the measurement of closure duration for onset stops, as duration has been more consistently shown as a perceptual cue for the contrast in previous studies (Wang, 2011; Gao and Hallé, 2013; Gao, 2015). The trade-off, however, is that this creates an environment that may also facilitate consonant voicing for the voiced obstruents even for the monosyllables. Tone sandhi (or lack thereof) on the target words, however, is not expected to be affected by the sentential context, as the preceding verb ɕja34 “to write” and the following demonstrative ɡəʔ12 “this” do not belong to the same prosodic word as the target. The full word list is given in Table II.

TABLE II.

Word list used in the production experiment. Tone transcriptions reflect the base tones before the application of tone sandhi.

MonosyllablesDisyllables
Voiceless/modalVoiced/murmuredVoiceless/modalVoiced/murmured
Stops pin34 “pancake” bin13 “bottle” ma13-pin34 “to sell pancakes” ma13-bin13 “to sell bottles” 
pu34 “to spread' bu13 “section” fən53-pu34 “distribution” fən53-bu13 “division” 
tɔ34 “to arrive' dɔ13 “news” 34-tɔ34 “check-in” 34-dɔ13 “news report” 
ti34 “emperor” di13 “brother” ɦuɑ̃13-ti34 “emperor” ɦuɑ̃13-di13 “royal brother” 
k34 “rail” g13 “hoop” thiɪʔ55-k34 “rail” thiɪʔ55-g13 “iron hoop” 
k34 “arch” g13 “together” iɪʔ55-k34 “an arch” iɪʔ55-g13 “all together” 
Fricatives fi34 “fee” vi13 “fat' 34-fi34 “to reduce the fee” 34-vi13 “to lose weight” 
fən34 “hard work” vən13 “article” faʔ55-fən34 “to work hard” faʔ55-vən13 “to publish an article” 
sɨ34 “water” zɨ13 “porcelain' dɑ̃13-sɨ34 “sugar water” dɑ̃13-zɨ13 “porcelain” 
su34 “lock” zu13 “seat” tɕin53-su34 “golden lock” tɕin53-zu13 “golden seat” 
Sonorants min34 “chirp” in13 “name” ɳjɔ34-min34 “bird's chirps” ɳjɔ34-in13 “bird's name” 
mε34 “America” ε13 “plum” ly34-mε34 “traveling in the US” ly34-ε13 Proper name 
ɳ34 “bird” ɳ̤13 “around” 13-ɳjɔ34 “blue bird” 13-ɳ̤13 “indiscriminate” 
MonosyllablesDisyllables
Voiceless/modalVoiced/murmuredVoiceless/modalVoiced/murmured
Stops pin34 “pancake” bin13 “bottle” ma13-pin34 “to sell pancakes” ma13-bin13 “to sell bottles” 
pu34 “to spread' bu13 “section” fən53-pu34 “distribution” fən53-bu13 “division” 
tɔ34 “to arrive' dɔ13 “news” 34-tɔ34 “check-in” 34-dɔ13 “news report” 
ti34 “emperor” di13 “brother” ɦuɑ̃13-ti34 “emperor” ɦuɑ̃13-di13 “royal brother” 
k34 “rail” g13 “hoop” thiɪʔ55-k34 “rail” thiɪʔ55-g13 “iron hoop” 
k34 “arch” g13 “together” iɪʔ55-k34 “an arch” iɪʔ55-g13 “all together” 
Fricatives fi34 “fee” vi13 “fat' 34-fi34 “to reduce the fee” 34-vi13 “to lose weight” 
fən34 “hard work” vən13 “article” faʔ55-fən34 “to work hard” faʔ55-vən13 “to publish an article” 
sɨ34 “water” zɨ13 “porcelain' dɑ̃13-sɨ34 “sugar water” dɑ̃13-zɨ13 “porcelain” 
su34 “lock” zu13 “seat” tɕin53-su34 “golden lock” tɕin53-zu13 “golden seat” 
Sonorants min34 “chirp” in13 “name” ɳjɔ34-min34 “bird's chirps” ɳjɔ34-in13 “bird's name” 
mε34 “America” ε13 “plum” ly34-mε34 “traveling in the US” ly34-ε13 Proper name 
ɳ34 “bird” ɳ̤13 “around” 13-ɳjɔ34 “blue bird” 13-ɳ̤13 “indiscriminate” 

Ten native speakers (5 male, 5 female) with an age range of 19–30 and a mean age of 25 were recorded in a quiet room in Shanghai using an Electro-Voice N/D767 cardioid microphone (Burnsville, MN) and a Marantz portable solid state recorder (PMD 671, Cumberland, RI). Each of them read the stimuli twice. Subsequent measurements for the two repetitions were averaged before the statistical analyses.

Consonant durations were measured in Praat (Boersma and Weenink, 2012) by the second author. The duration for stops was the closure duration and was measured from the end of the previous syllable to the stop release. For fricatives and sonorants, the segments themselves were identified from the spectrograms and their durations measured. Durations were analyzed with linear mixed-effects models, with the laryngeal feature (referred to as voicing for brevity below) as fixed effects and subject and item as random effects. P-values were calculated using the lmerTest package in R (Kuznetsova et al., 2016). Monosyllables and disyllables were analyzed separately. Stops and fricatives were classified as “voiced” or “voiceless” depending on whether 50% or more of the consonant duration (closure for stops, frication duration for fricatives) had voicing, as determined from the waveforms and spectrograms in Praat by the first author.

The spectral measure H1*-H2* (corrected H1-H2 based on the frequencies and bandwidths of formants; Shue et al., 2011) and the periodicity measure CPP were selected to estimate the breathiness induced by the contrast.3 H1*-H2* and CPP values were measured every millisecond in VoiceSauce v1.12 (Shue et al., 2011), and the measurements over every 9.1% of the vowel duration were averaged, yielding 11 data points for each vowel for statistical analysis. The Snack Sound Toolkit (Sjölander, 2004) was used by VoiceSauce to find the frequencies and bandwidths of the formants with the covariance method, a pre-emphasis of 0.96, and a window length of 25 ms with a frame shift of 1 ms. Fundamental frequencies were measured at 10% intervals during the vowel using the ProsodyPro Praat script (Xu, 2005–2013). The Maxf0 and Minf0 parameters in the script, as well as the octave-jump cost, were adjusted for each speaker, and the f0 measurements were manually checked by the second author against pitch tracks and narrowband spectrograms in Praat to correct any measurement errors by the script. The f0 values in Hz were then converted into semitones and z-scored. Growth curve analyses (Mirman, 2014) were conducted on the H1*-H2*, CPP, and f0 curves over the vowel using third-order (cubic) orthogonal polynomials. The models were built up from the base model that only included subject, item, and subject-by-voicing random effects. Voicing and its interaction with the time terms were subsequently added step-wise, and their effects on model fit were evaluated using log-likelihood model comparison. Parameter estimates for the full model were then tested for significance using t-tests, and p-values were again estimated by the lmerTest package. Different manners and different positions were analyzed separately, and the voiceless/modal category was used as the baseline. H1*-H2* and CPP were similarly compared for sonorant consonants, but the measurements were averaged over every 20% of the sonorant duration, yielding only five data points for each sonorant. All statistical analyses were performed using the lme4 package (Bates et al., 2015) in R (R Core Team, 2014).

To investigate the relative contribution of the different acoustic cues in the laryngeal contrast for each manner in monosyllables and disyllables, LDAs were conducted to explore the extent to which the laryngeal category can be predicted from the acoustic cues. The greedy.wilks function in the klaR package (Weihs et al., 2005) in R was used to conduct stepwise forward variable selection for significant predictors (p < 0.05), and the lda function in the MASS package (Venables et al., 2002) was used to derive the coefficients for the variables for the linear discriminant functions. The overall Wilks's lambda values (from 0 to 1) for the discrimination (0 means total discrimination, 1 means no discrimination), as well as their F and p values, were calculated using the manova function (see also Gao, 2015).

1. Duration and voicing measures

The consonant duration results are given in Fig. 1. For both the monosyllables and the second syllable of disyllables, the best model included the interaction between voicing and manner. An analysis with voicing nested under manner as fixed effects for monosyllables and disyllables, respectively, was then conducted to get voicing estimates for the different manners in the same model. For monosyllables, the effect of voicing is significant for fricatives (estimate = −59.168, Standard Error (SE) = 13.344, degrees of freedom (df) = 25.246, t = −4.434, p < 0.001), but not for stops (estimate = −11.073, SE = 10.930, df= 25.544, t = −1.013, p = 0.321) or sonorants (estimate= −0.783, SE = 15.393, df = 25.151, t = −0.051, p = 0.960). For the second syllable of disyllables, likewise, the effect of voicing is significant for fricatives (estimate = −66.554, SE = 12.558, df = 25.792, t = −5.300, p < 0.001), but not for stops (estimate = −8.870, SE = 10.250, df = 25.755, t = −0.865, p = 0.395) or sonorants (estimate = 0.182, SE = 14.484, df = 25.676, t = 0.013, p = 0.990).

FIG. 1.

Duration of onset consonants in monosyllables and the second syllable of disyllables. *: p < 0.05; **: p < 0.01; ***: p < 0.001.

FIG. 1.

Duration of onset consonants in monosyllables and the second syllable of disyllables. *: p < 0.05; **: p < 0.01; ***: p < 0.001.

Close modal

In terms of voicing, 89% of the voiced stops and 100% of the voiced fricatives in the second syllable of disyllable tokens were classified as voiced. This generally agrees with the results of earlier studies (Shen and Wang, 1995; Chen, 2011; Wang, 2011; Gao, 2015; Gao and Hallé, 2017). In monosyllables, due to the intervocalic position in which the consonant appears in the sentential context, 33% of the voiced stop onsets were also classified as voiced. For fricatives, 100% of the voiced bilabial fricatives and 50% of the coronal fricatives were classified as voiced. The tendency for bilabial fricatives to have more voicing in this position has also been documented in Gao (2015) and Gao and Hallé (2017). Voiceless obstruents were occasionally voiced (11% for monosyllables, 31% for the second syllable of disyllables), contra traditional descriptions for which we have no good explanation except that the intervocalic or post-nasal positions in which they appear perhaps encouraged phonetic voicing.

2. Spectral and periodicity measures

The H1*-H2* and CPP results for the vowels after the three consonant manners in monosyllables are given in Figs. 2 and 3, respectively. Model comparisons for H1*-H2* showed that the model did not significantly improve with the addition of voicing or its interactions with the linear, quadratic, and cubic time terms for any manner (p > 0.15 for all comparisons). For CPP, the interaction between voicing and the quadratic time term did significantly improve the model for fricatives [χ2(1) = 8.455, p = 0.004]. Parameter estimates for the quadratic interaction (estimate = 2.160, SE = 0.610, t = 3.538, p = 0.006) indicated that voiceless fricatives induced a sharper peak for the CPP curve on the following vowel than voiced ones; no other model comparisons were significant (all p > 0.07).

FIG. 2.

H1*-H2* results over the duration of the vowels after stops, fricatives, and sonorants for monosyllables. Symbols represent observed data (vertical lines indicate ± SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.

FIG. 2.

H1*-H2* results over the duration of the vowels after stops, fricatives, and sonorants for monosyllables. Symbols represent observed data (vertical lines indicate ± SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.

Close modal
FIG. 3.

CPP results over the duration of the vowels after stops, fricatives, and sonorants for monosyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.

FIG. 3.

CPP results over the duration of the vowels after stops, fricatives, and sonorants for monosyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.

Close modal

For the second syllable of disyllables, stops and sonorants again did not exhibit any phonatory difference in H1*-H2* or CPP on the following vowel based on their laryngeal features (p > 0.18 for all model comparisons). For fricatives, however, model comparisons showed that for H1*-H2* the effect of voicing on the intercept significantly improved the model [χ2(1) = 9.564, p = 0.002], and parameter estimates (estimate = 2.241, SE = 0.568, t = 3.942, p = 0.002) indicated that voiceless fricatives induced a lower H1*-H2* than voiced fricatives; for CPP, the effects of the laryngeal feature on the intercept and quadratic time terms both significantly improved the model [intercept: χ2(1) = 8.752, p = 0.003; quadratic: χ2(1) = 6.353, p = 0.012], and parameter estimates showed a significant effect for the quadratic interaction (estimate = 2.881, SE = 0.943, t = 3.054, p = 0.011), indicating that voiceless fricatives again induced a sharper peak for the CPP curve on the following vowel. These results are given in Figs. 4 and 5.4

FIG. 4.

H1*-H2* results over the duration of the vowels after stops, fricatives, and sonorants for the second syllable of disyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.

FIG. 4.

H1*-H2* results over the duration of the vowels after stops, fricatives, and sonorants for the second syllable of disyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.

Close modal
FIG. 5.

CPP results over the duration of the vowels after stops, fricatives, and sonorants for the second syllable of disyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.

FIG. 5.

CPP results over the duration of the vowels after stops, fricatives, and sonorants for the second syllable of disyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.

Close modal

For the spectral and periodicity measures on the sonorant consonants themselves, for monosyllables, the model for H1*-H2* did not significantly improve with the addition of voicing or its interactions with the linear, quadratic, and cubic time terms (p > 0.75 for all comparisons), but the model for CPP did improve with the addition of voicing on the intercept [χ2(1) = 4.818, p = 0.028] and the quadratic time term [χ2(1) = 4.064, p = 0.044]. Parameter estimates indicated that the modal sonorants had an overall higher CPP value than the murmured sonorants (voicing intercept: estimate = −1.815, SE = 0.510, t = 3.561, p = 0.005), and the murmured sonorants had a more U-shaped curve than the modal sonorants (voicing and quadratic time term interaction: estimate = 0.890, SE = 0.395, t = 2.256, p = 0.041). For sonorant onsets on the second syllable of disyllables, the models for H1*-H2* and CPP did not significantly improve with the addition of voicing or its interactions with the linear, quadratic, and cubic time terms (p > 0.33 for all comparisons). The monosyllabic and disyllabic results are given in Figs. 6 and 7, respectively.

FIG. 6.

H1*-H2* and CPP results over the duration of sonorant onsets for monosyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.

FIG. 6.

H1*-H2* and CPP results over the duration of sonorant onsets for monosyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.

Close modal
FIG. 7.

H1*-H2* and CPP results over the duration of sonorant onsets for the second syllable of disyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.

FIG. 7.

H1*-H2* and CPP results over the duration of sonorant onsets for the second syllable of disyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.

Close modal

3. f0

The f0 results for the monosyllables and the second syllable of disyllables are given in Figs. 8 and 9, respectively. For monosyllables, the addition of voicing improved the model for the stops [χ2(1) = 8.350, p = 0.004] and fricatives [χ2(1) = 15.153, p < 0.001], and the addition of its interaction with the linear time term improved the model for the fricatives [χ2(1) = 11.224, p < 0.001] and sonorants [χ2(1)= 4.472, p = 0.034]. Parameter estimates for the full model, which include the effects of voicing and its interaction with the linear, quadratic, and cubic time terms for the three manners are summarized in Table III. With the voiceless/modal category as the baseline, the negative intercepts indicated that the f0s after the voiced/murmured consonants were significantly lower than those after the voiceless/modal consonants, and the positive coefficients for the interaction between voicing and the linear time term indicated that the f0s after the voiced/murmured consonants had sharper rising slopes than those after the voiceless/modal consonants; therefore, the f0 difference between the two types of onsets decreased over the duration of the vowel. For the second syllable in disyllables, however, only for the fricatives did the addition of the laryngeal feature significantly improve the model [χ2(1) = 3.849, p = 0.050]. No other model comparisons were significant (all p > 0.12). Parameter estimates for the full models indicated that the effects of voicing on the intercept or higher time terms were not significant for any manner, including the fricatives.

FIG. 8.

Normalized f0 results over the duration of the vowels after stops, fricatives, and sonorants for monosyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.

FIG. 8.

Normalized f0 results over the duration of the vowels after stops, fricatives, and sonorants for monosyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.

Close modal
FIG. 9.

Normalized f0 results over the duration of the vowels after stops, fricatives, and sonorants for the second syllable of disyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.

FIG. 9.

Normalized f0 results over the duration of the vowels after stops, fricatives, and sonorants for the second syllable of disyllables. Symbols represent observed data (vertical lines indicate ±SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p < 0.05; **: p < 0.01; ***: p < 0.001.

Close modal
TABLE III.

Parameter estimates for the monosyllable f0 analysis. Baseline = voiceless.

EstimateSEtp
Stop Voicing: Intercept −0.805 0.190 −4.228 <0.001 
Voicing: Linear 0.533 0.271 1.967 0.068 
Voicing: Quadratic 0.330 0.208 1.588 0.137 
Voicing: Cubic −0.144 0.140 −1.028 0.314 
Fricative Voicing: Intercept −1.180 0.176 −6.699 <0.001 
Voicing: Linear 1.153 0.228 5.045 <0.001 
Voicing: Quadratic 0.353 0.184 1.917 0.078 
Voicing: Cubic −0.228 0.177 −1.283 0.231 
Sonorant Voicing: Intercept −0.682 0.244 −2.789 0.019 
Voicing: Linear 0.973 0.400 2.431 0.043 
Voicing: Quadratic 0.331 0.324 1.022 0.338 
Voicing: Cubic −0.175 0.142 −1.233 0.232 
EstimateSEtp
Stop Voicing: Intercept −0.805 0.190 −4.228 <0.001 
Voicing: Linear 0.533 0.271 1.967 0.068 
Voicing: Quadratic 0.330 0.208 1.588 0.137 
Voicing: Cubic −0.144 0.140 −1.028 0.314 
Fricative Voicing: Intercept −1.180 0.176 −6.699 <0.001 
Voicing: Linear 1.153 0.228 5.045 <0.001 
Voicing: Quadratic 0.353 0.184 1.917 0.078 
Voicing: Cubic −0.228 0.177 −1.283 0.231 
Sonorant Voicing: Intercept −0.682 0.244 −2.789 0.019 
Voicing: Linear 0.973 0.400 2.431 0.043 
Voicing: Quadratic 0.331 0.324 1.022 0.338 
Voicing: Cubic −0.175 0.142 −1.233 0.232 

4. Linear discriminant analysis

Consonant duration and CPP and f0 values averaged over the entire vowel duration were used as the acoustic variables in the linear discriminant analysis. These variables were selected as representatives of the acoustic properties of the consonant, vowel phonation, and vowel f0. Consonant duration was selected as the consonant cue as previous studies have primarily shown the perceptual effect of duration (e.g., Wang, 2011; Gao and Hallé, 2013; Gao, 2015), and Wang (2011) has shown that listeners did not use closure voicing as a perceptual cue for stops. CPP was selected as the phonation cue as our acoustic results above showed stronger CPP effects than H1*-H2*. The variables were centered and scaled before being submitted to the discriminant analysis.

Table IV summarizes the coefficients for the variables for the linear discriminant functions as well as the Wilks's lambda, F, and p values for the discriminations. Significant predictors, as indicated by stepwise variable selection, are given in bold. “Voiceless/modal” was dummy coded as 0. Therefore, a negative coefficient for a factor indicates that a higher value for that factor is more likely to lead to a “voiceless/modal” classification. For monosyllables (non-sandhi), the only consistent predictor was f0; but for fricatives, both CPP and duration were significant as well, and the stepwise analysis selected f0 first, then CPP, followed by duration. For the second syllable in disyllables (sandhi), only the fricatives could be significantly discriminated, and the stepwise analysis selected duration first, then CPP.

TABLE IV.

Coefficients for the variables for the linear discriminant functions, as well as the Wilks's lambda, F, and p values for the discriminations. Significant predictors (p <0.05) are in bold.

CoefficientsDurationCPPf0Wilks's lambdaFp
Monosyllable (non-sandhi) Stop −0.124 −0.061 −1.245 0.684 16.627 <0.001 
Fricative −0.604 −0.761 −1.207 0.303 56.080 <0.001 
Sonorant 0.026 −0.156 −1.314 0.716 7.402 <0.001 
Disyllable (sandhi) Stop −0.973 0.200 −0.723 0.961 1.531 0.210 
Fricative −1.464 −0.403 −0.041 0.434 30.013 <0.001 
Sonorant 2.136 0.490 −0.377 0.998 0.042 0.988 
CoefficientsDurationCPPf0Wilks's lambdaFp
Monosyllable (non-sandhi) Stop −0.124 −0.061 −1.245 0.684 16.627 <0.001 
Fricative −0.604 −0.761 −1.207 0.303 56.080 <0.001 
Sonorant 0.026 −0.156 −1.314 0.716 7.402 <0.001 
Disyllable (sandhi) Stop −0.973 0.200 −0.723 0.961 1.531 0.210 
Fricative −1.464 −0.403 −0.041 0.434 30.013 <0.001 
Sonorant 2.136 0.490 −0.377 0.998 0.042 0.988 

The acoustic results above indicate that this laryngeal contrast in Shanghai is primarily a tone contrast in the non-sandhi context (monosyllables), as although the H1*-H2* and CPP comparisons between the voiceless/modal and voiced/murmured categories were generally in the expected direction, with the voiceless/modal consonants exhibiting numerically lower H1*-H2* and higher CPP on the following vowel than the voiced/murmured ones, only the CPP comparison for fricatives reached significance under the growth curve analysis; f0 curves on the vowels after voiceless/modal and voiced/murmured consonants, however, differed significantly on both the intercept and slope for all three manners except for the slope for stops. There are indications that the consonants themselves still played a role in the contrast as the fricatives exhibited a duration difference, while the sonorants exhibited a CPP difference based on the contrast. Moreover, the attenuation of the f0 difference over the vowel after voiceless/modal vs voiced/murmured consonants also suggests that the f0 difference, at least in part, stems from the onset consonants. The LDAs provided the relative weighting of the acoustic cues from consonant duration, vowel phonation, and vowel f0 and corroborated the acoustic finding that the laryngeal contrast in the non-sandhi context is primarily tonal, with secondary cues from CPP and consonant duration for the fricatives.

In the sandhi context (second syllable of disyllables), the f0 difference was neutralized, but the stops gained a voicing difference despite losing the closure duration difference, and the fricatives exhibited both duration and voicing differences. For the sonorants, however, no difference between the modal and murmured categories was detected in consonant duration, consonant phonation, vowel phonation, or f0. The LDAs did not encode the effect of voicing, but confirmed that f0 cannot be used to discriminate the contrast, and that fricatives have enough secondary cues in duration and CPP to be differentiated.

These results show that the acoustic cues for the contrast indeed vary by the manner and position in which the contrast is realized. In the sandhi position where a phonological process presumably neutralizes the main cue for the contrast—f0, the contrast itself is incompletely neutralized for fricatives and arguably for stops, but completely neutralized for sonorants as far as the measures included here are concerned. The weakness of this contrast on sonorants hence finds some support in the results.

Unlike in previous studies (e.g., Cao and Maddieson, 1992; Ren, 1992; Gao, 2015), the H1*-H2* and CPP results here generally did not show a significant effect of the laryngeal feature. For f0, although we showed that it significantly covaried with the consonant feature in the non-sandhi context—a result shared by all previous research—we did not find incomplete neutralization in the sandhi context indicated by Ren (1992), Chen (2011), and Wang (2011). There are two potential reasons for these disparities. One is that, given our speakers were considerably younger than the speakers used in earlier studies, it is possible that Shanghai is gradually losing the phonation difference, and the contrast is now primarily cued by tone in the younger generations (see Gao, 2015; Gao and Hallé, 2016, 2017, for age and gender-based differences that support this contention). Another possibility is that the different results are partly due to the different statistical methods used. In the linear mixed-effects-based growth curve analyses, the random effects structure included not only subject and item, but also subject-by-voicing interaction. This helps reduce the type I error in hypothesis testing (Barr et al., 2013), in this case, the effect of voicing.

The perception study investigated how the different acoustic cues for the laryngeal contrast are weighted in perception and how the weightings are affected by the manner and position of the contrast. The stimuli were monosyllabic and disyllabic words in which the target syllables had a full cross-classification of three sets of cues—consonant properties, vowel phonation, and vowel f0. These syllables were constructed by cross-splicing consonant and vowel portions of different syllables and superimposing the f0 contour from one vowel onto another in Praat. For instance, from two base tokens [pu34] (no. 1) and [bu13] (no. 8), six additional stimuli (no. 2–no. 7) were constructed, as shown in Table V. Three monosyllabic pairs, one from each manner, were selected as the original base tokens—pu34∼bu13, fi34∼vi13, and 34∼m̤ε13, and their corresponding disyllabic pairs—fən53pu34 ∼ fən53bu13, 34fi34∼ kε34vi13, and ly3434∼ly34m̤ε13—were selected as the originals for the disyllables. Therefore, there were 24 monosyllables and 24 disyllables in total as the perceptual stimuli. There are three main reasons why we used the cross-spliced stimuli in the perception experiment instead of the acoustic continua often used in similar studies. First, this method allows for a complete parallel for the investigation of different manners in different positions. The acoustic-continuum method necessitates the use of different values along the continuous scale due to different acoustic properties depending on context, and hence loses some of the parallelism. Second, the manipulation is easily executable. The acoustic-continuum method may not allow effective continua to be built due to the small acoustic differences in some contexts. Third, the method is symmetrical among the three sets of cues and hence makes no assumption about the importance of any particular one.

TABLE V.

Examples of stimulus construction for the perception experiment from original tokens [pu34] and [bu13].

Stimulus numberC propertiesV phonationV f0Method
1. pu34 pu34 pu34 Original 
2. pu34 pu34 bu13 Superimpose f0 of [bu13] onto [pu34
3. pu34 bu13 pu34 Cross-splice C of [pu34] to V of [bu13], then superimpose f0 of [pu34] onto the vowel 
4. pu34 bu13 bu13 Cross-splice C of [pu34] to the V of [bu13
5. bu13 pu34 pu34 Cross-splice C of [bu13] to the V of [pu34
6. bu13 pu34 bu13 Cross-splice C of [bu13] to V of [pu34], then superimpose f0 of [bu13] onto the vowel 
7. bu13 bu13 pu34 Superimpose f0 of [pu34] onto [bu13
8. bu13 bu13 bu13 Original 
Stimulus numberC propertiesV phonationV f0Method
1. pu34 pu34 pu34 Original 
2. pu34 pu34 bu13 Superimpose f0 of [bu13] onto [pu34
3. pu34 bu13 pu34 Cross-splice C of [pu34] to V of [bu13], then superimpose f0 of [pu34] onto the vowel 
4. pu34 bu13 bu13 Cross-splice C of [pu34] to the V of [bu13
5. bu13 pu34 pu34 Cross-splice C of [bu13] to the V of [pu34
6. bu13 pu34 bu13 Cross-splice C of [bu13] to V of [pu34], then superimpose f0 of [bu13] onto the vowel 
7. bu13 bu13 pu34 Superimpose f0 of [pu34] onto [bu13
8. bu13 bu13 bu13 Original 

The base tokens were selected from a female speaker's production data, and a number of considerations went into the selection of these tokens. First, it was ensured that these tokens were representative of the overall acoustic patterns reported in Sec. II. Second, given that the f0 contour was either stretched or compressed when superimposed onto a vowel of a different duration, the original syllable pairs were selected such that their vowel durations were as similar as possible. Third, after f0 was superimposed onto a different vowel, H1*-H2* and CPP of the new token were remeasured, and we selected the base tokens for which these measures were minimally affected by the f0 manipulation. A summary of the acoustic measures for the 12 base tokens, as well as when the f0 of the base tokens was switched to that of the other laryngeal category, is given in Table VI, and all 48 test stimuli are provided as supplemental material online.5 All stimuli were embedded in the same carrier sentence and auditorily presented to the subjects through headphones for a two-alternative forced choice (2AFC) task, where they had to choose on a monitor the Chinese character(s) they heard.6 The entire stimulus list was presented four times, and the order of the stimuli was randomized each time. Forty-one native speakers (16 male, 25 female) with an age range of 19–37 yr and a mean age of 24.4 yr participated in the experiment in a quiet office at Fudan University in Shanghai.

TABLE VI.

Acoustic measures of the base tokens for the perception experiment as well as when the f0 of the base tokens was switched to that of the other laryngeal category (given in parentheses). H1*-H2*, CPP, and f0 were the average values over the vowel.

C duration (ms)H1*-H2* (dB)CPP (dB)f0 (Hz)
Monosyllable (non-sandhi) pu34 126 −1.55 (0.75) 16.66 (17.28) 217 (201) 
bu13 124 2.72 (1.77) 16.01 (18.24) 201 (217) 
fi34 196 −1.18 (1.22) 18.90 (19.28) 229 (191) 
vi13 126 3.45 (0.58) 17.24 (18.86) 191 (229) 
mε34 122 −1.83 (4.19) 22.98 (24.00) 211 (211) 
ε13 118 8.38 (10.03) 17.44 (21.56) 172 (172) 
Disyllable (sandhi) fən53-pu34 57 −1.28 (0.09) 17.38 (19.41) 217 (198) 
fən53-bu13 39 6.12 (6.43) 17.47 (18.92) 198 (217) 
34-fi34 147 −3.21 (4.79) 19.11 (21.37) 205 (187) 
34-vi13 70 7.81 (2.03) 20.73 (23.60) 187 (205) 
ly34-mε34 136 8.37 (8.47) 20.54 (22.42) 192 (194) 
ly34-ε13 99 12.27 (11.53) 20.88 (21.77) 194 (192) 
C duration (ms)H1*-H2* (dB)CPP (dB)f0 (Hz)
Monosyllable (non-sandhi) pu34 126 −1.55 (0.75) 16.66 (17.28) 217 (201) 
bu13 124 2.72 (1.77) 16.01 (18.24) 201 (217) 
fi34 196 −1.18 (1.22) 18.90 (19.28) 229 (191) 
vi13 126 3.45 (0.58) 17.24 (18.86) 191 (229) 
mε34 122 −1.83 (4.19) 22.98 (24.00) 211 (211) 
ε13 118 8.38 (10.03) 17.44 (21.56) 172 (172) 
Disyllable (sandhi) fən53-pu34 57 −1.28 (0.09) 17.38 (19.41) 217 (198) 
fən53-bu13 39 6.12 (6.43) 17.47 (18.92) 198 (217) 
34-fi34 147 −3.21 (4.79) 19.11 (21.37) 205 (187) 
34-vi13 70 7.81 (2.03) 20.73 (23.60) 187 (205) 
ly34-mε34 136 8.37 (8.47) 20.54 (22.42) 192 (194) 
ly34-ε13 99 12.27 (11.53) 20.88 (21.77) 194 (192) 

For each stimulus type defined by manner and position, a mixed-effects logistic regression was conducted with the subjects' binary responses as the dependent variable and the voicing specifications of consonant, phonation, and f0 cues as categorical predictors with random intercept by subject.7 A non-parametric analysis—the Classification and Regression Tree (CART) analysis (Breiman et al., 1984)—was also conducted using the rpart package in R to further investigate how the listeners classified the stimuli based on these cues. CART is a recursive partitioning technique that outlines the decision process for a category membership based on categorical predictors. The splits in a classification tree are selected so that the descendant subsets are “purer” than the current set, and the parameters for the splits can be considered as significant predictors for the classification. For our analysis, we constructed the classification trees by using consonant, phonation, and f0 cues as categorical predictors for the subjects' response for each manner and position by using the rpart function. We then conducted cost-complexity pruning for each tree based on the relative errors generated by tenfold cross-validation using the plotcp and prune functions (Baayen, 2008).

The accuracy and d′ results for the listeners' classification of the natural tokens are given in Fig. 10. These results indicate that the subjects had near perfect identification of the contrast in the non-sandhi context regardless of manner and in the sandhi context for fricatives. For stops in the sandhi context, the identification was weaker, but well above chance; for sonorants, however, identification was at chance.

FIG. 10.

Perceptual accuracy and d′ for the natural tokens in the perception experiment.

FIG. 10.

Perceptual accuracy and d′ for the natural tokens in the perception experiment.

Close modal

The coefficients for the consonants, phonation, and f0 cues in the mixed-effects logistic regressions for different manners and positions are given in Tables VII and VIII. “Voiceless/modal” was dummy coded as 0 for both the response variable and all the categorical predictors. Therefore, the intercept in the models indicates the log odds [ln(p/(1 − p))] of the segment being given a “voiced/murmured” response when the consonant, phonation, and f0 cues all came from the voiceless/modal category, and the coefficients for consonant, phonation, and f0 indicate the increase of the log odds when these cues came from the voiced category, respectively. For monosyllables (non-sandhi), f0 was the only consistent factor that significantly affected the response, and its coefficient was the largest among the three cues for all three manners; but for stops, phonation also had a significant effect, and for fricatives, both the consonant and phonation cues were significant as well. For the second syllable in disyllables (sandhi), all factors contributed significantly to the response for stops and fricatives, with phonation and consonant cues having the largest coefficient for stops and fricatives, respectively; for sonorants, none of the factors was significant. All significant effects were in the expected direction, i.e., the cues from the voiced/murmured category elicited more voiced/murmured responses.

TABLE VII.

Parameter estimates for the mixed-effects logistic regressions for monosyllables (non-sandhi context). Baseline = voiceless.

EstimateSEzp
Stop (Intercept) −5.007 0.429 −11.667 <0.001 
Consonant −0.0984 0.222 −0.443 0.658 
Phonation 0.945 0.232 4.059 <0.001 
f6.816 0.396 17.195 <0.001 
Fricative (Intercept) −0.945 0.213 −4.428 <0.001 
Consonant 2.523 0.204 12.384 <0.001 
Phonation 1.126 0.177 6.374 <0.001 
f2.551 0.205 12.464 <0.001 
Sonorant (Intercept) −4.429 0.419 −10.585 <0.001 
Consonant 0.284 0.286 0.992 0.321 
Phonation −0.284 0.286 −0.992 0.321 
f7.411 0.450 16.486 <0.001 
EstimateSEzp
Stop (Intercept) −5.007 0.429 −11.667 <0.001 
Consonant −0.0984 0.222 −0.443 0.658 
Phonation 0.945 0.232 4.059 <0.001 
f6.816 0.396 17.195 <0.001 
Fricative (Intercept) −0.945 0.213 −4.428 <0.001 
Consonant 2.523 0.204 12.384 <0.001 
Phonation 1.126 0.177 6.374 <0.001 
f2.551 0.205 12.464 <0.001 
Sonorant (Intercept) −4.429 0.419 −10.585 <0.001 
Consonant 0.284 0.286 0.992 0.321 
Phonation −0.284 0.286 −0.992 0.321 
f7.411 0.450 16.486 <0.001 
TABLE VIII.

Parameter estimates for the mixed-effects logistic regressions for the second syllable of disyllables (sandhi context). Baseline = voiceless.

EstimateSEzp
Stop (Intercept) −2.8715 0.298 −9.632 <0.001 
Consonant 0.292 0.146 1.996 0.046 
Phonation 1.484 0.155 9.577 <0.001 
f1.015 0.150 6.756 <0.001 
Fricative (Intercept) −2.270 0.244 −9.323 <0.001 
Consonant 4.517 0.267 16.952 <0.001 
Phonation 0.957 0.179 5.343 <0.001 
f2.406 0.202 11.885 <0.001 
Sonorant (Intercept) 0.762 0.224 3.402 <0.001 
Consonant 0.057 0.127 0.450 0.652 
Phonation −0.221 0.127 −1.734 0.083 
f0.172 0.127 1.350 0.177 
EstimateSEzp
Stop (Intercept) −2.8715 0.298 −9.632 <0.001 
Consonant 0.292 0.146 1.996 0.046 
Phonation 1.484 0.155 9.577 <0.001 
f1.015 0.150 6.756 <0.001 
Fricative (Intercept) −2.270 0.244 −9.323 <0.001 
Consonant 4.517 0.267 16.952 <0.001 
Phonation 0.957 0.179 5.343 <0.001 
f2.406 0.202 11.885 <0.001 
Sonorant (Intercept) 0.762 0.224 3.402 <0.001 
Consonant 0.057 0.127 0.450 0.652 
Phonation −0.221 0.127 −1.734 0.083 
f0.172 0.127 1.350 0.177 

The CART analyses after pruning are given in Fig. 11. The only pruning necessary was for fricatives in monosyllables for which the original tree from the rpart function also included branches based on phonation. Relative errors generated by tenfold cross-validation under different cost-complexity measures using the plotcp function indicate that the structural complexity introduced by these branches is not warranted. These branches were then subsequently pruned using the prune function.

FIG. 11.

CART analyses for stops, fricatives, and sonorants in monosyllables and fricatives in the second syllable of disyllables.

FIG. 11.

CART analyses for stops, fricatives, and sonorants in monosyllables and fricatives in the second syllable of disyllables.

Close modal

For stops and sonorants in disyllables, only the root node was obtained, indicating that none of the cues was a significant factor in the partition. For stops and sonorants in monosyllables, f0 was the sole significant predictor for the subjects' classification (to read the Monosyllable_Stop graph, for instance: among the 656 tokens with f0 cues coming from voiceless stop onsets, 641 were classified as voiceless and 15 were classified as voiced; among the 656 tokens with f0 cues coming from voiced stop onsets, 549 were classified as voiced and 107 were classified as voiceless); for fricatives in monosyllables, f0 and consonant cues contributed significantly, but their roles differed: f0 > consonant; for fricatives in disyllables, only the consonant and f0 cues were relevant, and the former was more important.

Both the logistic regression and CART analysis of the perception data showed that f0 was the primary cue that the listeners relied on in making category judgment for the laryngeal contrast in monosyllables (non-sandhi context). For the second syllable of disyllables (sandhi context), both analyses showed that the consonant and f0 cues contributed significantly to the voicing classification of fricatives. The logistic regression analysis, however, identified additional significant predictors: phonation for stops and fricatives in monosyllables, consonant, phonation, and f0 for stops in disyllables, and phonation for fricatives in disyllables. For a relatively small dataset with only a few predictors like ours, it seems that the CART analysis returned a more conservative estimate of what predictors are significant in the classification. Logistic regression and CART differ in that the former is able to provide an estimate of the average effect of a predictor while accounting for other predictors, whereas the latter's hierarchical structure does not allow the net effect of a predictor, in general, to be estimated (Lemon et al., 2003). Without a priori assumptions about how our perception data would pattern, it is perhaps worthwhile to consider both analyses to provide a more comprehensive view of the data.

The perception results were generally consistent with the aggregate production results: the laryngeal contrast in question was primarily cued by f0 in the non-sandhi context, and the f0 cue was able to override conflicting cues in the consonant or vowel phonation; for the sandhi context, f0 became ineffective in stops and sonorants, but still had an effect on fricative classification. Different manners relied on different cues, and classification was the most robust for fricatives. For stops in the sandhi context, the fact that the speakers were able to classify the natural tokens at a high rate indicates the relevance of the consonant cue, but the effect of the cue was not strong enough to override conflicting cues from f0 and phonation, if any. For sonorants in this context, however, both the natural token identification and the classification of all stimuli demonstrated that there was simply no reliable cue for the contrast.

It is worth noting that the coefficients in the LDA performed on the acoustic data are not directly comparable with the coefficients in the logistic regression analysis of the perception data as they mean very different things in the two analyses (logistic regression was not used for the acoustic data due to convergence problems). Moreover, the predictors in the acoustic study were continuous, while those in the perception study were categorical. However, comparisons among the coefficients within each analysis consistently point out how the cues are implemented and perceptually used differently based on manner and position and the importance of f0 cues in monosyllables and consonant cues for fricatives in the second syllable of disyllables.

Both the production and perception results here clearly show that, at least for the younger speakers that we tested, the laryngeal contrast in question in Shanghai is primarily realized as a tone difference acoustically in the non-sandhi position, and listeners accordingly attend to the f0 cues in classifying the contrast in this position. However, the fact that the f0 difference over the vowel diminishes over time indicates that the voicing/voice quality property of the onset consonant contributes to the contrast. This is also consistent with the weakness of the contrast for sonorants, which is a known crosslinguistic tendency for laryngeal contrasts for consonants, but would be difficult to explain if the contrast were purely tonal. Taken together with the acoustic and perceptual results of voicing and f0 cues in tonal and non-tonal languages elsewhere, the findings are consistent with the position that the perceptual system is tuned to the distribution of cues in the particular language.

Our results also shed light on whether certain cues are inherently better perceptually for a contrast. For instance, there is some evidence that consonant voicing is better cued on fricatives than on stops, as in the non-initial position, although both stops and fricatives exhibited an acoustic difference in voicing for the contrast, stop voicing did not seem to be a strong perceptual cue and was not able to override conflicting cues from the vowel, a finding also reported in Wang (2011), while fricative voicing was able to stand out as a cue for the listeners even when conflicting cues were present. This is potentially because the voicing contrast on fricatives is cued not only by consonant voicing, but also by the spectral peak and spectral moments provided by the frication noise (Jongman et al., 2000). It is also interesting to note that the voicing difference on fricatives is concomitant with a larger f0 difference than the voicing difference on stops in the second syllable sandhi position, as shown in the growth curve analysis, and the perception results showed that the f0 difference can be used by listeners. This indicates that the strength of one cue for a contrast may enhance another cue for the contrast realized elsewhere.

The presence of phonological tone sandhi had an interesting effect on the acoustic realization and perception of the laryngeal contrast in question. Although the intervocalic position is typically a prime position for laryngeal contrasts for consonants due to the transitional cues that vowels provide (Steriade, 1997), the fact that this contrast in Shanghai is primarily cued by f0 in non-sandhi contexts, and the f0 cue can potentially be lost due to tone sandhi in this position, makes this a special case. The f0 result on the second syllable of disyllables indicates that the tonal difference concomitant with the voicing difference of the onset consonant was indeed neutralized with fricative-onset syllables as marginal exceptions, but the contrast was only fully lost for the sonorants. For fricatives, there was a voicing and duration difference between the voiceless and voiced consonants, and the vowels also differed in the phonation and periodicity measures; perceptually, both consonant and f0 cues were able to drown out conflicting cues. For stops, the voiceless and voiced stops differed in closure voicing in this position; this voicing difference potentially led to the high d′ score for the classification of the natural tokens, but was ineffective when there were conflicting cues. The complexity of the situation indicates that there is more nuance to incomplete neutralization of a phonological contrast, as the “neutralizing” context, e.g., the non-initial sandhi context, may need to be further divided up, in this case, by manners of articulation of the onset consonant.

The weakness of the voice quality contrast for sonorant consonants was evident in both the production and perception results. In the non-sandhi position, the contrast was cued by f0 on the following vowel, and there was a CPP difference on the consonant itself, but the CPP cue was so weak that it was not able to compete with conflicting cues in the perception experiment. In the sandhi position, the sonorants were the only manner that lost all acoustic cues reported here between the contrasting pair, and the perceptual results also showed that there was no discriminability between the modal and murmured sonorants in this position. These results, on the one hand, support the contention of Berkson (2016b) that phonation contrasts tend to be more weakly cued on sonorants than obstruents, which potentially contributes to their typological rarity (see also Gao, 2015; Gao and Hallé, 2015); on the other hand, they also support the phonological theory of “licensing-by-cue” and its variations (Steriade, 1997, 2008), which contend that phonological contrasts are better licensed in contexts of better perceptibility and more susceptible to loss when the cues are endangered. The complete loss of the laryngeal contrast for sonorants in the sandhi position in Shanghai is a case in point. A caveat to the current results is that the acoustic and perceptual data both come from nasals, and it is possible that other sonorants, such as liquids, may behave differently, especially given that nasalization and spread glottis share similar acoustic consequences of increased amplitude of the first harmonic and increased bandwidth of the first formant (Keyser and Stevens, 2006), and have been shown to be perceptually confusable with each other (Klatt and Klatt, 1990). However, the confusion in the source of an increased first harmonic reported in Klatt and Klatt (1990) was for a female voice whose first harmonic is close to the value of the nasal pole; and Berkson (2013) showed in her study of breathiness in Marathi that only males cued breathiness with H1*-H2*, while females used CPP. This indicates that the confusion between nasalization and breathiness can potentially be avoided. Moreover, if the weakness of the phonation cues for sonorants is entirely due to the confusability between nasalization and breathiness, then the typological rarity of phonation contrasts on sonorants, in general, remains unaccounted for.

Although the results of our perception study by and large match the results of the acoustic study in the aggregate, we are not in a position to make generalizations about how the production and perception of this laryngeal contrast in Shanghai are related to each other on an individual speaker's level as the subjects in the two experiments were two distinct sets. It is possible that individual subjects tune their perception to the aggregate input in their environment, but we do not exclude the possibility that individual subjects' perception is disproportionally biased by their own production. It must also be acknowledged that both our production and perception studies were conducted with relatively young speakers of Shanghai, and as previously mentioned, it is possible that the voicing/voice quality contrast has undergone or is undergoing restructuring (see Gao, 2015; Gao and Hallé, 2016, 2017), the investigation of which requires a design that incorporates sociolinguistic factors, which our study does not.

Finally, if we consider this contrast to stem from a single distinctive feature, then this set of data lays out clearly the challenges for how the instantiation of this feature in a particular language can be acquired, as the issue concerns not only the weighting and integration of multiple cues in potentially unsupervised learning, but also how this learning can overcome the contextual dependency of cue weighting, especially when phonological processes intervene. The current work does not provide an answer to a difficult problem like this, but it does suggest that the learning of phonological contrast realization is likely guided by the morphophonological alternation in the language as well as the distributional properties of the acoustic dimensions along which the contrast manifests itself.

This paper presents a case study on how a phonological contrast is cued in multiple phonetic dimensions, both acoustically and perceptually. What is of particular interest is that the contrast in question—a laryngeal contrast in Shanghai Wu—is cued differently when realized on different manners (stops, fricatives, sonorants) and in different positions (non-sandhi, sandhi). Acoustic results showed that, although this contrast has been described as phonatory in earlier literature, its primary cue is in tone, at least in the younger speakers that were tested. In the non-sandhi position, phonation correlates only appear on fricative-onset syllables and sonorant consonants; stops and fricatives have consonant duration cues, and fricatives also have a frication voicing cue. In the sandhi position, tone sandhi neutralizes the f0 difference, but the contrast is maintained in fricatives by both consonant and vowel phonation cues, marginally maintained in stops by closure voicing, and lost in sonorants. The perception results were largely consistent with the aggregate acoustic results, indicating that speakers adjust the perceptual weights of individual cues for a contrast according to contexts. These findings support the position that phonological contrasts are formed by the integration of multiple cues in a language-specific, context-specific fashion and should be represented as such.

We are grateful to Dan Yuan and Zhongmin Chen for hosting us at Fudan University for data collection, Yifeng Li and Zhenzhen Xu for serving as our Shanghai consultants, Kelly Berkson, Christina Esposito, and Goun Lee for helping us with VoiceSauce, Mingxing Li for helping us with the linear discriminant analysis, and the University of Kansas General Research Fund No. 2 301 618 for financial support. We also thank the Associate Editor Megha Sundara and four anonymous reviewers for their many insightful comments, which helped improve both the content and the presentation of the paper. All remaining errors are our own.

1

We focus on the closure instead of the post-release portion of the stops here as the previous literature in Shanghai has shown that the difference in release duration between voiceless unaspirated and voiced stops in either initial or medial position is minimal (Shen and Wang, 1995; Chen, 2011).

2

The authors reported H2-H1 and F1-H1. These were converted to H1-H2 and H1-F1, and F1 was changed, notationally, to A1, to be consistent with the rest of the paper.

3

For spectral measures, we focus on H1*-H2* for two reasons. First, although different spectral measures have been shown to be effective voice quality measures in different languages, H1-H2 is the most consistently used parameter in the literature and is found to be effective in the majority of languages with phonation contrasts. Gao (2015) and Gao and Hallé (2017) also found that H1-H2 was the most consistently used acoustic parameter for the laryngeal contrast in Shanghai by speakers of different age groups and genders and in different tonal contexts. Second, H1*-A1*, H1*-A2*, and H1*-A3* were also measured and analyzed for our study, and they did not reveal additional differences for the contrast in question not shown by H1*-H2*.

4

An anonymous reviewer asked whether the word pairs with a nasal coda behaved similarly to those that are open in the phonation measures due to the potential confusion between breathiness and nasality reported in the literature (Klatt and Klatt, 1990: Keyser and Stevens, 2006). We reran the growth curve analyses for H1*-H2* and CPP on the vowels for the stimuli without nasal codas, and the statistical patterns were identical to the ones reported here except that for the CPP in sonorant onsets in the monosyllabic (no sandhi) context, the addition of the voicing intercept [χ2(1) = 4.451, p = 0.035] and the interaction between voicing and the linear time term [χ2(1) = 4.522, p = 0.033] both significantly improved the model, with the modal sonorants inducing a greater CPP and a slower CPP decrease on the following vowel.

5

See supplementary material at https://doi.org/10.1121/1.5052364E-JASMAN-144-014809 for the acoustic files used in the perception experiment.

6

An anonymous reviewer raised the issue of whether any prosodic effects on the onset consonants (e.g., as documented in Chen, 2011) could have influenced the perception results. In the production, the speaker read all items in the same carrier sentence, effectively putting the items in a focus position. In the perception study, the listeners also listened to the same carrier sentence and therefore performed the identification in the same focus position. Therefore, the entire study can be conceived as an investigation of this laryngeal contrast in focus position.

7

Additional analyses that included random slopes by subject for each factor were also conducted for different manners in the two positions. Models that included the random slopes for all factors all failed to converge. Models that included a subset of the random slopes were attempted as well, and there was no consistent random slope structure that converged for all manners and positions. We therefore opted to report the models with random intercept by subject only.

1.
Abramson
,
A. S.
, and
Lisker
,
L.
(
1985
). “
Relative power of cues: F0 shift versus voice timing
,” in
Linguistic Phonetics
, edited by
V.
Fromkin
(
Academic
,
New York
), pp.
25
33
.
2.
Abramson
,
A. S.
,
Nye
,
P. W.
, and
Luangthongkum
,
T.
(
2007
). “
Voice register in Khmu’: experiments in production and perception
,”
Phonetica
64
,
80
104
.
3.
Andruski
,
J.
, and
Ratliff
,
M.
(
2000
). “
Phonation types in production of phonological tone: The case of Green Mong
,”
J. Int. Phon. Assoc.
30
,
37
61
.
4.
Aoki
,
H.
(
1970
). “
A note on glottalized consonants
,”
Phonetica
21
,
65
75
.
5.
Baayen
,
R. H.
(
2008
).
Analyzing Linguistic Data—A Practical Introduction to Statistics Using R
(
Cambridge University Press
,
Cambridge, UK
).
6.
Barr
,
D. J.
,
Levy
,
R.
,
Scheepers
,
C.
, and
Tily
,
H. J.
(
2013
). “
Random effects structure for confirmatory hypothesis testing: Keep it maximal
,”
J. Mem. Lang.
68
,
255
278
.
7.
Bates
,
D.
,
Maechler
,
B.
,
Bolker
,
B.
, and
Walker
,
S.
(
2015
). “
Fitting linear mixed-effects models using lme4
,”
J. Stat. Software
67
,
1
48
.
8.
Berkson
,
K.
(
2013
). “
Phonation types in Marathi: An acoustic investigation
,” Ph.D. dissertation,
University of Kansas
.
9.
Berkson
,
K.
(
2016a
). “
Durational properties of Marathi obstruents
,”
Indian Linguist.
76
(
3-4
),
7
25
.
10.
Berkson
,
K.
(
2016b
). “
Production, perception, and distribution of breathy sonorants in Marathi
,” in
Formal Approaches to South Asian Languages Vol. 2
, edited by
M.
Menon
and
S.
Syed
(
Open Journal Systems 2.4.6.0, University of Konstanz, Konstanz, Germany
), pp.
4
14
.
11.
Blankenship
,
B.
(
2002
). “
The timing of nonmodal phonation in vowels
,”
J. Phonetics
30
,
163
191
.
12.
Boersma
,
P.
, and
Weenink
,
D.
(
2012
). “
Praat: Doing phonetics by computer (version 5.3.14) [computer program]
,” http://www.praat.org/ (Last viewed 5 July 2012).
13.
Breiman
,
L.
,
Friedman
,
J. H.
,
Olshen
,
R. A.
, and
Stone
,
C. J.
(
1984
).
Classification and Regression Trees
(
Wadsworth
,
Belmont, CA
).
14.
Brunelle
,
M.
(
2009
). “
Tone perception in Northern and Southern Vietnamese
,”
J. Phonetics
37
,
79
96
.
15.
Brunelle
,
M.
(
2012
). “
Dialect experience and perceptual integrality in phonological registers: Fundamental frequency, voice quality and the first formant in Cham
,”
J. Acoust. Soc. Am.
131
,
3088
3102
.
16.
Cao
,
J.-F.
, and
Maddieson
,
I.
(
1992
). “
An exploration of phonation types in Wu dialects of Chinese
,”
J. Phonetics
20
,
77
92
.
17.
Chao
,
Y.-R.
(
1967
). “
Contrastive aspects of the Wu dialects
,”
Language
43
,
92
101
.
18.
Chen
,
M. Y.
(
1970
). “
Vowel length variation as a function of the voicing of the consonant environment
,”
Phonetica
22
,
129
159
.
19.
Chen
,
Y.-Y.
(
2011
). “
How does phonology guide phonetics in segment-f0 interaction?
,”
J. Phonetics
39
,
612
625
.
20.
Chomsky
,
N.
, and
Halle
,
M.
(
1968
).
The Sound Pattern of English
(
Harper and Row
,
New York)
.
21.
Clayards
,
M.
,
Tanenhaus
,
M. K.
,
Aslin
,
R. N.
, and
Jacobs
,
R. A.
(
2008
). “
Perception of speech reflects optimal use of probabilistic speech cues
,”
Cognition
108
,
804
809
.
22.
Clements
,
G. N.
(
2009
). “
The role of features in phonological inventories
,” in
Contemporary Views on Architecture and Representations in Phonology
, edited by
E.
Raimy
and
C.
Cairns
(
MIT Press
,
Cambridge, MA
), pp.
19
68
.
23.
Davis
,
K.
(
1994
). “
Stop voicing in Hindi
,”
J. Phonetics
22
,
177
193
.
24.
de Krom
,
G.
(
1993
). “
A cepstrum-based technique for determining a harmonics-to-noise ratio in speech signals
,”
J. Speech Hear. Res.
36
,
224
266
.
25.
DiCanio
,
C.
(
2014
). “
Cue weight in the perception of Trique glottal consonants
,”
J. Acoust. Soc. Am.
135
,
884
895
.
26.
Dinnsen
,
D. A.
, and
Charles-Luce
,
J.
(
1984
). “
Phonological neutralization, phonetic implementation, and individual differences
,”
J. Phonetics
12
,
49
60
.
27.
Dmitrieva
,
O.
,
Jongman
,
A.
, and
Sereno
,
J.
(
2010
). “
Phonological neutralization by native and non-native speakers: The case of Russian final devoicing
,”
J. Phonetics
38
,
483
492
.
28.
Dupoux
,
E.
,
Kakehi
,
K.
,
Hirose
,
Y.
,
Pallier
,
C.
, and
Mehler
,
J.
(
1999
). “
Epenthetic vowels in Japanese: A perceptual illusion?
,”
J. Exp. Psych.: Human Percept. Perform.
25
,
1568
1578
.
29.
Dutta
,
I.
(
2009
).
Acoustics of Stop Consonants in Hindi: Voicing, Fundamental Frequency and Spectral Intensity
(
Verlag Dr. Müller
,
Saarbrücken, Germany
).
30.
Esposito
,
C. M.
(
2010a
). “
Variation in contrastive phonation in Santa Ana Del Valle Zapotec
,”
J. Int. Phonetic Assoc.
40
,
181
198
.
31.
Esposito
,
C. M.
(
2010b
). “
The effects of linguistic experience on the perception of phonation
,”
J. Phonetics
38
,
306
316
.
32.
Esposito
,
C. M.
(
2012
). “
An acoustic and electroglottographic study of White Hmong tone and phonation
,”
J. Phonetics
40
,
466
476
.
33.
Flege
,
J. E.
, and
Wang
,
C.
(
1989
). “
Native-language phonotactic constraints affect how well Chinese subjects perceive the word-final /t/-/d/ contrast
,”
J. Phonetics
17
,
299
315
.
34.
Francis
,
A. L.
,
Kaganovich
,
N.
, and
Driscoll-Huber
,
C.
(
2008
). “
Cue-specific effects of categorization training on the relative weighting of acoustic cues to consonant voicing in English
,”
J. Acoust. Soc. Am.
124
,
1234
1251
.
35.
Gao
,
J.-Y.
(
2015
). “
Interdependence between tones, segments and phonation types in Shanghai Chinese: Acoustics, articulation, perception and evolution
,” Ph.D. dissertation
Université Sorbonne Nouvelle–Paris III
, Paris, France.
36.
Gao
,
J.-Y.
, and
Hallé
,
P.
(
2013
). “
Duration as a secondary cue for perception of voicing and tone in Shanghai Chinese
,” in
Proc. of Interspeech 14
, Lyon, France, pp.
3157
3162
.
37.
Gao
,
J.-Y.
, and
Hallé
,
P.
(
2015
). “
The role of voice quality in Shanghai tone perception
,” in
Proc. of ICPhS 18
, Glasgow, Scotland, UK, paper no. 448.
38.
Gao
,
J.-Y.
, and
Hallé
,
P.
(
2016
). “
Sociolinguistic motivations in sound change: On-going loss of low tone breathy voice in Shanghai Chinese
,”
Papers Hist. Phonology
1
,
166
186
.
39.
Gao
,
J.-Y.
, and
Hallé
,
P.
(
2017
). “
Phonetic and phonological properties of tones in Shanghai Chinese
,”
Cahiers de Linguistique Asie Orientale
46
,
1
31
.
40.
Garellek
,
M.
, and
Keating
,
P.
(
2011
). “
The acoustic consequences of phonation and tone interactions in Jalapa Mazatec
,”
J. Int. Phonetic Assoc.
41
,
185
205
.
41.
Garellek
,
M.
,
Keating
,
P.
,
Esposito
,
C. M.
, and
Kreiman
,
J.
(
2013
). “
Voice quality and tone identification in White Hmong
,”
J. Acoust. Soc. Am.
133
,
1078
1089
.
42.
Gordon
,
M.
, and
Ladefoged
,
P.
(
2001
). “
Phonation types: A crosslinguistic overview
,”
J. Phonetics
29
,
383
406
.
43.
Halle
,
M.
, and
Stevens
,
K.
(
1971
). “
A note on laryngeal features
,”
Q. Progress Rep. Res. Lab. Electron. (MIT)
101
,
198
213
.
44.
Hallé
,
P.
, and
Best
,
C.
(
2007
). “
Dental-to-velar perceptual assimilation: A cross-linguistic study of the perception of dental stop+/l/ clusters
,”
J. Acoust. Soc. Am.
121
,
2899
2914
.
45.
Hanson
,
H. M.
,
Stevens
,
K. N.
,
Kuo
,
H.-K. J.
,
Chen
,
M. Y.
, and
Slifka
,
J.
(
2001
). “
Towards models of phonation
,”
J. Phonetics
29
,
451
480
.
46.
Hillenbrand
,
J. M.
,
Cleveland
,
R. A.
, and
Erickson
,
R. L.
(
1994
). “
Acoustic correlates of breathy vocal quality
,”
J. Speech Hear. Res.
37
,
769
778
.
47.
Holmberg
,
E.
,
Hillman
,
R.
,
Perkell
,
J.
,
Guiod
,
P.
, and
Goldman
,
S.
(
1995
). “
Comparisons among aerodynamic, electroglottographic, and acoustic spectral measures of female voice
,”
J. Speech Hear. Res.
38
,
1212
1223
.
48.
Holt
,
L. L.
,
Lotto
,
A. J.
, and
Kluender
,
K. R.
(
2001
). “
Influence of fundamental frequency on stop-consonant voicing perception: A case of learned covariation or auditory enhancement
,”
J. Acoust. Soc. Am.
109
,
764
774
.
49.
Huffman
,
M. K.
(
1987
). “
Measures of phonation in Hmong
,”
J. Acoust. Soc. Am.
81
,
495
504
.
50.
Jakobson
,
R.
,
Fant
,
G.
, and
Halle
,
M.
(
1952
).
Preliminaries to Speech Analysis
(
MIT Press
,
Cambridge, MA
).
51.
Jongman
,
A.
,
Wayland
,
R.
, and
Wong
,
S.
(
2000
). “
Acoustic characteristics of English fricatives
,”
J. Acoust. Soc. Am.
108
,
1252
1263
.
52.
Keyser
,
S. J.
, and
K. N.
Stevens
(
2006
). “
Enhancement and overlap in the speech chain
,”
Language
82
,
33
63
.
53.
Khan
,
S. D.
(
2012
). “
The phonetics of contrastive phonation in Gujarati
,”
J. Phonetics
40
,
780
795
.
54.
Kim
,
H.
, and
Jongman
,
A.
(
1996
). “
Acoustic and perceptual evidence for complete neutralization of manner of articulation in Korean
,”
J. Phonetics
24
,
295
312
.
55.
Kingston
,
J.
(
1992
). “
The phonetics and phonology of perceptually motivated articulatory covariation
,”
Lang. Speech
35
,
99
113
.
56.
Kingston
,
J.
,
Diehl
,
R. L.
,
Kirk
,
C. J.
, and
Castleman
,
W. A.
(
2008
). “
On the internal perceptual structure of distinctive features: The [voice] contrast
,”
J. Phonetics
36
,
28
54
.
57.
Klatt
,
D. H.
, and
Klatt
,
L. C.
(
1990
). “
Analysis, synthesis and perception of voice quality variations among male and female talkers
,”
J. Acoust. Soc. Am.
87
,
820
856
.
58.
Kuznetsova
,
A.
,
Brockhoff
,
B.
, and
Christensen
,
H.
(
2016
). “
Tests in linear mixed effects models
,” available at https://cran.r–project.org/web/packages/lmerTest/index.html (Last viewed August 3, 2018).
59.
Laver
,
J.
(
1980
).
The Phonetic Description of Voice Quality
(
Cambridge University Press
,
Cambridge, UK
).
60.
Lemon
,
S. C.
,
Roy
,
J.
,
Clark
,
M. A.
,
Friedmann
,
P. D.
, and
Rakowski
,
W.
(
2003
). “
Classification and Regression Tree analysis in public health: Methodological review and comparison with logistic regression.”
Ann. Behav. Med.
36
,
172
180
.
61.
Lisker
,
L.
(
1986
). “
‘Voicing’ in English: A catalogue of acoustic features signalling /b/ versus /p/ in trochees
,”
Lang. Speech
29
,
3
11
.
62.
Llanos
,
F.
,
Dmitrieva
,
O.
,
Shultz
,
A.
, and
Francis
,
A. L.
(
2013
). “
Auditory enhancement and second language experience in Spanish and English weighting of secondary voicing cues
,”
J. Acoust. Soc. Am.
134
,
2213
2224
.
63.
Massaro
,
D. W.
(
1987
). “
Psychophysics versus specialized processes in speech perception: An alternative perspective
,” in
The Psychophysics of Speech Perception
, edited by
M. E. H.
Schouten
(
Martinus Mijhoff
,
Boston
), pp.
46
65
.
64.
Massaro
,
D.
, and
Cohen
,
M.
(
1983
). “
Phonological context in speech perception
,”
Percept. Psychophys.
34
,
338
348
.
65.
McMurray
,
B.
,
Cole
,
J. S.
, and
Munson
,
C.
(
2011
). “
Features as an emergent product of computing perceptual cues relative to expectations
,” in
Where Do Phonological Features Come From?: Cognitive, Physical and Developmental Bases of Distinctive Speech Categories
, edited by
G. N.
Clements
, and
R.
Ridouane
(
John Benjamins, Amsterdam/Philadelphia
), pp.
197
235
.
66.
Mikuteit
,
S.
, and
Reetz
,
H.
(
2007
). “
Caught in the ACT: The timing of aspiration and voicing in Bengali
,”
Lang. Speech
50
,
247
277
.
67.
Miller
,
A. L.
(
2007
). “
Guttural vowels and guttural co-articulation in Ju|'hoansi
,”
J. Phonetics
35
,
56
84
.
68.
Mirman
,
D.
(
2014
).
Growth Curve Analysis and Visualization Using R
(
CRC Press
,
Boca Raton, FL
).
69.
Newman
,
R. S.
(
2003
). “
Using links between speech perception and speech production to evaluate different acoustic metrics: A preliminary report
,”
J. Acoust. Soc. Am.
113
,
2850
2860
.
70.
Parker
,
E. M.
,
Diehl
,
R. L.
, and
Kluender
,
K. R.
(
1986
). “
Trading relations in speech and nonspeech
,”
Percept. Psychophys.
39
,
129
142
.
71.
Port
,
R.
, and
Crawford
,
P.
(
1989
). “
Incomplete neutralization and pragmatics in German
,”
J. Phonetics
17
,
257
282
.
72.
R Core Team
(
2014
). “
R: A language and environment for statistical computing (version 3.1.0)
,” (R Foundation for Statistical Computing, Vienna), available at http://www.R-project.org/ (Last viewed October 10, 2017).
73.
Raphael
,
L. J.
(
1972
). “
Preceding vowel duration as a cue to the perception of the voicing characteristic of word-final consonants in American English
,”
J. Acoust. Soc. Am.
51
,
1296
1303
.
74.
Ren
,
N.-Q.
(
1992
). “
Phonation types and stop consonant distinctions: Shanghai Chinese
,” Ph.D. dissertation,
University of Connecticut
, Storrs.
75.
Repp
,
B. H.
(
1983
). “
Trading relations among acoustic cues in speech perception are largely a result of phonetic categorization
,”
Speech Commun.
2
,
341
361
.
76.
Shen
,
Z.-W.
, and
Wang
,
W. S.
(
1995
). “
Wuyu zhuoseyin de yanjiu—Tongji shang de fenxi he lilun shang de kaolü” (“A study of voiced stops in thje Wu dialects—Statistical analysis and theoretical considerations”)
, in
Wuyu Yanjiu (Studies of the Wu Dialects)
, edited by
E.
Zee
(
New Asia Books
,
Hong Kong
), pp.
219
238
.
77.
Shue
,
Y.-L.
,
Keating
,
P.
,
Vicenik
,
C.
, and
Yu
,
K.
(
2011
). “
VoiceSauce: A Program for Voice Analysis
,” available at http://www.ee.ucla.edu/~spapl/voicesauce/ (Last viewed November 1, 2015).
78.
Shultz
,
A. A.
,
Francis
,
A. L.
, and
Llanos
,
F.
(
2012
). “
Differential cue weighting in perception and production of consonant voicing
,”
J. Acoust. Soc. Am.
132
,
EL95
EL101
.
79.
Sjölander
,
K.
(
2004
). “
The Snack Sound Toolkit
,” available at http://www.speech.kth.se/snack/ (Last viewed March 2, 2018).
80.
Steriade
,
D.
(
1997
). “
Phonetics in phonology: The case of laryngeal neutralization
,”
UCLA Work. Pap. Phonetics
3
,
25
146
.
81.
Steriade
,
D.
(
2008
). “
The phonology of perceptibility effects: The P-map and its consequences for constraint organization
,” in
The Nature of the Word: Essays in Honor of Paul Kiparsky
, edited by
K.
Hanson
, and
S.
Inkelas
(
MIT Press
,
Cambridge, MA
), pp.
151
180
.
82.
Stevens
,
K. N.
(
1977
). “
Physics of laryngeal behavior and larynx modes
,”
Phonetica
34
,
264
279
.
83.
Stevens
,
K. N.
(
2002
). “
Toward a model for lexical access based on acoustic landmarks and distinctive features
,”
J. Acoust. Soc. Am.
111
,
1872
1891
.
84.
Stevens
,
K. N.
, and
Keyser
,
S. J.
(
2010
). “
Quantal theory, enhancement, and overlap
,”
J. Phonetics
38
,
10
19
.
85.
Toscano
,
J. C.
, and
McMurray
,
B.
(
2010
). “
Cue integration with categories: Weighting acoustic cues in speech using unsupervised learning and distributional statistics
,”
Cogn. Sci.
34
,
434
464
.
86.
Traill
,
A.
, and
Jackson
,
M.
(
1988
). “
Speaker variation and phonation type in Tsonga nasals
,”
J. Phonetics
16
,
385
400
.
87.
Venables
,
W. N.
, and
Ripley
,
B. D.
(
2002
).
Modern Applied Statistics with S
, 4th ed. (
Springer
,
New York
).
88.
Wang
,
Y.-Z.
(
2011
). “
Acoustic measurements and perceptual studies on initial stops in Wu dialects—Take Shanghainese for example
,” Ph.D. dissertation,
Zhejing University
, China.
89.
Warner
,
N.
,
Jongman
,
A.
,
Sereno
,
J.
, and
Kemper
,
R.
(
2004
). “
Incomplete neutralization of sub-phonemic durational differences in production and perception of Dutch
,”
J. Phonetics
32
,
251
276
.
90.
Wayland
,
R.
, and
Jongman
,
A.
(
2003
). “
Acoustic correlates of breathy and clear vowels: The case of Khmer
,”
J. Phonetics
31
,
181
201
.
91.
Weihs
,
C.
,
Ligges
,
U.
,
Luebke
,
K.
, and
Raabe
,
N.
(
2005
). “
klaR analyzing German business cycles
,” in
Data Analysis and Decision Support
, edited by
D.
Baier
,
R.
Decker
, and
L.
Schmidt-Thieme
(
Springer
,
Berlin
), pp.
335
343
.
92.
Xu
,
B.-H.
, and
Tang
,
Z.-Z.
(
1988
).
Shanghai Shiqu Fangyan Zhi (A Description of the Urban Shanghai Dialect)
(
Shanghai Educational Press
,
Shanghai
).
93.
Xu
,
Y.
(
2005–2013
). “
ProsodyPro.praat
,” available at http://www.phon.ucl.ac.uk/home/yi/ProsodyPro/ (Last viewed November 1, 2015).
94.
Zee
,
E.
, and
Maddieson
,
I.
(
1980
). “
Tones and tone sandhi in Shanghai: Phonetic evidence and phonological analysis
,”
Glossa
14
,
45
88
.
95.
Zhu
,
X.-N.
(
1999
).
Shanghai Tonetics
(
Lincom Europa
,
München
).
96.
Zhu
,
X.-N.
(
2006
).
A Grammar of Shanghai Wu
(
Lincom Europa
,
München
).

Supplementary Material