A listener's judgement on the perceptual distance between two sounds usually draws on their psychoacoustic difference, but can also be subject to L1-specific perception. This study investigates the interplay between these two aspects when evaluating the perceptual distance of sound pairs. Mandarin and English listeners rated the perceptual distance of consonant-vowel pairs corresponding to sequences legal or illegal in their L1s. The results suggested that a similarity rating task can provide a finer assessment of distinctiveness between sound pairs as compared to a discrimination task. The results also showed how psychoacoustic perception may interact with L1-specific perception in this process.

The auditory perception of speech sounds has been shown to involve multiple stages (Pisoni, 1973; Pisoni and Tash, 1974), from the processing of low-level acoustic information to the mapping of sound categories in one's native language (Kuhl , 1992; Flege , 1996) and the recognition of sounds in a sequence, which can be subject to the influence of lexical items in a language (Ganong, 1980). When a listener evaluates the perceptual distance between two sounds, the judgement may draw on their psychoacoustic difference at the earlier stage and then, if the listener is not under high time pressure and L1-specific perception has the chance to take effect, the assessment may also be influenced by sound categories and sound combinations in one's native language (Boomershine , 2008; Babel and Johnson, 2010).

The perceptual distinction of speech sounds has been extensively examined through various experimental methods, e.g., similarity rating (Greenberg and Jenkins, 1964; Mohr and Wang, 1968), AX discrimination (Pisoni and Tash, 1974; Babel and Johnson, 2010), free classification (Clopper, 2008; Clopper and Bradlow, 2009), and brain imaging (Näätänen, 2001; Näätänen and Kreegipuu, 2009; Fong , 2014). Among the various methods, two are most frequently used: one is similarity rating (e.g., Greenberg and Jenkins, 1964; Mohr and Wang, 1968), in which listeners rate the magnitude of dissimilarity between two sounds on a scale; the other is AX discrimination (e.g., Pisoni, 1973; Pisoni and Tash, 1974), in which listeners decide as quickly and accurately as possible whether two sounds are the same or different, with the listeners' response time assumed to be negatively correlated with the distance between the two sounds. As an offline task, similarity rating usually focuses on the outcome of the participants' judgment, e.g., their assessment on the similarity of sound pairs, while as an online task, AX discrimination can also evaluate the participants' behavior when performing the task, e.g., their speed in making decisions on sound pairs. Extensive studies have shown that an offline similarity rating task is subject to L1-specific perception (Best , 2001; Boomershine , 2008). For example, the same two phonetic sounds could be judged to be more similar if they correspond to allophonic sounds in the listener's L1 while as more distinct if they correspond to phonemic sounds. Similarly, an online AX discrimination task can also be influenced by L1-specific perception. In the Boomershine (2008) study, the results of AX discrimination with a short inter-stimulus-interval (ISI) of 100 ms showed the same influence from English and Spanish allophony as in similarity rating.

The relative psychoacoustic distance between a sound pair is shown to be accessible in a speeded-AX discrimination (Babel and Johnson, 2010; Johnson and Babel, 2010), the settings of which involve a short ISI such as 100 ms and a high time pressure, e.g., encouraging the listeners to respond within 500 ms to reduce the chance for L1-specific perception to take effect. With these settings, Babel and Johnson (2010) observed that English and Dutch listeners behaved the same in discriminating fricatives when their native phonologies differ in the phonemic inventories of the fricatives. Similarly, Li and Zhang (2017) observed no significant difference between English and Mandarin listeners when they discriminated sibilant place contrasts in different vowel contexts even though the sibilant systems are different between the two languages (English has an alveolar vs post-alveolar place contrast in all vowel contexts, while Mandarin has a three-way contrast among dentals/alveolars, palatals, and retroflexes, with vowel allophony in high front vowels after the sibilants of different places). Their stimuli were consonant-vowel (CV) pairs that minimally contrast in alveolar vs palatal sibilant onsets (e.g., [s] vs [ɕ]) in the vowel contexts [_i] [_a] [_ou], plus an allophonic context [_ˌɹ/i], in which a sibilant precedes a homorganic vocalic segment (e.g., [sˌɹ] vs [ɕi]). A short ISI of 100 ms was used and the listeners were encouraged to respond within 500 ms. With a longer response time corresponding to less perceptual distinction, their results indicated a reduced perceptual distinctiveness of sibilant place contrast in the [_i] context than in the other contexts, with no difference among the contexts [_a] [_ou] [_ˌɹ/i], i.e., [_i] < [_a], [_ou], [_ˌɹ/i] in terms of perceptual distinctiveness. This pattern held across American English and Mandarin listeners, which indicated that speeded-AX discrimination has obscured the influence from the two groups of listeners' native languages despite the difference in their L1 phonologies. As for the psychoacoustic basis of the vowel difference, Li and Zhang (2017) showed that the acoustic difference of the sibilants in a [si-ɕi] pair, for example, is larger than that in a [sa-ɕa] pair; however, the formant transition (F2) difference in a [si-ɕi] pair is smaller than that in a [sa-ɕa] pair. Thus, the reduced distinctiveness in the [_i] context indicated that the smaller formant transition difference, e.g., in [si-ɕi], has overridden the larger acoustic difference of alveolar vs palatal sibilants, e.g., as compared with [sa-ɕa], when the listeners evaluate the distinctiveness of CV pairs.

While a speeded-AX discrimination task has been shown in the literature to be able to access the psychoacoustic distance between sounds independent of a listener's native language, what remains unclear is if similarity rating can reveal the psychoacoustic distance of sounds in a way better than a speeded-AX discrimination. This study focuses on whether and how the joint influence of psychoacoustic perception and L1-specific perception can be assessed in a similarity rating task, which may further inform us of the difference between a similarity rating task and an AX discrimination task. This study aims to examine how psychoacoustic perception may interact with L1-specific perception when a listener evaluates the perceptual distance between a sound pair. In addition, we will explore if a similarity rating could access certain aspects of the psychoacoustic distance between sound pairs in a way that an AX-discrimination may not help uncover.

There is an extensive literature that has examined consonant pairs in the same vowel context (Greenberg and Jenkins, 1964; Mohr and Wang, 1968; Boomershine , 2008; Babel and Johnson, 2010; Johnson and Babel, 2010). This study focuses on the perceptual distinctiveness of sibilant place contrasts across different vowel contexts. In particular, a similarity rating task was conducted using the sound pairs in the speeded-AX discrimination experiment in Li and Zhang (2017).

The stimuli of the experiment are listed in Table 1. They involve two sibilant pairs, the fricatives [s-ɕ] and the aspirated affricates [tsh-tɕh], in three vowel contexts: the allophonic [_ˌɹ/i] (i.e., [s] or [tsh] before a homorganic syllabic approximant [ˌɹ]; [ɕ] or [tɕh] before [i]), [_a], and [_i]. The stimulus syllables were produced by a male native speaker of Mandarin Chinese, in which [sˌɹ, ɕi, tshˌɹ, tɕhi] and [sa, ɕa, tsha, tɕha] are legal; he was also a trained performer of Peking Opera, in which [si, tshi] are legal.1

Table 1.

Stimulus pairs for the perceptual experiment.

a. [_ˌɹ/i] b. [_a] c. [_i]
Fricative onsets  sˌɹ-ɕi  sa-ɕa  si-ɕi 
Aspirated affricate onsets  tshˌɹ-tɕh tsha-tɕh tshi-tɕh
a. [_ˌɹ/i] b. [_a] c. [_i]
Fricative onsets  sˌɹ-ɕi  sa-ɕa  si-ɕi 
Aspirated affricate onsets  tshˌɹ-tɕh tsha-tɕh tshi-tɕh

Six tokens were recorded for each syllable, and the one whose acoustic properties are closest to the average of the six was selected and manipulated: First, the durations of the sibilants were manipulated to match their mean values in natural speech (Feng, 1985), with [s, ɕ] as 125 ms and [tsh, tɕh] as 100 ms. Second, in the vocalic portions, the consonant-vowel transition was normalized to 50 ms (Delattre , 1955) and the steady vowel to 70 ms, i.e., a total of 120 ms, also to be close to the mean duration values in natural speech (Feng, 1985). Third, a level f0 at 200 Hz was superimposed onto the vocalic parts, whose intensities were leveled to be 70 dB.

Within the pairs with the onsets [s, ɕ], the ISI was set as 100 ms to facilitate responses referring to the psychoacoustic difference between two sounds (Pisoni, 1973; Werker and Logan, 1985; Johnson and Babel, 2010); within those with [tsh, tɕh], an additional 50 ms was added to compensate for the oral closure in the affricates. For each pair in Table 1, two different pairs are formed, e.g., [si-ɕi] [ɕi-si], as well as two identical ones, e.g., [si-si] [ɕi-ɕi].

The listening crew included 30 native American English listeners, who had no previous exposure to Mandarin or other Chinese dialects, and 30 native Mandarin listeners, who were born to Mandarin-speaking parents and learned English as a second language.

The perceptual experiment was programmed in Paradigm (Perception Research Systems, 2007). The participants were told that they will hear sound pairs from a new language and, after hearing a pair of CV syllables, they rated the perceptual distinctiveness of the pair on a 5-degree scale, which has no labels except for “extremely similar” and “totally different” on the two ends. After hearing a pair, a listener had 5 s to make a judgment by clicking on the scale (Babel and Johnson, 2010). The setting of a threshold of 5 s was expected to avoid cases of responses with a longer time when a participant's memory of the sounds has decayed with the passing of time. A practice session preceded the main experiment including stimulus sound pairs not tested in the main experiment. The listeners were encouraged to make use of the full scale. Each stimulus pair was repeated five times, giving ten responses for each CV pairs in Table 1 (two orders × five repetitions) per listener.

The participants' ratings were coded with “extremely similar” as 1, “totally different” as 5, and the other choices as 2, 3, and 4 in between, thus a higher value indicates a larger perceptual distance. In studies such as Boomershine (2008), normalized ratings within each participant were used to compensate for different use of the 5-point scale. In our study, the stimuli included identical pairs such as [si-si] and different pairs such as [sˌɹ-ɕi], which involved differences in both consonants and vowels. These pairs were expected to correspond respectively to 1 and 5 on the scale, thus providing two reference points for the perceptual similarity of other pairs. Therefore, participants' raw ratings were analyzed.

For the different pairs, the 30 American English and 30 Mandarin listeners gave a total of 3600 responses, ten of which were invalid as the participant did not respond within 5 s. Based on the remaining responses, Fig. 1 presents the ratings of the six pairs separately for American English listeners (upper) and Mandarin listeners (lower), with the mean ratings of each pair.

Fig. 1.

Ratings of the stimulus pairs by American English listeners (upper) and Mandarin listeners (lower). A higher value indicates more distinctiveness with mean ratings marked for each pair, with error bars around the mean values.

Fig. 1.

Ratings of the stimulus pairs by American English listeners (upper) and Mandarin listeners (lower). A higher value indicates more distinctiveness with mean ratings marked for each pair, with error bars around the mean values.

Close modal

The rating data were analyzed by Linear Mixed Effects models using the lmer function in the R package lme4 (Bates , 2015) and the p-values were determined using the R package lmerTest (Kuznetsova , 2017). The dependent variable was the rating (1–5); the fixed factors were Vowel ([_i], [_a], and [_ˌɹ/i]), Onset ([s-ɕ], [tsh-tɕh]), and NativeLg (English vs Mandarin); the random factors included (1|Participant), (1|Participant:Onset), and (1|Participant:Vowel). A series of model comparisons arrived at the final model that includes the three-way interaction of the predicting variables, i.e., Similarity ∼ Onset * Vowel * NativeLg + (1|Participant) + (1|Participant:Onset) + (1|Participant:Vowel). The fixed effects of this model are provided in the  Appendix. With the significant three-way interaction, the data of the American English listeners and Mandarin listeners were analyzed separately.

The two groups of listeners showed different patterns of vowel contexts. As shown in Table 2(a), the American English listeners gave a three-way difference of the vowel contexts, corresponding to a decreasing order of distinctiveness in Fig. 1: [_ˌɹ/i] (means = 4.32 and 4.36, respectively, for [sˌɹ-ɕi] and [tshˌɹ-tɕhi]) > [_a] (means = 3.85 and 3.86) > [_i] (3.69 and 3.34). As shown in Table 2(b), the Mandarin listeners gave a two-way difference of [_i] vs [_a] and [_ˌɹ/i] with no significant difference between the latter two, corresponding to a decreasing order of distinctiveness in Fig. 1: [_ˌɹ/i] (means = 4.70 and 4.72) and [_a] (4.74 and 4.81) > [_i] (4.20 and 2.84).

Table 2.

Vowel differences in the mixed-effect linear regressions of English and Mandarin listeners' ratings: t values (p values). P values appear in brackets. Boldface marks significance.

[_a] [_i]
(a) American English listeners 
[_ˌɹ/i]  4.28 (<0.001*** 7.28 (<0.001***
[_a]    3.00 (0.004**
(b) Mandarin listeners 
[_ˌɹ/i]  0.64 (0.52)  11.37 (<0.001***
[_a]    12.01 (<0.001***
[_a] [_i]
(a) American English listeners 
[_ˌɹ/i]  4.28 (<0.001*** 7.28 (<0.001***
[_a]    3.00 (0.004**
(b) Mandarin listeners 
[_ˌɹ/i]  0.64 (0.52)  11.37 (<0.001***
[_a]    12.01 (<0.001***

A significant interaction of Onset * Vowel was observed, respectively, for English listeners (X2 = 17.7, df = 2, p < 0.001) and Mandarin listeners (X2 = 251, df = 2, p < 0.001). Follow-up analyses showed that, for both groups, the interaction was rooted exclusively in a significant difference between [si-ɕi] and [tshi-tɕhi], with no difference between [s-ɕ] and [tsh-tɕh] in the other two vowel contexts [_ˌɹ/i] and [_a]. The two groups' ratings of [si-ɕi] and [tshi-tɕhi] were then summitted to Linear Mixed Effects models, with Onset ([s-ɕ], [tsh-tɕh]) and NativeLg (English vs Mandarin) as the predicting variables and the same random factors as in the earlier analyses. There turned out to be a significant difference between [si-ɕi] and [tshi-tɕhi] (X2= 31, df= 1, p < 0.001), by which [si-ɕi] has a higher rating (i.e., more distinctive) than [tshi-tɕhi], as well as a significant interaction of Onset * NativeLg (X2 = 16.6, df = 1, p < 0.001), by which Mandarin listeners gave a larger difference between [si-ɕi] and [tshi-tɕhi] (means = 4.20 vs 2.84) than English listeners (means = 3.69 vs 3.34), as shown in Fig. 1.

To sum up, the two groups' responses were similar in that (i) [_i] was the least distinct context and (ii) [si-ɕi] was more distinct than [tshi-tɕhi]. They differed in that (iii) there was a three-way vowel effect [_ˌɹ/i] > [_a] > [_i] for American English listeners, but only a two-way vowel effect [_ˌɹ/i], [_a] > [_i] for Mandarin listeners, and (iv) there was a larger distance in [si-ɕi] than [tshi-tɕhi] for Mandarin listeners as compared with English listeners.

The English listeners' lowest ratings in the [_i] context are consistent with the results in the Speeded-AX discrimination by Li and Zhang (2017) and are psychoacoustic in nature. For the reduced distinctiveness of sibilant place contrast in the [_i] context, Li and Zhang (2017) argued that it is rooted in the psychoacoustic properties of the sound pairs, in which a smaller formant transition difference overrides a larger sibilant difference, instead of the listeners' L1-specific perception: the American English listeners were not expected to have relied on their L1 phonology since sibilants such as [ɕ tɕh] are not available in English; alternatively, even if there was a chance for them to have mapped [s ɕ] to English /s ʃ/, the reduced distinctiveness in [_i] was not predicted by English-specific phonology since English /s/ vs /ʃ/contrast in the [_i] context (e.g., “see” vs “she”) as well as other vowel contexts (e.g., “sock” vs “shock”) (Li and Zhang 2017). The same interpretation applies to the English listeners' lowest ratings of [_i] in the current study, as shown in Fig. 1.

The English listeners' three-way differentiation of [_ˌɹ/i] > [_a] > [_i] in their similarity rating in this study, as shown in Table 2(a), has a psychoacoustic interpretation as well. The larger perceptual distance in [_ˌɹ/i] than in [_a] is consistent with the larger acoustic differences in [sˌɹ-ɕi] [tshˌɹ-tɕhi], in which a CV pair has different onsets and different vocalic parts, as compared with [sa-ɕa] [tsha-tɕha], respectively, in which only the onsets differ.2 In [sˌɹ tshˌɹ], in particular, the vocalic parts are homorganic to its preceding sibilants, which provides a strong place cue for the alveolar sibilants (Stevens , 2004); the same holds true for [ɕi tɕhi], in which the onsets and the vowels are homorganic. Thus, the three-way difference of [_ˌɹ/i] > [_a] > [_i] for the American English listeners' ratings is rooted in the relative psychoacoustic distances of the sound pairs.

For the English listeners, the difference reflected in the similarity rating between [_ˌɹ/i] and [_a] did not show up in the response time data in the Li and Zhang (2017) speeded-AX discrimination, in which the English listeners gave a two-way differentiation of [_ˌɹ/i] [_a] > [_i]. The lack of difference between [_ˌɹ/i] and [_a] in the results of the Speeded-AX discrimination indicates that the speeded nature of an online discrimination and the response time data may not be able to reveal certain details in the relative perceptual distinctiveness between different sound pairs. In contrast, the difference of [_ˌɹ/i] vs [_a] is accessible in an offline similarity rating, if the consonants are not contrastive in the listeners' native language, e.g., [s] vs [ɕ] for American English listeners, and a short ISI helps facilitate perception based on the psychoacoustic difference in a sound pair. With these settings, a listener is able to indicate detailed relative perceptual distinctiveness of different sound pairs (Babel and Johnson, 2010; Johnson and Babel, 2010).

For the Mandarin listeners, the lack of a significant difference between [_ˌɹ/i] and [_a] showed the influence of L1-specific perception, even though the psychoacoustic information of the sounds is available at the earlier stage of perception. More specifically, Mandarin has CV syllables corresponding to morphemes, such as [sˌɹ] “silk” vs [ɕi] “rare” and [tshˌɹ] “uneven” vs [tɕhi] “seven,” in the [_ˌɹ/i] context, as well as [sa] “to release” vs [ɕa] “shrimp” and [tsha] “to wipe” vs [tɕha] “to pinch,” in the [_a] context, all bearing a high-level tone. The psychoacoustic difference in [_ˌɹ/i] vs [_a] is reflected in American English listeners' ratings, while the absence of their difference in Mandarin listeners' ratings indicated the influence of the listeners' L1-specific perception. It seems that, in the more offline similarity rating task, Mandarin listeners' L1-specific perception has the chance to take effect and override the finer psychoacoustic difference in the sound pairs, despite the use of a short ISI to facilitate perception based on their acoustic differences.

The Mandarin listeners rated the sibilant place difference the lowest in the [_i] context, i.e., for [si-ɕi] and [tshi-tɕhi], even though /s/, /ɕ/, /tsh/, and /tɕh/ are all phonemic in Mandarin. This confirms the observation in the literature that, even for contrastive sound categories, a phonological context not allowing the contrasts may lead to difficulty in their discrimination (Mitterer and Blomert, 2003; Sun , 2015). This is likely rooted in an interaction between psychoacoustic perception and L1-specific perception: the four onsets are categorical in their L1 and [ɕi] [tɕhi] are legal CV syllables in Mandarin, which should lead to L1-specific perception of the onsets and the two syllables; [si] and [tshi] involve Mandarin-illegal combinations, in which the coarticulations between segments lead to context-specific properties of the onsets. With the short ISI of 100 ms, an evaluation of the psychoacoustic distance between [si] and [ɕi], for example, becomes available to Mandarin listeners as the two syllables do not match to two distinct legal CV syllables in Mandarin. Assuming this to be the case, it is not surprising that Mandarin listeners behaved similarly to English listeners in rating the [_i] context as the least distinct, consistent with the prediction of the relative acoustic differences involved in the three vowel contexts.

A more subtle interaction between psychoacoustic perception and L1-specific perception can be seen from English and Mandarin listeners' ratings of [si-ɕi] vs [tshi-tɕhi]. On the one hand, the two groups of listeners both perceived [si-ɕi] as more distinct than [tshi-tɕhi], which is consistent with the relative psychoacoustic differences between the two sounds in these pairs. For the onset sibilants, [s] and [ɕ] are longer (125 ms), thus involving stronger internal cues, than [tsh] and [tɕh], respectively (100 ms); the acoustic differences in [s] vs [ɕ] are larger than those in [tsh] and [tɕh] in terms of Center-Of-Gravity (COG) and intensity (Li and Zhang, 2017); in addition, the aspiration in the latter halves of [tsh] and [tɕh] might have further obscured the formant transition cues for the onset sibilants. The English and Mandarin listeners' ratings of [si-ɕi] to be more distinct than [tshi-tɕhi] are thus predicted by the psychoacoustic properties of the sound pairs. The judgement of [si-ɕi] as more distinct than [tshi-tɕhi] cannot be rooted in the listeners' L1-specific perception because, for English listeners, [si-ɕi] and [tshi-tɕhi] involved sibilant place contrasts not available in their L1 phonology, while for Mandarin listeners, the four onset sibilants are all available consonant categories.

On the other hand, Mandarin listeners turned out to give a larger difference between [si-ɕi] and [tshi-tɕhi] than the English listeners, as shown by the significant interaction of NativeLg*Onset when focusing on [si-ɕi] and [tshi-tɕhi]. This difference between the two groups of listeners is likely to have stemmed from their L1-specific perception. In Mandarin, the onset sibilants /s/, /ɕ/, /tsh/, and /tɕh/, are all phonemic even though they do not contrast in the [_i] context. In an offline similarity rating task without time pressure, for Mandarin listeners, if the response was solely due to L1-specific perception and psychoacoustic perception was completely overridden, then we would not expect a difference between the fricative contrast (/s/ vs /ɕ/) and the affricate contrast (/tsh/ vs /tɕh/). The manner difference indicated that Mandarin-specific perception interacted with psychoacoustic perception, by which Mandarin listeners seemed to have exaggerated the psychoacoustic difference between [si-ɕi] and [tshi-tɕhi], as compared to English listeners, whose responses are predominantly based on psychoacoustic properties of the relevant sounds. A potential explanation for Mandarin listeners' exaggeration of the psychoacoustic difference could be that, as they recognize /s/ /ɕ/ /tsh/ and /tɕh/ as consonant categories in their native language, their tacit knowledge about the acoustic properties of fricatives vs affricates, as based on their language experience, might have influenced their judgement: the fricatives /s/ and /ɕ/ have relatively strong internal cues (Wright, 2004) and thus their difference may remain relatively salient, as expected based on the listeners' L1 knowledge of the fricative category, even if they contrast in the [_i] context, as in the stimuli, while not in Mandarin; in contrast, the affricates /tsh/ and /tɕh/ have a relatively weaker internal cue and their perceptual difference would be smaller than the fricative pairs, which might be even diminished when /tsh/ and /tɕh/ form a contrast in an [_i] context in the stimuli, which does not exist in the listeners' L1. In such a scenario, the Mandarin listeners may be biased to disperse their responses to the fricative pairs and the affricate pairs, respectively towards the two ends of the scale, as compared with the English listeners. Simply put, this suggests that the recognition of different sounds as phonemic in one's native language may bias a listener towards a larger perceptual distinctiveness of the relevant sounds on top of their psychoacoustic differences. It should be added that this interaction was observed in the [_i] context only, not in the [_a] and [_ˌɹ/i] contexts, both of which involve mapping to pairs of L1-legal CV syllables.

While an online task such as a Speeded-AX discrimination is found to be able to assess psychoacoustic difference between two sounds, there is also indication that an offline similarity rating may access certain aspects of the psychoacoustic differences of two sounds, as observed in the data of Babel and Johnson (2010) and Johnson and Babel (2010). The results of the similarity rating in this study confirm its accessibility of relative psychoacoustic distance of sound pairs and show that an offline similarity rating task could possibly provide a finer assessment of the relative perceptual distances between different sound pairs, as compared with an online AX discrimination task. Related to the overall perceptual distance of pairs of sound sequences, this study also demonstrates how psychoacoustic perception may interact with L1-specific perception when a listener evaluates the perceptual distinctiveness of two sound sequences that may or may not be legal in one's native language.

This research was partially supported by Hong Kong University Grants Committee's Research Matching Grant Scheme (Grant No. RMGS2019_1_18), Hong Kong Baptist University Faculty Research Grant (FRG) Category II (Grant No. FRG2/17–18/076), and Hong Kong Baptist University Research Committee's Start-up Grant for New Academics. The authors acknowledge the support from The University of Kansas Phonetics and Psycholinguistics Laboratory (KUPPL) and Hong Kong Baptist University (HKBU) Phonology Laboratory.

Table 3.

Fixed effects in the mixed-effect linear ergression for ratings. Model: Similarity ∼ Onset * Vowel * NativeLg + (1|Participant) + (1|Participant:Onset) + (1|Participant:Vowel). Baselines: Onset = [s-ɕ] and Vowel = [_i]. Signif. codes: ‘***’ 0.001, ‘**’ 0.01, ‘*’ 0.05, ‘.’ 0.1, ‘ ’ 1….

Estimate SE df t-value Pr(>|t|)
(Intercept)  4.203  0.121  143.76  34.83  <0.001*** 
Onset(tsh-tɕh −1.365  0.086  189.92  −15.84  <0.001*** 
Vowel(_a)  0.540  0.120  169.50  4.51  <0.001*** 
Vowel(_ˌɹ/i)  0.493  0.120  169.50  4.12  <0.001*** 
NativeLg(Eng)  −0.519  0.171  143.85  −3.04  0.003 
Onset(tsh-tɕh):Vowel(_a)  1.431  0.100  3348.69  14.28  <0.001*** 
Onset(tsh-tɕh):Vowel(_ˌɹ/i)  1.388  0.100  3348.21  13.87  <0.001*** 
Onset(tsh-tɕh):NativeLg(Eng)  1.021  0.122  189.90  8.37  <0.001*** 
Vowel(_a):NativeLg(Eng)  −0.378  0.170  169.60  −2.23  0.027* 
Vowel(_ˌɹ/i):NativeLg(Eng)  0.142  0.170  169.60  0.84  0.403 
Onset(tsh-tɕh):Vowela:NativeLg(Eng)  −1.074  0.142  3348.49  −7.59  <0.001*** 
Onset(tsh-tɕh):Vowel(_ˌɹ/i):NativeLg(Eng)  −1.007  0.142  3348.20  −7.12  <0.001*** 
Estimate SE df t-value Pr(>|t|)
(Intercept)  4.203  0.121  143.76  34.83  <0.001*** 
Onset(tsh-tɕh −1.365  0.086  189.92  −15.84  <0.001*** 
Vowel(_a)  0.540  0.120  169.50  4.51  <0.001*** 
Vowel(_ˌɹ/i)  0.493  0.120  169.50  4.12  <0.001*** 
NativeLg(Eng)  −0.519  0.171  143.85  −3.04  0.003 
Onset(tsh-tɕh):Vowel(_a)  1.431  0.100  3348.69  14.28  <0.001*** 
Onset(tsh-tɕh):Vowel(_ˌɹ/i)  1.388  0.100  3348.21  13.87  <0.001*** 
Onset(tsh-tɕh):NativeLg(Eng)  1.021  0.122  189.90  8.37  <0.001*** 
Vowel(_a):NativeLg(Eng)  −0.378  0.170  169.60  −2.23  0.027* 
Vowel(_ˌɹ/i):NativeLg(Eng)  0.142  0.170  169.60  0.84  0.403 
Onset(tsh-tɕh):Vowela:NativeLg(Eng)  −1.074  0.142  3348.49  −7.59  <0.001*** 
Onset(tsh-tɕh):Vowel(_ˌɹ/i):NativeLg(Eng)  −1.007  0.142  3348.20  −7.12  <0.001*** 
1

The syllables [ɕa tɕha] were conventionally expressed as [ɕia tɕhia] or [ɕja tɕhja], in which the [i/j] part is literally the formant transition from the palatal sibilants to the following vowel (Ladefoged and Maddieson, 1996).

2

Detailed measurements and analysis of the stimulus syllables, including the consonants and the vowels are available in Li and Zhang (2017). Within each pair, the vocalic portions are close in their acoustic properties, which makes it unlikely to lead to a perceptual difference within the pair.

1.
Babel
,
M.
, and
Johnson
,
K.
(
2010
). “
Accessing psycho-acoustic perception with speech sounds
,”
J. Lab. Phonol.
1
(
1
),
179
205
.
2.
Bates
,
D.
,
Mächler
,
M.
,
Bolker
,
B.
, and
Walker
,
S.
(
2015
). “
Fitting linear mixed-effects models using lme4
,”
J. Stat. Softw.
67
(
1
),
1
48
.
3.
Best
,
C. T.
,
McRoberts
,
G. W.
, and
Goodell
,
E.
(
2001
). “
Discrimination of non-native consonant contrasts varying in perceptual assimilation to the listener's native phonological system
,”
J. Acoust. Soc. Am.
109
,
775
794
.
4.
Boomershine
,
A.
,
Hall
,
K. C.
,
Hume
,
E.
, and
Johnson
,
K.
(
2008
). “
The impact of allophony versus contrast on speech perception
,” in
Contrast in Phonology: Theory, Perception, Acquisition
, edited by
P.
Avery
,
E.
Dresher
, and
K.
Rice
(
Mouton de Gruyter
,
New York
), pp.
146
172
.
5.
Clopper
,
C. G.
(
2008
). “
Auditory free classification: Methods and analysis
,”
Behav. Res. Methods
40
,
575
581
.
6.
Clopper
,
C. G.
, and
Bradlow
,
A. R.
(
2009
). “
Free classification of American English dialects by native and non-native listeners
,”
J. Phon.
37
(
4
),
436
451
.
7.
Delattre
,
P. C.
,
Liberman
,
A. M.
, and
Cooper
,
F. S.
(
1955
). “
Acoustic loci and transitional cues for consonants
,”
J. Acoust. Soc. Am.
27
,
769
773
.
8.
Feng
,
L.
(
1985
). “
Beijinghua yuliu zhong shengyundiao de shichang” (“Duration of consonants, vowels and tones in colloquial Beijing Mandrain”)
, in
Beijing Yuyin Shiyan Lu (Experimental Studies in Beijing Mandarin)
, edited by
T.
Lin
and
L.
Wang
(
Peking University Press
,
Beijing
), pp.
131
195
.
9.
Flege
,
J. E.
,
Takagi
,
N.
, and
Mann
,
V.
(
1996
). “
Lexical familiarity and English-language experience affect Japanese adults' perception of /r/ and /l/
,”
J. Acoust. Soc. Am.
99
,
1161
1173
.
10.
Fong
,
M. C.-M.
,
Minett
,
J. W.
,
Blu
,
T.
, and
Wang
,
W. S.-Y.
(
2014
). “
Towards a neural measure of perceptual distance: Classification of electroencephalographic responses to synthetic vowels
,” in
Proceedings of Interspeech 2014
, September 14–18, Singapore, pp.
2595
2599
.
11.
Ganong
,
W. F.
(
1980
). “
Phonetic categorization in auditory word perception
,”
J. Exp. Psychol. Hum. Percept. Perform
6
(
1
),
110
125
.
12.
Greenberg
,
J. H.
, and
Jenkins
,
J. J.
(
1964
). “
Studies in the psychological correlates of the sound system of American English
,”
Word
20
,
157
177
.
13.
Johnson
,
K.
, and
Babel
,
M.
(
2010
). “
On the perceptual basis of distinctive features: Evidence from the perception of fricatives by Dutch and English speakers
,”
J. Phon.
38
(
1
),
127
136
.
14.
Kuhl
,
P. K.
,
Williams
,
K. A.
,
Lacerda
,
F.
,
Stevens
,
K. N.
, and
Lindblom
,
B.
(
1992
). “
Linguistic experiences alter phonetic perception in infants by 6 months of age
,”
Science
255
,
606
608
.
15.
Kuznetsova
,
A.
,
Brockhoff
,
P. B.
, and
Christensen
,
R. H. B.
(
2017
). “
lmerTest package: Tests in linear mixed effects models
,”
J. Stat. Softw.
82
(
13
),
1
26
.
16.
Ladefoged
,
P.
, and
Maddieson
,
I.
(
1996
).
The Sounds of the World's Languages
(
Blackwell Ltd
.,
New York
).
17.
Li
,
M.
, and
Zhang
,
J.
(
2017
). “
Perceptual distinctiveness between dental and palatal sibilants in different vowel contexts and its implications for phonological contrasts
,”
Lab. Phonol. J. Assoc. Lab. Phonol.
8
(
1
),
18
.
18.
Mitterer
,
H.
, and
Blomert
,
L.
(
2003
). “
Coping with phonological assimilation in speech perception: Evidence for early compensation
,”
Percept. Psychophys.
65
(
6
),
956
969
.
19.
Mohr
,
B.
, and
Wang
,
W. S.-Y.
(
1968
). “
Perceptual distance and the specification of phonological features
,”
Phonetica
18
,
31
45
.
20.
Näätänen
,
R.
(
2001
). “
The perception of speech sounds by the human brain as reflected by the mismatch negativity (MMN) and its magnetic equivalent (MMNm)
,”
Psychophysiology
38
,
1
21
.
21.
Näätänen
,
R.
, and
Kreegipuu
,
K.
(
2009
). “
The mismatch negativity as an index of different forms of memory in audition
,” in
Memory, Aging and the Brain: A Festschrift in Honour of Lars-Göran Nilsson
, edited by
L.
Bäckman
and
L.
Nyberg
(
Psychology Press
,
London
), pp.
287
299
.
22.
Perception Research Systems
(
2007
). “
Paradigm stimulus presentation
,” http://www.paradigmexperiments.com (Last viewed 23 June 2022).
23.
Pisoni
,
D.
(
1973
). “
Auditory and phonetic codes in the discrimination of consonants and vowels
,”
Percept. Psychophys.
13
,
253
260
.
24.
Pisoni
,
D.
, and
Tash
,
J.
(
1974
). “
Reaction times to comparisons within and across phonetic categories
,”
Percept. Psychophys.
15
(
2
),
285
290
.
25.
Stevens
,
K.
,
Li
,
Z.
,
Lee
,
C.-Y.
, and
Keyser
,
S. J.
(
2004
). “
A note on Mandarin fricatives and enhancement
,” in
From Traditional Phonology to Modern Speech Processing
, edited by
G.
Fant
,
H.
Fujisaki
,
J.
Cao
, and
Y.
Xu
(
Foreign Language Teaching and Research Press
,
Beijing
), pp.
393
404
.
26.
Sun
,
Y.
,
Giavazzi
,
M.
,
Adda-Decker
,
M.
,
Barbosa
,
L. S.
,
Kouider
,
S.
,
Bachoud-Lévi
,
A. C.
, and
Peperkamp
,
S.
(
2015
). “
Complex linguistic rules modulate early auditory brain responses
,”
Brain Lang.
149
,
55
65
.
27.
Werker
,
J. F.
, and
Logan
,
J. S.
(
1985
). “
Cross-language evidence for three factors in speech perception
,”
Percept. Psychophys.
37
,
35
44
.
28.
Wright
,
R.
(
2004
). “
A review of perceptual cues and cue robustness
,” in
Phonetically Based Phonology
, edited by
B.
Hayes
,
R.
Kirchner
, and
D.
Steriade
(
Cambridge University Press
,
Cambridge, UK
), pp.
34
57
.