This study compared the f0 of 14 German vowels in monosyllabic words (/dVt/) embedded in carrier sentences produced by 30 native speakers and 30 Mandarin Chinese learners. Appropriate techniques were employed to robustly measure f0 values and reliably analyze f0 profiles. The results showed that Mandarin learners produced the vowels bearing sentence stress with significantly larger f0 ranges and steeper f0 slopes but comparable f0 mean and maximum in comparison to German natives. Moreover, lax vowels produced by both groups demonstrated narrower ranges with faster f0 changes than tense vowels, which was stronger for Mandarin learners.
Compared with the extensive studies on Mandarin learners speaking English (Chen et al., 2001a; Jin and Liu, 2013; Yuan and Liberman, 2014), acoustic analyses of their production of German as a second language (L2) are still limited. Studies on L2 speech learning usually start with the acquisition of vowel segments in the target language. In German, without consideration of schwa /ə/ or long tense /ɛː/, there are 14 monophthongs that can be grouped into seven pairs, the members of which differ exclusively with respect to tenseness. The phonetic-acoustic differences of the German tense-lax opposition in production may manifest through changes in vowel formants, duration, and f0 (Schneeberg and Schlüßler, 2006). The former two aspects have been compared between German native speakers and Mandarin Chinese learners (Gao et al., 2020), while the last factor (f0 difference) remains to be investigated. Though previous studies have revealed that L2 German produced by Mandarin learners has higher f0 mean and larger f0 range on both sentence and phoneme levels compared with that produced by German natives (Ding et al., 2006), they have rarely concerned the f0 profiles associated with the tense-lax vowel contrast. Vowel intrinsic f0 (IF0) was proved to be a language universal (Whalen and Levitt, 1995). It has been shown that high vowels have a higher intrinsic f0 than low vowels and that intrinsic f0 also plays an important role in distinguishing the vowel identity. Therefore, the tense-lax contrast of vowels should be evident not only in formants and duration, but also in f0.
Unlike the role of f0 in German vowels to signal stress and possibly tenseness, f0 in Mandarin vowels is associated with a lexical tone and employed to distinguish lexical meanings. Moreover, Mandarin Chinese monophthongs are usually classified as tense vowels, that is, there are no tense-lax contrasts in Mandarin Chinese. Such f0 contrasts are supposed to be employed by native German speakers to distinguish between tense and lax vowels (Schneeberg and Schlüßler, 2006), while L2 Mandarin Chinese learners may not use the same f0-related strategy due to the different roles of f0 in their native tone language. In addition, vowel intrinsic f0 differences are also dependent on the prosodic context in running speech, and the intrinsic f0 difference should be maintained when the vowels bear the main phrasal stress (Shadle, 1985). Regarding the interaction between the intrinsic f0 and the prosodic environment, we predicted that Mandarin Chinese learners might have some difficulties in using f0 properly when they speak the non-tone language German. To address this specific issue, the current study aims to compare the f0 profiles of German vowels produced by German native speakers and Mandarin Chinese learners with a particular interest in the tense-lax contrast under sentence stress.
Two groups of speakers were recruited for the study, namely, a German native group (DEU) and a Chinese L2 learner group (CHN). The DEU group consisted of 30 German students studying at the TU Dresden with a mean age of 23.6 years (range: 18–38), while the CHN group included 30 Chinese L2 learners of German with an average age of 24.1 years (range: 18–31). Some CHN speakers were students majoring in German at Shanghai Jiao Tong University who had passed the nationwide unified examination for German students at the senior level of PGH (Prüfung für das Germanistik-Hauptstudium), and the others had passed the required German language examination (up to DSH-2 or DAF-16) before taking up their studies in German at TU Dresden. Speakers were exactly gender-balanced, i.e., 15 male and 15 female speakers in each group. Although they were born in different regions of their countries, the speakers in both groups had no strong regional accents. For example, all CHN speakers had achieved Grade Two Level B or above on the national standard Mandarin proficiency test (Putonghua Shuiping Ceshi), and most of them had less than one year of experience living in Germany after the age of 18 years. All participants, according to their self-reports, had normal speech and hearing functions with no history of any communication disorders.
2.2 Data collection
First, we embedded all 14 German vowels in monosyllabic words (/dVt/) to ensure that the speakers could produce the target vowels in a natural way. To create a systematic orthographic contrast, an “h” or an additional “t” was placed after the target vowel to indicate a tense or a lax vowel, respectively. Thus, we obtained 14 words, and most of them were nonsense but legal phoneme strings according to German phonotactic rules. These 14 words consisted of seven pairs with their International Phonetic Alphabet (IPA) transcriptions in parentheses as follows: daht-datt (/daːt/-/dat/), deht-dett (/deːt/-/dɛt/), diht-ditt (/diːt/-/dɪt/), döht-dött (/døːt/-/dœt/), doht-dott (/doːt/-/dɔt/), düht-dütt (/dayyːt/-/dʏt/), duht-dutt (/duːt/-/dʊt/). The regular spelling patterns facilitated the correct grapheme-to-phoneme conversion for the speakers, so that the target vowels could be easily elicited. Moreover, we put each of the target words in a carrier sentence, “Ich habe /dVt/ gesagt (I have said /dVt/),” to ensure a stable prosodic context. By randomizing each set of the 14 sentences five times, we created a reading list of 70 sentences, and thus it was guaranteed that each vowel was produced five times with different intra-group orders by each speaker. The speakers were told that they should read all the sentences as naturally as possible with a short pause between them. After a period of familiarization and practice, all the speakers chose to place a pitch accent on the target word automatically. This way, f0 values of target vowels under sentence stress could be elicited in an implicit way with well-controlled prosody. Though several CHN recordings were made in Shanghai and the others were in Germany, we ensured the same instructions and conditions. All recordings took place in a studio equipped with a recording console (Behringer Eurorack MX1602). The microphone (Microtech Gefell M930) was placed at a distance of approximately 20 cm from the speaker's mouth. All utterances were recorded with a sampling rate of 44.1 kHz and a quantization of 16 bits. The experiment lasted for about 5 min for each speaker, and they were financially compensated for their participation.
2.3 Acoustic measurements
In the first step, an automatic forced-alignment was carried out via the WebMAUS service (Kisler et al., 2017) on both word and phoneme levels, outputting a TextGrid format annotation of Praat (Boersma and Weenink, 2019). Based on the derived word-level boundaries, we segmented all the recordings into 4200 individual sentences (14 vowels × 5 repetitions × 15 speakers × 2 genders × 2 language groups). Inaccurate alignments from the automatic phoneme annotation were manually adjusted by a phonetic expert by taking into account changes in both waveforms and spectrograms as well as perception cues if necessary.
A five-step procedure was applied to achieve a high accuracy of f0 estimation. The first step was to extract the fundamental frequencies by the pitch-tracker developed by Shi et al. (2019), the analysis window of which was set at a length of 30 ms with 5 ms shift. A robust f0 estimate and a voicing probability for each frame of speech were obtained. In the second step, we carried out the pitch tracking through a two-pass procedure following the strategy proposed by Hirst (2011) by calculating a more accurate f0 range for f0 estimation. In the first pass, we inspected our data and set a more accurate search range of 150–400 Hz and 75–300 Hz for female and male speakers, respectively, to cover all reasonable f0 samples, and we extracted the f0 with this range. Then we calculated the first and third quartiles (i.e., q1 and q3) across all f0 samples for each speaker. In the second pass, the f0 floor and ceiling for each speaker were set to and , respectively. By using a personalized search range, we greatly reduced the estimation errors of f0 extraction. This was confirmed by comparing speakers' f0 histograms, in which long tails disappeared and samples were more centralized around the mean values. In the next step, a frame of speech with a voicing probability smaller than 0.5 was automatically removed from the data for f0 extraction because these f0 values were considered unreliable according to Shi et al. (2019). The fourth step was incorporated to deal with creaky voice that was frequently produced by several speakers. In glottalized periods, the corresponding f0 was estimated by another pitch-tracker (Drugman and Alwan, 2011), which was more robust to glottalization. Finally, we applied a median filter with a window of seven f0 samples to smooth the f0 contour. Manual corrections were only applied when the values were still wrong during the final check. For example, if the f0 samples with voicing probabilities slightly larger than 0.5 were actually voiceless, we had to make necessary corrections manually.
2.4 Acoustic analysis
Based on the optimized f0 samples, we calculated the f0 mean, f0 range (maximal f0 minus minimal f0), and f0 slope of each target vowel. Following Lehiste and Peterson (1961), we also measured the f0 maximum as a complement of f0 mean. We further measured the positions of the f0 maximum and minimum of each target vowel. Each vowel variable for a specific speaker was the average of his/her five repetitions of this vowel. We adopted two approaches to make the acoustic analysis more precise and robust.
One statistic approach to alleviate the influence of physiological differences efficiently was to convert the physical measurement of f0 to the perceptual variable of f0 using speaker-specific bases, which made the f0 produced by all speakers comparable across gender. In previous studies, f0 was usually converted from the Hz scale to the semitone (St) scale with a fixed value as a reference, which did not change the relative relationship between them due to the monotonic property of the logarithmic function (Chen et al., 2001b; Ding et al., 2006; Zhang et al., 2008). In the current study, we adopted the reference proposed by Yuan and Liberman (2014), where each f0 in Hz was transformed to a St value according to Eq. (1), in which the was the speaker-specific 5th percentile of all f0 in Hz scale
Another statistical approach was to measure the f0 slope by conducting a linear regression with time as an independent variable and f0 as the dependent variable, which was more robust than the usual practice of dividing the absolute f0 range (difference between the maximal and minimal f0 values in St) by the duration of the vowel. The slope we obtained could thus characterize the dynamic movements of f0 contours, where a positive slope represented an overall rising pattern, and a negative slope indicated an overall falling one. The absolute value of the slope reflected the steepness of the rising or falling. In the case of an f0 contour containing two parts (LH plus HL or HL plus LH), the value of the slope was dominated by the longer part or the relatively steeper part, i.e., the dominant part contributed more to the direction of the estimated slopes.
A series of linear mixed-effects (LME) models were run in matlab (MathWorks, 2019), where SpeakerGroup, Gender, VowelIdentity, or Tenseness and their interactions were treated as fixed effects and Subject as a random effect for intercept, while the acoustic parameters (f0 mean, maximum, range, or slope) were the dependent variables. We first fitted linear regression models to the data using the “fitlm” function and then computed the analysis of variance (ANOVA) statistics for each variable using the “anova” function. The variables of the best models were selected through the backward selection procedure using the “compare” function.
3.1 f0 mean and maximum
Average group values for f0 mean (in Hz), maximum (in Hz), and f0 mean (in St) of the target vowels are shown in the first, second, and third rows, respectively, in Table 1. It can be observed that the female speakers produced higher f0 mean (245 Hz) and maximum (261 Hz) (measured in Hz) than the male speakers (137 Hz; 147 Hz); also, high vowels (199 Hz for /iː ɪ yː ʏ uː ʊ/) were associated with higher f0 than low vowels (179 Hz for /aː a/).
|.||iː .||ɪ .||yː .||ʏ .||uː .||ʊ .||eː .||ɛ .||øː .||œ .||oː .||ɔ .||aː .||a .|
|.||iː .||ɪ .||yː .||ʏ .||uː .||ʊ .||eː .||ɛ .||øː .||œ .||oː .||ɔ .||aː .||a .|
The results for f0 mean or maximum (in Hz) showed similar patterns: significant effects were found for Gender and VowelIdentity but not for SpeakerGroup. The LME regression for f0 in St revealed significant effects of VowelIdentity  but non-significant effects of both SpeakerGroup  and Gender . However, the effect of Gender was statistically significant for f0 mean in Hz . In other words, by transforming f0 from Hz to St with Eq. (1), we preserved the difference of f0 mean due to vowel intrinsic f0 effects but reduced the difference due to the gender effect. A significant interaction effect on f0 mean in St was found between SpeakerGroup and VowelIdentity . For example, the f0 mean (in St) of vowels /uː/ and /ʏ/ ranked as the first and seventh among 14 monophthongs for the DEU speakers, respectively, while they ranked as the sixth and fourth for the CHN speakers, respectively. Moreover, the effect of Tenseness was significant , with lax vowels generally having a higher f0 than their tense counterparts (6.06 St versus 5.6 St). This test was conducted for group mean differences between tense and lax vowels, and no post hoc test was applied to each individual pair.
The results further showed that, compared to the DEU speakers, the CHN speakers produced vowels with comparable f0 mean (both in Hz and St) and maximum (in Hz), and they demonstrated similar intrinsic f0 values among different vowels. As f0 expressed in St could neutralize anatomy-based acoustic differences while retaining phonemic differences, we analyzed f0-related parameters in St hereafter.
3.2 f0 range
The f0 ranges of each vowel produced by the DEU and CHN speakers are shown in Fig. 1. The effect of SpeakerGroup  was significant, with the CHN speakers' target vowels having larger f0 ranges than those of the DEU speakers (3 St versus 1.75 St), which reflected the same trend as with the Hz scale (32 Hz versus 20 Hz). Furthermore, the f0 difference between maximum and mean in Hz for each vowel was consistently larger for the CHN speakers than that for the DEU speakers (Table 1). When f0 added range was represented in St, there was a significant effect for VowelIdentity  but no significant effect for Gender . Also, the f0 ranges of the tense vowels were considerably larger than their lax counterparts (3.05 St versus 1.71 St), indicating the significant effect of Tenseness . In each German tense-lax vowel pair, the tense vowel has an inherently longer duration than its lax counterpart (215 ms versus 105 ms across pairs in this study). The smaller f0 ranges of lax vowels were deemed related to their shorter duration. However, whether the tense and lax vowels had the same rate of f0 change was still unclear. Therefore, we further compared the slope of f0 contours between tense and lax vowels.
3.3 f0 slope
The f0 slope was used to represent the dynamic changes of the vowel f0 contour, including the direction (rising or falling) and the steepness (i.e., f0 change rate). To analyze the directions of f0 contours, we measured the positions of the f0 maximum and minimum for each vowel. Fig. 2 depicts the f0 maximum/minimum position relative to vowel onset in percent, where 0% and 100% correspond to the onset and offset positions of vowels, respectively. As can be seen from the figure, the patterns of f0 contours are similar between the DEU-F and DEU-M speakers, that is, minimums preceded maximums, suggesting a rising trend in general. The CHN-F speakers exaggerated this pattern, since their f0 minimums and maximums were closer to the onset and offset of vowels, respectively, compared to the DEU speakers. The vowels produced by the CHN-M speakers showed a reverse pattern, in which f0 maximums occurred generally earlier than f0 minimums, resulting in a roughly overall falling direction.
We further examined the steepness of the f0 slope. The proportions of negative slopes were 31.24% and 24.76% for tense and lax vowels produced by the CHN female speakers, respectively. The CHN male speakers produced even more negative slopes with 54.67% and 58.1% of tense and lax vowels, respectively. Averaging these slopes resulted in a “cancellation” effect in which the negative and positive slopes nullified each other. Therefore, we plotted the absolute values of slopes in Fig. 3(a), in which the average duration was also included to reflect the interactions between the f0 slope and duration of the vowels. All tense or lax tokens were averaged over seven vowels of each group. All lines were normalized to the same starting point for convenient comparisons, and the end points represent the average duration and slope. The LME regression for the absolute f0 slope revealed that there were significant effects of SpeakerGroup  and Tenseness . As can be seen from Fig. 3(a), the CHN speakers used a greater rate of f0 change for the vowels bearing sentence stress than the DEU speakers ( St/ms versus St/ms). Also, the lax vowels generally had shorter duration but steeper slopes ( St/ms versus St/ms) than their tense counterparts. However, the effect of Gender  was not significant, although the male speakers produced vowels with generally steeper f0 ( St/ms versus St/ms) contours than the female speakers (with the exception of lax vowels for the DEU speakers). There was also a significant interaction effect of SpeakerGroup × Tenseness . As shown in Fig. 3(b), both speaker groups had greater rates of f0 change for lax vowels than for tense vowels, and the tendency was stronger for the CHN speakers than for the DEU speakers.
In our current study of f0 profiles of German vowels under sentence stress, apart from the general comparison of f0 mean and maximum, range, and slope, a particular focus was the tense-lax contrast of German vowels produced by Mandarin learners of German compared to native German speakers. The following new findings emerged.
First, after transforming f0 from Hz to St with the speaker-specific base, the effect of Gender on f0 mean was no longer significant, whereas the effect of VowelIdentity remained significant. It is also clear that CHN learners demonstrate similar intrinsic f0 patterns as German native speakers under sentence stress, namely, high vowels are produced with an intrinsically higher f0 than low vowels. We have thus provided further evidence to support the universality of intrinsic f0 pattern in L2 speech. Furthermore, we have found that lax vowels are also associated with a higher f0 mean than their tense counterparts in the same stressed context, especially for peripheral vowels. However, Schneeberg and Schlüßler (2006) examined 14 German vowels produced by six speakers and found that only one tense-lax vowel pair showed a significant difference with the lax vowel having a higher f0, while other pairs showed no significant differences or tense vowels had significantly higher f0 than lax vowels. Other studies suggested that lax vowels had about the same/similar (i.e., statistically non-significant) f0 as their tense counterparts, for example, in Fischer-Jørgensen (1990) examining six vowels (/iː ɪ eː ɛ aː a/) produced by six speakers and Pape and Mooshammer (2006) examining six vowels (/iː ɪ uː ʊ aː a/) produced by three speakers. Whether and how the mixed findings result from different reading materials, individual speakers, or measuring approaches of f0 remains to be examined in the future.
Moreover, we have shown that both CHN learners and DEU speakers produce a larger f0 range for tense vowels than for lax vowels, which probably results from the longer inherent duration of tense vowels. We have also found that CHN learners produce vowels with a larger f0 range than DEU speakers, which echoes the findings in the previous studies for Chinese learners speaking German or English (Ding et al., 2006; Zhang et al., 2008). Due to negative first language (L1) transfer, many CHN learners may attach a lexical rising or falling tone to the vowel, which may enlarge the f0 range at the syllable level. More specifically, male CHN learners tend to produce more vowels with negative f0 slopes, while female CHN learners tend to produce more vowels with positive f0 slopes, which are more likely realized as lexical falling and rising tones in their L1 language, respectively. This difference may also result in much larger f0 ranges of vowels for male CHN learners than for female CHN learners (see the bar heights in Fig. 1), which could be explained by the fact that the Mandarin high-falling tone has the largest f0 range among lexical tones in general. In addition, the CHN learners produce 14 vowels with longer duration than the DEU speakers, 10 of which are statistically significant (Gao et al., 2020). The longer duration together with steeper slope may also contribute to the larger f0 range of CHN learners' vowels.
Finally, we have found that both DEU and CHN speakers produce lax vowels with greater steepness than tense ones, and CHN speakers increase steepness more than DEU speakers when they produce lax vowels. Having inherently shorter duration than their tense counterparts, lax vowels may require a larger f0 change rate to achieve sentence prominence. Besides, we have shown that CHN learners produce the target vowels bearing sentence stress with different directions of f0 slope. Like DEU speakers, CHN female speakers produce target vowels with an overall rising f0 contour, while CHN male speakers produce those with an overall falling f0 contour, which could be attributed to the negative L1 transfer of Mandarin Chinese. Though all CHN learners recruited for the study had comparable L2 German proficiency, we observed that the L2 German speech produced by the males was more Chinese-accented than that produced by the females. Their Chinese accent may result from their frequent use of high-falling tones to achieve the prominence of the target vowel. This is in line with the previous finding that Chinese students tend to use a falling tone to signal an English stressed syllable (Juffs, 1990), which supports Ohala's argument that falling tones are more perceptually salient and can be accomplished quicker (Ohala, 1978). Similar explanations are also found in previous studies of L2 English, e.g., Mandarin learners use a sharply falling f0 contour for strongly emphatic stress (Zhang et al., 2008).
This research work is partially sponsored by the China Scholarship Council and Shanghai Social Science Project (Grant No. 2018BYY003). We would like to express our gratitude to the anonymous reviewers and the editor for their constructive comments and suggestions to improve this work.