This study examined accent rating of speech samples collected from 12 Mandarin-accented English talkers and two native English talkers. The speech samples were processed with noise- and tone-vocoders at 1, 2, 4, 8, and 16 channels. The accentedness of the vocoded and unprocessed signals was judged by 53 native English listeners on a 9-point scale. The foreign-accented talkers were judged as having a less strong accent in the vocoded conditions than in the unprocessed condition. The native talkers and foreign-accented talkers with varying degrees of accentedness demonstrated different patterns of accent rating changes as a function of the number of channels.
In speech communication, listeners do not just perceive the linguistic content, like sounds, words, and sentences, but also identify the indexical information from the talker's voice. A variety of identity-related information, such as sex, age, height, language backgrounds, and social class, can be conveyed in speech waves.1–3 The indexical information is realized in the variations of speech properties like duration, amplitude, and fundamental frequency (F0).2,4–9
In natural, clear speech signals, there are abundant acoustic–phonetic cues encoding talker identity and those cues are salient to listeners. However, speech signals in the real world are usually compromised as a result of environmental noise, reverberation, transmission through low-fidelity devices, etc. Vocoded speech is one form of simplified speech signal that is degraded in the spectral and/or temporal domains. Vocoder processing is the core technology used in cochlear implant (CI). A few previous studies reported that talkers' foreign accents introduced a greater detrimental effect on speech perception of CI users than on the normal-hearing (NH) listeners;10–12 compared to NH listeners, CI users were less sensitive to accent information presented in different regional dialects or foreign accents (talkers speaking different native languages).13 However, it remains largely unknown how CI users perceive or judge foreign-accented talkers from the same language background but differing in the strength of accentedness. To provide reference data for the investigation of accent perception by CI users, in the present study, we used the CI acoustic simulations, i.e., vocoded speech, to examine how NH listeners' judgement of talkers' accentedness is affected under various conditions of spectral degradation.
Depending on the type of carriers used in vocoder processing, there are two forms of vocoded speech: noise- or tone-vocoded speech. For noise-vocoder processing, the extracted amplitude envelope from the filtered frequency bands of the original speech is used to modulate noise that is filtered into the same frequency bands. For tone-vocoder processing, the extracted envelope is used to modulate sinusoids whose frequencies are equal to the center frequencies of the frequency bands. Xu et al.14 showed that the differences in the sentence recognition scores between the noise- and tone-vocoders were small. However, for talker characteristics, several studies revealed that vocoder processing, especially noise-carrier vocoder processing, has a detrimental effect on talker sex identification.15–17 For example, Gonzales and Oliver15 found that the accuracy of gender and speaker identification was consistently higher for tone-vocoded speech than for noise-vocoded speech. With only 3 channels in tone-vocoded signal, listeners can achieve ≥ 90% accuracy in gender identification. For noise-vocoded signals, the identification performance improved as the number of channels increased from 4 to 10.
Unlike talker sex that is primarily reflected in the voice feature of F0 and vocal tract length (VTL),18 talker accent involves acoustic–phonetic deviations in both segmental and suprasegmental aspects.19–25 Talker accent acts as a vital source of talker variability and induces intelligibility problems.26,27 A large number of studies have been implemented to determine the perceptual relevance and contribution of segmental and suprasegmental features to listeners' judgement of foreign accentedness.28–32 Researchers reported that the perception of talker accent is determined by both segmental information, such as voice onset time, vowel quality,19,30,33 and suprasegmental features, including intonation, speech rate, and speech rhythm.29,31,34–36 While most studies used speech manipulation and had stimuli presented in quiet conditions, only a few studies examined the perception of talkers' accentedness in noise conditions.37–39 Some researchers found that foreign-accented talkers were rated as less accented in the presence of noise than in the quiet condition,37,39 but some reported no difference in accent rating in quiet and noise conditions.38
Distinct from the speech-in-noise condition in which the acoustic–phonetic cues are obscured by the background noise, vocoder processing degrades signals in the spectral and temporal domains. Kapolowicz et al.40 used an accent-detection task to examine the impact of the spectral resolution of tone-vocoder processing on speech intelligibility and accent rating. Sentences were recorded from six native English talkers and six foreign-accented talkers who were scored with the poorest intelligibility in a pilot experiment. The sentences were processed into 3, 4, 5, and 9 frequency bands and presented to 11 native English listeners. The listeners were required to specify whether the talkers had a foreign accent or not. The results revealed that as the number of frequency bands increased, the accent-detection score decreased for the native talkers but increased for the foreign-accented talkers. This result suggested that listeners required a greater number of channels for more accurate accent detection in the non-native accent condition than in the native accent condition.
Instead of using the accent-detection task reported in the Kapolowicz et al.40 study, we adopted an accent rating task in which the listeners were required to rate the degree of accentedness on a Likert scale. This approach enables us to examine listeners' sensitivity to accent strength. In the present study, we recruited 12 non-native English talkers with varying degrees of foreign accent and divided them into different groups based on the severity of the accent. Two types of vocoder processing (i.e., noise-vocoder and tone-vocoder) were used with the number of channels spanning from 1–16. A larger sample (N = 56) of native English-speaking listeners was recruited for the rating task. The main purpose of the present study was to examine the effects of spectral degradation through vocoder processing on accent rating of talkers with varying degrees of foreign accent. We hypothesized that due to the lack of full range of spectral and temporal information in vocoded speech as compared to natural speech, listeners would be less sensitive to varying degrees of accentedness in vocoder conditions than in unprocessed condition. Additionally, we aimed at determining whether listeners rated the accentedness differently in noise- and tone-vocoded speech. According to previous findings of differences between tone vs noise-vocoder in talker sex identification, we predicted that the vocoder type might also have an effect on the accentedness rating. However, given the distinct acoustic correlates between talker accent and talker sex, it is possible that the effect of vocoder type on talker sex identification might not be reflected on accentedness rating.
The perception stimuli were reading passages recorded from 12 native Mandarin talkers (six males and six females) who learned English as a second language and two native English talkers (one male and one female). The 12 Mandarin-accented English talkers were between 24 and 56 years old [M = 37.3 years; standard deviation (SD) = 10.4 years] with the length of residency in the U.S. ranging from 0.5–27 years (M = 11.3 years, SD = 9.2 years). None of the talkers reported having speech, language, or hearing problems. Each talker was recorded reading the Rainbow Passage41 in a quiet room. The talkers were instructed to read the passage with a regular speaking voice at their normal speaking rate. For each talker, the recorded passage was segmented into 12 sections. All sections had similar durations and were peak-amplitude normalized. Each section included one or two consecutive sentences and each sentence only occurred in one section. For each talker, 11 sections of the Rainbow Passage were used for the accent-rating task that included 10 for the vocoded conditions and one for the unprocessed clean condition.
2.2 Speech processing
The segmented utterances were randomly selected for noise-carrier or tone-carrier vocoder processing as used in Xu et al.14 In brief, the speech signal within the frequency range of 150–5500 Hz was bandpass filtered into 1, 2, 4, 8, and 16 frequency channels based on the Greenwood formula (see Ref. 42 for details). The temporal envelope of each frequency channel was extracted by half-wave rectification with lowpass filtering using a second-order Butterworth filter with the cutoff frequency set at 160 Hz. For noise-excited vocoder, the extracted envelope was used to modulate bandpass filtered white noise that was divided into the same number of frequency bands as the speech signal. For the tone-excited vocoder, the extracted envelope was used to modulate sinusoids with the frequencies equal to the center frequencies of the frequency bands of the speech stimuli. Finally, the modulated noise bands or sinusoids were summed to generate the noise-vocoded or tone-vocoded speech stimuli. In total, there were 10 vocoded conditions (5 channel conditions × 2 carrier types).
A total of 56 native English listeners (40 females, 16 males) were recruited to participate in the talker accent rating task. The listeners were all monolingual English speakers between 21 and 52 years old (M = 23.8 years, SD = 5.9 years). None of the listeners self-reported as having speech, language, or hearing impairments. All listeners were recruited from the University of Wisconsin-Milwaukee (N = 36) and Ohio University (N = 20). They were compensated with a nominal incentive for their participation. The use of human subjects was reviewed and approved by the institutional review boards of both the University of Wisconsin-Milwaukee and Ohio University.
The accent rating task was carried out in a sound booth using a custom matlab (MathWorks, Natick, MA) program. The presenting order of talkers and vocoded (including the unprocessed) conditions was randomized, but the stimuli from the same talker were presented together in a block. The pseudo-randomized stimuli were presented to listeners through headphones and the listeners were required to rate the severity of the talkers' foreign accent on a 9-point Likert scale with “1” representing no accent and “9” representing extremely strong accent. Prior to the real testing, a practice session was carried out with no feedback provided. The practice session had a total of 22 sections including both noise-vocoded and tone-vocoded conditions as well as unprocessed natural speech from two different Mandarin-accented English talkers who were not included in the real test. The purpose of the practice session was to help listeners familiarize themselves with vocoded speech and develop rating criteria. In the real test, each listener rated 154 speech sections that were composed of 11 sections of the Rainbow Passage from the 14 talkers. For both practice and the real test, the listeners were encouraged to use the full range of the rating scale. Meanwhile, they were instructed to focus on accentedness only and to disregard the intelligibility of the speech samples. For each listener, the rating task lasted for ∼1 h.
Of the 56 listeners, three of them (all females) rated the unprocessed natural speech for both native and foreign-accented talkers with equal scores (e.g., rated all talkers with 1 or 8), which indicated that the rating reliability of these listeners was questionable or that they failed to understand the rating task. Therefore, the three listeners were removed from further analysis. With the data from the remaining 53 listeners, the average rating score of the unprocessed stimuli for each foreign-accented talker across all listeners ranged from 4.9 to 7.5 (shown in Fig. 1). Based on the listeners' ratings of the unprocessed stimuli, a K-means clustering was used to divide the 12 foreign-accented talkers into two accent levels: moderately accented (N = 7) and strongly accented (N = 5). The overall average rating score was 5.53 and 7.22 for the moderately and strongly accented group, respectively, in the unprocessed condition.
An average accent rating across all listeners was obtained for each group of talkers in each tested condition. The listeners rated the native talkers with very low accent scores in the unprocessed condition (Fig. 2, left panel). With 1 or 2 channels, regardless of tone- or noise-vocoder conditions, the listeners rated the native talkers with relatively high accent scores. As the number of channels increased, the rating scores decreased and approximated the rating of unprocessed stimuli when the number of channels increased to 16. The listeners rated tone-vocoded signals with slightly higher scores than the noise-vocoded signals for the native talkers in the 1- and 2-channel conditions. When the number of channels increased to four or above, the listeners tended to rate the tone-vocoded signals with lower accent scores than they did for the noise-vocoded signals, but the rating difference between the two types of vocoded stimuli was small. The overall mean rating across all channel conditions was 2.91 for the noise-vocoded signals and 2.87 for the tone-vocoded signals.
For the foreign-accented talkers (Fig. 2, middle and right panels), regardless of the degree of accentedness, the rating scores of both noise-vocoded and tone-vocoded signals were lower than that of the unprocessed signals. Different from native talkers, the listeners rated the foreign-accented talkers, both moderately accented and strongly accented, with a slightly higher accent score for the tone-vocoded stimuli than they did for the noise-vocoded stimuli. However, it is noteworthy that the rating difference between the two types of vocoded signals was small and was not consistently shown in all channel conditions. The difference of the overall mean rating across all conditions between the noise- and tone-vocoded signals was 0.45 for the moderately accented talkers and 0.29 for the strongly accented talkers. Figure 3 presents the change of accent rating score as a function of spectral resolution with data collapsed between the noise- and tone-vocoded conditions. It is clearly shown that the talkers with strong accents showed a pattern of increased accentedness scores with an increasing number of channels. By contrast, for the talkers with moderate accents, the rating scores did not show considerable change across the 5 channel conditions.
We used a linear mixed-effects model (LMM) with the factors of accentedness, signal type, and channel condition defined as fixed effects, and listener and talker factors defined as random effects with by-talker intercepts and by-listener intercepts included. The factor of talker accentedness included three levels: no accent, moderate accent, and strong accent. For the factor of signal type, there were three levels: unprocessed, noise-vocoded, and tone-vocoded. For the channel condition, there were six levels: 1-, 2-, 4-, 8-, 16-channel, and infinite (unprocessed). The LMM results yielded significant main effects of accentedness [F(2, 11.3) = 95.78, p < 0.001], signal type [F(1, 8066.1) = 21.44, p < 0.001], and channel condition [F(4, 8066.1) = 21.66, p < 0.001]. Meanwhile, there were significant interaction effects between talker accentedness and signal type [F(2, 8066.1) = 7.17, p < 0.001], talker accentedness and channel condition [F(8, 8066.1) = 79.86, p < 0.001], and signal type and channel condition [F(4, 8066.1) = 5.78, p < 0.001].
Because the accent rating demonstrated distinct patterns among the three accent levels, separate LMMs were conducted for native talkers, moderately accented, and strongly accented talkers, respectively. For the native talkers, the LMM revealed a significant effect of channel condition [F(4, 1102) = 137.46, p < 0.001] and a significant signal type by channel interaction [F(4, 1102) = 2.44, p = 0.045] but no significant effect of signal type. The post hoc analysis on channel condition revealed no significant difference between 1- and 2-channel conditions and between 16-channel and unprocessed conditions. All other comparisons were significant. For the moderately accented talkers, the LMM yielded a significant effect of channel condition [F(4, 4012) = 7.23, p < 0.001] and signal type [F(1, 4012) = 50.02, p < 0.001] and a significant type by channel interaction [F(4, 4012) = 2.62, p = 0.033]. The post hoc analysis revealed significant differences between the unprocessed and both types of vocoded signals as well as between the noise-vocoded and tone-vocoded signals (all p < 0.001). For the effect of channel condition, the post hoc analysis revealed a significant difference between the unprocessed and all other channel conditions (all p < 0.001). Among the five vocoded channel conditions, significant difference was found only for 1- and 2-channel conditions, and 1- and 16-channel conditions (both p < 0.05). For the strongly accented talkers, the LMM yielded a significant effect of channel condition [F(4, 2848) = 57.18, p < 0.001] and signal type [F(1, 2848) = 16.75, p < 0.001] and a significant type by channel interaction [F(4, 2848) = 3.41, p = 0.009]. The post hoc analysis revealed significant differences between the unprocessed and both types of vocoded signals as well as between noise-vocoded and tone-vocoded signals (all p < 0.001). For the effect of channel condition, the post hoc analysis revealed significant differences among all channel conditions (all p < 0.05) except between 1- and 2-channel conditions as well as between 2- and 4-channel conditions. The LMM results revealed that the amount of acoustic information as a function of frequency channels had a determining role in accent rating for both native and non-native talkers. Meanwhile, listeners' judgement of accentedness was affected by the vocoder type and this factor interacted with the number of frequency channels.
The first research question focused on whether and how spectral degradation affect listeners' judgement of talkers' accentedness. The results showed that listeners judged foreign-accented talkers, regardless of the talkers' accent levels, as having a less strong accent in the vocoded conditions than in the unprocessed condition. This result echoed the findings reported in previous studies.37,39,40 It is noteworthy that for both native and foreign-accented talkers, the average rating of their accents was in the range of 4–5, which was located in the middle of the rating scale for 1- or 2-channel conditions. This suggested that the listeners' judgement was uncertain and likely guessing in accent rating when the spectral resolution was very low. Previous studies reported poor performance in talker sex identification in vocoded speech with a very low number of frequency channels.15–17 This was because the important cues for talker sex identification reside in F0 and vocal tract resonance features and these acoustic cues were largely undermined after vocoder processing. Besides the cues for talker sex identification, the acoustic information for speech recognition was greatly compromised when the number of spectral channels was very low. Previous studies revealed that the recognition performance of speech segments and sentences was also low in low channel conditions.14,42–45 With the extremely deprived acoustic information at both segmental and suprasegmental levels, the listeners could hardly make a reliable judgement about the talkers' accents. Gittleman and Van Engen39 reported a similar finding that the foreign-accented talkers were rated as having a less strong accent while the native talkers were rated as having a stronger accent in lower signal-to-noise ratio conditions in comparison to quiet conditions. The authors explained that the uncertainty of obscured signals made listeners “willing to give non-native talkers the benefit of the doubt” while “be less confident that the speech is fully native-like” for native talkers (p. 3142).39 Together with the findings from the current study, adverse sources, such as noise or vocoder processing that result in substantially obscured or distorted acoustic–phonetic cues, have an opposite influence on listeners' judgement of the accentedness for native and non-native talkers.
When the number of spectral channels increased to four or above, the listeners started to show a clear dichotomy of rating for native talkers and foreign-accented talkers and they could better differentiate accent levels (as shown in Figs. 2 and 3). The accent rating scores decreased for native talkers but increased for strongly accented talkers. These findings indicate that listeners require at least 4 channels of spectral information to differentiate the accent between native and foreign-accented talkers. Meanwhile, the spectral degradation affected foreign-accented talkers at varying accent levels in a different manner. For the strongly accented talkers, the accent rating scores consistently increased with the increasing number of frequency channels. By contrast, the accent rating was relatively stable across all vocoded channel conditions for the moderately accented talkers. The relatively stable rating across all channel conditions for the moderately accented talkers might be because the accentedness level of these talkers in the natural condition was rated in the middle of the accent scale. When the number of channels was very low, the listeners were uncertain and rated all talkers, including the moderately accented talkers, in the middle of the scale. As the acoustic information increased, the listeners received more cues to make a reliable judgement about the accentedness and approximated the rating of the natural condition that was also located in the middle range of the scale.
The second question of interest was the potential difference of accent rating between noise and tone-vocoded signals. As introduced earlier, listeners performed better for talker sex identification with tone-vocoded speech than with noise vocoded speech.15–17 Researchers proposed that the randomly fluctuating envelope of noise carriers likely distorted the speech envelope that was less interfered with in tone vocoder processing. Additionally, noise vocoded processing did not produce side bands of amplitude modulation as tone-vocoder processing did.15,17 In the present study, the statistical analysis revealed a significant difference between the two types of vocoder processing. Meanwhile, there was a significant vocoder type by channel condition interaction for all three talker groups. When the number of frequency bands was very low, the listeners rated the tone-vocoded stimuli with higher accent scores than they did for the noise-vocoded stimuli. As the number of channels increased to eight or above, compared with noise-vocoded signals, the tone-vocoded signals were rated with higher accent scores for foreign-accented talkers but lower accent scores for native talkers. This pattern indicated that the rating scores of tone-vocoded signals were closer to the ratings of the unprocessed speech when there was sufficient acoustic information for the listeners to make reliable judgement. One possible explanation was that noise carriers might introduce interfering acoustic information that influenced listeners' judgement on the speech characteristics of the target signals.
The findings of the present study provide valuable insights to our understanding of accented speech perception by CI users. With the limited acoustic–phonetic information conveyed through the implants, CI users may perceive less difference in varying degrees of accents in foreign-accented talkers, depending on their usable spectral channels. Compared to NH listeners who showed increasing difficulties in speech recognition with increasing accentedness of foreign-accented talkers,40 CI users may show less difference in their performance with accented talkers varying in the degree of accentedness, although accented speech in general causes a greater negative impact on them.
This study have several limitations. First, the listeners rated the foreign-accented talkers on a relatively condensed range. The average rating of the unprocessed speech varied from 4.9 to 7.5. Part of the reason might be that the stimuli being judged were relatively short sections segmented from a passage, which did not present all accent features the talkers had. The other potential reason was that although the listeners were instructed to rate on a full scale, some listeners likely avoided the ends of the scale but rather rated in a relatively narrow range. Second, to better control the total length of the rating task, the present study included only two native talkers as experiment controls who could hardly represent the entire native population. For future studies, a greater number of native talkers should be recruited.
This study was supported in part by the Acoustical Society of America Robert W. Young Award for Undergraduate Student Research in Acoustics.