The present study examined the change in spectral properties of Mandarin vowels and fricatives caused by nonlinear frequency compression (NLFC) used in hearing instruments and how these changes affect the perception of speech sounds in normal-hearing listeners. Speech materials, including a list of Mandarin monosyllables in the form of /dV/ (12 vowels) and /Ca/ (five fricatives), were recorded from 20 normal-hearing, native Mandarin-speaking adults (ten males and ten females). NLFC was based on Phonak SoundRecover algorithms. The speech materials were processed with six different NLFC parameter settings. Detailed acoustic analysis revealed that the high front vowel /i/ and certain compound vowels containing /i/ demonstrated positional deviation in certain processed conditions in comparison to the unprocessed condition. All five fricatives showed acoustic changes in spectral features in all processed conditions. Fourteen Mandarin-speaking, normal-hearing adult listeners performed phoneme recognition with the six NLFC processing conditions. When the cut-off frequency was set relatively low, recognition of /s/ was detrimentally affected, whereas none of the NLFC processing configurations affected the other phonemes. The discrepancy between the considerable acoustic changes and the negligible adverse effects on perceptual outcomes is partially accounted for by the phonology system and phonotactic constraints in Mandarin.
I. INTRODUCTION
Speech sounds include two categories: vowels and consonants. Acoustically, vowels are primarily characterized by spectral prominence located at frequency regions below 4 kHz. Consonants can be categorized into various manners of articulation and places of articulation. Among different manners of articulation, sonorants such as nasals, glides, or liquids have vowel-like formant structures with an acoustic energy presented at relatively low frequency regions. Stops are characterized by spectra patterns with the acoustic energy peak located at low- or mid-frequency (around 1–4 kHz) regions. Fricatives are characterized by an aperiodic turbulence noise with spectral energy concentration covering a wide range of relatively high frequency regions. In particular, fricatives with a more anterior constriction have higher spectral peaks. For example, the spectral peak of the English labial-dental fricative /f/ and interdental fricative /θ/ are around 8 kHz. Alveolar fricative /s/ has a spectral peak around 7 kHz and palatal /ʃ/ has a spectral peak around 4 kHz (Jongman et al., 2000; Fox and Nissen, 2005). Normal-hearing (NH) listeners have access to the acoustic information over the entire frequency range of vowels and consonants. However, for individuals with sensorineural hearing loss (SNHL) in the high-frequency regions, limited access to high-frequency information causes poor perception and discrimination of fricatives and affricates (Stelmachowicz et al., 2004; Zeng and Turner, 1990). Previous studies have shown that some individuals with more severe hearing loss have poor access even to the low-frequency regions. In this case, the perception and recognition of vowels and sonorants can also be affected (Molis and Leek, 2011; van Tasell et al., 1987).
To help hearing-impaired individuals gain audibility to a wider range of acoustical information, a diversity of hearing aid techniques have been proposed and implemented (see Simpson, 2009; Simpson et al., 2017; Mao et al., 2017). Contemporary frequency-lowering techniques are designed to bring high-frequency information to low-frequency regions so that the inaudible speech information becomes accessible to the hearing-impaired listeners. Among the various frequency-lowering strategies, nonlinear frequency compression (NLFC) has been widely used in modern hearing aids and has been investigated in clinical and basic research [e.g., SoundRecover (SR) by Phonak Naída]. NLFC algorithms involve a number of basic parameters that determine the input-output relationship. Specifically, cut-off frequency divides the entire frequency range into two parts: the frequency components above the cut-off are compressed and those below the cut-off are retained. A lower cut-off indicates a wider range of frequency compressions. Compression ratio determines the extent to which the frequency components above the cut-off are compressed. A greater compression ratio indicates a stronger compression and a greater amount of frequency downward shift. Input bandwidth represents the maximum frequency range of input signals that are processed through the algorithm. Output bandwidth represents the audible frequency range after NLFC processing. For a given hearing-impaired listener, the output bandwidth is fixed and is determined by the nature of the hearing loss. Listeners with more severe hearing loss have narrower output bandwidth. Compared to linear frequency compression, which lowers all frequency components by the same degree, NLFC disproportionally compresses high-frequency components to a greater extent than low-frequency components. Through this signal processing strategy, the inaudible high frequencies are progressively compressed and shifted into adjacent low-frequency regions.
The efficacy of NLFC has been tested on the recognition and perception of speech segments, monosyllabic words, and sentences in both adults and children with varying severity of hearing loss (e.g., Alexander et al., 2014; Bentler et al., 2014; Bohnert et al., 2010; Ching et al., 2013; Glista et al., 2009; Hillock-Dunn et al., 2014; McCreery et al., 2014; Wolfe et al., 2010, 2011). Although NLFC preserves more spectral details relative to other types of schemes, it does not provide the same amount of auditory improvement to all hearing-impaired listeners (McDermott, 2011). Previous studies revealed inconsistent perceptual outcomes using NLFC among different perceptual tasks and different groups of listeners. Some studies reported improved speech sound detection threshold, consonant and plural recognition with NLFC activated in comparison to NLFC deactivated or conventional processing (Glista et al., 2009; Hopkins et al., 2014; McCreery et al., 2014; Wolfe et al., 2010, 2011). Some studies reported the limited benefit of NLFC on speech recognition in noise (Hopkins et al., 2014) or no positive impact of NLFC on consonant or sentence recognition (Hillock-Dunn et al., 2014; Picou et al., 2015).
Among the studies on NLFC, there are multiple factors that may account for the individual variability and inconsistent results in the recognition of frequency-lowered speech (Mao et al., 2017). Researchers identified two types of resources: intrinsic and extrinsic factors (Alexander, 2013). Intrinsic characteristics such as the age of fitting, the age of onset of hearing loss, listening experience, and cognitive process determine listeners' ability to use the available perceptual cues provided by processed signals to identify speech. Extrinsic factors involve the configurations of signal processing strategies. The signal processing strategies determine how speech features are modified and how speech information is presented to listeners. Among the parameters of NLFC, input and output bandwidths are positively related to speech recognition accuracy (McCreery et al., 2013; Alexander et al., 2014; Alexander, 2016) and the cut-off frequency influences speech recognition to a greater extent than the compression ratio (Parsa et al., 2013; Souza et al., 2013; Alexander, 2016). Alexander (2016) tested the recognition of consonants, vowels, and high-frequency sounds (fricatives and affricates) in native English listeners with high-frequency hearing loss. Each participant was tested with stimuli processed under six combinations of cut-off frequencies and compression ratios. The author found that both vowel and consonant recognition accuracies were decreased with a low cut-off (e.g., 1.6 kHz) especially when the compression ratio was increased. This is because the more aggressive NLFC parameter settings cause greater distortion of the spectral features for both vowels and consonants and substantially alter the distribution of acoustic energy along the frequency domain. In addition, the more aggressive parameter settings with low cut-offs exert a more negative impact on the sound quality perception (Parsa et al., 2013). By virtue of these constraints, the minimal setting of cut-off is set at 1.5 kHz, which is still not reachable to many people with extreme severe-to-profound hearing loss.
Under this scenario, in order to provide better accessibility to high frequencies (e.g., fricatives and affricates) while simultaneously maintaining the original spectral patterns of low-frequency sounds (e.g., vowels and sonorants) for people with severe hearing loss that require more aggressive parameter settings, an adaptive NLFC [SoundRecover2 (SR2) by Phonak Naída] was developed (Rehmann et al., 2016). In comparison to traditional NLFC which is a static algorithm with fixed cut-off frequency and compression ratio, SR2 is an adaptive frequency-lowering algorithm which involves two predefined cut-off frequencies, a lower cut-off (CT1) and a higher cut-off (CT2). The system automatically chooses one of these two cut-off frequencies at any given moment depending on the short-term energy distribution in the input signal. As in the original SR algorithm, the applied compression ratio is predefined and constant, regardless of the value of the cut-off frequency. Figure 1 depicts the basic functional principle of the adaptive SR2 in comparison with the static SR.
Schematic diagrams of the signal processing for SR (top panel) and SR2 (bottom panel). In SR, all input frequencies below the cut-off frequency (CT) remains uncompressed and all input frequencies above the cut-off are compressed. SR2 is an adaptive algorithm where the processing depends on the energy distribution of the input signal. When the input signal comprises more low-frequency energy, a higher cut-off frequency (CT2) is active, and frequency compression takes place above CT2 so that the vowel structure of the input signal remains intact. On the other hand, when the input signal comprises more high-frequency energy, the compression starts at a lower cut-off frequency (CT1) so that a wider range of high-frequency input is compressed into the lower frequency region.
Schematic diagrams of the signal processing for SR (top panel) and SR2 (bottom panel). In SR, all input frequencies below the cut-off frequency (CT) remains uncompressed and all input frequencies above the cut-off are compressed. SR2 is an adaptive algorithm where the processing depends on the energy distribution of the input signal. When the input signal comprises more low-frequency energy, a higher cut-off frequency (CT2) is active, and frequency compression takes place above CT2 so that the vowel structure of the input signal remains intact. On the other hand, when the input signal comprises more high-frequency energy, the compression starts at a lower cut-off frequency (CT1) so that a wider range of high-frequency input is compressed into the lower frequency region.
The adaptive nature of SR2 results in conditionally active frequency compression within the output frequency range between CT1 and CT2 (Rehmann et al., 2016). When most of the energy of the incoming signal is detected in relatively low-frequency regions (e.g., vowel harmonics), the upper cut-off frequency (CT2) applies, and the frequencies below CT2 remain unprocessed. In this case, only input frequencies between CT2 and the maximum hearing aid output are compressed. This “protects” the important low-frequency components of vowel-like structures from being compressed. Otherwise, when most of the energy of the incoming signal is detected in relatively high-frequency regions (e.g., fricatives), the lower cut-off frequency (CT1) applies, and all input frequencies above CT1 are compressed. This facilitates the restoration of the audibility of the important high frequency fricative sound components into a wider output frequency area. Thus, in SR2, the upper cut-off frequency (CT2) allows to maintain vowel structures, whereas the lower cut-off frequency (CT1) enables a better access to high-frequency sounds. Taken as a whole, the adaptive nature of SR2 allows a lower cut-off frequency and a weaker compression ratio than SR and might potentially yield a better sound quality and an expanded fitting range.
Wolfe et al. (2017) evaluated plural recognition, word recognition, and phoneme perception in English using the traditional NLFC scheme and a prototype of the above-described adaptive NLFC scheme (i.e., SR2). Data from 14 English-speaking children with severe-to-profound hearing loss showed significantly improved plural and word recognition with the adaptive NLFC scheme than with the traditional NLFC scheme.
As introduced above, a large number of clinical studies have been carried out to test the NLFC in the hearing-impaired population. While researchers are embarking on investigations of the newly-developed adaptive NLFC in hearing-impaired patients, it is also of great scientific value to understand the acoustic characteristics of the processed speech signals and to evaluate to what extent the acoustic changes affect the perceptual behavior of NH listeners. On one hand, the investigation of the acoustic-phonetic features of the processed signals helps us better understand the nature of auditory input received by the listeners. On the other hand, the perceptual performance of NH listeners determines the optimal outcome of hearing-impaired listeners with the assistance of hearing aids. Testing the perceptual performance of NH listeners could be beneficial to understand the perceptual behavior of hearing-impaired listeners. If a parameter setting causes a detrimental effect on certain sounds in NH listeners, listeners with hearing loss are likely to show less accurate recognition of these sounds than the other sounds. Several NLFC studies compared the perceptual performance between hearing-impaired listeners and NH listeners or tested the perceptual outcome of NH listeners alone with various NLFC parameter settings to find out the optimal setting that maximizes recognition accuracy (e.g., Alexander et al., 2014; McCreery et al., 2013; Parsa et al., 2013). The detailed acoustic analysis and perceptual testing in the NH listeners on the NLFC processed speech materials will help build a solid foundation for future research on the effects of NLFC processing on speech recognition in hearing-impaired listeners.
While most of the current investigations on NLFC focused on the perception of English speech by native English listeners, this signal processing feature has rarely been investigated in a non-English-speaking population. Mandarin Chinese is a tonal language that shows systematic differences from English. Regarding the high-frequency sounds, compared to the two-way sibilant fricative contrast of /s/-/ʃ/ in English, the Mandarin phonetic system is characterized by a unique feature of a more complex three-way contrast of sibilant fricatives: alveolar /s/, alveolopalatal /ɕ/, and retroflex postalveolar /ʂ/. These fricatives have distinctive patterns of acoustic energy concentration. Specifically, /s/ has the spectral peak in the 7–9 kHz region, /ɕ/ has the spectral peak in the 5–8 kHz region, and /ʂ/ has the spectral peak in around the 3–4 kHz region (Lee et al., 2014; Li, 2008). Improving speech recognition at high frequencies (e.g., fricative perception) is of primary interest for frequency-lowering technology. For these fricatives, more confusion and reduced recognition accuracy can be expected because NLFC algorithm with relatively low cut-offs likely shifts the distinctive high-frequency spectral prominences down into similar low- or mid-frequency regions. From this perspective, such a three-way contrast of Mandarin fricatives provides a good resource to examine the effects of current frequency-lowering strategies and configurations on the perception of high-frequency sounds. In addition, Mandarin is the language used by approximately one-fifth of the world population. It was estimated that approximately 28.7 × 106 of Chinese have “hearing disability” defined as PTA (pure-tone average threshold of 0.5, 1, 2, and 4 kHz) greater than 40 dB hearing level (HL) (Sun et al., 2008), many of which can potentially benefit from the use of hearing aids. However, there is a scarcity of published data about the use of frequency-lowering techniques of modern hearing aids on the Mandarin-speaking population.
Given the abovementioned rationales, the present study was implemented to examine how clinically-adopted NLFC settings affect acoustic profiles of fricatives and vowels in Mandarin Chinese and to test the extent to which the acoustic changes affect the phoneme recognition in native Mandarin-speaking NH listeners. In addition to the fricatives, the present study also examined the acoustic changes of Mandarin vowels in different NLFC settings because vowels occupy the nucleus position of the syllable and contain a large amount of intelligibility information in Mandarin Chinese (Chen and Chan, 2016). Previous research suggested that certain NLFC settings (low cut-off and high compression ratio) detrimentally impact vowel recognition (Alexander, 2016; Souza et al., 2013). As the adaptive NLFC is a newly developed algorithm and has not been extensively investigated, the present study adopted both versions of NLFC algorithms (SR and SR2) to process the speech materials. This investigation will help inform future evaluations of the speech recognition performance of Mandarin-speaking listeners with hearing impairment using contemporary NLFC techniques.
Undoubtedly, both static and adaptive NLFC schemes change the natural spectral shapes of speech segments. More aggressive parameter settings with low cut-offs and high compression ratios involve a wider range of frequencies being processed to a greater extent. Therefore, we expect that the spectral features of speech materials will be substantially modified in aggressive settings. Given the production-perception relationship, we expect that the dramatic change of acoustic profiles will be reflected in the recognition performance of the NH listeners. They will show more confusion and reduced recognition accuracy with more aggressive settings. When comparing the original and adaptive NLFC, as the adaptive NLFC selectively uses higher cut-offs for low-frequency dominant sounds and lower cut-offs for high-frequency dominant sounds, we expect that the vowel acoustic features will be less affected while the fricative features will be modified more substantially in the adaptive NLFC that has a lower cut-off than the static NLFC. Perceptually, the adaptive nature is assumed to minimize the detrimental effect on the vowel recognition. However, more confusion in the recognition of fricatives is likely to happen in NH listeners because high-frequency information is downshifted to a lower frequency region with the operation of lower cut-offs in the adaptive NLFC. With regard to the three-way contrast of sibilant fricatives in Mandarin, as Mandarin /s/ and /ɕ/ both have acoustic energy concentrated at similar high frequency regions above 6 kHz region, we hypothesize that the NLFC especially adaptive NLFC with more aggressive settings will induce greater acoustic modification for these two fricatives than the other sibilant fricative /ʂ/. In the meantime, the lowering of the spectral prominence will result in similar acoustic representation of /s/ and /ɕ/ to /ʂ/. Therefore, more confusion among /s/, /ɕ/, and /ʂ/ is expected.
II. ACOUSTICAL ANALYSIS OF NLFC PROCESSED MANDARIN SPEECH MATERIALS
A. Methods
1. Participants
Twenty NH native Mandarin speakers (ten male and ten female speakers) aged between 18 and 29 yr old [Mean = 24.05, Median = 24, standard deviation (SD) = 3.02 yr] were recruited from the Beijing area. All speakers spoke Mandarin Chinese as their first language. Pure-tone audiometry confirmed normal hearing (threshold ≤15 dB HL) in both ears in octave frequencies between 250 and 8000 Hz. The use of human subjects was reviewed and approved by the Institutional Review Boards of Beijing Tongren Hospital and Ohio University.
2. Speech materials and recording
The speech materials included 12 /dV/ syllables containing 12 Mandarin vowels and five /Ca/ syllables containing five Mandarin fricatives. All syllables were in Mandarin tone 1. Table I lists the IPA (International Phonetic Alphabet), Mandarin Pinyin, and Chinese characters of the 17 monosyllabic words. Each speaker was seated in a sound booth equipped with a high-quality Sony DVCAM (DSR 1800P) and a D3-Edit nonlinear editing system designed for broadcast training. Each speaker was provided a paper with a list of Chinese characters corresponding to the 17 target syllables and was asked to repeat each syllable five times directly to the recorder. All speech samples were stored on a hard disk with a 44.1-kHz sampling rate and a 16-bit quantization rate. The recorded speech samples were then segmented and saved into individual syllables using the acoustic analysis software, CoolEdit 2000 (Syntrillium Software, Scottsdale, AZ).
The word lists used to collect speech samples.
IPA . | syllable (in Pinyin) . | Chinese character . |
---|---|---|
a | dā | 搭 |
i | d | 低 |
u | dū | 督 |
ɤ | dē | 嘚 |
aɪ | dāi | 呆 |
ɑʊ | dāo | 刀 |
oʊ | dōu | 兜 |
iε | diē | 爹 |
uo | duō | 多 |
iɑʊ | diāo | 雕 |
ioʊ | diū | 丢 |
ueɪ | dūi | 堆 |
f | fā | 发 |
s | sā | 仨 |
ɕ | xiā | 虾 |
ʂ | shā | 杀 |
x | hā | 哈 |
IPA . | syllable (in Pinyin) . | Chinese character . |
---|---|---|
a | dā | 搭 |
i | d | 低 |
u | dū | 督 |
ɤ | dē | 嘚 |
aɪ | dāi | 呆 |
ɑʊ | dāo | 刀 |
oʊ | dōu | 兜 |
iε | diē | 爹 |
uo | duō | 多 |
iɑʊ | diāo | 雕 |
ioʊ | diū | 丢 |
ueɪ | dūi | 堆 |
f | fā | 发 |
s | sā | 仨 |
ɕ | xiā | 虾 |
ʂ | shā | 杀 |
x | hā | 哈 |
3. Nonlinear frequency compression processing
The NLFC technique named SoundRecover (by Phonak Naida) was used to process all speech tokens produced by the 20 speakers. In particular, two generations of the SoundRecover technique denoted SR and SR2 were used to process each token in six conditions (see Table II). No gain or receiver adjustments were applied at this stage. For both SR and SR2, the configurations with the lower, middle, and higher cut-offs are applied to patients with profound, severe, and moderate hearing loss, respectively.
The NLFC settings of cut-off frequency (CT) and compression ratio (CR) for signal processing. SR represents the original NLFC algorithm with fixed CT and CR. In the adaptive NLFC represented by SR2, there are two cut-offs in which CT1 is noted as the lower cut-off and CT2 is noted as the higher cut-off. The cut-off used is defined by the energy distribution of the incoming signal. In cases where the signal has more low frequency energy, the higher cut-off frequency will be active; whereas if more high frequency content is present, the lower cut-off is used.
. | CT (SR)/ CT1 for SR2 (Hz) . | CT2 (Hz) . | CR . | Max. output (Hz) . | Used for patients with hearing-loss . |
---|---|---|---|---|---|
SR-1 | 2136 | – | 1.6 | 4800 | profound |
SR-2 | 3943 | – | 2.5 | 5800 | severe |
SR-3 | 5825 | – | 3.4 | 6800 | moderate |
SR2-1 | 1624 | 2966 | 1.3 | 4600 | profound |
SR2-2 | 3725 | 4965 | 1.3 | 6600 | severe |
SR2-3 | 5626 | 6449 | 1.8 | 8900 | moderate |
. | CT (SR)/ CT1 for SR2 (Hz) . | CT2 (Hz) . | CR . | Max. output (Hz) . | Used for patients with hearing-loss . |
---|---|---|---|---|---|
SR-1 | 2136 | – | 1.6 | 4800 | profound |
SR-2 | 3943 | – | 2.5 | 5800 | severe |
SR-3 | 5825 | – | 3.4 | 6800 | moderate |
SR2-1 | 1624 | 2966 | 1.3 | 4600 | profound |
SR2-2 | 3725 | 4965 | 1.3 | 6600 | severe |
SR2-3 | 5626 | 6449 | 1.8 | 8900 | moderate |
4. Acoustic measurements
The landmarks of onset and offset of target vowels and fricatives were located manually using CoolEdit 2000. The vowel onset was set at the zero-crossing point of the first period of voicing following the stop release. The vowel offset was set at the zero-crossing point of the final period of voicing with both F1 and F2 clearly presented. The fricative onset was set at the beginning of the frication and the fricative offset was set at the point of cessation of the frication.
On the basis of the landmark locations, the frequency values of the first two formants, F1 and F2, at five equidistant time points (20-35-50-65-80%) were extracted using a spectrographic analysis program TF32 for vowel productions. The formant frequency values were normalized to eliminate the effects of different vocal tract size. First, the F1 and F2 values were converted to their respective z-scores following the Lobanov's method (Lobanov, 1971). The z-scores were then rescaled back to Hz-like values following Thomas and Kindle (2007) to facilitate comparison and interpretation (see also Yang et al., 2015).
For fricative productions, due to the presence of background noise at low-frequency regions that might interfere with acoustic analysis, all fricative samples were high-pass filtered at 1000 Hz using a sixth order Butterworth filter. Then, a set of seven acoustic measures (i.e., spectral peak, rise time, normalized amplitude, spectral mean, spectral variance, spectral skewness, and spectral kurtosis) were obtained using a custom matlab program (Lee et al., 2014; Yang et al., 2017). The rise time was defined as the time interval from the onset of frication to the maximum amplitude of frication noise (Dorman et al., 1980; Mahmoodzade and Bijankham, 2007). The normalized amplitude was defined as the difference in dB between the root-mean-square (RMS) amplitude of the entire portion of frication and the RMS amplitude of the entire portion of the following vowel (Fox and Nissen, 2005; Jongman et al., 2000). The spectral peak, spectral mean, spectral variance, spectral skewness, and spectral kurtosis were all computed based on a 40-ms full Hamming window in the middle portion of frication (Jongman et al., 2000; Lee et al., 2014; Yang et al., 2017).
B. Results
Figure 2 shows the mean vowel formant trajectories of all speakers in both unprocessed and all six processed conditions. The left panel shows the comparison between the original signal and the signals processed with the SR algorithms. Among the three SR algorithm settings, formant trajectories in SR-1 showed greater positional changes from the original formant trajectories than those in SR-2 or SR-3. Among the 12 vowels, the vowel /i/ showed a retraction along the F2 axis. Two other vowels, /iε/ and /aɪ/, also showed positional changes, particularly in the SR-1 setting. For the vowel /iε/, the first half of the vowel deviated to a further back position relative to the original formant trajectory. For the vowel /aɪ/, the second half of the vowel shifted to a further back position than the first half of the vowel in comparison to the original formant trajectory. The vowel /iε/ starts from a high-front position, whereas the vowel /aɪ/ moves toward a high-front position. Therefore, the influence of SR-1 on the vowel acoustics was manifested on the vowel /i/ along the F2 axis. The right panel shows the comparison between the original signal and the signals processed with the SR2 algorithms. Similar to SR, formant trajectories in SR2-1 showed greater positional changes from the original trajectories than those in SR2-2 or SR2-3. The vowel /i/ or the vowels containing the vowel target /i/ showed evident positional changes from the original vowel trajectories.
Vowel trajectories of the 12 Mandarin vowels in unprocessed and six processed conditions in the F1 × F2 space. The six panels show the group mean formant data (N = 20) of the six processed conditions as indicated in the lower right corner of each panel. The unprocessed vowels are plotted in gray in the background in every panel. Different symbols represent different vowels. The larger symbol of each formant track indicates the ending point of the track.
Vowel trajectories of the 12 Mandarin vowels in unprocessed and six processed conditions in the F1 × F2 space. The six panels show the group mean formant data (N = 20) of the six processed conditions as indicated in the lower right corner of each panel. The unprocessed vowels are plotted in gray in the background in every panel. Different symbols represent different vowels. The larger symbol of each formant track indicates the ending point of the track.
Among the seven acoustic measures for fricatives, spectral peak and spectral mean have been widely acknowledged as the most important features that characterize the acoustic profile of fricatives (Lee et al., 2014; Li, 2008; Nissen and Fox, 2005). Specifically, spectral mean defines the global feature of acoustic energy distribution along the entire frequency range of the spectrum, while the spectral peak specifies the local feature of the spectral prominence associated with the location of articulatory constrictions (Jongman et al., 2000; Lee et al., 2014; Fox and Nissen, 2005). Figure 3 shows the scatter plots of spectral peak and spectral mean of fricative productions in the original and six processed conditions. In the original condition, among the five Mandarin fricatives, /f/ had a widely distributed spectral energy and the spectral peak was scattered along the entire frequency range from 1 to 10 kHz. Fricatives /s/ and /ɕ/ had the highest spectral peak and spectral mean whereas /x/ had the lowest spectral peak and spectral mean. These patterns were consistent with the findings reported by previous studies of Mandarin fricatives (Lee et al., 2014; Li, 2008). Under the processed conditions, spectral means decreased greatly for all five fricatives and showed a greater magnitude of downward shift than spectral peaks. Fricatives /f/, /s/, and /ɕ/ exhibited a clear decrease on the spectral peak. However, the spectral peaks of the fricatives /x/ and /ʂ/ did not show any observable changes in the NLFC-processed conditions relative to the original condition. Among the six processed conditions, the spectral peak and spectral mean in SR2-1 showed the greatest downward shift relative to the original condition. The two measures in SR-3 and SR2-3 were closer to those in the original condition.
Dispersion of the five Mandarin fricatives defined by the spectral peak (plotted along the x-axis) and spectral mean (plotted along the y-axis) in the unprocessed and each processed condition. Each fricative is plotted in a panel. Each symbol represents data from one token. Different symbols represent the seven different processing conditions.
Dispersion of the five Mandarin fricatives defined by the spectral peak (plotted along the x-axis) and spectral mean (plotted along the y-axis) in the unprocessed and each processed condition. Each fricative is plotted in a panel. Each symbol represents data from one token. Different symbols represent the seven different processing conditions.
Statistical analyses were implemented to examine the difference of each acoustic measure among the seven conditions for individual vowels and fricatives. For vowel formant values, a two-way repeated measures analysis of variance (ANOVA) was used for F1 and F2, respectively, with measurement point and condition as the two within-subject factors. For fricative measures, a one-way repeated measures ANOVA was used with condition as the within-subject factor. The alpha level adjustment (i.e., Bonferroni correction) was applied for multiple comparisons. As varying formant values at different measurement points were expected for vowels especially compound vowels, the main effect of condition was of interest and reported in the present study. The summary of vowel statistics and fricative statistics are presented in Tables III and IV, respectively. The results revealed significant differences of F1 for the vowels /i, ioʊ, ueɪ/ (p < 0.001) and significant differences of F2 for the vowels /a, ɤ, i, aɪ, aʊ, iε, uo, ueɪ/ among the seven conditions (p < 0.001). For fricatives, all acoustic measures, except rise time, showed significant differences among the seven conditions for all five fricatives (all p < 0.0001 except for spectral peak for fricative /h/ for which p = 0.007).
Summary of statistical results for vowel acoustics. The F, p, and ηp2 values for the main effect of parameter setting and subsequent pairwise comparisons between each processed condition and the unprocessed condition are provided. * represents significant results after Bonferroni correction of p = 0.05/12.
. | F1 . | F2 . | ||||||
---|---|---|---|---|---|---|---|---|
. | F . | p . | ηp2 . | Pairwise comparison . | F . | p . | ηp2 . | Pairwise comparison . |
a | 0.82 | 0.509 | 0.041 | 5.66 | <0.0001* | 0.229 | SR-1, SR2-1 vs unprocessed | |
ɤ | 1.29 | 0.286 | 0.064 | 5.97 | <0.0001* | 0.239 | SR-1, SR2-2 vs unprocessed | |
i | 7.80 | <0.0001* | 0.291 | SR-1, SR-2, SR2-2, SR2-3 vs unprocessed | 49.57 | <0.0001* | 0.723 | SR-1, SR-2, SR2-1 vs unprocessed |
u | 4.77 | 0.014 | 0.201 | 1.81 | 0.146 | 0.087 | ||
aɪ | 0.21 | 0.870 | 0.011 | 5.41 | <0.0001* | 0.222 | SR-1, SR-2, SR2-1 vs unprocessed | |
aʊ | 1.16 | 0.336 | 0.057 | 5.76 | 0.001* | 0.233 | SR-1 vs unprocessed | |
oʊ | 3.95 | 0.008 | 0.172 | 4.42 | 0.006 | 0.189 | ||
iε | 4.59 | 0.01 | 0.194 | 8.56 | 0.003* | 0.311 | SR2-2 vs unprocessed | |
iaʊ | 0.51 | 0.647 | 0.026 | 5.29 | 0.013 | 0.218 | ||
ioʊ | 6.45 | 0.001* | 0.253 | SR-1, SR-2, SR-3, SR2-2, SR2-3 vs unprocessed | 5.30 | 0.004 | 0.218 | |
uo | 3.25 | 0.027 | 0.146 | 6.05 | <0.0001* | 0.241 | SR-2, SR-3, SR2-1, SR2-2, SR2-3 vs unprocessed | |
ueɪ | 9.95 | <0.0001* | 0.344 | SR-1, SR-2, SR-3 vs unprocessed | 14.14 | <0.0001* | 0.427 | SR-1, SR2-1 vs unprocessed |
. | F1 . | F2 . | ||||||
---|---|---|---|---|---|---|---|---|
. | F . | p . | ηp2 . | Pairwise comparison . | F . | p . | ηp2 . | Pairwise comparison . |
a | 0.82 | 0.509 | 0.041 | 5.66 | <0.0001* | 0.229 | SR-1, SR2-1 vs unprocessed | |
ɤ | 1.29 | 0.286 | 0.064 | 5.97 | <0.0001* | 0.239 | SR-1, SR2-2 vs unprocessed | |
i | 7.80 | <0.0001* | 0.291 | SR-1, SR-2, SR2-2, SR2-3 vs unprocessed | 49.57 | <0.0001* | 0.723 | SR-1, SR-2, SR2-1 vs unprocessed |
u | 4.77 | 0.014 | 0.201 | 1.81 | 0.146 | 0.087 | ||
aɪ | 0.21 | 0.870 | 0.011 | 5.41 | <0.0001* | 0.222 | SR-1, SR-2, SR2-1 vs unprocessed | |
aʊ | 1.16 | 0.336 | 0.057 | 5.76 | 0.001* | 0.233 | SR-1 vs unprocessed | |
oʊ | 3.95 | 0.008 | 0.172 | 4.42 | 0.006 | 0.189 | ||
iε | 4.59 | 0.01 | 0.194 | 8.56 | 0.003* | 0.311 | SR2-2 vs unprocessed | |
iaʊ | 0.51 | 0.647 | 0.026 | 5.29 | 0.013 | 0.218 | ||
ioʊ | 6.45 | 0.001* | 0.253 | SR-1, SR-2, SR-3, SR2-2, SR2-3 vs unprocessed | 5.30 | 0.004 | 0.218 | |
uo | 3.25 | 0.027 | 0.146 | 6.05 | <0.0001* | 0.241 | SR-2, SR-3, SR2-1, SR2-2, SR2-3 vs unprocessed | |
ueɪ | 9.95 | <0.0001* | 0.344 | SR-1, SR-2, SR-3 vs unprocessed | 14.14 | <0.0001* | 0.427 | SR-1, SR2-1 vs unprocessed |
Summary of statistical results of for each of the acoustic measure in five Mandarin fricatives. The F and p values for the main effect of parameter setting and subsequent pairwise comparisons between each processed condition and the unprocessed condition were provided. * represents significant results after Bonferroni correction of p = 0.05/5.
. | f /f/ . | h /x/ . | . | s /s/ . | sh /ʂ/ . | x /ɕ/ . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | F . | P . | Pairwise comparison . | F . | p . | Pairwise comparison . | F . | p . | Pairwise comparison . | F . | p . | Pairwise comparison . | F . | p . | Pairwise comparison . | ||
spectral peak | 27.10 | <0.0001* | 5 processed (no SR-3) vs unprocessed | 4.98 | 0.007* | SR2-1 vs unprocessed | 173.97 | <0.0001* | 6 processed vs unprocessed | 67.17 | <.0001* | SR-1, SR-2, SR2-1, SR2-2 vs unprocessed | 162.21 | <.0001* | 6 processed vs unprocessed | ||
rise time | 0.35 | 0.805 | 3.13 | 0.039 | 2.30 | 0.054 | 1.00 | 0.401 | 0.79 | 0.547 | |||||||
normalized amplitude | 50.45 | <0.0001* | 6 processed vs unprocessed | 99.20 | <0.0001* | 6 processed vs unprocessed | 197.61 | <0.0001* | 6 processed vs unprocessed | 134.14 | <.0001* | 6 processed vs unprocessed | 150.42 | <.0001* | 6 processed vs unprocessed | ||
centroid | 330.19 | <0.0001* | 6 processed vs unprocessed | 134.29 | <0.0001* | 6 processed vs unprocessed | 711.17 | <0.0001* | 6 processed vs unprocessed | 280.41 | <.0001* | 6 processed vs unprocessed | 489.30 | <.0001* | 6 processed vs unprocessed | ||
skewness | 222.98 | <0.0001* | 6 processed vs unprocessed | 120.07 | <0.0001* | 6 processed vs unprocessed | 535.85 | <0.0001* | 6 processed vs unprocessed | 237.58 | <.0001* | 6 processed vs unprocessed | 396.89 | <.0001* | 6 processed vs unprocessed | ||
kurtosis | 35.11 | <0.0001* | 6 processed vs unprocessed | 23.97 | <0.0001* | 6 processed vs unprocessed | 50.01 | <0.0001* | 6 processed vs unprocessed | 24.22 | <.0001* | 6 processed vs unprocessed | 28.52 | <.0001* | 6 processed vs unprocessed | ||
variance | 112.45 | <0.0001* | 6 processed vs unprocessed | 60.02 | <0.0001* | 5 processed (no SR2-1) vs unprocessed | 140.38 | <0.0001* | 6 processed vs unprocessed | 58.57 | <.0001* | 5 processed (no SR2-1) vs unprocessed | 70.62 | <.0001* | 6 processed vs unprocessed |
. | f /f/ . | h /x/ . | . | s /s/ . | sh /ʂ/ . | x /ɕ/ . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | F . | P . | Pairwise comparison . | F . | p . | Pairwise comparison . | F . | p . | Pairwise comparison . | F . | p . | Pairwise comparison . | F . | p . | Pairwise comparison . | ||
spectral peak | 27.10 | <0.0001* | 5 processed (no SR-3) vs unprocessed | 4.98 | 0.007* | SR2-1 vs unprocessed | 173.97 | <0.0001* | 6 processed vs unprocessed | 67.17 | <.0001* | SR-1, SR-2, SR2-1, SR2-2 vs unprocessed | 162.21 | <.0001* | 6 processed vs unprocessed | ||
rise time | 0.35 | 0.805 | 3.13 | 0.039 | 2.30 | 0.054 | 1.00 | 0.401 | 0.79 | 0.547 | |||||||
normalized amplitude | 50.45 | <0.0001* | 6 processed vs unprocessed | 99.20 | <0.0001* | 6 processed vs unprocessed | 197.61 | <0.0001* | 6 processed vs unprocessed | 134.14 | <.0001* | 6 processed vs unprocessed | 150.42 | <.0001* | 6 processed vs unprocessed | ||
centroid | 330.19 | <0.0001* | 6 processed vs unprocessed | 134.29 | <0.0001* | 6 processed vs unprocessed | 711.17 | <0.0001* | 6 processed vs unprocessed | 280.41 | <.0001* | 6 processed vs unprocessed | 489.30 | <.0001* | 6 processed vs unprocessed | ||
skewness | 222.98 | <0.0001* | 6 processed vs unprocessed | 120.07 | <0.0001* | 6 processed vs unprocessed | 535.85 | <0.0001* | 6 processed vs unprocessed | 237.58 | <.0001* | 6 processed vs unprocessed | 396.89 | <.0001* | 6 processed vs unprocessed | ||
kurtosis | 35.11 | <0.0001* | 6 processed vs unprocessed | 23.97 | <0.0001* | 6 processed vs unprocessed | 50.01 | <0.0001* | 6 processed vs unprocessed | 24.22 | <.0001* | 6 processed vs unprocessed | 28.52 | <.0001* | 6 processed vs unprocessed | ||
variance | 112.45 | <0.0001* | 6 processed vs unprocessed | 60.02 | <0.0001* | 5 processed (no SR2-1) vs unprocessed | 140.38 | <0.0001* | 6 processed vs unprocessed | 58.57 | <.0001* | 5 processed (no SR2-1) vs unprocessed | 70.62 | <.0001* | 6 processed vs unprocessed |
The acoustic analyses of Mandarin vowels and fricatives revealed that the NLFC procedure introduced a fundamental change in the spectral profiles of speech sounds. While the changes of vowel formants mainly occurred in SR-1 and SR2-1 conditions for the vowels containing /i/, almost all spectral features of all Mandarin fricatives were changed. Fricatives are characterized by spectral information located at relatively high-frequency regions, while vowels are characterized by the first three formants that are normally located below 4 kHz for adult speakers. For certain vowels such as /i/ that have a relatively high F2, or some vowels that contain the component /i/, the low cut-off configurations as in SR-1 and SR2-1 conditions changed the acoustic profile to a certain extent. As compared to fricatives, the change of vowel acoustic profile caused by NLFC was limited. Among the five fricatives, those with relatively low spectral peaks and spectral means (e.g., /ʂ/) were affected less than those with high spectral peaks and spectral means (e.g., /s/ and /ɕ/). Further, as shown in Fig. 4, these five fricatives were not as separated in the acoustic space defined by spectral peak and spectral mean after frequency compression especially when the cut-off frequency was relatively low. This was because the consonants with high-frequency energy experienced greater low-frequency spectral shift and thus showed spectral overlap with the consonants that had low-frequency energy and were less modified by the compression.
Scatter plots showing the spectral peak (plotted along the x-axis) and spectral mean (plotted along the y-axis) of the five Mandarin fricatives in the unprocessed and six processed conditions. Each symbol represents data from one token. Different symbols represent the five different fricatives. The top panel shows the unprocessed condition and the lower six panels show the processed conditions as indicated in the upper right corner of each panel.
Scatter plots showing the spectral peak (plotted along the x-axis) and spectral mean (plotted along the y-axis) of the five Mandarin fricatives in the unprocessed and six processed conditions. Each symbol represents data from one token. Different symbols represent the five different fricatives. The top panel shows the unprocessed condition and the lower six panels show the processed conditions as indicated in the upper right corner of each panel.
Both vowels and consonants exhibited acoustic changes to some extent. As the purpose of NLFC is to assist people with hearing loss to gain audibility or to improve the hearing ability, it would be necessary to understand the perceptual mechanism of the processed stimuli and to test to what extent the acoustic changes affect the perceptual behavior in NH listeners before it is tested in hearing-impaired listeners. The following experiment describes a phoneme recognition test in both unprocessed and processed conditions by Mandarin-speaking NH listeners.
III. MANDARIN PHONEME RECOGNITION OF NLFC PROCESSED SPEECH MATERIALS
A. Methods
1. Participants
Fourteen native Mandarin-speaking, NH adult listeners (six males and eight females) were recruited to participate in the phoneme recognition experiment. The listeners aged between 21 and 35 yr old (Mean = 25.87, Median = 25, SD = 3.31 yr). None of the participants reported having a speech, language, or hearing impairment.
2. Stimuli
Syllables containing 12 vowels and five fricatives (see Table I) in both unprocessed and six processed conditions (see Table II) were used for phoneme recognition. Vowel tokens from ten randomly selected speakers from the group of 20 speakers and fricative tokens from all 20 speakers were used for the perception experiment. The total number of vowel stimuli was 840 (i.e., 12 vowels × 7 conditions × 10 speakers) and the total number of fricative stimuli was 700 (i.e., 5 fricatives × 7 conditions × 20 speakers).
3. Procedure
The phoneme recognition test was conducted in a sound-treated room using a custom matlab program. The phoneme recognition task (including both vowel and fricative recognition) took approximately 90 min for each participant. In either the vowel or fricative recognition test, the stimuli were randomized and delivered binaurally to the listener through a headphone at a comfortable volume level. Listeners were instructed to choose one word presented in a Chinese character on a computer screen that best matched what they had heard. Each stimulus was played only once and no feedback was provided. A short, 5-min training was provided before the test to familiarize the participants with the procedure.
B. Results
We found no differences in the recognition accuracy of both vowels and fricatives produced by either male or female speakers. The results reported below were based on the data collapsed across all speakers. Figure 5 shows the group mean phoneme recognition scores in the unprocessed and six processed conditions. The overall recognition accuracy for vowels and fricatives in the original condition was 97.0% and 97.1% correct, respectively. While the overall recognition accuracy of vowels in all six NLFC processing conditions and that of consonants in four of the six NLFC processing conditions were nearly perfect (all >95% correct), the overall recognition accuracy for fricatives dropped to 88.4% and 86.9% correct in the SR-1 and SR2-1 conditions, respectively.
Group mean and SD of recognition accuracy of Mandarin vowels (top) and fricatives (bottom) in unprocessed and six processed conditions by Mandarin-speaking NH listeners (N = 14).
Group mean and SD of recognition accuracy of Mandarin vowels (top) and fricatives (bottom) in unprocessed and six processed conditions by Mandarin-speaking NH listeners (N = 14).
Figure 6 shows the confusion matrices of the fricative recognition by the group of 14 NH listeners. The NH listeners recognized all five fricatives with high accuracy rates in the original, unprocessed condition as well as in the SR-2, SR-3, SR2-2, and SR2-3 conditions. However, in the SR-1 and SR2-1 conditions, listeners showed relatively poor recognition for /s/ because of the misperception of /s/ as /ʂ/ or /f/. The recognition accuracy of /ɕ/ also decreased slightly in these two processed conditions due to the misperception of /ɕ/ as /ʂ/. The recognition accuracy of /ʂ/ was slightly decreased in SR2-1 due to the misperception of /ʂ/ as /s/. No noticeable confusion was found between /ʂ/ and other fricatives in SR-1. Note that listeners tended to respond /ʂ/ to the stimuli of all three sibilants in SR-1 and SR2-1. Because these two conditions had very low cut-off frequencies, the spectral peaks and spectral means of /s/ and /ɕ/ shifted to low-frequency regions and became similar to those of /ʂ/. The three sibilants, therefore, demonstrated similar spectral patterns and became less distinctive perceptually. Regarding the two non-sibilant fricatives /f/ and /x/, the recognition accuracy of /f/ was above 90% in most conditions and the recognition of /x/ was not affected by the frequency compression.
Confusion matrices of the five Mandarin fricatives in the unprocessed and each processed condition. The stimulus is represented by the ordinate and the response by the abscissa. Each value in the matrices is the percent of the stimuli being identified as a particular fricative.
Confusion matrices of the five Mandarin fricatives in the unprocessed and each processed condition. The stimulus is represented by the ordinate and the response by the abscissa. Each value in the matrices is the percent of the stimuli being identified as a particular fricative.
IV. GENERAL DISCUSSION
The present study examined the change of phonetic-acoustic properties of Mandarin vowels and fricatives as a function of cut-off frequency and compression ratio in NLFC, and how these changes affect the phoneme recognition in Mandarin-speaking NH listeners. Six conditions of cut-off frequency and compression ratio combinations from two generations of NLFC (i.e., SR and SR2) were used to process the Mandarin speech signals (see Table II). The spectral features of the selected Mandarin vowels and fricatives after NLFC processing were compared to those from the original, unprocessed signals. In the perceptual experiment, the unprocessed and processed signals were both presented to Mandarin-speaking NH listeners for identification.
Consistent with our predictions, the acoustic analysis revealed a fundamental change of spectral features for the vowel /i/, and all fricatives mainly in the settings with a low cut-off frequency. Spectral changes as a result of frequency compression were reported in previous studies (e.g., McDermott, 2011). Not surprisingly, as the NLFC processing manipulates the speech signal in the frequency domain, the output spectra would differ from those of the original signal. Moreover, as the signal was compressed to a greater extent with a lower cut-off frequency, as in the SR-1 and SR2-1 conditions, the spectra features were generally deviated more from those of the unprocessed original signals. Inconsistent with the substantial changes of spectral features, however, phoneme recognition by the NH listeners was not significantly compromised. For vowel recognition, although the acoustic profiles of the vowels /i, iε, aɪ/ were changed (Fig. 1), the recognition accuracy was not affected (Fig. 5). The high recognition accuracy can be partially explained by the small vowel inventory and the absence of a front-central contrast of high vowels in Mandarin Chinese. Unlike some languages that have a crowded vowel space with a large number of vowel sounds, Mandarin Chinese has a small inventory of monophthongal vowel phonemes. Each Mandarin vowel phoneme can be recognized with high accuracy, even though it shows a relatively wide range of positional variation. Moreover, Mandarin lacks a phoneme contrast of front versus central high vowels. Even though the acoustic properties of /i/ were modified and the vowel shifted to a central position in the acoustic space, listeners still accurately recognized it as an /i/ vowel. It is worth noting that the native-Mandarin-speaking authors as well as participants in the perceptual experiments all confirmed a noticeable quality change in certain NLFC processed tokens that contained the /i/ vowel.
The basic acoustic features of all five Mandarin fricatives were modified to some extent. All five fricatives showed a dramatic downward shift of the spectral mean. This indicates that the global feature of spectral shape has been altered. Consistent with our expectation, the fricatives /s/ and /ɕ/ showed a substantial downward shift of the spectral peak in comparison to the fricative /ʂ/. In addition, the magnitudes of downward shift of the spectral mean and spectral peak changed as a function of the cut-off frequency. However, the recognition of Mandarin fricatives in all six processed conditions was not greatly affected. Only /s/ in the SR-1 and SR2-1 conditions was relatively poorly recognized and was confused particularly with /ʂ/. Acoustically, both /s/ and /ɕ/ demonstrated similar acoustic manifestations to /ʂ/ due to the compression and downward shift of high-frequency energy above the cut-off frequency. However, the recognition of /ɕ/ did not show much confusion with /ʂ/ even in the conditions with low cut-off. We speculate that this may be attributable to the phonological constraints and vowel contexts following the three sibilants in Mandarin. In particular, Mandarin /s/ and /ʂ/ can both be followed by the vowel /a/. When the spectral feature of /s/ was modified similar to /ʂ/, listeners misperceived /sa/ as /ʂa/. Mandarin fricative /ɕ/ is a palatal-alveolar consonant that is always followed by the high front vowel /i/ before the nucleus /a/. The different vowel context of /ɕ/ from /s/ or /ʂ/ might have helped listeners distinguish /ɕ/ from /ʂ/ even though the two fricatives might have similar acoustic manifestations after frequency compression. It is noteworthy that other than the misperception of /s/ as /ʂ/, there is a symmetrical confusion pattern between /s/ and /f/. That is, /f/ was mostly confused with /s/ in all conditions and /s/ also showed certain confusion with /f/ in all conditions. As shown in Figs. 3 and 4, the similar distribution of spectral peak and spectral mean after frequency compression of the two fricatives may account for the symmetrical confusion pattern between /f/ and /s/.
Besides the phonological constraints, another factor that can explain the high recognition accuracy after frequency compression may be related to the consistent F2 onset before and after compression. Formant transition provides vital perceptual cues for recognition of the places of stop and fricative consonants (Blumstein and Stevens, 1980; Walley and Carrel, 1983; Whalen, 1981). Although NLFC altered the spectral feature of the fricatives, the patterns of F2 onset of the following vowel for the fricatives were largely maintained, especially in the configurations with relatively high cut-offs. The well-maintained F2 transition might provide a reliable cue to help listeners recognize the fricatives.
Among the six NLFC conditions, the acoustic features of vowels and fricatives were affected to a greater extent in the SR-1 and SR2-1 conditions than in the other conditions. Perceptually, the low cut-offs caused a measurable decrease in fricative recognition accuracy in the NH listeners. Alexander (2016) found that a cut-off of 2.2 kHz with a high compression ratio might cause an adverse effect on speech recognition and that compression ratio was not as important as cut-off frequency in affecting speech recognition in hearing-impaired listeners. The similar detrimental effect of low cut-offs on speech recognition has also been reported in other studies (Parsa et al., 2013; Souza et al., 2013). In the present study, the SR-1 condition used a low cut-off of 2196 Hz, similar to the low cut-off setting in Alexander (2016). The SR2-1 had an even lower cut-off of 1624 Hz for high-frequency dominant sounds. The lower recognition accuracy in the SR-1 and SR2-1 conditions in NH listeners provided support to the finding of negative effects of low cut-off frequencies on speech recognition shown in previous studies.
On the other hand, when we compared SR and SR2, the two generations of the algorithm with higher cut-offs (SR-2, SR-3 and SR2-2, SR2-3) did not show significant differences in acoustic properties and recognition accuracy. However, SR2-1 caused greater modification on the acoustic properties of both vowels and fricatives than did SR-1. As specified earlier, SR2 has two cut-offs in which only one is adaptively used based on the frequency energy distribution of the incoming signal. As fricatives demonstrated less frequency energy in the conditional region, the input frequencies were compressed starting with the CT1 (lower cut-off), which caused greater spectral distortion. Therefore, fricatives processed by SR2-1 showed greater low-frequency shift and less separation of spectral peak and spectral mean in comparison to SR-1. The greater spectral change resulted in an observable lower overall consonant recognition accuracy (see Fig. 5), significantly reduced recognition accuracy for certain fricatives (/ʂ/ and /ɕ/), and more confusion among the three sibilant fricatives in SR2-1 than in SR-1. These results were consistent with our expectations. In contrast to fricatives, most vowels have spectral energy concentrated in the low frequencies. Thus, higher cut-offs were automatically selected for these vowels. In this situation, the original formant structure could be well maintained. However, not all acoustic properties of the vowels were well retained. For the high front vowel /i/ and compound vowels containing the component of /i/, significant backward movement along the F2 axis was observed. That is, the spectral features of F2 for this vowel were significantly changed in SR2-1. We speculated that this is associated with the high F2 for the vowel /i/. Because the spectral energy in F2 region for /i/ was sufficiently high, the system might fail to detect acoustic energy in the conditional frequency region, which might cause the algorithm automatically to select the lower cut-off (CT1) for this vowel. Consequently, the very low cut-off of 1624 Hz in SR2-1 caused a spectral prominence in the low frequency region that was shifted from a higher frequency region, which, was recognized as retracted F2 in the vowel space (Fig. 2, top right panel). Note that SR-1 had a cut-off of 2196 Hz, which was close to the F2 values of vowel /i/ in normal adult speakers (approximately 2300 Hz for adult males, Peterson and Barney, 1952). Therefore, compressed frequencies starting from 2196 Hz in SR-1 only caused slight retraction of F2 in SR-1 in comparison to the significant retraction of F2 in SR2-1 (Fig. 2, top panel).
Among the six configurations, SR-1 and SR2-1 caused certain misperception of /s/ to /ʂ/ in the NH listeners. It is noteworthy that these settings for NLFC are appropriate for listeners with moderate to profound hearing loss. These hearing-impaired listeners typically experience little, if any, audibility to high-frequency acoustic information. Many hearing-impaired listeners are not able to hear the fricatives /s/ or /ɕ/. When applying NLFC to listeners with high-frequency hearing loss, lower cut-offs and higher compression ratios would help the patients gain access to a wider range of auditory information. Under this scenario, even though the low cut-off configurations such as SR-1 and SR2-1 may cause a greater modification of acoustic features in certain speech sounds and cause perceptual confusion of certain fricatives to a certain degree in the NH listeners, these configurations may help listeners with high-frequency hearing loss gain access to the audibility of fricatives, which might be beneficial to them. Recently, Alexander and Rallapalli (2017) showed that NLFC processing resulted in better English fricative recognition in a group of listeners with mild to moderately-severe SNHL.
In summary, the acoustic profiles of fricatives were greatly modified after NLFC. Vowel acoustic features were relatively maintained. The cut-off frequencies played a key role in determining the magnitude of modification of the acoustic features. For both SR and SR2, Mandarin vowels and fricatives showed the greatest change in their acoustic profile when the cut-offs were lower than 3 kHz. Perceptually, these NLFC configurations did not lead to noticeable confusion in the NH listeners except for those with the lowest cut-offs in both SR and SR2 conditions. The discrepancy between the considerable acoustic changes and the negligible adverse effects on perceptual outcomes could be partially accounted for by the phonology system and phonotactic constraints in Mandarin Chinese. The negligible perceptual confusion in the NH listeners validated the clinical use of NLFC in people with hearing loss. It is worth noting that the results derived from NH listeners should be applied to hearing-impaired listeners with caution. NH listeners typically have intact spectral resolution without amplitude compression. These factors may result in different patterns of outcomes for speech recognition and sound quality rating between normal-hearing and hearing-impaired listeners. A future study evaluating the efficacy of static and adaptive NLFC on speech recognition and sound-quality ratings in Mandarin-speaking hearing-impaired listeners is warranted. Moreover, when testing the NH listeners, one direction for future study would be to add speech-shaped noise in the phoneme recognition tasks. The low-frequency emphasis of the speech-shaped noise can mask formant transition and at the same time help to avoid the ceiling effect seen in the present study.
ACKNOWLEDGMENTS
The research was supported in part by a research contract provided by Phonak, LLC.