As well as background noise and reverberation, speaker-to-listener relative location affects the binaural speech transmission index (BSTI) considerably, especially in the near field. To highlight how speaker location influences the BSTI, binaural room impulse responses measured in a low-reverberation listening room are used to obtain the BSTI indirectly and analyze its near-field dependence on distance and direction. The results show that the BSTI based on the better-ear rule is higher when the virtual speaker is located laterally rather than in the anterior or posterior. When the distance-dependent intensity factor is introduced, the distance is the dominant factor, not the azimuth.
1. Introduction
Speech intelligibility (SI) is an important metric for predicting the loss of speech information transmitted between speakers and listeners in a room. In practice, SI is an objective physical quantity influenced by (i) the room acoustic conditions, including background noise [or signal-to-noise ratio (SNR)] and reverberation;1 (ii) the target speaker's characteristics, such as directivity and frequency response;2 and (iii) the binaural effect (BE), such as head shadows (HSs) and binaural interactions.3,4 Because of the significant effect of binaural cues attributed to HSs and binaural interactions on speech perception, binaural SI (BSI) tends to be better than single-channel SI without accounting for head-induced effects. In particular, BSI has been used recently to deal with spatial masking for multi-speakers5,6 and speech reception by people with impaired hearing.7,8
Moreover, associated with the BE, the speaker azimuth angle and distance influence significantly how the target speech is perceived upon being heard. Accordingly, some studies have investigated how BSI is influenced by speaker direction and distance relative to the listener in the far field (FF),9–11 where the speaker is usually located outside the reverberation radius and thus the room acoustic conditions are more important to BSI than is the BE. However, if the speaker is located in the near field (NF) [i.e., at a distance of no more than 1.0 m, referring to the definition of the NF head-related transfer function (HRTF)] or the listener is within the reverberation radius, then the detrimental effect of reverberation on BSI is attenuated.1 In other words, compared with the FF situation, the HS effect is more pronounced in the NF but the detrimental effect of reverberation is almost nonexistent, thereby causing BSI to change more considerably with speaker location. The HS effect also arises regarding the NF HRTF.12
In a steady acoustic environment that satisfies a linear time-invariant system under an indoor acoustic environment, the speech transmission index (STI) proposed by Houtgast and Steeneken13 is used to objectively evaluate how the transmission system affects SI, and the indirectly measured STI14,15 can be used to evaluate SI more conveniently and effectively. This method requires only the room impulse responses (RIRs) to be measured and avoids time-consuming measurements involving modulated speech signals, thereby making it convenient and repeatable. Correspondingly, the binaural STI (BSTI) model considering the BE of speech perception has been proposed to improve the predictive accuracy of BSI.16 Further, when performing BSTI measurements using an artificial head, the recommended approach is to use the STI results for the better ear, i.e., selecting the better (larger) value from the pair of STIs.15
As is well known, the general rule for BSTIs with a spatialized speaker is that the ipsilateral BSTI is larger than the contralateral one, corresponding to the STI of the better ear. However, comprehensive understanding and quantitative results regarding BSTIs are lacking in the entire NF region. This motivates the present systematic study of how speaker location influences the BSTI, especially how it depends on distance and direction between speaker and listener in the NF. The present work helps to bridge the gaps in BSTI research for speakers located in different spatial regions, and it is important regarding the comprehensive assessment of spatial masking and speech reception for speakers in different locations from the NF to the FF.
To clarify the quantitative influence of the BE on BSTIs with different speaker locations in the NF, the effect of reverberation must be eliminated by enhancing sound absorption and weakening boundary reflection. Therefore, the binaural RIRs (BRIRs) are measured in a listening room with a low reverberation time. Then, according to the calculation procedure of indirect method of STI,15 the SNRs based on the speech spectrum transmission characteristics and stationary (i.e., pink) noise are simulated by using the measured BRIRs, whereupon the BSTIs are assessed indirectly. Additionally, to distinguish the roles of the listener's BE and the distance-dependent intensity variation with distance in BSTIs, both constant-intensity and variable-intensity speech signals are simulated.
2. Method
2.1 Indirect method of BSTI
The STI is based on the fact that the interference on SI is related to the reduced time modulation due to the transmission system, which can be described by the modulation transfer function (MTF). According to the MTF, the apparent SNR can be calculated, whereupon the STI can be derived by using the weighted sum of the apparent SNR in different frequency bands.15 The MTF is distributed over 14 modulation frequencies Fl, ranging from 0.63 to 12.50 Hz, and seven octave bands with center frequencies fk, ranging from 125 Hz to 8 kHz. The traditional (direct) method for determining the MTF is based on test signals modulated sinusoidally in intensity, but that method is time consuming for all 98 MTF measurements. To reduce measurement consumption and enhance repeatability, Schroeder14 developed a single impulse response measurement (indirect method).
When the virtual speaker (VS) is located at distance r from the listener's head center and azimuth θ relative to the listener's front, the MTF resulting from reverberation is denoted by and expressed as the Fourier transform of the squared BRIRs, normalized by the energy of the squared impulse response,
where k is the octave-band sequence number, l is the modulation-frequency sequence number, and t is time in seconds. The transmission function is obtained from the BRIR through octave filtering at center frequencies . Because the influence of noise (i.e., SNR) on SI is not normally included in the BRIR, the additional MTF , which is independent of the modulation frequency, is calculated as
If the speech signals of the VS traveling from various azimuths and distances are obtained by convolving the monaural speech samples with the corresponding BRIRs, then can be computed as
where the intensity factor is either a constant Q0 (corresponding to a constant-intensity speech signal) or proportional to (corresponding to a variable-intensity speech signal as the VS approaches the listener), is the sound pressure level in the octave band with center frequency fk, and T0 is the length of the speech signal. The combined MTF is calculated as the product of and to include the combined effect of noise and reverberation. Herein, the detrimental effect of reverberation in the listening room with acoustic absorption treatment is neglected (see Sec. 2.2 for the specific measurement environment), so only the SNR dependence term is considered.
For each azimuth θ and distance r, a 7 × 14 matrix is obtained for the MTF and is then converted into the apparent SNR. When each apparent SNR is cropped to range from −15 to 15 dB, it corresponds linearly to the STI ranging from 0 to 1, namely, the transmission index (TI). In this way, the modulation transfer index (MTI) at each octave is obtained by the average of TIs in 14 modulation frequencies. Then, STI is calculated as a weighted sum of MTIs with the weighting coefficient of 0.13, 0.14, 0.11, 0.12, 0.17, 0.19, and 0.14 for the octave bands from 125 Hz to 8 kHz, respectively.15
2.2 Speech and noise materials
The BRIRs were measured in a listening room with background noise of less than 30 dBA and a reverberation time of less than 0.15 s after sound absorption and isolation treatment. Previous work has shown that binaural responses recorded with a Knowles Electronics Manikin for Acoustic Research (KEMAR) can achieve SI comparable to that recorded with an individual subject.17 Therefore, the present binaural signals were recorded on a KEMAR. The BRIR was measured at the speaker distance of 0.2, 0.25, 0.3, 0.4, 0.5, 0.75, and 1.0 m. At each distance, measurements were conducted at 72 azimuths from to at an interval of in the horizontal plane. The azimuths of , and represent the directions of front, right, rear, and left, respectively. Also, an appropriate time window was used for the BRIR to further eliminate the influence of room reflection, and equalization was applied to the BRIRs so that the sound intensity in the KEMAR head center was uniform at each speaker distance.
The speech sample was obtained by generating a pink noise signal and filtering and adjusting its spectrum according to GB/T 7347–1987, the standard spectrum of Chinese speech.18 The VS's binaural speech signal can be obtained by convoluting the speech sample with the measured KEMAR BRIRs at various locations, which corresponds to the position relationship between the VS and the listener. To distinguish the roles of the listener's BE and distance-dependent intensity variation with distance in BSTIs, two groups of speech signals are considered, namely, those with constant intensity and those with variable intensity. For the distance-dependent intensity change, a distance-based compensation is applied to the speech sample, then the variable-intensity speech signals are obtained by convoluting the compensated speech sample and the BRIRs; otherwise, the speech sample is convoluted directly with the BRIRs to obtain the constant-intensity speech signals. In this study, pink noise was used as stationary noise with constant relative magnitude at each octave band. With the binaural speech signal containing the spatial information and the stationary pink noise, the BSTIs were calculated using the pulse-based indirect method with the VS located at different distances and azimuths.
3. Results and discussion
To reveal how the BE influences the BSTI, Eq. (3) was used to analyze the SNR in seven octave bands as well as the overall SNR of the right ear when the VS was located at different azimuths and distances, based on constant-intensity speech signals and pink noise. As shown in Figs. 1(a)–1(g), the SNR decreased gradually as the VS was moved from ipsilateral to contralateral regarding the target ear, and the variation increased with frequency but decreased slightly with distance. For instance, in the 8-kHz octave band [Fig. 1(g)], the variation between ipsilateral and contralateral exceeded 40 dB at 0.2 m, but the difference between 1.0 and 0.2 m was only around 11 dB. The larger variation in higher bands reveals that the SNR variation due to changing the location of the VS is related mainly to the reduction of high-frequency sound due to the HS for the contralateral VS, especially in the NF; in other words, HSs act mainly on high-frequency sound. The overall SNR [Fig. 1(h)] changes markedly with distance in the NF, but the variation weakens gradually with increasing distance.
Note that in the 8-kHz octave band, the SNR curve has a dip at azimuths of . Therefore, analysis was done of the relative magnitude spectrum in the 8-kHz octave band of the BRIRs at different azimuths [Fig. 1(i)]. Taking r = 1.0 m as an example, the BRIR spectral amplitude at 6–9 kHz is relatively low around the azimuth of , which is related to auricular reflection. Also, when the speaker is located contralateral from the right ear (i.e., at azimuths around ), the spectral amplitude of the right ear is not the smallest in Fig. 1(i) and corresponds to the slight peak of SNR around the azimuth of in Fig. 1(h), which is determined by the so-called “contralateral bright-spot effect” due to the interference of sound waves from the contralateral speaker, and it can also be found in the NF HRTF.12
Furthermore, the BSTIs of the left, right, and better ear were calculated and analyzed. To effectively express the BSTI changes in the opposite trends between the ipsilateral and contralateral VSs, the intermediate value of the BSTI for the front VS was kept near 0.5 at distance 1.0 m and azimuth by setting the SNR around 0 dB. These discrete BSTIs were then used to draw contour maps by linear interpolation. Figures 2(a)–2(c) show the BSTI contour maps of the left, right, and better ear for the VS located at different distances and azimuths, based on constant-intensity speech signals.
Figures 2(a) and 2(b) show that the BSTIs of the left and right ear change significantly with VS azimuth and distance. With decreasing distance, the BSTI difference with azimuth increases to a maximum of 0.66, from the ipsilateral BSTI of 0.79 to the contralateral BSTI of 0.13 at 0.2 m. Because of physiological symmetry, the BSTIs do not differ obviously between the left and right ear; i.e., the BSTIs of the two ears exhibit approximately symmetrical behavior. In addition, the BSTI of the ipsilateral ear is always larger than that of the contralateral ear; i.e., to some extent, the better ear is equivalent to the ipsilateral ear, which is consistent with Ref. 16. The BSTI from the ipsilateral VS increases with decreasing distance, but the trend is the exact opposite for the contralateral VS. It could be due to the ipsilateral reflection (from both the head and ipsilateral ear of the listener) and contralateral HSs, respectively.
Under the better-ear rule, namely, selecting the larger value from each pair of BSTIs for the left and right ear, as shown in Fig. 2(c), the BSTI changes indistinctively with distance except for specific directions (e.g., the azimuths of , and ), whereas the BE causes it to change significantly with azimuth. The BSTI is larger when the VS is located laterally rather than at the anterior or posterior, this being due to stronger reflection from the head and ipsilateral ear of the listener with a lateral VS and the HS effect with an anterior or posterior VS, especially in the NF, which is similar to previous research.10,11 When the VS is located laterally (i.e., at azimuths of or ), the BSTI increases with decreasing distance, whereas the trend is the opposite at the anterior and posterior (i.e., from to and to , respectively), which is caused by the more serious HS effect with closer VS.
To highlight these trends in the contour maps, we drew two guidelines [Fig. 2(c)], namely, the contour lines for the BSTI values of 0.75 (purple dashed line) and 0.50 (purple dot-dashed line), higher than what is considered an excellent or an acceptable BSTI, respectively. Note that even though under the better-ear rule the BSTIs for the VS located in the front or rear region cannot become very acceptable () with decreasing VS distance (without increasing the speech power to maintain constant intensity for the listener), regarding the poor regions for . The rear poor region (i.e., at azimuths from to ) is much larger than the front poor region (i.e., at azimuths from to ), which is caused by the auricular masking of the listener for the rear VS's high-frequency speech power.
The above discussion was of BSTIs with a spatialized VS and a speech-signal intensity that did not increase as the VS approached the listener, so as to maintain constant intensity for the listener, which is easy to implement by convoluting the constant speech sample and the equalized BRIRs. However, in practice, the intensity perceived by the listener usually varies with distance from the VS, which can be realized by convoluting the distance-dependent speech sample and the equalized BRIRs. Therefore, Figs. 2(d)–2(f) show the BSTIs of the spatialized VS when a distance-dependent intensity change is introduced. Similarly, two guidelines are added to the contour maps. Compared to the BSTIs with a constant-intensity speech signal, when the distance-dependent intensity factor for the speech sample is introduced, they increase more significantly with decreasing distance for both the ipsilateral and contralateral VSs unlike the constant-intensity results [Figs. 2(d) and 2(e)]. Taking advantage of the combination of distance-dependent intensity changes and the BE effect (i.e., HS effect), the BSTIs with a closer VS are increased integrally compared to their constant-intensity values. For instance, the region for excellent BSTI () is extended and the region for poor BSTI () is reduced [Fig. 2(f)]. As shown in Fig. 2(f), when the distance-dependent intensity factor is introduced, closed circular contour lines appear that represent different STI values, and with an indentation in both the anterior and posterior region. In other words, when distance-dependent intensity changes are accounted for, they dominate the change in the BSTI with speaker location. In addition, similar to the constant-intensity group, the BSTI values are again lower in the front and rear regions than in the lateral regions.
4. Conclusion
In the present work, the BSTIs of the left, right, and better ear were calculated and analyzed for a speaker located at different azimuths and distances, this being done with the measured NF BRIR and with constant-intensity and variable-intensity speech signals. The results indicate that (i) the BSTIs of the left and right ear decrease significantly when the VS is moved from ipsilateral to contralateral of the target ear, with the variation decreasing with distance, and (ii) the BSTIs based on the better-ear rule are higher when the VS is located laterally rather than in the anterior or posterior. When no distance-dependent intensity factor is accounted for, the BSTI changes indistinctively with distance but significantly with azimuth, except for specific directions. In addition, the BSTI increases with decreasing distance under a lateral speaker but does the opposite with an anterior or posterior speaker. However, if distance-dependent intensity changes are accounted for, then the BSTIs increase further and increase with decreasing distance at all azimuths, combined with the NF effect (i.e., HS effect). Furthermore, the distance-dependent intensity changes dominate the BSTI variation with different speaker locations when the intensity factor is accounted for.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Grant No. 11574090) and the Natural Science Foundation of Guangdong Province (Grant No. 2018B030311025).