Sound masking can diminish the performance impairment due to background speech in open-plan offices. This paper compares a steady-state masking sound with the spectrum of the disturbing speech signal to a time-reversed speech masker. As part of a laboratory experiment subjects have to complete a digit span task and a questionnaire. Both masking sounds improve the number recall performance as compared to unmasked speech. When the speech-to-noise ratio is reduced, the error rates decrease only during stationary sound masking. Sound masking with time-reversed speech increases the speech privacy at higher speech-to-noise ratios but it is perceived as more annoying.
1. Introduction
Sound masking in combination with a high level of room absorption and high sound screens can cover disturbing background sounds and improve the speech privacy within open-plan offices.1 However, artificial sound masking is often perceived as annoying and the occurring background speech levels are too high to be masked efficiently. This paper compares two masking sounds with different temporal characteristics and resulting speech intelligibility with regard to their effect on working memory performance and annoyance perception.
Exposure to background sounds with sufficient variation in time and frequency impairs working memory performance substantially.2 Background speech affects working memory performance more than speech-like noise with the same temporal-spectral characteristics and is perceived as more annoying regardless if listeners hear the speech-like noise as speech or non-speech.3,4 This outcome is attributed to the more complex acoustic constituents of natural speech. Meaningful irrelevant speech disrupts free recall of words more than meaningless irrelevant sound.5 However, these effects are only observed under conditions in which the task is free, not serial, recall. The effect of semantic content on serial recall is still subject to research.
It is likely that both temporal-spectral characteristics and semantic content have an impact on working memory performance. Spectral fluctuations seem to be required to trigger the effect while mere amplitude fluctuations do not cause the same effect.6 The effect is observed for native speech and speech in unfamiliar languages, but it appears to be more pronounced under one's native language.6 Since sound masking requires typically a negative signal-to-noise ratio (SNR) to improve working memory performance (e.g., Ref. 7), the background sound may be dominated by the masking sound characteristics. The SNR refers in the following to the ratio of the A-weighted sound pressure level (SPL) of distracting single voice speech to the A-weighted SPL of a masking sound. As part of this study two masking sounds with different characteristics are compared: a time-reversed speech masker with temporal-spectral variations and a steady-state sound with the spectrum of the disturbing speech signal.
With respect to speech intelligibility, time-reversed speech has superior speech masking efficiency compared to stationary masking sounds due to additional informational masking (e.g., Ref. 8). Such masking sounds reduce speech intelligibility at higher SNRs than stationary masking sounds which may be beneficial in a way that these sounds could be used at lower SPLs.9,10 However, such speech-like sounds may catch the listener's attention because they have temporal-spectral variations that are similar to disturbing speech. Thus, reversed speech may even decrement short-term memory performance to a similar extent as normal speech.11 A sound condition that consists of a disturbing speech signal and a time-reversed speech masker may be as distracting as disturbing speech because one and two voices produce roughly the same disruption of short-term memory in a laboratory experiment.12 However, when the SNR is set appropriately, sound masking with time-reserved speech reduces the speech intelligibility at relatively high SNRs, and hence the working memory performance may improve as compared to unmasked disturbing speech.
A time-reversed speech masker can be created near real-time by cutting a speech recording into short segments, reversing these segments, and concatenating them. When speech recognition is applied, such a masking sound could be used for local sound masking of distracting background speech in office environments. The segment duration has an impact on the speech intelligibility of the time-reversed speech signal and of the disturbing speech signal that is masked by the time-reversed speech signal.10,13–15 The segment duration of time-reversed speech with 20–200 ms segments has a strong impact on speech intelligibility and approaches 0% speech intelligibility for segment durations above 200 ms.13,14 At SNRs between −15 and 0 dB, a segment duration of 200 ms results in relatively low speech intelligibility of a speech signal that is masked by time-reversed speech.15
Jiang et al.10 analyze the question whether the speech masking efficiency of time-reversed speech with respect to speech intelligibility also has a positive effect on working memory performance. At a SNR of −5 dB subjects perform significantly worse during masked speech by a time-reversed masker with frame length of 150 ms than during masked speech by a noise-like masker with a slope of −5 dB per octave, but they do not perform significantly worse during speech that is masked by a time-reversed masker with frame length of 500 ms.10 Contrary to Refs. 11 and 12, these results demonstrate that using time-reversed speech as masking sound may improve the cognitive performance to a similar extent as using noise-like masking sounds. However, Jiang et al.10 do not consider a control condition with unmasked speech which makes a comparison of the positive effect of these masking sounds on working memory performance difficult. They vary both, SNR and masker type, but they only use one negative SNR of −5 dB. Since in most cases stationary maskers only show an improvement of working memory performance at negative SNRs, varying the SNR below SNRs of 0 dB and varying the masker type at the same time would enable a comparison of the effect of SNR changes between different masking sounds such as noise- and speech-like signals.
Since sound masking systems are frequently not accepted in Germany because employees perceive them as annoying, the subjective perception of sound masking is as important as the effect on working memory performance.9 Perceived annoyance refers to ambient acoustic conditions and is assessed by a 5-point verbal scale (not at all, slightly, moderately, very, and extremely annoying). Standard guideline ISO/TS 15666:2003 provides recommendations on the assessment of noise annoyance and suggests the use of two questions on annoyance in each questionnaire with both annoyance scales, a 5-point verbal and an 11-point numerical scale that are based on the recommendations of the International Commission on Biological Effects of Noise. In contrast to the effect on working memory performance, the subjective perception may be quite different between time-reversed speech and noise-like masking sounds. Comparisons of the subjective annoyance perception of noise-like and time-reversed maskers reveal that steady-state sound masking is perceived as less annoying by subjects.10,16 Especially at negative SNRs, this may be attributed to the lower temporal-spectral variability of sound conditions with steady-state sound masking at the same SNR. The annoyance perception of sound masking environments and the impact of different aspects like SPL, spectral properties, temporal fluctuations, and speech intelligibility are still subject to research.
A laboratory experiment is performed to compare the effects of a steady-state and time-reversed speech masker on working memory performance and annoyance. A masking sound with a slope of −5 dB per octave is tested as well but the results are outside the scope of this paper. Based on past studies,10,16 it is expected that the time-reversed masker reduces speech intelligibility at higher SNRs. Since the impact of semantic content on working memory performance is still subject to research, the positive impact of sound masking with time-reversed speech on serial recall performance is uncertain. It is not clear if the lower temporal-spectral variability of the sound conditions with steady-state sound masking increases working memory performance more effectively than the reduced speech intelligibility of the sound conditions with the time-reversed masker. The masking sounds are tested at SNRs of −6, −9, and −12 dB. The fluctuating time-reversed masker is not expected to restore working memory performance to the silent baseline condition while the steady-state noise-like masking sound is expected to approach the baseline condition at decreasing SNRs.
The annoyance ratings of sound conditions with stationary masking are expected to be between both control conditions and to be lower at decreasing SNRs because the speech sound is quieter at lower SNRs. The sound conditions with fluctuating time-reversed speech are expected to be perceived as more annoying than the conditions with stationary masking but it is unknown if they are perceived as less annoying than unmasked speech. In particular, two research questions are addressed:
Does sound masking with time-reversed speech improve serial recall performance more efficiently than sound masking with steady-state noise?
Is sound masking with time-reversed speech perceived as less annoying than unmasked speech?
2. Methods and materials
Twenty-four students [6 female, median (Mdn) = 24 yrs, standard deviation (SD) = 2.5] participate in the experiment. All participants are native German speakers and received a small stipend.
The experimental design is a one-way repeated measures design with 12 levels according to the 12 sound conditions during which working memory performance and subjective ratings are tested. The further analysis with the eight mentioned sound conditions is based on a two-way repeated measures design with two factors, masker type and SNR. The design and procedure of this study are described in more detail in a previous paper that compares the effect of the speech-shaped stationary masker with −5 dB per octave shaped noise (see Ref. 7).
The analyzed sound conditions are listed in Table 1. The time-reversed signal is created by dividing the speech signal into 200 ms time frames, reversing these segments and concatenating them in the original order without any delay. All masker signals are calibrated to 45 dB(A) while the SPL of the speech signal is varied between 36 and 42 dB(A), resulting in SNRs from −12 to −6 dB. Masker SPLs of 45 dB(A) are commonly suggested because higher levels can annoy employees and impair communication.1 The sound condition with unmasked speech is calibrated to 42 dB(A). The SPL refers to an A-weighted energy-equivalent SPL LAeq averaged over 40 s and measured using a sound level meter Norsonic Sound Analyzer type 110 (Norsonic AS, Tranby, Norway) and an artificial ear with G.R.A.S. 40AG 1/2 in pressure microphone (G.R.A.S. Sound and Vibration A/S, Holte, Denmark).
Name of sound condition, used masker, and SNR [dB]. The acronyms REF (reference), HSM (masking sound that is spectrally-matched to Hochmair-Schulz-Moser speech recordings), and TRM (time-reversed speech masker) are used in the following to refer to the sound conditions. The SNR is printed as a subscript.
Sound condition . | Masker type . | SNR . |
---|---|---|
REF0 | (Silence) | — |
REF∞ | (Unmasked distracting speech) | ∞ |
HSM−6 | Speech-shaped steady-state noise | −6 |
HSM−9 | Speech-shaped steady-state noise | −9 |
HSM−12 | Speech-shaped steady-state noise | −12 |
TRM−6 | Time-reversed speech | −6 |
TRM−9 | Time-reversed speech | −9 |
TRM−12 | Time-reversed speech | −12 |
Sound condition . | Masker type . | SNR . |
---|---|---|
REF0 | (Silence) | — |
REF∞ | (Unmasked distracting speech) | ∞ |
HSM−6 | Speech-shaped steady-state noise | −6 |
HSM−9 | Speech-shaped steady-state noise | −9 |
HSM−12 | Speech-shaped steady-state noise | −12 |
TRM−6 | Time-reversed speech | −6 |
TRM−9 | Time-reversed speech | −9 |
TRM−12 | Time-reversed speech | −12 |
Temporal-spectral variability can be estimated by the hearing sensation fluctuation strength (cf. Ref. 17). A comparison of fluctuation strength values shows that sound conditions with the time-reversed masker have similar values as unmasked speech while the sound conditions with the steady-state noise have notably lower values.
The speech intelligibility according to the Hochmair-Schulz-Moser sentence test18 is tested in a subsequent experiment. The same speech recordings are used to determine the speech intelligibility. All participants report normal hearing. Unmasked speech and the sound conditions with steady-state masking are tested with 17 subjects (Mdn = 24 yrs, SD = 3.1) while the sound conditions with reversed speech are tested with 12 subjects (Mdn = 24 yrs, SD = 2.8).
3. Results
Figure 1(a) depicts the mean error rates that are observed in the number recall task. A repeated measures analysis of variance (ANOVA) shows a significant effect of sound condition on mean error rate in serial recall [F(4.8,109.3) = 5.87, mean squared error (MSE) = 0.0074, p < 0.001, η2 = 0.20]. The homogeneity is estimated with Mauchly's test for sphericity and indicates a violation of the assumption of sphericity. Consequently, the Greenhouse-Geiser correction is applied. A two-way ANOVA reveals a significant main effect of masker type [F(1,23) = 7.39, MSE = 0.0084, p < 0.05, η2 = 0.24] but neither a significant effect of SNR nor an interaction effect. One-tailed t-tests of the results are calculated to compare the sound conditions with masked speech at the same SNR and to compare the conditions toward silence and unmasked distracting speech. Table 2 shows the results of all t-tests with Benjamini-Hochberg corrected p-values.19 Significantly more errors are made during distracting speech than under silence. Both masking sounds, steady-state noise and reversed speech result in lower mean error rates as compared to unmasked speech at −6 dB SNR. Only the time-reversed masker leads to a higher mean error rate than under silence at a SNR of −12 dB. The mean error rate decreases significantly from −6 to −12 dB SNR under steady-state masking but not under reversed speech. The sound condition with speech-shaped steady-state noise produces at −12 dB SNR lower error rates than the sound condition with time-reversed speech.
Illustration of the results, means with standard errors, are plotted. (a) Mean error rates in number recall (n = 24); (b) mean annoyance ratings on a 5-point Likert scale (n = 21); (c) speech intelligibility (n = 12–17).
Illustration of the results, means with standard errors, are plotted. (a) Mean error rates in number recall (n = 24); (b) mean annoyance ratings on a 5-point Likert scale (n = 21); (c) speech intelligibility (n = 12–17).
Overview of t-tests with corrected p-values and Cohen's d effect sizes. H1 indicates the alternative hypothesis; (1) mean error rates; (2) mean annoyance ratings. *, **, *** represent significance at p < 0.05, 0.01, and 0.001 levels.
H1 . | p-value (1) . | Cohen's d (1) . | p-value (2) . | Cohen's d (2) . |
---|---|---|---|---|
REF0 < REF∞ | 2.45 × 10−04*** | 1.0 | 1.63 × 10−13*** | 4.1 |
HSM−6 < REF∞ | 3.69 × 10−03** | 0.74 | 2.64 × 10−02* | 0.47 |
TRM−6 < REF∞ | 1.14 × 10−02* | 0.56 | 5.56 × 10−01 | 0.0 |
REF0 < HSM−12 | 3.50 × 10−01 | 0.10 | 1.78 × 10−08*** | 2.0 |
REF0 < TRM−12 | 7.71 × 10−03** | 0.62 | 2.32 × 10−13*** | 3.9 |
HSM−12 < HSM−6 | 2.31 × 10−02* | 0.48 | 2.70 × 10−03** | 0.73 |
TRM−12 < TRM−6 | 8.84 × 10−01 | 0.25 | 7.68 × 10−01 | 0.16 |
HSM−6 < TRM−6 | 3.45 × 10−01 | 0.12 | 2.57 × 10−02* | 0.49 |
HSM−9 < TRM−9 | 1.72 × 10−01 | 0.25 | 9.75 × 10−04*** | 0.84 |
HSM−12 < TRM−12 | 7.71 × 10−03** | 0.63 | 2.26 × 10-05*** | 1.2 |
H1 . | p-value (1) . | Cohen's d (1) . | p-value (2) . | Cohen's d (2) . |
---|---|---|---|---|
REF0 < REF∞ | 2.45 × 10−04*** | 1.0 | 1.63 × 10−13*** | 4.1 |
HSM−6 < REF∞ | 3.69 × 10−03** | 0.74 | 2.64 × 10−02* | 0.47 |
TRM−6 < REF∞ | 1.14 × 10−02* | 0.56 | 5.56 × 10−01 | 0.0 |
REF0 < HSM−12 | 3.50 × 10−01 | 0.10 | 1.78 × 10−08*** | 2.0 |
REF0 < TRM−12 | 7.71 × 10−03** | 0.62 | 2.32 × 10−13*** | 3.9 |
HSM−12 < HSM−6 | 2.31 × 10−02* | 0.48 | 2.70 × 10−03** | 0.73 |
TRM−12 < TRM−6 | 8.84 × 10−01 | 0.25 | 7.68 × 10−01 | 0.16 |
HSM−6 < TRM−6 | 3.45 × 10−01 | 0.12 | 2.57 × 10−02* | 0.49 |
HSM−9 < TRM−9 | 1.72 × 10−01 | 0.25 | 9.75 × 10−04*** | 0.84 |
HSM−12 < TRM−12 | 7.71 × 10−03** | 0.63 | 2.26 × 10-05*** | 1.2 |
The subjects are asked to rate the perceived annoyance after each sound condition. The datasets of three subjects are not considered due to problems with the web browser while replying to the questions. The mean annoyance ratings of the sound conditions are depicted in Fig. 1(b). A one-way ANOVA reaches statistical significance [F(4.3,86.9) = 28.4, MSE = 0.58, p < 0.001, η2 = 0.68]. The Greenhouse-Geiser correction is applied to the degrees of freedom because Mauchly's test for sphericity is significant. A two-way ANOVA shows a significant main effect of masker type [F(1,20) = 26.2, MSE = 1.4, p < 0.001, η2 = 0.57] and an interaction effect between SNR and masker type [F(2,40) = 5.06, MSE = 0.47, p < 0.05, η2 = 0.20] but no main effect of SNR. Follow-up t-tests for paired samples are calculated and Benjamini-Hochberg correction19 is applied (see Table 2). The annoyance of speech background is rated significantly higher than silence. Only the sound condition with stationary sound masking is perceived as significantly less annoying than unmasked speech at −6 dB SNR. Both sound conditions at −12 dB SNR are rated more annoying than the silent condition. The annoyance perception decreases significantly from −6 to −12 dB SNR when steady-state noise is used as masking sound. The conditions with steady-state masking are rated as significantly less annoying than the conditions with reversed speech at all tested SNRs, −6, −9, and −12 dB.
The measured speech intelligibility during the different sound conditions is depicted in Fig. 1(c). At the same SNR, the speech intelligibility under time-reversed speech masking (4% to 32%) is much lower than under steady-state masking (29% to 89%).
4. Discussion
This study compares the effects of masking sounds with different temporal-spectral characteristics on working memory performance and annoyance. Both masking sounds improve the serial recall performance toward unmasked speech at −6 dB SNR. Only when time-reversed speech is used as masking sound, the mean error rate at −12 dB SNR is significantly higher than in the silent condition. The error rates during steady-state sound masking are significantly lower than during time-reversed speech masking at −12 dB SNR. Time-reversed speech can diminish the disturbing impact of semantic content of speech at higher SNRs because it reduces the speech intelligibility more efficiently. Since time-reversed speech may have similar temporal-spectral variability as unmasked speech, the temporal-spectral variability of a signal that consists of distracting speech and a time-reversed speech signal is expected to be similar to unmasked speech at equal SPLs. The results indicate that a time-reversed masker enables similar performance improvements in number recall as a steady-state speech-shaped noise at SNRs around −6 dB. This may be attributed to lower speech intelligibility or lower frequency fluctuations of sound conditions with speech and reversed speech as compared to speech only. As opposed to the time-reversed speech masker, when the SNR of speech that is masked by steady-state noise is decreased, both temporal-spectral variability and speech intelligibility decline and approach the working memory performance as observed in silence.
The results show that short-term memory performance in serial recall is sensitive to background sounds between −6 and −12 dB SNR in sound masking applications. A time-reversed masker can increase the speech privacy at higher SNRs but it does not show any advantage with regard to the expected work performance. While all sound conditions with the steady-state masker improve the annoyance perception toward unmasked speech, the annoyance of sound conditions with reversed speech is perceived as high as during unmasked speech. Sound masking with reversed speech is not expected to create a favorable sound environment that enables one to concentrate at work while being perceived as pleasant at the same time. Since the employees in German open-plan offices are usually still skeptical about the application of sound masking, the SPL settings are commonly based on improving the user acceptance. In general, the SPL has to be set to a level that provides a good trade-off between speech privacy and annoyance. This study follows a different approach by analyzing the working memory performance in a serial recall task. The results suggest that reducing the speech intelligibility of distracting speech may be almost as important as reducing the temporal-spectral variability when sound masking is intended to improve cognitive performance. This implies that semantic content has an impact on working memory performance (cf. Ref. 6).