Listeners often experience challenges understanding a person (target) in the presence of competing talkers (maskers). This difficulty reduces with the availability of visual speech information (VSI; lip movements, degree of mouth opening) and during linguistic release from masking (LRM; masking decreases with dissimilar language maskers). We investigate whether and how LRM occurs with VSI. We presented English targets with either Dutch or English maskers in audio-only and audiovisual conditions to 62 American English participants. The signal-to-noise ratio (SNR) was easy at 0 audio-only and −8 dB audiovisual in Experiment 1 and hard at −8 and −16 dB in Experiment 2 to assess the effects of modality on LRM across the same and different SNRs. We found LRM in the audiovisual condition for all SNRs and in audio-only for −8 dB, demonstrating reliable LRM for audiovisual conditions. Results also revealed that LRM is modulated by modality with larger LRM in audio-only indicating that introducing VSI weakens LRM. Furthermore, participants showed higher performance for Dutch maskers compared to English maskers with and without VSI. This establishes that listeners use both VSI and dissimilar language maskers to overcome masking. Our study shows that LRM persists in the audiovisual modality and its strength depends on the modality.

Listeners often experience challenges understanding a person (target) in the presence of competing talkers (maskers). This challenge arises because the background interferes with (masks) the target speech. Most studies assessing speech-in-speech perception do so in auditory-only conditions. However, we must investigate both auditory and visual speech information (VSI) to fully understand perception in challenging listening conditions. VSI can provide essential information about acoustic-phonetic cues such as place of articulation for consonants and characteristics of vowels (e.g., Summerfield, 1992; Assmann and Summerfield, 2004). In addition, listeners take advantage of auditory speech information during speech-in-speech listening situations. Generally, listener performance is better, sometimes termed “release from masking,” when there is a mismatch between the target and masker. One well-attested phenomenon is linguistic release from masking (LRM) in which masking of a target language (e.g., English) decreases with increasingly dissimilar language backgrounds (e.g., Dutch).

Listeners show improved intelligibility for target speech when the target and masker language differ. This has been explained by the target-masker linguistic similarity hypothesis, which states that the more linguistically dissimilar the target and masker speech streams, the easier they are to segregate (Brouwer et al., 2012). This improved intelligibility for listeners with different language target-masker pairs is known as LRM. For instance, studies have shown an increase in masking for listeners when a target language is presented with a known language masker (e.g., Van Engen and Bradlow, 2007; Calandruccio et al., 2013). Van Engen and Bradlow (2007) showed that native English listeners' intelligibility for English targets was higher when the competing speech was Mandarin relative to English. Linguistic release from masking effects has also been demonstrated for English targets presented in Spanish (Garcia Lecumberri and Cooke, 2006), Greek (Calandruccio and Zhou, 2014), Mandarin-accented English language maskers (Calandruccio et al., 2010; Calandruccio et al., 2014), regional linguistic Dutch variants (Brouwer, 2017), and German-accented Dutch maskers (Brouwer, 2019).

1. Factors affecting LRM

Energetic (reduced intelligibility due to spectro-temporal overlap of target and masker) and informational (reduced intelligibility due to a combination of factors beyond those attributable to energetic masking effects) masking contributions have been invoked to explain LRM. This phenomenon may be due to energetic masking differences between conditions. The linguistic differences at the segmental (phonetic/phonological) and suprasegmental (stress patterns, prosody) levels of speech lead to reduced overlap in spectral and temporal aspects of the acoustic energy distribution between target-masker pairs. For example, target-masker combinations that differ widely on a linguistic-phonetic dimension (e.g., English-in-Mandarin vs English-in-English) show a notable LRM (Calandruccio et al., 2010). There is reduced overlap of spectro-temporal characteristics between different language target-masker pairs during LRM.

Another factor contributing to LRM is differences in informational masking, which is caused by any masker that reduces intelligibility after energetic masking has been accounted for. It happens when the listener is unable to perceptually segregate the target from the masker stream due to interference outside of energetic masking. It can occur in speech-in-speech situations when the phonological, lexical, syntactic, and semantic characteristics of one language interfere with those of another even when their energetic differences are accounted for. For example, Brouwer et al. (2012) demonstrated a decrease in semantic interference for semantically anomalous maskers relative to meaningful maskers. Thus, semantic overlap between the target and masker affects the size of LRM. This study highlights the influence of informational masking on LRM effects.

The components of informational masking are competing attention to the masker, interference from a known language, higher cognitive load, similarity, and uncertainty (Durlach et al., 2003; Cooke et al., 2008; Mattys et al., 2009; Brouwer and Bradlow, 2014). Table I describes the components of informational masking that we apply to LRM.

  1. Uncertainty. Uncertainty occurs when target-masker characteristics change unpredictably across trials (Shinn-Cunningham, 2008). Uncertainty has been defined as the listener being unable to pinpoint what exactly in the complex sound to listen to (Watson et al., 1976). It is through the reduction in uncertainty that the listener can pull the useful information from the acoustic signal to help them attend to the target. Uncertainty significantly influences the amount of informational masking in speech-in-speech tasks in that reduced uncertainty results in reduced informational masking, which boosts listener performance (e.g., Watson et al., 1976).

  2. Similarity. Similarity is how alike the target and masker speech are in regard to spatial location, spectral or linguistic content, etc. This similarity effect is not completely reducible to energetic overlap. For example, intelligibility is better if the target and masker speech are from different sex talkers compared to same sex talkers because of the distinctive target and masker voices (Brungart, 2001; Helfer and Freyman, 2008). These perceptual differences in voice characteristics minimize informational masking because of reduced similarity. In addition, there is the linguistic similarity hypothesis in which more linguistically different language target and masker speech streams are easier to segregate (Calandruccio et al., 2013; Brouwer et al., 2012). Informational masking decreases when there is reduced similarity between the target and masker.

  3. Higher cognitive load. Higher cognitive load occurs when listeners process complex acoustic signals in a speech-in-speech task. Listeners have to tune into the relevant information that either speech stream might contain while ignoring the irrelevant information when attempting to understand the target. Mattys et al. (2009) explains that any processing resources allotted to the masker will reduce performance during speech perception. Higher cognitive load increases informational masking.

  4. Interference from a known masker language. There is interference from a known language masker language because the listener is following along with the acoustic phonetic and semantic information in both the target and the masker, which interferes with their ability to identify the target signal. Several studies have shown that maskers intelligible to the listener can mask the target more than different language maskers or unintelligible ones (Garcia Lecumberri and Cooke, 2006; Van Engen and Bradlow, 2007). Informational masking increases when the listener experiences interference from a known masker language.

  5. Attention to the masker. In this situation, the listener has difficulty selectively attending to the target while ignoring the masker, which increases informational masking. Competing attention to the masker often occurs due to some of the factors outlined above (e.g., known masker language) and can also occur due to nonlinguistic or semantic features of the masker capturing the listener's attention. Broadly, the competing masker itself can diminish the listener's attentional resources that were intended for identifying the target talker (Mattys et al., 2009).

TABLE I.

Five components of informational masking as described in the literature and how VSI will affect each component.

Informational masking componentDefinitionEffect of adding VSI
Uncertainty Being unable to pinpoint where exactly in the complex sound to listen to Reduces due to mouth gestures matching the audio input of the target 
Similarity How alike the target and masker speech are across all dimensions Unchanged 
Higher cognitive load Load due to any processing resources allotted to the masker Reduces due to perceiver being able to direct attention to the target 
Interference from a known masker language Acoustic phonetic and semantic information from the masker that is distracting Unchanged 
Attention to the masker Difficulty selectively attending to the target while ignoring the masker Reduces due to redundant information about the target 
Informational masking componentDefinitionEffect of adding VSI
Uncertainty Being unable to pinpoint where exactly in the complex sound to listen to Reduces due to mouth gestures matching the audio input of the target 
Similarity How alike the target and masker speech are across all dimensions Unchanged 
Higher cognitive load Load due to any processing resources allotted to the masker Reduces due to perceiver being able to direct attention to the target 
Interference from a known masker language Acoustic phonetic and semantic information from the masker that is distracting Unchanged 
Attention to the masker Difficulty selectively attending to the target while ignoring the masker Reduces due to redundant information about the target 

In conclusion, uncertainty, similarity, higher cognitive load, interference from a known masker language, and attention to the masker, as outlined in Table I, all decrease target intelligibility. These components contribute to informational masking and can be manipulated in order to further investigate speech-in-speech perception. The current study will explore LRM using audiovisual speech stimuli. The addition of VSI will reduce uncertainty, cognitive load, and attention to the masker, which might further uncover factors contributing to LRM.

2. LRM in different listening situations

Recent studies have examined how LRM responds to changing listening situations. Testing LRM under more ecologically valid listening conditions provides complementary evidence of factors essential to LRM. Recent studies have demonstrated that manipulating factors like spatial location affects the magnitude of LRM (Viswanathan et al., 2016, 2018). For instance, Viswanathan et al. (2016) found that LRM still persists when the masker is spatially separated from the target (i.e., easier segregability), but the size of LRM effects diminish relative to the co-located (i.e., harder segregability) condition. Similarly, Viswanathan et al. (2018) investigated whether LRM effects persisted under a different set of altered listening conditions. Here, they tested this using speech streams spectrally degraded via noise-vocoding. Results showed that the magnitude of LRM again varied when the listening situation was changed. Also, Viswanathan and Kokkinakis (2019) demonstrated that altering listening situations through varying extent of reverberation can also determine whether LRM effects are obtained with the same target-masker pairs. In addition to these studies, Williams and Viswanathan (2020) manipulated talker sex in the target and found LRM effects in the opposite direction (rLRM), i.e., performance was higher for English maskers relative to Dutch, for different sex talkers. However, they obtained typical LRM effects when talker sex was manipulated in the masker. The rLRM effects in the different sex talker condition reinforce the idea that target-masker linguistic similarity alone cannot predict the size and direction of LRM. Thus, LRM effects differed depending on the listening contexts indicating the need to study LRM in varying situations to fully understand the phenomenon. Given this, the addition of VSI offers an additional listening situation that may aid in understanding how LRM generalizes.

Perceivers understand speech better when they can see and hear the talker. When noise reduces speech intelligibility, seeing the talker increases their speech understanding (Schubotz et al., 2021; Holler and Levinson, 2019; Dekle et al., 1992; Sumby and Pollack, 1954). They use the visual information from the speaker's articulators, i.e., the lips, tongue, and teeth to help with their perception (Sumby and Pollack, 1954). It has been shown that VSI about articulatory gestures can provide information about temporal dynamics as well as identify the segment. For example, access to visible articulators through the degree of mouth opening could help predict the timing of onsets and peaks in acoustic energy of the speech signal reducing spectro-temporal uncertainty (Grant and Seitz, 2000). This allows the perceiver to readily identify and segregate intensity patterns of the target and masker speech. In addition to the temporal properties of the acoustic signal, visible articulators also provide the perceiver with information about specific sounds that are being produced. For instance, the perceiver associates the sounds /p/, /b/, and /m/ with lips closed and suddenly open and associates lip rounding with the production of the sound /u/ (Peelle and Sommers, 2015). This place of articulation information allows the perceiver to better understand the talker by filling in missed phonetic information via changes in mouth, tongue, and lips during speech production (Sumby and Pollack, 1954). VSI also contains time-varying lip jaw movements, which allow the listener to tune into temporal characteristics of the speech signal (Summerfield, 1987). In sum, simultaneous auditory information and VSI provides a perceptual benefit (i.e., audiovisual benefit) over auditory speech alone during speech perception. This perceptual benefit derives from the redundant and complementary information provided by visual speech as described above.

VSI can help perceivers attend to the target by allowing them to separate target speech from competing speech maskers (Helfer and Freyman, 2005). This improved perception of the target talker due to VSI can be called visual release from masking. For example, Helfer and Freyman (2005) examined whether the availability of VSI reduced interference caused by informational components of target-masker uncertainty and similarity during speech-in-speech perception. They tested spatial release from masking (SRM; better intelligibility when targets are in different spatial locations compared to the same) with either audiovisual or audio-only conditions. Here, the findings with SRM might inform how LRM effects operate in such listening situations because both types of masking release provide additional information to allow for easier segregation of speech streams. Results showed an audiovisual benefit of 6.2 dB (50% threshold) in the SRM condition compared to a higher audiovisual benefit of 8.8 dB (50% threshold) in the co-located speech masker condition. They found that performance improved when audiovisual stimuli of English target speech was presented with a competing English masker compared to audio-only stimuli. The researchers concluded that the combination of informational masking components produced during SRM and audiovisual speech contribute necessary advantages (i.e., segregation of target from masker and maintaining attention to masker) to reduce uncertainty during speech-in-speech perception. Given these results with SRM, we might expect a similar outcome with LRM effects in the current study. These findings demonstrate that VSI reduces informational masking effects and are especially beneficial to the perceiver in speech-in-speech tasks.

A recent study by Brown et al. (2021) using visual information to examine LRM-like effects offers insight into the current study. The researchers investigated the target-masker linguistic similarity hypothesis in a cross-modal paradigm where there is no energetic masking. That is, they asked whether LRM-like effects (i.e., LRM with no auditory speech signal) would occur in a visual-only condition. They used two-talker maskers of either English, Dutch, or Mandarin. The participant's task was to lipread English target sentences in the presence of these speech maskers. This study showed LRM-like effects in that same language speech maskers interfered with the lipread target speech more than different-language maskers indicating that LRM-like effects occur even when visual-only targets are presented with speech maskers. Furthermore, it shows that such effects occur in the absence of energetic masking. The researchers attributed these effects to informational masking components such as difficulty segregating same or similar language target-masker streams and/or more attentional demand with same or similar language target-masker streams. This explanation shows that uncertainty and attention to the masker, respectively, are likely to contribute to the LRM effects in the current study. However, this study does not show typical LRM effects because there was no auditory speech signal for the target stimuli. Brown et al. implemented visual-only target stimuli to address a different research question than our current study. That is, they focused on testing the target-masker linguistic similarity hypothesis with no energetic masking.

It is currently unknown whether listeners will still experience LRM with VSI available. Studying LRM in the presence of visual information is critical for the following reasons. First, this will provide a more realistic view of LRM since most speech is perceived face-to-face. Second, studying LRM in audiovisual conditions will allow us to manipulate informational masking components while keeping energetic masking components the same. The goal of the present research project is to investigate whether and how VSI influences the size and presence of LRM. Through this investigation, we might be able to pinpoint the contribution of different components of informational masking as outlined in Table I. We will accomplish this by testing native American English participants using the following target-masker conditions: (1) manipulating the masker language (i.e., English-English and English-Dutch); and (2) manipulating the presence of VSI by presenting either a video of the talker or only the audio of the talker.

English and Dutch were chosen as the masker languages due to their typological similarity and their membership to the Germanic genus of the Indo-European language family (Dryer and Haspelmath, 2013). Both languages share cognates with Germanic roots, similar verb system tenses as well as stress and intonation patterns. In addition, Dutch and English allow complex syllable structures; both commonly reduce vowels to schwa in unstressed syllables; both mark lexical stress with altered pitch and lengthened syllable duration; and both have similar numbers of consonants and vowels (Booij, 2019, 1999; Collins and Mees, 2003; Carr, 2019). Implementing Dutch and English masker languages permits the current study to align with existing LRM research (Brouwer et al., 2012; Calandruccio et al., 2013; Brown et al., 2021), which will allow for a better understanding of how VSI influences LRM.

We present the stimuli at easier signal-to-noise ratios (SNRs) (0 dB SNR for audio-only and −8 dB SNR for audiovisual) in Experiment 1 and harder SNRs in Experiment 2 (−8 dB SNR for audio-only and −16 dB SNR for audiovisual). The aim was to investigate whether there were LRM effects for the easier SNR and to examine whether these effects would persist for the harder SNR. Critically, this design allows for the between experiment comparison of audiovisual and audio-only at the same SNR, which is necessary to evaluate the size of LRM effects across modality. This is needed to account for baseline intelligibility differences between audio-only and audiovisual conditions.

1. Outcomes and implications

Based on studies investigating the effect of VSI and visual-only information on speech perception in the presence of noise and speech maskers discussed above, we expect VSI to affect informational masking components as outlined in Table I. VSI, such as mouth gestures, matching the auditory input of the target speech reduces uncertainty and cognitive load. VSI also provides redundant information about the target, which reduces overall attention to the masker. Consistent with past studies, we predict participants might show an overall audiovisual benefit (i.e., improved speech intelligibility due to VSI) during masking (e.g., Helfer and Freyman, 2005). We also predict that participants will show better performance overall and smaller LRM effects with easier SNRs compared to harder SNRs. It is also expected that the hardest condition, −8 dB SNR audio-only, will show the strongest LRM effects.

There are three possibilities for whether LRM effects will be present in the audiovisual condition. First, it is possible that there will be no LRM in the audiovisual condition. If this is the case, then it may indicate that VSI might increase the ability to extract missing phonetic cues and pull-out acoustic cues that guide the perceiver in segregating and maintaining attention to the target talker from the masker. This creates a situation where the visual benefits far outweigh the benefits due to a mismatched language masker. That is, LRM might be neutralized in the presence of VSI since the latter provides redundant speech information when paired with a foreign language background. Such a neutralization would suggest that informational masking components that underlie LRM and visual release completely overlap.

Second, it is possible that LRM might occur unchanged in size, i.e., typical LRM effects, in the audiovisual condition. This outcome would suggest that the improvement in performance due to VSI consists of components that do not affect LRM. For instance, it might suggest that manipulating the uncertainty component of informational masking through audiovisual means does not affect LRM. If this is the case, then perhaps the informational masking components that lead to LRM do not overlap with those responsible for visual release from masking.

Third, it is possible there will be weaker than typically demonstrated LRM effects in the audiovisual condition. If a much weaker LRM effect is present, then perhaps VSI such as lip jaw movements and missing phonetic information such as place of articulation and form of mouth might be highly synchronized with the target signal. This allows the perceiver to better segregate and maintain attention to the target talker in this situation compared to with either LRM or VSI alone. Here, some informational masking components might be shared between LRM and visual release form masking. This outcome is the most probable and supported by research investigating a similar phenomenon, SRM, in which performance improves when the maskers are spatially separated from the target. The Helfer and Freyman (2005) study shows that SRM is smaller when visual information is present. Although LRM effects do not operate the same as SRM, LRM effects can be compared with SRM examined by Helfer and Freyman as both types of release provide additional information in the masker for easier segregation of target-masker speech streams. Perhaps, LRM effects will still occur but to a lesser extent given the availability of VSI. When VSI is considered alongside LRM, we expect the following. First, there might be a differential effect where the English masker benefits much more than the Dutch masker when uncertainty is reduced. Second, adding VSI reduces cognitive load and attention to the masker thereby weakening LRM. Altogether, VSI reduces informational masking by reducing uncertainty, cognitive load, and attention to the masker.

In this experiment, we presented audio-only stimuli at 0 dB SNR and audiovisual stimuli at −8 dB SNR to participants to examine whether LRM would still occur when the perceiver had access to VSI.

1. Participants

Thirty-two subjects (25 females and 7 males) ranging from 19 to 69 years old (median age = 31 years) participated in the experiment. Eight participants were 40 years old or above.1 All participants were native speakers of American English with no knowledge of Dutch and provided informed consent. Participants reported no hearing or speech deficits, normal or corrected-to-normal vision, and provided information on exposure to foreign-accented speech. Participants were recruited from Prolific®, an online participant recruitment database, and received compensation for their participation. All participants passed headphone checks to ensure they were wearing headphones (Milne et al., 2021).

2. Materials and procedure

Similar to past studies, participants completed an intelligibility task in which they reported the words they heard and listened to audiovisual stimuli over headphones. The stimuli consisted of 80 target sentences in which each recording was made up of one target sentence and two background sentences (masker). One female native American English speaker produced syntactically simple sentences as target stimuli (Bamford-Kowal-Bench Revised Sentence List, BKB-R; Bench et al., 1979). Each target sentence contained three or four keywords (e.g., The clown had a funny face) for a total of 60 to 80 keywords per language per modality. Per Gallun et al. (2018), the target stimuli were normalized to the same intensity level of 39.5 dB. In addition, we piloted this intensity level to identify suitable SNRs for all conditions to ensure there were no ceiling or floor effects. However, this intensity level will not match the levels of presentation in the online study given that participants have control of their own volume. Half of the list was accompanied by a video of the talker presented on the computer screen (audiovisual condition), whereas the other half was presented without an accompanying video (audio-only condition). The target talker was instructed to maintain a neutral expression while speaking at a conversational pace in a natural intonation. The talker in the video recordings was seated in front of a blue background and repeated lists of sentences heard via an earbud hidden from view. She spoke into the camera directly in front of her. The video recordings were edited in Adobe Audition Premiere Pro.

Two female native speakers each of American English and Dutch recorded masker sentences, e.g., The wrong shot led the farm, from the Syntactically Normal Sentence Test (SNST; Nye and Gaitenby, 1974). All stimuli were recorded in a sound-attenuating booth at a 44.1 kHz sampling rate and a 16-bit resolution. The combined target and masker stimuli were set to SNR levels of 0 dB for the audio-only condition and −8 dB for the audiovisual condition. The SNR differed in steps of 8 dB to account for the visual release from masking that resulted from VSI. Masker language and presentation modality was manipulated within subjects. The presentation consisted of 20 sentences per masker language per presentation modality. Each participant was presented with 80 sentences during the experiment.

The tasks were conducted online using LabVanced (Finger et al., 2017). There was headphone calibration prior to the start of the experiment. Participants were instructed to wear headphones, turn their volume to zero, play the headphone calibration sound and gradually increase the volume to a comfortable listening level. Next, a headphone check (approx. 3 min) was administered using the dichotic Huggins Pitch approach in which a faint pitch in noise can only be detected with dichotically presented stimuli (Milne et al., 2021). Once participants passed the headphone check, they were presented with a total of fourteen audio-only and audiovisual practice trials (from a BKB-R list) to familiarize themselves with the target talker and the task. They received written instructions to type the target sentence they heard given their best ability and to report individual words in case they are unable to understand the entire sentence. They were presented with English sentences in the presence of two-talker background speech. Each sentence was played once, and participants typed their responses before moving to the next trial. The written responses were scored as incorrect if the keyword was missing, incomplete, or wrong. Each keyword was checked for spelling errors and corrected during scoring prior to analysis. For example, the spelling was corrected if the keyword potatoes was spelled “potatos.” The corrected keyword was retained and counted as correct. Presentation modalities, audiovisual and audio-only were blocked and counterbalanced across all participants. Half started with the audiovisual block and the other half with the audio-only block first. The presentation order for each masker language (English and Dutch) was randomized within each block for all participants.

The data were analyzed in RStudio (R Core Team, 2021; RStudio Team, 2021) using the lme4 package (Bates et al., 2015). Linear mixed-effect regression model analyses (see Baayen et al., 2008) were conducted to examine the effect of modality (audio-only and audiovisual), masker language (Dutch and English), and their interaction on intelligibility. Proportion correct intelligibility score was used as the dependent variable. Masker language (English vs Dutch) and modality (audio-only vs audiovisual) were entered as fixed effects. Simple contrast coding was used for all fixed effects to reflect whether each level is reliably different from the intercept, the grand mean of both factors. Participant and item random intercept were included to account for the across participant and item differences in average proportion correct intelligibility score. Note that the most parsimonious model did not include random slopes. Table II shows the parameters for the full model that was run to test the hypotheses.

TABLE II.

Experiment 1 intelligibility score parameters in proportion correct.

Empty modelFull model
ParameterEstimateSEtpEstimateSEtp
Fixed effects 
Intercept 0.79 0.03 23.23 <0.001 0.74 0.03 24.20 <0.001 
Modality     −0.30 0.01 −23.90 <0.001 
Masker language     0.07 0.01 5.80 <0.001 
Modality * masker language     0.08 0.02 3.60 <0.001 
Random effects 
Subject variance (intercept) 0.02    0.02    
Item variance (intercept) 0.01    0.01    
Residual variance 0.09    0.07    
Empty modelFull model
ParameterEstimateSEtpEstimateSEtp
Fixed effects 
Intercept 0.79 0.03 23.23 <0.001 0.74 0.03 24.20 <0.001 
Modality     −0.30 0.01 −23.90 <0.001 
Masker language     0.07 0.01 5.80 <0.001 
Modality * masker language     0.08 0.02 3.60 <0.001 
Random effects 
Subject variance (intercept) 0.02    0.02    
Item variance (intercept) 0.01    0.01    
Residual variance 0.09    0.07    

Participant performance was measured by scoring the correct key words reported per sentence to obtain an intelligibility score. The left panel in the first column of Fig. 1 shows participants scored high in both audio-only Dutch and English conditions. The right panel of the first column of Fig. 1 shows higher intelligibility for the different language background compared to the same language in the audiovisual condition indicating an LRM effect, participants report more keywords correctly for the Dutch condition relative to English. In addition to intelligibility scores, we also report LRM scores to clearly show the effects across modality.

FIG. 1.

The effect of modality on intelligibility plotted in percentage points (instead of proportion correct to ease readability) for all perceivers for the Dutch and English conditions in Experiment 1 (left column) and Experiment 2 (right column). The difference between Dutch and English indicates LRM in the audiovisual modality for Experiment 1 and LRM in audio-only and audiovisual for Experiment 2. Note: This figure plots intelligibility as percentage correct. However, intelligibility was proportion correct in the analysis.

FIG. 1.

The effect of modality on intelligibility plotted in percentage points (instead of proportion correct to ease readability) for all perceivers for the Dutch and English conditions in Experiment 1 (left column) and Experiment 2 (right column). The difference between Dutch and English indicates LRM in the audiovisual modality for Experiment 1 and LRM in audio-only and audiovisual for Experiment 2. Note: This figure plots intelligibility as percentage correct. However, intelligibility was proportion correct in the analysis.

Close modal

The model's intercept, corresponding to grand mean proportion correct score, is 0.74 with a standard error of 0.03. This analysis showed that intelligibility was significantly influenced by modality (estimate = −0.30, standard error, SE = 0.01, t = −23.90, p < 0.001) in that participants in the audio-only condition showed higher overall performance than those in the audiovisual condition. While this seems like an opposite effect, it is not. Recall that we had attempted to equalize the performance by presenting the modalities at different SNRs. This effect shows that the chosen SNR for the audiovisual condition went beyond simply neutralizing the audiovisual benefit. Intelligibility was also influenced by masker language (estimate = 0.07, SE = 0.01, t = 5.80, p < 0.001). Participants showed LRM effects such that Dutch maskers produced higher target intelligibility than English maskers overall. There was also a significant interaction between modality and masker language (estimate = 0.08, SE = 0.02, t = 3.60, p = 0.0003). This interaction indicates that LRM is affected by modality in that LRM is present in the audiovisual condition (estimate = −0.11, SE = 0.02, t = −7.40, p < 0.001) and not in the audio-only one (estimate = −0.03, SE = 0.02, t = −1.43, p = 0.633).

In this experiment, participants were presented with audiovisual stimuli to investigate LRM effects in such conditions. LRM occurred in the audiovisual condition indicating that VSI does not neutralize the factors that drive LRM. In other words, perceivers use both VSI and differing target-masker languages to overcome masking effects. The interaction between modality and masker language suggests that LRM is modulated by modality. Performance in the audio-only condition was better than audiovisual. This indicated that the 8 dB difference overcompensated for the audiovisual benefit. That is, 0 dB SNR for audio-only and −8 dB for audiovisual were not comparable. Our results demonstrate that participant performance was not as expected in the audiovisual condition because it was set at an SNR level (−8 dB), which was too difficult. In Experiment 1, we observed LRM effects in the audiovisual condition. However, it is unclear whether this effect is due to SNR level or modality. Note that the SNR levels for audio-only and audiovisual conditions differ in steps of 8 dB to account for increased performance due to visual release from masking. We will make adjustments in Experiment 2 to address this issue and facilitate a direct comparison of audio-only and audiovisual conditions in the same SNR.

In Experiment 2, we examined whether it is truly modality or the SNR level driving this interaction. To test this, we used an audio-only SNR level (−8 dB) that overlapped with the audiovisual SNR (−8 dB) in Experiment 1. This was implemented to evaluate the size of LRM between modalities at the same SNR level. In Experiment 2, the audiovisual condition was presented at a −16 dB SNR level to again equate performance for the audiovisual and audio-only conditions by using an 8 dB difference.

In Experiment 2, we tested whether we would replicate LRM for the audiovisual condition using a harder SNR pair (−8 dB SNR audio-only and −16 dB SNR audiovisual) relative to the easier SNR pair used in Experiment 1 (0 dB SNR audio-only and −8 dB SNR audiovisual). While this maintains the −8 dB difference, it affords the possibility of directly comparing audiovisual and audio-only across the two groups for the same SNR (−8 dB SNR). We expected stronger effects of VSI at the harder SNR compared to Experiment 1. That is, participant performance should be higher for the −16 dB SNR audiovisual condition than the −8 dB SNR audiovisual condition due to participants taking advantage of the stimuli.

1. Participants

Thirty participants (20 females and 10 males) ranging from 18 to 42 years old (median age = 26 years) participated in the experiment. Four participants were excluded because of extremely low performance (predetermined cutoff of less than 10%; the same exclusion criterion as Experiment 1). Three participants were 40 or above.2 All participants were native speakers of American English with no knowledge of Dutch and provided informed consent. Participants reported no hearing or speech deficits, normal or corrected-to-normal vision, and provided information on exposure to foreign-accented speech. Participants were recruited from Prolific®, an online participant recruitment database, and received compensation for their participation. All participants passed headphone checks to ensure they were wearing headphones (Milne et al., 2021).

2. Materials and procedure

All materials and procedures were the same as in Experiment 1. The only exception was that different SNR levels were used in Experiment 2. The audio-only condition used an SNR of −8 dB and the audiovisual condition used an SNR of −16 dB.

The data analyses were the same method as Experiment 1. Table III shows the parameters for the full model that was run to test the hypotheses.

TABLE III.

Experiment 2 intelligibility score parameters in proportion correct.

Empty modelFull model
ParameterEstimateSEtpEstimateSEtp
Fixed effects 
Intercept 0.27 0.04 7.20 <0.001 0.31 0.03 10.21 <0.001 
Modality     0.10 0.02 5.50 <0.001 
Masker language     0.25 0.02 13.70 <0.001 
Modality * masker language     −0.24 0.05 −5.00 <0.001 
Random effects 
Subject variance (intercept) 0.01    0.01    
Item variance (intercept) 0.03    0.02    
Residual variance 0.10    0.09    
Empty modelFull model
ParameterEstimateSEtpEstimateSEtp
Fixed effects 
Intercept 0.27 0.04 7.20 <0.001 0.31 0.03 10.21 <0.001 
Modality     0.10 0.02 5.50 <0.001 
Masker language     0.25 0.02 13.70 <0.001 
Modality * masker language     −0.24 0.05 −5.00 <0.001 
Random effects 
Subject variance (intercept) 0.01    0.01    
Item variance (intercept) 0.03    0.02    
Residual variance 0.10    0.09    

Participant performance was again measured by scoring the correct keywords reported per sentence to obtain an intelligibility score. The right column of Fig. 1 shows participants scored higher for Dutch than English conditions suggesting an LRM effect for both audio-only and audiovisual. The right panel in the second column of Fig. 1 shows participants recognized more keywords overall in the audiovisual condition relative to audio-only, which is especially apparent for the English masker.

The data were then submitted to a linear mixed-effect model to evaluate the visually apparent effects. The model's intercept, corresponding to grand mean proportion correct score, is 0.31 with a standard error of 0.03. The lower mean score relative to Experiment 1 indicates a lower performance because of the comparatively harder SNRs used in Experiment 2. Intelligibility was significantly influenced by modality, estimate = 0.10, SE = 0.02, t = 5.50, p < 0.001, such that participants in the audiovisual condition showed higher overall performance than those in the audio-only condition. Intelligibility was also influenced by masker language, estimate = 0.25, SE = 0.02, t = 13.70, p < 0.001, indicating LRM across modalities. Participants showed LRM effects in which Dutch maskers produced higher intelligibility than English maskers overall. Finally, a significant interaction between modality and masker language, estimate = −0.24, SE = 0.05, t = −5.00, p < 0.001, indicates that the effect of modality is weaker in the Dutch condition. To further analyze this interaction, a post hoc analysis of simple effects was performed. This analysis examined whether masker language had significant effects on intelligibility for participants at the level of audio-only followed by audiovisual and to examine the differences between their estimated means. Results showed larger LRM effects in the audio-only condition in that participants in the Dutch condition performed higher than in the English condition (estimate = −0.37, SE = 0.03, t = −11.07, p < 0.001). LRM effects were smaller in the audiovisual condition with higher performance in Dutch than English (estimate = −0.13, SE = 0.03, t = −4.60, p < 0.001). In this analysis, we cannot directly assess the size of LRM because there are different modalities and different SNRs. The anlysis in Sec. III C 1 refers to assessing effects of modality using the same SNR.

1. Assessing effects of modality using the same SNR

To examine the relationship between the size of LRM and modality, we submitted intelligibility scores from the −8 dB SNR audiovisual condition in Experiment 1 and intelligibility scores from the −8 dB SNR audio-only condition in Experiment 2 to a linear mixed effects model with random intercepts by item and by subject. Table IV shows the parameters for the full model that was run to test the hypotheses. The model's intercept, corresponding to the grand mean proportion correct score, was 0.44 with a standard error of 0.03. This analysis revealed a significant main effect of modality (estimate = 0.35, SE = 0.02, t = 19.70, p < 0.001), indicating better performance in the audiovisual condition relative to audio-only even when both modalities were the same SNR as seen in the middle two panels of Fig. 1. There was also a main effect of masker language (estimate = 0.25, SE = 0.02, t = 14.22, p < 0.001), suggesting better performance for Dutch compared to English in both modalities. Finally, there was a significant interaction between modality and masker language (estimate = −0.28, SE = 0.04, t = −7.93, p < 0.001) in which the LRM effect is larger in the audio-only condition (estimate = −0.39, SE = 0.03, t = −12.60, p < 0.001) compared to the audiovisual condition (estimate = −0.11, SE = 0.02, t = −6.38, p < 0.001) as seen in Fig. 2. This suggests that LRM is modulated by modality even when SNR level is constant across modality.

TABLE IV.

The −8 audio-only vs −8 audiovisual intelligibility score parameters in proportion correct.

Empty modelFull model
ParameterEstimateSEtpEstimateSEtp
Fixed effects 
Intercept 0.36 0.04 8.08 <0.001 0.44 0.03 13.70 <0.001 
Modality     0.35 0.02 19.70 <0.001 
Masker language     0.25 0.02 14.22 <0.001 
Modality * masker language     −0.28 0.04 −7.93 <0.001 
Random effects 
Subject variance (intercept) 0.02    0.02    
Item variance (intercept) 0.05    0.01    
Residual variance 0.11    0.10    
Empty modelFull model
ParameterEstimateSEtpEstimateSEtp
Fixed effects 
Intercept 0.36 0.04 8.08 <0.001 0.44 0.03 13.70 <0.001 
Modality     0.35 0.02 19.70 <0.001 
Masker language     0.25 0.02 14.22 <0.001 
Modality * masker language     −0.28 0.04 −7.93 <0.001 
Random effects 
Subject variance (intercept) 0.02    0.02    
Item variance (intercept) 0.05    0.01    
Residual variance 0.11    0.10    
FIG. 2.

(Color online) The LRM effect plotted in percentage points for all perceivers for each modality in −8 dB SNR. The horizontal lines (medians) in the audiovisual condition compared to the audio-only indicates larger LRM effects in the audio-only condition. The boxes depict the values between the 25th and 75th percentiles and the whiskers denote the ±1.5 interquartile range.

FIG. 2.

(Color online) The LRM effect plotted in percentage points for all perceivers for each modality in −8 dB SNR. The horizontal lines (medians) in the audiovisual condition compared to the audio-only indicates larger LRM effects in the audio-only condition. The boxes depict the values between the 25th and 75th percentiles and the whiskers denote the ±1.5 interquartile range.

Close modal

In this pair of experiments, we investigated whether LRM is modulated by modality. To do so, participants were presented with more challenging SNR levels and an SNR level that overlaps with Experiment 1 across modality. Results showed that participants performed better overall with Dutch as the masker language in both audio-only and audiovisual conditions. They also showed that LRM still occurred in the audiovisual condition. These findings reinforce the idea that factors driving LRM are not redundant with factors responsible for visual release from masking. It is interesting that LRM is smaller in the audiovisual condition compared to audio-only when presented at the same SNR level. While LRM is reliably detected across both conditions, its size is modulated by modality. This suggests that reducing the uncertainty, cognitive load, and attention to the masker components of informational masking through audiovisual means does affect the size of LRM.

As expected, the availability of VSI that matched the auditory input of the target speech reduced uncertainty and cognitive load. It also provided some redundant information about the target that reduced overall attention to the masker. This reduction in informational masking components allowed the participants to show improved speech intelligibility due to VSI. In addition, when uncertainty was reduced in Experiment 2 (i.e., by decreasing the SNR level), participants showed increased performance for the audiovisual English–English condition compared to audio-only. Overall, results showed that adding VSI reduced cognitive load and attention to the masker, which resulted in a weaker LRM effect.

All together, these results are consistent with the hypothesis in which LRM is weaker in the audiovisual condition. These results are also in agreement with those from the Helfer and Freyman (2005) study wherein SRM was smaller with audiovisual stimuli. In the current experiment, we can compare LRM effects with SRM examined by Helfer and Freyman as both types of release provide additional information in the masker for easier segregation of target-masker speech streams. This study also showed that LRM was smaller with audiovisual stimuli. These analogous results suggest that reduced informational masking resulting from VSI seems to reduce both types of masking release.

The obtained LRM effects may be due to the overall reduced cognitive load that VSI provides for the perceiver. In a follow-up experiment, we will further manipulate cognitive load by testing whether linguistic experience (i.e., higher cognitive load) with both masker languages influences LRM effects in the presence of VSI. We will test this using Dutch-English bilinguals and American English monolinguals because of their differing knowledge of the masker languages. There is higher cognitive load during this task for the bilinguals since they know both the masker languages compared to the monolinguals. The bilinguals must inhibit linguistic information in both the English and Dutch maskers leading to more cognitive load, while monolinguals inhibit linguistic information from only one background masker. If the LRM effects in this experiment are due to a reduction in cognitive load because of the reduced uncertainty provided by VSI, then testing participants with higher cognitive load (i.e., linguistic knowledge of both masker languages) compared to those with lower cognitive load (i.e., linguistic knowledge of one masker language) might help us to better understand these findings.

In this study, we investigated the effect of manipulating informational masking components while keeping energetic masking the same. To do this, we introduced VSI to participants during an intelligibility task. Experiments 1 and 2 examined the size and presence of LRM effects when presented with VSI. In this study, we found LRM effects in both the audiovisual and audio-only conditions. That is, native English participants showed higher performance for Dutch maskers compared to English maskers with and without VSI. This demonstrates that listeners use both VSI and different language backgrounds to overcome masking effects instead of solely using either VSI or differing language backgrounds. This suggests that reducing uncertainty, cognitive load, and attention to the masker by using VSI does not neutralize LRM effects but instead weakens LRM.

This study showed that LRM does persist in the audiovisual modality. It is not immediately clear from Experiment 1 results directly how LRM is modulated by modality because of the covariation in SNR and modality. To address this, we conducted Experiment 2 to investigate whether the obtained effects would replicate under more challenging listening conditions. Experiment 2 showed LRM effects in the audio-only condition, which replicates typical effects (e.g., Calandruccio et al., 2010; Brouwer et al., 2012). Importantly, this replication confirms that moving this test to an online environment did not fundamentally alter LRM effects. The audiovisual condition also showed LRM effects again. Critically, the audiovisual LRM effects from Experiment 1 were replicated with more challenging listening conditions. They also generalize across SNRs and changes in listening difficulty. This suggests that these effects are stable. Furthermore, by comparing −8 dB SNR audio-only in Experiment 1 and −8 dB SNR audiovisual in Experiment 2, we evaluated the modulation of LRM effects with modality. This comparison showed that LRM is modulated by modality when the SNR level is constant with larger LRM effects in the audio-only condition. This supports the explanation that reducing uncertainty, cognitive load, and attention to the masker components of informational masking by introducing VSI weakens LRM. These experiments demonstrate that LRM and VSI seem to share some informational components, not entirely overlapping, that modulate the size of LRM.

The findings from these experiments help guide our understanding of LRM. This study demonstrates that LRM still occurs with reduced overall informational masking contributions. Reducing uncertainty and cognitive load by using audiovisual relative to audio-only speech does not eliminate LRM effects. However, the size of LRM changes depending on the modality. This further supports the hypothesis that LRM is present but changes in different listening situations. To conclude, our results clearly demonstrate that LRM occurs in the audiovisual modality and reducing certain components of informational masking such as uncertainty and cognitive load influence the size of LRM.

1

Eight participants were 40, 41, 43, 51, 52, 53, 56, and 69. The effects did not qualitatively change when these participants were excluded.

2

One participant was 40 years old and two were 42 years old. The effects did not qualitatively change when these participants were excluded.

1.
Assmann
,
P.
, and
Summerfield
,
Q.
(
2004
). “
The perception of speech under adverse conditions
,” in
Speech Processing in the Auditory System: Springer Handbook of Auditory Research
, edited by
S.
Greenberg
,
W. A.
Ainsworth
,
A. N.
Popper
, and
R. R.
Fay
(
Springer
,
New York
), Vol.
14
, pp.
231
308
.
2.
Baayen
,
R. H.
,
Davidson
,
D. J.
, and
Bates
,
D. M.
(
2008
). “
Mixed-effects modeling with crossed random effects for subjects and items
,”
J. Memory Lang.
59
(
4
),
390
412
.
3.
Bates
,
D.
,
Maechler
,
M.
,
Bolker
,
B.
, and
Walker
,
S.
(
2015
). “
Fitting linear mixed-effects models using lme4
,”
J. Stat. Softw.
67
(
1
),
1
48
.
4.
Bench
,
J.
,
Kowal
,
Å.
, and
Bamford
,
J.
(
1979
). “
The BKB (Bamford-Kowal-Bench) sentence lists for partially-hearing children
,”
British J. Audiol.
13
(
3
),
108
112
.
5.
Booij
,
G.
(
1999
).
The Phonology of Dutch
(
Oxford University Press
,
Oxford, UK
).
6.
Booij
,
G.
(
2019
).
The Morphology of Dutch
(
Oxford University Press
,
Oxford, UK
).
7.
Brouwer
,
S.
(
2017
). “
Masking release effects of a standard and a regional linguistic variety
,”
J. Acoust. Soc. Am.
142
,
EL237
EL243
.
8.
Brouwer
,
S.
(
2019
). “
The role of foreign accent and short-term exposure on speech-in-speech recognition
,”
Atten. Percept. Psychophys.
81
(
6
),
2053
2062
.
9.
Brouwer
,
S.
, and
Bradlow
,
A. R.
(
2014
). “
Contextual variability during speech-in-speech recognition
,”
J. Acoust. Soc. Am.
136
,
EL26
EL32
.
10.
Brouwer
,
S.
,
Van Engen
,
K. J.
,
Calandruccio
,
L.
, and
Bradlow
,
A. R.
(
2012
). “
Linguistic contributions to speech-on-speech masking for native and non-native listeners: Language familiarity and semantic content
,”
J. Acoust. Soc. Am.
131
(
2
),
1449
1464
.
11.
Brown
,
V. A.
,
Dillman-Hasso
,
N.
,
Li
,
Z.
,
Ray
,
L.
,
Mamantov
,
E.
,
Van Engen
,
K.
, and
Strand
,
J.
(
2021
). “
Lipreading in noise: Cross-modal analysis of the target-masker linguistic similarity hypothesis
,” preprint, doi.org/10.31234/osf.io/a425d.
12.
Brungart
,
D. S.
(
2001
). “
Informational and energetic masking effects in the perception of two simultaneous talkers
,”
J. Acoust. Soc. Am.
109
(
3
),
1101
1109
.
13.
Calandruccio
,
L.
,
Brouwer
,
S.
,
Van Engen
,
K. J.
,
Dhar
,
S.
, and
Bradlow
,
A. R.
(
2013
). “
Masking release due to linguistic and phonetic dissimilarity between the target and masker speech
,”
Am. J. Audiol.
22
(
1
),
157
164
.
14.
Calandruccio
,
L.
,
Dhar
,
S.
, and
Bradlow
,
A. R.
(
2010
). “
Speech-on-speech masking with variable access to the linguistic content of the masker speech
,”
J. Acoust. Soc. Am.
128
(
2
),
860
869
.
15.
Calandruccio
,
L.
, and
Zhou
,
H.
(
2014
). “
Increase in speech recognition due to linguistic mismatch between target and masker speech: Monolingual and simultaneous bilingual performance
,”
J. Speech. Lang. Hear. Res.
57
(
3
),
1089
1097
.
16.
Carr
,
P.
(
2019
).
English Phonetics and Phonology: An Introduction
(
John Wiley & Sons
,
Hoboken, NJ
).
17.
Collins
,
B. D.
, and
Mees
,
I.
(
2003
).
The Phonetics of English and Dutch
(
Brill, Leiden
,
Netherlands
).
18.
Cooke
,
M.
,
Garcia Lecumberri
,
M. L.
, and
Barker
,
J.
(
2008
). “
The foreign language cocktail party problem: Energetic and informational masking effects in non-native speech perception
,”
J. Acoust. Soc. Am.
123
(
1
),
414
427
.
19.
Dekle
,
D. J.
,
Fowler
,
C. A.
, and
Funnell
,
M. G.
(
1992
). “
Audiovisual integration in perception of real words
,”
Percept. Psychophys.
51
(
4
),
355
362
.
20.
Dryer
,
M. S.
, and
Haspelmath
,
M.
(eds.) (
2013
).
The World Atlas of Language Structures Online
(Max Planck Institute for Evolutionary Anthropology, Leipzig), available at http://wals.info (Last viewed 6/1/2022).
21.
Durlach
,
N. I.
,
Mason
,
C. R.
,
Kidd
,
G.
, Jr
,
Arbogast
,
T. L.
,
Colburn
,
H. S.
, and
Shinn-Cunningham
,
B. G.
(
2003
). “
Note on informational masking (L)
,”
J. Acoust. Soc. Am.
113
(
6
),
2984
2987
.
22.
Finger
,
H.
,
Goeke
,
C.
,
Diekamp
,
D.
,
Standvoß
,
K.
, and
König
,
P.
(
2017
). “
LabVanced: A unified JavaScript framework for online studies
,” in
Proceedings of the International Conference on Computational Social Science
,
October 19–22
,
Santa Fe, NM
.
23.
Gallun
,
F. J.
,
Seitz
,
A.
,
Eddins
,
D. A.
,
Molis
,
M. R.
,
Stavropoulos
,
T.
,
Jakien
,
K. M.
,
Kampel
,
S. D.
,
Diedesch
,
A. C.
,
Hoover
,
E. C.
,
Bell
,
K.
,
Souza
,
P. E.
,
Sherman
,
M.
,
Calandruccio
,
L.
,
Xue
,
G.
,
Taleb
,
N.
,
Sebena
,
R.
, and
Srinivasan
,
N.
(
2018
). “
Development and validation of Portable Automated Rapid Testing (PART) measures for auditory research
,”
Proc. Mtgs. Acoust.
33
(
1
),
050002
.
24.
Garcia Lecumberri
,
M.
, and
Cooke
,
M.
(
2006
). “
Effect of masker type on native and non native consonant perception in noise
,”
J. Acoust. Soc. Am.
119
(
4
),
2445
2454
.
25.
Grant
,
K. W.
, and
Seitz
,
P. F.
(
2000
). “
The use of visible speech cues for improving auditory detection of spoken sentences
,”
J. Acoust. Soc. Am.
108
(
3
),
1197
1208
.
26.
Helfer
,
K. S.
, and
Freyman
,
R. L.
(
2005
). “
The role of visual speech cues in reducing energetic and informational masking
,”
J. Acoust. Soc. Am.
117
(
2
),
842
849
.
27.
Helfer
,
K. S.
, and
Freyman
,
R. L.
(
2008
). “
Aging and speech-on-speech masking
,”
Ear Hear.
29
(
1
),
87
–98.
28.
Holler
,
J.
, and
Levinson
,
S. C.
(
2019
). “
Multimodal language processing in human communication
,”
Trends Cogn. Sci.
23
(
8
),
639
652
.
29.
Mattys
,
S. L.
,
Brooks
,
J.
, and
Cooke
,
M.
(
2009
). “
Recognizing speech under a processing load: Dissociating energetic from informational factors
,”
Cogn. Psychol.
59
(
3
),
203
243
.
30.
Milne
,
A. E.
,
Bianco
,
R.
,
Poole
,
K. C.
,
Zhao
,
S.
,
Oxenham
,
A. J.
,
Billig
,
A. J.
, and
Chait
,
M.
(
2021
). “
An online headphone screening test based on dichotic pitch
,”
Behav. Res.
53
(
4
),
1551
1562
.
31.
Nye
,
P. W.
, and
Gaitenby
,
J. H.
(
1974
). “
The intelligibility of synthetic monosyllabic words in short, syntactically normal sentences
,”
Haskins Lab. Status Rep. Speech Res.
37
(
38
),
169
190
, available at https://files.eric.ed.gov/fulltext/ED094445.pdf#page=168.
32.
R Core Team
(
2021
). “R:
A language and environment for statistical computing
,” R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org/.
33.
RStudio Team
(
2021
). “RStudio: integrated development for R” (RStudio, Inc., Boston, MA), available at http://www.rstudio.com.
34.
Peelle
,
J. E.
, and
Sommers
,
M. S.
(
2015
). “
Prediction and constraint in audiovisual speech perception
,”
Cortex
68
,
169
181
.
35.
Schubotz
,
L.
,
Holler
,
J.
,
Drijvers
,
L.
, and
Özyürek
,
A.
(
2021
). “
Aging and working memory modulate the ability to benefit from visible speech and iconic gestures during speech-in-noise comprehension
,”
Psychol. Res.
85
,
1997
2011
.
36.
Shinn-Cunningham
,
B. G.
(
2008
). “
Object-based auditory and visual attention
,”
Trends Cogn. Sci.
12
(
5
),
182
186
.
37.
Sumby
,
W. H.
, and
Pollack
,
I.
(
1954
). “
Visual contribution to speech intelligibility in noise
,”
J. Acoust. Soc. Am.
26
(
2
),
212
215
.
38.
Summerfield
,
Q.
(
1987
). “
Some preliminaries to a comprehensive account of audio-visual speech perception
,” in
Hearing by Eye: Psychology Lip-Reading
(
Lawrence Erlbaum Associates, Inc.
,
Mahwah, NJ
), pp.
3
52
.
39.
Summerfield
,
Q.
(
1992
). “
Lipreading and audio-visual speech perception
,”
Philos. Trans. R Soc. London B Biol. Sci.
335
(
1273
),
71
78
.
40.
Van Engen
,
K. J.
, and
Bradlow
,
A. R.
(
2007
). “
Sentence recognition in native-and foreign language multi-talker background noise
,”
J. Acoust. Soc. Am.
121
(
1
),
519
526
.
41.
Viswanathan
,
N.
, and
Kokkinakis
,
K.
(
2019
). “
Listening benefits in speech-in-speech recognition are altered under reverberant conditions
,”
J. Acoust. Soc. Am.
145
(
5
),
EL348
EL353
.
42.
Viswanathan
,
N.
,
Kokkinakis
,
K.
, and
Williams
,
B. T.
(
2016
). “
Spatially separating language masker from target results in spatial and linguistic masking release
,”
J. Acoust. Soc. Am.
140
(
6
),
EL465
EL470
.
43.
Viswanathan
,
N.
,
Kokkinakis
,
K.
, and
Williams
,
B. T.
(
2018
). “
Listeners experience linguistic masking release in noise-vocoded speech-in-speech recognition
,”
J. Speech. Lang. Hear. Res.
61
(
2
),
428
435
.
44.
Watson
,
C. S.
,
Kelly
,
W. J.
, and
Wroton
,
H. W.
(
1976
). “
Factors in the discrimination of tonal patterns. II. Selective attention and learning under various levels of stimulus uncertainty
,”
J. Acoust. Soc. Am.
60
(
5
),
1176
1186
.
45.
Williams
,
B. T.
, and
Viswanathan
,
N.
(
2020
). “
The effects of target-masker sex mismatch on linguistic release from masking
,”
J. Acoust. Soc. Am.
148
(
4
),
2006
2014
.