The role of working memory (WM) and long-term lexical-semantic memory (LTM) in the perception of interrupted speech with and without visual cues, was studied in 29 native English speakers. Perceptual stimuli were periodically interrupted sentences filled with speech noise. The memory measures included an LTM semantic fluency task, verbal WM, and visuo-spatial WM tasks. Whereas perceptual performance in the audio-only condition demonstrated a significant positive association with listeners' semantic fluency, perception in audio-video mode did not. These results imply that when listening to distorted speech without visual cues, listeners rely on lexical-semantic retrieval from LTM to restore missing speech information.

In real-life adverse listening situations, auditory information alone is not adequate to achieve ideal speech intelligibility. Visual cues congruent with auditory information provide compensatory cues for speech understanding.1 Compensatory visual cues are manifested in several ways. First, seeing a speaker's face enables us to effectively track the speech source. Second, visual cues provide articulatory gestures such as tongue height and tongue movement for vowels and place of articulation for consonants.2,3 Third, the temporal continuity of the audio-visual stimulus aids the segmentation of continuous speech into individual words.1 Finally, visual cues add to supra-segmental information, including intonation, stress, and rhythmic properties of the auditory stimuli.4,5

One type of controlled adverse listening condition used to study auditory closure ability is created by introducing periodic interruptions to a continuous speech stream. The interruptions can be silent gaps (i.e., removing segments of speech) or fillers, such as noise, tone, etc. Warren6 was the first to report the phenomenon of “phonemic restoration” (also known as auditory closure), based on the observation that listeners reported hearing all speech sounds even when they heard an interrupted sentence with a 120 ms section of speech removed and filled with cough or tone. Following this classic study, several studies have reported that speech interrupted with regular intervals of noise was more intelligible than speech with silence-filled intervals.7–9 This is because of the illusion of perceptual continuity of speech, which relies on the Gestalt principles.10,11 Perceived continuity of speech behind the noise is believed to activate a larger semantic network in listeners' lexical-semantic long-term memory (LTM). LTM in the context of this paper refers to one's declarative memory/stored knowledge over an extended duration, which includes semantic memory (meaning/concepts), lexical memory (word forms), and episodic memory (event memory). Activation of a larger semantic network in LTM potentially improves a listener's ability to identify partially heard words in a sentence.12 Although there have been a few studies focusing on the role of the LTM system in perceptual restoration with auditory-only input, it remains largely unknown whether and to what extent listeners rely on LTM when visual cues are added during recognition of interrupted speech.

Grossberg and Kazerounian13 proposed that the interaction between the low-level acoustic input and high-level contextual information during perception of interrupted speech is linked to listeners' working memory (WM) because it helps listeners retain and update low-level acoustic information to match high-level lexical and sentence units (i.e., meaning and grammar). WM refers to the concurrent maintenance and manipulation of information in real-time.14 Inconsistent results have been reported regarding the predictability of working memory capacity (WMC) to auditory closure ability measured using interrupted speech perception paradigms. Some researchers found no association between listeners' WMC and their performance on interrupted speech with silent intervals in both normal-hearing adults and adults with hearing impairment.15–17 However, Nagaraj and Magimairaj8 found that when the silent gaps were replaced by low-pass filtered speech noise, the listeners' receptive vocabulary and WMC explained unique variance in the perceptual restoration of missing speech. These differences may be partially attributed to methodological differences, for example, speech noise-filled gaps leading to gestalt perception or perceptual continuity and thereby activation of semantic memory and WM. While studies have examined the relationship between WM and interrupted speech perception in the auditory modality, there is a paucity of research on how the integration of auditory and visual information during interrupted speech perception is driven by listeners' WMC.

As an extension to previous research, in the present study we aimed to address the role of WMC and lexical-semantic LTM retrieval ability in processing visual cues when recognizing auditorily distorted sentences. Studies in the context of visual-only speech reading have shown a direct relationship between speech reading performance and LTM but not with WMC.18 The Ease of Language Understanding (ELU) model has been developed and tested over time to explain the role of WMC and LTM in speech understanding, especially in adverse listening conditions.19,20 According to the ELU model, supportive visual input reduces uncertainties in adverse listening conditions, thereby decreasing demand on memory resources more than when listening to speech without supplemental visual cues. Therefore, we hypothesized that integrating visual cues to restore missing speech information in the interrupted speech paradigm will not be related to individuals' WMC. We also predicted that when listening to interrupted sentences with visual cues, listeners may rely less on search and retrieval of desired lexical items from LTM.

Participants who passed hearing screening21 qualified for the study. Twenty-nine normal-hearing adults (11 males, 18 females) participated (age range = 18–40 years, mean age = 22 years). All participants had normal or corrected vision and spoke English as their native language. Participants received course credit or financial compensation. This study was approved by ethical committees at Utah State University and the University of Wisconsin–Milwaukee as it was a collaborative research project.

Each participant completed an interrupted speech perception task, two WM measures, and a semantic fluency task. All experimental tasks were conducted in a sound-treated room and were administered in pseudo-random order. The order of the perception and memory tests was fixed, but the conditions in the speech perception test and subtests in the WM tests were both randomized. Testing was carried out in one session lasting 2 h with sufficient breaks.

Speech stimuli consisted of audio and video recordings of Quick Speech-in-Noise (QuickSIN;22 sample sentence: “A white silk jacket goes with any shoes”) and AzBio sentences23 (sample sentence: “She relied on him for transportation”). All the sentences were spoken by a male native speaker of English in a neutral accent using a conversational style with limited contextual cues. The digitized sentences were processed using matlab version 8.5 and a Tucker Davis Technologies (Alachua, FL) RZ-6 signal processor to create interrupted speech conditions. Sentences were gated with a 50% duty cycle square wave at a rate of 2.5 Hz to create an alternating 200 ms of speech segment and silent interval. To minimize distortion associated with abrupt gating of speech, 5-ms raised cosine ramps were applied to the onset and offset of the square wave. Concatenated sentences were fed to the RZ-6 signal processor at regular intervals and multiplied with the continuously generated square wave that was generated in real-time. Finally, the silent intervals were filled with speech-shaped noise (SSN), which was bandpass filtered between 80 and 8000 Hz using a fourth-order Butterworth filter. Interrupted filler noise was generated by gating the filtered SSN with the inverted square wave used for gating the sentences. SSN used to fill the silent intervals had the same long-term power spectrum as the speech sentences. The root mean square (rms) of filled SSN was normalized to be 8 dB higher than the level of unprocessed sentences.

Processed sentences were presented from a computer using Sennheiser (Wedemark, Germany) HD 280 pro circumaural headphones at a 44 100 Hz sampling rate. Participants were instructed to adjust the volume to their most comfortable listening level during the practice trials. Prior to the actual experiment, all participants were trained with a separate set of sentences. On practice trials, participants were provided feedback that comprised the interrupted sentence followed by the original uninterrupted sentence.

Following practice, 40 AzBio and 36 QuickSIN sentences were presented with half of each used for audio and audio plus video conditions, respectively. Therefore, separate lists of sentences were used for audio and audio-video conditions. During the audio-visual presentation of the sentences, participants saw the speaker's face on the computer monitor. Participants were provided standard instructions and were encouraged to guess and repeat as many words as possible after every sentence. The next sentence was not presented until the participant finished responding to the previous sentence. No feedback was provided following the responses. The order of AzBio and QuickSIN tests was randomized, but the sentences within each test were presented in a fixed order. Each participant's responses were audio-recorded, and correct identification of all the words within each sentence was scored offline to obtain the percent correct scores. Six participants at the University of Wisconsin–Milwaukee site provided a written response after each sentence. An independent sample t-test was conducted to compare the scores of the participants recruited from the two sites, and no significant difference was found.

To index WMC, we used the automated verbal [operation span (Ospan)] and visuo-spatial [symmetry span (Sspan)] WM tasks, which have been reported to have good reliability.24,25 The tasks were administered using E-prime software following standard procedures.24 

For the Ospan task, participants were required to solve a simple arithmetic equation [e.g., “(8 × 2) –8 = 11”] and quickly determine whether it was “True” or “False” (by selecting a box on the screen) while trying to remember a set of unrelated letters (F, H, J, K, L, N, P, Q, R, S, T, Y). All participants received practice with feedback after each practice trial. The time taken to solve the problem and respond during practice was used to account for individual differences in math solving speed during the actual experimental trials.

In the experimental trials, participants performed both letter recall and math judgment together. Participants first saw the equation and responded True/False. After each True/False response, a letter appeared on the screen for 1 s. Participants were asked to remember and recall the letters at the end of each set. There were three trials for each set size ranging in list length from three to seven. Half the experimental trial equations were “False.” The order of set size was random for each participant. Each participant's Ospan score was the number of correctly recalled letters in the correct position.

For the Sspan task, participants judged the symmetry of a grid displayed on the screen while remembering the position of red squares on a matrix for later recall. The task structure was the same as the Ospan task. The duration for which the symmetry judgment pictures were displayed on experimental trials was based on the average time taken by the participant on judgments during practice.

Participants first saw an 8 × 8 matrix with some squares filled in solid black color. Participants were to decide if the matrix was symmetrical on both sides of the central vertical axis and provide a touch screen response. Next, participants saw a red square for 650 ms on a 4 × 4 matrix. Then another symmetry judgment picture appeared for the participant to judge. This was followed by a red square on the 4 × 4 grid. At the end of a set of symmetry judgments, participants were required to recall the positions of the red squares in the same order that they were presented on a blank 4 × 4 matrix on the touch screen. Test trials included three trials at each set size. Set size ranged from 2 to 5. Each participant's Sspan score was the number of correctly recalled squares in the correct position.

This task measures the access and retrieval from LTM. Participants were asked to name animals as quickly as possible in 2 min. Accurately named exemplars were added to obtain the total score. Any repeated or incorrect exemplar received a score of 0. Test-retest reliability in adults is reported to be 0.82 (Woodcock-Johnson III Test of Cognitive Abilities).26 The task has minimal confounds given its simple administration and is known to be a strong measure of semantic fluency (i.e., accuracy and speed of access and retrieval of lexical-semantic information from LTM).

Percent correct scores were transformed into rationalized arcsine unit (RAU) for statistical analysis.27 Aggregated RAU scores for audio and audio-video conditions are presented in Fig. 1. As expected, the results of a repeated sample t-test revealed that participants performed significantly better in the audio-video condition than the audio-only condition, t(28) = 13.83, p < 0.001.

Fig. 1.

Aggregated RAU scores in audio-only and audio-video conditions. Each dot represents the RAU accuracy averaged across AzBio and QuickSIN sentences for one participant.

Fig. 1.

Aggregated RAU scores in audio-only and audio-video conditions. Each dot represents the RAU accuracy averaged across AzBio and QuickSIN sentences for one participant.

Close modal

Correlation analysis revealed that WMC measured using Ospan was not associated with the audio-only or audio-video sentence recognition conditions. However, WMC measured using Sspan was positively associated with the recognition of sentences in the audio-only condition (r = 0.49, p < 0.01). LTM retrieval ability measured using the semantic fluency task was positively related to the audio-only sentence recognition (r = 0.58, p < 0.001) but was not associated with audio-video condition (r = 0.22, p = 0.26). Partial correlation between LTM retrieval and audio-only sentence recognition, controlling for Sspan, was also positive (rp = 0.44, p < 0.05).

Following the correlational analysis, the dataset was fitted with linear mixed effects model (LME) to investigate the influence of verbal WMC, visuo-spatial WMC, and LTM retrieval on the recognition of words in sentences with and without visual cues. The advantages of using LME are that the hierarchical nature of the repeated measures can be captured more accurately, and random effects (subjects and sentences) can be correctly modeled, thereby avoiding inflation of error rates and spurious results. A series of two-level, random intercept nested models were fit based on the theoretical framework of the study, and the likelihood ratio test was used to assess the significance of model terms.28 Analyses were conducted in R 3.6.1,29 and the “lmer()” function in the “lme4” package30 was utilized for the LME analysis. A significance level of 0.05 was applied unless otherwise stated. Full documentation of all code and the statistical output is provided in the supplemental materials.43 

Visuo-spatial WMC measured using Sspan was used in the model. LME results showed that the model with only LTM was found to be better than the model with both LTM and WMC. Parameter estimates for all tested models for the RAU accuracy of correct word recognition in sentences are shown in Table 1. The best fit model parameters suggest that there is a significant two-way interaction between modality (audio-only vs audio-video) and LTM, b = –0.27, p < 0.01. Figure 2 illustrates the interaction of modality (audio-only vs audio-video) and the role of LTM retrieval.

Table 1.

Parameter estimates for linear mixed effects models for sentence recognition. ***, p<0.001; **, p < 0.01; *, p < 0.05.

Model 1Model 2Model 3Model 4 (best fit)
Fixed effects b (SE)a 
 Intercept 57.61 (5.75)*** 61.80 (4.19)*** 58.96 (6.04)*** 57.03 (4.48)*** 
 Main effects     
    Modality (audio vs audio +video) 19.83 (3.02)*** 19.84 (3.02)*** 19.78 (3.02)*** 29.32 (4.33)*** 
  LTM 0.22 (0.11)* 0.28 (0.10)** — 0.42 (0.11)*** 
  WMC (Sspan) 0.21 (0.20) — 0.42 (0.18)* — 
 Cross-level interactions     
  Modality × LTM — — — −0.27 (0.09)** 
Random effects Var 
 Intercepts 6.79 17.62 19.90 17.76 
Sample size N 
 Level 2 macro-units (participants) 29 
 Level 1 micro-units (words) 2204 
Model 1Model 2Model 3Model 4 (best fit)
Fixed effects b (SE)a 
 Intercept 57.61 (5.75)*** 61.80 (4.19)*** 58.96 (6.04)*** 57.03 (4.48)*** 
 Main effects     
    Modality (audio vs audio +video) 19.83 (3.02)*** 19.84 (3.02)*** 19.78 (3.02)*** 29.32 (4.33)*** 
  LTM 0.22 (0.11)* 0.28 (0.10)** — 0.42 (0.11)*** 
  WMC (Sspan) 0.21 (0.20) — 0.42 (0.18)* — 
 Cross-level interactions     
  Modality × LTM — — — −0.27 (0.09)** 
Random effects Var 
 Intercepts 6.79 17.62 19.90 17.76 
Sample size N 
 Level 2 macro-units (participants) 29 
 Level 1 micro-units (words) 2204 
a

Standard error (SE).

Fig. 2.

Best fit linear mixed effects model for the sentence recognition (RAU scores) by condition (audio-only vs audio-video), focusing on the effect of LTM retrieval, with 95% confidence bands.

Fig. 2.

Best fit linear mixed effects model for the sentence recognition (RAU scores) by condition (audio-only vs audio-video), focusing on the effect of LTM retrieval, with 95% confidence bands.

Close modal

The main goal of this study was to investigate the importance of listeners' memory (lexical-semantic retrieval from LTM and WMC) abilities in restoring missing speech with and without supplemental visual cues. As expected and consistent with the findings from previous studies, audio-visual perception of interrupted sentences was significantly better than audio-only perception.31–33 Listeners with better LTM retrieval ability, as measured using the time-constrained semantic fluency measure, were also better in restoring missing speech in interrupted sentences, regardless of their WM abilities, in the audio-only condition. The semantic fluency task taxes the lexical-semantic LTM retrieval system, primarily the temporal lobe-mediated associative vocabulary network.34 Stored vocabulary knowledge in LTM is fundamental to generating names within a category (e.g., animals). During recall, sub-categorization or clustering (e.g., farm animals, carnivores) of items occurs due to the inherent nature of encoding and organization of words in the mental lexicon. Therefore, semantic knowledge and its organization can be crucial to perceptual restoration because it helps activate and retrieve a target word (i.e., missing speech) based on available cues in the partial speech information provided. A cue automatically triggers activation of related words in the lexical network.35 Therefore, better LTM retrieval ability reflects the organization and efficiency of lexical networks wherein spreading activation is more precise and faster. Consequently, auditory closure ability improved with better LTM retrieval ability in the current sample, which is consistent with the ELU model prediction.19 Spreading activation triggered from cues in the available speech information may have allowed listeners with better LTM retrieval ability to fill in missing speech more effectively.

Even though a significant positive correlation between visuo-spatial WM and auditory-only sentence recognition was observed, WMC was not a significant factor in predicting auditory closure ability based on the best fit LME model. WMC was not a significant predictor when LTM was also added to the model as shown in Table 1, model 1. LTM retrieval was significantly correlated to interrupted speech perception even after controlling for WMC as revealed by the partial correlation result. These findings are in contrast to a previous study8 that found WMC to be a unique predictor of interrupted speech perception even after controlling for lexical knowledge. The reason WMC did not emerge as a unique predictor was potentially due to methodological differences between the two studies, especially the interruption rate used in this study, which was 2.5 Hz compared to 1.5 Hz in the previous study.8 At the 1.5 Hz interruption rate, the duration of the missing speech segment (333 ms) is greater compared to the 2.5 Hz rate (200 ms). This difference may have potentially created much greater uncertainty in the perception of interrupted speech at 1.5 Hz, which requires greater WM resources.

The findings in this study are similar to previous studies that found no association between WMC and speech perception in noise.36 It appears that the recruitment of attention-controlled WMC during the listening process may depend on the nature of the task demands. When listening in adverse situations, WMC may play a larger role when the listening situation is very demanding. Listening to meaningful sentences that are coherent may not engage such executive control when available partial speech cues automatically activate meaningful words and clauses. Furthermore, in neurotypical individuals, it is their robust language system, characterized by a larger vocabulary size and stronger LTM retrieval that appears to be most crucial to the perception of degraded sentences.8 

With the addition of the speaker's face, restoring missing speech sounds in sentences was not related to WMC or LTM. Furthermore, there was no association between visual gain (i.e., the difference in perception scores between audio-video and audio-only conditions) and any of the memory measures. These results align with the ELU model20 in that visual cues reduce the uncertainties and confusions when speech is distorted with interruption, hence reducing the demand on memory resources. Cumulative evidence shows that the integration of audio-visual information occurs at multiple stages through multiple cognitive mechanisms.37 Integration and processing of cross-modal sensory information are not necessarily more resource-demanding.38,39 Some recent studies reported reduced cognitive demands for processing speech signals in suboptimal conditions in the presence of combined audio and visual information as compared to auditory-only information.40–42 These results suggest that listeners rely heavily on visual cues, if available, to understand degraded speech rather than searching the stored lexicon. The shift in listeners' reliance from their semantic LTM to available visual cues as found in the current study has significant implications when listening in ecologically valid situations (e.g., noisy restaurants, classrooms). Results suggest that adults with sensory and/or cognitive impairment might significantly benefit from visual cues in adverse listening situations. Visual cues from a speaker's face may enhance speech perception by modulating attentional and emotional networks, thereby increasing the accuracy of recognition and reducing the reliance on lexical search in LTM.

In summary, our data suggest that the perceptual restoration of interrupted sentences benefits substantially from the addition of visual cues. The perception of interrupted sentences with visual cues did not show a significant relationship with listeners' semantic LTM and WMC. This implies that listeners depend on visual cues, when available, to fill in missing speech information rather than taxing their memory system. However, without visual cues, listeners' ability to retrieve semantic information from LTM, but not their WMC, was found to be crucial for sentence recognition. These results indicate that for listeners with significant limitations in semantic knowledge and retrieval, speech perception in adverse conditions would pose a greater challenge, and providing facial expressions or other types of visual cues would aid their speech perception.

1.
T.
Cunillera
,
E.
Camàra
,
M.
Laine
, and
A.
Rodríguez-Fornells
, “
Speech segmentation is facilitated by visual cues
,”
Q. J. Exp. Psychol.
63
(
2
),
260
274
(
2010
).
2.
R.
Campbell
, “
The processing of audio-visual speech: Empirical and neural bases
,”
Philos. Trans. R. Soc. London B Biol. Sci.
363
(
1493
),
1001
1010
(
2008
).
3.
Q.
Summerfield
, “
Lipreading and audio-visual speech perception
,”
Philos. Trans. R. Soc. London B Biol. Sci.
335
(
1273
),
71
78
(
1992
).
4.
C.
Chandrasekaran
,
A.
Trubanova
,
S.
Stillittano
,
A.
Caplier
, and
A. A.
Ghazanfar
, “
The natural statistics of audiovisual speech
,”
PLoS Comput. Biol.
5
(
7
),
e1000436
(
2009
).
5.
J. E.
Peelle
and
M. H.
Davis
, “
Neural oscillations carry speech rhythm through to comprehension
,”
Front. Psychol.
3
,
320
(
2012
).
6.
R. M.
Warren
, “
Perceptual restoration of missing speech sounds
,”
Science
167
(
3917
),
392
393
(
1970
).
7.
J. A.
Bashford
,
R. M.
Warren
, and
C. A.
Brown
, “
Use of speech-modulated noise adds strong ‘bottom-up’ cues for phonemic restoration
,”
Percept. Psychophys.
58
(
3
),
342
350
(
1996
).
8.
N. K.
Nagaraj
and
B. M.
Magimairaj
, “
Role of working memory and lexical knowledge in perceptual restoration of interrupted speech
,”
J. Acoust. Soc. Am.
142
(
6
),
3756
3766
(
2017
).
9.
V.
Shafiro
,
S.
Sheft
, and
R.
Risley
, “
Perception of interrupted speech: Effects of dual-rate gating on the intelligibility of words and sentences
,”
J. Acoust. Soc. Am.
130
(
4
),
2076
2087
(
2011
).
10.
A. S.
Bregman
, “
Auditory scene analysis and the role of phenomenology in experimental psychology
,”
Can. Psychol. Can.
46
(
1
),
32
40
(
2005
).
11.
B. G.
Shinn-Cunningham
and
D.
Wang
, “
Influences of auditory object formation on phonemic restoration
,”
J. Acoust. Soc. Am.
123
(
1
),
295
301
(
2008
).
12.
S.
Srinivasan
and
D. L.
Wang
, “
A schema-based model for phonemic restoration
,”
Speech Commun.
45
(
1
),
63
87
(
2005
).
13.
S.
Grossberg
and
S.
Kazerounian
, “
Laminar cortical dynamics of conscious speech perception: Neural model of phonemic restoration using subsequent context in noise
,”
J. Acoust. Soc. Am.
130
(
1
),
440
460
(
2011
).
14.
A. D.
Baddeley
, “
Working memory: Theories, models, and controversies
,”
Annu. Rev. Psychol.
63
,
1
29
(
2012
).
15.
M. R.
Benard
,
J. S.
Mensink
, and
D.
Başkent
, “
Individual differences in top-down restoration of interrupted speech: Links to linguistic and cognitive abilities
,”
J. Acoust. Soc. Am.
135
(
2
),
EL88
EL94
(
2014
).
16.
N. K.
Nagaraj
and
A. N.
Knapp
, “
No evidence of relation between working memory and perception of interrupted speech in young adults
,”
J. Acoust. Soc. Am.
138
(
2
),
EL145
150
(
2015
).
17.
V.
Shafiro
,
S.
Sheft
,
R.
Risley
, and
B.
Gygi
, “
Effects of age and hearing loss on the intelligibility of interrupted speech
,”
J. Acoust. Soc. Am.
137
(
2
),
745
756
(
2015
).
18.
B.
Lyxell
and
J.
Rönnberg
, “
Information-processing skill and speech-reading
,”
Br. J. Audiol.
23
(
4
),
339
347
(
1989
).
19.
J.
Rönnberg
,
E.
Holmer
, and
M.
Rudner
, “
Cognitive hearing science: Three memory systems, two approaches, and the ease of language understanding model
,”
J. Speech Lang. Hear. Res.
64
(
2
),
359
370
(
2021
).
20.
J.
Rönnberg
,
T.
Lunner
,
A.
Zekveld
,
P.
Sörqvist
,
H.
Danielsson
,
B.
Lyxell
,
O.
Dahlström
,
C.
Signoret
,
S.
Stenfelt
,
M. K.
Pichora-Fuller
, and
M.
Rudner
, “
The Ease of Language Understanding (ELU) model: Theoretical, empirical, and clinical advances
,”
Front. Syst. Neurosci.
7
,
31
(
2013
).
21.
American Speech-Language-Hearing Association
, “
Adult hearing screening
,” https://www.asha.org/practice-portal/professional-issues/adult-hearing-screening/ (Last viewed May 6,
2021
).
22.
M. C.
Killion
,
P. A.
Niquette
,
G. I.
Gudmundsen
,
L. J.
Revit
, and
S.
Banerjee
, “
Development of a quick speech-in-noise test for measuring signal-to-noise ratio loss in normal-hearing and hearing-impaired listeners
,”
J. Acoust. Soc. Am.
116
(
4
),
2395
2405
(
2004
).
23.
A. J.
Spahr
,
M. F.
Dorman
,
L. M.
Litvak
,
S.
Van Wie
,
R. H.
Gifford
,
P. C.
Loizou
,
L. M.
Loiselle
,
T.
Oakes
, and
S.
Cook
, “
Development and validation of the AzBio sentence lists
,”
Ear. Hear.
33
(
1
),
112
117
(
2012
).
24.
T. S.
Redick
,
J. M.
Broadway
,
M. E.
Meier
,
P. S.
Kuriakose
,
N.
Unsworth
,
M. J.
Kane
, and
R. W.
Engle
, “
Measuring working memory capacity with automated complex span tasks
,”
Eur. J. Psychol. Assess.
28
(
3
),
164
171
(
2012
).
25.
N.
Unsworth
,
R. P.
Heitz
,
J. C.
Schrock
, and
R. W.
Engle
, “
An automated version of the operation span task
,”
Behav. Res. Methods
37
(
3
),
498
505
(
2005
).
26.
R.
Woodcock
,
K.
McGrew
, and
N.
Mather
, “
The Woodcock-Johnson III tests of cognitive abilities in cognitive assessment courses
,” in
WJ III Clinical Use and Interpretation: Scientist-Practitioner Perspectives
(
Academic
,
San Diego, CA
,
2007
), pp.
377
401
.
27.
G. A.
Studebaker
, “
A ‘rationalized’ arcsine transform
,”
J. Speech Hear. Res.
28
(
3
),
455
462
(
1985
).
28.
J. J.
Hox
,
M.
Moerbeek
, and
R.
van de Schoot
,
Multilevel Analysis: Techniques and Applications
, 3rd ed. (
Routledge
,
New York
,
2018
).
29.
R Project, “
The R Project for Statistical Computing
,” https://www.r-project.org/ (Last viewed October 2,
2020
).
30.
D.
Bates
,
M.
Mächler
,
B.
Bolker
, and
S.
Walker
, “
Fitting linear mixed-effects models using lme4
,”
J. Stat. Softw.
67
(
1
),
1
48
(
2015
).
31.
K. S.
Helfer
and
R. L.
Freyman
, “
The role of visual speech cues in reducing energetic and informational masking
,”
J. Acoust. Soc. Am.
117
(
2
),
842
849
(
2005
).
32.
A.
Jesse
and
E.
Janse
, “
Audiovisual benefit for recognition of speech presented with single-talker noise in older listeners
,”
Lang. Cogn. Process.
27
(
7
),
1167
1191
(
2012
).
33.
N.
Tye-Murray
,
M. S.
Sommers
, and
B.
Spehar
, “
Audiovisual integration and lipreading abilities of older adults with normal and impaired hearing
,”
Ear Hear.
28
(
5
),
656
668
(
2007
).
34.
J.
Hall
,
K. K.
McGregor
, and
J.
Oleson
, “
Weaknesses in lexical-semantic knowledge among college students with specific learning disabilities: Evidence from a semantic fluency task
,”
J. Speech Lang. Hear. Res.
60
(
3
),
640
653
(
2017
).
35.
V. M.
Rosen
and
R. W.
Engle
, “
The role of working memory capacity in retrieval
,”
J. Exp. Psychol. Gen.
126
(
3
),
211
227
(
1997
).
36.
C.
Füllgrabe
and
S.
Rosen
, “
On the (un)importance of working memory in speech-in-noise processing for listeners with normal hearing thresholds
,”
Front. Psychol.
7
,
1268
(
2016
).
37.
J. E.
Peelle
and
M. S.
Sommers
, “
Prediction and constraint in audiovisual speech perception
,”
Cortex J.
68
,
169
181
(
2015
).
38.
R. J.
Allen
,
A. D.
Baddeley
, and
G. J.
Hitch
, “
Is the binding of visual features in working memory resource-demanding?
,”
J. Exp. Psychol. Gen.
135
(
2
),
298
313
(
2006
).
39.
M.
Rudner
and
J.
Rönnberg
, “
Explicit processing demands reveal language modality-specific organization of working memory
,”
J. Deaf Stud. Deaf Educ.
13
(
4
),
466
484
(
2008
).
40.
S. K.
Mishra
and
M. E.
Lutman
, “
Repeatability of click-evoked otoacoustic emission-based medial olivocochlear efferent assay
,”
Ear Hear.
34
(
6
),
789
798
(
2013
).
41.
S. K.
Mishra
and
M. E.
Lutman
, “
Top-down influences of the medial olivocochlear efferent system in speech perception in noise
,”
PLoS One
9
(
1
),
e85756
(
2014
).
42.
S.
Moradi
,
B.
Lidestam
,
H.
Danielsson
,
E. H. N.
Ng
, and
J.
Rönnberg
, “
Visual cues contribute differentially to audiovisual perception of consonants and vowels in improving recognition and reducing cognitive demands in listeners with hearing impairment using hearing aids
,”
J. Speech Lang. Hear. Res.
60
(
9
),
2687
2703
(
2017
).
43.
See supplementary material at https://www.scitation.org/doi/suppl/10.1121/10.0006297 for full documentation of all code and the statistical output.

Supplementary Material