Studies supporting learning-induced reductions in listening-related cognitive load have lacked procedural learning controls, making it difficult to determine the extent to which effects arise from perceptual or procedural learning. Here, listeners were trained in the coordinate response measure (CRM) task under unfiltered (UT) or degraded low-pass filtered (FT) conditions. Improvements in low-pass filtered CRM performance were larger for FT. Both conditions showed training-related reductions in cognitive load as indexed by a secondary working memory task. However, only the FT condition showed a correlation between CRM improvement and secondary task performance, suggesting that effects can be driven by perceptual and procedural learning.
1. Introduction
Degraded listening conditions can decrease speech comprehension accuracy and make listening feel excessively effortful (McGarrigle et al., 2014; Strauss and Francis, 2017). This is commonly observed in individuals with hearing loss (e.g., Bernarding et al., 2013; Downs, 1982; Mishra et al., 2014), but degraded conditions produced by poor quality cell phone speakers (Rakauskas et al., 2004), rooms with long reverberation times (Rennies et al., 2014), and hearing protection devices (Smalt et al., 2020) can yield similar detriments. The current work evaluates auditory training's potential for improving speech comprehension accuracy and reducing cognitive load under such degraded listening conditions.
Beneficial impacts of auditory training on speech perception have received a decent amount of attention in the literature (for review, see Samuel and Kaljic, 2009). Most studies, however, have focused on unrealistic listening tasks such as phoneme discrimination (e.g., Clarke-Davidson et al., 2008) and lexical decision (e.g., Norris et al., 2003). Fewer have investigated learning under conditions of signal degradation. Of these, training has been shown to improve performance with speech degraded via temporal compression (Banai and Lavner, 2012), shifting of pitch range (Fu and Galvin, 2007), vocoding (Hervais-Adelman et al., 2011), the addition of masking noise (Spencer et al., 2016), and interruptions to speech (Benard and Baskent, 2013). Still, the impacts of training on listening-related cognitive load have received little attention. Sommers et al. (2015) trained listeners to perform various phoneme discrimination and speech comprehension tasks masked by 4-talker babble. Pre- and post-tests entailed a modified N-back task with masked speech. Memory for 3-back words significantly improved form pre- to post-test, suggesting that training had yielded more spare capacity to retain words in memory after comprehension was completed (i.e., a reduction in cognitive load; McCoy et al., 2005). Using pupillometry, Kuchinsky et al. (2014) gave listeners training with word recognition in noise and found that training yielded changes in the pupil's response to word presentations. However, neither study used a control group for procedural learning, making it difficult to conclude that effects reflected more than learning the demands of tasks and experimentation in general. Further, Kuchinsky et al. (2014) found that training yielded an increase in pupil diameter. Such an effect has frequently been associated with increased, not decreased, cognitive load (Borghini and Hazan, 2018; Miles et al., 2017; Piquado et al., 2010; Strand et al., 2018; Zekveld et al., 2014).
Here, we used the coordinate response measure (CRM) task (Bolia et al., 2000; Thompson et al., 2015) combined with a secondary melody comparison task to form a dual-task listening scenario. Performance on primary (CRM) and secondary (melody comparison) tasks were assessed before (pre-) and after (post-) a period of training. CRM stimuli in the test were degraded with low-pass filtering to simulate the attenuation created by common hearing protection devices, which yields poorer speech perception accuracy and greater listening-related cognitive load (Gallagher et al., 2014; Reddy et al., 2014; Smalt et al., 2020). For training, individuals performed the CRM under conditions of low-pass filtering (filtered training; FT) or no filtering (unfiltered training; UT). Note that the UT condition received training on the relevant task procedures and with speech-on-speech masking, but was not exposed to low-pass filtered speech. We hypothesized that FT would improve CRM performance and decrease cognitive load as indexed by better secondary task performance in the post-test. Since the difference between UT and FT conditions was whether or not test-relevant low-pass filtered stimuli were heard in training, FT improvements above that of the UT condition were expected to stem from perceptual learning specific to low-pass filtering conditions.
2. Methods
2.1 Participants
Forty-one individuals at Kansas State University participated for course credit. All self-reported normal hearing. Procedures were approved by the local ethics committee and all participants signed an informed consent document. One participant was dropped from the UT condition for exceptionally poor single-task CRM performance (<15% correct). The final sample contained 20 individuals in each training condition (FT, UT).
2.2 CRM task
The employed CRM corpus contains spoken sentences of the form “Ready <call sign>, go to <color> <number> now” (for details, see Bolia et al., 2000; Thompson et al., 2015). Here, we used eight talkers (four female), eight call signs (Arrow, Baron, Charlie, Eagle, Hopper, Laker, Ringo, Tiger), four colors (red, blue, white, green), and four numbers (1–4). On each trial, listeners were presented with a CRM sentence with the “Baron” call sign spoken by one of four randomly selected male talkers. A female (i.e., different sex) spoken masker sentence with one of the other call signs was presented at a 0 dB target-to-masker ratio. The talkers, masker call sign, colors, and numbers were selected randomly for each trial, with replacement occurring across trials. Listeners were tasked to indicate the color/number of the “Baron” sentence. Depending upon the condition and block within the experiment, CRM stimuli were either left unfiltered,1 or digitally low-pass filtered with a finite impulse response (FIR) filter having a passband edge of 20 Hz and a stopband attentuation of –60 dB. The stopband edge was either 1048, 532, 276, or 148 Hz, creating filters with relatively shallow to relatively steep roll-offs. These filter settings were chosen because the 148 Hz stopband mimics the attenuation that occurs with use of standard hearing protection devices (Gallagher et al., 2014). Other filters allowed training with an easy-to-hard progression known to induce strong perceptual learning effects (e.g., Church et al., 2013; Wisniewski et al., 2017a; Wisniewski et al., 2019).
2.3 Melody comparison task
Before and after the presentation of CRM sentences, listeners could be presented with four note sequences made up of pure tones (400 ms, 20 ms on- and off-ramps) mapping onto musical notes A3–A4 (f0 range of 220–440 Hz). The 1st and 2nd sequences were spaced 500 ms from the onset and offset of CRM sentence presentation. On each such trial, notes were selected at random (with replacement) for the sequence presented before the CRM. Half of trials were “Same,” in which the second sequence was the same as the first. Half were “Different,” with the second sequence having one randomly selected note replaced with a different note in the range of A3–A4 (also selected at random). Participants were tasked with indicating whether the melodies were the same or different. Figure 1(a) depicts an example waveform representing the combination of the CRM with the melody comparison task. The task was chosen because of its potential generalizability to non-speech studies and to avoid ambiguity in regard to whether listeners are prioritizing memory content or listening to speech in tasks like an N-back.
(a) Depiction of audio from an example trial where the CRM task is combined with the melody comparison task. Arrows highlight the note within the melodies that changes from before to after the CRM sentence presentation. (b) Illustration of the design and procedures used.
(a) Depiction of audio from an example trial where the CRM task is combined with the melody comparison task. Arrows highlight the note within the melodies that changes from before to after the CRM sentence presentation. (b) Illustration of the design and procedures used.
2.4 Apparatus
Sounds were presented using Focusrite Scarlett audio interfaces (Focusrite, UK) and Sennheiser HD-280 closed-back headphones (Sennheiser, Germany). Listeners sat in sound-attenuating listening booths (WhisperRoom, Knoxville, TN). Sound levels did not exceed 79 dB sound pressure level (SPL). Procedures were controlled with matlab (Natick, MA). Responses were made with custom X-Keys response pads (P.I. Engineering, Williamston, MI).
2.5 Design and procedures
A 2 (training condition: UT vs FT) × 2 (task condition: Single vs Dual) × 2 (test: pre-, post-test) mixed design was used [see Fig. 1(b)]. Task condition and test were within-subject factors. An individual's participation always occurred on three consecutive days. Pre- and post-tests were on Day 1 and Day 3. Training was on Day 2.
Testing was identical for the UT and FT conditions. In the pre- and post-tests, there were three types of blocks: Single CRM, Single Melody, and Dual. For each block type, listeners were exposed to stimuli similar to that shown in Fig. 1(a). In Single CRM blocks, listeners were tasked to respond with the color/number pair of the “Baron” sentence using the response pad with a 4 × 4 grid of color/number options. In Single Melody blocks, they were instead prompted to report their same/different judgment with other keys on the same response pad. In Dual blocks, listeners performed both tasks, reporting the color/number pair, then the same/different judgment. Instructions were to prioritize the CRM in Dual blocks. To promote prioritization of the CRM, a penalty timeout (8 s) was given when an incorrect response was made on the CRM (for all blocks, conditions, and days). For Dual blocks, this penalty was incurred after responding to the melody comparison task. Further, feedback was only presented for the CRM and never for the melody comparison task. Listeners were told which type of block they were about to perform before starting a block. The reasons for the different block types were to assess dual-task cost for the CRM and melody comparison. There were two blocks per type, containing 20 trials each, and pseudorandomly ordered such that the second encounter with a block type did not come before all block types were experienced at least once. Ten seconds were given to respond to each prompt before a timeout occurred. All testing employed CRM sentences filtered with the most difficult 148 Hz stopband edge.
Training for both conditions entailed nine blocks (20 trials per block) of the CRM without melody comparison stimuli, all with feedback after responding. “Correct” or “Wrong” was displayed on screen, along with the 8 s timeout if responses were wrong. For the UT condition, trials employed unfiltered CRM stimuli. The FT condition did the same, except that the stopband edges of filtering decreased over the course of blocks in an easy-to-hard progression [see Fig. 1(b) for specific progression].
3. Results
3.1 CRM accuracy
UT (M = 0.94, SD = 0.06) and FT (M = 0.92, SD = 0.05) proportion correct was similar during training (across all trials), with no significant difference between conditions, t < 1. Thus, large differences in the difficulty of training were not evident.
Test proportion correct data (color and number) are shown in Fig. 2(a). Analysis entailed a mixed model 2 (training condition: FT, UT) × 2 (task condition: single, dual) × 2 (test: pre-, post-test) analysis of variance (ANOVA) using Greenhouse-Geisser corrections to degrees of freedom before interpreting significance (uncorrected dfs reported). Both the UT and FT conditions appeared to improve from pre- to post-test (Mpre = 0.82, SDpre = 0.12; Mpost = 0.90, SDpost = 0.06), supported by a main effect of test, F(1,38) = 24.63, p < 0.001, partial eta2 = 0.38. However, there was also a significant training condition x test interaction, F(1,38) = 5.35, p = 0.026, partial eta2 = 0.15. Independent sample t-tests comparing the pre- and post-test data for the UT and FT conditions found a difference in the post-test (MUT = 0.88, SDUT = 0.07; MFT = 0.93, SDFT = 0.03), t(38) = 2.53, p = 0.015, Cohen's d = 0.80. There was no difference in the pre-test, t < 1. Thus, FT yielded accuracy improvements above and beyond that accountable by procedural learning and any perceptual learning of general speech-on-speech masking. There was also a main effect of task, F(1,38) = 15.44, p < 0.001, partial eta2 = 0.27, indicating a cost to performing the CRM under the dual-task scenario (Msingle = 0.88, SDsingle = 0.09; Mdual = 0.84, SDdual = 0.08). Other main effects and interactions were not significant, Fs < 2.
(Color online) (a) Accuracy for the CRM task. (b) d' for the melody comparison task. Error bars show standard errors of the mean.
(Color online) (a) Accuracy for the CRM task. (b) d' for the melody comparison task. Error bars show standard errors of the mean.
3.2 Melody comparison accuracy
The d' signal detection parameter was used as a measure of accuracy to minimize impacts of any biases to respond “same” or “different.” Figure 2(b) shows d' for each test, training, and task condition. There was a main effect of task condition, F(1,38) = 52.04, p < 0.001, partial eta2 = 0.58, such that the dual task had lower d' than the single-task condition (Msingle = 1.86, SDsingle = 0.67; Mdual = 1.33 SDdual = 0.65). This indicated the expected dual-task cost. There was also a significant task condition x test interaction, F(1,38) = 6.33, p = 0.016, partial eta2 = 0.14, likely related to better performance for the dual-task melody comparison after training (Mpre = 1.18, SDpre = 0.70; Mpost = 1.49, SDpost = 0.85). Indeed, dual-task d' was significantly lower in the pre- than the post-test, t(39) = 2.28, p = 0.028, Cohen's d = 0.36. Single-task d', did not differ between pre- and post-test, t < 1. All other main effects and interactions were not significant, Fs < 2.
We next performed a correlational analysis (cf. Sommers et al., 2015) to examine the relationship between gains in CRM accuracy and reductions in cognitive load as quantified by dual-task melody comparison d'. Figure 3 shows scatterplots of individual improvements on the CRM (x axis; post-test minus pre-test) and improvements in dual-task melody comparison d' (y axis; post-test minus pre-test). The UT condition did not show any significant relationship between CRM gains and gains in dual-task melody comparison performance, r(18) = 0.03, p = 0.903. The FT condition did show a significant positive correlation, r(18) = 0.68, p < 0.001. To compare r values between conditions, we used a surrogate distribution of r values by randomly assigning condition labels to data for 1000 permutations. A p value for the observed r difference was then determined by calculating the proportion of r differences in the surrogate distribution being larger than the observed value. The difference in r was found to be greater than that expected by chance, p = 0.030. Thus, improvements on the CRM were related to reductions in cognitive load, but only in the condition that received relevant training with low-pass filtered speech.
(Color online) Scatterplot of improvement in dual-task melody comparison task accuracy (y axis) as a function of improvements to CRM task accuracy (x axis). Lines show least-squares fit to the UT and FT data.
(Color online) Scatterplot of improvement in dual-task melody comparison task accuracy (y axis) as a function of improvements to CRM task accuracy (x axis). Lines show least-squares fit to the UT and FT data.
4. Discussion
This study aimed to assess auditory training's impact on perception of low-pass filtered speech, and the cognitive load required for comprehension. Main findings were as follows: (1) FT improved accuracy beyond that produced by UT, (2) training yielded a decrease in CRM-related cognitive load regardless of training condition, and (3) CRM accuracy improvements were correlated with reductions in cognitive load, but only for the FT condition. Our data shows that some apparent reductions in cognitive load can come from procedural learning, or some generalizable perceptual learning for similar masking tasks. It will be necessary to address this possibility in training work going forward. However, that a relationship between extent of learning and cognitive load reductions was only observed in the FT group suggests that there is also a perceptual learning component specific to the trained acoustic conditions. Extent of learning only appears to matter for individuals receiving the test-relevant signal degradation.
The low-pass filtering used here produces attenuation similar to that produced by hearing protection devices (Gallagher et al., 2014). Interestingly, poor speech understanding and excessive effort are chief reasons given by individuals for not wearing hearing protection (Reddy et al., 2014; Smalt et al., 2020). Hearing-protection related training programs for workers typically do not entail listening training, but conceptual training focused on the benefits of using hearing protection (Hong and Csaszar, 2005; Lusk et al., 1995; Neitzel et al., 2008; Trabeau et al., 2008). Given that we show training-related benefits to perception and cognitive load, it may be useful to include perceptual training in such programs. Further, this would be a way to address the problems that come with hearing-loss (e.g., excessive listening effort) preemptively by reducing hearing-loss likelihood.
The current data also suggest that training can serve as a means to improve multi-tasking performance. This can be important for operators needing to listen under specific types of degraded conditions. For instance, military operators listening through a specific hearing protection device to specific types of low-bandwidth signals, or in environments with unique spectral characteristics, could receive appropriate listening training. Cognitive control of resource allocation, the making of confidence and error monitoring judgments, decision making, and prediction have effortful components that are important for guiding our behavior in complex scenarios (for review, see Schwartz, 2008; Yeung and Summerfield, 2012). Relief on the cognitive load needed to perform speech comprehension may yield more resources for these important processes and increase operator situational awareness and performance.
Though this work is promising in regard to hearing-loss prevention, rehabilitation, and performance enhancement, extensions will be needed to further progress training-related strategies. Any worthwhile training regimen needs to yield generalization beyond the circumstances of training. Future work will need to examine generalization of learning to untrained tasks and novel talkers. We employed a single-talker masker having a different sex than the target talker, showing effects in a largely informational masking scenario (i.e., target/masker content overlapped, but talker sex differed). More work is needed to evaluate impacts of training on informational and energetic masking, both of which are problems in the context of degraded speech listening. Relatedly, the use of different dual-tasks can yield different conclusions regarding cognitive load during listening (Strand et al., 2018), perhaps because different measures tap into different cognitive resources (Strauss and Francis, 2017). Future research can yield more detail regarding learning's impacts by using measures that more clearly map onto specific resources (e.g., Wisniewski, 2017; Wisniewski et al., 2017b). It is also possible that effects observed here could be larger if training entailed multiple sessions (cf. Sommers et al., 2015), or if training regimen parameters were optimized (e.g., Wisniewski et al., 2019). Future work will benefit from exploring how the length and content of training can be altered to maximize learning's beneficial outcomes. The current data tells us that these are worthwhile research directions and that auditory training should be given more attention for its potential benefits.
Acknowledgments
We thank Michelle Wheeler, Victoria Robinson, Raelynn Slipke, Francis Guffy, Kelly Wilkerson, and Emma Harmon for help with data collection. Research reported in this publication was supported by the the Cognitive and Neurobiological Approaches to Plasticity (CNAP) Center of Biomedical Research Excellence (COBRE) of the National Institutes of Health under Grant No. P20GM113109.
The employed corpus recordings were already bandpass filtered (80–8,000 Hz; Bolia et al., 2000).