This study assessed musical scene analysis (MSA) performance and subjective quality ratings of multi-track mixes as a function of spectral manipulations using the EQ-transform (% EQT). This transform exaggerates or reduces the spectral shape changes in a given track with respect to a relatively flat, smooth reference spectrum. Data from 30 younger normal hearing (yNH) and 23 older hearing-impaired (oHI) participants showed that MSA performance was robust to changes in % EQT. However, audio quality ratings elicited from yNH participants were more sensitive to % EQT than those of oHI participants. A significant positive correlation between MSA performance and quality ratings among oHI showed that oHI participants with better MSA performances gave higher-quality ratings, whereas there was no significant correlation for yNH listeners. Overall, these data indicate the complementary virtue of measures of MSA and audio quality ratings for assessing the suitability of music mixes for hearing-impaired listeners.
1. Introduction
Musical scene analysis (MSA) in multi-track music refers to perceptual and cognitive processes that allow listeners to discern and focus on specific musical elements within a complex multi-track musical arrangement. Multi-track mixing is a common component of contemporary music production. In a coherent multi-track mix, an assortment of vocal tracks and accompanying instruments are consolidated to create a richly layered mixture. The clarity of the mix, allowing listeners to identify and focus on individual instruments, seems important for both casual listeners and professional mixing engineers. Understanding how different listener groups, specifically those with a diagnosed hearing impairment, discern and appreciate mixes, may aid in creating specialized mixes for such listeners.
A handful of previous studies suggest that cochlear implant users may benefit from bespoke mixes other than the commercially distributed ones (Buyens , 2014; Gajecki and Nogueira, 2018; Tahmasebi , 2020). However, few studies exist that test mixing properties for non-cochlear implant users with cochlear hearing loss in terms of mix clarity or preference. One such study (Benjamin and Siedenburg, 2023) showed that hearing-impaired listeners with moderate to severe hearing loss had distinct level and balance preferences for mixes. Here, spectral manipulation using the so-called EQ-transform (EQT) was introduced. This transform applied to individual tracks of a mix enhanced or reduced their spectral coloration. The EQT had a significant effect on their objective frequency-domain sparsity measured using the robust Gini index (Hurley and Rickard, 2009). It was also shown that participants with higher levels of hearing loss preferred mixes with spectrally sparser tracks. A study conducted by Hake (2023) tested MSA performance, that is, the ability to detect a target instrument in a multi-track mix, for participants with varying levels of hearing impairment. Performance depended strongly on the level differences between the target and the corresponding mix (that may or may not include the target), the type of the target instrument (e.g., lead vocals, drums, guitar), and the number of tracks in the mix. However, the effects of spectral manipulations were not considered. In a more recent study (Benjamin and Siedenburg, 2024), the EQT was used to assess the effects of spectral manipulations of multi-track music on MSA performance. Although the MSA performance of normal hearing participants was unaffected by frequency-domain sparsity, hearing-impaired listeners performed better for mixes that were objectively sparser. However, the relation between subjective quality ratings of musical mixes and MSA performance remains poorly understood, especially for hearing-impaired listeners.
Auditory scene analysis deals with individual abilities to discern and segregate sounds within a complex auditory scene. For musical stimuli, MSA is paramount in the context of appreciating and discerning individual instruments or lead vocals in polyphony (McAdams and Bregman, 1979). Notably, Hake (2023) showed that there were relatively weak improvements in MSA abilities with increasing musical training, but stronger effects of cochlear hearing loss. The influence of cochlear hearing loss on MSA remains poorly explored, especially for multi-track music perception. Sensorineural hearing impairment, which mostly manifests itself with aging, affects auditory perception. The condition widely referred to as presbycusis, characterized by the gradual bilateral loss of hearing with increasing age, especially at high frequencies (Wu , 2020). According to Pichora-Fuller (1995), presbycusis may compromise the ability to segregate sounds in noisy environments. Alain (2001) suggested that among older individuals affected by presbycusis, neuroplasticity may aid in sound localization and pitch perception. Helfer and Freyman (2008) suggested that older individuals with hearing loss tend to rely considerably on spectral and contextual cues in a manner different from that for younger people with no hearing impairment. Importantly, Rasidi and Seluakumaran (2024) showed that even mild hearing impairment may reduce frequency selectivity. This may negatively impact the ability to detect spectral changes. Overall, these studies suggest that sensorineural hearing impairment may modify the perceptual weights allocated to different acoustic features, such as spectral content in scene-analysis tasks.
In multi-track mixing and production, equalization or EQing remains a fundamental method of spectral manipulation (Izhaki, 2017). Using EQing, the mixing engineer may manipulate the frequency content of a component track by enhancing or attenuating specific components in the frequency domain to achieve the desired audio quality (Senior, 2018). According to Izhaki (2017), EQing can be pivotal in determining the perceived quality of music. Studies by Gabrielsson (1988) and Zielinski (2008) demonstrate the sensitivity of listeners to changes in frequency balance and how it may dictate their judgments of audio quality. Yet, research on the influence of spectral manipulations and hearing loss on perceived audio quality in multi-track music remains scarce. More importantly, even fewer studies explore the connection between scene analysis abilities and perceived quality. Freyman (2001) showed that the ability to separate target speech from a masker may be influenced by the quality of the speech and spatial cues. However, the manner in which scene analysis ability interacts with perceived audio quality of an auditory scene remains unclear, especially in the context of music perception. In this study, we aimed to understand the relationship between MSA and quality ratings as a function of hearing impairment and spectral manipulations using the EQT in multi-track mixes.
2. Methods
2.1 Participants
In the present study, 30 young normal hearing (yNH) (ages, mean = 27 years, SD = 6 years) and 23 older hearing-impaired (oHI) participants (ages, mean = 73 years, SD = 7 years) with predominantly moderate to severe cochlear hearing loss were tested. The musical training of all of the participants was assessed using the Goldsmith Musical Sophistication Index (Gold-MSI) musical training subscale questionnaire proposed by Müllensiefen (2014). The questionnaire consists of 7 questions. For each of the questions, a score between 1 and 7 was calculated based on the answers. The final score was calculated by summing the individual scores, a higher score indicating more musical training. Two specimen examples are provided in the supplementary material. The yNH were significantly more musically trained than the oHI [t(51) = 2.2, p = 0.03, d = 0.6 (medium effect)].
Pure-tone audiometry was conducted using a portable AD528 audiometer from Interacoustics GmbH (www.interacoustics.com/ad528; Middelfart, Denmark). Based on the hearing loss for the ear with the lower arithmetic mean hearing thresholds (BEMHT) (taken over the pure tone frequencies 125 Hz, 250 Hz, 500 Hz, 1 kHz, 2 kHz, 4 kHz, and 8 kHz), the oHI (mean = 43, SD = 5 dB HL) had approximately 40 dB higher hearing loss on average than yNH (mean = 4.3, SD = 6 dB HL). Figure 1(A) shows audiograms for both groups for the better ear. According to Clark (1981), the following classifications were made: normal if BEMHT ≤ 25 dB, mild hearing loss if 25 dB < BEMHT ≤ 40 dB, and moderate to severe hearing loss if BEMHT > 40 dB. Thus, 16 oHI participants had moderate to severe hearing loss (mean = 45, SD = 4 dB HL) with only 7 having mild hearing loss (mean = 38.3, SD = 2 dB HL). Although a very strong trend of increasing hearing loss with age was apparent among yNH [r = 0.6, p < 0.001, d = 1.5 (very large effect)], the ages of oHI participants and their BEMHT were uncorrelated (p = 0.5). Figure 1(B) is a scatterplot of ages and BEMHT values.
2.2 Stimuli, apparatus, and procedure
The experiment was conducted in a low-reflection chamber with a pair of ESI active 8-inch near-field studio monitors by ESI Audiotechnik GmbH (www.esi-audio.de/activ8; Leonberg, Germany) at the University of Oldenburg, Germany. These monitors were separated by 90° and were 2 m equidistant from the participant. The audio playback levels were calibrated using noise with the same long-term spectrum as the ensemble average of the power spectra taken from a myriad of instruments and lead vocals available in the open-source Medley database (Bittner , 2014). The calibration was such that the sum of sound pressures from both monitors at the position of the participant was 80 dB SPL(A). The stimuli were processed on a stand-alone desktop terminal using matlab R2023a. The terminal was linked to the monitors using an RME Fireface UFX audio-interface. The stimuli were taken from the aforementioned Medley database.
The experiment was conducted in two parts. In the first part, referred to as the MSA part, a target track or musical target consisting of a single instrument or vocal was presented followed by a mix after a 1-s pause. The mix was an ensemble of instruments and lead vocal tracks. The participant was asked to indicate whether they heard the target in the mix that followed it. Both the target and the mix were of 2-s duration.
The target-to -mix level difference was maintained at −10 dB, and all mixes contained five distinct tracks. All oHI participants wearing hearing aids were requested to complete both parts of the experiment unaided. As shown in our previous study (Benjamin and Siedenburg, 2023), hearing-impaired listeners preferred boosting higher frequencies when unaided to compensate for the reduced audibility at these frequencies. Nevertheless, in this study, no such high-frequency amplification was provided as we aimed to investigate the effect of the EQT alone on MSA performance and quality ratings.
Both the target and component tracks of the mix were subjected to the EQT investigated by Benjamin and Siedenburg (2023). The EQT exaggerates or reduces the spectral variation of a given track in the frequency domain. This is achieved by linearly interpolating or extrapolating between the power spectrum of the stimulus undergoing the EQT (input signal) and a smooth reference spectrum, which is an average spectrum of individual power spectra of more than 100 tracks taken from the aforementioned database. To perform the EQT operation, first the power spectrum of the input signal was calculated using the fast Fourier transform applied over its entire duration with a sampling frequency of 44.1 kHz. A rectangular window was used without any zero padding. Using the power spectrum of the input signal and the reference spectrum whose energy was normalized to that of the input signal, the power spectrum of the transformed signal was calculated by linearly interpolating between the two spectra. To mitigate audible artifacts in the transformed signal, first the difference between the transformed and the input power spectra was extracted. This noisy representation was then smoothed using a Savitsky-Golay filter (Schafer, 2011). The power spectrum of the input signal was colored with the smoothed power difference in the penultimate step to obtain the spectrum of the EQT signal. The EQT signal was obtained by applying the inverse Fourier transform to this spectrum. To illustrate, a 200% EQT would double the power differences between the reference spectrum and the power spectrum of the original stimulus, as illustrated in Fig. 1(C). Transformed stimuli have an altered spectral contrast compared to the original. Conversely, a 0% transform would reduce the coloration in the original spectrum. A detailed step-by-step explanation of the EQT process is provided in the supplementary material. In part 1 of the present experiment, both the target and the component tracks in the mix following it were subjected to 0%, 100% (original), 200%, or 300% EQT. These degrees of spectral manipulation were chosen based on our previous work showing that unaided bilateral hearing aid users had % EQT preferences between 0% and 300%.
The quality rating task in the second part of the experiment commenced after a voluntary break. Here, the participants were presented with a Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) interface (International Telecommunication Union, 2015) where they were tasked with providing their subjective audio quality ratings of 15 different music excerpts of 2-s duration, as in the first part. In every trial, the participant would rate one of the 15 excerpts. The presentation order of excerpts was randomized. During a trial, the participant listened to EQT versions of that excerpt by clicking the play button and provided a quality rating for it on a scale from 0 to 100 with the rating slider. A rating of 0 corresponds to the worst quality and 100 to the best. The % EQT included in each trial were: –500% serving as the anchor, –200%, 0%, 50%, 100%, 200%, and 300%. The participant was not allowed to provide their rating for a stimulus prior to listening to it at least once. Furthermore, the participant was only allowed to rate one item with a rating of 0 and one other item with a rating of 100 per trial. All of the 5 remaining items could only be rated between 0 and 100. The positions of the stimuli on the interface between trails were randomized so that the mixes transformed with a given % EQT did not always appear in the same locations. Only after providing the subjective quality ratings of all the EQT versions of a mix in one trial, could the participant proceed to the next trial where the process was repeated for another excerpt. The experiment as a whole came to an end once all of the 15 different excerpts were rated. Figure 1(D) illustrates the interface. The aim of the two parts was to ascertain the relationship between quality ratings and musical scene analysis ability as a function of the varying degrees of spectral manipulation provided by the EQT.
2.3 Statistical analysis
A non-parametric approach was used to analyze the MUSHRA data, which are prone to type 1 errors with parametric testing (Mendonça and Delikaris-Manias, 2018). We calculated bootstrapped means with 103 iterations with replacement (Davison and Hinkley, 1997). Last, to ascertain independent and interaction effects of musical training, % EQT, BEMHT, and MSA performance on the subjective quality ratings, a linear mixed effects model was used (Shek and Ma, 2011).
3. Results
3.1 Data
We first aimed at understanding the association between quality rating and MSA performance as a function of % EQT applied to the stimuli. Figure 2(A) shows the mean quality rating preferences and MSA performance and 95% confidence interval plots. For both groups, MSA performance was hardly affected by % EQT. However, the quality rating scores showed a quadratic dependence on % EQT. Furthermore, quality ratings were more affected by % EQT for yNH [χ2(6) = 149, p < 0.0001, η2 = 0.45 (large effect)] than for oHI [χ2(6) = 69.3, p < 0.0001, η2 = 0.33 (medium effect)]. Among the oHI, quality rating scores were more closely related to MSA performance. This can be shown in Fig. 2(B) where quality scores and MSA performance were averaged for each participant over % EQT values. Among the yNH, mean MSA performance was not correlated with the quality rating scores (p = 0.75), whereas for the oHI, there was a positive correlation [r = 0.54, p = 0.009 < 0.01, d = 1.3 (very large effect)]. Last, mean quality ratings were significantly higher for the yNH than for the oHI participants [t(51) = 4.7, p < 0.0001, d = 1.3 (very large effect)].
3.2 Statistical model
A linear mixed-effect model was used to assess the influence of musical training, MSA ability, BEHMT, and % EQT on the subjective quality ratings obtained in the second part of the experiment. Owing to the parabolic pattern of quality rating scores with respect to % EQT observed in Fig. 2(A), % EQT was included as a quadratic term in the model. Based on the model output, musical training did not have any effect on the quality ratings (p = 0.4). BEMHT had a significant independent effect on the quality ratings [F(1,199) = 21.1, p < 0.0001, = 0.1 (small effect)]. % EQT had a rather strong independent effect on the quality ratings [F(1,199) = 132.3, p < 0.0001, = 0.4 (large effect)]. The quadratic term of % EQT had a weaker albeit significant effect on the quality ratings [F(1,199) = 55.4, p < 0.0001, = 0.22 (small effect)]. Although MSA performance had no significant impact on the quality ratings independently (p = 0.23), it had a significant interaction effect with BEMHT and % EQT [F(1,199) = 4.4, p = 0.036 < 0.05, = 0.02 (very small effect)]. Last, there was a modest yet significant two-way interaction effect between the quadratic % EQT term and BEMHT [F(1,199) = 4.21, p = 0.042 < 0.05, = 0.02 (very small effect)]. Figure 2(C) provides the model prediction of quality ratings with respect to % EQT for different levels of BEMHT. Increasing BEMHT was associated with smaller changes in quality rating with changes in % EQT.
4. Discussion
In this study, we investigated the relationship between MSA performance and audio quality ratings for multi-track music stimuli subjected to spectral manipulations with the EQT. The EQT manipulates the spectral contrast and therefore the spectral shape of a signal as shown in an earlier study. Moreover, oHI participants preferred mixes with higher % EQT settings than yNH individuals. In this study, MSA performance among both groups was robust across varying degrees of % EQT. Therefore, the ability to detect a target track in a complex musical mix was largely unaffected by spectral manipulations. This observed robustness could mean that spectral contrast alone may not influence MSA abilities in multi-track music, notwithstanding hearing impairment. However, the quality rating scores were more affected by % EQT for yNH than for oHI, suggesting that yNH listeners are more sensitive to alterations in the frequency domain and therefore more critical in their quality assessment of multi-track music. Lentz and Leek (2003) showed that hearing-impaired listeners had a reduced ability in processing alterations in spectral shape compared to normal hearing listeners. This was underpinned by Narne (2020), where it was shown that hearing-impaired listeners had broader psychophysical tuning curves that correlated with a poorer ability to discriminate the ripple glide direction of narrow-band signals, ergo poorer discriminability of spectral shape. Huber (2019) showed that age had a negative impact when detecting linear and non-linear distortions in speech while improved selective attention contributed positively. For music, only working memory had a significant positive effect at perceiving such distortions. In the present study, it was shown that MSA performance among yNH was independent of their quality rating scores, whereas a strong correlation between these two music perception metrics was observed for oHI. This suggests that oHI, who were more adept at detecting the target track within the mix, had a tendency to provide higher-quality ratings. This could suggest that for individuals with cochlear hearing loss, improved scene analysis abilities may facilitate better listening experiences or vice versa. Future research should further investigate the validity of such a relationship.
Acknowledgments
This work was supported by Deutsche Forschungsgemeinschaft Project No. ID 352015383–SFB 1330 A6; and a Freigeist Fellowship to K.S. from the Volkswagen Stiftung. We would like to thank Brian Moore for highly valuable comments on the manuscript, all of the participants for their interest in the study, and Hörzentrum Oldenburg gGmbH for their support.
Author Declarations
Conflict of Interest
The authors do not have any conflicts to disclose.
Ethics Approval
This study received approval from the Commission for Research Assessment and Ethics of the Carl von Ossietzky University of Oldenburg in Germany (Drs.EK/2019/092-01). An informed consent was obtained in written form from each and every participant.
Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.