Multitrack mixing is an essential practice in modern music production. Research on automatic-mixing paradigms, however, has mostly tested samples of trained, normal hearing (NH) participants. The goal of the present study was to explore mixing paradigms for hearing-impaired (HI) listeners. In two experiments, the mixing preferences of NH and HI listeners with respect to the parameters of lead-to-accompaniment level ratio (LAR) and the low to high frequency spectral energy balance were investigated. Furthermore, preferences of transformed equalization (EQ-transform) were assessed, achieved by linearly extrapolating between the power spectrum of individual tracks and a reference spectrum. Multitrack excerpts of popular music were used as stimuli. Results from experiment 1 indicate that HI participants preferred an elevated LAR compared to NH participants but did not suggest distinct preferences regarding spectral balancing or EQ-transform. Results from experiment 2 showed that bilateral hearing aid (HA) disuse among the HI participants yielded higher LAR values, stronger weighting of higher frequencies, as well as sparser EQ-transform settings compared to a condition with HA use. Overall, these results suggest that adjusting multitrack mixes may be a valuable way for making music more accessible for HI listeners.
Bilateral hearing aid
Hearing loss level
Lead-to-accompaniment level ratio
Spectral balance about 1 kHz
Hearing-impaired participants with bilateral hearing aids
Hearing-impaired participants without bilateral hearing aids
Multitrack mixing is a practice whereby separate audio recordings intended for an envisioned piece of music or the mix are combined on being spectro-temporally modified by a mixing engineer. The process involved in creating a mix often follows the recording phase where raw recordings are made in a studio. For ease of processing, the recordings are conducted in a manner in which different sources or instruments are recorded as separate tracks. With the separate tracks from the recording phase, the mixing engineer is then tasked with creating a coherent mixdown version whilst emphasizing the artistic visions of the parties involved (Case, 2011). Furthermore, it is also incumbent on them to consider the audibility or transparency of all sources in the mix (Moylan, 2014). For a number of reasons, including time and costs, it may be beneficial to automate certain steps of this practice (Reiss, 2016). In fact, research on automatic mixing has made significant progress in the last 15 years (De Man , 2019). However, to the best of our knowledge, these approaches have mostly been studied on expert listeners and have not considered hearing-impaired listeners (HI). Given that hearing aid users continue to be dissatisfied with how hearing aids transmit music (Greasley , 2020; Madsen and Moore, 2014), developing strategies to preprocess music signals for hearing aid users is an important task for music processing research. Here, we evaluate the preferences of normal-hearing (NH) and HI listeners with regard to level- and spectrum-based mixing effects.
To understand the processing chain of mixing from a mathematical perspective, several authors have put forward models to automate the mixing process. Izhaki (2017) describes the mixing process as being a correlative one where the processing needed on one track depends entirely on the presence or introduction of other tracks. Jillings and Stables (2017) monitored user interactions on a browser-based digital audio workstation (DAW). On investigation, it was apparent that the degree of changes made by the users decreased as they moved toward the latter stages of their mixing project with the DAW. Accordingly, an iterative model of the mixing process was suggested, where one goes from an early coarse stage to a later fine stage requiring more fine-grained attention through continuous refinement of earlier mixing decisions. Alternatively, Ma (2016) modeled mixing as an optimization problem where, given a finite set of variable parameters, the final mix is created by virtue of arriving at an optimal solution to a system of equations that describes the process best. Reiss (2011) proposed topologies with which automatic mixing can be achieved. In these topologies, necessary spectral and temporal features of the raw or unmixed tracks are first extracted and processed to create modified or processed tracks, making up the final mix.
The field of automatic mixing has been emerging in significance with the advent of novel techniques in engineering and signal processing (Moffat and Sandler, 2019). Several studies have been conducted on intelligent mixing systems, employing techniques to create mixes that mimic those of a trained engineer. An important question in automatic-mixing concerns the reduction of auditory masking, that is, the mutual overshadowing of sound sources in a mix. Hafezi and Reiss (2015) explored an automatic equalization (EQ)-based masking reduction paradigm, which was tested on a sample of 11 NH participants between the ages of 20 and 42 years old. The paradigm was implemented using an off-line and a real-time approach. To objectively evaluate the efficacy of masking reduction, the masker to unmasked ratio (Aichinger , 2011) was evaluated. Based on this measure, the paradigms showed improvements in the masking reduction when compared to the unmixed or raw versions of eight songs. A subjective evaluation conducted on four songs showed that some of the implementations created mixes that were preferred over manual mixes that were created by novice mixing engineers. Notably, listeners with sensorineural hearing impairment have been shown to have higher masking thresholds even at frequencies where they show less than 30 dB hearing loss (Smits and Duijhuis, 1982). This could suggest that such listeners may benefit from greater masking reduction through higher masker to unmasked ratios.
In a later study, Ronan (2018) presented a paradigm to optimize masking reduction via EQ and dynamic range compression (DRC) controls. The optimization of the controls was achieved with the aid of a particle swarm optimizer with and without sub-grouping tracks in the mix. The particle swarm optimization is performed by moving a set of candidate solutions to a problem called particles via a search space, which is usually a high dimensional Cartesian space. Each particle's velocity and step size are iteratively updated with time and also depend on the location of the particle with the best or most optimal position at a given time (Kennedy and Eberhart, 1995). An optimal solution is reached on minimizing a cost function, which, in this case, is the L2 norm of multitrack masking at frequencies between 500 Hz and 2 kHz. In sub-grouping, tracks belonging to similar instruments were grouped, and the controls were applied to each of these groups to create secondary mixes, which were then summed to create the final mix. When there was no sub-grouping, the controls were applied to all of the unmixed tracks at once. Objective evaluation of the cross-adaptive multitrack masking reduction showed that the use of sub-grouping achieved greater reduction in masking with fewer iterations of the optimizer. This was supported by a listening test with 24 NH participants (between the ages of 23 to 52 years old) with 5 songs. The mixes created using this paradigm with sub-grouping elicited better preference and clarity ratings than when no sub-grouping was considered. As an alternative, Tom (2019) demonstrated a masking reduction paradigm via frequency-dependent panning. Here, two approaches were explored, where one was a real-time implementation and the other an offline implementation. In the real-time implementation, a palindromic Siegel-Tukey type ordering (Siegel, 1956) with respect to masking was performed such that the last stem in the ordering would require the least or no panning intervention for masking reduction. In the off-line implementation, panning position of a track to improve masking reduction was estimated using a particle swarm optimizer. Objectively, the resulting panning positions were comparable to those in professional mixes. A subjective evaluation conducted on 25 trained NH participants showed that both modes outperformed available panning-based masking reduction paradigms. Steinmetz (2021) demonstrated an early application of deep learning for creating automatic mixes: With a subjective evaluation conducted on 16 audio engineers, it was demonstrated that the implementation had promise by virtue of being rated closely to factory or target mixes especially when tested on audio samples from the ENST drums database (Gillet and Richard, 2006).
Although successful paradigms have been formulated for automatic mixing, these have only been tested on listeners with no reported hearing impairment and a relatively small number of audio samples. This beckons the question of whether HI listeners would benefit from mixes specifically tailored toward their needs. Kohlberg (2015) showed that cochlear implant (CI) users prefer reduced music complexity with fewer instruments in the mix than NH listeners. Similarly, Pons (2016) indicated that CI users may benefit from individualized mixes and higher vocal levels specifically. Buyens (2014) also showed that CI users preferred the lead vocal levels to be enhanced with respect to the rest of the instruments in the mix and intimated such that the mixes generally available to the public might not be suitable for such listeners. More recently, Tahmasebi (2020) proposed a deep-neural-network-based and real-time capable source separation for music remixing to enhance music perception of CI listeners. The implementation was aimed at separating lead vocals from the rest of the mix in popular western music. The stimuli were presented to 13 bilateral CI users and 10 NH participants under realistic conditions with and without visual cues. It was evident that the CI users preferred elevated lead vocal levels with respect to those of the other instruments, irrespective of the reverberance or the presence of visual cues. As such, the lead vocal level preferences were 8 dB higher among the CI users than those of the NH participants tested.
Nagathil (2017) demonstrated a paradigm whereby the spectral complexity of classical chamber music pieces was reduced through low rank approximation of their constant-Q transforms (CQTs). This example yielded significantly higher preference ratings among a sample of 14 CI users over a source separation and remixing paradigm. In the latter, the lead vocals were separated from the mix and remixed with elevated levels as suggested by Buyens (2014). Because the low rank paradigm brings about timbre distortions and yet outperforms the source separation and remixing paradigm that faithfully reconstructs the lead vocals, it can be maintained that the study underpins the priority of reduced spectral complexity over timbral fidelity in chamber music among CI users.
Whether such findings would generalize to HI participants wearing bilateral hearing aids (HAs) instead of CIs is yet to be ascertained. It is well-known that HI listeners are fraught with psychoacoustical limitations. These limitations include impaired frequency selectivity of HI listeners (Glasberg and Moore, 1986), impaired temporal fine structure sensitivity (Hopkins and Moore, 2011), and impaired sound localization abilities (Warnecke , 2020). Studies of music perception have shown that listeners with moderate hearing impairment have drastically reduced abilities to hear out melodies or instruments from a mixture (Siedenburg , 2020). In a later study by Siedenburg (2021b), the ability to track if a reference voice in a mixture had tremolo artificially introduced by amplitude modulation was investigated among young NH listeners and older HI listeners. The latter were tested with and without their hearing aids. It was discovered that the ability of the older HI listeners to track the existence of tremolo in the reference voice did not improve with the use of their hearing aids.
Several aforementioned studies allude to the fact that CI users prefer the enhancement of lead vocal level relative to the other instruments in the mix. Accordingly, Bürgel (2021) showed that lead vocals serve as powerful attractors for garnering auditory attention in popular music. In a more recent study, Knoll and Siedenburg (2022) showed that a mixed group of NH and HI participants preferred elevated lead vocal levels than those available in original mixes. To investigate if the distinct preferences reported in the literature on CI listeners hold true when NH and moderately HI listeners are compared directly, individual preferences regarding the level-based mixing effect lead-to-accompaniment ratio (LAR) were recorded in this study.
Concerning spectral mixing effects, a study by Hornsby and Ricketts (2006) showed that the speech information use at high frequencies did not improve with frequency-dependent amplification among participants with sensorineural hearing impairment. However, to assess the contribution of frequency-dependent loudness weighting in music preference, we use a spectral-based effect that changes the balance of spectral energy of the audio signal in this paper. Furthermore, the consequence of reduced frequency selectivity brought on by hearing impairment (Florentine , 1980) on music preference need be assessed from the perspective of mixing. Equalizing may well serve as the effect which can be manipulated to make such an assessment. However, Izhaki (2017) highlights the challenges associated with the appropriate application of EQing in popular music. Here, the author emphasizes the effect of EQing on tonality, which is achieved by accentuating and attenuating levels at different frequency bands. Izhaki (2017) also highlights that it serves as the cardinal tool with which the engineer may alter the timbre of the instruments in the mix. This can be used to convey different emotions to the listener, so much so that it must be performed with the greatest care by a trained professional.
In this study, we aim at emphasizing or downplaying the spectral distinctiveness of individual tracks via the equalization transformation (EQ-transform). Through assessing the preferences of these mixing effects, our goal is to characterize mixing preferences of HI listeners in comparison to NH listeners, as well as the effect of HA use on these preferences from a general perspective, as opposed to doing so as function individualized settings of the HA. Overall, these concerns indicate several issues in the production of multitrack mixes for HI listeners, let alone their automation. Therefore, it behooves research in this direction to explore their mixing preferences prior to creating dedicated automatic-mixing paradigms. Moreover, whether the use of HAs has an influence on mixing preferences among HI listeners is still open for discussion. To answer these questions, this study evaluated listeners' preferences with regard to characteristic mixing effects. In experiment 1, we tested a sample of NH and HI participants with and without HAs. In experiment 2, a sample of HA users were tested to compare their preferences with and without HAs. Based on previous studies, we hypothesize that HI participants may show elevated lead vocal level preferences and high frequency weighting in the mixes compared to NH participants. Furthermore, reduced frequency selectivity among HI participants may manifest as greater affinity toward spectrally sparser mixes. HA use is also hypothesized to bring about significant differences in these preferences among HI listeners.
II. MIXING EFFECTS
Here, we outline the mixing effects used in the present study.1 The rationale is to provide effects that have one free parameter only and, thus, may be easily apprehended and used by the participants of our study. As motivated above, we seek to test the LAR and explore spectral-based effects and the way in which participants adjust these effects according to their preferences.
The LAR in dB is varied by accentuating or attenuating the broadband level of the lead vocal track with respect to that of the accompanying instruments considered en masse. By this, we merely consider all the tracks other than the lead vocal tracks in the mix, which we refer to as the accompaniment. We also disregard backing vocal tracks entirely. The manipulation herein affects only the level of the lead vocals in the multitrack excerpts, leaving the relative levels of the accompanying tracks unperturbed as they were in the original mix. This was done to avoid bringing out unnatural level relationships between the accompanying tracks (e.g., to prevent accentuating the level of low energy or transient tracks, which may inadvertently bring about the audibility of background noise). A LAR of 0 dB would imply that the level of the lead vocals and that of the accompaniment are identical. To avoid alterations brought on to panning in the mix, the weighting applied to the left and right channels of the lead vocal tracks were identical in that the lead vocal levels of both channels were altered in unison. Furthermore, as the broadband levels were an average evaluated over the entire duration of the excerpts, the silent or low-energy portions were also part of the calculation.
B. Spectral balance
In this spectral filtering effect described by Siedenburg (2021a), the weightings of bands of a filter bank are altered. Effectively, this shifts the spectral slope between 125 Hz and 8 kHz of the final mix. That is, positive values for spectral balance increase the auditory brightness of the signal. A spectral balance of 0 dB/Oct means the filter applied is an all pass filter with no change in the spectral centroid. The slope is only applied between 125 Hz and 8 kHz with a balancing point at 1 kHz (implying a gain at this frequency of 0 dB). The filter is applied on the final stereo mix, which encompasses all tracks.
In transformed mixing effects, the consequence of interest is acquired from a factory mix which is made originally available by the mixing engineer (factory effect). The effect in question is then extrapolated linearly with respect to a reference for that effect. Therefore, the extrapolation is performed with the reference effect corresponding to 0% and the factory effect corresponding to 100%. A participant wishing double the effect available in the factory mix would do so by choosing a 200% transform. This is to say that if the power level of a given track at a given frequency bin was 5 dB above the reference level (level at the 0% transform), a 200% EQ-transform would then transform this difference to 10 dB, a 300% transform to 15 dB, and so on. By doing so, power levels above the reference in each track are accentuated and notches falling below the reference are commensurately attenuated, thus, affecting the frequency domain or spectral sparsity of the track. To gauge changes in such sparsity brought on by the transform, a CQT-based method to evaluate the Gini-index or coefficient as a measure of sparsity was used. The Gini-index-based measure was used because of its robust nature (Hurley and Rickard, 2009). Rickard and Fallon (2004) showed by using the Gini coefficient that speech samples taken from the TIMIT Acoustic-Phonetic Continuous Speech Corpus (Garofolo, 1993) were sparser in the tempo-spectral domain than in the temporal domain. In a later study, Rickard (2006) showed that the Gini-index as a measure of time-frequency sparsity serves as a reliable indicator of mathematical separability of sources in a mixture by serving as a reasonable surrogate for disjoint orthogonality of the sources. In this context, disjoint orthogonality indicates how strongly the energy of one source dominates the other sources in the mix at a particular point in time and frequency. Zonoobi (2011) demonstrated the superiority of the Gini coefficient as a measure of sparsity over conventional norm-based measures when reconstructing randomly generated one-dimensional signals and real images with and without additive white Gaussian noise. The signals and images were reconstructed from compressed samples using sparse estimates with aid of the norm-based measures and the Gini coefficient. Those reconstructed with the aid of the latter had the lowest mean square errors with respect to their original versions. Furthermore, noise corrupted images, reconstructed with the aid of the Gini coefficient, had the highest peak signal-to-noise ratios. More recently, Orović (2022) demonstrated the reliability of the Gini coefficient as a measure of energy distribution obtained from time-frequency representations of nonstationary signals.
Figure 2(A) illustrates the manner in which the sparsity was objectively measured in this study. Glasberg and Moore (1986) showed that hearing impairment led to broader auditory filters, indicating poor frequency selectivity. This was also shown by Zurek and Formby (1981), where sinusoidal frequency modulation discriminability depreciated with higher levels of hearing loss. The EQ-transform was, therefore, conceived as a potentially valuable tool to assess the effects of spectral sparsity in multitrack mix preferences among HI listeners resulting from such shortcomings. In the EQ-transform, each stereo-channel of each track was processed independently. Specifically, the one-sided power spectra [using a 352 800-point fast Fourier transform (FFT) corresponding to the 8 s duration and sampling frequency of 44 100 Hz] of the factory tracks were linearly extrapolated between themselves, serving as the 100% transform and the reference power spectrum corresponding to a 0% transform as shown in Fig. 1. The so-called reference spectrum was the ensemble average of that of the lead vocals and of the most commonly occurring instruments (piano, guitar, bass guitar, drums, percussion, and synth instruments), extracted from the open source databases used in this study (discussed in Sec. III). As the EQ-transform process depends heavily on the factory power levels with respect to the reference levels, the latter is normalized such that it has the same overall power as the track undergoing the transform.
The EQ-transform was completed in eight steps, which are illustrated in Fig. 1(C). Each track was independently subjected to these steps in the following manner: (1) The power spectrum of a track was obtained using the FFT (analysis). (2) The transformed power spectrum of the track was derived as shown in Fig. 1(D). Here, the so-called transformed power spectrum for the track is derived by linearly extrapolating between the reference spectrum and the factory power spectrum of that track (extrapolation). (3) The residual power spectrum [difference between the transformed spectrum from step (2) and the factory spectrum from step (1)] was evaluated (envelope extraction). (4) This residual spectrum was then smoothed using a Savitsky-Golay filter (Schafer, 2011; smoothing). The smoothing was performed to avoid temporal smearing in the transformed mix. (5) Bands of the factory spectrum with power not less than 90 dB below the global maximum for the track were evaluated (thresholding). (6) The smoothed spectrum from step (4) was used to color the factory spectrum in the bands evaluated in step (5) (coloring). (7) The time domain representation of the colored spectrum is then obtained using the inverse Fourier transform (synthesis). (8) The resulting signal is normalized such that it has the same broadband level of the factory track from which it was derived (normalization). Finally, the EQ-transformed mixdown were created by merely adding the normalized signals for each track.
As depicted in Fig. 2(A), CQT was applied to an audio input with a frequency resolution of a third of an octave, and the Hamming window was used as a windowing function. On performing the CQT, the median power across time was obtained for each frequency bin. The Gini coefficient was then evaluated for the resulting sequence of the median power values. Figure 2(B) illustrates the Gini coefficient evaluated in this manner as a function of the applied EQ-transform for 75 lead vocal, 78 bass, 84 drum, 35 guitar, 18 percussion, 40 piano, and 24 synth tracks extracted from both databases used in this study. From Fig. 2(B), a monotonically increasing mean Gini coefficient with respect to overmixing can be observed. A higher Gini indicates higher frequency domain sparsity, alluding to the fact that the EQ-transformation implementation used in this study gives rise to greater spectral sparsity with overmixing in the individual tracks. A noticeable increase in spectral density with under-mixing, particularly between 0% and 100%, is also apparent. Working from 0% reference, there appears to be significant increase in the Gini-index going toward 100% and onward, showing higher spectral sparsity with increasing percent EQ-transform. Figure 2(C) depicts the envelopes of the power spectral densities of an example mixdown after 0%, 100% (factory mix), and 500% EQ-transform. The excerpt contained four tracks from lead vocals, bass guitar, drums, and guitar. An apparent increase in the spectral sparsity of the mixdown on overmixing as shown for individual tracks is visible here.
III. EXPERIMENT 1
In this first experiment, we compared effect preferences in a between-subjects design with distinct groups of NH participants, HI participants who did not wear hearing aids, and HI participants who wore hearing aids.
A sample of 25 NH and 20 HI participants took part. Among the HI participants, ten participants were bilateral hearing aid users (wHA) and another ten participants did not use any hear aids (woHA). HI participants were recruited via Hörzentrum Oldenburg gGmbH (Oldenburg, Germany). However, the distinction between wHA and woHA was made post-recruitment. The NH participants were recruited using an online advertisement with no reference to hearing impairment whatsoever. However, there were a few woHA participants who were recruited through the advertisement. They were classified as hearing impaired with the aid of pure-tone audiometry performed on all participants. All participants were compensated at the rate of 12 Euros per hour for their involvement in the study.
The NH participants were, on average, 28 years old [standard deviation (SD) = 9.6], wHA participants were 70 years old (SD = 9.5), and woHA participants were 67 years old (SD = 14.5). The age difference between wHA and woHA participants was not significant. Among the NH participants, there were 15 female and 10 male participants. Among wHA participants, there were seven male and three female participants. In the woHA group, there were four female and six male participants. Figure 3(A) displays individual and median hearing loss levels (HLs) across the respective participant groups. The HL was assessed via pure-tone audiometry using pure tones at 125 Hz, 250 Hz, 500 Hz, 1 kHz, 2 kHz, 4 kHz, and 8 kHz frequencies. From the audiogram, it can be observed that the wHA participants, indeed, had more than 10 dB greater HL, on average (M = 42 dB, SD = 11 dB), compared to the woHA participants (M = 29 dB, SD = 3 dB). This can be further observed in Fig. 3(B), which displays age and average HLs (arithmetic mean across all tested frequencies). Here, it is apparent that all of the woHA participants had mild hearing impairment (25 dB < HL ≤ 40 dB) as compared to 50% of the wHA participants with moderate to severe hearing impairment (HL > 40 dB; Clark, 1981). The NH participants did not have elevated HL, as expected (M = 1.5 dB, SD = 5 dB).
Figure 3(B) shows the relationship between average hearing loss and age among the participants. A significant positive correlation was observed, r(43) = 0.9, p < 0.001. According to the Goldsmiths Musical Sophistication Index musical training subscale (Müllensiefen , 2014), participants had mean scores of M = 29 (SD = 10) for NH, M = 26 (SD = 12) for wHA, and M = 31 (SD = 13) for woHA participants, and there were no significant differences between any of the three groups of participants ( p > 0.3).
Concerning the HA of the wHA participants, nine of the ten participants from whom the data were made available after the study wore behind-the-ear (BTE) hearing aids. Among them, five wore closed-fit type HAs (with a tube connecting the BTE case to an ear mold customized for the participant). The participants using the closed-fit HAs reported having used them between 2 and 15 yr (M = 9.4 yr, STD = 4.9 yr). The other four participants were using open-fit type hearing aids (with a tube connecting the BTE case to a dome, leaving the ear canal open). This allows for unamplified low frequency sound to enter. The participants reported having used these open-fit type hearing aids between 4 and 17 yr (M = 7.8 yr, STD = 6.2 yr). The HA use of the remaining wHA participant was unavailable at the time of collection after the study was completed. Refer to the supplementary material for information pertaining to individual HA use.2
2. Stimuli and apparatus
The audio excerpts used as stimuli were 8 s long and taken from the Medley database (Bittner , 2014). All the audiometry in this study was conducted using a portable AD528 audiometer by Interacoustics (Dortmund, Germany). The audio playback was realized over a pair of ESI active 8″ near-field studio monitors in a low reflection chamber at the University of Oldenburg, Germany. The monitors were separated by a 90° angle and 2 m distance from the listener's seat. The overall playback level was adjusted to 80 dBA at the participant position. The monitor levels at the participant positions were calibrated to these levels with a stationary noise shaped with an average spectrum acquired from commonly available instruments from both aforementioned databases as with the reference spectrum used for the EQ-transform discussed earlier. Please see Sec. 5 of the supplementary material for sound pressure levels of the individual excerpts presented in the study at the participant position.2 Due to the fluctuating levels of music signals, we have also provided the minimum (LAFmin) and maximum (LAFmax) levels of our stimuli. The measurements were made using a Nor140 precision sound level meter from Norsonic AS (Tranby, Norway). The sound pressures were measured over a full minute during which the excerpts were looped. It was evident here that none of the excerpts used in this study exceeded 90 dBA at maximum level. The audio transforms were realized using a standalone desktop computer running matlab 2021b (The MathWorks, Inc., Natick, MA). The desktop computer was connected to the monitors via an RME Fireface UFX audio-interface (RME intelligent audio solutions, Haimhausen Germany).
Prior to commencing with the training phase of the experiment, the participant was asked to fill the Gold MSI musical training questionnaire (musical training subscale; Müllensiefen , 2014) to estimate their level of musical training. On completion of the questionnaire, the participant was guided to the training phase of the experiment, which was used merely to acquaint the participant with the graphical user interface and concept of the experiment.
The training phase consisted of a single block of ten trials in which an audio excerpt that was not included in the main experiment was repeated. In each trial, one of the three previously discussed mixing effects would be manipulated by the participant via the rotation of an ungraduated virtual dial. Between each trial, the initial position of the virtual dial was randomized. As the mixing effect would take effect immediately on the dial change, the participant would be instructed to set the dial to where they best preferred the audio playback over the loudspeakers.
After completion of the training phase, the main phase of the experiment ensued after a break during which the participant was requested to retain or wear their hearing aids throughout the rest of the experiment if they were indeed hearing aid users. The main phase of the experiment was grouped into three blocks, where each block was dedicated to one of the mixing effects. Each block comprised ten trials with different multitrack excerpts being presented per trial. Within a given block, no excerpt was presented more than once. There were 20 distinct multitrack excerpts taken only from the Medley database. Among them, the first ten were used exclusively for the level-based effect and the other ten excerpts were used for the two spectral-based effects. To avoid order biases, the order of the blocks and the order in which the excerpts were presented within a block were randomized. The starting position of the virtual dial in each trial was also assigned randomly. The dial position set by the participant was stored on proceeding to the next trial.
4. Data analysis
A linear mixed effects (LME) model was used to estimate relationships between mixing effect preferences and participant groups, the ten audio excerpts, and the interaction effects between the two factors. Such a model was used for its advantages over repeated measures analyses of variance (ANOVAs) for unbalanced sample sizes between groups to avoid list-wise deletion (Lohse , 2020). To summarize the main effects and interactions, results are presented in the form of classic ANOVA statistics for the ease of interpretation, derived from the LME models via matlab's ANOVA function.
To underpin the differences depicted in the mixed effects model, a post hoc, independent samples t-test was used. The test was applied on average preferences of each participant and calculated across the ten presented excerpts. The resultant p-values were subjected to a controlled-Holm procedure for it is uniformly more powerful than the Bonferroni correction (VanderWeele and Mathur, 2019). As a measure of effect size, Cohen's d (Lakens, 2013) was used.
As a way to assess individual differences within groups, a test of variance compared the mean preferences taken across all excerpts for each participant to mean preferences taken across participants within the group. In other words, evaluating variances of each distribution in Fig. 4(B) were assessed in a given participant group and those in Fig. 4(C) for the same group for comparison. A single-tailed test of variance was used next to determine if the variance of the former was considerably larger than the latter. A significantly higher variances of mean preferences of participants would indicate significant individual differences.
B. Results and discussion
As mentioned earlier, the preferences for a given mixing effect were elicited from each of the participants for ten distinct audio excerpts. Figure 4(B) illustrates 95% confidence interval plots and mean LAR preferences elicited from each participant for the ten excerpts. LAR preferences pertaining to each of the excerpts averaged across participants belonging to the respective groups are presented in Fig. 4(C). Figure 4(A) shows the resulting averages across each excerpt per participant. The preferences recorded for both of the spectral mixing effects are also presented in this manner in Figs. 5 and 6. According to the one-sample t-test conducted on means depicted in Fig. 4(A), LAR preferences among NH participants were slightly negative (M = –0.92 dB, SD = 2.33 dB), whereas wHA participants preferred positive levels of LAR (M = 1.68 dB, SD = 2.76 dB), similar to woHA participants (M = 1.32 dB, SD = 4.78 dB). Yet, none of these deviations from zero were statistically robust, neither for NH participants [t(24) = 2, p = 0.06] nor for HI participants [t(9) > 0.9, p > 0.4].
In a direct comparison between groups, the LME model showed significant inter-group effects, F(2,420) = 3.64, p = 0.02 < 0.05. Furthermore, there were inter-excerpt effects, F(9,420) = 16.52, p < 0.001, together with a significant interaction effect, F(18,420) = 1.67, p = 0.04 < 0.05. The post hoc independent samples t-test showed that mean LAR preferences among wHA participants were significantly higher than those among NH participants, t(33) = 2.83, p = 0.008 < 0.02, d = 1.06 (large effect size) as can be observed in Fig. 4(A). When comparing NH and woHA participants, no significant differences were shown, t(11) = 1.4, p = 0.18. However, the two-tailed test of variance shows that the variances of the mean LAR preferences between NH and woHA are significantly different (F = 0.24, p = 0.005 < 0.01), and the degrees of freedom were, therefore, adjusted from 33 to 11. Finally, no significant differences between the wHA and woHA were shown, t(18) = 0.2, p = 0.84. None of the groups showed any significant within-group individual differences ( p > 0.08).
Similarly, the one-sample t-test performed on the mean spectral balance preferences illustrated in Fig. 5(A) did not reveal significant deviations of preferences from the 0 dB/Oct reference (where the unweighted factory mix-down is presented) for NH participants (M = 0.2 dB/Oct, SD = 0.8 dB/Oct), t(24) = 1.3, p = 0.2. This was similarly the case for preferences among wHA participants (M = 0.1 dB/Oct, SD = 0.6 dB/Oct), t(9) = 0.6, p = 0.6. However, woHA participants (M = 0.6 dB/Oct, SD = 0.7 dB/Oct) preferred a significantly elevated preference favoring weighting higher frequencies about 1 kHz more, t(9) = 2.4, p = 0.04 < 0.05, d = 0.77 (medium effect).
Here, the LME model did not show significant inter-group effects of preferences, F(2,420) = 1.13, p = 0.32 between the participant groups. Furthermore, significant inter-excerpt effects F(9,420) = 11.07, p < 0.001 but no interaction effects F(18,420) = 1.1, p = 0.34 were shown. However, unlike with LAR preferences earlier, the variance of mean spectral balance about 1 kHz (SPBal) preferences taken across all excerpts for each participant were significantly larger than those taken across all participants for each excerpt for NH participants ( F = 6, p = 0.004 < 0.01). This indicates stark differences in SPBal preferences due to the NH participants. This can be visualized in Figs. 5(B) and 5(C), where one can observe drastic differences between preferences of NH participants but relatively small differences across excerpts.
As for the EQ-transform, the mean preferences illustrated in Fig. 6(A) for wHA (M = 83.5%, SD = 33.8%), and woHA participants (M = 89.1%, SD = 43.5%) were not significantly different from factory settings [t(9) = 1.5, p = 0.15 and t(9) = 0.8, p = 0.4, respectively]. However, those for NH participants (M = 80.5%, SD = 28.3%) showed significantly reduced EQ-transform preferences compared to those presented to them in the original mix, t(24) = 3.5, p = 0.002 < 0.01, d = 0.7 (medium effect).
The LME model used did not show an effect of participant group, F(2,420) = 0.26, p = 0.8. However, inter-excerpt effects F(9,420) = 6, p < 0.001 and interaction effects F(18,420) =1.8, p = 0.02 < 0.05 were observed. Barring nonsignificant group effects, interesting trends can be observed here as illustrated in Figs. 6(A) and 6(B). It is evident that the participants from all of the groups mostly preferred under-mixing or, more specifically, a transform between 0% and 100%. All participant groups showed similar mean EQ-transform preferences with no significant differences between their variances. Furthermore, no significant individual differences were observed within the groups (p > 0.09). Refer to the supplementary material for the illustration of p-values for inter-excerpt comparisons of the respective preferences for the three groups pooled in experiment 1.2 An illustration of raw error residuals from the LME model used is shown in the supplementary material.2
In summary, we find that the wHA participants preferred a significantly elevated level of the lead vocals in the mixes presented to them compared to the NH participants. These preferences were considerably more diverse among the woHA participants than the NH participants. When spectral balance preferences of the mixes were assessed, there were significant individual differences among NH participants. On average, wHA participants preferred the factory settings in the mixes with an almost 0 dB/Oct preference. However, woHA participants favored weighting higher frequencies in the mixes more by way of significantly elevated SPBal preferences than factory settings. All three participant groups preferred spectrally denser mixes than those presented to them by way of an EQ-transform preference below 100%. This observation was significant for NH participants.
A clear limitation of experiment 1 was that the degree of hearing loss confounded the between-subjects distinctions of hearing aid use; see Fig. 3(B). For that reason, we sought to follow up on these findings using a more controlled within-subjects design, where the mixing effect preferences were assessed with and without HA use among a sample of HI participants different from those in experiment 1. Here, we aimed at assessing the role of hearing aid use alone on preferences of music processing strategies.
IV. EXPERIMENT 2
This experiment was identical to the previous one in setup and implemented to evaluate the effect of hearing aid use on the mixing effects preferences. To that end, this experiment only targeted participants who were HA users who completed the experiment once with (wHA) and once without (woHA) their hearing aids on.
A sample of 18 participants with a mean age of 73 years old participated in this experiment, and 14 of them were moderate to severely HI (M = 50 dB HL, SD = 7 dB HL) and had a mean age of 75 years old. Only four participants were mildly HI (M = 33 dB HL, SD = 5 dB HL) with a mean age of 67 years old. Participants with HAs were specifically recruited via a subjects database from Hörzentrum gGmbH. There were 12 male and 6 female participants. Musical training estimated as in experiment 1 was, on average, 18 points on the Gold MSI musical training subscale (Müllensiefen , 2014) (SD = 9). Figure 7(A) shows the median hearing loss of the participants evaluated through pure-tone audiometry. Similar to that observed with the participant pool in experiment 1, there was a significant linear correlation between age and mean hearing loss, r(16) = 0.7, p = 0.003 <0.01, as visible in Fig. 7(B).
Among the participants, 16 wore a BTE type HA and 12 of those wore open-fit type HAs with a reported duration of use between 1 and 23 yr (M = 9 yr, STD = 5.4 yr). Four of the participants wore closed-fit type HAs with a reported duration of use between 5 and 18 yr (M = 13.5 yrs, STD = 5.8 yr). Only one participant wore a full shell in-the-ear (ITE) type HA with a reported length of use spanning 35 yr. Data pertaining to the hearing aid use from only one remaining participant was unavailable on collection post-study. Refer to the supplementary material for individual information pertaining to the HA use.2
Unlike in the previous experiment, where HA users were advised to wear their hearing aids for the entire duration of the experiment, here, the participants were asked to leave their HA on in a given phase consisting of the ten trials for each of the three mixing effects and then remove them in a second phase of the experiment. Whether the participant was to wear their hearing aids in a former or latter phase of the experiment was counterbalanced across participants. Furthermore, the blocks and choice of the respective audio excerpts were completely randomized.
B. Results and discussion
Figures 8(A)–8(C) illustrate the results from experiment 2. A paired t-test was performed to assess significant differences of preferences in the wHA and woHA conditions. For LAR, the test indicated significantly elevated average preferences without HA (M = 13 dB, SD = 8 dB) than with HA (M = 8 dB, SD = 8 dB), t(17) = 2.4, p = 0.028 < 0.05, d = 0.6 (medium effect). This trend is also observed in spectral balance preferences, where an elevated SPBal without HA use (M = 0.64 dB/Oct, SD = 1 dB/Oct) than with HA use (M = –0.74 dB/Oct, SD = 1.2 dB/Oct), t(17) = 3.62, p = 0.002 < 0.01, d = 0.9 (large effect) is apparent.
Finally, EQ-transform preferences were similarly elevated without HA use (M = 156%, SD = 101.3%) when compared to that with HA use (M = 74%, SD = 70.4%), t(17) = 2.3, p = 0.03 < 0.05, d = 0.5 (medium effect). That is, an increase from undermixing to overmixing in EQ-transform preferences can be observed with the removal of HAs. In tandem with the previously made assertion, this observation suggests that the removal of HAs resulted in the participants preferring spectrally sparser mixes.
To analyze the relationship between mixing effect preferences and the level of hearing loss, the data from both experiments were pooled by using the data of NH participants and woHA participants from experiment 1 and the data of the woHA conditions from experiment 2. Here, it can be observed further that the participants from the second experiment had around (M = 46 dB HL, SD = 9 dB HL) 17 dB HL higher hearing levels compared to the woHA participants in experiment 1. A significant positive correlation between mean HL and LAR preferences was observed, r(51) = 0.7, p < 0.001. Mean HL and EQ-transform preferences were similarly positively correlated, r(51) = 0.5, p < 0.001, and a marginal correlation was observed for SPBal preferences r(51) = 0.3, p = 0.06. Figures 9(A)–9(C) provide an illustration of the correlation in the pooled data between the two experiments. Taken together, the data from both experiments, thus, suggest a monotonic relationship between the degree of hearing loss and mixing effects preferences for LAR and the EQ- transform. With participants wearing HAs, somewhat expectedly, this relation does not hold any more because HAs arguably compensate for the hearing impairment. That is, it seems highly beneficial to seek different music processing strategies for HI participants with and without HAs.
V. GENERAL DISCUSSION
In this study, we sought to establish preferences with regard to basic mixing effects in HI individuals. Therefore, individual preferences of the LAR, spectral balance, and EQ-transform effects were assessed in a sample of NH and HI listeners. In experiment 1, the HI participants were grouped into those who did and those who did not wear HAs. We observed that HI participants with and without HAs preferred an increased LAR compared to NH participants. We did not observe pronounced effects of spectral balance between groups but rather substantial individual differences for NH participants. The EQ-transform implemented in this study (linearly extrapolating between available EQ in the mix and a reference spectrum) was shown to significantly affect frequency domain or spectral sparsity measured here using the Gini-index. Results showed that all three participant groups preferred under-mixing or an EQ-transform setting of less than 100% on average (i.e., less sparse than the original mix). Yet, this observation was only significant among the NH participants who preferred mean EQ-transforms of 20% below the factory settings. In experiment 2, targeting only participants with HAs, preferences were recorded with and without their HAs. The use of HAs resulted in a 5 dB reduction in LAR. Moreover, HA use also yielded a significant reduction in the spectral balance preference. A –0.7 dB/Oct balance favoring low frequency with HA use and a similar positive balance of around + 0.7 dB/Oct favoring high frequencies with no HA use was observed. When the NH and woHA data from experiment 1 and the woHA data from experiment 2 were pooled, a significant positive correlation between LAR and EQ-transform preferences with respect to the mean hearing loss was observed. These results suggest that with increasing hearing loss, participants had a greater affinity toward louder lead vocals in the mix. Moreover, with increasing hearing loss, spectrally sparser mixes were also favored.
From experiment 1, it was evident that, on average, HI participants from both groups preferred a LAR of 2 dB with a statistically significant difference between NH and wHA participants. According to Pons (2016), a small sample of CI users, on average, preferred an instrument to vocals ratio of −1.92 dB (translating into a 1.92 dB LAR), similar to that found here. This also underpins similar findings by Buyens (2014). In the present experiment, the NH participant, however, preferred the lead vocals to be merely a decibel lower than that of the accompaniment, consistent with other recent findings (Tahmasebi , 2020). In experiment 2, wHA participants also preferred the lead vocals louder than the accompaniment but even more so when the HAs were not used (yielding an increase of around 5 dB). Here, the fact that wHA participants preferred the LAR to be 8 dB louder than each of the accompaniments is similar to that shown by Tahmasebi (2020) for CI users. Together with the within-subjects comparison of experiment 2, our results suggest that the LAR in its effect on vocal level is an important feature to consider for adjusting music mixes for mild to severely unaided HI listeners with a tendency of higher preferred LARs by listeners with higher degrees of hearing loss. This appears to be very plausible, given the exceptional status of vocals in popular music and their roles in conveying the lyrics of songs (Condit-Schultz and Huron, 2015).
Spectral balance preferences dictate the frequency weighting of the audio excerpt, the perceptual effects of which have been described in terms of brightness perception (Saitis and Siedenburg, 2020). From experiment 1, it was observed that the woHA preferred an elevated spectral balance favoring higher frequencies in the mix. NH and wHA participants preferred spectral balance of 0 dB/Oct, on average, which was the unperturbed factory mix. Although NH participants also preferred similar settings of spectral balance, their choices were varied, and the mean participant-specific choices over excerpts bore a significantly greater variance than the mean excerpt or condition-specific choices. The strength of these individual differences is rather surprising, given that previous work depicted rather steep psychometric functions of spectral balance in the range of −2 to 2 dB (Siedenburg , 2021a). Experiment 2 followed up on this observation by revealing a significant reduction of spectral balance values resulting from hearing aid use: The wHA preferred negative spectral balances (–0.7 dB/Oct) on average. When they removed their HAs, a commensurately positive value of +0.7 dB/Oct was preferred on average. According to Thrailkill (2019), loudness perception among hearing aid users with sensorineural hearing impairment is dominated by higher frequencies and did not change with their experience with these devices. A reduced spectral balance preference among such listeners as shown here may highlight the fact that they may counter the compensation of high frequency hearing loss brought on by the hearing aids.
The EQ-transform implementation discussed here was shown to bring about significant changes in spectral sparsity in the multitrack mixes. Although we did not observe statistically robust effects in experiment 1, experiment 2 depicted a significant elevation in EQ-transform preferences when HAs were removed, implying that HA users showed preferences toward spectrally denser mixes. Furthermore, correlation of the pooled data showed that there was a significant positive correlation between mean HL and preferred EQ-transform settings, indicating a preference of greater spectral sparsity with increasing mean HLs. Recent studies (Abdulla and Jayakumari, 2022; O'Grady , 2005) have demonstrated that spectro-temporal sparsity appears to be a pivotal factor in source separability in that sparse representations in time or frequency domains improve the performance of blind source separation algorithms. Here, we first observed that HI listeners may prefer higher levels of spectral sparsity, and with the so-called EQ-transform, we presented an algorithm that yields according changes of multitrack mixes.
We acknowledge that a major limitation of this study was not to consider a more comprehensive full input/output characterization of hearing aids, including insertion gains, which may have allowed us to derive a deeper understanding of some of the various auditory mechanisms involved in the perception of multitrack music. However, our primary goal was to suggest and explore novel strategies for remixing music that may lead to higher degrees of music appreciation among HI subjects. Furthermore, this study does not address the issues pertaining to individual differences in loudness growth functions (e.g., Marozeau and Florentine, 2007). Specifically, Florence (2017) showed that HI participants had steeper loudness growth than NH participants for sounds of low to moderate intensity, even when the former were fitted with compression HAs that were aimed at compensating for the reduced compressive nonlinearity of the cochlea as a result of sensorineural hearing impairment. Evaluating the loudness growth curves of participants in this study may have provided additional insight into individual preferences of the mixing effects. Finally, we wish to acknowledge that this study did not specifically control for loudness saturation in the HAs, brought on by possibly high crest factors of music signals. This can be a critical issue for music listening with HAs and is especially problematic in older HAs with narrower input headroom. It should be noted, however, that the maximal presentation levels of our stimuli did not appear to be problematic (see the supplementary material2) such that we do not expect hearing aid saturation as a critical issue in the present study, even though it is certainly an important factor to consider for real-life music listening with HAs.
Overall, with this study, we made a first attempt to explore mixing preferences of NH and HI listeners with and without HAs. Despite substantial individual differences among NH and HI listeners, we observed consistent choices of mixing parameters that extend previous work on CI listeners (Buyens , 2014; Tahmasebi , 2020) toward mild to moderately HI listeners. Furthermore, with the so-called EQ-transform, a straightforward spectral effect was introduced that appears to be a promising tool for the individualization of multitrack mixes. In follow-up studies, we seek to test objective performances of HI and NH listeners in music scene analysis tasks to assess the effects of our implementation on source transparency. Particularly, the participant's ability to determine if a cued instrument or vocal was present within a given excerpt of a mix will be assessed with different EQ-transform settings, yielding significant differences in spectral sparsity, as shown in this study. Furthermore, we seek to explore whether the combination of mixing effects, which have only been tested in isolation in this study, may provide synergistic results, which could provide participants with a richer palette of potential audio manipulations to adjust according to their preferences.
The main contribution of this study was to evaluate music mixing preferences in a sample of HI listeners with and without HAs. In addition to suggesting that previous findings on LAR preferences of CI listeners extend toward HA users, it was also shown that there are distinct preferences regarding the setting of spectral mixing effects. Importantly, with the EQ-transform, we proposed a new spectral transformation of multitrack music signals. Our results suggest that HI listeners prefer spectrally sparser mixes as evinced by preferences toward increased EQ-transform settings. Generally, our findings indicate that the individualization of level- and spectral-based mixing effects may yield enhanced music appreciation for listeners with hearing loss.
The authors would like to thank all of the participants for having participated in this study and Hörzentrum Oldenburg gGmbH for their support. This study was funded by a Freigeist Fellowship to K.S. from the Volkswagen Stiftung.
For sound examples, see https://uol.de/en/music-perception/sound-examples/mixing-transforms-for-hearing-impaired-listeners#c457967 (Last viewed July 13, 2023).
See Sec. 1 of the supplementary material at https://doi/org/10.1121/10.0020269 for the list of audio excerpts used in experiment 1; see Sec. 6.1 for information pertaining to HA use among wHA in experiment 1; see Sec. 5 for minimum and maximum sound pressure levels of the excerpts measured at the participant position; see Sec. 3 for an illustration of p-values from inter-excerpt comparisons in experiment 1; see Sec. 4 for the error residual plots of the LME (linear mixed effects) model used; see Sec. 6.2 for information pertaining to HA use among wHA in experiment 2; and see Sec. 2 for the list of audio excerpts used in experiment 2.