Because a reference signal is often unavailable in real-world scenarios, reference-free speech quality and intelligibility assessment models are important for many speech processing applications. Despite a great number of deep-learning models that have been applied to build non-intrusive speech assessment approaches and achieve promising performance, studies focusing on the hearing impaired (HI) subjects are limited. This paper presents HASA-Net+, a multi-objective non-intrusive hearing-aid speech assessment model, building upon our previous work, HASA-Net. HASA-Net+ improves HASA-Net in several ways: (1) inclusivity for both normal-hearing and HI listeners, (2) integration with pre-trained speech foundation models and fine-tuning techniques, (3) expansion of predictive capabilities to cover speech quality and intelligibility in diverse conditions, including noisy, denoised, reverberant, dereverberated, and vocoded speech, thereby evaluating its robustness, and (4) validation of the generalization capability using an out-of-domain dataset.
I. INTRODUCTION
Speech quality and intelligibility assessments serve as important tools for a variety of speech-related applications (Cooper , 2024), such as speech enhancement (SE) (Loizou, 2007), teleconferencing (Yi , 2022), voice conversion and text-to-speech (Huang , 2022), and hearing aids (Barker , 2022). Speech quality refers to the pleasantness or naturalness of a speech signal, while speech intelligibility measures how well the content of the speech can be understood. A straightforward approach to measure speech quality or intelligibility is to conduct listening tests, where speech signals are presented to a group of listeners who are then asked to score the quality or recognize the words. The mean opinion score (MOS) is a widely used criterion to assess speech quality, which ranges on a scale from one to five. Although listening tests are considered the most accurate method, they are time-consuming and expensive when conducted on many subjects. Therefore, objective metrics have been proposed and used as substitutes for listening tests.
Objective speech assessments can be roughly divided into two categories, namely, intrusive and non-intrusive. Intrusive methods involve comparing degraded or processed speech to the clean reference to estimate perceived speech quality or intelligibility. Representatives for speech quality assessment include perceptual evaluation of speech quality (PESQ) (Rix , 2001), perceptual objective listening quality analysis (POLQA) (Beerends , 2013), hearing aid speech quality index (HASQI) (Kates and Arehart, 2014), and signal-to-distortion ratio (SDR) (Vincent , 2006). For speech intelligibility assessment, some commonly used methods include articulation index (AI) (French and Steinberg, 1947), speech intelligibility index (SII) (Pavlovic, 1987), speech transmission index (STI) (Steeneken and Houtgast, 1980), short-time objective intelligibility (STOI) (Taal , 2011), extended short-time objective intelligibility (ESTOI) (Jensen and Taal, 2016), and hearing aid speech perception index (HASPI) (Kates and Arehart, 2021). Although intrusive methods show a higher correlation with human ratings, they are not practical for real-world scenarios as clean speech may not always be available. On the other hand, non-intrusive methods calculate perceived speech quality or intelligibility directly on the degraded or processed speech without a clean reference. ITU-T P.563 (Malfait , 2006), speech-to-reverberation modulation ratio (SRMR) (Falk , 2010), and SRMR to hearing aid (SRMR-HA) (Suelzle , 2013) are examples of non-intrusive speech quality measures. Non-intrusive speech intelligibility measure includes non-intrusive short-time objective intelligibility (NI-STOI) (Andersen , 2017).
In recent years, non-intrusive models based on deep learning (DL) have shown significant progress in speech quality and intelligibility assessment. These models aim to minimize the loss between the predicted values and the ground truth values of various measures without requiring a clean reference. These approaches can be classified into two categories based on their assessment target. The first category focuses on predicting human subjective ratings. A MOS prediction model (Lo , 2019) was designed to predict MOS for converted speech. Meanwhile, several neural network architectures were investigated including bidirectional long short-term memory (BLSTM), convolutional neural network (CNN), and convolutional neural network-bidirectional long short-term memory (CNN-BLSTM). Deep noise suppression mean opinion score (Reddy , 2021) used a multi-stage self-teaching approach to predict speech quality. NISQA (Mittag , 2021), a CNN-based model, concentrated on communication network distortions and assesses speech quality across five dimensions: overall quality, noisiness, coloration, discontinuity, and loudness. MeanNet and BiasNet (MBNet) (Leng , 2021) predicted the mean score of an utterance and the difference between the mean score and the listener score, respectively. LDNet (Huang , 2022) directly predicted the listener score based on the input speech and listener identity, and integrated listener-dependent modeling for MOS prediction. In Andersen (2018) and Pedersen (2020), CNN architectures were employed for the task of predicting speech intelligibility. The second category focuses on predicting objective metrics. Quality-Net (Fu , 2018) predicted PESQ based on BLSTM. Metric-Net (Yu , 2021) transformed the regression-based PESQ estimation to a multi-class single-label classification problem. STOI-Net (Zezario , 2020) utilized the CNN-BLSTM architecture with an attention mechanism to predict STOI. Attention enhanced multi-task speech assessment (Dong and Williamson, 2020) was a unified model that predicts multiple objective speech quality and intelligibility scores, including PESQ, STOI, HASQI, and SDR.
Inspired by human's ability to distinguish between the quality of two speech signals, regardless of their content differences, a new framework based on DL has been developed. This framework uses non-matching references to predict relative speech assessment scores, in contrast to previous DL-based reference-free methods. Non-matching reference based speech quality assessment (NORESQA) (Manocha , 2021) predicted signal-to-noise ratio (SNR) and scale-invariant signal-to-distortion ratio (Si-SNR) for quality assessment between two signals whose speech content and speakers may differ. Additionally, NORESQA-MOS (Manocha and Kumar, 2022) was a MOS estimation method based on the principles of NORESQA. The use of non-matching references makes these approaches applicable in real-world scenarios, as any arbitrary speech signals can be used as the reference input.
Recently, speech foundation models (SFMs) have attracted considerable attention. Among them, self-supervised learning (SSL) models stand out, demonstrating promising performance across various speech processing applications. The SSL models learn feature representations from large amounts of unlabeled data and are applied to downstream tasks. In the realm of speech assessment, several studies have investigated the use of SSL models. For example, Tseng (2021) predicted the MOS values utilizing SSL models and a listener identifier termed BiasNet to model the bias of listeners. Cooper (2022) also used SSL models for MOS prediction, demonstrating good generalization through simple fine-tuning. Additionally, Yang (2022) proposed a MOS prediction fusion model framework that employs seven SSL models. Furthermore, some studies have leveraged diverse acoustic information from multiple domains. For example, a CNN-BLSTM architecture was utilized by researchers to process diverse acoustic features, encompassing inputs from the time-frequency domain, time-domain, and SSL embeddings (Zezario , 2022a; Zezario , 2022b; Zezario , 2022c). Specifically, one study estimated PESQ, STOI, and SDI (Zezario , 2022a), another predicted subjective intelligibility scores for binaural hearing aid users (Zezario , 2022b), and a third estimated both subjective and objective intelligibility scores (Zezario , 2022c). Chen and Tsao (2021) combined scattering transform and SSL models to make predictions for subjective quality and intelligibility using a Chinese dataset named TMHINT-QI. In addition to SSL models, embeddings from Whisper (Radford , 2023), a large-scale pre-trained automatic speech recognition (ASR) model, have been employed to extract representations for speech assessment, showing significant robustness across various speech processing methods (Zezario , 2024).
Despite the recent breakthroughs of DL for speech assessment predictions, research has focused primarily on normal hearing (NH) listeners, with limited attention given to hearing impaired (HI) listeners. Moreover, research focusing on HI listeners (Salehi , 2018; Zezario , 2022b; Tu , 2022a,b) has primarily focused on predicting either quality or intelligibility instead of both criteria, and the datasets used to train these models have been limited to noisy or enhanced conditions.
This study proposes HASA-Net+, a multi-objective non-intrusive hearing-aid speech assessment model that builds on previous work HASA-Net (a non-intrusive hearing-aid speech assessment network) (Chiang , 2021). In HASA-Net, spectrograms and hearing-loss patterns were used as input to predict speech quality and intelligibility in noisy conditions for HI listeners. Both HASA-Net and HASA-Net+ are designed to calculate HASQI and HASPI scores, two well-known speech quality and intelligibility assessment metrics designed for HI listeners. While both HASQI and HASPI are based on a calculation comparing the degraded or processed signal with its clean reference, HASA-Net and HASA-Net+ predict scores without the need of clean reference. HASA-Net+ improves on HASA-Net in several ways. First, it is a general model that accounts for both NH and HI listeners. Second, it incorporates SFMs (SSL and large-scale weakly supervised pre-trained models) and fine-tuning approaches. Third, it evaluates the model's robustness across five different speech conditions, including noisy, denoised, reverberant, dereverberated, and vocoded speech. Fourth, the model's generalization capability is validated using an out-of-domain (OOD) dataset in zero-shot, few-shot, and full dataset settings. In the zero-shot setting, the model trained on the in-domain dataset is directly tested on OOD data without any adjustments. In the few-shot setting, the model is fine-tuned using a limited amount of OOD data. Last, in the full dataset setting, the model is trained using the entire set of OOD training data.
The remainder of this paper is organized as follows: We present a brief introduction to HASQI and HASPI and provide an overview of prior research on DL-based speech assessment for HI listeners in Sec. II. We present the proposed HASA-Net+ in Sec. III. Subsequently, we present the experiments in Sec. IV. In Sec. V, we discuss the findings and limitations of this study. Finally, we conclude this work in Sec. VI.
II. RELATED WORK
A. HASQI and HASPI
In this section, we provide an introduction to the current versions of HASQI (version 2) (Kates and Arehart, 2014) and HASPI (version 2) (Kates and Arehart, 2021). HASQI and HASPI are objective metrics commonly used to evaluate speech quality and intelligibility for both NH and HI listeners. Both HASQI and HASPI produce scores between 0 and 1, where higher scores indicate better speech quality or intelligibility. These metrics rely on comparing the output of the model of the auditory periphery for a processed or degraded speech signal to the model output for the corresponding clean reference signal. The model of the auditory periphery (Kates, 2013) used in these metrics is able to simulate NH and HI that is audiogram-dependent, allowing it to represent both NH and HI conditions.
To compute HASQI, the reference signal used is the clean speech passed through a model of the listener's periphery. The degraded or processed signal is also passed through a peripheral model that corresponds to the degree of hearing loss. The assumption for HASQI is that the clean, undistorted signal processed through the listener's periphery with linear amplification (Byrne and Dillon, 1986) to compensate for any loss of audibility will result in the highest speech quality for that listener. The outputs of the auditory models are then used to measure changes in the time-frequency envelope modulation, temporal fine structure, and long-term spectrum. The nonlinear term of HASQI measures the time-frequency envelope modulation and temporal fine structure modifications, while the linear term measures the difference in the long-term spectrum. The final HASQI score is calculated as the product of the nonlinear and linear terms. More detailed information on HASQI can be found in Kates and Arehart (2014).
In HASPI, a peripheral model of the listener audiogram is used to process degraded or processed speech, while the reference signal is the clean speech that has been passed through a model of the normal auditory periphery. The assumption for HASPI is that the highest speech intelligibility will be obtained by using the sharp auditory filters and wide dynamic range that are characteristic of the normal auditory periphery (Kates and Arehart, 2022). The outputs of the models are analyzed using an envelope modulation-rate filterbank, and an ensemble neural network is employed to fit the subjective intelligibility scores. This is different from the original version of HASPI (version 1), which measures the outputs using a lowpass filter to determine the cepstral correlation and temporal fine structure, and employs a parametric model to fit the subjective intelligibility scores using the combined measurements of cepstral correlation and temporal fine structure. For more comprehensive details regarding both HASPI versions, please refer to Kates and Arehart (2014) and Kates and Arehart (2021).
It should be noted that due to its improved performance in reverberation environments, the latest version of HASPI (version 2) has been employed to evaluate speech intelligibility in HASA-Net+, whereas the original version of HASPI (version 1) is still utilized by HASA-Net. As for quality assessment, both HASA-Net and HASA-Net+ employ the current version of HASQI (version 2).
B. DL-based speech assessment method for hearing-aid users
In this section, we provide a number of DL-based methods for evaluating speech assessments focusing on hearing-aid users. Considering that several studies (Schädler , 2015; Fontan , 2017; Arai , 2020) have demonstrated that DL-based ASR models are achieving speech recognition performance comparable to that of humans and exhibiting similar speech recognition patterns, a feasible strategy is to utilize DL-based ASR for speech assessment. Karbasi (2020) proposed a no-reference intelligibility (NORI) framework based on the hidden Markov model (HMM)–based ASR model, which included two ASR-based discriminant measures for predicting speech intelligibility in noisy environments for individuals with normal and impaired hearing. Tu and Barker (2022b) calculated the similarity between the hidden representation from ASR models of clean reference and the processed signal to predict intelligibility in the first round Clarity Prediction Challenge (Graetzer , 2021).
Another approach utilizes DL models to non-intrusively estimate the output of subjective or objective metrics. For speech quality, Liang (2023) used a CNN to extract features from gammatone filter bank energies that were used as inputs, and employed multi-task learning to aid in predicting speech quality, along with the auxiliary task of quality classification. For speech intelligibility, a non-intrusive multi-branched speech intelligibility prediction model for hearing aids (Zezario , 2022b) was developed, which includes a MSBG hearing loss model, a cross-domain feature extraction module, and a speech intelligibility prediction model to predict subjective intelligibility scores for binaural hearing-aid users. Kamo (2022) presented a prediction model based on Conformer architecture that integrates audio, transcription, and various listener characteristics (audiogram, age, gender, etc.) to predict subjective intelligibility scores for noisy speech processed by hearing-aid users. Our previous research, HASA-Net, was able to jointly predict speech quality and intelligibility scores, specifically, HASQI and HASPI, by incorporating spectrogram and the hearing loss pattern as an extra input.
III. HASA-NET+
A. Speech foundation models
SFMs have achieved great success in speech processing. Among SFMs, SSL models learn meaningful representations from vast amounts of unlabeled data and can be categorized into three groups: generative modeling, discriminative modeling, and multi-task learning. Generative modeling relies on the network to reconstruct masked frames (Liu , 2020) or predict future frames (Chung , 2019; Chung and Glass, 2020). Discriminative modeling employs contrastive learning (Schneider , 2019; Baevski , 2020) or classifies pseudo labels (Hsu , 2021; Chen , 2022a) to learn meaningful speech information. Multi-task learning has been applied in Ravanelli (2020), which learns meaningful speech information via multi-tasking objectives.
Another notable speech foundation model (SFM), Whisper (Radford , 2023), a large pre-trained model based on weak supervision, has recently been developed and demonstrated strong potential in producing robust acoustic features across diverse datasets. This capability stems from the use of a vast number of audio transcripts spanning various languages and tasks. In contrast to SSL models, the weak supervision method incorporates actual transcripts during training. Consequently, the audio features produced by Whisper are expected to carry richer phonetic information, making it an intriguing prospect to explore which of these features are most reliable for building non-intrusive speech assessment models.
B. Hearing-loss patterns
Hearing loss is detected through an audiogram, which is a clinical measure of the degree of inaudibility at different frequency regions. The audiogram has the hearing threshold on the y axis, measured in dB HL, and frequency on the x axis, measured in Hertz. A threshold at any frequency above 20 dB is considered as a hearing loss. We selected six features from the audiogram and used them to form a hearing-loss pattern. Each pattern describes the hearing threshold at a specific frequency. The six frequencies we chose were 250, 500, 1000, 2000, 4000, and 6000 Hz. We incorporated these hearing-loss patterns as another input of the HASA-Net+ model.
C. HASA-Net+ framework
The HASA-Net+ is designed to consider both the SFM latent representations obtained from the speech signals and the hearing-loss pattern extracted from the audiogram as inputs. These inputs are then utilized to produce the corresponding objective quality and intelligibility scores, namely HASQI and HASPI, serving as the ground truth metrics. HASA-Net+ is a modified version of HASA-Net (Chiang , 2021) with three changes to its initial design. First, HASA-Net+ differs from HASA-Net in that it operates on the raw waveform, whereas HASA-Net utilizes time-frequency spectrograms. The second modification of HASA-Net+ involves the processing of the hearing-loss pattern. In contrast to HASA-Net, where the hearing-loss pattern is used directly, in HASA-Net+, it undergoes additional processing. Specifically, the hearing-loss pattern is passed through a dense layer, which increases its dimensionality from 6 to 256. The third modification in HASA-Net+ involves a different approach to combining the inputs before feeding them into the BLSTM layer. In HASA-Net, the input to the BLSTM layer was obtained by concatenating the time-frequency spectrograms and hearing-loss pattern. In contrast, in HASA-Net+, the latent representations obtained from the SFMs and the hearing-loss pattern processed through the dense layer are combined using addition, resulting in merged features that are passed into the BLSTM layer.
In Fig. 1, the left branch of the input corresponds to the SFM representations obtained by feeding the raw waveform into the SFM model. For each frame, the SFM representation is a weighted sum of the representations from all transformer encoder layers, and it is passed through a dense layer to have a dimensionality of 256. The right branch of the input represents the hearing-loss pattern, which describes the hearing thresholds at six specific frequencies based on the audiogram. This six-dimensional pattern is passed through a dense layer, increasing its dimensionality to 256. After additively merging the SFM representation and the hearing-loss representation, the resulting merged features are fed into a BLSTM with 100 nodes, followed by a dense layer with 128 rectified linear unit (ReLU) nodes. Finally, the output of the dense layer is used for two separate tasks: quality estimation and intelligibility prediction. For each task, a multi-head attention mechanism, a dense layer consisting of one node with a sigmoid function, and a global average pooling layer, are applied to generate the final prediction. The outputs of the dense layer and the global average pooling layer are frame-level predictions and utterance-level predictions, respectively.
IV. EXPERIMENTS
A. Dataset creation
We utilized the clean speech utterances from the VCTK-DEMAND corpus (Valentini et al., 2016) for the in-domain task and the TIMIT corpus (Garofolo , 1988) for the OOD task. For both the in-domain and OOD tasks, the data used in the experiments comprised a variety of speech types, including noisy, enhanced, reverberation, dereverberated, and vocoded speech. We provide a comprehensive description of the setup and data creation process for each task below.
1. In-domain task
In the original VCTK-DEMAND corpus, the training set contains 11 572 utterances from 28 British accent speakers, and the testing set (hereafter referred to as Test) contains 824 speech utterances from the other two speakers. We selected two speakers (p226 and p287) from the training set to form a validation set of 770 utterances (hereafter referred to as Val; these validation data used here were employed to ensure the speech enhancement model was well trained), while the remaining 10 802 utterances (hereafter referred to as Train) were used for training the DL-based enhancement and dereverberation models. Note that the utterances used to train the DL-based enhancement and dereverberation models were not used in the HASA-Net+ experiments. We combined Val and Test to construct a new dataset (hereafter referred to as SetHASA) containing 1594 (770 + 824) clean speech utterances to generate data for testing HASA-Net+.
To train HASA-Net+, we needed to prepare enhanced speech signals. This required developing an SE model to process noisy data. We adopted MetricGAN+ (Fu , 2021) as the SE model, and trained it using 10 802 noisy-clean utterance pairs constructed from Train. Each clean speech utterance in Train and Val was corrupted with one of ten different noises (Noise10) at one of four signal-to-noise ratios (SNRs) (15, 10, 5, and 0 dB) to generate a corresponding simulated noisy speech utterance as in Valentini et al. (2016). The best model was determined by the performance on Val, and then used to generate enhanced speech for SetHASA (Val Test). Note that the noisy speech utterances in Test were generated in the same way as Train and Val, but with the other four SNRs (17.5, 12.5, 7.5, and 2.5 dB) and five noises (Noise5) from the DEMAND database (Thiemann , 2013) as in Valentini et al. (2016).
In addition to enhanced speech, we also prepared dereveerbrated speech to train HASA-Net+. We adopted MetricGAN-U (Fu , 2022) as the dereverberation model, and trained it using reverberant-clean utterance pairs constructed from Train. We generated reverberant speech utterances by convolving each clean speech utterance in Train and Val with one of 315 real room impulse response (RIR) data (RIR315) and one rir_scale_factor in (0.75, 0.85, 0.95, 1.05, and 1.15) using the AddReverb function in the SpeechBrain (Rravanelli , 2021) toolkit following the setting in Fu (2022). The best model was determined by the performance on Val, and then used to generate dereverberated speech for SetHASA (i.e., Val Test). Note that the reverberant speech utterances in Test were generated in the same way as Train and Val, but with the other ten room impulse responses (RIRs) (RIR10) and five rir_scale_factor (0.8, 0.9, 1.0. 1.1, and 1.2) as in Fu (2022). To create the vocoded sets, we applied the tone vocoder and noise vocoder on SetHASA. Specifically, we generated half of the data using the tone vocoder, while the other half were produced using the noise vocoder.
As mentioned above, the dataset used in the HASA-Net+ experiments was constructed based on SetHASA (i.e., Val Test). Each clean speech utterance in SetHASA was associated with five different conditions, including noisy, denoised, reverberant, dereverberated, and vocoded. In total, the dataset contained 7970 (1594 × 5) utterances. Fivefold cross-validation was employed in the HASA-Net+ experiments. Since the training and evaluation of HASA-Net+ are performed on different partitions of the same dataset, this is called an in-domain task.
2. OOD task
In the standard setting of the TIMIT corpus, the training set contains 4680 utterances from 90 speakers, and the testing set contains 1560 utterances from the other 30 speakers. We selected 200 clean utterances from the testing set to train the DL-based enhancement and dereverberation models.
We employed a BLSTM-based model to generate enhanced speech. The model has a similar architecture to the generator of the MetricGAN+ (Fu , 2021) model used in the in-domain task. It consisted of two BLSTM layers (300 nodes each) followed by a dense layer (257 nodes) with sigmoid activation for mask estimation. These masks are multiplied with the input noisy magnitude spectrogram to generate an enhanced magnitude spectrogram. Inverse short time Fourier transform (ISTFT) was used to convert the enhanced magnitude spectrogram back to a speech waveform using the noisy speech phase. We generated 4000 (200 × 5 4) noisy-clean utterance pairs by corrupting the 200 clean utterances with five noises (Noise5 as in the in-domain task) at four SNRs (15, 10, 5, and 0 dB), and randomly selected 90% for training and the remaining 10% for validation. The dereverberation model follows the same architecture as the SE model, but takes as input the magnitude spectrogram of reverberant speech. ISTFT was used to convert the dereverberated magnitude spectrogram back to a speech waveform using the reverberant speech phase. We generated 10 000 (200 × 10 5) reverberant-clean utterance pairs by applying convolution to the 200 utterances with ten RIRs (RIR10 as in the in-domain task) and five rir_scale_factors (0.75, 0.85, 0.95, 1.05, and 1.15), and randomly selected 90% for training and the remaining 10% for validation.
The original training set (4680 clean utterances) and the remaining testing set (1360 clean utterances) of the TIMIT corpus were used to generate data for the HASA-Net+ experiments. The original training set was divided into a training set of 4220 clean utterances and a validation set of 460 clean utterances. For each clean utterance in the training and validation sets, the corresponding noisy speech was generated by corrupting it with one of ten noises (Noise10 as in the in-domain task) at one of four SNRs (15, 10, 5, and 0 dB), while for each clean speech utterance in the testing set, the corresponding noisy speech was generated by corrupting it with one of the other five noises (Noise5 as in the in-domain task) at one of seven SNRs (17.5, 12.5, 7.5, 2.5, –2.5, –7.5 and –12.5 dB). The BLSTM-based enhancement model was used to generate the corresponding enhanced utterances from these noisy utterances. For each clean utterance in the training and validation sets, the corresponding reverberant speech was generated by convolving it with one of 315 RIRs (RIR315 as in the in-domain task) and one rir_scale _factor in (0.75, 0.85, 0.95, 1.05, and 1.15), while for each clean speech utterance in the testing set, the corresponding reverberant speech was generated by convolving it with one of the other ten RIRs (RIR10 as in the in-domain task) and one rir_scale _factor in (0.8, 0.9, 1.0, 1.1, and 1.2). The BLSTM-based dereverberation model was used to generate the corresponding dereverberated utterances from these reverberation utterances. The vocoded versions of the clean utterances were generated in the same way as in the in-domain task, involving the application of both the tone vocoder and noise vocoder. Notably, half of the data were processed with the tone vocoder, while the other half underwent treatment with the noise vocoder.
Each utterance had five different conditions (noisy, enhanced, reverberation, dereverberation, and vocoded), and the total number of utterances in the training, validation, and testing sets in the HASA-Net+ experiments was 21 100 (4220 × 5), 2300 (460 × 5), and 6800 (1360 × 5), respectively. Since HASA-Net+ is evaluated on this dataset in zero- and few-shot settings, in addition to the full dataset setting; this is called an OOD task.
B. Audiograms
In this study, we aim to encompass a broad spectrum of scenarios, thus opting to generate a diverse range of audiograms. Additionally, we seek to investigate the effectiveness of employing an SFM as compared to our previous work (Chiang , 2021). For a fair comparison, we selected audiograms that align with those utilized in Chiang (2021), ensuring consistency with established research parameters and facilitating comparisons with prior studies. The audiograms used in this study are divided into six categories: flat, sloping, rising, cookie-bite, noise-notched, and high-frequency, as shown in Fig. 2. Each category contains seven different audiograms. In addition, we also considered the case where the hearing thresholds at all frequencies were 0 dB HL, indicating NH. As a result, there were 43 hearing-loss audiograms, including seven audiograms for each of the six hearing-loss categories, plus one NH audiogram.
In the HASA-Net+ experiments, these 42 hearing loss audiograms were divided into different groups for training, validation, and testing. The training audiogram set contained 30 patterns, including five audiograms per category, while the testing audiogram set contained 12 patterns, including the remaining two audiograms per category. In addition, 12 patterns were selected from the training audiogram set to form the validation audiogram set, with two audiograms for each category. The NH audiogram can be used in training, validation, and testing.
C. Results
In this section, we start by performing preliminary tests on in-domain data to identify the best pre-trained SFM and fine-tuning approach. These evaluations help finalize the configuration of HASA-Net+, which is then used in Sec. IV C 3 to evaluate its generalization capability through OOD data.
1. Evaluation of pre-trained SFMs
The HASA-Net+ model takes both the SFM latent representations extracted from the raw waveform and the hearing-loss pattern obtained from the audiogram as input, and predicts the corresponding objective quality and intelligibility scores. The ground-truth values for quality and intelligibility were determined using HASQI and HASPI, respectively. All clean speech utterances were presented at 65 dB sound pressure level (SPL). To simulate the effect of hearing aid processing, the stimuli were amplified using the National Acoustics Laboratories revised (NAL-R) (Byrne and Dillon, 1986) linear fitting prescriptive formula based on hearing loss profiles determined by the audiograms. The evaluation criteria included mean square error (MSE), linear correlation coefficient (LCC), and Spearman's rank correlation coefficient (SRCC).
To evaluate the performance of HASA-Net+ on the in-domain task, due to the limited number of unique clean speech utterances (1594), fivefold cross-validation was employed. The dataset (consisting of 1594 × 5 utterances) was randomly divided into five partitions, one of which was used for testing, and the remaining four partitions were used for training. During training, each utterance was paired with three audiograms: two audiograms randomly selected from the training audiogram set and the NH audiogram. This pairing yielded 1594 × 4 3 combinations. During testing, each utterance was paired with three audiograms: two audiograms randomly selected from the testing audiogram set and the NH audiogram. This pairing yielded 1594 × 1 3 combinations. The cross-validation process was repeated five times, and the five results were averaged to obtain the overall performance.
First, we compared two models: HASA-Net+ (integrating a pre-trained SFM) and HASA-Net (using spectrograms as input and used as our baseline model). We used the pre-trained WavLM Large model (Chen , 2022a) as the pre-trained SFM and investigated two ways to integrate its representations into downstream tasks: using the representations from the last layer (LL) or the weighted sum of the representations of all transformer encoder layers with learnable weights (WS). As a result, we have two versions of HASA-Net+: HASA-Net+ (WavLM-LL) and HASA-Net+ (WavLM-WS). We trained the models using the Adam optimizer with a learning rate of 10−4.
As shown in Table I, both versions of HASA-Net+ outperform the baseline HASA-Net, indicating the superiority of the representation from the pre-trained SFM. Comparing the two versions of HASA-Net+, it is clear that HASA-Net+ (WavLM-WS) achieves higher correlation values and lower MSE values than HASA-Net+ (WavLM-LL). This suggests that each transformer encoder layer in the WavLM model contains valuable information, and making full use of information from different layers is crucial to achieve the best performance.
. | Quality . | Intelligibility . | ||||
---|---|---|---|---|---|---|
Model . | MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . |
HASA-Net | 0.007 | 0.947 | 0.958 | 0.026 | 0.737 | 0.756 |
HASA-Net+ (WavLM-LL) | 0.006 | 0.951 | 0.960 | 0.024 | 0.733 | 0.764 |
HASA-Net+ (WavLM-WS) | 0.005 | 0.955 | 0.969 | 0.019 | 0.808 | 0.829 |
. | Quality . | Intelligibility . | ||||
---|---|---|---|---|---|---|
Model . | MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . |
HASA-Net | 0.007 | 0.947 | 0.958 | 0.026 | 0.737 | 0.756 |
HASA-Net+ (WavLM-LL) | 0.006 | 0.951 | 0.960 | 0.024 | 0.733 | 0.764 |
HASA-Net+ (WavLM-WS) | 0.005 | 0.955 | 0.969 | 0.019 | 0.808 | 0.829 |
Next, we examined detailed evaluation results for each condition, namely noisy, enhanced, reverberant, dereverberated, and vocoded speech. Tables II and III present the prediction results for speech quality and intelligibility, respectively. It is observed that HASA-Net+ (WavLM-WS) consistently demonstrates superior performance compared to HASA-Net and HASA-Net+ (WavLM-LL) across all conditions. It is also worth noting that the performance of all three models shows a similar trend. Specifically, in terms of quality prediction, all models achieve higher correlation and lower MSE values under enhanced and vocoded conditions compared to noisy, reverberation, and dereverberation conditions. Meanwhile, for intelligibility prediction, all models show better performance under reverberation and dereverberation conditions compared to noisy, enhanced, and vocoded conditions.
. | HASA-Net . | HASA-Net+ (WavLM-LL) . | HASA-Net+ (WavLM-WS) . | ||||||
---|---|---|---|---|---|---|---|---|---|
Condition . | MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . |
Noisy | 0.008 | 0.871 | 0.936 | 0.008 | 0.836 | 0.937 | 0.005 | 0.878 | 0.958 |
Enhanced | 0.010 | 0.888 | 0.921 | 0.006 | 0.935 | 0.952 | 0.004 | 0.955 | 0.967 |
Reverberation | 0.009 | 0.843 | 0.881 | 0.006 | 0.883 | 0.921 | 0.004 | 0.909 | 0.941 |
Dereverberation | 0.007 | 0.852 | 0.885 | 0.006 | 0.861 | 0.900 | 0.005 | 0.890 | 0.922 |
Vocoded | 0.004 | 0.908 | 0.967 | 0.005 | 0.912 | 0.957 | 0.004 | 0.933 | 0.970 |
. | HASA-Net . | HASA-Net+ (WavLM-LL) . | HASA-Net+ (WavLM-WS) . | ||||||
---|---|---|---|---|---|---|---|---|---|
Condition . | MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . |
Noisy | 0.008 | 0.871 | 0.936 | 0.008 | 0.836 | 0.937 | 0.005 | 0.878 | 0.958 |
Enhanced | 0.010 | 0.888 | 0.921 | 0.006 | 0.935 | 0.952 | 0.004 | 0.955 | 0.967 |
Reverberation | 0.009 | 0.843 | 0.881 | 0.006 | 0.883 | 0.921 | 0.004 | 0.909 | 0.941 |
Dereverberation | 0.007 | 0.852 | 0.885 | 0.006 | 0.861 | 0.900 | 0.005 | 0.890 | 0.922 |
Vocoded | 0.004 | 0.908 | 0.967 | 0.005 | 0.912 | 0.957 | 0.004 | 0.933 | 0.970 |
. | HASA-Net . | HASA-Net+ (WavLM-LL) . | HASA-Net+ (WavLM-WS) . | ||||||
---|---|---|---|---|---|---|---|---|---|
Condition . | MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . |
Noisy | 0.025 | 0.653 | 0.678 | 0.022 | 0.653 | 0.707 | 0.015 | 0.740 | 0.811 |
Enhanced | 0.021 | 0.641 | 0.642 | 0.018 | 0.664 | 0.701 | 0.014 | 0.738 | 0.778 |
Reverberation | 0.022 | 0.749 | 0.763 | 0.021 | 0.759 | 0.782 | 0.014 | 0.837 | 0.855 |
Dereverberation | 0.028 | 0.801 | 0.792 | 0.028 | 0.787 | 0.794 | 0.022 | 0.841 | 0.841 |
Vocoded | 0.028 | 0.665 | 0.726 | 0.028 | 0.673 | 0.722 | 0.024 | 0.711 | 0.773 |
. | HASA-Net . | HASA-Net+ (WavLM-LL) . | HASA-Net+ (WavLM-WS) . | ||||||
---|---|---|---|---|---|---|---|---|---|
Condition . | MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . |
Noisy | 0.025 | 0.653 | 0.678 | 0.022 | 0.653 | 0.707 | 0.015 | 0.740 | 0.811 |
Enhanced | 0.021 | 0.641 | 0.642 | 0.018 | 0.664 | 0.701 | 0.014 | 0.738 | 0.778 |
Reverberation | 0.022 | 0.749 | 0.763 | 0.021 | 0.759 | 0.782 | 0.014 | 0.837 | 0.855 |
Dereverberation | 0.028 | 0.801 | 0.792 | 0.028 | 0.787 | 0.794 | 0.022 | 0.841 | 0.841 |
Vocoded | 0.028 | 0.665 | 0.726 | 0.028 | 0.673 | 0.722 | 0.024 | 0.711 | 0.773 |
2. Evaluation of different fine-tuning methods
We used HASA-Net+ (WavLM-WS) to investigate the effectiveness of simultaneously fine-tuning the pre-trained model during HASA-Net+ training, as it achieved the best performance in previous experiments. We evaluated three fine-tuning methods: partial fine-tuning (PF), entire fine-tuning (EF), and the two-stage fine-tuning (2-stage FT) method in Chen (2022b).
In the PF method, the convolutional feature extractor of the WavLM model was frozen, while only the transformer layers were fine-tuned during HASA-Net+ training. In the EF approach, the convolutional feature extractor and transformer layers of the WavLM model were fine-tuned simultaneously during HASA-Net+ training. For the 2-stage FT method, in the first stage, the pre-trained model was fixed, and only the parameters of the remaining modules of HASA-Net+ were optimized. In the second stage, the entire HASA-Net+ was fine-tuned, including the pre-trained model. In the PF and EF methods, we used the Adam optimizer with a learning rate of 10−5 for the pre-trained model and 10−4 for the remaining modules of HASA-Net+. In the 2-stage FT approach, we applied the Adam optimizer with a learning rate of 10−4 in the first stage, and a lower learning rate of 10−5 in the second stage. To the best of our knowledge, this is the first attempt to study the performance of various fine-tuning methods applied to a speech assessment model for the HI.
From Table IV, we can see that for quality prediction, all three fine-tuning methods are effective. However, for intelligibility prediction, the EF and 2-stage FT methods are effective, but the PF method does not bring benefit. Overall, the 2-stage FT method shows superiority over PF and EF, as evidenced by the lowest MSE and highest correlation, indicating better performance in both quality and intelligibility prediction.
HASA-Net+ (WavLM-WS) . | Quality . | Intelligibility . | ||||
---|---|---|---|---|---|---|
MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | |
Pre-trained | 0.005 | 0.955 | 0.969 | 0.019 | 0.808 | 0.829 |
PF | 0.003 | 0.974 | 0.979 | 0.019 | 0.804 | 0.823 |
EF | 0.003 | 0.970 | 0.980 | 0.018 | 0.841 | 0.848 |
2-stage FT | 0.002 | 0.983 | 0.987 | 0.013 | 0.869 | 0.885 |
HASA-Net+ (WavLM-WS) . | Quality . | Intelligibility . | ||||
---|---|---|---|---|---|---|
MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | |
Pre-trained | 0.005 | 0.955 | 0.969 | 0.019 | 0.808 | 0.829 |
PF | 0.003 | 0.974 | 0.979 | 0.019 | 0.804 | 0.823 |
EF | 0.003 | 0.970 | 0.980 | 0.018 | 0.841 | 0.848 |
2-stage FT | 0.002 | 0.983 | 0.987 | 0.013 | 0.869 | 0.885 |
3. Generalization capability
The generalization capability of HASA-Net+ was studied on the OOD task. The training set contained 21 100 (4220 × 5) utterances, the validation set contained 2300 (460 × 5) utterances, and the testing set contained 6800 (1360 × 5) utterances. Each utterance was paired with an audiogram randomly selected from the corresponding training, validation, or testing audiogram set or the NH audiogram. To create more challenging conditions, all utterances in the testing set were presented at one of three dB SPL levels (55, 65, or 75), while the training and validation set utterances were consistently presented at 65 dB SPL.
We evaluated HASA-Net+'s generalization capability in zero-shot, few-shot, and full dataset settings by varying the number of training utterances. HASA-Net served as the baseline. We chose the HASA-Net+ (WavLM-WS) model trained with the 2-stage FT method, as it demonstrated the best performance in the in-domain evaluation. Additionally, we compared the results with different SFMs, including HuBERT (Hsu , 2021) and Whisper (Radford , 2023). This resulted in two additional versions of HASA-Net+: HASA-Net+ (HuBERT-WS) and HASA-Net+ (Whisper-WS). Similar to WavLM, HuBERT takes the waveform as input and processes it through a convolutional feature encoder followed by transformer layers. We extracted and utilized the weighted sum of each transformer's layer results as inputs for HASA-Net+. For Whisper, the raw audio waveform is transformed into a log-Mel spectrogram, which then passes through convolutional layers and transformer layers. The weighted sum of each transformer's layer results is similarly used as inputs for HASA-Net+. Essentially, this involves replacing the WavLM model with other SFMs, as illustrated in Fig. 1, demonstrating the flexibility of HASA-Net+ with different feature representations. Following the setup of HASA-Net+ (WavLM-WS), we first trained HASA-Net+ (HuBERT-WS) and HASA-Net+ (Whisper-WS) on the in-domain data. For generalization capability evaluation, we used three settings with varying training utterances. In the zero-shot setting, we directly applied the models trained on the in-domain dataset to the OOD testing data. In the few-shot setting, we fine-tuned the models using a limited amount of data (100–6400 utterances). In the full dataset setting, we used the full OOD training set (21 100 utterances) to train the models.
The results are shown in Table V and several observations can be drawn. First, in the zero-shot setting, the three HASA-Net+ models leveraging different SFMs consistently outperform HASA-Net in both quality and intelligibility predictions, demonstrating that using SFMs possesses superior generalization capabilities compared to HASA-Net. Among the HASA-Net+ variants, HASA-Net+ (WavLM-WS) achieves the highest performance, followed by HASA-Net+ (Whisper-WS) and HASA-Net+ (HuBERT-WS). It should also be noted that the testing set included various dB SPL levels that were not present in the in-domain training data. This further underscores the ability of HASA-Net+ models to effectively handle utterances with different dB SPL levels. Second, in the few-shot setting, all models benefit from an increase in the amount of training data. Since the full dataset setting has access to the complete training data, all models achieve their respective best performance in this setting. Third, all three versions of HASA-Net+ outperform HASA-Net in all settings, highlighting the strong generalization capability and practical applicability of SFMs in real-world scenarios, with HASA-Net+ (WavLM-WS) achieving the highest performance.
Training data size . | Model . | Quality . | Intelligibility . | ||||
---|---|---|---|---|---|---|---|
MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | ||
Zero-shot | HASA-Net | 0.056 | 0.570 | 0.558 | 0.042 | 0.299 | 0.488 |
HASA-Net+ (WavLM-WS) | 0.024 | 0.834 | 0.814 | 0.026 | 0.644 | 0.776 | |
HASA-Net+ (HuBERT-WS) | 0.052 | 0.620 | 0.610 | 0.038 | 0.447 | 0.679 | |
HASA-Net+ (Whisper-WS) | 0.060 | 0.634 | 0.628 | 0.054 | 0.486 | 0.716 | |
100 | HASA-Net | 0.057 | 0.563 | 0.550 | 0.043 | 0.301 | 0.493 |
HASA-Net+ (WavLM-WS) | 0.023 | 0.836 | 0.816 | 0.026 | 0.644 | 0.776 | |
HASA-Net+ (HuBERT-WS) | 0.041 | 0.705 | 0.687 | 0.036 | 0.451 | 0.688 | |
HASA-Net+ (Whisper-WS) | 0.047 | 0.661 | 0.649 | 0.054 | 0.486 | 0.715 | |
400 | HASA-Net | 0.048 | 0.577 | 0.562 | 0.043 | 0.318 | 0.505 |
HASA-Net+ (WavLM-WS) | 0.014 | 0.901 | 0.894 | 0.028 | 0.654 | 0.791 | |
HASA-Net+ (HuBERT-WS) | 0.030 | 0.801 | 0.782 | 0.036 | 0.474 | 0.700 | |
HASA-Net+ (Whisper-WS) | 0.042 | 0.683 | 0.663 | 0.352 | 0.487 | 0.716 | |
1600 | HASA-Net | 0.056 | 0.596 | 0.584 | 0.040 | 0.366 | 0.533 |
HASA-Net+ (WavLM-WS) | 0.014 | 0.912 | 0.912 | 0.026 | 0.662 | 0.789 | |
HASA-Net+ (HuBERT-WS) | 0.029 | 0.824 | 0.810 | 0.035 | 0.521 | 0.728 | |
HASA-Net+ (Whisper-WS) | 0.036 | 0.740 | 0.713 | 0.034 | 0.495 | 0.717 | |
6400 | HASA-Net | 0.034 | 0.788 | 0.798 | 0.041 | 0.374 | 0.641 |
HASA-Net+ (WavLM-WS) | 0.012 | 0.924 | 0.927 | 0.025 | 0.675 | 0.796 | |
HASA-Net+ (HuBERT-WS) | 0.015 | 0.903 | 0.905 | 0.026 | 0.647 | 0.765 | |
HASA-Net+ (Whisper-WS) | 0.025 | 0.828 | 0.808 | 0.032 | 0.551 | 0.756 | |
Full dataset | HASA-Net | 0.030 | 0.809 | 0.816 | 0.038 | 0.457 | 0.688 |
HASA-Net+ (WavLM-WS) | 0.011 | 0.928 | 0.929 | 0.023 | 0.709 | 0.889 | |
HASA-Net+ (HuBERT-WS) | 0.013 | 0.918 | 0.919 | 0.025 | 0.674 | 0.781 | |
HASA-Net+ (Whisper-WS) | 0.024 | 0.839 | 0.826 | 0.031 | 0.579 | 0.764 |
Training data size . | Model . | Quality . | Intelligibility . | ||||
---|---|---|---|---|---|---|---|
MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | ||
Zero-shot | HASA-Net | 0.056 | 0.570 | 0.558 | 0.042 | 0.299 | 0.488 |
HASA-Net+ (WavLM-WS) | 0.024 | 0.834 | 0.814 | 0.026 | 0.644 | 0.776 | |
HASA-Net+ (HuBERT-WS) | 0.052 | 0.620 | 0.610 | 0.038 | 0.447 | 0.679 | |
HASA-Net+ (Whisper-WS) | 0.060 | 0.634 | 0.628 | 0.054 | 0.486 | 0.716 | |
100 | HASA-Net | 0.057 | 0.563 | 0.550 | 0.043 | 0.301 | 0.493 |
HASA-Net+ (WavLM-WS) | 0.023 | 0.836 | 0.816 | 0.026 | 0.644 | 0.776 | |
HASA-Net+ (HuBERT-WS) | 0.041 | 0.705 | 0.687 | 0.036 | 0.451 | 0.688 | |
HASA-Net+ (Whisper-WS) | 0.047 | 0.661 | 0.649 | 0.054 | 0.486 | 0.715 | |
400 | HASA-Net | 0.048 | 0.577 | 0.562 | 0.043 | 0.318 | 0.505 |
HASA-Net+ (WavLM-WS) | 0.014 | 0.901 | 0.894 | 0.028 | 0.654 | 0.791 | |
HASA-Net+ (HuBERT-WS) | 0.030 | 0.801 | 0.782 | 0.036 | 0.474 | 0.700 | |
HASA-Net+ (Whisper-WS) | 0.042 | 0.683 | 0.663 | 0.352 | 0.487 | 0.716 | |
1600 | HASA-Net | 0.056 | 0.596 | 0.584 | 0.040 | 0.366 | 0.533 |
HASA-Net+ (WavLM-WS) | 0.014 | 0.912 | 0.912 | 0.026 | 0.662 | 0.789 | |
HASA-Net+ (HuBERT-WS) | 0.029 | 0.824 | 0.810 | 0.035 | 0.521 | 0.728 | |
HASA-Net+ (Whisper-WS) | 0.036 | 0.740 | 0.713 | 0.034 | 0.495 | 0.717 | |
6400 | HASA-Net | 0.034 | 0.788 | 0.798 | 0.041 | 0.374 | 0.641 |
HASA-Net+ (WavLM-WS) | 0.012 | 0.924 | 0.927 | 0.025 | 0.675 | 0.796 | |
HASA-Net+ (HuBERT-WS) | 0.015 | 0.903 | 0.905 | 0.026 | 0.647 | 0.765 | |
HASA-Net+ (Whisper-WS) | 0.025 | 0.828 | 0.808 | 0.032 | 0.551 | 0.756 | |
Full dataset | HASA-Net | 0.030 | 0.809 | 0.816 | 0.038 | 0.457 | 0.688 |
HASA-Net+ (WavLM-WS) | 0.011 | 0.928 | 0.929 | 0.023 | 0.709 | 0.889 | |
HASA-Net+ (HuBERT-WS) | 0.013 | 0.918 | 0.919 | 0.025 | 0.674 | 0.781 | |
HASA-Net+ (Whisper-WS) | 0.024 | 0.839 | 0.826 | 0.031 | 0.579 | 0.764 |
Tables VI and VII further present detailed performance for quality and intelligibility prediction for various types of hearing loss. We again observe a consistent trend: for all types of hearing loss, as the training data size increases, the correlation values increase and the MSE decreases. In terms of quality prediction, notable improvements were achieved with only 100 training samples for each hearing loss type, and increasing the amount of training data led to notable further improvements. In contrast, increasing the training data size improves intelligibility prediction less than quality prediction. Overall, HASA-Net+(WavLM-WS) has an impressive ability to achieve excellent generalization capability with limited OOD training data, which makes it a very valuable tool for practical applications.
Training data size . | Zero-shot . | 100 . | 1600 . | Full dataset . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | |
Flat | 0.055 | 0.643 | 0.619 | 0.025 | 0.833 | 0.807 | 0.015 | 0.912 | 0.918 | 0.011 | 0.928 | 0.931 |
Sloping | 0.059 | 0.627 | 0.617 | 0.023 | 0.854 | 0.836 | 0.018 | 0.907 | 0.909 | 0.013 | 0.928 | 0.929 |
Rising | 0.057 | 0645 | 0.640 | 0.028 | 0.809 | 0.792 | 0.016 | 0.907 | 0.909 | 0.012 | 0.924 | 0.923 |
Cookie-bite | 0.058 | 0.636 | 0.629 | 0.020 | 0.862 | 0.838 | 0.014 | 0.917 | 0.930 | 0.009 | 0.946 | 0.952 |
Noise-notched | 0.055 | 0.682 | 0.655 | 0.032 | 0.849 | 0.816 | 0.016 | 0.928 | 0.900 | 0.014 | 0.943 | 0.927 |
High-frequency | 0.058 | 0.645 | 0.633 | 0.026 | 0.847 | 0.821 | 0.014 | 0.923 | 0.926 | 0.011 | 0.934 | 0.927 |
Normal | 0.015 | 0.879 | 0.850 | 0.015 | 0.889 | 0.855 | 0.013 | 0.895 | 0.879 | 0.011 | 0.906 | 0.895 |
Training data size . | Zero-shot . | 100 . | 1600 . | Full dataset . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | |
Flat | 0.055 | 0.643 | 0.619 | 0.025 | 0.833 | 0.807 | 0.015 | 0.912 | 0.918 | 0.011 | 0.928 | 0.931 |
Sloping | 0.059 | 0.627 | 0.617 | 0.023 | 0.854 | 0.836 | 0.018 | 0.907 | 0.909 | 0.013 | 0.928 | 0.929 |
Rising | 0.057 | 0645 | 0.640 | 0.028 | 0.809 | 0.792 | 0.016 | 0.907 | 0.909 | 0.012 | 0.924 | 0.923 |
Cookie-bite | 0.058 | 0.636 | 0.629 | 0.020 | 0.862 | 0.838 | 0.014 | 0.917 | 0.930 | 0.009 | 0.946 | 0.952 |
Noise-notched | 0.055 | 0.682 | 0.655 | 0.032 | 0.849 | 0.816 | 0.016 | 0.928 | 0.900 | 0.014 | 0.943 | 0.927 |
High-frequency | 0.058 | 0.645 | 0.633 | 0.026 | 0.847 | 0.821 | 0.014 | 0.923 | 0.926 | 0.011 | 0.934 | 0.927 |
Normal | 0.015 | 0.879 | 0.850 | 0.015 | 0.889 | 0.855 | 0.013 | 0.895 | 0.879 | 0.011 | 0.906 | 0.895 |
Training data size . | Zero-shot . | 100 . | 1,600 . | Full dataset . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | |
Flat | 0.028 | 0.599 | 0.733 | 0.030 | 0.603 | 0.747 | 0.028 | 0.621 | 0.761 | 0.026 | 0.642 | 0.784 |
Sloping | 0.031 | 0.595 | 0.544 | 0.028 | 0.652 | 0.586 | 0.028 | 0.651 | 0.619 | 0.023 | 0.723 | 0.650 |
Rising | 0.041 | 0.595 | 0.678 | 0.042 | 0.587 | 0.676 | 0.042 | 0.636 | 0.677 | 0.035 | 0.690 | 0.676 |
Cookie-bite | 0.029 | 0.569 | 0.571 | 0.025 | 0.656 | 0.659 | 0.030 | 0.611 | 0.662 | 0.024 | 0.678 | 0.692 |
Noise-notched | 0.024 | 0.720 | 0.626 | 0.024 | 0.660 | 0.658 | 0.025 | 0.655 | 0.692 | 0.021 | 0.698 | 0.678 |
High-frequency | 0.042 | 0.585 | 0.570 | 0.040 | 0.611 | 0.607 | 0.039 | 0.649 | 0.641 | 0.033 | 0.712 | 0.679 |
Normal | 0.005 | 0.498 | 0.646 | 0.005 | 0.540 | 0.649 | 0.005 | 0.612 | 0.654 | 0.005 | 0.567 | 0.664 |
Training data size . | Zero-shot . | 100 . | 1,600 . | Full dataset . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | MSE . | LCC . | SRCC . | |
Flat | 0.028 | 0.599 | 0.733 | 0.030 | 0.603 | 0.747 | 0.028 | 0.621 | 0.761 | 0.026 | 0.642 | 0.784 |
Sloping | 0.031 | 0.595 | 0.544 | 0.028 | 0.652 | 0.586 | 0.028 | 0.651 | 0.619 | 0.023 | 0.723 | 0.650 |
Rising | 0.041 | 0.595 | 0.678 | 0.042 | 0.587 | 0.676 | 0.042 | 0.636 | 0.677 | 0.035 | 0.690 | 0.676 |
Cookie-bite | 0.029 | 0.569 | 0.571 | 0.025 | 0.656 | 0.659 | 0.030 | 0.611 | 0.662 | 0.024 | 0.678 | 0.692 |
Noise-notched | 0.024 | 0.720 | 0.626 | 0.024 | 0.660 | 0.658 | 0.025 | 0.655 | 0.692 | 0.021 | 0.698 | 0.678 |
High-frequency | 0.042 | 0.585 | 0.570 | 0.040 | 0.611 | 0.607 | 0.039 | 0.649 | 0.641 | 0.033 | 0.712 | 0.679 |
Normal | 0.005 | 0.498 | 0.646 | 0.005 | 0.540 | 0.649 | 0.005 | 0.612 | 0.654 | 0.005 | 0.567 | 0.664 |
In addition to the quantitative results reported above, we also performed a qualitative analysis of the performance of HASA-Net+(WavLM-WS) in zero-shot, few-shot with 100 and 1600 training samples, and full dataset settings. The results are presented visually as scatterplots in Fig. 3. In each scatterplot, the x axis represents the ground-truth HASQI (or HASPI) score, and the y axis represents the predicted HASQI (or HASPI) score. As prediction accuracy increases, the points in the scatterplot will tend to align more diagonally. We can notice that in the zero-shot setting, it is difficult for HASA-Net+ to provide accurate estimates for the test samples with low HASQI scores. However, as the amount of training data increases, the prediction performance improves, as evidenced by the scatter plots becoming more diagonal.
D. Characteristics of HASQI and HASPI
Our experimental results on both the in-domain dataset and the OOD dataset show that intelligibility prediction is more challenging than quality prediction. The difference in the sensitivity of the HASQI metric and the HASPI metric may be the main reason for the relatively poor performance of intelligibility prediction. Kates (2018) also pointed out that HASQI is more sensitive to noise than HASPI. According to the findings in Kates (2018), HASPI is close to one when the SNR increases to 10 dB, while HASQI only reaches around 0.3 at the same SNR level. Even at a higher SNR of 40 dB, HASQI can only reach 0.9. Notably, HASPI shows a transition from zero to one between −10 and 10 dB, indicating that for most SNR levels above 10 dB, HASPI is expected to be close to its maximum value. Our findings are in line with observations in our training data. Since the lowest SNR in our datasets is 2.5 dB, it means that most HASPI scores are in the region of higher values. Figure 4 displays the percentage distribution of scores for HASQI and HASPI on the in-domain dataset, with scores divided into five intervals, each with a range of 0.2. The distribution of HASQI scores is fairly even, with fewer scores falling in the range of 0–0.2. On the other hand, the distribution of HASPI scores is heavily skewed, with most scores concentrated between 0.6 and 1.0. This suggests that data with lower scores are limited. The unbalanced distribution of HASPI scores in the training set poses a challenge to the training process. Despite slightly lower prediction performance in terms of intelligibility prediction compared to quality prediction, HASA-Net+ shows promising results on the in-domain and OOD datasets with SRCC values of 0.885 (cf. Table IV), 0.802 (zero-shot), and 0.866 (full dataset) (cf. Table V), respectively.
V. DISCUSSION
Since HASQI and HASPI prediction scores are utilized as the ground-truth scores to train HASA-Net+, the HASQI and HASPI prediction results are considered as the upper bound of the prediction results. The main advantage of HASA-Net+ is that it does not require a clean reference signal, thereby overcoming the potential limitation of HASQI and HASPI that require a clean reference to calculate scores. This facilitates numerous applications and designs of hearing aid algorithms. Furthermore, with the availability of more training data, HASA-Net+ can be further enhanced to achieve higher prediction results. When assessing with available reference speech or for users with hearing loss but without wearing hearing aids, HASPI and HASQI remain a more reliable choice.
Please also note that the main focus of this work is to derive a neural metric that can predict speech quality and intelligibility for HI users “with restored audibility.” With this model, we can design improved front-end processors, such as speech enhancement, active noise control, and adaptive echo cancellation, while taking into account the amplification stage.
It is noted that optimal fine-tuning techniques may vary across different scenarios, such as varying amounts of training data, speech data diversity, and audiogram distributions. The fine-tuning experiments in this study aim to demonstrate that our proposed system can deliver acceptable performance under zero-shot scenarios and can be further enhanced with fine-tuning techniques. Detailed investigations into factors like data diversity and audiogram distributions and their impact on fine-tuning techniques constitute separate avenues for future research. It is important to note that we utilized two distinct datasets: one as the in-domain dataset and the other as the OOD dataset. The in-domain dataset was initially used to determine the optimal SFM configuration and fine-tuning strategy, while the OOD dataset was employed to evaluate HASA-Net+'s generalization capability. We ensured a strict separation between the two datasets, with the in-domain data consisting of the TIMIT dataset and the OOD data comprising the VCTK dataset. Additionally, to further reinforce this separation, we generated denoised and dereverberated utterances for the OOD dataset using different DL models than those applied to the in-domain dataset. This process was specifically designed to rigorously test HASA-Net+'s generalization capability across different datasets.
VI. CONCLUSION
In this paper, we introduce HASA-Net+, a multi-objective non-intrusive hearing-aid speech assessment model, which builds upon previous work on HASA-Net. Like HASA-Net, HASA-Net+ predicts two well-known objective metrics for hearing-aid users: HASQI and HASPI. While HASQI and HASPI are based on a calculation comparing the processed speech with its clean reference, HASA-Net+ predicts these scores without needing a clean reference. HASA-Net+ improves upon HASA-Net in several ways. First, it is a general model that takes into account both NH and HI listeners. Second, it incorporates SFM pre-training and fine-tuning techniques, demonstrating the superiority of state-of-the-art pre-trained SFMs in enhancing the model's performance. Third, it evaluates the robustness of the model across five different speech conditions, including noisy, enhanced, reverberant, dereverberated, and vocoded speech. Fourthly, the generalization capability of the model is validated using the OOD dataset in zero-shot, few-shot, and full dataset scenarios. The experimental results reveal that incorporating SFMs yields better performance compared to the previous HASA-Net, which used spectrogram as input features. We also explored various fine-tuning approaches for quality and intelligibility prediction tasks. Moreover, HASA-Net+ demonstrates improved generalization capability under various training data sizes, indicating its applicability in real-world scenarios. We also identify the difficulty posed by the imbalance in the distribution of intelligibility scores, which is an ongoing effort to address in order to improve prediction accuracy. These findings validate that the proposed HASA-Net+ has the potential to serve as a universal model for practical applications. For future work, we aim to collect data from human subjects to further validate and refine HASA-Net+'s effectiveness in real-world settings.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.