AI-based soundscape analysis: Jointly identifying sound sources and predicting annoyance

Soundscape studies typically attempt to capture the perception and understanding of sonic environments by surveying users. However, for long-term monitoring or assessing interventions, sound-signal-based approaches are required. To this end, most previous research focused on psycho-acoustic quantities or automatic sound recognition. Few attempts were made to include appraisal (e.g., in circumplex frameworks). This paper proposes an artificial intelligence (AI)-based dual-branch convolutional neural network with cross-attention-based fusion (DCNN-CaF) to analyze automatic soundscape characterization, including sound recognition and appraisal. Using the DeLTA dataset containing human-annotated sound source labels and perceived annoyance, the DCNN-CaF is proposed to perform sound source classification (SSC) and human-perceived annoyance rating prediction (ARP). Experimental findings indicate that (1) the proposed DCNN-CaF using loudness and Mel features outperforms the DCNN-CaF using only one of them. (2) The proposed DCNN-CaF with cross-attention fusion outperforms other typical AI-based models and soundscape-related traditional machine learning methods on the SSC and ARP tasks. (3) Correlation analysis reveals that the relationship between sound sources and annoyance is similar for humans and the proposed AI-based DCNN-CaF model. (4) Generalization tests show that the proposed model's ARP in the presence of model-unknown sound sources is consistent with expert expectations and can explain previous findings from the literature on sound-scape augmentation.


I. INTRODUCTION
To mitigate the effect of urban sound on the health and well-being of city dwellers, previous research has classically focused on treating noise as a pollutant.For a couple of decades, researchers have gradually changed their focus to a more holistic approach to urban sound, referred to as the soundscape approach (Brambilla and Maffei, 2010;Kang et al., 2016;Nilsson and Berglund, 2006;Raimbault and Dubois, 2005).Overall, the soundscape approach offers a more comprehensive understanding of urban sound and has the potential to lead to more effective interventions for improving the health and well-being of city dwellers (Abraham et al., 2010;Tsaligopoulos et al., 2021).
Previous studies on the categorization and quantification of soundscapes mostly rely on assessments of participant perceptions.In these studies (Acun and Yilmazer, 2018;Bruce and Davies, 2014;Mackrill et al., 2013;Maristany et al., 2016), participants are usually guided to participate in questionnaires about soundscapes.For example, based on investigations, Yilmazer and Acun (2018) explore the relationship among the sound factors, spatial functions, and properties of soundscapes.Using field questionnaires, Fang et al. (2021) explore how different participants' perceptions and preferences for soundscapes differed.Questionnaires may include a direct assessment of the soundscape quality, but the appraisal is often indicated in the two-dimensional plane spanned by pleasantness and eventfulness (Axelsson et al., 2010).To benchmark soundscape emotion recognition in a valence-arousal plane, Fan et al. (2017) created the Emo-soundscapes dataset based on 6-s excerpts from Freesound.organd online labeling by 1182 annotators.They later used it for constructing a deep learning model for automatic classification (Fan et al., 2018).To automatically recognize the eventfulness and pleasantness of the soundscape, Fan et al. (2015) builds a gold standard model and tests the correlation between the level of pleasure and the level of eventfulness.In an everyday context, uneventful and unpleasant soundscapes are often not noticed and do not contribute to the experience of the place.Hence, Sun et al. (2019) propose a soundscape classification that acknowledges that sonic environments can be pushed into the background.Only foregrounded soundscapes contribute to the appraisal and are classified as disruptive and supportive, the latter being either calming or stimulating (Sun et al., 2019).Based on audio recordings containing implicit information in soundscapes, Thorogood et al. (2016) established the background and foreground classification task within a musicological and soundscape context.For urban park soundscapes, Gong et al. (2022) introduce the concepts of "importance" and "performance" and position the soundscape elements in this twodimensional plane.The importance dimension reflects to what extent a particular sound is an essential part of this soundscape.The perception study underlying this paper (Mitchell et al., 2022) can be seen as a foregrounded soundscape assessment with annoyance as its primary dimension, which is a negative dimension of soundscape assessment.This type of assessment is often used to identify sources of noise, and it allows researchers to identify sources of annoyance that can cause negative health reflections.
Research on annoyance has been carried out based on non-acoustic approaches from different perspectives in the fields of psychology, sociology, medicine, human-computer interaction, and vision.In psychology, researchers primarily focus on the effects of emotion and mood on annoyance (Timmons et al., 2023).The findings of the DEBATS study (Lefe `vre et al., 2020) also confirm that considering nonacoustic factors such as situational, personal, and attitudinal factors will improve annoyance predictions.Sociological studies tend to pay more attention to the impact of social support, social relationships, and cultural factors on annoyance (Beyer et al., 2017).In medical studies, the relationship between annoyance and health is emphasised (Eek et al., 2010).The study of Carlsson et al. (2005) indicates that the correlation between subjective health and functional ability increases with increasing annoyance levels.Human-computer interaction usually utilises user experience studies, visual eye tracking, and virtual reality techniques to analyse and predict the annoyance of users when interacting with machines (Mount et al., 2012).On the other hand, acoustic-based annoyance research focuses more on the effects of sound and auditory stimulation on an individual's psychological and emotional state (Nering et al., 2020).The relationship between the appraisal of the soundscape and the assessment of annoyance on the community level is still underresearched, although it was first explored in 2003 (Lercher and Schulte-Fortkamp, 2003).Several non-acoustic factors influence community noise annoyance, and some of them, such as noise sensitivity (Das et al., 2021) are so strongly rooted in human auditory perception (Kliuchko et al., 2016) that they probably also contribute to soundscape appraisal.
The formal definition of "soundscape" refers to an understanding of the sonic environment, hence recognizing sources.The influence of perceived sounds on the appraisal of soundscapes is found to depend on contexts (Hong and Jeon, 2015).The sounds that people hear in their environment can have a substantial impact on their overall appraisal of that environment, and it can influence people's emotional and cognitive responses to their living surroundings.Hence, any acoustic signal processing attempting to predict this, is very likely to benefit from automatic sound recognition.Automatic sound recognition in predicting people's perception of the soundscape has the potential to improve our understanding of how the acoustic environment affects our perceptions, and can inform the development of more effective interventions to promote positive outcomes.Therefore, Boes et al. (2018) propose to use an artificial neural network to predict both the sound source (human, natural, and mechanical) perceived by users of public parks as well as their appraisal of the soundscape quality.It was shown that sound recognition outperforms psychoacoustic indicators in predicting each of these perceptual outcomes.Identifying specific sounds in urban soundscapes is relevant for assisting drivers or self-driving cars (Marchegiani and Posner, 2017), and for urban planning and environment improvement (Ma et al., 2021).
In this paper, a new artificial intelligence (AI) method, inspired by the approach presented in Mitchell et al. (2023), is introduced to identify various sound sources and predict one of the components in a circumplex appraisal of the sonic environment: annoyance.More specifically, this paper proposes a deep-learning model based on the cross-attention mechanism to simultaneously perform sound source classification (SSC) and annoyance rating prediction (ARP) for end-to-end inference of sound sources and annoyance rates in soundscapes.SSC has been widely used for audio event recognition (Kong et al., 2020;Ren et al., 2017) and acoustic scene classification (Barchiesi et al., 2015;Hou et al., 2022a;Mesaros et al., 2018b).In this work, we will augment it with ARP, aiming to predict the overall appraisal of the soundscape along the annoyance axis.
In soundscapes with complex acoustic environments, source-related SSC and human-perception-related ARP are commonly used techniques for understanding how people perceive and respond to sounds in soundscapes.To accurately identify these various audio events, deep learningbased convolutional neural networks (Li et al., 2019;Xu et al., 2017), recurrent neural networks (Parascandolo et al., 2016), convolutional recurrent neural networks (Li et al., 2020), and Transformer (Vaswani et al., 2017) with multihead attention are used in SSC-related detection and classification of acoustic scenes and events (DCASE) challenges (Mesaros et al., 2018a;Politis et al., 2021).Recently, with the aid of large-scale audio datasets, e.g., AudioSet (Gemmeke et al., 2017), and diverse audio pre-trained models [such as convolution-based PANNs (Kong et al., 2020) and Transformer-based AST (Gong et al., 2021)], deep learning-based approaches have made great improvement in SSC tasks.However, most of these SSC-related studies focus on recognizing sound sources without considering whether they are annoying to humans.This paper proposes a joint SSC and ARP approach, expanding SSC to include subjective human perception.
An intuitive observation is that in real-life soundscapes, loud sounds naturally attract more human attention than quieter sounds.For example, on the side of the street, the sound of roaring cars will capture people's attention more than the sound of small conversations on the corner.Therefore, this paper exploits the loudness-related root mean square value (RMS) (Mulimani and Koolagudi, 2018) and Mel spectrograms (Bala et al., 2010) features, which conform to human hearing characteristics, to predict the objective sound sources and perceived annoyance ratings.The proposed model uses convolutional blocks to extract high-level representations of the two features and a cross-attention module to fuse their semantic representations.Based on the proposed model, this paper explores the following research questions (RQs): (1) RQ1: Can the model's performance be improved using two acoustic features?(2) RQ2: How does the performance of the proposed model compare with other models on the ARP task and the SSC task, as well as the joint ARP and SSC tasks?Does the crossattention-based fusion module in the model work well?(3) RQ3: Does the proposed model capture the relationships between sound sources and annoyance ratings?What are the relationships between sound sources, annoyance ratings, and sound levels?(4) RQ4: How does the proposed model respond to adding unknown sounds to the soundscape?
The paper is organized as follows.Section II introduces the proposed method.Section III describes the baselines, dataset and training setup.Section IV analyzes and discusses the results with research questions.Section V draws conclusions.

II. METHOD
This section introduces the proposed model DCNN-CaF: the dual-branch convolutional neural network (DCNN) with cross-attention-based fusion (CaF).First, we introduce how to extract audio representations from the input audio clips, and then perform CaF on audio representations.Finally, we use different loss functions to train taskdependent branches of the model to complete the classification-based SSC task and the regression-based ARP task.

A. Audio representation extraction
Since the Mel spectrograms common in sound sourcerelated tasks and RMS features that can reflect the energy of sound sources are used in this paper, there are two branches of inputs to the DCNN-CaF model to extract high-level representations of the two acoustic features separately, as shown in Fig. 1.Inspired by the excellent performance of pure convolution-based pretrained audio neural networks (PANNs) (Kong et al., 2020) in audio-related tasks, a convolutional structure similar to that in PANNs is used in Fig. 1 to extract the representation of the input acoustic features.Specifically, the dual-input model in Fig. 1 uses 4-layer convolutional blocks.Each convolutional block contains two convolutional layers with global average pooling (GAP).The representations of Mel spectrograms and RMS features generated by the convolution block, R m and R r , are fed to the attention-based fusion module to generate representations suitable for the ARP task.The embeddings of the sound source generated by the mapping of R m through the embedding layer will be input into the final sound source classification layer to complete the SSC task.

B. Cross-attention-based fusion
The cross-attention fusion module in this paper is based on the multi-headed attention (MHA) in Transformer (Vaswani et al., 2017).MHA allows models to jointly focus on representations at different positions in different subspaces.Following the description in Transformer (Vaswani FIG. 1. (Color online) The proposed dual-branch convolutional neural network with cross-attention-based fusion (DCNN-CaF).The dimension of the output of each layer is shown.et al., 2017), MHA is calculated on a set of queries (Q), keys (K), and values (V), where (2) where head i represents the output of the ith attention head for a total number of h heads.and W O are learnable weights.For MHA in the encoder, Q, K, and V come from the same place, at this point, the attention in MHA is called self-attention (Vaswani et al., 2017).All the parameters (such as h ¼ 8, d k , and ) follow the default settings of Transformer (Vaswani et al., 2017).
From the corresponding dimensions of the output of each layer in Fig. 1, it can be seen that the dimensions of R m and R r are both (512, 30), which correspond to the number of filters of the previous convolutional layer and the number of frames, respectively.After a series of convolutional layers operations, the input 480 frames of Mel spectrograms and RMS features are extracted into audio representations with a time length of 30 frames.This means that in MHA, the time step of each head head i is also 30.To obtain the representation of R m and R r based on the mutual attention of R m and R r collaboratively, in MHA1 in Fig. 1, In contrast, in MHA2, The cross-attention-adjusted representations of R m and R r are simply concatenated together and fed into the fusion layer to obtain higher acoustic representations containing the semantics of R m and R r .

C. The loss function of the DCNN-CaF model
The model proposed in this paper performs two tasks simultaneously, SSC and ARP.Given that the output of the sound source classification layer is ŷs , and its corresponding label is y s , referring to the previous work (Hou and Botteldooren, 2022), the binary cross-entropy (BCE) is used as the loss function for the SSC task, Given the prediction output from the annoyance rating prediction layer is ŷa and its corresponding label is y a , the mean squared error (MSE) (Wallach and Goffinet, 1989) is used as a loss function for the ARP task to measure the distance between the predicted and the human-annotated annoyance ratings, Then, the final loss function of the DCNN-CaF model in this paper is

III. DATASET, BASELINE, AND EXPERIMENTAL SETUP A. Dataset
To the best of our knowledge, DeLTA (Mitchell et al., 2022) is the only publicly available dataset that includes both ground-truth sound source labels and human annoyance rating scores, so we use it in the paper.DeLTA comprises 2890 15-s binaural audio clips collected in urban public spaces across London, Venice, Granada, and Groningen.A remote listening experiment performed by 1221 participants was used to label the DeLTA recordings.In the listening experiment, participants listened to 10 15-s binaural recordings of urban environments, assessed whether they contained any of the 24 classes of sound sources, and then provided an annoyance rating (continuously from 1 to 10).Participants were given labels for 24 classes of sound sources, including: Aircraft, Bells, Bird tweets, Bus, Car, Children, Construction, Dog bark, Footsteps, General traffic, Horn, Laughter, Motorcycle, Music, Non-identifiable, Rail, Rustling leaves, Screeching brakes, Shouting, Siren, Speech, Ventilation, Water, and Other, adapted from the taxonomy developed by Salamon et al. (2014).In the listening experiment, each recording was evaluated by two to four participants, with an average of 3.1 recognized sound sources per recording.For more detailed information about DeLTA, please see (Mitchell et al., 2022).During the training of models in this paper, the training, validation, and test sets contain 2081, 231, and 578 audio clips, respectively.

B. Baseline for annoyance rating prediction (ARP) task
To compare the performance of the proposed deeplearning-based method with traditional approaches in soundscape-related studies, we employ five regression methods inspired by their performance in annoyance prediction in soundscape research (Al-Shargabi et al., 2023;Iannace et al., 2019;Morrison et al., 2003;Szychowska et al., 2018;Zhou et al., 2018) to perform the ARP task based on Aweighted equivalent sound pressure levels.They are linear regression, support vector regression (SVR), decision tree (DT), k-nearest neighbours (KNN), and random forest.Linear regression is a fundamental and interpretable model that assumes a linear relationship between input features (in this case, sound levels) and the target variable (annoyance ratings).SVR is particularly effective when dealing with complex relationships between input features and target variables.Decision tree regression is known for its ability to handle non-linear relationships and interactions among features.Random forest regression is an ensemble method that combines multiple decision trees to improve predictive accuracy and reduce overfitting.KNN regression can work well when there is a relatively small dataset and in lowdimensional spaces.

C. Baseline for sound source classification (SSC) task
In SSC-related research, deep learning convolutional neural network (CNN)-based models have achieved widespread success, and recently, Transformer-based models become dominant.Therefore, for the SSC task, the classical CNN-based YAMNet (Plakal and Ellis, 2023) and PANN (Kong et al., 2020), and Transformer-based AST (Gong et al., 2021) are used as baselines.Since YAMNet, PANN, and AST are trained on the large-scale AudioSet (Gemmeke et al., 2017), the last layer of YAMNet, PANN, and AST has 527, 521, and 527 units for output, respectively.In contrast, the SSC task in this paper has only 24 classes of audio events, so we modify the number of units in the last layer of all three to 24, and then fine-tune the models on the DeLTA dataset.

D. Baseline for joint ARP and SSC task
This paper first attempts to use the artificial intelligence (AI)-based model to simultaneously perform sound source classification and annoyance rating prediction.Therefore, this paper adopts deep neural networks (DNN), convolutional neural networks (CNN), and CNN-Transformer as baselines for comparison.

Deep neural networks (DNN)
The DNN consists of two branches.Each branch contains four fully connected layers and ReLU functions (Boob et al., 2022), where the number of units in each layer is 64, 128, 256, and 512, respectively.The outputs of the final fully connected layer of the two branches are concatenated and combined to feed to the SSC and ARP layers, respectively.

Convolutional neural networks (CNN)
Similar to DNN, the compared CNN also consists of two branches.Each branch includes two convolutional layers, where the number of filters in each convolutional layer is 32 and 64, respectively.The outputs of the convolutional layers are concatenated and combined to feed to the SSC and ARP layers, respectively.

CNN-Transformer
The CNN-Transformer is based on CNN, and an Encoder from Transformer (Vaswani et al., 2017) is added after the final convolutional layer in CNN.After the output of the Encoder is flattened, it is fed to the SSC and ARP layers, respectively.

E. Training setup and metric
The 64-filter banks logarithmic Mel-scale spectrograms (Bala et al., 2010) and frame-level root mean square values (RMS) (Mulimani and Koolagudi, 2018) are used as the acoustic features in this paper.A Hamming window length of 46 ms and a window overlap of 1/3 (Hou et al., 2022a) are used for each frame.A batch size of 64 and Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1eÀ3 are used to minimize the loss in the proposed model.The model is trained for 100 epochs.
The SSC is a classification task, so accuracy (Acc), Fscore, and threshold-free area under curve (AUC) are used to evaluate the classification results.The ARP is viewed as a regression task in this paper, so mean absolute error (MAE) and root mean square error (RMSE) are used to measure the regression results.Higher Acc, F-score, AUC and lower RMSE, MAE indicate better performance.Models and more details are available on the project webpage (Hou, 2023).

IV. RESULTS AND ANALYSIS
This section analyzes the performance of the proposed method based on the following research questions.
A. Can the model's performance be improved using two acoustic features?
Two kinds of acoustic features are used in this paper, the Mel spectrograms that approximate the characteristics of human hearing and the RMS features that characterise the acoustic level.Table I shows the ablation experiments of the two acoustic features on the proposed DCNN-CaF model to specifically present the performance of the DCNN-CaF model based on different features.When only a single feature is used, the input of the DCNN-CaF model is the corresponding single branch.
As shown in Table I, the DCNN-CaF model performs the worst on the ARP and SSC tasks when using only the 46 ms interval RMS features, which are related to instantaneous loudness.This is apparently caused by the lack of spectral information, which is embedded in the Mel spectrograms and omitted from the RMS features.The dimension of the frame-level RMS used in this paper is (T, 1), where T is the number of frames.Compared with Mel spectrograms with a dimension of (T, 64), the spectral information contained in the loudness-related one-dimensional RMS features is also scarcer.This factor makes it difficult for the model to distinguish the 24 types of sound sources and predict annoyance from real-life different sound sources in the DeLTA dataset based only on the RMS features alone.The DCNN-CaF using Mel spectrograms outperforms the results of its corresponding RMS features overall.While DCNN-CaF combining Mel spectrograms and RMS features achieves the best results, which clarifies that using these two acoustic features benefits the model's performance on SSC and ARP tasks.Thus, adding energy level-related information to the sound recognition improves annoyance prediction as expected, but it also slightly improves sound source recognition.
B. How does the performance of the proposed model compare with other models on the ARP task and the SSC task, as well as the joint ARP and SSC tasks?Does the cross-attention-based fusion module in the model work well?
Table II presents the results of classical pure convolutionbased YAMNet and PANN (Kong et al., 2020), and Transformer-based AST (Gong et al., 2021), on the SSC task.YAMNet, PANN, and AST are trained based on Mel spectrograms.For a fair comparison, the proposed DCNN-CaF only uses the left SSC branch of the input Mel spectrograms.
In Table II, both YAMNet and DCNN-CaF are lightweight models compared to PANN and AST.Relative to Transformer-based AST, the number of parameters of DCNN-CaF is reduced by (86.207À 4.961)/86.207Â 100%% 94%.Compared to YAMNet, PANN, and AST, which have deeper layers than DCNN-CaF, the shallow DCNN-CaF achieves better results on the SSC task, which may be due to the relatively small dataset used in this paper, and large and deep models are prone to overfitting during the training process.
Table III shows the joint ARP and SSC baselines proposed in Sec.III D. For a fairer comparison, the DNN, CNN, and CNN-Transformer in Sec.III D also use a dualinput branch structure to simultaneously use the two acoustic features of Mel spectrograms and RMS to complete the SSC and ARP tasks.
As shown in Table III, the CNN based on the convolutional structure outperforms the DNN based on multi-layer perceptrons (MLP) (Kruse et al., 2022) on both tasks, which reflects that the convolutional structure is more effective than the MLP structure in extracting acoustic representations.While performing well on the ARP regression task, the CNN-Transformer combining convolution and an Encoder from Transformer has the worst result corresponding to the SCC task for real-life 24-class sound source recognition.This may be because the DeLTA dataset used in this paper is not large enough to allow the Transformer Encoder with MHA (Vaswani et al., 2017) to play its expected role.Previous work has also shown that Transformer-based models tend to perform less well on small datasets (Hou et al., 2022b).Finally, compared to these common baseline models, DCNN-CaF achieves better results on both SSC and ARP.
Next, we explore the performance of traditional Aweighted equivalent sound pressure level (L Aeq )-based methods for annoyance prediction (ARP task).Thus, we extract the sound levels of audio clips in the DeLTA dataset and utilize them as features to predict annoyance ratings, as shown in Table IV.Note that sound level is a clip-level feature, while the proposed DCNN-CaF only accepts frame-level features as input.Therefore, the proposed DCNN-CaF, which cannot input coarse-grained clip-level L Aeq -based sound level features, is omitted in Table IV.
Compared to the other models in Table IV, the support vector regression (SVR) achieves the best performance on the ARP task.This may be attributed to its robustness in handling outliers and its ability to effectively model nonlinear relationships (Izonin et al., 2021).In summary, the L Aeqbased traditional approaches in Table IV show competitive performance on the ARP task, and their performance is close to the deep learning neural network-based methods in Table III.
To intuitively present the results of DCNN-CaF for the annoyance rating prediction, Fig. 2 visualizes the gap between the annoyance ratings predicted by DCNN-CaF and the corresponding ground-truth annoyance ratings.The red point representing the predicted value and the blue point indicating the true label in Fig. 2 mostly match well, indicating that the proposed model successfully regresses the annoyance ratings in the real-life soundscape.
For an in-depth analysis of the performance of the DCNN-CaF, Fig. 3 further visualizes the attention distribution from the cross-attention-based fusion module on some test samples.As described in Sec.III B, the number of time steps of each head in the multi-head attention (MHA) is 30, which comes from 30 frames in the dimensions of the representations of Mel spectrograms and RMS features before MHA.Therefore, in DCNN-CaF, the dimension of the attention matrix of each head in MHA is 30 Â 30. Figure 3 visualizes the distribution of attention in the same head number from MHA1 and MHA2.From the distribution of attention in the subgraphs of Fig. 3, it can be seen that the MHA1, which uses R r to adjust R m and MHA2, which uses R m to adjust R r , complement each other.For example, for sample #1 in Fig. 3, the attention of MHA1 in subfigure ( 1) is mainly distributed on the left side, while the attention of MHA2 in subfigure ( 2) is predominantly distributed on the right side.For the same sample, MHA1 and MHA2 with different attention perspectives match each other well.The results in Fig. 3 illustrate that the proposed DCNN-CaF model successfully pays different attention to the information of different locations of two kinds of acoustic features based on the cross-attention module, which is beneficial for the fusion of these acoustic features.For more visualizations of all the attention distributions of the 8 heads of MHA1 and MHA2, please see the project webpage.
C. Does the proposed model capture the relationships between sound sources and annoyance ratings?What are the relationships between sound sources, annoyance ratings, and sound levels?
To identify which of the 24 classes of sounds is most likely to cause annoyance, we first analyze the relationship between the sound identified by the model and the annoyance it predicts.Then, the predictions from the model are compared to the human classification.Specifically, we first use Spearman's rho (David and Mallows, 1961) to analyze the correlation between the probability of various sound sources predicted by the model and the corresponding annoyance ratings.Then, we calculate the distribution of sound sources at different annoyance ratings, and further verify the model's predictions based on human-annotated sound sources and annoyance rating labels.
Correlation analysis between the model's sound source classification and annoyance ratings.A Shapiro-Wilk (with a ¼ 0.05) statistic test (Hanusz et al., 2016) is performed before a correlation analysis of the model's predicted sound sources and annoyance ratings on the test set.The results of the Shapiro-Wilk statistic test showed no evidence that the model's predictions conform to a normal distribution.Therefore, a non-parametric method named Spearman's rho (David and Mallows, 1961) is used for correlation analysis.The Spearman's rho correlation analysis in Table V shows that the recognition of some sounds is significantly correlated with the predicted annoyance rating.Specifically, the presence of sound sources such as Children, Water, Rail, Construction, Siren, Shouting, Bells, Motorcycle, Music, Car, General traffic, Screeching brakes, Horn, and Bus is positively correlated with the annoyance rating.The presence of sound sources such as Ventilation, Footsteps, Dog bark, Bird tweet, Rustling leave, Non-identifiable, and Other is negatively correlated with the annoyance rating.As for sound sources such as Speech, Aircraft, and Laughter, there is no significant correlation between them being present and annoyance rating.Further correlation analysis indicates that the sound source Bus shows the highest positive correlation with the annoyance rating, with a correlation coefficient of 0.712.In contrast, the sound source Rustling leave shows the highest negative correlation with the annoyance rating, with a coefficient of -0.731.
Verifying the model's predictions based on humanperceived manually annotated labels.Based on the correlation analysis of the model's predictions, some sound sources are more likely to cause people annoyance than others.To investigate the consistency of the correlation analysis results between the model-based and the human-annotated labelsbased, we calculate the distribution of sound sources at different annoyance rating levels based on the humanannotated labels to explore the correlations between the sound source and the annoyance levels, as shown in Fig. 4.
Given that the mean value of the annoyance rating by humans on the test set is l, for the ith class of sound source s i , the total number of occurrences in audio samples with an annoyance rating less than or equal to l is n i;l , and the total number of occurrences in audio samples with an annoyance rating greater than l is n i;h .N i ¼ n i;l þ n i;h , N i is the total number of samples containing the sound source s i .Then, the probability of the sound source occurring in the samples where annoyance is lower than or equal to l is where x represents the annoyance rating for fragments containing the sound source, s i .Correspondingly, the probability of it occurring in samples higher than l is  Table V comprehensively shows the probability distribution of 24 classes of sound sources at different levels of annoyance rating according to human perception.The probability distribution of sound sources at different annoyance rating levels in Table V reveals that according to people's real feelings, Rustling leaves sounds have the highest probability in the low annoyance rating level (x l), while Bus sounds have the highest probability in the high annoyance rating level (x > l).This successfully verifies the correctness of the above model-based correlation analysis between sound sources and annoyance ratings.Furthermore, Table V also shows that Children, Water, Rail, Construction, Shouting, Bells, Motorcycle, Music, Car, General traffic, Screeching brakes, Horn, and Bus are more likely to occur in the high annoyance rating level, while Footsteps, Dog bark, Bird tweet, Rustling leave, Non-identifiable, and Other are more likely to occur in the low annoyance rate level.Speech, Aircraft, and Ventilation have a similar probability of occurring in the high and low annoyance levels, implying that they may be more prevalent in the soundscape of the test set.In short, both the proposed model-based and human-perceived-based analyses showed similar trends regarding which sound sources are most strongly associated with annoyance levels.The consistency between the two analyses in identifying sound sources most strongly associated with annoyance ratings indicates that the proposed model performs well in predicting the relationships between sound sources and annoyance ratings.
Correlations between sound level and annoyance rating.In addition to exploring the correlation between sound sources and annoyance ratings, we further analyze the correlation between fragment-level A-weighted equivalent sound pressure level (L Aeq ) and human-perceived annoyance based on Kendall's Tau (often referred to as Kendall's Tau rank correlation coefficient).The corresponding result is (tau ¼ 0:42; p < 0:001).That is, there is a significant correlation between sound level and annoyance rating in the DeLTA dataset.This is not unexpected as the ARP baseline models based on L Aeq only in Table IV have some predictive power.
Next, we delve into the relationship between the probability of the presence of sound sources predicted by the model and sound levels.Table V shows that there is no significant Pearson correlation between sound sources and sound levels in the DeLTA dataset.That is, the 24 different classes of sound sources in the DeLTA dataset cannot be identified solely by relying on fragment-level sound level information.
Case study.Notably, there is a significant positive correlation between Music and annoyance ratings in Table V.According to the statistical results in DeLTA (Mitchell et al., 2022), the average annoyance score for clips with Music sources is 4.01, while the average annoyance score for clips without Music sources is 3.29, which implies that most of the presence of Music in DeLTA causes an increase in annoyance rather than being relaxing.Previous studies also show that there are various types of annoying music in daily life (Trotta, 2020).
In order to analyze it in depth, we further filter out all audio clips containing music in DeLTA, totaling 222 15-s clips, with an average sound level of 81.8 dBA.We then analyze the relationship between the sound levels and annoyance ratings of these 222 music clips.The results show that under the condition of music source, there is a significant positive correlation between sound level and human-perceived annoyance (tau ¼ 0:18; p < 0:001) in the DeLTA dataset.In summary, even though the Music is not significantly correlated with the sound level, it is weakly positively correlated with the sound level in Table V.In addition, the overall sound level in the audio clips where the Music source exists is high, and the sound level is significantly related to annoyance, which may contribute to the significant positive correlation between Music and annoyance presented in Table V.
In addition to sound level, characteristics of music, such as its style or genre, give people different listening experiences.Previous research reveals the role of music in inciting or mitigating antisocial behaviour, and that certain music genres can soothe or agitate individuals.Additionally, the perception of annoyance may also be affected, depending on the choice of music.For example, genres featuring heavy rhythms, known for their potential to evoke angry emotions (Areni, 2003;Cowen et al., 2020), are often not favoured by listeners in an urban environment context.As highlighted in the study (Landstr€ om et al., 1995), a key contributor to annoyance is the presence of a tonal component in the noise.Individuals exposed to intruding noise containing tonal elements tend to report higher levels of annoyance than those exposed to non-tonal noise.Furthermore, reported levels of annoyance tend to increase when the noise contains multiple tonal components.This observation suggests that tonal characteristics present in the sound source (and possibly also in the music) may be a contributing factor to the positive correlation between music and annoyance ratings in Table V.To investigate the generalization performance of the proposed DCNN-CaF, we randomly add 20 classes of sound sources as noise to the test set in this paper to explore the model's performance in predicting annoyance ratings in soundscapes with added unknown sound sources.To add a variety of sound source samples to the 578 audio clips in the test set, we first use 20 sound sources from the public ESC-50 dataset (Piczak, 2015) as additional noise sources, each source containing 40 5-s audio samples.Then, we randomly add the 5-s noise source samples to the 15-s audio files in the test set, and each 15-s audio file is randomly assigned 1 to 3 5-s audio samples from the same noise source.During the synthesis process, the signal-to-noise ratio (SNR) defaults to 0. In this way, we get 20 test sets containing different types of noise sources.Therefore, the total number of audio clips containing model-unknown noise is 20 Â 578 ¼ 11 560, and the corresponding audio duration is about 48.2 h (15 s Â 11 560 ¼ 173 400 s).
Figure 5 shows the average human-annotated annoyance rating for sounds in the test set, the average annoyance rating predicted by the model for the test set without external noise added, and the average annoyance rating predicted by the model for the test set added with 20 classes of noise.As shown in Fig. 5, without additional noise, the average annoyance rating of the 578 15-s audio clips in the test set in the soundscape predicted by the model is similar to that of human-perceived annoyance ratings.The standard deviation of our model prediction (the yellow line on the bars in Fig. 5) is smaller than the corresponding human-perceived annoyance, which intuitively demonstrates that the proposed model achieves a similar effect on the test set as the annoyance ratings from humans perception.
Adding the 20 types of sources at an SNR of 0, increases the sound level (i.e., RMS) and, therefore, would most probably increase the annoyance rating.If the model purely relied on the RMS value, as some other noise annoyance models do, it would predict the same increase for all sources.However, different annoyance levels are predicted depending on the source added, which intuitively corresponds better to human perception.The subtle sounds, such as Pouring water and Keyboard typing, are less likely to increase much annoyance.Compared to sound sources that are less likely to introduce human annoyance, such as Water drops and Clapping, the results in Fig. 5 illustrate that sound sources related to machines or engines increase annoyance ratings more strongly for the same increase in average sound level.Overall, the model's predictions in Fig. 5 are consistent with what can be expected, but its validity is not confirmed by experiments with human participation.
Figure 5 presents the performance of the DCNN-CaF model under artificially added unknown noise sources.However, the synthetic dataset in Fig. 5 is difficult to compare with real-life audio in terms of the realism and naturalness of the sound.To compare the performance of the proposed model on unknown data and, in particular, investigate its performance for predicting positive effects of soundscape augmentation, we test the DCNN-CaF on a real-world acoustic dataset from a road traffic noise environment (denoted as RTNoise) (Coensel et al., 2011).The experiment in Coensel et al. (2011) shows that adding bird song and fountain sound can reduce human-perceived traffic noise loudness and increase perceived pleasantness.RTNoise contains recordings of freeways, major roads, and minor roads sounds and mixtures of these sounds with two bird choruses and two fountain sounds.
As shown in Fig. 6, whether it is on the freeways, major roads, or minor roads, compared to the source audio clips, the predicted annoyance ratings of the audio clips with added bird sounds or fountain sounds will be reduced to varying degrees.Using the same sounds, tests with human listeners in Coensel et al. (2011) show a similar tendency for perceived traffic noise loudness for freeway sound, but this effect was not so prominent for major roads and minor roads.Human-rated pleasantness in the same experiment shows an opposite trend to the predicted trend in annoyance, which can be seen as the opposite.However, this prior experimental work also showed lower perceived loudness and higher pleasantness for the minor and major roads compared to freeway sound.The DCNN-CaF model does not seem to be able to distinguish between these types of sound.This could be caused by the poor control of noise levels in the online playback during the DeLTA data collection or by the shortness of the audio fragments (15 s) that do not allow the model to learn the difference between short car passages and continuous traffic.Other studies show that natural sounds, especially birdsong, can relax people (Van Renterghem, 2019); and under similar noise exposure conditions, respondents in neighbourhoods with more bird songs and fountains reported lower levels of annoyance (Qu et al., 2023).The response of the proposed model in Fig. 6 to the sounds of birdsong and fountains in a real soundscape successfully matches this existing research.

V. CONCLUSION
Soundscape characterization involves identifying sound sources and assessing human-perceived emotional qualities along multiple dimensions.This can be a time-consuming and expensive process when relying solely on human listening tests and questionnaires.This paper investigates the feasibility of using artificial intelligence (AI) to perform soundscape characterization without human questionnaires.Predictive soundscape models based on measurable features, such as the model proposed here, can enable perceptionfocused soundscape assessment and design in an automated and distributed manner, beyond what traditional soundscape methods can achieve.This paper proposes the cross-attention-based DCNN-CaF using two kinds of acoustic features to ensure the accuracy and reliability of the AI model, and simultaneously perform both the sound source classification task and the annoyance rating prediction task.The proposed AI model in this paper is trained on the DeLTA dataset, which contains sound source labels and human-annotated labels along one of the emotional dimensions of perception: annoyance.
Our experimental analysis demonstrates the following findings: (1) the proposed DCNN-CaF with dual-input branches using Mel spectrograms and loudness-related RMS features outperforms models using only one of these features.( 2  training dataset to cover more acoustic scenarios.Furthermore, to improve the interpretability of the proposed model, the following work will try to visualize the learned weights of the model through heatmap analysis to clarify which neurons play a more critical role in learning to help explain the decision-making process of the model.
FIG. 4. (Color online) Distribution of the number of samples from different sources at low and high annoyance rating levels, that is, n i;l and n i;h (shown as Nl and Nh), and the corresponding Pðx ljs i Þ curves.

D
. How does the proposed model respond to adding unknown sounds to the soundscape?

FIG. 5
FIG. 5. (Color online) The mean and standard deviation of the predicted annoyance rating under different model-unknown noise sources.
) On the sound source classification and annoyance rating prediction tasks, the DCNN-CaF with the attentionbased fusion of two features outperforms DNN, CNN, and CNN-Transformer, which concatenate two features directly.In addition, attention visualization in the DCNN-CaF model shows that the cross-attention-based fusion module successfully pays different attention to the information of different acoustic features, which is beneficial for the fusion of these acoustic features.(3) Correlation analysis shows that the model successfully predicts the relationships between various sound sources and annoyance ratings, and these predicted relationships are consistent with those perceived by humans in the soundscape.(4) Generalization tests show that the model's ARP in the presence of model-unknown sources is consistent with expert expectations and can explain previous findings from the literature on soundscape augmentation.Future work involves extending the soundscape appraisal with other dimensions and taking into account more practical factors, such as participants' hearing and cultural and linguistic differences, to expand the

FIG. 6
FIG. 6. (Color online) The performance of the proposed DCNN-CaF on real-life audio recordings.

TABLE I .
Ablation study on the acoustic features.

TABLE II .
Performance of models trained only for the SSC task.

TABLE IV .
Comparison of sound level-based approaches on the ARP task.
FIG. 2. (Color online) Scatter plot of annoyance ratings of model predictions and human-annotated labels on some samples in the test set.Black vertical lines indicate gaps between two points.

TABLE V .
Spearman's rho correlation coefficients on DeLTA.
a Statistical significance at the 0.01 level.