Soundscapes are an important part of urban landscapes and play a key role in the health and well-being of citizens. However, predicting soundscapes over a large area with fine resolution remains a great challenge and traditional methods are time-consuming and require laborious large-scale noise detection work. Therefore, this study utilized machine learning algorithms and street-view images to estimate a large-area urban soundscape. First, a computer vision method was applied to extract landscape visual feature indicators from large-area streetscape images. Second, the 15 collected soundscape indicators were correlated with landscape visual indicators to construct a prediction model, which was applied to estimate large-area urban soundscapes. Empirical evidence from 98 000 street-view images in Fuzhou City indicated that street-view images can be used to predict street soundscapes, validating the effectiveness of machine learning algorithms in soundscape prediction.
I. INTRODUCTION
Urban streets are an important part of urban public space, not only as transportation corridors but also as an important means of strengthening urban social ties, promoting social interaction, and improving the quality of life of urban residents (Appleyard and Lintell, 1972; Hassen and Kaufman, 2016). The urban street acoustic environment is a key influence on the streetscape experience, impacting quality of life and reflecting the city's culture and environment (Skånberg and Öhrström, 2002; Goines and Hagler, 2007; Sun , 2019). Studies have shown that unpleasant sounds can lead to cardiovascular disease, sleep problems, irritability, and cognitive impairment in children (Meecham and Smith, 1977; Jenkins , 1981; Daiber , 2019). Conventional acoustic meters generally measure the physical properties of sound; however, the human perception of sound and its impact on health depends not only on the physical properties of sound but also on people's subjective perception and mental state (Nilsson and Berglund, 2006). ISO 12913-1 (ISO, 2014; Brooks, 2016) emphasizes that the environment plays a key role in soundscape assessment and design, highlighting the importance of considering the human perception of the environment rather than physical measurements. Therefore, improving the perceived quality of a soundscape is important for improving health (Herranz-Pascual , 2010; Hasegawa and Lau, 2022).
How people perceive the acoustic environment in soundscapes is studied from three main perspectives: (1) analyzing recordings using physical parameters to objectively obtain soundscape information (Barber , 2011), (2) obtaining subjective soundscape information through questionnaires, interviews, and field observations (Liu , 2013a,b), and (3) combining subjective and objective methods (Jeon , 2010). Combining emotional responses to specific sound scenes with objective acoustic parameter analysis allows comprehensive and accurate sound scene information to be obtained. Various methods have been proposed to measure and evaluate soundscapes to improve the quality of the urban soundscape, including arranging sound level meters and noise sensors in specific locations to provide accurate data; however, there are several limitations to this approach (Verma , 2019; Gasco , 2020). First, the cost of purchasing and installing sensors is high. Second, sensors can only cover a limited area. To overcome these problems, researchers have developed inexpensive and large-scale soundscape assessment methods that utilize new data sources such as smartphones and social media (Gasco , 2017; Gasco , 2019). While these methods have the advantages of real-time, large-scale, low-cost, and individualization, smartphone and social media data may be subject to sampling bias because not everyone uses a smartphone or social media and usage habits differ.
Urban street imagery creates opportunities to advance multiscale urban research owing to its broad coverage and fine spatial sampling. It has been used to quantify urban greenery (Long and Liu, 2017; Wu , 2020; Hawes , 2022), urban climate (Ignatius , 2022), tourist behavior (Guo and Loo, 2013; Ning , 2022), building characteristics and distribution (Kelly , 2013; Nguyen , 2019; Keralis , 2020), traffic (Wang , 2022), road accessibility (Ewing and Cervero, 2010; Hara , 2013), safety (Song , 2020; Zhanjun , 2022), knowledge of crime (Perkins , 1992; McKee , 2017; Branas , 2018), and urban perception (Dubey , 2016; Kruse , 2021; Guan , 2022). Computer vision (CV) techniques and algorithms play important roles in street-view image processing and analysis. Semantic segmentation is an important deep-learning model in the field of computer vision that is mainly used for urban feature extraction. It converts a two-dimensional image into a pixel-level index based on a convolutional network, which enables the segmentation and classification of different objects and regions in the image. Commonly used semantic segmentation models include YOLO (You Only Look Once, Darknet, Joseph Redmon, Seattle, WA), SegNet (Cambridge, UK), VGGNet (Visual Geometry Group, Oxford, UK), and DeepLab (Google Inc., Mountain View, CA). CV models, such as target detection and image classification, can also efficiently extract high-level features from images (Verma , 2020), and studies have been conducted to automatically identify hazardous scenes related to non-motorized transportation and their immediate causes from street-view images (SVI) using target detection and classification (Wang , 2022). Furthermore, urban features extracted from SVI by CV models can efficiently estimate hidden community socioeconomic conditions, such as travel behavior, poverty status, health outcomes and behaviors, and crime, thus providing the basis for this project to predict the urban soundscape through street-view imagery (Fan , 2023).
Human visual and auditory perceptions are inextricably linked and streetscape perception is influenced not only by visual components but also by acoustic components (Einhäuser , 2020; Verma , 2020). Previous research has demonstrated a strong correlation between soundscapes and visual aesthetics (Schroeder and Anderson, 1984; Carles , 1999; Meng and Kang, 2015; Meng , 2017; Salem , 2018). For example, Carles (1999) used 36 sounds and images to study the interaction between visual and auditory stimuli, and their results suggested that consistency (or coherence) between sounds and images affects landscape preferences. These studies mainly explored the correlation between sounds and images; however, studies on predicting and quantifying soundscape metrics are lacking.
Therefore, this study investigated how streetscape images can be utilized for soundscape assessments and predictions, focusing on high-resolution quantification and prediction at the city level. Specifically, this study aimed to determine (1) methods for acquiring soundscape metrics at a high resolution at the city level, and (2) the relationships between visual landscape elements and soundscape metrics in streetscape images. To achieve this, we extracted the pixel feature, semantic segmentation, and object detection results from urban streetscape images using CV and deep-learning models and constructed 15 soundscape indicators based on sound intensity, soundscape quality, sound source, and human perception. Then, machine learning algorithms were trained on the sound intensity of street images from 45 sampling points, the best algorithm was selected, and the sound scene indicators of 24 636 picking point street images in Fuzhou were predicted.
Our work enables soundscape visualization, which helps us understand the distribution of soundscapes, reveals the relationship between the urban visual environment and soundscape, and facilitates the optimization of urban planning and design, improvement of the urban environment, enhancement of health and well-being, enhancement of urban marketing and attractiveness, and facilitation of community participation and decision making. These benefits contribute to the creation of livable and sustainable cities that enhance the quality of life of residents and the competitiveness of cities.
II. METHODOLOGY
The integrated framework proposed in this study comprised three main steps (Fig. 1). First, visual features of street panorama images were extracted using CV algorithms and deep-learning models at three levels: pixel-, object-, and semantic-level features. Second, street soundscape indicators were constructed from four aspects: sound intensity, sound quality, sound source, and perceived emotion. Third, random forest (RF) regression was used to construct a soundscape prediction model to measure street soundscapes created by humans at the city level.
A. Soundscape indicators
A soundscape is a conceptual framework for acoustic- or sound-related issues involving the physical properties of sound, spatial distribution, environmental factors, and perceptual and emotional responses to human hearing (Hasegawa and Lau, 2022). As shown in Fig. 2, to realize the construction of the soundscape indicator system from the sound environment to the human emotional response to evaluate the urban soundscape, 15 perceptual indicators were identified from the literature, which mainly included four aspects: sound intensity, sound source, human perception, and sound quality (Axelsson , 2014; Liu , 2019). The sound types were classified according to Zhao (2023), Ryu (2018), and Schafer (1993). This study categorized sound sources into five subcategories: traffic, human, natural, mechanical, and musical noise. Music noise generally refers to the sound from music or music-related activities, such as music from store promotional activities. The main emotional response of humans is to evaluate the overall sound quality and the secondary response is to perceive emotions toward different sounds, which are categorized into eight subcategories: “pleasant,” “chaotic,” “exciting,” “uneventful,” “calm,” “annoying,” “eventful,” and “monotonous” (Axelsson , 2010). As shown in Table I, we established a soundscape indicator system with four categories and 15 subcategories.
Question . | Indicator . | Scale (from 1 to 5) . |
---|---|---|
1. Overall, how do you feel about the overall sound intensity (noisy or quiet) from the audio? | Sound intensity | [Very noisy, Noisy, …, Quiet, Very quiet] |
2. Overall, how do you feel about the overall sound quality (good or bad) from the audio? | Sound quality | [Very bad, Bad, …, Good, Very good] |
3. How much do you currently experience the following sound types in the above scene? | Traffic noise, Human sounds, Natural sound, Mechanical noise, Musical noise | [No sensation at all, Don't feel dominant, …, Dominant, Completely dominant] |
4. To what extent do you agree or disagree with the consistency of the following feelings about the sound environment with the above scenario? | Pleasant, Chaotic, Exciting, Uneventful, Calm, Annoying, Eventful, Monotonous | [Completely disagree, Disagree, …, Agree, Completely agree] |
Question . | Indicator . | Scale (from 1 to 5) . |
---|---|---|
1. Overall, how do you feel about the overall sound intensity (noisy or quiet) from the audio? | Sound intensity | [Very noisy, Noisy, …, Quiet, Very quiet] |
2. Overall, how do you feel about the overall sound quality (good or bad) from the audio? | Sound quality | [Very bad, Bad, …, Good, Very good] |
3. How much do you currently experience the following sound types in the above scene? | Traffic noise, Human sounds, Natural sound, Mechanical noise, Musical noise | [No sensation at all, Don't feel dominant, …, Dominant, Completely dominant] |
4. To what extent do you agree or disagree with the consistency of the following feelings about the sound environment with the above scenario? | Pleasant, Chaotic, Exciting, Uneventful, Calm, Annoying, Eventful, Monotonous | [Completely disagree, Disagree, …, Agree, Completely agree] |
The street soundscape perception survey was designed to collect soundscape metrics. We used the acquired images with the recorded audio to score each street scene image individually for each of the metrics. The specific experiments are as follows: This study adopts a pilot experimental design: on-site survey (offline) and off-site survey (online). The on-site survey was conducted at designated survey points. We invited 20 members of the public as subjects to conduct on-site ratings on 15 audio and video indicator scenes. Off-site research is conducted in a laboratory environment. We first took and recorded panoramic photos at the survey point, and recruited 200 volunteers as subjects. The off-site research process includes the following steps: (1) Make a slide from the isometrically projected panoramic image to show the scene to the subjects, (2) play the recorded sound on site; the entire display time is about 3 min, (3) subjects performed on 15 soundscape indicators.
This design allows us to compare the differences between on-site and off-site assessments, and at the same time, obtain larger sample size data through off-site surveys. Decorating off-site participants with panoramic images and audio provides a more immersive experience, thereby increasing the accuracy and reliability of assessments. The number of people participating in the survey was 220, and the personnel structure is shown in Table II.
Feature . | Options . | Quantity . | Ratio (%) . |
---|---|---|---|
Gender | Male | 116 | 53 |
Female | 104 | 47 | |
Age | Under 20 years old | 12 | 5 |
21–30 years old | 49 | 22.2 | |
31–40 years old | 70 | 31.8 | |
41–50 years old | 63 | 28.6 | |
Over 50 years old | 26 | 12.4 | |
Educational background | Primary school and below | 5 | 2.3 |
Middle school and high School | 54 | 24.5 | |
Diploma or undergraduate | 80 | 36.4 | |
Master's degree or above | 81 | 36.8 | |
Occupation | Farmer | 27 | 12.3 |
Individual operators | 47 | 21.4 | |
Government staff | 56 | 25.5 | |
Landscape industry practitioners | 68 | 30.9 | |
Other | 22 | 9.9 |
Feature . | Options . | Quantity . | Ratio (%) . |
---|---|---|---|
Gender | Male | 116 | 53 |
Female | 104 | 47 | |
Age | Under 20 years old | 12 | 5 |
21–30 years old | 49 | 22.2 | |
31–40 years old | 70 | 31.8 | |
41–50 years old | 63 | 28.6 | |
Over 50 years old | 26 | 12.4 | |
Educational background | Primary school and below | 5 | 2.3 |
Middle school and high School | 54 | 24.5 | |
Diploma or undergraduate | 80 | 36.4 | |
Master's degree or above | 81 | 36.8 | |
Occupation | Farmer | 27 | 12.3 |
Individual operators | 47 | 21.4 | |
Government staff | 56 | 25.5 | |
Landscape industry practitioners | 68 | 30.9 | |
Other | 22 | 9.9 |
B. Visual characteristics of streetscape images
SVI provides a unique view of ground-level urban landscapes with extensive coverage and fine spatial sampling and has been widely used in urban built environment studies at multiple scales (Biljecki and Ito, 2021). These images can be labeled according to different research purposes and CV techniques and utilized to construct visual features at the pixel, object, and semantic levels. Pixel-level features characterize the overall impression of the SVI (e.g., brightness and saturation) and influence the emotional perception; object-level visual features refer to operations, such as detecting, recognizing, and tracking objects in an image, for example, cars or people; and semantic-level visual features refer to the semantic segmentation and understanding of an image used to extract the semantic information of different regions in an image. Examples include the proportions of vegetation, sky, and roads.
Thus, pixel, object, and semantic visual features were extracted (Table III). Pixel-level features were extracted using the algorithm retrieved from the OpenCV (Intel Corporation, Santa Clara, CA) library to convert the images from the red-green-blue (RGB) color space to the hue, saturation, and value color space and calculate the histograms of the different color channels to obtain the color features of the image. Object-level features were extracted to identify and calculate the number of elements of 91 object types (e.g., buses, people, trucks) using the yolov5-master algorithm (Ultralytics LLC, San Diego, CA) with the target detection technique in deep learning and the COCO (Common Objects in Context, Microsoft Corporation, Redmond, WA) dataset. Semantic-level features were extracted using the FCN-8s model (Fully Convolutional Network, Berkeley, CA) trained on the Cityscapes dataset (Daimler AG. Stuttgart, Germany), which categorized SVI data according to 18 types of labels (including sky, vegetation, roads, and buildings). This study explored the relationship between street-scene visual features and human perception, aiming to identify key visual features that affect human perception.
Level . | Model/Library . | Dataset . | Features . |
---|---|---|---|
Pixel-level | OpenCV | — | Hue, saturation, lightness, hue_std,a saturation_std, lightness_std |
Object-level | Yolov5-master | COCO | 91 object types (person, bus, truck, motorcycle, etc.) |
Semantic-level | FCN-8s | Cityscapes | 18 categories (building, sky, road, etc.) |
Level . | Model/Library . | Dataset . | Features . |
---|---|---|---|
Pixel-level | OpenCV | — | Hue, saturation, lightness, hue_std,a saturation_std, lightness_std |
Object-level | Yolov5-master | COCO | 91 object types (person, bus, truck, motorcycle, etc.) |
Semantic-level | FCN-8s | Cityscapes | 18 categories (building, sky, road, etc.) |
Standard deviation (std).
C. Soundscape prediction model
The prediction of each soundscape metric was considered a supervised regression task. RF is an integrated learning method that performs classification and regression tasks by combining multiple decision trees. The main feature of RF is that each decision tree is trained using a randomly selected subset of samples and features.
The basic steps of the RF were as follows: (1) A decision tree was constructed by randomly selecting a portion of samples from the training set (with putative sampling), (2) for each decision tree, a subset of features was randomly selected to train the decision tree, (3) steps 1 and 2 were repeated to construct multiple decision trees, (4) for the classification task, each decision tree voted to provide a prediction result, and for the regression task, the prediction result of each decision tree was averaged. (5) the final prediction result was synthesized from the prediction results of multiple decision trees (Fig. 3).
The prediction accuracy of the RF model is mainly affected by the number, depth, and samples of the regression tree; in general, the prediction accuracy of the model improves as the number of trees increases; however, if the depth of the tree is too large, it may lead to overfitting, which reduces the accuracy of the model. An imbalance in the number of samples from different categories in the training data may cause the model to perform better for a larger number of categories and poorly for a smaller number of categories. We composed the input variables of street visual features, corresponding soundscape metrics, and 115 street visual features as inputs to predict 15 different soundscape metrics.
D. Entropy weight method (EWM)
The EWM is a purely objective evaluation method that follows the law that the greater the degree of dispersion of an indicator, the lower the information entropy of that indicator and the greater the amount of information it contains. If the values of an indicator are equal, the indicator will not undergo a comprehensive evaluation. In this study, the weight of the 15 sound indicators of SVI is determined objectively using the EWM. Its final result can be calculated by realizing the following formula:
1. Dimensionless processing of data
In the case of inconsistent measurement units and directions for various indicators, it is necessary to standardize the data. In order to avoid the meaningless logarithm when calculating the entropy value, real numbers of the decimal order can be added to each 0 value, such as 0.01, 0.001, or 0.0001.
There are 15 indicators in the article, 45 survey points, 220 participants rated each survey point, and 9900 samples. Therefore, in the formula is the value of the jth indicator of the ith sample, i =1,2, 3,…,9900; j = 1,2,3,…,15; and and refer to the minimum and maximum value within the jth indicator in i.
2. Calculate the proportion of indicators, and calculate the proportion of the ith sample value under the jth indicator to that indicator
3. Calculate entropy and coefficient of difference
In the formula, n is the number of samples.
4. Calculate entropy weight
III. STUDY AREA AND DATASET
A. Study area
Fuzhou City is located on the southeast coast of China, with longitude 119°17′18″E and latitude 26°04′08″ N, and is the capital of Fujian Province (Fig. 4). The climatic conditions are subtropical humid maritime climate with an average annual temperature of 16 °C–20 °C and average annual precipitation of 900–1200 mm. The study area is located in the central city of Fuzhou and covers an area of approximately 313 km2. The central urban area of Fuzhou is the most densely populated area in Fujian Province and one of the most economically developed areas. Traffic noise is the main source of noise and the Fuzhou dialect is a distinctive human sound; therefore, it is of great significance to explore the visual elements and spatial patterns of the regional soundscape to promote the sustainable development of the city.
B. Street-view images
The Baidu Street View map was selected as the source of street-view data. The Generate Points Along the Line tool in the geographic information system (GIS) was used to generate uniformly distributed observation points at 100 m intervals. Then, the geographic coordinates of the observation points were used as a reference to obtain the Baidu SVI at different locations through the Baidu Map Application Programming Interface. Finally, panoramic images were obtained from 24 636 locations in the main city of Fuzhou. These images were used for the investigation and analysis using CV.
C. Audio data
The audio data were collected from the main urban area of Fuzhou City, covering different types of environments, such as commercial areas, residential areas, parks, and transportation hubs, to ensure that the audio data could reflect the sound characteristics of different areas of the city, with a total of 45 sample points. The collected data mainly included 3 min videos, 4–10 panoramic image shots, and 3 min recordings of changes in sound levels. Audio and images were used for the online questionnaire content and sound level data were used to validate the RF model. Sound level meters (UT353BT, Zhongshan Xinyi Electronic Instrument Co., Guangdong, China) were used to sound levels and a smartphone (iPhone 12, Apple Inc., Cupertino, CA) to capture videos and panoramic images. It is worth noting that we use the UT353BT sound level meter to measure the average value of the sound intensity at each sampling point, which is the equivalent continuous sound level (Leq). These objective measurements were used for comparison with participants' subjectively perceived sound intensity. Although we use the term ‘sound intensity’ in the questionnaire for easier understanding by participants, the actual physical quantity measured is the equivalent continuous sound level.
IV. RESULTS AND ANALYSIS
A. Soundscape prediction results
1. Model comparison
To demonstrate the superiority of the RF model, we compared it with k-nearest neighbors (KNN) regression, back propagation (BP) neural network regression, and support vector machine (SVM) regression. The dataset was constructed using SVI from Fuzhou City, with 70% used for training and 30% for testing. Mean absolute percentage error (MAPE) and coefficient of determination (R2) were used to evaluate model performance. Taking sound intensity as an example, we used the Leq as the average of 3 min on-site sound measurements for correlation analysis with model predictions. As shown in Table IV, the MAPE for Fuzhou City ranged from 3.443 to 7.759, and R2 values were between 0.421 and 0.776. The KNN model performed worst on the dataset, while the RF model exhibited the best performance in both MAPE and R2 metrics. Consequently, we selected the RF model as our final predictive model.
2. Assessment of forecast results
MAPE and R2 are commonly used to assess predicted outcomes in RFs. The K-fold cross-validation method was used to evaluate model performance. Specifically, tenfold cross-validation was used. The dataset is randomly divided into ten subsets, and ten independent trainings and tests are performed.
As shown in Figs. 5 and 6, the MAPE values of different soundscape metrics varied considerably. The median MAPE for musical noise is 19.43, which is the lowest among the metrics, while the median MAPE for chaos is 31.19, one of the highest. This difference may be attributed to the nature of these soundscape attributes. Musical noise, being a more specific and potentially less frequent occurrence, might be easier for participants to identify and rate consistently, leading to better predictive accuracy and lower MAPE. Chaos, on the other hand, is a more subjective and complex attribute that could vary widely in interpretation among participants, resulting in higher prediction errors and MAPE. R2 also varied by soundscape metric. Higher R2 median values were obtained for sound intensity and traffic noise (0.63 and 0.58, respectively), indicating better model fit for these attributes. In contrast, musical noise and calmness had lower R2 median values (0.26 and 0.29 respectively), suggesting these attributes may be more challenging to predict accurately.
These results indicate that individuals exhibit varying degrees of sensitivity to different sound attributes, with higher sensitivity observed for sound intensity, traffic noise, and chaotic sounds, which are considered “medium” in the context of this study. Conversely, attributes, such as “calm” and “monotonous,” are perceived with relatively lower sensitivity. These findings align with our expectations and echo the observations of previous research (Axelsson , 2010).
3. Validation of forecast results
To validate the accuracy of this study in predicting street soundscapes using the SVI, we used the correlation between the predicted sound intensity and field measurements, where the predicted soundscape metrics are the acoustic environment as perceived by people from the SVI. Pixel-, object-, and semantic-level visual features were extracted from on-site cell phones to street scene images and from panoramic images for visual feature extraction and used as training models to obtain the predicted sound intensity. The results of the correlation analysis between subjective sound intensity collected from participants and actual measurements of Leq averaged over 3 min are shown in Fig. 7, where R2 is 0.5471. According to previous research (Lionello , 2020), the use of the SVI to assess the soundscape in the main urban area of Fuzhou City was reliable. Some of the sound field measurements differed significantly from the predicted values, possibly because the sound levels measured over 3 min as an indicator to subjective sound intensity was not representative.
B. Soundscape map and spatial analysis
The EWM revealed that the information entropy of sound intensity was significantly higher than that of other indicators, followed by sound quality (Fig. 8). The entropy weights of indicators, such as “uneventful,” “music noise,” and “calm,” were lower. These results suggest that, within our specific dataset, sound intensity and quality showed greater variability across the sampled locations compared to other indicators. It is important to note that these weights reflect the distribution and variability of the indicators in our samples, rather than directly indicating people's sensitivity to these factors. As shown in Fig. 8, the first seven indicators have accumulated weights of 64%, which can better explain the main variability of the data. Therefore, the study focuses on explaining the first seven indicators: sound intensity, sound quality, traffic noise, natural sound, chaotic, eventful, and exciting.
1. Sound intensity distribution map
Sound intensity refers to the amount of energy in a sound and is one of the most important indicators for assessing sound. It affects the quality and clarity of sound and is related to hearing protection and environmental noise control. The sound intensity distribution in the main urban area of Fuzhou City is shown in Fig. 9. Overall, the sound intensity distribution was low in the center and high in the north and south. Most of the high-intensity areas were concentrated along highways and development zones, while the low-intensity areas were mostly concentrated in parks and along the Wulong and Min rivers, which was consistent with our expectations. Specifically, the areas with higher sound intensity included highways and construction sites, such as the following:
-
The construction of infrastructure in development zones.
-
The Third Ring Expressway. Surprisingly, the sound intensity in the business district, located in Dongjiekou, Fuzhou City, was lower than expected. This may be because these areas are also well vegetated, as shown in the corresponding SVI.
-
Which may attenuate the perception of sound intensity. This is consistent with the findings of Van Renterghem (2019), who suggested that vegetation can strongly improve environmental noise perception. Noise levels in residential areas, such as Huangshan New Town, are at low to medium value levels.
-
Low-intensity areas were identified as parks with more vegetation and mountain forests.
-
In general, the distribution of sound intensity was highly correlated with urban functions, which is consistent with the study by Monazzam (2015), who revealed that noise levels vary across land uses.
2. Typical soundscape indicator distribution
Further exploration of sound quality, traffic noise, natural sound, chaotic, eventful, and exciting metrics is presented in Fig. 10. The areas with better sound quality were mainly located near parks and scenic areas, such as West Lake Park, Minjiang Park, and the Gushan Scenic Area. The areas with poor sound quality were mainly concentrated in suburban areas with more highways and construction sites. Natural sound values are usually higher in park areas with more vegetation in the center. Traffic noise had a distribution similar to that of chaotic and eventful noise, with higher values concentrated near freeways and downtown attractions. Surprisingly, developed areas, such as Sanfangqixiang, Dongjiekou, and Wanda, had higher vibrancy values despite being busy and having more traffic noise. This is because the developed areas in the main urban area of Fuzhou City are greener and more orderly, providing a more pleasant environment for residents.
C. Relationship between soundscape indicators and visual features
A multiple regression model was used to explore the contribution of visual features to the influence of the soundscape indicators. To improve the interpretability of the model and minimize the redundancy of the variables, this study divided the set of 115 visual features into 19 variables (Table V). A stepwise backward regression method was used to select the variables. The process included (1) selecting a significance level (e.g., 0.05) and retaining variables with p-values less than the significance level in the model, (2) removing the variables with the largest p-values from the model and refitting the model, and (3) evaluating the fit of the model after removing the variables using a statistical index. If the assessment is unsatisfactory, go back to step 2 and continue to remove the variable with the largest p-value. (4) Repeat steps 2 and 3 until an end condition is met. The p-values of all the variables were less than the significance level.
Visual features . | Variables . | Definitions . |
---|---|---|
Pixel-level | lightness_mean | The mean values of brightness dimensions in the SVI |
saturation_mean | The average value of the saturation dimension in the SVI | |
hue_mean | The mean value of the hue dimension in the SVI | |
lightness_stda | The standard deviation of the lightness dimension in the SVI | |
saturation_std | The standard deviation of the saturation dimension in the SVI | |
hue_std | The standard deviation of the hue dimension in the SVI | |
Object-level | person_object | Total number of people in the SVI |
bicycle_object | Total number of bicycles in the SVI | |
car_object | Total number of cars in the SVI | |
motorcycle_object | Total number of motorcycles in the SVI | |
bus_object | Total number of buses in the SVI | |
truck_object | Total number of trucks in the SVI | |
other_object | Total number of other remaining objects in the COCO dataset in the SVI | |
Semantic-level | sky_semantic | Percentage of sky pixels in the SVI |
nature_semantic | Percentage of vegetation pixels in the SVI | |
human_semantic | Percentage of human pixels in the SVI | |
vehicle_semantic | Percentage of vehicle pixels in the SVI | |
building_semantic | Percentage of building pixels in the SVI | |
other_semantic | Percentage of pixels from other categories in the Cityscapes dataset in the SVI |
Visual features . | Variables . | Definitions . |
---|---|---|
Pixel-level | lightness_mean | The mean values of brightness dimensions in the SVI |
saturation_mean | The average value of the saturation dimension in the SVI | |
hue_mean | The mean value of the hue dimension in the SVI | |
lightness_stda | The standard deviation of the lightness dimension in the SVI | |
saturation_std | The standard deviation of the saturation dimension in the SVI | |
hue_std | The standard deviation of the hue dimension in the SVI | |
Object-level | person_object | Total number of people in the SVI |
bicycle_object | Total number of bicycles in the SVI | |
car_object | Total number of cars in the SVI | |
motorcycle_object | Total number of motorcycles in the SVI | |
bus_object | Total number of buses in the SVI | |
truck_object | Total number of trucks in the SVI | |
other_object | Total number of other remaining objects in the COCO dataset in the SVI | |
Semantic-level | sky_semantic | Percentage of sky pixels in the SVI |
nature_semantic | Percentage of vegetation pixels in the SVI | |
human_semantic | Percentage of human pixels in the SVI | |
vehicle_semantic | Percentage of vehicle pixels in the SVI | |
building_semantic | Percentage of building pixels in the SVI | |
other_semantic | Percentage of pixels from other categories in the Cityscapes dataset in the SVI |
Standard deviation (std).
The visual features of the streetscape and soundscape indicators were analyzed using multiple regression and the results are shown in Fig. 11. We selected the visual features that were sorted according to the top six contribution rates. The bar length indicates the normalization coefficient. Overall, street scene visual features contribute differently to different sound indicators (Lu , 2023).
For sound intensity, vehicle_semantic, car_object, and bus_object had a significant positive correlation with the visual features, while lightness_mean and lightness_std (standard deviation) had the strongest negative correlations. Nature_semantic was positively correlated with the sound quality score, while vehicle_semantic, building_semantic, and truck_object were negatively correlated. Two pixel-level features, saturation_std and lightness_mean, appeared in the sound quality list, suggesting that these two visual features can significantly affect the human perception of sound quality.
Traffic noise and mechanical noise had similar effects on the visual feature metrics; for example, sky_semantic and building_semantic had the same positive effect. However, truck_object was not present in mechanical noise because the number of trucks in the main urban area of Fuzhou City is low and the camera captured fewer images. Human voice and musical noise were positively affected by person_object and building_semantics. The visual elements with the strongest positive and negative correlations with nature were nature-semantic and building-semantic, respectively. The assessment of sound sources is mainly based on human a priori knowledge rather than immersive experiences, which may lead to a bias in some perceptions (Paes , 2021). For example, even though there are no moving vehicles on a highway, there are perceptions of significant traffic noise in this scenario because of people's prior knowledge.
Regarding perceived emotions, “pleasant” and “exciting” had some similarities. For example, “pleasant” was positively correlated with nature_semantic, building_semantic, and lightness_mean and negatively correlated with vehicle_semantic and bus_ object. “Exciting” was positively correlated with nature_semantic, sky_ semantic, lightness_mean, and saturation_std and negatively correlated with bus_object and other_object. The findings of Chesnokova and Purves (2018) are corroborated by this result, demonstrating a human tendency to perceive natural sounds favorably and vehicle sounds unfavorably. “Chaotic,” “eventful,” and “annoying” were positively influenced by similar visual features. such as person_object, vehicle_semantic, and car_object. This was because the richer the object targets within the street scene, the more complex the scene, the more humans perceive the street to be approximately crowded, and the lower their perceptual emotion. “Uneventful,” “calm,” and “monotonous” showed strong associations with most visual features, such as sky_semantic and building_semantic, which were both positively affected, whereas car_object negatively affected these soundscape metrics.
D. Relevance of sound indicators
To explore the relationships between different soundscape metrics in streetscape images, we performed a correlation analysis as shown in Fig. 12. We categorized soundscape metrics into four groups: sound intensity (I), sound quality (Q), sound source (S), and perception (P). Based on the correlation patterns, we can cluster the indicators into three main groups:
-
Urban noise cluster: This group shows strong positive correlations among sound intensity (I), traffic noise (S), chaotic (P), mechanical noise (S), annoying (P), and eventful (P) metrics. Key findings include that sound intensity strongly correlates with traffic noise (r = 0.71), chaotic perception (r = 0.69), and mechanical noise (r = 0.68). These metrics generally indicate urban noise pollution and negative sound perceptions. This means that an increase in sound intensity is accompanied by an increase in mechanical noise.
-
Human activity sound group: This group exhibits positive correlations among human sounds (S), musical noise (S), exciting (P), and pleasant (P) metrics. Notable observations are that human sounds correlate positively with musical noise (r = 0.44) and pleasant perception (r = 0.54). It shows that these indicators often appear at the same time.
-
Natural quality cluster: This group shows positive correlations among sound quality (S), nature sound (S), and calm (P) metrics. Key points include: sound quality correlates positively with natural sounds (r = 0.51) and calmness (r = 0.58). These metrics represent high-quality, natural, and tranquil soundscapes. This suggests that scenes with better sound quality are often accompanied by natural sounds. Importantly, we observed moderate negative correlations between the urban noise cluster and the natural quality cluster: sound intensity negatively correlates with sound quality (r = –0.42) and calmness (r = –0.58). Traffic noise shows negative correlations with nature sounds (r = –0.34) and calmness (r = –0.52). This means that an increase in the quality of noise may result in a decrease in natural sounds, and an increase in sound intensity may result in a less calm environment.
To sum up, there is a certain correlation between sound intensity and factors, such as noise type, noise quality, and the calmness of the environment. These correlation analysis results can provide reference for environmental noise management, acoustic design, etc.: (1) Reducing urban noise sources (e.g., traffic, mechanical) could significantly improve perceived sound quality and calmness. (2) Incorporating natural sounds and human-oriented acoustic elements may enhance the pleasantness of urban soundscapes. (3) Balancing the presence of exciting, musical elements with overall sound intensity could create more appealing urban acoustic environments. This structured analysis of soundscape correlations offers a foundation for developing targeted strategies in urban planning, noise control, and the creation of more comfortable and attractive sonic landscapes (Salem , 2018).
V. DISCUSSION
A. Advantages of using street-view imagery to predict soundscape
SVI data provide significant advantages for assessing urban street soundscapes as a data source with wide coverage and easy access. (1) The large number of data samples covering a wide range of urban areas enables large-scale assessment. This enabled us to conduct large-scale soundscape assessments at the city level and obtain more comprehensive and accurate results. (2) The high-resolution visual information captures the subtle visual elements of a landscape. These elements may be associated with soundscape indicators. By analyzing landscape features in streetscape images, we can better understand the mechanisms of soundscape formation. (3) Pre-existing SVI data can be used to save time and cost by avoiding the process of conducting field surveys or manually collecting data. This makes the soundscape assessment and prediction more efficient and feasible. (4) The close correlation between visual and auditory perception can be utilized to predict sound (Salem , 2018), Thus, it is feasible to use visual data to assess soundscapes. (5) By combining predicted soundscape metrics with the GIS, high-resolution maps of the distribution of soundscape metrics can be generated. These visualization results can provide urban planners, environmental protection agencies, and the public with decision support regarding soundscape quality, thereby promoting the improvement of the urban environment and people's quality of life. Therefore, the use of streetscape imagery to predict soundscapes has several advantages, such as large-scale assessment, high-resolution information, time and cost effectiveness, and visualization and decision support, providing a powerful tool and methodology for research and practice.
B. Limitations and future work
This study had some limitations which may be addressed in future studies. First was the impact of environmental factors. The soundscape of a street is influenced by environmental factors, such as weather and time. For example, during peak commuting hours, streets are noisier; however, images can only provide static information and these factors may not be accurately predicted through SVI. Second was the diversity of sounds. Urban soundscapes are composed of roads and highways and include parks, residential areas, and urban square spaces. Relying solely on SVI may not fully predict the urban soundscapes. While our 45 sample points provide a diverse representation of Fuzhou's urban soundscape, we acknowledge that a larger dataset may provide more comprehensive insights. Collecting large amounts of audio data in urban environments poses significant time and resource challenges. Therefore, the future development direction of street-view image prediction sound scenes should expand and diversify the dataset: The present study took the street-view image of Fuzhou as an example but future works should expand the research scope and collect street-view image data of other cities, covering different urban spaces and geographical and cultural backgrounds. This can make the prediction model more universal and adaptable and can be applied to a wider range of urban environments. In addition, SVI prediction methods based on machine learning algorithms are effective in predicting soundscapes and landscapes; however, there is still room for improvement. In the future, these algorithms can be further optimized to improve the accuracy and stability of the models. For example, more complex deep-learning models, such as convolutional neural networks and recurrent neural networks, can be used to improve the performance of predictive models.
VI. CONCLUSION
This study used CV methods to extract landscape visual feature indicators from large-scale SVI. The 15 soundscape indicators were then correlated with landscape visual indicators to construct a prediction model, which was applied to 98 000 SVI in Fuzhou for empirical analysis. The results indicated that SVI can be used to predict street soundscapes, thereby verifying the effectiveness of machine learning algorithm–based street-view image prediction methods for predicting soundscapes and landscapes. The contributions of street-view visual features to different soundscape indicators varied. Taking sound intensity as an example, vehicle_semantic, car_object, and bus_object exhibited significant positive correlations. However, the lightness_mean and lightness_std were the most strongly and negatively correlated visual features. This study provides an alternative method to traditional noise detection for the fine-grained resolution prediction of large-scale sound scenes. The contributions of this study are as follows:
-
Streetscape images can be used as powerful tools for assessing soundscape quality. By analyzing elements, such as buildings, greenery, and traffic in streetscape images, we can obtain visual features of the urban environment. These features are related to the propagation and reflection of sound; therefore, they can be used as important indicators for assessing the quality of soundscapes.
-
There is a correlation between the visual features of the urban environment and soundscape quality. We found a certain correlation between green area, building height, traffic density, and other factors in the streetscape image and indicators, such as sound clarity and noise level. This suggests that by analyzing streetscape images, we can initially predict the quality of the soundscape.
-
The method of assessing soundscape quality using the SVI can provide a reference for urban planning and environmental improvement. Using streetscape images to assess soundscape quality, we can obtain a more comprehensive understanding of the distribution and influencing factors of sound in urban environments. This will help urban planners consider soundscape quality when designing urban environments and provide more comfortable and livable spaces.
This study demonstrated that the soundscape of a large urban area can be effectively predicted using machine learning algorithms and streetscape imagery. This approach bypasses cumbersome ground-based measurements and can be deployed on a large scale with fine spatial resolution and analyzed comparatively across multiple cities. This provides strong support for the prediction and planning of urban soundscapes, helping create a more qualitative urban soundscape environment, playing a key role in the health and well-being of citizens.
ACKNOWLEDGEMENTS
We would like to express our gratitude to the editors and anonymous reviewers for their invaluable comments on this manuscript.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
DATA AVAILABILITY
The authors do not have permission to share data.