Soundscapes are an important part of urban landscapes and play a key role in the health and well-being of citizens. However, predicting soundscapes over a large area with fine resolution remains a great challenge and traditional methods are time-consuming and require laborious large-scale noise detection work. Therefore, this study utilized machine learning algorithms and street-view images to estimate a large-area urban soundscape. First, a computer vision method was applied to extract landscape visual feature indicators from large-area streetscape images. Second, the 15 collected soundscape indicators were correlated with landscape visual indicators to construct a prediction model, which was applied to estimate large-area urban soundscapes. Empirical evidence from 98 000 street-view images in Fuzhou City indicated that street-view images can be used to predict street soundscapes, validating the effectiveness of machine learning algorithms in soundscape prediction.

Urban streets are an important part of urban public space, not only as transportation corridors but also as an important means of strengthening urban social ties, promoting social interaction, and improving the quality of life of urban residents (Appleyard and Lintell, 1972; Hassen and Kaufman, 2016). The urban street acoustic environment is a key influence on the streetscape experience, impacting quality of life and reflecting the city's culture and environment (Skånberg and Öhrström, 2002; Goines and Hagler, 2007; Sun , 2019). Studies have shown that unpleasant sounds can lead to cardiovascular disease, sleep problems, irritability, and cognitive impairment in children (Meecham and Smith, 1977; Jenkins , 1981; Daiber , 2019). Conventional acoustic meters generally measure the physical properties of sound; however, the human perception of sound and its impact on health depends not only on the physical properties of sound but also on people's subjective perception and mental state (Nilsson and Berglund, 2006). ISO 12913-1 (ISO, 2014; Brooks, 2016) emphasizes that the environment plays a key role in soundscape assessment and design, highlighting the importance of considering the human perception of the environment rather than physical measurements. Therefore, improving the perceived quality of a soundscape is important for improving health (Herranz-Pascual , 2010; Hasegawa and Lau, 2022).

How people perceive the acoustic environment in soundscapes is studied from three main perspectives: (1) analyzing recordings using physical parameters to objectively obtain soundscape information (Barber , 2011), (2) obtaining subjective soundscape information through questionnaires, interviews, and field observations (Liu , 2013a,b), and (3) combining subjective and objective methods (Jeon , 2010). Combining emotional responses to specific sound scenes with objective acoustic parameter analysis allows comprehensive and accurate sound scene information to be obtained. Various methods have been proposed to measure and evaluate soundscapes to improve the quality of the urban soundscape, including arranging sound level meters and noise sensors in specific locations to provide accurate data; however, there are several limitations to this approach (Verma , 2019; Gasco , 2020). First, the cost of purchasing and installing sensors is high. Second, sensors can only cover a limited area. To overcome these problems, researchers have developed inexpensive and large-scale soundscape assessment methods that utilize new data sources such as smartphones and social media (Gasco , 2017; Gasco , 2019). While these methods have the advantages of real-time, large-scale, low-cost, and individualization, smartphone and social media data may be subject to sampling bias because not everyone uses a smartphone or social media and usage habits differ.

Urban street imagery creates opportunities to advance multiscale urban research owing to its broad coverage and fine spatial sampling. It has been used to quantify urban greenery (Long and Liu, 2017; Wu , 2020; Hawes , 2022), urban climate (Ignatius , 2022), tourist behavior (Guo and Loo, 2013; Ning , 2022), building characteristics and distribution (Kelly , 2013; Nguyen , 2019; Keralis , 2020), traffic (Wang , 2022), road accessibility (Ewing and Cervero, 2010; Hara , 2013), safety (Song , 2020; Zhanjun , 2022), knowledge of crime (Perkins , 1992; McKee , 2017; Branas , 2018), and urban perception (Dubey , 2016; Kruse , 2021; Guan , 2022). Computer vision (CV) techniques and algorithms play important roles in street-view image processing and analysis. Semantic segmentation is an important deep-learning model in the field of computer vision that is mainly used for urban feature extraction. It converts a two-dimensional image into a pixel-level index based on a convolutional network, which enables the segmentation and classification of different objects and regions in the image. Commonly used semantic segmentation models include YOLO (You Only Look Once, Darknet, Joseph Redmon, Seattle, WA), SegNet (Cambridge, UK), VGGNet (Visual Geometry Group, Oxford, UK), and DeepLab (Google Inc., Mountain View, CA). CV models, such as target detection and image classification, can also efficiently extract high-level features from images (Verma , 2020), and studies have been conducted to automatically identify hazardous scenes related to non-motorized transportation and their immediate causes from street-view images (SVI) using target detection and classification (Wang , 2022). Furthermore, urban features extracted from SVI by CV models can efficiently estimate hidden community socioeconomic conditions, such as travel behavior, poverty status, health outcomes and behaviors, and crime, thus providing the basis for this project to predict the urban soundscape through street-view imagery (Fan , 2023).

Human visual and auditory perceptions are inextricably linked and streetscape perception is influenced not only by visual components but also by acoustic components (Einhäuser , 2020; Verma , 2020). Previous research has demonstrated a strong correlation between soundscapes and visual aesthetics (Schroeder and Anderson, 1984; Carles , 1999; Meng and Kang, 2015; Meng , 2017; Salem , 2018). For example, Carles (1999) used 36 sounds and images to study the interaction between visual and auditory stimuli, and their results suggested that consistency (or coherence) between sounds and images affects landscape preferences. These studies mainly explored the correlation between sounds and images; however, studies on predicting and quantifying soundscape metrics are lacking.

Therefore, this study investigated how streetscape images can be utilized for soundscape assessments and predictions, focusing on high-resolution quantification and prediction at the city level. Specifically, this study aimed to determine (1) methods for acquiring soundscape metrics at a high resolution at the city level, and (2) the relationships between visual landscape elements and soundscape metrics in streetscape images. To achieve this, we extracted the pixel feature, semantic segmentation, and object detection results from urban streetscape images using CV and deep-learning models and constructed 15 soundscape indicators based on sound intensity, soundscape quality, sound source, and human perception. Then, machine learning algorithms were trained on the sound intensity of street images from 45 sampling points, the best algorithm was selected, and the sound scene indicators of 24 636 picking point street images in Fuzhou were predicted.

Our work enables soundscape visualization, which helps us understand the distribution of soundscapes, reveals the relationship between the urban visual environment and soundscape, and facilitates the optimization of urban planning and design, improvement of the urban environment, enhancement of health and well-being, enhancement of urban marketing and attractiveness, and facilitation of community participation and decision making. These benefits contribute to the creation of livable and sustainable cities that enhance the quality of life of residents and the competitiveness of cities.

The integrated framework proposed in this study comprised three main steps (Fig. 1). First, visual features of street panorama images were extracted using CV algorithms and deep-learning models at three levels: pixel-, object-, and semantic-level features. Second, street soundscape indicators were constructed from four aspects: sound intensity, sound quality, sound source, and perceived emotion. Third, random forest (RF) regression was used to construct a soundscape prediction model to measure street soundscapes created by humans at the city level.

FIG. 1.

(Color online) Research framework.

FIG. 1.

(Color online) Research framework.

Close modal

A soundscape is a conceptual framework for acoustic- or sound-related issues involving the physical properties of sound, spatial distribution, environmental factors, and perceptual and emotional responses to human hearing (Hasegawa and Lau, 2022). As shown in Fig. 2, to realize the construction of the soundscape indicator system from the sound environment to the human emotional response to evaluate the urban soundscape, 15 perceptual indicators were identified from the literature, which mainly included four aspects: sound intensity, sound source, human perception, and sound quality (Axelsson , 2014; Liu , 2019). The sound types were classified according to Zhao (2023), Ryu (2018), and Schafer (1993). This study categorized sound sources into five subcategories: traffic, human, natural, mechanical, and musical noise. Music noise generally refers to the sound from music or music-related activities, such as music from store promotional activities. The main emotional response of humans is to evaluate the overall sound quality and the secondary response is to perceive emotions toward different sounds, which are categorized into eight subcategories: “pleasant,” “chaotic,” “exciting,” “uneventful,” “calm,” “annoying,” “eventful,” and “monotonous” (Axelsson , 2010). As shown in Table I, we established a soundscape indicator system with four categories and 15 subcategories.

FIG. 2.

(Color online) Urban soundscape indication system from acoustic environment to human response.

FIG. 2.

(Color online) Urban soundscape indication system from acoustic environment to human response.

Close modal
TABLE I.

The content of soundscape perception survey.

Question Indicator Scale (from 1 to 5)
1. Overall, how do you feel about the overall sound intensity (noisy or quiet) from the audio?  Sound intensity  [Very noisy, Noisy, …, Quiet, Very quiet] 
2. Overall, how do you feel about the overall sound quality (good or bad) from the audio?  Sound quality  [Very bad, Bad, …, Good, Very good] 
3. How much do you currently experience the following sound types in the above scene?  Traffic noise, Human sounds, Natural sound, Mechanical noise, Musical noise  [No sensation at all, Don't feel dominant, …, Dominant, Completely dominant] 
4. To what extent do you agree or disagree with the consistency of the following feelings about the sound environment with the above scenario?  Pleasant, Chaotic, Exciting, Uneventful, Calm, Annoying, Eventful, Monotonous  [Completely disagree, Disagree, …, Agree, Completely agree] 
Question Indicator Scale (from 1 to 5)
1. Overall, how do you feel about the overall sound intensity (noisy or quiet) from the audio?  Sound intensity  [Very noisy, Noisy, …, Quiet, Very quiet] 
2. Overall, how do you feel about the overall sound quality (good or bad) from the audio?  Sound quality  [Very bad, Bad, …, Good, Very good] 
3. How much do you currently experience the following sound types in the above scene?  Traffic noise, Human sounds, Natural sound, Mechanical noise, Musical noise  [No sensation at all, Don't feel dominant, …, Dominant, Completely dominant] 
4. To what extent do you agree or disagree with the consistency of the following feelings about the sound environment with the above scenario?  Pleasant, Chaotic, Exciting, Uneventful, Calm, Annoying, Eventful, Monotonous  [Completely disagree, Disagree, …, Agree, Completely agree] 

The street soundscape perception survey was designed to collect soundscape metrics. We used the acquired images with the recorded audio to score each street scene image individually for each of the metrics. The specific experiments are as follows: This study adopts a pilot experimental design: on-site survey (offline) and off-site survey (online). The on-site survey was conducted at designated survey points. We invited 20 members of the public as subjects to conduct on-site ratings on 15 audio and video indicator scenes. Off-site research is conducted in a laboratory environment. We first took and recorded panoramic photos at the survey point, and recruited 200 volunteers as subjects. The off-site research process includes the following steps: (1) Make a slide from the isometrically projected panoramic image to show the scene to the subjects, (2) play the recorded sound on site; the entire display time is about 3 min, (3) subjects performed on 15 soundscape indicators.

This design allows us to compare the differences between on-site and off-site assessments, and at the same time, obtain larger sample size data through off-site surveys. Decorating off-site participants with panoramic images and audio provides a more immersive experience, thereby increasing the accuracy and reliability of assessments. The number of people participating in the survey was 220, and the personnel structure is shown in Table II.

TABLE II.

Sample structure.

Feature Options Quantity Ratio (%)
Gender  Male  116  53 
Female  104  47 
Age  Under 20 years old  12 
21–30 years old  49  22.2 
31–40 years old  70  31.8 
41–50 years old  63  28.6 
Over 50 years old  26  12.4 
Educational background  Primary school and below  2.3 
Middle school and high School  54  24.5 
Diploma or undergraduate  80  36.4 
Master's degree or above  81  36.8 
Occupation  Farmer  27  12.3 
Individual operators  47  21.4 
Government staff  56  25.5 
Landscape industry practitioners  68  30.9 
Other  22  9.9 
Feature Options Quantity Ratio (%)
Gender  Male  116  53 
Female  104  47 
Age  Under 20 years old  12 
21–30 years old  49  22.2 
31–40 years old  70  31.8 
41–50 years old  63  28.6 
Over 50 years old  26  12.4 
Educational background  Primary school and below  2.3 
Middle school and high School  54  24.5 
Diploma or undergraduate  80  36.4 
Master's degree or above  81  36.8 
Occupation  Farmer  27  12.3 
Individual operators  47  21.4 
Government staff  56  25.5 
Landscape industry practitioners  68  30.9 
Other  22  9.9 

SVI provides a unique view of ground-level urban landscapes with extensive coverage and fine spatial sampling and has been widely used in urban built environment studies at multiple scales (Biljecki and Ito, 2021). These images can be labeled according to different research purposes and CV techniques and utilized to construct visual features at the pixel, object, and semantic levels. Pixel-level features characterize the overall impression of the SVI (e.g., brightness and saturation) and influence the emotional perception; object-level visual features refer to operations, such as detecting, recognizing, and tracking objects in an image, for example, cars or people; and semantic-level visual features refer to the semantic segmentation and understanding of an image used to extract the semantic information of different regions in an image. Examples include the proportions of vegetation, sky, and roads.

Thus, pixel, object, and semantic visual features were extracted (Table III). Pixel-level features were extracted using the algorithm retrieved from the OpenCV (Intel Corporation, Santa Clara, CA) library to convert the images from the red-green-blue (RGB) color space to the hue, saturation, and value color space and calculate the histograms of the different color channels to obtain the color features of the image. Object-level features were extracted to identify and calculate the number of elements of 91 object types (e.g., buses, people, trucks) using the yolov5-master algorithm (Ultralytics LLC, San Diego, CA) with the target detection technique in deep learning and the COCO (Common Objects in Context, Microsoft Corporation, Redmond, WA) dataset. Semantic-level features were extracted using the FCN-8s model (Fully Convolutional Network, Berkeley, CA) trained on the Cityscapes dataset (Daimler AG. Stuttgart, Germany), which categorized SVI data according to 18 types of labels (including sky, vegetation, roads, and buildings). This study explored the relationship between street-scene visual features and human perception, aiming to identify key visual features that affect human perception.

TABLE III.

Summary of feature extraction models and algorithms.

Level Model/Library Dataset Features
Pixel-level  OpenCV  —  Hue, saturation, lightness, hue_std,a saturation_std, lightness_std 
Object-level  Yolov5-master  COCO  91 object types (person, bus, truck, motorcycle, etc.) 
Semantic-level  FCN-8s  Cityscapes  18 categories (building, sky, road, etc.) 
Level Model/Library Dataset Features
Pixel-level  OpenCV  —  Hue, saturation, lightness, hue_std,a saturation_std, lightness_std 
Object-level  Yolov5-master  COCO  91 object types (person, bus, truck, motorcycle, etc.) 
Semantic-level  FCN-8s  Cityscapes  18 categories (building, sky, road, etc.) 
a

Standard deviation (std).

The prediction of each soundscape metric was considered a supervised regression task. RF is an integrated learning method that performs classification and regression tasks by combining multiple decision trees. The main feature of RF is that each decision tree is trained using a randomly selected subset of samples and features.

The basic steps of the RF were as follows: (1) A decision tree was constructed by randomly selecting a portion of samples from the training set (with putative sampling), (2) for each decision tree, a subset of features was randomly selected to train the decision tree, (3) steps 1 and 2 were repeated to construct multiple decision trees, (4) for the classification task, each decision tree voted to provide a prediction result, and for the regression task, the prediction result of each decision tree was averaged. (5) the final prediction result was synthesized from the prediction results of multiple decision trees (Fig. 3).

FIG. 3.

Random forest flow chart.

FIG. 3.

Random forest flow chart.

Close modal

The prediction accuracy of the RF model is mainly affected by the number, depth, and samples of the regression tree; in general, the prediction accuracy of the model improves as the number of trees increases; however, if the depth of the tree is too large, it may lead to overfitting, which reduces the accuracy of the model. An imbalance in the number of samples from different categories in the training data may cause the model to perform better for a larger number of categories and poorly for a smaller number of categories. We composed the input variables of street visual features, corresponding soundscape metrics, and 115 street visual features as inputs to predict 15 different soundscape metrics.

The EWM is a purely objective evaluation method that follows the law that the greater the degree of dispersion of an indicator, the lower the information entropy of that indicator and the greater the amount of information it contains. If the values of an indicator are equal, the indicator will not undergo a comprehensive evaluation. In this study, the weight of the 15 sound indicators of SVI is determined objectively using the EWM. Its final result can be calculated by realizing the following formula:

1. Dimensionless processing of data

In the case of inconsistent measurement units and directions for various indicators, it is necessary to standardize the data. In order to avoid the meaningless logarithm when calculating the entropy value, real numbers of the decimal order can be added to each 0 value, such as 0.01, 0.001, or 0.0001.

For positive indicators (the larger, the better indicator):
(1)
For negative indicators (smaller is better):
(2)

There are 15 indicators in the article, 45 survey points, 220 participants rated each survey point, and 9900 samples. Therefore, xij in the formula is the value of the jth indicator of the ith sample, i =1,2, 3,…,9900; j = 1,2,3,…,15; and Min(xij) and Max(xij) refer to the minimum and maximum value within the jth indicator in i.

2. Calculate the proportion of indicators, and calculate the proportion of the ith sample value under the jth indicator to that indicator

(3)

3. Calculate entropy and coefficient of difference

Calculate the entropy value of the indicator:
(4)

In the formula, n is the number of samples.

Calculate the coefficient of difference for the jth indicator:
(5)

4. Calculate entropy weight

(6)
where wj is the weight of the jth indicator, dj is the coefficient of variation of the jth indicator, and m is the number of indicators.

Fuzhou City is located on the southeast coast of China, with longitude 119°17′18″E and latitude 26°04′08″ N, and is the capital of Fujian Province (Fig. 4). The climatic conditions are subtropical humid maritime climate with an average annual temperature of 16 °C–20 °C and average annual precipitation of 900–1200 mm. The study area is located in the central city of Fuzhou and covers an area of approximately 313 km2. The central urban area of Fuzhou is the most densely populated area in Fujian Province and one of the most economically developed areas. Traffic noise is the main source of noise and the Fuzhou dialect is a distinctive human sound; therefore, it is of great significance to explore the visual elements and spatial patterns of the regional soundscape to promote the sustainable development of the city.

FIG. 4.

(Color online) Overview of geographical location of the study area.

FIG. 4.

(Color online) Overview of geographical location of the study area.

Close modal

The Baidu Street View map was selected as the source of street-view data. The Generate Points Along the Line tool in the geographic information system (GIS) was used to generate uniformly distributed observation points at 100 m intervals. Then, the geographic coordinates of the observation points were used as a reference to obtain the Baidu SVI at different locations through the Baidu Map Application Programming Interface. Finally, panoramic images were obtained from 24 636 locations in the main city of Fuzhou. These images were used for the investigation and analysis using CV.

The audio data were collected from the main urban area of Fuzhou City, covering different types of environments, such as commercial areas, residential areas, parks, and transportation hubs, to ensure that the audio data could reflect the sound characteristics of different areas of the city, with a total of 45 sample points. The collected data mainly included 3 min videos, 4–10 panoramic image shots, and 3 min recordings of changes in sound levels. Audio and images were used for the online questionnaire content and sound level data were used to validate the RF model. Sound level meters (UT353BT, Zhongshan Xinyi Electronic Instrument Co., Guangdong, China) were used to sound levels and a smartphone (iPhone 12, Apple Inc., Cupertino, CA) to capture videos and panoramic images. It is worth noting that we use the UT353BT sound level meter to measure the average value of the sound intensity at each sampling point, which is the equivalent continuous sound level (Leq). These objective measurements were used for comparison with participants' subjectively perceived sound intensity. Although we use the term ‘sound intensity’ in the questionnaire for easier understanding by participants, the actual physical quantity measured is the equivalent continuous sound level.

1. Model comparison

To demonstrate the superiority of the RF model, we compared it with k-nearest neighbors (KNN) regression, back propagation (BP) neural network regression, and support vector machine (SVM) regression. The dataset was constructed using SVI from Fuzhou City, with 70% used for training and 30% for testing. Mean absolute percentage error (MAPE) and coefficient of determination (R2) were used to evaluate model performance. Taking sound intensity as an example, we used the Leq as the average of 3 min on-site sound measurements for correlation analysis with model predictions. As shown in Table IV, the MAPE for Fuzhou City ranged from 3.443 to 7.759, and R2 values were between 0.421 and 0.776. The KNN model performed worst on the dataset, while the RF model exhibited the best performance in both MAPE and R2 metrics. Consequently, we selected the RF model as our final predictive model.

TABLE IV.

Sound intensity prediction accuracy in different models.

Model MAPE (%) R2
KNN  7.759  0.421 
BP  5.608  0.534 
SVR  7.729  0.425 
RF  3.443  0.776 
Model MAPE (%) R2
KNN  7.759  0.421 
BP  5.608  0.534 
SVR  7.729  0.425 
RF  3.443  0.776 

2. Assessment of forecast results

MAPE and R2 are commonly used to assess predicted outcomes in RFs. The K-fold cross-validation method was used to evaluate model performance. Specifically, tenfold cross-validation was used. The dataset is randomly divided into ten subsets, and ten independent trainings and tests are performed.

As shown in Figs. 5 and 6, the MAPE values of different soundscape metrics varied considerably. The median MAPE for musical noise is 19.43, which is the lowest among the metrics, while the median MAPE for chaos is 31.19, one of the highest. This difference may be attributed to the nature of these soundscape attributes. Musical noise, being a more specific and potentially less frequent occurrence, might be easier for participants to identify and rate consistently, leading to better predictive accuracy and lower MAPE. Chaos, on the other hand, is a more subjective and complex attribute that could vary widely in interpretation among participants, resulting in higher prediction errors and MAPE. R2 also varied by soundscape metric. Higher R2 median values were obtained for sound intensity and traffic noise (0.63 and 0.58, respectively), indicating better model fit for these attributes. In contrast, musical noise and calmness had lower R2 median values (0.26 and 0.29 respectively), suggesting these attributes may be more challenging to predict accurately.

FIG. 5.

(Color online) The MAPE of soundscape indicators prediction model.

FIG. 5.

(Color online) The MAPE of soundscape indicators prediction model.

Close modal
FIG. 6.

(Color online) The R2 of soundscape indicators prediction model.

FIG. 6.

(Color online) The R2 of soundscape indicators prediction model.

Close modal

These results indicate that individuals exhibit varying degrees of sensitivity to different sound attributes, with higher sensitivity observed for sound intensity, traffic noise, and chaotic sounds, which are considered “medium” in the context of this study. Conversely, attributes, such as “calm” and “monotonous,” are perceived with relatively lower sensitivity. These findings align with our expectations and echo the observations of previous research (Axelsson , 2010).

3. Validation of forecast results

To validate the accuracy of this study in predicting street soundscapes using the SVI, we used the correlation between the predicted sound intensity and field measurements, where the predicted soundscape metrics are the acoustic environment as perceived by people from the SVI. Pixel-, object-, and semantic-level visual features were extracted from on-site cell phones to street scene images and from panoramic images for visual feature extraction and used as training models to obtain the predicted sound intensity. The results of the correlation analysis between subjective sound intensity collected from participants and actual measurements of Leq averaged over 3 min are shown in Fig. 7, where R2 is 0.5471. According to previous research (Lionello , 2020), the use of the SVI to assess the soundscape in the main urban area of Fuzhou City was reliable. Some of the sound field measurements differed significantly from the predicted values, possibly because the sound levels measured over 3 min as an indicator to subjective sound intensity was not representative.

FIG. 7.

(Color online) Relationship between predicted and measured sound intensity.

FIG. 7.

(Color online) Relationship between predicted and measured sound intensity.

Close modal

The EWM revealed that the information entropy of sound intensity was significantly higher than that of other indicators, followed by sound quality (Fig. 8). The entropy weights of indicators, such as “uneventful,” “music noise,” and “calm,” were lower. These results suggest that, within our specific dataset, sound intensity and quality showed greater variability across the sampled locations compared to other indicators. It is important to note that these weights reflect the distribution and variability of the indicators in our samples, rather than directly indicating people's sensitivity to these factors. As shown in Fig. 8, the first seven indicators have accumulated weights of 64%, which can better explain the main variability of the data. Therefore, the study focuses on explaining the first seven indicators: sound intensity, sound quality, traffic noise, natural sound, chaotic, eventful, and exciting.

FIG. 8.

(Color online) Sound indicator weights.

FIG. 8.

(Color online) Sound indicator weights.

Close modal

1. Sound intensity distribution map

Sound intensity refers to the amount of energy in a sound and is one of the most important indicators for assessing sound. It affects the quality and clarity of sound and is related to hearing protection and environmental noise control. The sound intensity distribution in the main urban area of Fuzhou City is shown in Fig. 9. Overall, the sound intensity distribution was low in the center and high in the north and south. Most of the high-intensity areas were concentrated along highways and development zones, while the low-intensity areas were mostly concentrated in parks and along the Wulong and Min rivers, which was consistent with our expectations. Specifically, the areas with higher sound intensity included highways and construction sites, such as the following:

  1. The construction of infrastructure in development zones.

  2. The Third Ring Expressway. Surprisingly, the sound intensity in the business district, located in Dongjiekou, Fuzhou City, was lower than expected. This may be because these areas are also well vegetated, as shown in the corresponding SVI.

  3. Which may attenuate the perception of sound intensity. This is consistent with the findings of Van Renterghem (2019), who suggested that vegetation can strongly improve environmental noise perception. Noise levels in residential areas, such as Huangshan New Town, are at low to medium value levels.

  4. Low-intensity areas were identified as parks with more vegetation and mountain forests.

  5. In general, the distribution of sound intensity was highly correlated with urban functions, which is consistent with the study by Monazzam (2015), who revealed that noise levels vary across land uses.

FIG. 9.

(Color online) Distribution of sound intensity in the main urban area of Fuzhou.

FIG. 9.

(Color online) Distribution of sound intensity in the main urban area of Fuzhou.

Close modal

2. Typical soundscape indicator distribution

Further exploration of sound quality, traffic noise, natural sound, chaotic, eventful, and exciting metrics is presented in Fig. 10. The areas with better sound quality were mainly located near parks and scenic areas, such as West Lake Park, Minjiang Park, and the Gushan Scenic Area. The areas with poor sound quality were mainly concentrated in suburban areas with more highways and construction sites. Natural sound values are usually higher in park areas with more vegetation in the center. Traffic noise had a distribution similar to that of chaotic and eventful noise, with higher values concentrated near freeways and downtown attractions. Surprisingly, developed areas, such as Sanfangqixiang, Dongjiekou, and Wanda, had higher vibrancy values despite being busy and having more traffic noise. This is because the developed areas in the main urban area of Fuzhou City are greener and more orderly, providing a more pleasant environment for residents.

FIG. 10.

(Color online) Spatial distribution of typical soundscape indicators.

FIG. 10.

(Color online) Spatial distribution of typical soundscape indicators.

Close modal

A multiple regression model was used to explore the contribution of visual features to the influence of the soundscape indicators. To improve the interpretability of the model and minimize the redundancy of the variables, this study divided the set of 115 visual features into 19 variables (Table V). A stepwise backward regression method was used to select the variables. The process included (1) selecting a significance level (e.g., 0.05) and retaining variables with p-values less than the significance level in the model, (2) removing the variables with the largest p-values from the model and refitting the model, and (3) evaluating the fit of the model after removing the variables using a statistical index. If the assessment is unsatisfactory, go back to step 2 and continue to remove the variable with the largest p-value. (4) Repeat steps 2 and 3 until an end condition is met. The p-values of all the variables were less than the significance level.

TABLE V.

Regression analysis variables.

Visual features Variables Definitions
Pixel-level  lightness_mean  The mean values of brightness dimensions in the SVI 
saturation_mean  The average value of the saturation dimension in the SVI 
hue_mean  The mean value of the hue dimension in the SVI 
lightness_stda  The standard deviation of the lightness dimension in the SVI 
saturation_std  The standard deviation of the saturation dimension in the SVI 
hue_std  The standard deviation of the hue dimension in the SVI 
Object-level  person_object  Total number of people in the SVI 
bicycle_object  Total number of bicycles in the SVI 
car_object  Total number of cars in the SVI 
motorcycle_object  Total number of motorcycles in the SVI 
bus_object  Total number of buses in the SVI 
truck_object  Total number of trucks in the SVI 
other_object  Total number of other remaining objects in the COCO dataset in the SVI 
Semantic-level  sky_semantic  Percentage of sky pixels in the SVI 
nature_semantic  Percentage of vegetation pixels in the SVI 
human_semantic  Percentage of human pixels in the SVI 
vehicle_semantic  Percentage of vehicle pixels in the SVI 
building_semantic  Percentage of building pixels in the SVI 
other_semantic  Percentage of pixels from other categories in the Cityscapes dataset in the SVI 
Visual features Variables Definitions
Pixel-level  lightness_mean  The mean values of brightness dimensions in the SVI 
saturation_mean  The average value of the saturation dimension in the SVI 
hue_mean  The mean value of the hue dimension in the SVI 
lightness_stda  The standard deviation of the lightness dimension in the SVI 
saturation_std  The standard deviation of the saturation dimension in the SVI 
hue_std  The standard deviation of the hue dimension in the SVI 
Object-level  person_object  Total number of people in the SVI 
bicycle_object  Total number of bicycles in the SVI 
car_object  Total number of cars in the SVI 
motorcycle_object  Total number of motorcycles in the SVI 
bus_object  Total number of buses in the SVI 
truck_object  Total number of trucks in the SVI 
other_object  Total number of other remaining objects in the COCO dataset in the SVI 
Semantic-level  sky_semantic  Percentage of sky pixels in the SVI 
nature_semantic  Percentage of vegetation pixels in the SVI 
human_semantic  Percentage of human pixels in the SVI 
vehicle_semantic  Percentage of vehicle pixels in the SVI 
building_semantic  Percentage of building pixels in the SVI 
other_semantic  Percentage of pixels from other categories in the Cityscapes dataset in the SVI 
a

Standard deviation (std).

The visual features of the streetscape and soundscape indicators were analyzed using multiple regression and the results are shown in Fig. 11. We selected the visual features that were sorted according to the top six contribution rates. The bar length indicates the normalization coefficient. Overall, street scene visual features contribute differently to different sound indicators (Lu , 2023).

FIG. 11.

(Color online) The results of the multivariate regression analysis between the visual features and soundscape indicators.

FIG. 11.

(Color online) The results of the multivariate regression analysis between the visual features and soundscape indicators.

Close modal

For sound intensity, vehicle_semantic, car_object, and bus_object had a significant positive correlation with the visual features, while lightness_mean and lightness_std (standard deviation) had the strongest negative correlations. Nature_semantic was positively correlated with the sound quality score, while vehicle_semantic, building_semantic, and truck_object were negatively correlated. Two pixel-level features, saturation_std and lightness_mean, appeared in the sound quality list, suggesting that these two visual features can significantly affect the human perception of sound quality.

Traffic noise and mechanical noise had similar effects on the visual feature metrics; for example, sky_semantic and building_semantic had the same positive effect. However, truck_object was not present in mechanical noise because the number of trucks in the main urban area of Fuzhou City is low and the camera captured fewer images. Human voice and musical noise were positively affected by person_object and building_semantics. The visual elements with the strongest positive and negative correlations with nature were nature-semantic and building-semantic, respectively. The assessment of sound sources is mainly based on human a priori knowledge rather than immersive experiences, which may lead to a bias in some perceptions (Paes , 2021). For example, even though there are no moving vehicles on a highway, there are perceptions of significant traffic noise in this scenario because of people's prior knowledge.

Regarding perceived emotions, “pleasant” and “exciting” had some similarities. For example, “pleasant” was positively correlated with nature_semantic, building_semantic, and lightness_mean and negatively correlated with vehicle_semantic and bus_ object. “Exciting” was positively correlated with nature_semantic, sky_ semantic, lightness_mean, and saturation_std and negatively correlated with bus_object and other_object. The findings of Chesnokova and Purves (2018) are corroborated by this result, demonstrating a human tendency to perceive natural sounds favorably and vehicle sounds unfavorably. “Chaotic,” “eventful,” and “annoying” were positively influenced by similar visual features. such as person_object, vehicle_semantic, and car_object. This was because the richer the object targets within the street scene, the more complex the scene, the more humans perceive the street to be approximately crowded, and the lower their perceptual emotion. “Uneventful,” “calm,” and “monotonous” showed strong associations with most visual features, such as sky_semantic and building_semantic, which were both positively affected, whereas car_object negatively affected these soundscape metrics.

To explore the relationships between different soundscape metrics in streetscape images, we performed a correlation analysis as shown in Fig. 12. We categorized soundscape metrics into four groups: sound intensity (I), sound quality (Q), sound source (S), and perception (P). Based on the correlation patterns, we can cluster the indicators into three main groups:

  1. Urban noise cluster: This group shows strong positive correlations among sound intensity (I), traffic noise (S), chaotic (P), mechanical noise (S), annoying (P), and eventful (P) metrics. Key findings include that sound intensity strongly correlates with traffic noise (r = 0.71), chaotic perception (r = 0.69), and mechanical noise (r = 0.68). These metrics generally indicate urban noise pollution and negative sound perceptions. This means that an increase in sound intensity is accompanied by an increase in mechanical noise.

  2. Human activity sound group: This group exhibits positive correlations among human sounds (S), musical noise (S), exciting (P), and pleasant (P) metrics. Notable observations are that human sounds correlate positively with musical noise (r = 0.44) and pleasant perception (r = 0.54). It shows that these indicators often appear at the same time.

  3. Natural quality cluster: This group shows positive correlations among sound quality (S), nature sound (S), and calm (P) metrics. Key points include: sound quality correlates positively with natural sounds (r = 0.51) and calmness (r = 0.58). These metrics represent high-quality, natural, and tranquil soundscapes. This suggests that scenes with better sound quality are often accompanied by natural sounds. Importantly, we observed moderate negative correlations between the urban noise cluster and the natural quality cluster: sound intensity negatively correlates with sound quality (r = –0.42) and calmness (r = –0.58). Traffic noise shows negative correlations with nature sounds (r = –0.34) and calmness (r = –0.52). This means that an increase in the quality of noise may result in a decrease in natural sounds, and an increase in sound intensity may result in a less calm environment.

FIG. 12.

(Color online) Cross correlation between the soundscape indicators.

FIG. 12.

(Color online) Cross correlation between the soundscape indicators.

Close modal

To sum up, there is a certain correlation between sound intensity and factors, such as noise type, noise quality, and the calmness of the environment. These correlation analysis results can provide reference for environmental noise management, acoustic design, etc.: (1) Reducing urban noise sources (e.g., traffic, mechanical) could significantly improve perceived sound quality and calmness. (2) Incorporating natural sounds and human-oriented acoustic elements may enhance the pleasantness of urban soundscapes. (3) Balancing the presence of exciting, musical elements with overall sound intensity could create more appealing urban acoustic environments. This structured analysis of soundscape correlations offers a foundation for developing targeted strategies in urban planning, noise control, and the creation of more comfortable and attractive sonic landscapes (Salem , 2018).

SVI data provide significant advantages for assessing urban street soundscapes as a data source with wide coverage and easy access. (1) The large number of data samples covering a wide range of urban areas enables large-scale assessment. This enabled us to conduct large-scale soundscape assessments at the city level and obtain more comprehensive and accurate results. (2) The high-resolution visual information captures the subtle visual elements of a landscape. These elements may be associated with soundscape indicators. By analyzing landscape features in streetscape images, we can better understand the mechanisms of soundscape formation. (3) Pre-existing SVI data can be used to save time and cost by avoiding the process of conducting field surveys or manually collecting data. This makes the soundscape assessment and prediction more efficient and feasible. (4) The close correlation between visual and auditory perception can be utilized to predict sound (Salem , 2018), Thus, it is feasible to use visual data to assess soundscapes. (5) By combining predicted soundscape metrics with the GIS, high-resolution maps of the distribution of soundscape metrics can be generated. These visualization results can provide urban planners, environmental protection agencies, and the public with decision support regarding soundscape quality, thereby promoting the improvement of the urban environment and people's quality of life. Therefore, the use of streetscape imagery to predict soundscapes has several advantages, such as large-scale assessment, high-resolution information, time and cost effectiveness, and visualization and decision support, providing a powerful tool and methodology for research and practice.

This study had some limitations which may be addressed in future studies. First was the impact of environmental factors. The soundscape of a street is influenced by environmental factors, such as weather and time. For example, during peak commuting hours, streets are noisier; however, images can only provide static information and these factors may not be accurately predicted through SVI. Second was the diversity of sounds. Urban soundscapes are composed of roads and highways and include parks, residential areas, and urban square spaces. Relying solely on SVI may not fully predict the urban soundscapes. While our 45 sample points provide a diverse representation of Fuzhou's urban soundscape, we acknowledge that a larger dataset may provide more comprehensive insights. Collecting large amounts of audio data in urban environments poses significant time and resource challenges. Therefore, the future development direction of street-view image prediction sound scenes should expand and diversify the dataset: The present study took the street-view image of Fuzhou as an example but future works should expand the research scope and collect street-view image data of other cities, covering different urban spaces and geographical and cultural backgrounds. This can make the prediction model more universal and adaptable and can be applied to a wider range of urban environments. In addition, SVI prediction methods based on machine learning algorithms are effective in predicting soundscapes and landscapes; however, there is still room for improvement. In the future, these algorithms can be further optimized to improve the accuracy and stability of the models. For example, more complex deep-learning models, such as convolutional neural networks and recurrent neural networks, can be used to improve the performance of predictive models.

This study used CV methods to extract landscape visual feature indicators from large-scale SVI. The 15 soundscape indicators were then correlated with landscape visual indicators to construct a prediction model, which was applied to 98 000 SVI in Fuzhou for empirical analysis. The results indicated that SVI can be used to predict street soundscapes, thereby verifying the effectiveness of machine learning algorithm–based street-view image prediction methods for predicting soundscapes and landscapes. The contributions of street-view visual features to different soundscape indicators varied. Taking sound intensity as an example, vehicle_semantic, car_object, and bus_object exhibited significant positive correlations. However, the lightness_mean and lightness_std were the most strongly and negatively correlated visual features. This study provides an alternative method to traditional noise detection for the fine-grained resolution prediction of large-scale sound scenes. The contributions of this study are as follows:

  1. Streetscape images can be used as powerful tools for assessing soundscape quality. By analyzing elements, such as buildings, greenery, and traffic in streetscape images, we can obtain visual features of the urban environment. These features are related to the propagation and reflection of sound; therefore, they can be used as important indicators for assessing the quality of soundscapes.

  2. There is a correlation between the visual features of the urban environment and soundscape quality. We found a certain correlation between green area, building height, traffic density, and other factors in the streetscape image and indicators, such as sound clarity and noise level. This suggests that by analyzing streetscape images, we can initially predict the quality of the soundscape.

  3. The method of assessing soundscape quality using the SVI can provide a reference for urban planning and environmental improvement. Using streetscape images to assess soundscape quality, we can obtain a more comprehensive understanding of the distribution and influencing factors of sound in urban environments. This will help urban planners consider soundscape quality when designing urban environments and provide more comfortable and livable spaces.

This study demonstrated that the soundscape of a large urban area can be effectively predicted using machine learning algorithms and streetscape imagery. This approach bypasses cumbersome ground-based measurements and can be deployed on a large scale with fine spatial resolution and analyzed comparatively across multiple cities. This provides strong support for the prediction and planning of urban soundscapes, helping create a more qualitative urban soundscape environment, playing a key role in the health and well-being of citizens.

We would like to express our gratitude to the editors and anonymous reviewers for their invaluable comments on this manuscript.

The authors have no conflicts to disclose.

The authors do not have permission to share data.

1.
Appleyard
,
D.
, and
Lintell
,
M.
(
1972
). “
The environmental quality of city streets: The residents' viewpoint
,”
J. Am. Instit. Plan.
38
,
84
101
.
2.
Axelsson
,
Ö.
,
Nilsson
,
M. E.
, and
Berglund
,
B.
(
2010
). “
A principal components model of soundscape perception
,”
J. Acoust. Soc. Am.
128
,
2836
2846
.
3.
Axelsson
,
Ö.
,
Nilsson
,
M. E.
,
Hellström
,
B.
, and
Lundén
,
P.
(
2014
). “
A field experiment on the impact of sounds from a jet-and-basin fountain on soundscape quality in an urban park
,”
Landscape Urban Plann.
123
,
49
60
.
4.
Barber
,
J. R.
,
Burdett
,
C. L.
,
Reed
,
S. E.
,
Warner
,
K. A.
,
Formichella
,
C.
,
Crooks
,
K. R.
,
Theobald
,
D. M.
, and
Fristrup
,
K. M.
(
2011
). “
Anthropogenic noise exposure in protected natural areas: Estimating the scale of ecological consequences
,”
Landscape Ecol.
26
,
1281
1295
.
5.
Biljecki
,
F.
, and
Ito
,
K.
(
2021
). “
Street view imagery in urban analytics and GIS: A review
,”
Landscape Urban Plann.
215
,
104217
.
6.
Branas
,
C. C.
,
South
,
E.
,
Kondo
,
M. C.
,
Hohl
,
B. C.
,
Bourgois
,
P.
,
Wiebe
,
D. J.
, and
MacDonald
,
J. M.
(
2018
). “
Citywide cluster randomized trial to restore blighted vacant land and its effects on violence, crime, and fear
,”
Proc. Natl. Acad. Sci. U.S.A.
115
,
2946
2951
.
7.
Brooks
,
B.
(
2016
). “
The soundscape standard
,” in
Inter-Noise and Noise–Con Congress and Conference Proceedings
(
Institute of Noise Control Engineering
,
Wakefield, MA
), Vol. 253, pp.
2188
2192
.
8.
Carles
,
J. L.
,
Barrio
,
I. L.
, and
De Lucio
,
J. V.
(
1999
). “
Sound influence on landscape values
,”
Landscape Urban Plann.
43
,
191
200
.
9.
Chesnokova
,
O.
, and
Purves
,
R. S.
(
2018
). “
From image descriptions to perceived sounds and sources in landscape: Analyzing aural experience through text
,”
Appl. Geogr.
93
,
103
111
.
10.
Daiber
,
A.
,
Kröller Schön
,
S.
,
Frenis
,
K.
,
Oelze
,
M.
,
Kalinovic
,
S.
,
Vujacic Mirski
,
K.
,
Kuntic
,
M.
,
Bayo Jimenez
,
M. T.
,
Helmstädter
,
J.
, and
Steven
,
S.
(
2019
). “
Environmental noise induces the release of stress hormones and inflammatory signaling molecules leading to oxidative stress and vascular dysfunction—Signatures of the internal exposome
,”
Biofactors
45
,
495
506
.
11.
Dubey
,
A.
,
Naik
,
N.
,
Parikh
,
D.
,
Raskar
,
R.
, and
Hidalgo
,
C. A.
(
2016
). “
Deep Learning the City: Quantifying Urban Perception at a Global Scale
,” in Computer Vision–ECCV 2016: 14th European Conference, October 11–14, 2016, Amsterdam, The Netherlands (
Springer
,
Cham, Switzerland
), pp.
196
212
.
12.
Einhäuser
,
W.
,
Da Silva
,
L. F.
, and
Bendixen
,
A.
(
2020
). “
Intraindividual consistency between auditory and visual multistability
,”
Perception
49
,
119
138
.
13.
Ewing
,
R.
, and
Cervero
,
R.
(
2010
). “
Travel and the built environment: A meta-analysis
,”
J. Am. Plann. Assoc.
76
,
265
294
.
14.
Fan
,
Z.
,
Zhang
,
F.
,
Loo
,
B. P.
, and
Ratti
,
C.
(
2023
). “
Urban visual intelligence: Uncovering hidden city profiles with street view images
,”
Proc. Natl. Acad. Sci. U.S.A.
120
,
e2220417120
.
15.
Gasco
,
L.
,
Asensio
,
C.
, and
De Arcas
,
G.
(
2017
). “
Towards the assessment of community response to noise through social media
,” in
INTER-NOISE and NOISE-CON Congress and Conference Proceedings
(
Institute of Noise Control Engineering
,
Wakefield, MA
), Vol. 255, pp.
2209
2217
.
16.
Gasco
,
L.
,
Clavel
,
C.
,
Asensio
,
C.
, and
De Arcas
,
G.
(
2019
). “
Beyond sound level monitoring: Exploitation of social media to gather citizens subjective response to noise
,”
Sci. Total Environ.
658
,
69
79
.
17.
Gasco
,
L.
,
Schifanella
,
R.
,
Aiello
,
L. M.
,
Quercia
,
D.
,
Asensio
,
C.
, and
de Arcas
,
G.
(
2020
). “
Social media and open data to quantify the effects of noise on health
,”
Front. Sustain. Cities
2
,
41
.
18.
Goines
,
L.
, and
Hagler
,
L.
(
2007
). “
Noise pollution: A modem plague
,”
South. Med. J.
100
,
287
294
.
19.
Guan
,
F.
,
Fang
,
Z.
,
Wang
,
L.
,
Zhang
,
X.
,
Zhong
,
H.
, and
Huang
,
H.
(
2022
). “
Modelling people's perceived scene complexity of real-world environments using street-view panoramas and open geodata
,”
ISPRS J. Photogramm. Remote Sens.
186
,
315
331
.
20.
Guo
,
Z.
, and
Loo
,
B. P.
(
2013
). “
Pedestrian environment and route choice: Evidence from New York City and Hong Kong
,”
J. Transp. Geogr.
28
,
124
136
.
21.
Hara
,
K.
,
Le
,
V.
, and
Froehlich
,
J.
(
2013
). “
Combining crowdsourcing and google street view to identify street-level accessibility problems
,” in
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '13)
(
Association for Computing Machinery
,
New York
), pp.
631
640
.
22.
Hasegawa
,
Y.
, and
Lau
,
S.
(
2022
). “
Comprehensive audio-visual environmental effects on residential soundscapes and satisfaction: Partial least square structural equation modeling approach
,”
Landscape Urban Plann.
220
,
104351
.
23.
Hassen
,
N.
, and
Kaufman
,
P.
(
2016
). “
Examining the role of urban street design in enhancing community engagement: A literature review
,”
Health Place
41
,
119
132
.
24.
Hawes
,
J. K.
,
Gounaridis
,
D.
, and
Newell
,
J. P.
(
2022
). “
Does urban agriculture lead to gentrification?
,”
Landscape Urban Plann.
225
,
104447
.
25.
Herranz-Pascual
,
K.
,
Aspuru
,
I.
, and
García
,
I.
(
2010
). “
Proposed conceptual model of environmental experience as framework to study the soundscape
,” in 39th International Congress on Noise Control Engineering 2010, INTER-NOISE 2010, pp.
2904
2912
.
26.
Ignatius
,
M.
,
Xu
,
R.
,
Hou
,
Y.
,
Liang
,
X.
,
Zhao
,
T.
,
Chen
,
S.
,
Wong
,
N. H.
, and
Biljecki
,
F.
(
2022
). “
Local climate zones: Lessons from Singapore and potential improvement with street view imagery
,”
ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci.
X-4/W2-2022
,
121
128
.
27.
ISO (2014). ISO 12913-1:2014, “
Acoustics—Soundscape
” (International Organization for Standards, Geneva, Switzerland).
28.
Jenkins
,
L.
,
Tarnopolsky
,
A.
, and
Hand
,
D.
(
1981
). “
Psychiatric admissions and aircraft noise from London Airport: Four-year, three-hospitals' study
,”
Psychol. Med.
11
,
765
782
.
29.
Jeon
,
J. Y.
,
Lee
,
P. J.
,
You
,
J.
, and
Kang
,
J.
(
2010
). “
Perceptual assessment of quality of urban soundscapes with combined noise sources and water sounds
,”
J. Acoust. Soc. Am.
127
,
1357
1366
.
30.
Kelly
,
C. M.
,
Wilson
,
J. S.
,
Baker
,
E. A.
,
Miller
,
D. K.
, and
Schootman
,
M.
(
2013
). “
Using Google Street View to audit the built environment: Inter-rater reliability results
,”
Ann. Behav. Med.
45
,
S108
S112
.
31.
Keralis
,
J. M.
,
Javanmardi
,
M.
,
Khanna
,
S.
,
Dwivedi
,
P.
,
Huang
,
D.
,
Tasdizen
,
T.
, and
Nguyen
,
Q. C.
(
2020
). “
Health and the built environment in United States cities: Measuring associations using Google Street View-derived indicators of the built environment
,”
BMC Public Health
20
,
215
.
32.
Kruse
,
J.
,
Kang
,
Y.
,
Liu
,
Y.
,
Zhang
,
F.
, and
Gao
,
S.
(
2021
). “
Places for play: Understanding human perception of playability in cities using street view images and deep learning
,”
Comput. Environ. Urban Syst.
90
,
101693
.
33.
Lionello
,
M.
,
Aletta
,
F.
, and
Kang
,
J.
(
2020
). “
A systematic review of prediction models for the experience of urban soundscapes
,”
Appl. Acoust.
170
,
107479
.
34.
Liu
,
J.
,
Kang
,
J.
,
Luo
,
T.
, and
Behm
,
H.
(
2013b
). “
Landscape effects on soundscape experience in city parks
,”
Sci. Total Environ.
454-455
,
474
481
.
35.
Liu
,
J.
,
Kang
,
J.
,
Luo
,
T.
,
Behm
,
H.
, and
Coppack
,
T.
(
2013a
). “
Spatiotemporal variability of soundscapes in a multiple functional urban area
,”
Landscape Urban Plann.
115
,
1
9
.
36.
Liu
,
J.
,
Wang
,
Y.
,
Zimmer
,
C.
,
Kang
,
J.
, and
Yu
,
T.
(
2019
). “
Factors associated with soundscape experiences in urban green spaces: A case study in Rostock, Germany
,”
Urban For. Urban Green.
37
,
135
146
.
37.
Long
,
Y.
, and
Liu
,
L.
(
2017
). “
How green are the streets? An analysis for central areas of Chinese cities using Tencent Street View
,”
PLoS One
12
,
e171110
.
38.
Lu
,
Y.
,
Tan
,
J.
,
Hasegawa
,
Y.
, and
Lau
,
S. K.
(
2023
). “
The interactive effects of traffic sound and window views on indoor soundscape perceptions in the residential area
,”
J. Acoust. Soc. Am.
153
,
972
989
.
39.
McKee
,
P.
,
Erickson
,
D. J.
,
Toomey
,
T.
,
Nelson
,
T.
,
Less
,
E. L.
,
Joshi
,
S.
, and
Jones-Webb
,
R.
(
2017
). “
The impact of single-container malt liquor sales restrictions on urban crime
,”
J. Urban Health
94
,
289
300
.
40.
Meecham
,
W. C.
, and
Smith
,
H. G.
(
1977
). “
Effects of jet aircraft noise on mental hospital admissions
,”
Br. J. Audiol.
11
,
81
85
.
41.
Meng
,
Q.
, and
Kang
,
J.
(
2015
). “
The influence of crowd density on the sound environment of commercial pedestrian streets
,”
Sci. Total Environ.
511
,
249
258
.
42.
Meng
,
Q.
,
Sun
,
Y.
, and
Kang
,
J.
(
2017
). “
Effect of temporary open-air markets on the sound environment and acoustic perception based on the crowd density characteristics
,”
Sci. Total Environ.
601-602
,
1488
1495
.
43.
Monazzam
,
M. R.
,
Karimi
,
E.
,
Nassiri
,
P.
, and
Taghavi
,
L.
(
2015
). “
School-reopening impact on traffic-induced noise level at different land uses: A case study
,”
Int. J. Environ. Sci. Technol.
12
,
3089
3094
.
44.
Nguyen
,
Q. C.
,
Khanna
,
S.
,
Dwivedi
,
P.
,
Huang
,
D.
,
Huang
,
Y.
,
Tasdizen
,
T.
,
Brunisholz
,
K. D.
,
Li
,
F.
,
Gorman
,
W.
, and
Nguyen
,
T. T.
(
2019
). “
Using Google Street View to examine associations between built environment characteristics and US health outcomes
,”
Prev. Med. Rep.
14
,
100859
.
45.
Nilsson
,
M. E.
, and
Berglund
,
B.
(
2006
). “
Soundscape quality in suburban green areas and city parks
,”
Acta Acust. united Ac.
92
,
903
911
.
46.
Ning
,
H.
,
Li
,
Z.
,
Wang
,
C.
,
Hodgson
,
M. E.
,
Huang
,
X.
, and
Li
,
X.
(
2022
). “
Converting street view images to land cover maps for metric mapping: A case study on sidewalk network extraction for the wheelchair users
,”
Comput. Environ. Urban Syst.
95
,
101808
.
47.
Paes
,
D.
,
Irizarry
,
J.
, and
Pujoni
,
D.
(
2021
). “
An evidence of cognitive benefits from immersive design review: Comparing three-dimensional perception and presence between immersive and non-immersive virtual environments
,”
Autom. Constr.
130
,
103849
.
48.
Perkins
,
D. D.
,
Meeks
,
J. W.
, and
Taylor
,
R. B.
(
1992
). “
The physical environment of street blocks and resident perceptions of crime and disorder: Implications for theory and measurement
,”
J. Environ. Psychol.
12
,
21
34
.
49.
Ryu
,
H.
,
Ki
,
K. S.
,
Yoo
,
J.
,
Chang
,
S. I.
, and
Kim
,
B.
(
2018
). “
Sound grade classification with sound mapping of national park trails in South Korea
,”
J. Acoust. Soc. Am.
144
,
1931
.
50.
Salem
,
T.
,
Zhai
,
M.
,
Workman
,
S.
, and
Jacobs
,
N.
(
2018
). “
A multimodal approach to mapping soundscapes
,” in CVPR Workshops (IEEE Computer Society, Washington, DC) pp.
2524
2527
.
51.
Schafer
,
R. M.
(
1993
).
The Soundscape: Our Sonic Environment and the Tuning of the World
(
Destiny Books
,
Rochester, VT
).
52.
Schroeder
,
H. W.
, and
Anderson
,
L. M.
(
1984
). “
Perception of personal safety in urban recreation sites
,”
J. Leis. Res.
16
,
178
194
.
53.
Skånberg
,
A.
, and
Öhrström
,
E.
(
2002
). “
Adverse health effects in relation to urban residential soundscapes
,”
J. Sound Vib.
250
,
151
155
.
54.
Song
,
G.
,
Liu
,
L.
,
He
,
S.
,
Cai
,
L.
, and
Xu
,
C.
(
2020
). “
Safety perceptions among African migrants in Guangzhou and Foshan, China
,”
Cities
99
,
102624
.
55.
Sun
,
K.
,
De Coensel
,
B.
,
Filipan
,
K.
,
Aletta
,
F.
,
Van Renterghem
,
T.
,
De Pessemier
,
T.
,
Joseph
,
W.
, and
Botteldooren
,
D.
(
2019
). “
Classification of soundscapes of urban public open spaces
,”
Landscape Urban Plann.
189
,
139
155
.
56.
Van Renterghem
,
T.
(
2019
). “
Towards explaining the positive effect of vegetation on the perception of environmental noise
,”
Urban For. Urban Green.
40
,
133
144
.
57.
Verma
,
D.
,
Jana
,
A.
, and
Ramamritham
,
K.
(
2019
). “
Artificial intelligence and human senses for the evaluation of urban surroundings
,” in Intelligent Human Systems Integration 2019: Proceedings of the 2nd International Conference on Intelligent Human Systems Integration (IHSI 2019): Integrating People and Intelligent Systems, February 7–10, 2019, San Diego, CA, edited by
W.
Karwowski
and
T.
Ahram
(
Springer
,
Cham, Switzerland
), Vol. 903, pp.
852
857
.
58.
Verma
,
D.
,
Jana
,
A.
, and
Ramamritham
,
K.
(
2020
). “
Predicting human perception of the urban environment in a spatiotemporal urban setting using locally acquired street view images and audio clips
,”
Build. Environ.
186
,
107340
.
59.
Wang
,
M.
,
Chen
,
Z.
,
Rong
,
H. H.
,
Mu
,
L.
,
Zhu
,
P.
, and
Shi
,
Z.
(
2022
). “
Ridesharing accessibility from the human eye: Spatial modeling of built environment with street-level images
,”
Comput. Environ. Urban Syst.
97
,
101858
.
60.
Wang
,
Y.
,
Liu
,
D.
, and
Luo
,
J.
(
2022
). “
Identification and improvement of hazard scenarios in non-motorized transportation using multiple deep learning and street view images
,”
Int. J. Environ. Res. Public Health
19
,
14054
.
61.
Wu
,
D.
,
Gong
,
J.
,
Liang
,
J.
,
Sun
,
J.
, and
Zhang
,
G.
(
2020
). “
Analyzing the influence of urban street greening and street buildings on summertime air pollution based on street view image data
,”
ISPRS Int. J. Geo-Inf.
9
,
500
.
62.
Zhanjun
,
H. E.
,
Wang
,
Z.
,
Xie
,
Z.
,
Wu
,
L.
, and
Chen
,
Z.
(
2022
). “
Multiscale analysis of the influence of street built environment on crime occurrence using street-view images
,”
Comput. Environ. Urban Syst.
97
,
101865
.
63.
Zhao
,
T.
,
Liang
,
X.
,
Tu
,
W.
,
Huang
,
Z.
, and
Biljecki
,
F.
(
2023
). “
Sensing urban soundscapes from street view imagery
,”
Comput. Environ. Urban Syst.
99
,
101915
.