This study presents a dataset of audio-visual soundscape recordings at 62 different locations in Singapore, initially made as full-length recordings over spans of 9–38 min. For consistency and reduction in listener fatigue in future subjective studies, one-minute excerpts were cropped from the full-length recordings. An automated method using pre-trained models for Pleasantness and Eventfulness (according to ISO 12913) in a modified partitioning around medoids algorithm was employed to generate the set of excerpts by balancing the need to encompass the perceptual space with uniformity in distribution. A validation study on the method confirmed its adherence to the intended design.

Soundscape research, at least under the paradigm of the international standards on soundscapes (ISO 12913), entails the use of soundscape recordings, in audio (and possibly also visual) format, in the context of both objective and subjective evaluations, due to “soundscapes” being defined as “acoustic environments as perceived or experienced and/or understood by a person or people, in context” (International Organization for Standardization, 2014). Such recordings, when collected in large quantities and analyzed in aggregate, can yield useful information about the acoustic nature (perceived or measured) of different locations (Schulte-Fortkamp and Jordan, 2016) and simultaneously serve to preserve their unique sonic heritage (Kang , 2016).

In the Singaporean context, the SoundscapeSG initiative (National Archives of Singapore, 2023) crowdsources audio recordings from citizens with a focus on preserving the sounds of Singapore. While significant when considered from the point of view of soundscape preservation, the crowdsourced nature of the dataset means that recording devices and conditions cannot be controlled or reasonably expected to be known. This may complicate a faithful reproduction for the purpose of a laboratory-based soundscape study, since key parameters such as the recording microphone sensitivity and the in situ sound pressure levels for the recordings are often unknown.

On the other hand, dedicated teams in Singapore have also collected publicly-available datasets of audio recordings in Singapore for which recording devices are known and/or consistent. For instance, Tan (2022) made 300 binaural recordings, each 3 min long, at randomly selected locations in Singapore for the purpose of deriving a hierarchical taxonomy of isolated sound sources in an urban environment and corresponding qualities in a subsequent listening experiment. Furthermore, the SINGA:PURA dataset Ooi (2021) contains audio recordings collected from static recording units in a wireless acoustic sensor network in Singapore for the purpose of urban sound tagging. The recordings comprise 18 h of labeled and 201 h of unlabeled data, with the labels providing information about the start and end times of sound classes related to the SONYC taxonomy Cartwright (2019).

However, recordings at random or static locations may not necessarily capture a range of soundscapes covering the PleasantnessEventfulness circumplex model of soundscape perception described in ISO/TS 12913-3:2019 (International Organization for Standardization, 2019). In addition, comprehensive analyses of large sets of long recordings, especially on perceptual quantities requiring human participants to respond, can be unfeasible due to the effect of listener fatigue and reduced attention from prolonged listening to auditory stimuli (Hume and Ahtamad, 2013).

These concerns are usually mitigated by cropping out a set of excerpts of shorter duration from the original set of recordings and generalizing from the set of shorter excerpts, under the assumption that the set of shorter excerpts maintains a similar perceptual diversity as the original. For example, a study investigating principal components of soundscape perception extracted 50 soundscape excerpts from a larger collection of urban outdoor soundscape recordings for listening tests with human participants (Axelsson , 2010). The selection of excerpts was performed by a consensus vote with the expert opinion of three research team members aiming to achieve a broad diversity of sound pressure levels and source types. Ultimately, the selected excerpts satisfactorily covered the principal component space investigated in the study, but this was only (and could only have been) determined upon post hoc analysis of the listening test results. Moreover, a hybrid approach was used by Thorogood (2015), where the authors combined their expert opinion with a listening test involving 31 human participants to excerpt 30 samples, each 4 s long, from the World Soundscape Project Tape Library containing over 223 h of sound recordings.

However, manual listening and pilot studies are typically labor-intensive, so automated methods to achieve similar results would enhance the reliability and replicability of any set of excerpts from a set of longer recordings. Hence, in this study, we provide a mathematical formulation for the idea of “perceptual diversity” in the context of the construct of the perceptual space generated by the PleasantnessEventfulness axes, and aim to make the following major contributions:

  1. The “Lion City Soundscapes” (LCS) dataset, which contains over 24 hours of general-purpose soundscape recordings in 62 different locations in Singapore, recorded according to the standards specified in ISO 12913-2:2018 (International Organization for Standardization, 2018). This is a result of a continuation of the study previously conducted by Ooi (2022) who identified and classified (but did not record) the 62 locations.

  2. A “perceptually diverse” set of 62 one-minute excerpts of the original full-length recordings in the LCS dataset for use as stimuli in future subjective studies involving human participants. The excerption methodology is automated and utilizes a loss function and selection algorithm in conjunction with an ensemble of pre-trained attention-based models for the prediction of Pleasantness and Eventfulness according to ISO/TS 12913-3:2019.

The organization of the manuscript is as follows: Section 2 describes the methodology used to record the full-length recordings in the LCS dataset. Section 3 describes the overall excerption methodology, loss function, and selection algorithm. Section 4 implements and validates the method in Sec. 3 on the full-length recordings in the LCS dataset, and Sec. 5 presents the results of the implementation and the validation experiment. Finally, Sec. 6 concludes the study and suggests potential avenues for future work.

The LCS dataset was recorded with reference to the Singapore Soundscape Site Selection Survey (S5), which previously identified 62 locations as characteristic Singaporean soundscapes spanning the quadrants generated by the Pleasantness and Eventfulness axes of the ISO/TS 12913-3:2019 circumplex model of soundscape perception (Ooi , 2022). For notational convenience, we henceforth refer to the Pleasantness and Eventfulness axes as the x-axis and y-axis, respectively. The S5 study labeled the quadrants “full of life and exciting (F&E)” for x > 0 and y > 0, “chaotic and restless (C&R)” for x < 0 and y > 0, “calm and tranquil (C&T)” for x > 0 and y < 0, and “boring and lifeless (B&L)” for x < 0 and y < 0. In particular, 15, 14, 15, and 18 locations were categorized as F&E, C&R, C&T, and B&L, respectively. The Global Positioning System (GPS) coordinates of the 62 locations also covered the entirety of mainland Singapore, so they comprised a variety of acoustic environments and geographic regions in Singapore of interest in soundscape preservation. Considering the alignment of the aim of the S5 study to identify characteristic soundscapes of Singapore under the ISO 12913 paradigm and the aim of this study to create a perceptually diverse soundscape dataset in the Singaporean context, the locations identified by the S5 study were deemed suitable for a preliminary reference.

However, an initial scouting of the exact GPS coordinates of the locations revealed that not all were physically or publicly accessible. For such locations, a different but nearby site that was feasible for recording was chosen instead. For example, the GPS coordinates identified by the S5 study for Upper Seletar Reservoir (1.404356, 103.803620) were located on water, so a nearby bank of the same reservoir (at 1.397272, 103.802971) was chosen as the recording site for the LCS dataset. A map of the final recording sites corresponding to the 62 locations is shown in Fig. 1.

Fig. 1.

Map of Singapore with recording sites for the Lion City Soundscapes dataset marked in orange, red, green, and black for locations respectively identified as “full of life and exciting,” “chaotic and restless,” “calm and tranquil,” and “lifeless and boring” by the S5 study. The map was generated using the OneMap API (Singapore Land Authority, 2022).

Fig. 1.

Map of Singapore with recording sites for the Lion City Soundscapes dataset marked in orange, red, green, and black for locations respectively identified as “full of life and exciting,” “chaotic and restless,” “calm and tranquil,” and “lifeless and boring” by the S5 study. The map was generated using the OneMap API (Singapore Land Authority, 2022).

Close modal

After identifying the exact coordinates of the recording sites for the LCS dataset, the recordings were performed with a setup similar to that used for the Urban Soundscapes of the World (USotW) database (De Coensel , 2017) and the Soundscape Indices Protocol (Mitchell , 2020) used for the International Soundscape Database. Specifically, the recording setup consisted of the following equipment mounted on a tripod:

  • IEC 61094-4 WS2F designated and IEC 61672-1 Class 1 compliant sound pressure acquisition system (1.0 m above the ground): GRAS 146AE Free-field Microphone connected to a HEAD Acoustics SQobold Data Acquisition System.

  • Binaural microphone (1.5 m above the ground): B&K Type 4101-B Binaural Microphone placed on a Neumann KU100 Dummy Head and connected to the same SQobold Data Acquisition System.

  • 360-degree video camera (1.8 m above the ground): Insta360 One R Twin Edition.

  • Third-order ambisonic microphone (2.1 m above the ground): Zylia ZM-1 Ambisonic Microphone (19 channels) connected to a Zylia ZR-1 Portable Recorder.

Windshields were also used for all microphones, except for the built-in microphone of the 360-degree video camera. Figure 2 shows the setup at one of the recording sites and the corresponding view from the 360-degree video camera.

Fig. 2.

Photo of (a) recording setup and (b) equirectangular projection of screenshot taken by 360-degree video camera at Sungei Buloh Wetland Reserve (GPS coordinates: 1.447397, 103.730346).

Fig. 2.

Photo of (a) recording setup and (b) equirectangular projection of screenshot taken by 360-degree video camera at Sungei Buloh Wetland Reserve (GPS coordinates: 1.447397, 103.730346).

Close modal

The full-length soundscape recordings were made between September 2022 and January 2023 at various times of the day (earliest 0825 h, latest 1906 h) for durations ranging from 9 to 38 min (mean 24.1 min, standard deviation 5.3 min, median 23.6 min). Due to the lack of waterproofing in the equipment used, all recordings were done in dry weather or under shelter in rainy weather. All audio was recorded at 24-bit depth with a sampling frequency of 48 kHz, and the 360-degree videos were recorded in spherical format with a resolution of 4096 × 2048 pixels.

With the full-length recordings in Sec. 2, we proceeded to extract excerpts of identical length that together maintain a similar perceptual diversity as the original set of full-length recordings. The excerption was motivated by practical considerations regarding the total experimental duration for further subjective evaluation in a laboratory context since the total length of the entire set of full-length recordings (with a mean duration of 24.1 min) would be overly excessive. In this section, we formulate a mathematical definition of the excerption task and the idea of “perceptual diversity” based on ISO/TS 12913-3:2019 to allow for a more rigorous choice of suitable excerpts from the full-length recordings. For brevity, a closed set of integers is denoted as [ [ a , b ] ] : = { z | a z b }.

Consider a set of K full-length soundscape recordings s : = { S 1 , S 2 , , S K }, where each full-length recording S k : = { s k , 0 , s k , 1 , , s k , n k 1 } contains n k possible excerpts for each k [ [ 1 , K ] ]. Each excerpt is represented as a vector s k , j in an N-dimensional subset of N serving as a perceptual space, where j [ [ 0 , n k 1 ] ] and k [ [ 1 , K ] ]. For instance, under the definition of the perceptual circumplex model in ISO/TS 12913-3:2019 (International Organization for Standardization, 2019), we have N = 2, with the orthogonal axes symbolizing Pleasantness and Eventfulness values, both in the closed interval [ 1 , 1 ], such that s k , j [ 1 , 1 ] 2.

Therefore, any set of excerpts (with exactly one excerpt taken from each full-length recording) can be considered as a set of vectors R = { r 1 , r 2 , , r K }, where r k S k for each k [ [ 1 , K ] ]. We would like to find the set R ̂ = { r ̂ 1 , r ̂ 2 , , r ̂ K }, where r ̂ k S k for each k [ [ 1 , K ] ], such that R ̂ is the most “perceptually diverse” set of excerpts of the full-length recordings S 1 , S 2 , , S K.

To that end, we translate the idea of “perceptual diversity” into a loss function amenable to an optimization algorithm for the selection of vectors in R ̂. By “perceptually diverse,” the following criteria are desirable:

  1. The set of excerpts R ̂ should cover as much of the perceptual space as possible. To measure this criterion, we use the volume V ( R ̂ ) of the convex hull of R ̂, which should be maximized.

  2. The set of excerpts R ̂ should be as uniformly distributed across the perceptual space as possible. To measure this criterion, we use the test statistic D ( R ̂ ) for a generalized Kolmogorov-Smirnov (KS) test for equality of R ̂ to a uniform distribution, as described in Peacock (1983). The statistic compares the absolute differences between the observed and expected proportions of samples lying in a more extreme region than a given observed sample and should be minimized. In the case of the 2-dimensional PleasantnessEventfulness model in ISO/TS 12913-3:2019, the more “extreme regions” are, namely, the quadrants corresponding more to the top right, top left, bottom right, and bottom left of the observed sample when its coordinates are considered as the origin.

Heuristically, criterion (1) is necessary to allow for as varied a set of excerpts as possible covering the extremes of the perceptual space. Coupled with criterion (2), this prevents over-representation of subsets of points at the extremes of the perceptual space by the excerpts in R ̂, as illustrated visually in Fig. 3. Suppose the true distribution of a set of excerpts (represented as points in the 2-dimensional Pleasantness–Eventfulness space) is a uniform distribution over a circle of radius 0.5, with a sample shown in Fig. 3(a). Then, an observed set of excerpts similar to that in Fig. 3(b) is undesirable, because there are more “Pleasant” and “Eventful” examples of excerpts that exist and could have been chosen. On the other hand, an observed set of excerpts similar to that in Fig. 3(c) is also undesirable, because only extremely “Pleasant” and/or “Eventful” examples of excerpts are chosen and more neutral examples are ignored. The sample of points in Fig. 3(c) has a larger convex hull area of 0.63 than that in Fig. 3(b) of 0.24, but also a larger generalized KS test statistic of 0.34 than that in Fig. 3(b) of 0.20, thereby showing that both criteria (1) and (2) need to be considered in tandem in the selection of R ̂.

Fig. 3.

Samples of 100 points (in black) randomly drawn from uniform distributions centered on the origin (a) over a circle of radius 0.5, (b) over a circle of radius 0.3, and (c) on the perimeter of a circle of radius 0.45. Their convex hulls are bounded by dotted lines and shaded in orange.

Fig. 3.

Samples of 100 points (in black) randomly drawn from uniform distributions centered on the origin (a) over a circle of radius 0.5, (b) over a circle of radius 0.3, and (c) on the perimeter of a circle of radius 0.45. Their convex hulls are bounded by dotted lines and shaded in orange.

Close modal
Consequently, a suitable loss function to assess the “perceptual diversity” of an arbitrary set of excerpts R is
L ( R ) = α V ( R ) + β D ( R ) ,
(1)
for some hyperparameters α , β 0. Let be the set of all sets R of excerpts with exactly one excerpt taken from each full-length recording, then we must have
R ̂ = argmin R L ( R ) = argmin R ( α V ( R ) + β D ( R ) ) ,
(2)
where R ̂ minimizes the loss function in Eq. (1).

By the definition of R ̂ in Eq. (2), the objective of excerption is akin to finding a set of cluster centers for the clusters of points given by S 1 , S 2 , , S K, with the cluster centers being actual points in S 1 , S 2 , , S K. This is similar to the classic k-medoids problem but using the loss function in Eq. (1) instead of a distance metric. Since Eq. (1) is not a distance metric, Lloyd's algorithm (Hastie , 2009) cannot be used for the optimization in Eq. (2). The partitioning around medoids (PAM) algorithm (Kaufman and Rousseeuw, 2005) could be alternatively be considered, but it updates the clusters themselves in each iteration, which could potentially cause points in each cluster to change to different clusters over the course of the algorithm. This property is undesirable in the context of this study, because the clusters S 1 , S 2 , , S K representing the full-length soundscape recordings contain the possible excerpts that can be extracted, and it is illogical that the excerpts for a given full-length soundscape recording could be reclassified as belonging to a different full-length soundscape recording. Hence, we modify the standard PAM algorithm to treat the clusters as immutable and not update the clusters in each iteration, as detailed in Algorithm 1.

Algorithm 1:

Modified PAM.

1: INPUT
2: Set of full-length soundscape recordings s = { S 1 , S 2 , , S K }, where S k = { s k , 0 , s k , 1 , , s k , n k 1 } for each k [ [ 1 , K ] ]; Hyperparameters α , β
3: INITIALIZE
4:  for k = 1 , , K//Random initialization 
5:   select randomly r ̂ k S k 
6:  end for 
7:   R ̂ { r ̂ 1 , r ̂ 2 , , r ̂ K } 
8:   ( L old , L new ) ( , L ( R ̂ ) )//In practice, can be a large number, and L ( R ̂ ) = α V ( R ̂ ) + β D ( R ̂ ) 
9: PROCEDURE
10:  while L new < L old//Optimization 
11:    L old L new 
12:    ( Q , P ) argmin k { 1 , 2 , , K } j { 1 , 2 , , n k } L ( R ̂ { r ̂ k } + { s k , j } ) 
13:    r ̂ Q s Q , P//Swap points resulting in greatest reduction in loss function 
14:    R ̂ { r ̂ 1 , r ̂ 2 , , r ̂ K }//Update optimal set with swapped points 
15:    L new L ( R ̂ ) 
16:  end while 
17: OUTPUT: R ̂ 
1: INPUT
2: Set of full-length soundscape recordings s = { S 1 , S 2 , , S K }, where S k = { s k , 0 , s k , 1 , , s k , n k 1 } for each k [ [ 1 , K ] ]; Hyperparameters α , β
3: INITIALIZE
4:  for k = 1 , , K//Random initialization 
5:   select randomly r ̂ k S k 
6:  end for 
7:   R ̂ { r ̂ 1 , r ̂ 2 , , r ̂ K } 
8:   ( L old , L new ) ( , L ( R ̂ ) )//In practice, can be a large number, and L ( R ̂ ) = α V ( R ̂ ) + β D ( R ̂ ) 
9: PROCEDURE
10:  while L new < L old//Optimization 
11:    L old L new 
12:    ( Q , P ) argmin k { 1 , 2 , , K } j { 1 , 2 , , n k } L ( R ̂ { r ̂ k } + { s k , j } ) 
13:    r ̂ Q s Q , P//Swap points resulting in greatest reduction in loss function 
14:    R ̂ { r ̂ 1 , r ̂ 2 , , r ̂ K }//Update optimal set with swapped points 
15:    L new L ( R ̂ ) 
16:  end while 
17: OUTPUT: R ̂ 

For the implementation of Algorithm 1 on the K = 62 full-length recordings of the LCS dataset, a constant duration of 1 min per excerpt was chosen in line with the duration of excerpts provided in the USotW database (De Coensel , 2017). To obtain the PleasantnessEventfulness coordinates of the possible excerpts, prediction models for Pleasantness and Eventfulness were trained separately on the ARAUS dataset (Ooi , 2024). The prediction models were attention-based deep neural networks performing soundscape augmentation in the abstract feature domain, as previously described by Watcharasupat (2022). Five models were trained for each attribute, by leaving out each of the folds in the fivefold cross-validation set of the ARAUS dataset in turn, and combined to form ensemble models for Pleasantness and Eventfulness by taking the mean of their predictions. The mean squared errors in prediction for the ensemble models for Pleasantness and Eventfulness were 0.1231 and 0.1217, respectively.

However, the architecture by Watcharasupat (2022) was originally designed for 30-s recordings, so we first applied the trained ensemble models to the K = 62 full-length recordings in the LCS dataset (after calibration to their in situ L A , eq levels) in two consecutive 30-s windows at a time. This gave pairs of Pleasantness and Eventfulness values representing s k , j for each possible 1-min excerpt. The windows were applied with a hop length of 1 s, such that the k-th full-length recording had at most n k = t k 59 possible excerpts, where t k is the full-length recording duration in seconds. For compatibility with the S5 study results, we then discarded all possible excerpts where either vector in a pair fell outside the quadrant corresponding to what the S5 study identified for the corresponding location, except when this would discard all possible excerpts for a given location.

Subsequently, Algorithm 1 was run with hyperparameter values α = β = 1, and with D ( R ̂ ) obtained based on a comparison with a uniform distribution over a circle centered on the origin with radius 0.5. This is approximately the maximum distance of any point in S 1 S 2 S 62 for the full-length recordings in the LCS dataset from the origin, as observable from Fig. 4(a). Since the initialization step of Algorithm 1 is sensitive to initial seeding, we ran it with 20 different random seeds and took the result of the seed giving the lowest value of L ( R ̂ ) as the final set of one-minute excerpts from the LCS dataset.

Fig. 4.

KDE plots of s for the full-length recordings in the LCS dataset, overlaid with the most “perceptually diverse” set of excerpts R ̂ obtained after termination of Algorithm 1 for the best initial seed for parameter values (a) α = β = 1 (seed 12); (b) α = 0 , β = 1 (seed 2); (c) α = 1 , β = 0 (seed 17). Darker colors in the KDE plots denote higher densities, with the threshold for visualization set at 0.02.

Fig. 4.

KDE plots of s for the full-length recordings in the LCS dataset, overlaid with the most “perceptually diverse” set of excerpts R ̂ obtained after termination of Algorithm 1 for the best initial seed for parameter values (a) α = β = 1 (seed 12); (b) α = 0 , β = 1 (seed 2); (c) α = 1 , β = 0 (seed 17). Darker colors in the KDE plots denote higher densities, with the threshold for visualization set at 0.02.

Close modal

In addition, to verify that both the convex hull volume V ( R ) and generalized KS test statistic D ( R ) were necessary terms in Eq. (1) to capture the idea of “perceptual diversity,” we also ran Algorithm 1 over the full-length recordings in the LCS dataset with two more hyperparameter settings: (a) α = 0 , β = 1 (only minimizing the generalized KS test statistic); and (b) α = 1 , β = 0 (only maximizing the convex hull volume). This served purely as a further validation experiment and no actual excerption was performed with these two settings. Nonetheless, the same models and setup with 20 different random seeds as described earlier in this section were used for a fair comparison of results.

Table 1 summarizes the key results of the validation experiment described in Sec. 4, whereas Fig. 4 visually illustrates these results with the set of “perceptually diverse” excerpts R ̂ that would have been obtained after termination of Algorithm 1 on the best seeds [i.e., those having the lowest value of L ( R ̂ )] for each hyperparameter setting. As explained in Sec. 4, the final set of 1-min excerpts cropped from the full-length recordings in the LCS dataset are those corresponding to the illustration in Fig. 4(a) for the hyperparameter values α = β = 1.

Table 1.

Results of the validation experiment for the Lion City Soundscapes dataset by loss function hyperparameters used in Algorithm 1 across 20 differently-seeded runs. The symbols α , β , L , V , D are as defined in Eq. (1).

Mean ± standard deviation Best (i.e., lowest L)
α β Epochs L V D Seed L V D
15.0  ± 1.3   0.459  ± 0.000  0.459  ± 0.000  0.324  ± 0.015  17   0.459  0.459  0.292 
69.9  ± 11.2   0.140  ± 0.005  0.367  ± 0.014  0.140  ± 0.005  0.130  0.368  0.130 
61.7  ± 9.6   0.302  ± 0.007  0.456  ± 0.001  0.154  ± 0.007  12   0.316  0.456  0.140 
Mean ± standard deviation Best (i.e., lowest L)
α β Epochs L V D Seed L V D
15.0  ± 1.3   0.459  ± 0.000  0.459  ± 0.000  0.324  ± 0.015  17   0.459  0.459  0.292 
69.9  ± 11.2   0.140  ± 0.005  0.367  ± 0.014  0.140  ± 0.005  0.130  0.368  0.130 
61.7  ± 9.6   0.302  ± 0.007  0.456  ± 0.001  0.154  ± 0.007  12   0.316  0.456  0.140 

Notably, Algorithm 1 converged for all 20 seeds regardless of the hyperparameter setting, thereby confirming the validity of the proposed modifications to the PAM algorithm. On average, the convergence was fastest (in 15.0 epochs) for the setting α = 1 , β = 0 due to only a small subset of points being required to form a maximal convex hull, and was slowest (in 69.9 epochs) for the setting α = 0 , β = 1.

The setting α = 1 , β = 0 gave the largest mean convex hull volume V across the three settings (0.459) but also the highest mean generalized KS test statistic D (0.324). In contrast, the setting α = 0 , β = 1 gave the smallest mean convex hull volume V across the three settings (0.367) but also the lowest mean generalized KS test statistic D (0.140). The setting α = β = 1 gave values of V and D in between the other two settings, so α = β = 1 indeed balances the aims of the competing terms in the loss function in Eq. (1) to achieve both criteria (1) and (2) as expected. For the setting α = 1 , β = 0, Algorithm 1 also consistently returned similar results regardless of the seed, with the standard deviation being less than 0.001 across all 20 seeds, so the modified PAM algorithm could potentially function as an identification algorithm for the largest possible convex hull in a global fashion depending on the hyperparameter settings as well.

Moreover, the sets of points in Fig. 4(c) and Fig. 4(a) have similar convex hull volumes (0.459 and 0.456, respectively). However, there is a relatively large cluster of points in the bottom right quadrant and a relative lack of points in the top right quadrant in Fig. 4(c), which could lead to an undesirable overemphasis in B&L excerpts at the expense of F&E excerpts if α = 1 , β = 0. This is further supported by the value of the generalized KS test statistic for the set of points in Fig. 4(c) being 0.292, as opposed to 0.140 in Fig. 4(a). The α = 0 , β = 1 setting also gave a set of excerpts with noticeably smaller convex hull (volume 0.368) in Fig. 4(b) than in Fig. 4(a), although their distributions were similarly uniform with generalized KS test statistic values of 0.140 and 0.130, respectively.

Incidentally, from the kernel density estimate (KDE) plots in Fig. 4, we can observe that all four quadrants generated by the PleasantnessEventfulness axes are represented by the set of full-length recordings in the LCS dataset, which serves to validate that the S5 study managed to identify locations that were F&E, C&R, C&T, and B&L. However, there is a slight bias towards soundscapes considered C&T (bottom right quadrant), and a relative lack of soundscapes considered B&L (bottom left quadrant) based on the predictions made by the ARAUS dataset prediction models. These characteristics can also be observed in a similar ValenceArousal space for the audio stimuli in the IADS-2 and IADS-E datasets [Yang , 2018, Figs. 1(a) and 4(a)], which are completely independent of the stimuli in both the LCS and ARAUS datasets, so they could possibly be attributed more to the general nature of auditory perception instead of the specific excerption methodology used.

Nonetheless, the automated excerption methodology in Sec. 3 and its implementation in Sec. 4 is not without its limitations. In particular, when there are multiple extreme outliers in the data, the criterion of maximizing the convex hull could cause Algorithm 1 to select a disproportionate number of outliers and make the selected excerpts no longer “perceptually diverse.” This could be mitigated by adjusting α and β to suitable values to reduce the focus of Eq. (1) on maximizing the convex hull, but may require an additional hyperparameter tuning step to find suitable values for an arbitrary dataset. The use of models trained on the ARAUS dataset, which in turn utilizes responses to soundscapes from the USotW database, also implicitly assumes similarity in the underlying distribution over the PleasantnessEventfulness space. While this was not a problem with the LCS dataset due to concurrence with the aims of the ARAUS and USotW datasets in containing as broad a range of soundscapes as possible under the PleasantnessEventfulness space, caution should be exercised if the same method is applied to a restricted perceptual space, or a different perceptual space altogether.

In conclusion, we presented the Lion City Soundscapes dataset, which consists of a collection of “full-length” soundscape recordings at 62 different locations in Singapore, and corresponding 1-min excerpts from the full-length recordings made at each location. The excerption methodology was an automated method based on the application of pre-trained Pleasantness and Eventfulness models on the ARAUS dataset and a modified PAM algorithm on a loss function involving the area of the convex hull and a generalized KS test statistic for equality of distributions.

Since the definitions of the proposed loss function in Eq. (1) and the modified PAM algorithm in Algorithm 1 did not initially assume any inherent meaning in the space of interest, the excerption methodology could theoretically be extended to any set of audio recordings or objects, with the perceptual space replaced by any latent embedding space of said recordings or objects. Hence, future work could compare the relative performance of the loss function and selection algorithm on different spaces or datasets comprising different modalities. Subjective validation of the categorical accuracy of the 1-min LCS excerpts with respect to the F&E, C&R, C&T, and B&L quadrant labels for the corresponding locations given by the S5 study could also be performed. Last, the LCS dataset itself could be used as an additional set of base urban soundscapes to obtain further responses for subjective studies such as those involving the ARAUS dataset.

We would like to express our heartfelt gratitude to the management teams at Changi Airport; National Parks Board, Bukit Timah Nature Reserve; National Parks Board, Jurong Lake Gardens; National Parks Board, Singapore Botanic Gardens; National Parks Board, Sungei Buloh Wetland Reserve; Raffles Marina; and the Singapore Zoo for their assistance with this study. This research was supported in part by the Singapore Ministry of National Development and in part by the National Research Foundation, Prime Minister's Office under the Cities of Tomorrow Research Programme under Grant No. COT-V4-2020-1. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author and do not reflect the view of National Research Foundation, Singapore, and Ministry of National Development, Singapore. Part of this work was done while K.N.W. was supported by the International Fellowship from the American Association of University Women (AAUW) and the IEEE Signal Processing Society Scholarship Program.

The authors have no conflicts of interest to report.

The data that support the findings of this study are openly available in the institutional repository of Nanyang Technological University (DR-NTU) at https://doi.org/10.21979/N9/AVHSBX. These comprise the metadata for the individual locations shown in Fig. 1, the full-length recordings in the LCS dataset, and the actual 1-min excerpts extracted based on the results in Fig. 4(a). The replication code for the results of this study is available at https://github.com/ntudsp/lion-city-soundscapes, and the code used to generate the visualization in Fig. 1 is available at https://github.com/ntudsp/lion-city-soundscapes-visualisation.

1.
Axelsson
,
O.
,
Nilsson
,
M. E.
, and
Berglund
,
B.
(
2010
). “
A principal components model of soundscape perception
,”
J. Acoust. Soc. Am.
128
(
5
),
2836
2846
.
2.
Cartwright
,
M.
,
Elisa
,
A.
,
Mendez
,
M.
,
Cramer
,
J.
,
Lostanlen
,
V.
,
Dove
,
G.
,
Wu
,
H-h.
,
Salamon
,
J.
,
Nov
,
O.
, and
Bello
,
J. P.
(
2019
). “
SONYC Urban Sound Tagging (SONYC-UST): A Multilabel Dataset from an Urban Acoustic Sensor Network
,” in
Proceedings of DCASE 2019 Workshop
,
October 25–26
,
New York
.
3.
De Coensel
,
B.
,
Sun
,
K.
, and
Botteldooren
,
D.
(
2017
). “
Urban Soundscapes of the World: Selection and reproduction of urban acoustic environments with soundscape in mind
,” in
Proceedings of Inter-Noise 2017
,
August 27–30
,
Hong Kong
.
4.
Hastie
,
T.
,
Tibshirani
,
R.
, and
Friedman
,
J.
(
2009
).
The Elements of Statistical Learning
, 2nd ed. (
Springer Science+Business Media, LLC
,
New York
).
5.
Hume
,
K.
, and
Ahtamad
,
M.
(
2013
). “
Physiological responses to and subjective estimates of soundscape elements
,”
Appl. Acoust.
74
(
2
),
275
281
.
6.
International Organization for Standardization
(
2014
). ISO 12913-1:2014, “Acoustics—Soundscape—Part 1: Definition and conceptual framework” (
International Organization for Standardization
,
Geneva, Switzerland
).
7.
International Organization for Standardization
(
2018
). ISO 12913-2, “Acoustics—Soundscape—Part 2: Data collection and reporting requirements” (
International Organization for Standardization
,
Geneva, Switzerland
).
8.
International Organization for Standardization
(
2019
). ISO 12913-3:2019, “Acoustics—Soundscape—Part 3: Data analysis” (
International Organization for Standardization
,
Geneva, Switzerland
).
9.
Kang
,
J.
,
Aletta
,
F.
,
Gjestland
,
T. T.
,
Brown
,
L. A.
,
Botteldooren
,
D.
,
Schulte-Fortkamp
,
B.
,
Lercher
,
P.
,
van Kamp
,
I.
,
Genuit
,
K.
,
Fiebig
,
A.
,
Bento Coelho
,
J. L.
,
Maffei
,
L.
, and
Lavia
,
L.
(
2016
). “
Ten questions on the soundscapes of the built environment
,”
Build. Environ.
108
,
284
294
.
10.
Kaufman
,
L.
, and
Rousseeuw
,
P. J.
(
2005
). “
Partitioning around medoids (program PAM)
,” in
Finding Groups in Data: An Introduction to Cluster Analysis
(
John Wiley & Sons, Inc
.,
New York
), Chap. 2, pp.
68
125.
11.
Mitchell
,
A.
,
Oberman
,
T.
,
Aletta
,
F.
,
Erfanian
,
M.
,
Kachlicka
,
M.
,
Lionello
,
M.
, and
Kang
,
J.
(
2020
). “
The soundscape indices (SSID) protocol: A method for urban soundscape surveys—Questionnaires with acoustical and contextual information
,”
Appl. Sci.
10
,
2397
.
12.
National Archives of Singapore
(
2023
). “
SoundscapeSG
,” https://www.nas.gov.sg/citizenarchivist/SoundScape/describe (Last viewed September 25, 2023).
13.
Ooi
,
K.
,
Lam
,
B.
,
Hong
,
J. Y.
,
Watcharasupat
,
K. N.
,
Ong
,
Z.-T.
, and
Gan
,
W.-S.
(
2022
). “
Singapore soundscape site selection survey (S5): Identification of characteristic soundscapes of Singapore via weighted k-means clustering, MDPI
,”
Sustainability
14
,
7485
.
14.
Ooi
,
K.
,
Ong
,
Z.-T.
,
Watcharasupat
,
K. N.
,
Lam
,
B.
,
Hong
,
J. Y.
, and
Gan
,
W.-S.
(
2024
). “
ARAUS: A large-scale dataset and baseline models of affective responses to augmented urban soundscapes
,”
IEEE Trans. Affective Comput.
15
,
105
120
.
15.
Ooi
,
K.
,
Watcharasupat
,
K. N.
,
Peksi
,
S.
,
Karnapi
,
F. A.
,
Ong
,
Z-t.
,
Chua
,
D.
,
Leow
,
H-w.
,
Kwok
,
L-l.
,
Ng
,
X-l.
,
Loh
,
Z-a.
, and
Gan
,
W-s
(
2021
). “
A Strongly-Labelled Polyphonic Dataset of Urban Sounds with Spatiotemporal Context
,” in
Proceedings of the 13th APSIPA ASC
,
December 14–17
,
Tokyo, Japan
.
16.
Peacock
,
J. A.
(
1983
). “
Two-dimensional goodness-of-fit testing in astronomy
,”
Mon. Not. R. Astron. Soc.
202
(
3
),
615
627
.
17.
Schulte-Fortkamp
,
B.
, and
Jordan
,
P.
(
2016
). “
When soundscape meets architecture
,”
Noise Mapp.
3
(
1
),
216
231
.
18.
Singapore Land Authority
(
2022
). “
OneMap API Docs
,” https://www.onemap.gov.sg/docs/ (Last viewed February 19, 2024).
19.
Tan
,
J. K. A.
,
Hasegawa
,
Y.
, and
Lau
,
S. K.
(
2022
). “
A comprehensive environmental sound categorization scheme of an urban city
,”
Appl. Acoust.
199
,
109018
.
20.
Thorogood
,
M.
,
Fan
,
J.
, and
Pasquier
,
P.
(
2015
). “
BF-classifier: Background/foreground classification and segmentation of soundscape recordings
,” in
Proceedings of the 10th Audio Mostly Conference, AM 2015
,
October 7–9
,
Thessaloniki, Greece
.
21.
Watcharasupat
,
K. N.
,
Ooi
,
K.
,
Lam
,
B.
,
Wong
,
T.
,
Ong
,
Z.-T.
, and
Gan
,
W.-S.
(
2022
). “
Autonomous in-situ soundscape augmentation via joint selection of masker and gain
,”
IEEE Signal Process. Lett.
29
,
1749
1753
.
22.
Yang
,
W.
,
Makita
,
K.
,
Nakao
,
T.
,
Kanayama
,
N.
,
Machizawa
,
M. G.
,
Sasaoka
,
T.
,
Sugata
,
A.
,
Kobayashi
,
R.
,
Hiramoto
,
R.
,
Yamawaki
,
S.
,
Iwanaga
,
M.
, and
Miyatani
,
M.
(
2018
). “
Affective auditory stimulus database: An expanded version of the International Affective Digitized Sounds (IADS-E)
,”
Behav. Res.
50
(
4
),
1415
1429
.