This study presents a dataset of audiovisual soundscape recordings at 62 different locations in Singapore, initially made as fulllength recordings over spans of 9–38 min. For consistency and reduction in listener fatigue in future subjective studies, oneminute excerpts were cropped from the fulllength recordings. An automated method using pretrained models for Pleasantness and Eventfulness (according to ISO 12913) in a modified partitioning around medoids algorithm was employed to generate the set of excerpts by balancing the need to encompass the perceptual space with uniformity in distribution. A validation study on the method confirmed its adherence to the intended design.
1. Introduction
Soundscape research, at least under the paradigm of the international standards on soundscapes (ISO 12913), entails the use of soundscape recordings, in audio (and possibly also visual) format, in the context of both objective and subjective evaluations, due to “soundscapes” being defined as “acoustic environments as perceived or experienced and/or understood by a person or people, in context” (International Organization for Standardization, 2014). Such recordings, when collected in large quantities and analyzed in aggregate, can yield useful information about the acoustic nature (perceived or measured) of different locations (SchulteFortkamp and Jordan, 2016) and simultaneously serve to preserve their unique sonic heritage (Kang , 2016).
In the Singaporean context, the SoundscapeSG initiative (National Archives of Singapore, 2023) crowdsources audio recordings from citizens with a focus on preserving the sounds of Singapore. While significant when considered from the point of view of soundscape preservation, the crowdsourced nature of the dataset means that recording devices and conditions cannot be controlled or reasonably expected to be known. This may complicate a faithful reproduction for the purpose of a laboratorybased soundscape study, since key parameters such as the recording microphone sensitivity and the in situ sound pressure levels for the recordings are often unknown.
On the other hand, dedicated teams in Singapore have also collected publiclyavailable datasets of audio recordings in Singapore for which recording devices are known and/or consistent. For instance, Tan (2022) made 300 binaural recordings, each 3 min long, at randomly selected locations in Singapore for the purpose of deriving a hierarchical taxonomy of isolated sound sources in an urban environment and corresponding qualities in a subsequent listening experiment. Furthermore, the SINGA:PURA dataset Ooi (2021) contains audio recordings collected from static recording units in a wireless acoustic sensor network in Singapore for the purpose of urban sound tagging. The recordings comprise 18 h of labeled and 201 h of unlabeled data, with the labels providing information about the start and end times of sound classes related to the SONYC taxonomy Cartwright (2019).
However, recordings at random or static locations may not necessarily capture a range of soundscapes covering the Pleasantness–Eventfulness circumplex model of soundscape perception described in ISO/TS 129133:2019 (International Organization for Standardization, 2019). In addition, comprehensive analyses of large sets of long recordings, especially on perceptual quantities requiring human participants to respond, can be unfeasible due to the effect of listener fatigue and reduced attention from prolonged listening to auditory stimuli (Hume and Ahtamad, 2013).
These concerns are usually mitigated by cropping out a set of excerpts of shorter duration from the original set of recordings and generalizing from the set of shorter excerpts, under the assumption that the set of shorter excerpts maintains a similar perceptual diversity as the original. For example, a study investigating principal components of soundscape perception extracted 50 soundscape excerpts from a larger collection of urban outdoor soundscape recordings for listening tests with human participants (Axelsson , 2010). The selection of excerpts was performed by a consensus vote with the expert opinion of three research team members aiming to achieve a broad diversity of sound pressure levels and source types. Ultimately, the selected excerpts satisfactorily covered the principal component space investigated in the study, but this was only (and could only have been) determined upon post hoc analysis of the listening test results. Moreover, a hybrid approach was used by Thorogood (2015), where the authors combined their expert opinion with a listening test involving 31 human participants to excerpt 30 samples, each 4 s long, from the World Soundscape Project Tape Library containing over 223 h of sound recordings.
However, manual listening and pilot studies are typically laborintensive, so automated methods to achieve similar results would enhance the reliability and replicability of any set of excerpts from a set of longer recordings. Hence, in this study, we provide a mathematical formulation for the idea of “perceptual diversity” in the context of the construct of the perceptual space generated by the Pleasantness–Eventfulness axes, and aim to make the following major contributions:

The “Lion City Soundscapes” (LCS) dataset, which contains over 24 hours of generalpurpose soundscape recordings in 62 different locations in Singapore, recorded according to the standards specified in ISO 129132:2018 (International Organization for Standardization, 2018). This is a result of a continuation of the study previously conducted by Ooi (2022) who identified and classified (but did not record) the 62 locations.

A “perceptually diverse” set of 62 oneminute excerpts of the original fulllength recordings in the LCS dataset for use as stimuli in future subjective studies involving human participants. The excerption methodology is automated and utilizes a loss function and selection algorithm in conjunction with an ensemble of pretrained attentionbased models for the prediction of Pleasantness and Eventfulness according to ISO/TS 129133:2019.
The organization of the manuscript is as follows: Section 2 describes the methodology used to record the fulllength recordings in the LCS dataset. Section 3 describes the overall excerption methodology, loss function, and selection algorithm. Section 4 implements and validates the method in Sec. 3 on the fulllength recordings in the LCS dataset, and Sec. 5 presents the results of the implementation and the validation experiment. Finally, Sec. 6 concludes the study and suggests potential avenues for future work.
2. Recording methodology
2.1 Site identification
The LCS dataset was recorded with reference to the Singapore Soundscape Site Selection Survey (S5), which previously identified 62 locations as characteristic Singaporean soundscapes spanning the quadrants generated by the Pleasantness and Eventfulness axes of the ISO/TS 129133:2019 circumplex model of soundscape perception (Ooi , 2022). For notational convenience, we henceforth refer to the Pleasantness and Eventfulness axes as the xaxis and yaxis, respectively. The S5 study labeled the quadrants “full of life and exciting (F&E)” for $ x > 0$ and $ y > 0$, “chaotic and restless (C&R)” for $ x < 0$ and $ y > 0$, “calm and tranquil (C&T)” for $ x > 0$ and $ y < 0$, and “boring and lifeless (B&L)” for $ x < 0$ and $ y < 0$. In particular, 15, 14, 15, and 18 locations were categorized as F&E, C&R, C&T, and B&L, respectively. The Global Positioning System (GPS) coordinates of the 62 locations also covered the entirety of mainland Singapore, so they comprised a variety of acoustic environments and geographic regions in Singapore of interest in soundscape preservation. Considering the alignment of the aim of the S5 study to identify characteristic soundscapes of Singapore under the ISO 12913 paradigm and the aim of this study to create a perceptually diverse soundscape dataset in the Singaporean context, the locations identified by the S5 study were deemed suitable for a preliminary reference.
However, an initial scouting of the exact GPS coordinates of the locations revealed that not all were physically or publicly accessible. For such locations, a different but nearby site that was feasible for recording was chosen instead. For example, the GPS coordinates identified by the S5 study for Upper Seletar Reservoir (1.404356, 103.803620) were located on water, so a nearby bank of the same reservoir (at 1.397272, 103.802971) was chosen as the recording site for the LCS dataset. A map of the final recording sites corresponding to the 62 locations is shown in Fig. 1.
2.2 Recording setup and conditions
After identifying the exact coordinates of the recording sites for the LCS dataset, the recordings were performed with a setup similar to that used for the Urban Soundscapes of the World (USotW) database (De Coensel , 2017) and the Soundscape Indices Protocol (Mitchell , 2020) used for the International Soundscape Database. Specifically, the recording setup consisted of the following equipment mounted on a tripod:

IEC 610944 WS2F designated and IEC 616721 Class 1 compliant sound pressure acquisition system (1.0 m above the ground): GRAS 146AE Freefield Microphone connected to a HEAD Acoustics SQobold Data Acquisition System.

Binaural microphone (1.5 m above the ground): B&K Type 4101B Binaural Microphone placed on a Neumann KU100 Dummy Head and connected to the same SQobold Data Acquisition System.

360degree video camera (1.8 m above the ground): Insta360 One R Twin Edition.

Thirdorder ambisonic microphone (2.1 m above the ground): Zylia ZM1 Ambisonic Microphone (19 channels) connected to a Zylia ZR1 Portable Recorder.
Windshields were also used for all microphones, except for the builtin microphone of the 360degree video camera. Figure 2 shows the setup at one of the recording sites and the corresponding view from the 360degree video camera.
The fulllength soundscape recordings were made between September 2022 and January 2023 at various times of the day (earliest 0825 h, latest 1906 h) for durations ranging from 9 to 38 min (mean 24.1 min, standard deviation 5.3 min, median 23.6 min). Due to the lack of waterproofing in the equipment used, all recordings were done in dry weather or under shelter in rainy weather. All audio was recorded at 24bit depth with a sampling frequency of 48 kHz, and the 360degree videos were recorded in spherical format with a resolution of 4096 × 2048 pixels.
3. Excerption methodology
With the fulllength recordings in Sec. 2, we proceeded to extract excerpts of identical length that together maintain a similar perceptual diversity as the original set of fulllength recordings. The excerption was motivated by practical considerations regarding the total experimental duration for further subjective evaluation in a laboratory context since the total length of the entire set of fulllength recordings (with a mean duration of 24.1 min) would be overly excessive. In this section, we formulate a mathematical definition of the excerption task and the idea of “perceptual diversity” based on ISO/TS 129133:2019 to allow for a more rigorous choice of suitable excerpts from the fulllength recordings. For brevity, a closed set of integers is denoted as $ [ [ a , b ] ] : = { z \u2208 \mathbb{Z}  a \u2264 z \u2264 b}$.
3.1 Task definition
Consider a set of K fulllength soundscape recordings $ s : = { S 1 , S 2 , \u2026 , S K}$, where each fulllength recording $ S k : = { s k , 0 , s k , 1 , \u2026 , s k , n k \u2212 1}$ contains $ n k$ possible excerpts for each $ k \u2208 [ [ 1 , K ] ]$. Each excerpt is represented as a vector $ s k , j$ in an Ndimensional subset of $ \mathbb{R} N$ serving as a perceptual space, where $ j \u2208 [ [ 0 , n k \u2212 1 ] ]$ and $ k \u2208 [ [ 1 , K ] ]$. For instance, under the definition of the perceptual circumplex model in ISO/TS 129133:2019 (International Organization for Standardization, 2019), we have $ N = 2$, with the orthogonal axes symbolizing Pleasantness and Eventfulness values, both in the closed interval $ [ \u2212 1 , 1 ]$, such that $ s k , j \u2208 [ \u2212 1 , 1 ] 2$.
Therefore, any set of excerpts (with exactly one excerpt taken from each fulllength recording) can be considered as a set of vectors $ R = { r 1 , r 2 , \u2026 , r K}$, where $ r k \u2208 S k$ for each $ k \u2208 [ [ 1 , K ] ]$. We would like to find the set $ R \u0302 = { r \u0302 1 , r \u0302 2 , \u2026 , r \u0302 K}$, where $ r \u0302 k \u2208 S k$ for each $ k \u2208 [ [ 1 , K ] ]$, such that $ R \u0302$ is the most “perceptually diverse” set of excerpts of the fulllength recordings $ S 1 , S 2 , \u2026 , S K$.
3.2 “Perceptually diverse” loss function
To that end, we translate the idea of “perceptual diversity” into a loss function amenable to an optimization algorithm for the selection of vectors in $ R \u0302$. By “perceptually diverse,” the following criteria are desirable:

The set of excerpts $ R \u0302$ should cover as much of the perceptual space as possible. To measure this criterion, we use the volume $ V ( R \u0302 )$ of the convex hull of $ R \u0302$, which should be maximized.

The set of excerpts $ R \u0302$ should be as uniformly distributed across the perceptual space as possible. To measure this criterion, we use the test statistic $ D ( R \u0302 )$ for a generalized KolmogorovSmirnov (KS) test for equality of $ R \u0302$ to a uniform distribution, as described in Peacock (1983). The statistic compares the absolute differences between the observed and expected proportions of samples lying in a more extreme region than a given observed sample and should be minimized. In the case of the 2dimensional Pleasantness–Eventfulness model in ISO/TS 129133:2019, the more “extreme regions” are, namely, the quadrants corresponding more to the top right, top left, bottom right, and bottom left of the observed sample when its coordinates are considered as the origin.
Heuristically, criterion (1) is necessary to allow for as varied a set of excerpts as possible covering the extremes of the perceptual space. Coupled with criterion (2), this prevents overrepresentation of subsets of points at the extremes of the perceptual space by the excerpts in $ R \u0302$, as illustrated visually in Fig. 3. Suppose the true distribution of a set of excerpts (represented as points in the 2dimensional Pleasantness–Eventfulness space) is a uniform distribution over a circle of radius 0.5, with a sample shown in Fig. 3(a). Then, an observed set of excerpts similar to that in Fig. 3(b) is undesirable, because there are more “Pleasant” and “Eventful” examples of excerpts that exist and could have been chosen. On the other hand, an observed set of excerpts similar to that in Fig. 3(c) is also undesirable, because only extremely “Pleasant” and/or “Eventful” examples of excerpts are chosen and more neutral examples are ignored. The sample of points in Fig. 3(c) has a larger convex hull area of 0.63 than that in Fig. 3(b) of 0.24, but also a larger generalized KS test statistic of 0.34 than that in Fig. 3(b) of 0.20, thereby showing that both criteria (1) and (2) need to be considered in tandem in the selection of $ R \u0302$.
3.3 Selection algorithm
By the definition of $ R \u0302$ in Eq. (2), the objective of excerption is akin to finding a set of cluster centers for the clusters of points given by $ S 1 , S 2 , \u2026 , S K$, with the cluster centers being actual points in $ S 1 , S 2 , \u2026 , S K$. This is similar to the classic kmedoids problem but using the loss function in Eq. (1) instead of a distance metric. Since Eq. (1) is not a distance metric, Lloyd's algorithm (Hastie , 2009) cannot be used for the optimization in Eq. (2). The partitioning around medoids (PAM) algorithm (Kaufman and Rousseeuw, 2005) could be alternatively be considered, but it updates the clusters themselves in each iteration, which could potentially cause points in each cluster to change to different clusters over the course of the algorithm. This property is undesirable in the context of this study, because the clusters $ S 1 , S 2 , \u2026 , S K$ representing the fulllength soundscape recordings contain the possible excerpts that can be extracted, and it is illogical that the excerpts for a given fulllength soundscape recording could be reclassified as belonging to a different fulllength soundscape recording. Hence, we modify the standard PAM algorithm to treat the clusters as immutable and not update the clusters in each iteration, as detailed in Algorithm 1.
1: INPUT: 
2: Set of fulllength soundscape recordings $ s = { S 1 , S 2 , \u2026 , S K}$, where $ S k = { s k , 0 , s k , 1 , \u2026 , s k , n k \u2212 1}$ for each $ k \u2208 [ [ 1 , K ] ]$; Hyperparameters $ \alpha , \beta $. 
3: INITIALIZE: 
4: for $ k = 1 , \u2026 , K$//Random initialization 
5: select randomly $ r \u0302 k \u2208 S k$ 
6: end for 
7: $ R \u0302 \u2190 { r \u0302 1 , r \u0302 2 , \u2026 , r \u0302 K}$ 
8: $ ( L old , L new ) \u2190 ( \u221e , L ( R \u0302 ) )$//In practice, $\u221e$ can be a large number, and $ L ( R \u0302 ) = \u2212 \alpha V ( R \u0302 ) + \beta D ( R \u0302 )$ 
9: PROCEDURE: 
10: while $ L new < L old$//Optimization 
11: $ L old \u2190 L new$ 
12: $ ( Q , P ) \u2190 argmin k \u2208 { 1 , 2 , \u2026 , K} j \u2208 { 1 , 2 , \u2026 , n k} \u2009 L ( R \u0302 \u2212 { r \u0302 k} + { s k , j} )$ 
13: $ r \u0302 Q \u2190 s Q , P$//Swap points resulting in greatest reduction in loss function 
14: $ R \u0302 \u2190 { r \u0302 1 , r \u0302 2 , \u2026 , r \u0302 K}$//Update optimal set with swapped points 
15: $ L new \u2190 L ( R \u0302 )$ 
16: end while 
17: OUTPUT: $ R \u0302$ 
1: INPUT: 
2: Set of fulllength soundscape recordings $ s = { S 1 , S 2 , \u2026 , S K}$, where $ S k = { s k , 0 , s k , 1 , \u2026 , s k , n k \u2212 1}$ for each $ k \u2208 [ [ 1 , K ] ]$; Hyperparameters $ \alpha , \beta $. 
3: INITIALIZE: 
4: for $ k = 1 , \u2026 , K$//Random initialization 
5: select randomly $ r \u0302 k \u2208 S k$ 
6: end for 
7: $ R \u0302 \u2190 { r \u0302 1 , r \u0302 2 , \u2026 , r \u0302 K}$ 
8: $ ( L old , L new ) \u2190 ( \u221e , L ( R \u0302 ) )$//In practice, $\u221e$ can be a large number, and $ L ( R \u0302 ) = \u2212 \alpha V ( R \u0302 ) + \beta D ( R \u0302 )$ 
9: PROCEDURE: 
10: while $ L new < L old$//Optimization 
11: $ L old \u2190 L new$ 
12: $ ( Q , P ) \u2190 argmin k \u2208 { 1 , 2 , \u2026 , K} j \u2208 { 1 , 2 , \u2026 , n k} \u2009 L ( R \u0302 \u2212 { r \u0302 k} + { s k , j} )$ 
13: $ r \u0302 Q \u2190 s Q , P$//Swap points resulting in greatest reduction in loss function 
14: $ R \u0302 \u2190 { r \u0302 1 , r \u0302 2 , \u2026 , r \u0302 K}$//Update optimal set with swapped points 
15: $ L new \u2190 L ( R \u0302 )$ 
16: end while 
17: OUTPUT: $ R \u0302$ 
4. Implementation and validation experiment setup for modified PAM algorithm
For the implementation of Algorithm 1 on the $ K = 62$ fulllength recordings of the LCS dataset, a constant duration of 1 min per excerpt was chosen in line with the duration of excerpts provided in the USotW database (De Coensel , 2017). To obtain the Pleasantness–Eventfulness coordinates of the possible excerpts, prediction models for Pleasantness and Eventfulness were trained separately on the ARAUS dataset (Ooi , 2024). The prediction models were attentionbased deep neural networks performing soundscape augmentation in the abstract feature domain, as previously described by Watcharasupat (2022). Five models were trained for each attribute, by leaving out each of the folds in the fivefold crossvalidation set of the ARAUS dataset in turn, and combined to form ensemble models for Pleasantness and Eventfulness by taking the mean of their predictions. The mean squared errors in prediction for the ensemble models for Pleasantness and Eventfulness were 0.1231 and 0.1217, respectively.
However, the architecture by Watcharasupat (2022) was originally designed for 30s recordings, so we first applied the trained ensemble models to the $ K = 62$ fulllength recordings in the LCS dataset (after calibration to their in situ $ L A , eq$ levels) in two consecutive 30s windows at a time. This gave pairs of Pleasantness and Eventfulness values representing $ s k , j$ for each possible 1min excerpt. The windows were applied with a hop length of 1 s, such that the kth fulllength recording had at most $ n k = \u230a t k \u230b \u2212 59$ possible excerpts, where $ t k$ is the fulllength recording duration in seconds. For compatibility with the S5 study results, we then discarded all possible excerpts where either vector in a pair fell outside the quadrant corresponding to what the S5 study identified for the corresponding location, except when this would discard all possible excerpts for a given location.
Subsequently, Algorithm 1 was run with hyperparameter values $ \alpha = \beta = 1$, and with $ D ( R \u0302 )$ obtained based on a comparison with a uniform distribution over a circle centered on the origin with radius 0.5. This is approximately the maximum distance of any point in $ S 1 \u222a S 2 \u222a \u2026 \u222a S 62$ for the fulllength recordings in the LCS dataset from the origin, as observable from Fig. 4(a). Since the initialization step of Algorithm 1 is sensitive to initial seeding, we ran it with 20 different random seeds and took the result of the seed giving the lowest value of $ L ( R \u0302 )$ as the final set of oneminute excerpts from the LCS dataset.
In addition, to verify that both the convex hull volume $ V ( R )$ and generalized KS test statistic $ D ( R )$ were necessary terms in Eq. (1) to capture the idea of “perceptual diversity,” we also ran Algorithm 1 over the fulllength recordings in the LCS dataset with two more hyperparameter settings: (a) $ \alpha = 0 , \beta = 1$ (only minimizing the generalized KS test statistic); and (b) $ \alpha = 1 , \beta = 0$ (only maximizing the convex hull volume). This served purely as a further validation experiment and no actual excerption was performed with these two settings. Nonetheless, the same models and setup with 20 different random seeds as described earlier in this section were used for a fair comparison of results.
5. Results and discussion
Table 1 summarizes the key results of the validation experiment described in Sec. 4, whereas Fig. 4 visually illustrates these results with the set of “perceptually diverse” excerpts $ R \u0302$ that would have been obtained after termination of Algorithm 1 on the best seeds [i.e., those having the lowest value of $ L ( R \u0302 )$] for each hyperparameter setting. As explained in Sec. 4, the final set of 1min excerpts cropped from the fulllength recordings in the LCS dataset are those corresponding to the illustration in Fig. 4(a) for the hyperparameter values $ \alpha = \beta = 1$.
.  .  Mean $\xb1$ standard deviation .  Best (i.e., lowest L) .  

$\alpha $ .  $\beta $ .  Epochs .  L .  V .  D .  Seed .  L .  V .  D . 
1  0  15.0 $\xb1$ 1.3  $\u2212$ 0.459 $\xb1$ 0.000  0.459 $\xb1$ 0.000  0.324 $\xb1$ 0.015  17  $\u2212$ 0.459  0.459  0.292 
0  1  69.9 $\xb1$ 11.2  0.140 $\xb1$ 0.005  0.367 $\xb1$ 0.014  0.140 $\xb1$ 0.005  2  0.130  0.368  0.130 
1  1  61.7 $\xb1$ 9.6  $\u2212$ 0.302 $\xb1$ 0.007  0.456 $\xb1$ 0.001  0.154 $\xb1$ 0.007  12  $\u2212$ 0.316  0.456  0.140 
.  .  Mean $\xb1$ standard deviation .  Best (i.e., lowest L) .  

$\alpha $ .  $\beta $ .  Epochs .  L .  V .  D .  Seed .  L .  V .  D . 
1  0  15.0 $\xb1$ 1.3  $\u2212$ 0.459 $\xb1$ 0.000  0.459 $\xb1$ 0.000  0.324 $\xb1$ 0.015  17  $\u2212$ 0.459  0.459  0.292 
0  1  69.9 $\xb1$ 11.2  0.140 $\xb1$ 0.005  0.367 $\xb1$ 0.014  0.140 $\xb1$ 0.005  2  0.130  0.368  0.130 
1  1  61.7 $\xb1$ 9.6  $\u2212$ 0.302 $\xb1$ 0.007  0.456 $\xb1$ 0.001  0.154 $\xb1$ 0.007  12  $\u2212$ 0.316  0.456  0.140 
Notably, Algorithm 1 converged for all 20 seeds regardless of the hyperparameter setting, thereby confirming the validity of the proposed modifications to the PAM algorithm. On average, the convergence was fastest (in 15.0 epochs) for the setting $ \alpha = 1 , \beta = 0$ due to only a small subset of points being required to form a maximal convex hull, and was slowest (in 69.9 epochs) for the setting $ \alpha = 0 , \beta = 1$.
The setting $ \alpha = 1 , \beta = 0$ gave the largest mean convex hull volume V across the three settings (0.459) but also the highest mean generalized KS test statistic D (0.324). In contrast, the setting $ \alpha = 0 , \beta = 1$ gave the smallest mean convex hull volume V across the three settings (0.367) but also the lowest mean generalized KS test statistic D (0.140). The setting $ \alpha = \beta = 1$ gave values of V and D in between the other two settings, so $ \alpha = \beta = 1$ indeed balances the aims of the competing terms in the loss function in Eq. (1) to achieve both criteria (1) and (2) as expected. For the setting $ \alpha = 1 , \beta = 0$, Algorithm 1 also consistently returned similar results regardless of the seed, with the standard deviation being less than 0.001 across all 20 seeds, so the modified PAM algorithm could potentially function as an identification algorithm for the largest possible convex hull in a global fashion depending on the hyperparameter settings as well.
Moreover, the sets of points in Fig. 4(c) and Fig. 4(a) have similar convex hull volumes (0.459 and 0.456, respectively). However, there is a relatively large cluster of points in the bottom right quadrant and a relative lack of points in the top right quadrant in Fig. 4(c), which could lead to an undesirable overemphasis in B&L excerpts at the expense of F&E excerpts if $ \alpha = 1 , \beta = 0$. This is further supported by the value of the generalized KS test statistic for the set of points in Fig. 4(c) being 0.292, as opposed to 0.140 in Fig. 4(a). The $ \alpha = 0 , \beta = 1$ setting also gave a set of excerpts with noticeably smaller convex hull (volume 0.368) in Fig. 4(b) than in Fig. 4(a), although their distributions were similarly uniform with generalized KS test statistic values of 0.140 and 0.130, respectively.
Incidentally, from the kernel density estimate (KDE) plots in Fig. 4, we can observe that all four quadrants generated by the Pleasantness–Eventfulness axes are represented by the set of fulllength recordings in the LCS dataset, which serves to validate that the S5 study managed to identify locations that were F&E, C&R, C&T, and B&L. However, there is a slight bias towards soundscapes considered C&T (bottom right quadrant), and a relative lack of soundscapes considered B&L (bottom left quadrant) based on the predictions made by the ARAUS dataset prediction models. These characteristics can also be observed in a similar Valence–Arousal space for the audio stimuli in the IADS2 and IADSE datasets [Yang , 2018, Figs. 1(a) and 4(a)], which are completely independent of the stimuli in both the LCS and ARAUS datasets, so they could possibly be attributed more to the general nature of auditory perception instead of the specific excerption methodology used.
Nonetheless, the automated excerption methodology in Sec. 3 and its implementation in Sec. 4 is not without its limitations. In particular, when there are multiple extreme outliers in the data, the criterion of maximizing the convex hull could cause Algorithm 1 to select a disproportionate number of outliers and make the selected excerpts no longer “perceptually diverse.” This could be mitigated by adjusting $\alpha $ and $\beta $ to suitable values to reduce the focus of Eq. (1) on maximizing the convex hull, but may require an additional hyperparameter tuning step to find suitable values for an arbitrary dataset. The use of models trained on the ARAUS dataset, which in turn utilizes responses to soundscapes from the USotW database, also implicitly assumes similarity in the underlying distribution over the Pleasantness–Eventfulness space. While this was not a problem with the LCS dataset due to concurrence with the aims of the ARAUS and USotW datasets in containing as broad a range of soundscapes as possible under the Pleasantness–Eventfulness space, caution should be exercised if the same method is applied to a restricted perceptual space, or a different perceptual space altogether.
6. Conclusion
In conclusion, we presented the Lion City Soundscapes dataset, which consists of a collection of “fulllength” soundscape recordings at 62 different locations in Singapore, and corresponding 1min excerpts from the fulllength recordings made at each location. The excerption methodology was an automated method based on the application of pretrained Pleasantness and Eventfulness models on the ARAUS dataset and a modified PAM algorithm on a loss function involving the area of the convex hull and a generalized KS test statistic for equality of distributions.
Since the definitions of the proposed loss function in Eq. (1) and the modified PAM algorithm in Algorithm 1 did not initially assume any inherent meaning in the space of interest, the excerption methodology could theoretically be extended to any set of audio recordings or objects, with the perceptual space replaced by any latent embedding space of said recordings or objects. Hence, future work could compare the relative performance of the loss function and selection algorithm on different spaces or datasets comprising different modalities. Subjective validation of the categorical accuracy of the 1min LCS excerpts with respect to the F&E, C&R, C&T, and B&L quadrant labels for the corresponding locations given by the S5 study could also be performed. Last, the LCS dataset itself could be used as an additional set of base urban soundscapes to obtain further responses for subjective studies such as those involving the ARAUS dataset.
Acknowledgments
We would like to express our heartfelt gratitude to the management teams at Changi Airport; National Parks Board, Bukit Timah Nature Reserve; National Parks Board, Jurong Lake Gardens; National Parks Board, Singapore Botanic Gardens; National Parks Board, Sungei Buloh Wetland Reserve; Raffles Marina; and the Singapore Zoo for their assistance with this study. This research was supported in part by the Singapore Ministry of National Development and in part by the National Research Foundation, Prime Minister's Office under the Cities of Tomorrow Research Programme under Grant No. COTV420201. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author and do not reflect the view of National Research Foundation, Singapore, and Ministry of National Development, Singapore. Part of this work was done while K.N.W. was supported by the International Fellowship from the American Association of University Women (AAUW) and the IEEE Signal Processing Society Scholarship Program.
Author Declarations
Conflict of Interest
The authors have no conflicts of interest to report.
Data Availability
The data that support the findings of this study are openly available in the institutional repository of Nanyang Technological University (DRNTU) at https://doi.org/10.21979/N9/AVHSBX. These comprise the metadata for the individual locations shown in Fig. 1, the fulllength recordings in the LCS dataset, and the actual 1min excerpts extracted based on the results in Fig. 4(a). The replication code for the results of this study is available at https://github.com/ntudsp/lioncitysoundscapes, and the code used to generate the visualization in Fig. 1 is available at https://github.com/ntudsp/lioncitysoundscapesvisualisation.