Voices arguably occupy a superior role in auditory processing. Specifically, studies have reported that singing voices are processed faster and more accurately and possess greater salience in musical scenes compared to instrumental sounds. However, the underlying acoustic features of this superiority and the generality of these effects remain unclear. This study investigates the impact of frequency micro-modulations (FMM) and the influence of interfering sounds on sound recognition. Thirty young participants, half with musical training, engage in three sound recognition experiments featuring short vocal and instrumental sounds in a go/no-go task. Accuracy and reaction times are measured for sounds from recorded samples and excerpts of popular music. Each sound is presented in separate versions with and without FMM, in isolation or accompanied by a piano. Recognition varies across sound categories, but no general vocal superiority emerges and no effects of FMM. When presented together with interfering sounds, all sounds exhibit degradation in recognition. However, whereas /a/ sounds stand out by showing a distinct robustness to interference (i.e., less degradation of recognition), /u/ sounds lack this robustness. Acoustical analysis implies that recognition differences can be explained by spectral similarities. Together, these results challenge the notion of general vocal superiority in auditory perception.
I. INTRODUCTION
The human auditory system possesses remarkable abilities to detect and distinguish sounds, even in complex acoustic scenes where various sounds occur simultaneously. This complexity is exemplified in musical scenes featuring multiple instruments and voices. Despite the simultaneity of sounds, the auditory system excels at identifying and selectively focusing on individual instruments and vocals within a musical scene. This ability is achieved through the process of auditory scene analysis (ASA) (Bregman, 1990), wherein sounds are separated and organized into mental representations of distinct auditory streams. Acoustic features of sounds play a crucial role in this process, providing cues to organize the auditory input into meaningful components. In the context of music, these cues encompass loudness, pitch, and timbre. Timbre, often simply described as “texture” or “tone color” (Helmholtz, 1877), is a multidimensional feature (Siedenburg , 2019a) that enables the discrimination of sound sources (e.g., sounds from a singing voice vs a cello), even when other acoustic features match.
Neurophysiological experiments have demonstrated enhanced cortical “voice-specific” responses when isolated vocal sounds are presented alongside non-vocal environmental sounds (Belin , 2000; Belin , 2002), as well as other musical instrument sounds (Levy , 2001; Gunji , 2003). Moreover, specific neural populations have been identified that respond selectively to music featuring singing voices but not to instrumental music mixtures (Norman-Haignere , 2022). This facilitated processing of vocal sounds extends to multi-instrumental musical scenes, where vocal sounds exhibit a unique salience that attracts listeners' attention unlike other sounds (Bürgel , 2021). In a comparative analysis of vocal and instrumental melodies, vocal melodies were shown to be more accurately recognized than instrumental melodies (Weiss , 2012), even when the melodies are sung without lyrics (Weiss , 2021). We here refer to the faster and more precise recognition of vocal sounds as “vocal superiority.”
Agus and colleagues investigated the ability to recognize instrumental and vocal sounds in a multi-experiment study (Agus , 2012). The study used sound excerpts with a duration of 250 ms extracted from a database of isolated musical sounds. Sounds were controlled in level and pitch with the aim of isolating timbre as the distinctive factor. Participants were tasked with recognizing sounds of a target timbre in a sequence of diverse sounds, responding actively when detecting a target. As recognition was anticipated to be highly accurate, response times were measured to provide insights even under circumstances of perfect or indifferent recognition accuracy. Participants showed near-perfect accuracy and fast reaction times (RTs) for all targets. Nevertheless, vocal superiority emerged via consistently faster RTs and higher accuracy compared to instrumental sounds. Subsequent experiments further underscored recognition advantages for vocal sounds, revealing that vocals are recognizable from shorter sound snippets compared to other musical instruments (Suied , 2014; Isnard , 2019).
The specific acoustical features responsible for triggering vocal superiority are still unknown. Agus (2012) argued that the full spectro-temporal envelope of sounds must be involved in vocal recognition: that is, neither solely spectral nor solely temporal features suffice for robust vocal recognition. This argument appears to be inconsistent with other studies on recognition of vocal sounds from short sound snippets below 8 ms duration (Suied , 2014), so short that reliable temporal cues are likely to be eradicated. Spectral envelope cues of vocal sounds such as formants have also been shown to be highly informative of instrument identity and the natural basis of vowel recognition (Reuter , 2018). Using automatic instrument classification on a large set of sound samples, Siedenburg (2021) observed that spectral envelope features alone sufficed to accurately discriminate vocal sounds from other harmonic musical instrument sounds. Thus, whether spectral envelope features alone contribute substantially to perception tasks based on fast recognition of vocal and musical instrument sounds remains to be determined.
Our previous experiments on auditory attention in musical scenes highlighted yet another candidate feature: frequency micro-modulations (FMM) present in singing voices (Bürgel and Siedenburg 2023). In the context of this study, FMM refers to non-stationary frequency changes in pitched sounds, usually smaller than one semitone (Larrouy-Maestri and Pfordresher, 2018). In singing, FMM arises from imperfect control of intonation caused by vocal-motor control adjustments (Hutchins , 2014) that persist even in highly trained singers (e.g., Sundberg , 1996; Hutchins and Campbell, 2009), but can also be utilized intentionally as a form of expressive intonation (Sundberg, 2013). Although pitch perception for vocals appears to be less precise than for musical instruments (“vocal generosity effect”) (Hutchins , 2012; Sundberg, 2013; Gao and Oxenham, 2022), the expressivity of FMM still provides perceptible additional musical information (Larrouy-Maestri and Pfordresher, 2018) and plays a role in enhancing the prominence of vowel sounds (McAdams 1989; Marin and McAdams, 1991). Hence, this line of work raises the question whether FMM, enhancing the prominence of singing voices in musical scenes, also plays a role in the fast and precise recognition of vocal sounds.
The aim of this study is to critically revisit the phenomenon of vocal superiority in order to highlight various acoustical factors that affect the recognition of vocal and instrumental sounds. We aim to investigate recognition of vocals under consideration of the specific vowel type within simplified musical scenes and explore the influence of FMM and other spectral features in a regression model. Furthermore, we investigate whether recognition is dependent on the used audio material or remains consistent across different stimulus sets and vocal sounds. Overall, this may help to further disentangle the roles of acoustical features and perceptual categorization processes in the perception of voice sounds.
The experimental design emulates that of Agus (2012): in a recognition go/no-go task, participants are presented both vocal and instrumental sounds and are instructed to respond to one type of sound while ignoring the other (Agus , 2012). Participants are instructed to respond as quickly as possible when hearing sounds of a target category while ignoring sounds of the non-target category (distractors). We measure response times and recognition accuracy. All sounds are aligned in duration and sound level and controlled in pitch. The targets consist of either instrumental sounds (wind or string instruments) or vocal sounds (sung vowels or singing voices). To investigate the effect of FMM on sound recognition, here each sound is presented in both an unmodified version and a version with eliminated FMM. Throughout the experiment, the target category alternates between blocks to gather category-dependent responses. Additionally, in one block for both vocal and instrument targets, sounds are accompanied by a spatially separated piano accompaniment forming a minor or major triad with the target. This design aims to assess recognition abilities when the target is embedded in a simplified musical scene.
We conduct three experiments, as illustrated in Fig. 1. In experiment 1A, sounds are extracted from a sample database, encompassing sung vowels /a/ and /u/ in both alto and soprano registers as vocal sounds, along with bassoon, trumpet, cello, and violin sounds as instrumental sounds. The results indicate frequent confusion between specific vocal and wind sounds and the absence of an effect for FMM. To further investigate this issue, two additional experiments are conducted. Experiment 1B replicates experiment 1A but excludes wind instruments. In experiment 2, sounds are extracted from a popular music database, featuring female and male singing voices as vocal sounds, alongside string and wind instruments as instrument sounds. This experiment aims to compare professionally manufactured samples of isolated sounds with relatively small FMM against excerpts of naturalistic popular music with relatively large FMM.
Expanding upon previous studies, we hypothesize a discernible effect of vocal superiority, anticipating that vocal sounds are recognized faster and more accurately than instrument sounds. Furthermore, we expect that the human auditory system is specifically sensitive to the FMM occurring in singing voices, yielding pronounced vocal superiority for singing voices with FMM. Notably, our previous experiments (Bürgel and Siedenburg, 2023) demonstrated a high correlation between a larger frequency range of FMM and sound salience. Therefore, we expect the influence of FMM to be even more pronounced in the experiment that utilizes pop music excerpts. Given the reported robust detection of vocal sounds in complex musical scenes, demonstrating a partial immunity to interference from other sounds present within such scenes, we speculate that the introduction of an accompanying piano interferer has a comparatively small impact on the recognition of vocals.
II. GENERAL METHODS
A. Participants
All participants were recruited via calls for participation on the online learning platform of the University of Oldenburg. Separate calls for subjects with and without musical training were posted to ensure a diverse range of musical abilities in our sample. The inclusion criterion for musically trained participants was a minimum of 4 years of musical training on at least one instrument. All participants were required to self-report normal hearing as a prerequisite for participation. Information on the participants' musical abilities was acquired using a subset of the Goldsmiths Musical Sophistication Index (Gold-MSI) (Müllensiefen , 2014), consisting of nine questions on music perception abilities and seven questions on musical training. The numbers of participants are listed in the respective experiment sections.
B. Stimuli and task
Stimuli were generated in matlab (MathWorks, Inc., Natick, MA) by using modified excerpts from two distinct databases. For experiment 1, a sample database of orchestral instruments [the Vienna Symphonic Library (VSL)] was utilized (VSL, 2024) (see Table X in the supplementary material). All sounds in the VSL database had a uniform duration but featured variations of attack and decay length (see supplementary material Fig. S6). In experiment 2, excerpts from a popular music multitrack database, MedleyDB, were employed (Bittner , 2016). A schematic illustration of the extraction of stimuli is shown in Fig. 2(A).
For the instrument sounds, two wind instruments and two string instruments were selected as target categories: bassoon, trumpet, cello, and violin. For vocal sounds, the vowels /a/ and /u/ in the registers alto and soprano were selected. Each instrument or vowel was selected in the mezzo forte dynamic level and in a range of 12 semitones, ranging from A3 to G#4 (one octave), resulting in 96 sounds. An alternative version of each sound without FMM was created using the frequency modulation tool in the pitch and time correction software melodyne (melodyne Version 5; Celemony Software, Munich, Germany). This process involved analyzing the FMM in melodyne using the “pitch modulation” function and subsequently removing FMM by setting all modulations to zero. A quantification of the FMM reduction is included in the supplementary material (Fig. S8). Its perceptible impact on the sound ranged from being nearly inaudible without direct comparison to sounds with FMM, particularly for some sounds in the orchestral database, to being more noticeable for the pop music extracts. For transparency, we uploaded example sound files on our website (Bürgel and Siedenburg, 2024). All sounds were truncated to a duration of 250 ms, starting 5 ms before the sound level reached a threshold of –20 dB relative to its maximum level. The first 5 ms were used to create a smoothed onset, while the last 5 ms were used for as a fading offset, utilizing a 5 ms logarithmic ramp for both. Signals were converted to mono by summing up both channels, and sound level was normalized in root mean square (RMS). In total, 192 distinct target sounds were generated this way, comprising 12 sounds each with and without FMM for each of the four instruments and four vocal sounds. For conditions in which the sounds were accompanied by a piano interferer, additional piano samples (Bösendorfer grand piano; Bösendorfer, Vienna, Austria) were used, spanning the pitch range from A2 to A5, to encompass all possible pitch combinations required to create a triad with the target sound (cf. Siedenburg , 2020).
C. Procedure
The experiments were approved by the ethics committee of the University of Oldenburg. All experiments shared identical designs but varied in duration and the chosen stimulus set. Figure 2(B) provides a schematic overview of the procedure. Experiment 1A utilized all 192 stimuli extracted from the VSL database. In experiment 1B, the wind instruments were replaced with string instruments, resulting in a balanced set of instrumental and vowel sounds and a total of 96 stimuli. Experiment 2A employed 96 stimuli from the popular music database.
Each experiment started with a briefing session, during which participants were instructed about the experiment's structure, consisting of distinct blocks featuring either isolated or accompanied sounds. Additionally, participants were informed that they would need to react to and attend exclusively to vocal sounds or sounds belonging to string or wind instruments. Subsequently, participants received specific instructions for the accompanied blocks. They were informed that an interfering piano sound would play in one ear, which they were instructed to ignore, while focusing their attention on either vocal or string and wind sounds presented in the other ear. This briefing was followed by a training section, during which participants freely listened to instrumental and vocal sounds, with or without a piano interferer. A description and icon for the presented sound were provided before and during the sound presentation. Participants had to listen to each sound category in randomized order, both with and without the piano interferer, before they could proceed to the main experiment. The main experiment comprised five blocks, each containing all stimuli in a randomized order. The first block always involved an “all-go” detection task, where participants were instructed to respond to all stimuli as fast as possible, irrespective of timbre. Subsequent blocks were “go/no-go” tasks, where instrumental or vocal sounds acted as targets and the other group had to be ignored. Sounds were either presented in isolation or accompanied by a piano dyad, forming a major or minor triad with the target sound. To attenuate effects of energetic masking, the piano signal was set to a level of –5 dB relative to the target and presented dichotically with respect to the target signal. The dichotic separation of piano and target (left or right channel) as well as the key of the triad (major or minor) and tonal position of the target in the triad (root, third, or fifth) were randomly assigned but balanced across all stimuli in one block. The number of target and distractor stimuli were also balanced. All sound levels were normalized. Instructions, using the same icons as in the training phase, were presented on a touch screen before each block and remained visible during each block. Participants manually continued the experiment by pressing a button on the touch screen before a block started. The stimuli presentation started after a 2000 ms pause. Stimuli were presented in a continuous stream with a 2000 ms response window and a randomized 1000 to 2000 ms inter-trail interval to prevent a rhythmic presentation. All stimuli were presented diotically, except for the dichotic condition with a piano interferer. The experiment concluded with a questionnaire gathering demographic data and a subset of 16 questions from the Gold-MSI to assess participants' musical ability.
D. Apparatus
The experiment was conducted in a double-walled sound booth. Participants sat in a comfortable chair and interacted with the experiment using a touch screen attached to a movable arm in front of them. Stimuli were processed through an RME Fireface UCX soundcard (RME Audio, Haimhausen, Germany) at a 44.1 kHz sampling rate and presented on Sennheiser HD 650 headphones (Sennheiser, Wedemark, Germany). Participants' responses were captured using a custom-made response box. Pressing a button on the box triggered a short signal burst, which was recorded simultaneously with the stimulus presentation at audio sampling rate, removing any potential time lag between stimulus presentation and response recording. Stimuli were presented at an average level of 70 dB sound pressure level SPL (A), as measured with a Brüel & Kjaer type 2250 light sound-level meter (Brüel & Kjaer, Virum, Denmark) and a Brüel & Kjaer type 4153 artificial ear to which the headphones were coupled.
E. Behavioral analysis
Linear mixed-effect (LME) models (West , 2014) were utilized for statistical analyses. All mixed-effect analyses were conducted in matlab using the fitlme function in the statistics and machine learning toolbox (statistics and machine learning toolbox Release 8.7; MathWorks, Inc., Natick, MA). The model incorporated random intercepts for each participant. In addition, IES and musical sophistication were used as numerical variables, whereas the presence of FMM and sound category (vocal or instrumental sound) were used as categorical predictors. All binary categorical predictors were sum-coded. To present main effects and interactions succinctly, results are displayed in the form of an analysis of variance (ANOVA) table, with fixed-effect coefficients presented as statistical parameters (F) and probability (p). These values were derived from the LME models using matlab's anova function. For a more detailed analysis, individual fixed-effect coefficients are also reported as statistical parameters (t) and probability (p). For a comprehensive display of the behavioral results, models, and statistical evaluations, please refer to the supplementary material.
III. EXPERIMENT 1A
The aim of experiment 1A was to investigate the influence of target category, FMM and accompaniment on the recognition of sounds. Two vowel sound categories were presented: vowel /a/ and /u/ in the register alto and soprano; and two instrumental sound categories: strings with cello and violin and winds with bassoon and trumpet. Each category comprised 12 sounds with notes spanning over the same one-octave range. Each sound was presented in two versions: one with naturally occurring FMM and another where FMM was eliminated. Additionally, each sound was both presented in isolation and accompanied by a piano interferer.
A. Participants
A total of 32 participants took part in experiment 1A. Two participants were excluded from the analysis because they achieved notably lower accuracies in the isolated sound recognition (60% and 63%) than the average accuracy of the subjects (88%; minimum = 77%; maximum = 100%). Consequently, 30 participants [age, = 24.1 years, standard deviation (std) = 2.6] were included in the analysis. This group comprised 15 self-described musically trained participants and 15 participants with no or less than 4 years of musical training. There were overlaps in the musical sophistication scores of both groups: Non-musicians had scores of 32.9 (minimum, 9; maximum, 45) in the nine questions regarding musical perception and scores of 12.8 (minimum, 7; maximum, 34) in the seven questions regarding musical training. In contrast, musicians had scores of 45.7 (minimum, 38; maximum, 52) in questions related to musical perception and 32.1 (minimum, 24; maximum, 41) in questions related to musical training.
B. Results and discussion
1. Results
In the initial “all-go” block, participants were involved in a sound detection task, where they were instructed to respond whenever they heard a sound, regardless of its timbre. The task was accomplished with perfect detection accuracy, and the average detection time was 314 ms, resulting in an IES of the same value (314 ms), with no significant effects of sound category observed in the statistical evaluation (F= 0.632, p = 0.729). The distinctions between sound categories became more pronounced during the go/no-go recognition task, for which IESs are displayed in Fig. 3(A). Averages across sounds with and without FMM are shown because there were virtually no effects of FMM (see the statistics below).
Recognition was generally slower and less precise compared to the detection task. From a descriptive perspective, vocal sounds in the isolated presentation condition slightly outperformed instrumental sounds, yielding IESs of 667 ms compared to instruments with a score of 692 ms. However, categories clustered into two groups: one with relatively fast and precise recognition and therefore low IES containing the /a/, strings, and trumpets, and another group with slower and less precise recognition containing the /u/ and bassoon. Among vowels, /a/ yielded a score of 582 ms, compared to /u/, with a score of 752 ms. Among instruments, recognition of strings and trumpet were closely similar, with a score of 537 ms and a score of 553 ms, respectively, whereas the bassoon stood out with a total score of 820 ms. When embedded in a musical scene, recognition for both vowels and instruments worsened, albeit with a different impact for the categories. When comparing recognition between the isolated and accompanied presentations, voice sounds exhibited an increase in 353 ms. However, differences seen in the isolated presentation were even more pronounced among voice sounds as /a/ demonstrated robustness to the presence of the accompaniment, displaying a score increase in 64 ms, which was considerably smaller than that in /u/, which showed an increase in IES by 640 ms as a result of with slower RT and an ER close to chance level. In contrast, instrument sounds exhibited an increase in IES of 413 ms. Consistent with the isolated presentation, the bassoon continued to stand out among instruments, revealing differences between the bassoon and all other instruments and having the largest score increase in 764 ms, with not only a more increased RT but also an ER close to chance level. The strings and trumpet showed a deterioration similar to each other, with an increase in IES of 296 ms.
In terms of the statistical evaluation, the coefficient of determination (R2) for the correlation between ER and RT across all stimuli was 0.82. Contrary to our assumptions, the hypothesized vocal superiority was absent in the isolated presentation condition as sound recognition showed no differences between vocal and instrumental sounds (F = 0.014, p = 0.91). Additionally, the presence of FMM had minimal impact, yielding deviations of IESs not larger than ±22 ms (F = 0.060, p = 0.806). Notably, no clear trend emerged for either vowels or instruments. Musical sophistication also showed no impact on the recognition of sounds, neither for musical perception scores (F = 0.326, p = 0.568), nor for musical training scores (F = 1.487, p = 0.222), nor as a categorical factor by differentiating between self-identified musicians and non-musicians (F = 0.55, p = 0.477). The presentation side of the target sound in the interfered presentation also showed no effect (F = 1.153, p = 0.288). However, the presence of accompaniment showed an effect on recognition for accuracy (F = 9.311, p = 0.002). In the isolated presentation, the average IES was 680 ms, which increased further in the presentation with accompaniment, resulting in a score of 1120 ms. While no general differences between the vocals and instruments in a comparison between with and without accompaniment were apparent (F = 0.040, p = 0.95), considerable effects were observable between /a/ sounds and all other categories (/u/, t = 47.450, p < 0.001; bassoon, t = 71.232, p < 0.001; strings and trumpet, t = 23.471, p = 0.021), highlighting a robustness to interference for /a/ sounds, unmatched by instruments or the vowel /u/.
2. Discussion
Experiment 1 A reinforces the fundamental fact that timbre cues are sufficient to guide the recognition of sound sources. In line with Agus and colleagues and further studies (Agus , 2012; Moskowitz , 2020), no differences between sound categories emerged in the simple detection task, suggesting that sound detection is equally fast and accurate for all sounds with similar onsets. When participants were asked to recognize sound categories in a stream of sounds, both recognition times and ERs increased compared to the simple detection task. Another similarity with previous work is that recognition does not seem to be driven by a trade-off in which faster response times lead to lower accuracy. Rather, some sounds were recognized both quickly and accurately, whereas others were recognized slowly and inaccurately.
Contrary to previous work (Agus , 2012) and our hypothesis, a vocal superiority effect did not emerge for the isolated presentation of sounds. Instead, most of the instruments were in fact recognized slightly faster and more accurately with lower IESs. Furthermore, differences between the vowels became apparent and revealed that the recognition of /a/ sounds was similar to the recognition of the instruments, whereas the recognition of /u/ yielded lower IESs as a result of being both slower and less accurate. Curiously, the bassoon showed a similar pattern of performance compared to /u/ sounds. A possible explanation for this could lie in the confusion between /u/ and the bassoon. This confusion seems to be a result of spectral similarity, which may result from shared formant frequencies between the sounds (Reuter , 2018). The confusion presents characteristics akin to an informational masking effect (Tanner, 1958; Eipert , 2019), wherein the shared acoustical characteristics interfere with the perceptual processing of both sounds, leading to confusion and difficulty in distinguishing between the two. Consequently, this uncertainty in the recognition of /u/ and bassoon sounds may have contributed to enhanced certainty for other sounds and facilitated better recognition.
Nevertheless, this assumption does not explain the lack of vocal superiority for /a/, which is known to be spectrally distinct from /u/. Among vowels, /u/ is assigned to the group of lowest statistical format frequency, as opposed to /a/, which is in the highest group (e.g., Maurer, 2016, p. 35). An argument could be made that the ambiguity of the bassoon has led to a general uncertainty in the recognition of vowels, which manifested in both vowels but especially for /u/. Support of this assumption is the consideration that in the study by Agus, neither /u/ sounds nor bassoon sounds were competing as target sounds, and it could be speculated that with the omission of these sounds, a vocal superiority was fostered. Coherent with this perspective, the sounds in Agus et al.'s experiment were also recognized with much higher accuracy, achieving ERs of less than 2%. This leaves room for speculation as to whether there is a general RT advantage only when the recognition accuracy is at ceiling level. However, it is important to note that multiple studies utilizing a comparison in the recognition of speech and instrumental sounds have also demonstrated an advantage for speech sounds (Murray , 2006; Moskowitz , 2020) and even recognition advantages for singing voices (Suied , 2014).
In parallel, participants' familiarity with the sound categories may have influenced recognition (Siedenburg and McAdams, 2017). Although a training phase was provided to familiarize participants with the sounds, this exposure may not have been extensive enough to achieve a sufficiently consolidated mental representation of the underlying sound categories, resulting in poor classification for certain categories.
Furthermore, no effect of FMM was observed. In our previous experiment, we utilized excerpts featuring tone transitions, known to exhibit particularly perceptible FMM (Saitou , 2005; Hutchins and Campbell, 2009; Larrouy-Maestri and Pfordresher, 2018). This implies that the degree of FMM may have been too small to enrich sounds with perceptible information, leading to an alignment between sounds with and without FMM. Alternatively, a more straightforward explanation could be that FMMs are not a crucial feature for recognition but rather exploited for pattern matching in complex musical scenes, where they particularly help the vocals to stand out from other sounds.
Notably, a unique robustness to interference emerged for the /a/ sounds in the comparison between the presentation with and without accompaniment. As recognition worsened for all sounds, vowel /a/ stood out by achieving a considerable smaller decrease in IES under the condition with accompaniment compared to the /u/ and all other instrument sounds. This robustness was a result of a distinctly faster recognition of /a/ sounds, with almost no deterioration in accuracy between the isolated and accompanied presentation, which was visible in both musically untrained and trained participants. However, it should be noted that due to the high ER of /u/, a clear interpretation of the results for this sound category is involved. If the /u/ sounds were excluded from the results, it could be assumed that the robustness to interference of the /a/ sounds might demonstrate another facet of vocal superiority that shows parallels to the robust detection of voices in musical scenes (vocal salience) (Bürgel , 2021). However, whether this represents an actual vocal-specific recognition, or whether it is a result of spectral similarities favorable to /a/ sounds and unfavorable to the /u/ sound, remains unanswered in our experiment.
Taken together, these findings prompt the question about the extent to which confusion impacted the recognition of vowel sounds and whether distinctiveness is the primary factor driving recognition scores. To investigate this, experiment 1B was conducted with the focus of disentangling the confusion, by reducing the stimuli set to vowel and string sounds.
IV. EXPERIMENT 1B
To address questions regarding potential confusion between vowels and the bassoon, Experiment 1B used the same design as experiment 1A but omitted wind instruments from the stimulus set. To balance the number of vocal and instrumental sounds, the instrumental samples were presented twice. Additionally, each sample was presented both in isolation and accompanied by a piano dyad. All stimuli were presented in versions with FMM.
A. Participants
A total of 30 participants (age, = 24.1, std = 2.6) took part in experiment 1B. The experiment was carried out in the same session as experiment 2, and consequently, the same participants were part of both experiments. Participants were randomly assigned to start with either experiment 1B or experiment 2, with order counterbalanced among participants. No participants were excluded from the analysis. This resulted in 15 self-described musically trained participants and 15 participants with no or less than 4 years of musical training. Overlaps between both groups were observed for musical sophistication scores, with non-musicians achieving scores of 38.8 (minimum, 23; maximum, 51) in the nine questions regarding musical perception and scores of 12.4 (minimum, 7; maximum, 22) in the seven questions regarding musical training. In contrast, musicians scored 49.7 (minimum, 36; maximum, 58) in questions related to musical perception and 23.2 (minimum, 10; maximum, 42) in questions related to musical training.
B. Results and discussion
1. Results
The performance in the recognition task is depicted in Fig. 3(B). In the isolated presentation, the average IES was 598 ms. This increased in the accompanied presentation, resulting in an IES of 878 ms. In the isolated presentation, vowels and strings were closely aligned, with vowels exhibiting an IES of 603 ms and strings an IES of 588 ms. Both vowel sounds /a/ and /u/ were recognized to a similar degree, with /a/ yielding an IES of 594 ms and /u/ an IES of 611 ms. The inclusion of musical accompaniment had a detrimental impact on the recognition of /u/ and string sounds. Vowel /a/ showed the smallest degradation among sounds, with an increase in 43 ms. In contrast, vowel /u/ exhibited the largest increase in 599 ms, with an accuracy close to chance level. Strings exhibited an increase in 264 ms.
ER and RT shared substantial variance, with an R2 value of 0.62. Consistent with the findings in experiment 1A, no differences in recognition emerged between vocal and instrumental sounds (F = 1.505, p = 0.221). Recognition performance deteriorated for the presence of accompaniment (F = 11.528, p < 0.001). However, in line with the robustness to interference observed in experiment 1A, this effect was less pronounced for /a/ sounds, as underlined by effects between /a/ and /u/ (t = 6.842, p < 0.001), as well as /a/ and strings (t = 2.745, p = 0.006). Musical sophistication did not appear to affect recognition in a substantial way, as no main effects for either musical perception (F = 0.443, p = 0.506) or musical training (F = 0.031, p = 0.859) were evident. Also, in line with experiment 1A, the presentation side of the target sound in the interfered presentation showed no effect (F = 0.024, p = 0.876)
2. Discussion
Experiment 1B aimed to investigate vocal sound recognition in the absence of wind instruments which were frequently confused with /u/ sounds in experiment 1A. Contrary to expectations, no recognition advantage emerged for vocals in the isolated presentation, even though only string and vocal sounds were presented, closely mirroring the conditions of the shared distractor experiment by Agus and colleagues (experiment 2, voice-processing advantage; Agus , 2012). Unlike experiment 1A, no differences were observed between vowels /a/ and /u/ in the isolated presentation. This underscores our assumption that the relatively poor recognition of /u/ in experiment 1A was due to confusion with the bassoon, indicating that vocal sound identity itself does not guarantee superior recognition and does not make sounds unsusceptible to confusion with non-vocal sounds. This assumption is further supported by the contrasting behavior of vowels when embedded in a musical scene. As observed in experiment 1A, /a/ sounds exhibited unique robustness to interference, with a smaller decrease in recognition performance distinct from /u/ and instrumental sounds. However, even without wind instruments, /u/ sounds yielded the worst recognition scores among all sounds.
Another interesting observation arises in the isolated presentation: an equalization between strings and voices. In experiment 1A, the strings, especially the cello, exhibited faster recognition compared to vocal sounds. However, this difference is no longer evident. It remains unclear whether this change is due to the absence of the bassoon, the reduction of target categories, or a specific behavior of the cello. Nevertheless, it underscores the necessity of contextualizing results within the stimulus set and emphasizes the importance of testing with diverse stimuli sets.
Furthermore, recognition accuracy for /u/ dropped close to the chance level under the condition with accompaniment, suggesting that the piano dyad strongly interfered with the /u/ sounds. However, stimuli were deliberately designed to hinder complete energetic masking by reducing the piano's sound level and spatially separating the target and interferer sounds, resulting in binaural presentation. An argument could be made that informational masking occurred with both sounds remaining audible but listeners' attention shifting towards the masking sound (Pollack, 1975; Kidd , 2008), potentially due to the uncertainty associated with the random assignment of target and accompaniment to left and right channels. However, this would not explain why certain sound categories such as the /a/ sounds are very robust to the presence of the accompaniment. The significantly better recognition of /u/ in the isolated performance in experiment 1B together with the poor recognition of /u/ in experiment 1A suggests that most likely a combination of confusion with the bassoon and interference by the piano, both of which can be attributed to informational masking, was at the source of the observed effect.
In summary, the results underline the findings from experiment 1A and support the assumption that vocal sounds in musical scenes do not inherently evoke enhanced recognition. Instead, it appears that the recognition of vocal sounds seems to be highly influenced by the vocal sound quality itself (here, the vowel type). Thus, it would be intriguing to explore whether the recognition of other vowels or vocal sounds would be similar to the /a/ or /u/ sounds. To explore the generality of the observed effects and further assess the effect of FMM, we conducted experiment 2, in which stimuli were generated by extracting snippets of vocals or instruments from a pop music database.
V. EXPERIMENT 2
Sounds were extracted from songs in a multitrack popular music database used in previous work, which demonstrated a correlation between FMM range and vocal salience (Bürgel and Siedenburg, 2023). The excerpts were taken from the onset of notes and thus contained pitch transitions encompassing more pronounced FMM compared to experiment 1.
A. Participants
The same participants as in experiment 1B took part.
B. Stimuli and procedure
The popular multi-track music database comprises 127 songs across a variety of popular music genres, each with individual audio files for instrument and vocal tracks. Given the continuous nature of the tracks and the presence of overlaid audio effects, potential sound candidates had to be manually selected to resemble the clean and unmodified samples in the VSL database. In line with the selected sound categories in the database used for experiment 1, string and wind instruments were chosen as instrumental targets. For vocal sounds, rather than different voice registers or vowels, female and male vocal tracks were selected. To control pitch, the chosen tracks were analyzed in melodyne, and 12 different excerpts for each target sound were extracted, covering the same pitch in a range of one octave (from A3 to G#4) as in our VSL samples. This process resulted in vocalizations that could be categorized as nine /a/ sounds, four /u/ sounds, one /o/ sound, one /e/ sound, one hissed sound, and eight mixed sounds with multiple vowels. FMM manipulation, truncation, ramping, and normalization were performed similarly to the sounds from experiment 1. In total, 96 different stimuli were extracted this way, including 12 sounds with and without FMM for two instrumental and two vocal sounds each. Additionally, each target sound was both presented in isolation and embedded in a musical scene with a piano dyad accompanying the target sound. The procedure was identical to experiments 1A and 1B.
C. Results and discussion
1. Results
In the detection task, all sounds were detected perfectly, with only minor differences visible for RT. Vocals yielded an IES of 365 ms, and instruments had an IES of 350 ms, with no significant effects present in the LME (F = 0.817, p = 0.44). Additionally, no difference was observed for participants who started either with experiment 1B or experiment 2 (F = 0.603, p = 0.616). The performance for the recognition task in experiment 2 is displayed as averages across sounds with and without FMM in Fig. 4(A).
Overall, recognition was generally slower and less precise compared to the detection task. In the isolated presentation, the average IES was 621 ms. This increased in the presentation with accompaniment, resulting in an IES of 800 ms. The absence of FMM showed neither a positive nor negative trend, leading to average variances of IES not larger than ±24 ms. Differences between vocals and strings were closely aligned in the isolated presentation: female vocals exhibited an IES of 624 ms, male vocals an IES of 589 ms, strings an IES of 654 ms, and winds an IES of 594 ms. When presented with accompaniment, recognition worsened for both vowels and strings. However, this effect was less pronounced for the vocal sounds, which exhibited an increase in 53 ms compared to an increase in 214 ms for instruments, demonstrating a robustness to interference, as seen in both previous experiments for vowel /a/. The increase differed slightly between female and male vocals, with no effects in the LME, with an increase in 37 ms for female vocals and an increase in 69 ms for male vocals. In contrast, strings yielded an increase in 134 ms and winds an increase in 294 ms.
Discrimination between the vocalizations within the sounds revealed notable differences in recognition, as depicted in Fig. 4(B). The increase in IES between the isolated and accompanied presentations ranged from –18 ms to 64 ms, with the one /o/ sound standing out, with an increase in 130 ms. Notably, the eight sounds with /a/ had an average IES of 531 ms in the isolated presentation, with an increase in 46 ms in the accompanied presentation. In contrast the four /u/ sounds had an IES of 646 ms in isolation, with an increase in 33 ms. Thus, /u/ sounds held higher IES, aligning with observations in experiment 1. Contrary to those observations, no pronounced differences in the deterioration were observed between /a/ and /u/ sounds. However, it is important to note that this analysis is not balanced, as the numbers of stimuli within each vocalization greatly differed and the observed differences may be an artifact of individual stimuli.
ER and RT shared substantial variance, with an R2 value of 0.86. Consistent with experiment 1, neither an effect of FMM emerged (F = 0.776, p = 0.378), nor an effect for musical sophistication in musical perception (F = 0.939, p = 0.332) or training (F= 0.012, p = 0.912), nor an effect for target presentation side (F= 0.169, p = 0.682). However, an overall recognition advantage for vocal sounds was present (F = 20.449, p < 0.001), as well as an effect for accompaniment (F = 15.813, p < 0.001) and interaction effects between both (F = 20.252, p < 0.001). Differences between vocal and instrumental sounds were negligible in the isolated presentation but pronounced in the accompanied presentation, with effects between vocals and strings (t = 3.784, p < 0.001), as well as vocals and winds (t = 9.313, p < 0.001), highlighting a specific robustness to interference for the vocals.
As a synopsis of the experiments, Fig. 5 displays the differences between the isolated and accompanied presentations of all three experiments.
2. Discussion
Experiment 2 investigated the recognition of vocal and instrument sounds extracted from a pop music database, focusing on the effect of FMM and assessing the generalizability of recognition advantages found in previous experiments. The results mirrored those of experiment 1. Notably, the anticipated superior recognition for vocal sounds was present only under the condition with accompanied presentation but was not present under the condition with isolated presentation. Furthermore, the presence of FMM showed no impact on sound recognition, supporting the conclusion that cues related to FMM are not exploited or do not affect the recognition of musical sounds.
Additionally, the previously observed robustness to interference for /a/ sounds persisted between the isolated and accompanied presentations for the vocal sounds of experiment 2. Additionally, an overall less pronounced deterioration of recognition in the accompanied presentation was observed for all sound categories, even though the sounds used within a category were less homogeneous than in experiment 1. This inhomogeneity stemmed from extracting instrumental and vocal sounds from various songs with different instruments or singers, lacking strict control for intonation dynamics and articulation. It is likely that the less controlled pop music excerpts could stand out more easily from the piano accompaniment than the orchestral samples, which were aligned with the piano in intonation dynamics and articulation. Yet, despite the inhomogeneity of sounds, a clear robustness to interference was pronounced in the experiment, indicating that diverse vocal sounds are capable of exhibiting robust recognition. Interestingly, despite the distinctions observed between /a/ and /u/ in experiment 1, this robustness was evident for most vocalizations, this robustness was evident for most vocalizations, except for three sounds (/o/, /e/, and /hiss/) which showed either no robustness or slightly worse performance in isolated presentation. This inconsistency might be a consequence of the sound's inhomogeneity, allowing them to provide more distinct cues compared to the orchestral samples.
VI. ACOUSTICAL ANALYSIS
To explore relations between acoustic features of the sound and recognition performance, linear regression analysis was employed to predict human recognition scores using the spectral similarity between sounds and their FMM range. Spectral information of sound signals was obtained through cepstral coefficients derived from sounds' energy in filter bands with equivalent rectangular bandwidth (ERBCC). This representation served two purposes: first, to compare spectral attributes of the competing target sounds that participants were asked to recognize and, second, to assess the similarity between the target sounds and piano interferers. The ERBCC extraction process was analogous to the computation of Mel-frequency cepstral coefficients (MFCC) known for their effectiveness in computational sound classification (Monir , 2022), but using an equivalent rectangular bandwidth (ERB)-filter bank (instead of the Mel-spectrum) to better align with data of human frequency selectivity (Glasberg and Moore, 1990). The ERBCC extraction process involved extracting the long-term spectrum over the whole 250 ms duration, filtering the spectral energy of the target sounds into 64 ERB bands in the frequency range between 20 Hz and 16 000 Hz, taking the logarithm, and deriving the first 13 cepstral coefficients using discrete cosine transformation. In a last step, the first coefficient was discarded as it contains no information about the spectral shape, but only a constant level of offset information, and the temporal dimension was discarded by averaging each coefficient across the time windows.
In order to extract the similarity between vocal and instrumental sounds, a principal component analysis (PCA) was performed on the ERBCC data of all target signals for each target category, encompassing all 12 notes and the same 12 notes of the piano interferer (A3 to G#4). The first two principal components (PC1 and PC2), which explained the most variation in the dataset, were used to create a two-dimensional component space. The proximity of sounds in the space indicates their spectral similarity, with sounds closer together being more similar. A measure of spectral distinctiveness was derived by calculating the Euclidean distance between a target sound and the center of the space with the rationale that more spectrally distinct sounds would occupy regions further separated from the center of the space. To represent the similarity to the piano interferer, the distance between the target and the spatial centroid of the piano sounds was computed. Additionally, to examine potential correlations between recognition accuracies and acoustic features, a confusion matrix was generated using the ERBFCC. This involved determining the similarity between sounds through a sound-by-sound correlation analysis. The resulting confusion matrix can be found in the supplementary material (Fig. S7).
To gather information regarding the FMM intensity, the FMM range as the difference between highest and lowest fundamental frequency (f0) within each sound was analyzed. To do so, the fundamental frequency was extracted using the matlab function pitch (audio toolbox Release 3.7) in 10 ms sliding time windows over the duration the sound. An additional artifact suppression was implemented to counteract irregular fluctuations, by applying a threshold for tonal components in the time window (harmonic ratio), as provided in the pitch function, excluding samples below a harmonic ratio of 75%. Additionally, a one-octave frequency threshold around the sounds' median f0 was applied to each window, to eliminate erroneous leaps and octave errors in pitch recognition. As a final step, the f0 frequencies were transformed to a scale with a resolution of one cent and the distance between largest and smallest f0 was computed.
Both distance metrics and FMM range were employed as independent variables in a multiple linear regression model to predict IES. To mitigate the potential influence of adding independent variables to the regression, a bootstrap hypothesis testing was incorporated. In this process, linear regression models were generated using randomized independent variables to predict IES values in a bootstrap procedure comprising 1000 iterations. The results of the model using the true (non-randomized) similarities were then compared with the distribution of R2 values from the bootstrap models, considering the model suitable when the adjusted R2 values of the analyzed data were greater than the 99th percentile of the bootstrap distribution.
A. Experiment 1
For experiment 1, the first two principal components explained 91% of the variance, with PC1 accounting for 80% and PC2 explaining 11%. The two components are depicted in a component space in Fig. 6(A). Vowel /u/ and bassoon sounds are clustered throughout the component space. Along PC1, these are adjacent to /a/ sounds, whereas string and trumpet sounds create their own region on the opposite side of the space. Piano sounds appear in the center of the space and overlap on the edges mostly with /u/ and bassoon sounds. Interestingly, clusters that appeared on the opposite side of the component space, e.g., /a/ sounds, trumpet, and strings, were also the sounds with the best recognition scores. Conversely, the merged cluster of vowel /u/ sounds and instrumental bassoon sounds in the center is in concordance with assumptions about potential confusion between these categories. Further, this cluster underlines our hypothesis that /u/ and bassoon sounds had pronounced spectral similarities, which could explain their relatively poor recognition. Furthermore, the relatively small distance between those sounds and the piano sounds implies that the piano interferer could have had a much greater influence on recognition for those sounds than for the other more distant sounds.
The analysis of FMM range is presented in Fig. 6(B). Vocal sounds had a larger range than instrument sounds, with /a/ showing a range of 49 cents and /u/ a range of 45 cents. However, this range was smaller than in our previous detection experiments, where vocal sounds exhibited a range of 84 cents. The elimination of FMM reduced the range to 6 cents for both sounds. Instruments initially had an overall reduced FMM range compared to vocal sounds, with ranges of 28 cents for strings and 16 cents for winds, further minimized to 4 cents for strings and 5 cents for winds.
Results of the multiple linear regression are illustrated in Fig. 6(C). Similar to the LME analysis, a linear regression operating on the FMM range showed no considerable correlations with R2 smaller than 0.02 in both the isolated presentation and the accompanied presentation. Utilizing the distance between the target sounds in the component space yielded moderate correlations, with R2 values of 0.30 for the isolated presentation and 0.38 presentation with the piano interferer. When considering the distance in the component space between the target and piano sounds, the linear regression for the presentation with the interferer yielded an R2 value of 0.18. Operating on both distances improved the model to an R2 of 0.54. The addition of the FMM range did not enhance the model. To examine whether the presence of FMM specifically impacts the recognition of vowels, a distinct linear regression was performed, focusing solely on vowel targets. However, even in this analysis, FMM range demonstrated no significant influence, with R2 values remaining below 0.05 in both presentation conditions. All results, except for the one operating only on the FMM range, passed the bootstrap hypothesis test, with R2 values exceeding the 99th percentile of the bootstrap distribution. In summary, the model supports our assumptions that spectral distinctiveness of the target sounds and the similarities between the target and the piano interferer guide sound source recognition. The consistently stronger correlations in the accompanied presentation for the similarities between the stimuli suggest that these similarities are particularly impactful when the musical scene is more demanding.
B. Experiment 2
The component space for sounds utilized in experiment 2 is shown in Fig. 6(D). In comparison to experiment 1, sound categories were more intermingled in space. The first two principal components explained 92% of the variance, with PC1 accounting for 85% and PC2 accounting for 7%. Vocal and string sounds were mixed, while wind instruments stood out and were mostly found in a distinct quadrant. Additionally, a densely packed cluster of piano sounds was visible. The lack of separate instrument and vocal clusters can be understood as a result of the less homogeneous excerpts utilized in this experiment, further emphasized by the pronounced cluster of piano sounds originating from the VSL database. Interestingly, despite the less homogeneous sound excerpts, better recognition was observed in experiment 2 than for the more distinct sounds in experiment 1.
The analysis of FMM range is depicted in Fig. 6(E). Vocal sounds in experiment 2 not only obtained a larger FMM range compared to experiment 1 but also showed a close resemblance to the ranges found in our previous detection experiment. Female vocal sounds showed a range of 78 cents, and male sounds showed a range of 96 cents, both being close to the salient vocal signals in our previous experiment with a range of 82 cents. After the FMM reduction, the ranges decreased to 15 cents for female vocals and to 29 cents for male vocals. Instrumental sounds carried smaller ranges than vocals, which for strings were reduced from 24 cents to 4 cents and for winds reduced from 21 cents to 9 cents.
As observed in the LME, a linear regression based on the FMM range showed no substantial correlations, with R2 values of approximately 0.05 in both the isolated and the accompanied presentations. Utilizing the distance between target sounds in the component space yielded weak correlations, with R2 values of 0.24 for the isolated presentation and 0.28 for the presentation with the piano interferer. When based on the distance in the component space between the target and piano sounds, the linear regression for the presentation with the interferer yielded an R2 value of 0.11, which did not surpass the 99th percentile threshold. Utilizing both FMM range and the distance of target sounds slightly improved the model to an R2 of 0.26 for the isolated presentation and 0.29 in the accompanied presentation. Operating on both distance metrics held an R2 of 0.29. Incorporating all predictors resulted in an R2 value of 0.32. Only models that operated on the similarities within the target sounds passed the bootstrap hypothesis testing. To investigate whether the FMM might influence only vocal sounds, a separate linear regression was performed focusing exclusively on female and male target sounds. However, even in this particular analysis, the FMM range showed no considerable correlation, with R2 values remaining below 0.05 under both presentation conditions. Taken together, these results imply that spectral similarities between the targets impacted recognition, albeit to a somewhat smaller degree than in experiment 1, as other distinct features of the inhomogeneous sounds may have been more dominant compared to spectral similarities.
The diminished impact of the interferer sound appears to result from their spectral distinctiveness. The piano tones occupied a distinct area within the component space, lacking overlap with other sounds. This dissimilarity seems to have surpassed a critical threshold, leading to the recognition of the target sounds no longer being influenced by their similarity to the interferer. Therefore, the relatively smaller deterioration in the accompanied presentation in experiment 2 could have been a combination of multiple differences between the target and interferer, including intonation, articulation, and spectral dissimilarity.
Despite the average FMM range of vocal sounds being comparable to our previous detection experiment, correlations between FMM range and recognition were negligible. This finding refutes our initial assumption in experiment 1 that the absence of an FMM effect was due to insufficient FMM range. Additionally, the omission of FMM did not exhibit a consistent trend; instead, it appeared to unsystematically and marginally worsen or improve recognition. These findings reinforce our prior conclusion that cues related to FMM do not significantly affect the recognition of musical sounds.
VII. GENERAL DISCUSSION
In this study, we investigated recognition of vocal and instrumental sounds in three experiments. We tested the influence of FMM and accompaniment on the recognition of vocal sounds. Sounds from multiple databases were utilized to examine the generality of effects across diverse audio material. Participants were tasked with classifying short vocal or instrumental sounds in a go/no-go task. Sounds were controlled in level and pitch, and each sound was presented in versions with naturalistic FMM and with reduced FMM. Additionally, sounds were either presented in isolation or formed a harmonic triad with an accompanying but spatially separated piano interferer. The audio material of the sounds varied between experiments. To assess whether human recognition could be explained by acoustic features, a multiple linear regression employing spectral features of the sounds was utilized.
Contrary to our hypotheses and previous findings, we did not observe vocal superiority (faster and more accurate recognition) across any of our three experiments. Instead, notable differences between sung vowels became apparent, with /a/ sounds outperforming /u/ sounds. Suspecting the absence of this effect due to spectral similarities between vocal and bassoon sounds (Reuter , 2018), as indicated by our acoustical model and observed in the behavioral data, we repeated the experiment and removed the wind instruments or used different audio material. While recognition improvement was evident, differences between the vowels persisted and the reported vocal superiority effect remained absent. An argument could be made that language differences between the singer and listener influenced the recognition of the vowels to some degree, thus potentially explaining one contributing factor to the observed confusion. However, it's essential to note that the vowels used in classical singing may differ compared to those in everyday speech. The lack of vocal superiority is only sparsely reported in the literature (Bigand , 2011; Ogg , 2017). However, a direct comparison between our study and aforementioned studies is questionable. Ogg argued that the absence of vocal superiority in speech signals found in their study may not apply to the recognition of singing voices, as the investigated scenarios were too disparate. Bigand and colleagues pointed out that the absence of superiority in their study was likely caused by a peak level normalization. Furthermore, they reported that when utilizing a RMS sound level normalization, as conducted in many sound recognition studies (i.e., Agus , 2012; Suied , 2014; Moskowitz , 2020) superior recognition of vocals was observed. They argued that the RMS normalization might have emphasized frequencies that facilitate a superior recognition of speech sounds. However, this suggestion is not supported by our results, as we applied RMS level normalization but still observed an absence of vocal superiority, what makes our findings unique.
Contrary to our assumptions, the removal of FMM showed no influence on vocal recognition. It is assumed that FMM enriches vocal sounds with additional information about pitch continuity (Weiss and Peretz, 2019) and enhances the prominence of sung vowels compared to vowels without FMM (McAdams, 1989). Our previous study on the influence of FMM on the detection of vocals in musical scenes underpinned the importance of FMM, as the reduction of FMM led to a reduced vocal salience (Bürgel and Siedenburg, 2023). Consequently, our intention was to explore the effect of FMM on sound recognition. However, in the present study no such effects were found. To investigate whether this absence was based on using an orchestral database with stationary tones and generally lower FMM ranges, we conducted an additional experiment that used sounds extracted from continuous excerpts of the same pop music database used in our previous detection experiment. Although the FMM ranges were similar to the detection experiment, the effect of the FMM on recognition remained absent. An obvious disparity between both studies is the stimulus duration: 2 s in the previous experiment and a quarter-second in the current experiment. An argument could be made that such short stimuli do not provide sufficient exposure to allow a perceivable effect of FMM to unfold. However, this reasoning seems rather unlikely, considering that established perceptual thresholds for identifying the direction of pitch modulations are as low as 20 ms (Gordon and Poeppel, 2002). Moreover, FMM perceptibility has been demonstrated for tones of considerably shorter durations, such as 80 ms (d'Alessandro and Castellengo, 1991), as well as similar durations of 220 ms (Larrouy-Maestri and Pfordresher, 2018x). On the other hand, this finding aligns with studies indicating that the auditory system utilizes multiple processes with different temporal resolutions (Poeppel, 2003; Santoro , 2014; Giroud , 2020). These processes encompass mechanisms specialized in extracting information such as pitch and spectral shape within short time intervals (∼30 ms), while other processes analyze longer time intervals (∼200 ms) to detect changes over time. Even though the time windows identified are both shorter than the stimulus duration we employed, this observation could indicate that our stimulus duration was too brief to give rise to a significant effect of time-variant features on detection. Moreover, in the previous detection experiment, we employed musical scenes featuring a variety of instruments and vocal sounds that overlapped in time and spatial location within the scene. This stands in contrast to the relatively simple dichotic scene utilized in the current recognition experiment, where the piano and target sound were clearly separated. This less saturated scene with strict peripheral separation might have offered a simplicity that rendered FMM cues obsolete for recognition. Taken together it seems likely that the presence of FMM has no major impact on the recognition of musical sounds when other timbre cues are available.
The presence of a musical accompaniment deteriorated recognition for all sounds considerably. Unexpectedly, however, this negative effect was consistently smaller for the recognition of /a/ sounds and vocal excerpts from the pop music database. This distinct robustness to interference primarily manifested as a faster recognition compared to other sounds, with almost no deterioration in accuracy. Importantly, this effect was present across all three experiments, despite variations in the excerpts, highlighting the consistency of the effect. The observed robustness in vocal recognition may be indicative of a specialized processing mechanism for acoustic features of vocal signals during the segregation of auditory objects in a musical scene. This suggests that, when the auditory system segregates sound into mental representations of distinct streams, vocal features could trigger a prioritized, voice-specific processing (Belin , 2000; Levy , 2001; Gunji , 2003; Belin , 2004) that in turn might facilitate a better identification of vocal sounds within the complex auditory scene, contributing to accelerated recognition. However, distinctions between the tested vowels were apparent, with /u/ sounds lacking the robustness seen in /a/ sounds, suggesting that vocal sounds do not inherently trigger facilitated recognition. When also considering the susceptibility of /u/ to be confused with instruments, our results suggest that while vocal recognition in musical scenes can be uniquely robust, it does not possess properties that make it impervious to confusion with spectrally similar sounds.
Musical sophistication showed no significant effects on sound recognition in our experiments. Musicians are often reported to have advantages in the discrimination pitch (e.g., Tervaniemi , 2005; Micheyl , 2006) or timbre (e.g., Chartrand and Belin, 2006; Kannyo and DeLong, 2011), improved resistance against informational masking (Oxenham , 2003), and the ability to hear out partials in tone complexes (Zendel and Alain, 2009), in chords (Fine and Moore, 1993), or even melodies in complex musical mixtures (Siedenburg , 2020). Furthermore, familiarity with instrumental sounds (typical for more musically experienced individuals) is known to enhance recognition (Siedenburg and McAdams, 2017). However, no effect for musical sophistication was observed in our study. It should be noted that a distinct analysis of accuracy and speed also revealed no trade-off between them (Chartrand and Belin, 2006). Instead, most participants showed a behavior where sounds with higher accuracy tended to be detected faster. Partially in agreement with the absence of effects, studies specifically investigating low-millisecond sound recognition and discrimination have shown conflicting effects of musical sophistication, with influences either being absent (Bigoni and Dahl, 2018) or present (Akça , 2023).
A. Limitations
While our study has provided valuable insights into sound recognition, it is imperative to recognize the inherent limitations that may have influenced the interpretation of our findings. One caveat of the study is that certain sounds, primarily /u/ and the bassoon, exhibited a high ER in recognition. While RT measurement has proven to be a useful tool, especially when assessing supra-threshold recognition, it may not be suitable for all signals tested in this study. This is particularly critical given the limited number of stimuli used for each condition: 12. An alternative interpretation is that the response times measured in this case may reflect a measure of confidence rather than recognition speed. This alternative view is further supported by the correlation between errors and RTs, indicating that a low ER was associated with a faster RT. Nonetheless, we deliberately included these sounds because we believe they provide a valuable insight into the influence of acoustic similarity on sound recognition. To address the issue of high ERs, one perspective could also involve extending the training phase. Previous research by Agus (2010) has demonstrated that even the recognition of seemingly meaningless noise signals can be improved above chance levels with training. One approach could be to design a training session in which a certain number of stimuli must be correctly recognized for each sound category before proceeding to the main experiment. Another attempt could involve investigating whether increased repetition of stimuli leads to a reduction in errors and how this in turn affects RTs. In the same vein, it would also be intriguing to explore the recognition of /u/ or other vowel and instrumental sounds across different or more diverse stimuli pools containing spectrally similar or dissimilar sounds. This could shed light on how such variations influence the results, providing further insights into the hypothesized vocal superiority and the intricate interplay of spectral similarity.
Another potential source of undesired variability in our results could stem from the use of stimuli that are intended to be more ecologically valid and therefore are less controlled. While we applied a standardized method to extract stimuli based on a sound-dependent level threshold to maintain consistency, this approach may in principle have led to varying degrees of transient truncation across different signals. However, based on our own close listening of the stimuli and as highlighted by Fig. S6 in the supplementary material, we do not think that the procedure truncated onset portions severely such that onset cues remained intact (Siedenburg , 2019b). To further mitigate this potential issue, we included diverse sound categories (e.g., different vocal registers such as alto and soprano) and utilized multiple source databases. While the consistency observed in our results suggests the absence of significant bias caused by our extraction, further investigation into the generalizability of these findings across different databases could provide valuable insights.
The methods used to investigate the effects of FMM also offer opportunities for expansion. It would be intriguing to explore whether presenting stimuli with and without FMM in a separated block-wise presentation would yield different results. This approach could potentially impact the already challenging recognition task, by allowing participants to adopt a strategy of extracting additional FMM information for blocks with FMM. It could be argued that our methodology did not facilitate this, as the alternation of signals with and without FMM within a presentation block may not have provided a reliable strategy for exploiting FMM cues. Moreover, increasing the number of stimulus repetitions could further enhance certainty regarding the influence of FMM. Additionally, selecting stimuli with a high modulation depth could maximize the contrast between signals with and without FMM, potentially affecting recognition outcomes. Related to this, the choice of stimuli could consider the fact that FMM are notably pronounced during note transitions, so it would be especially intriguing to employ excerpts featuring transitions. Taken together, these revised methods could provide a clearer picture of the influence of FMM on the recognition of short sounds.
VIII. CONCLUSION
In contrast to previous studies, our work did not demonstrate a general recognition advantage for vocal sounds under the isolated presentation condition, nor did FMM influence recognition. Notably, recognition between vowels /a/ and /u/ differed considerably, which was linked to similarities with instrumental sounds. When the sounds were accompanied by a piano dyad, recognition accuracy and speed deteriorated. However, a distinctive robustness to interference for the recognition of /a/ sounds was observed, while a lack of robustness was observed for /u/. An acoustical model highlighted the role of spectral envelope cues in sound recognition. In summary, these findings demonstrated that vocal recognition is not mandatorily more efficient compared to instrumental sound recognition. This calls for a revised concept of vocal processing, emphasizing the need for a comprehensive understanding of the various acoustic factors influencing both vocal and instrumental sound recognition.
SUPPLEMENTARY MATERIAL
See supplementary material at SuppPub1.docx for an extended description of our methods and results.
ACKNOWLEDGMENTS
This work was funded by the Deutsche Forschungsgemeinschaft (DFG) (German Research Foundation)—Project No. 352015383—SFB 1330 A6. This research was also supported by a Freigeist Fellowship of the Volkswagen Foundation to K.S. M.B. and K.S. designed the study. M.B. collected and analyzed the data. M.B. wrote the first draft of the manuscript. K.S. revised the manuscript. Both authors contributed to the article and approved the submitted version.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Ethics Approval
The experiments were approved by the ethics committee of the University of Oldenburg and adhere to the ethical principles of the Acoustical Society of America (Acoustical Society of America, 2019). The participants provided their written informed consent to participate in this study. Participation was compensated monetarily.
DATA AVAILABILITY
The data that support the findings of this study are openly available in GitHub at https://github.com/MichelBuergel/Data/vocalRecognition.