The Acoustic Interactions for Robot Audition corpus is introduced for research on sound source localization and separation, and for multi-user speech recognition. Its aim is to evaluate and train Robot Audition techniques, as well as Auditory Scene Analysis in general. It was recorded in six real-life environments with different noise presence and reverberation time, using two array configurations: an equilateral triangle, and a three-dimensional 16-microphone array set over a hollow plastic body. It includes clean speech data for static sources and tracking information for mobile sources. It is freely available at https://aira.iimas.unam.mx/.
1. Introduction
Speech source localization and separation is an essential part of Robot Audition and Auditory Scene Analysis in general. It is of great interest to evaluate and train techniques in noisy environments or acoustic scenarios with a high amount of speech interferences (such as a restaurant or a department store) since they are expected to be employed in such scenarios.
In terms of evaluation, it is important to carry this out in a repeatable manner when it is of interest to quantify improvements in performance. In terms of training, there has been a recently emerging interest in the use of data-based models for source localization and separation (Rascon and Meza, 2017). To this effect, it is also important to use a training data set that was acquired in real-life environments.
There have been corpora collected for localizing and tracking sound sources, such as RWCP-SSD (Nakamura et al., 2000), AV16.3 (Lathoud et al., 2005), CAVA (Arnaud et al., 2008), AVASM (Deleforge et al., 2014), CAMIL (Deleforge and Horaud, 2011), and LOCATA (Löllmann et al., 2018). These corpora, in one form or another, provide important features such as varying amount of microphones, applying different array configurations, recorded in real environments, using a robotic head, etc. It is of great interest to have a corpus that incorporates all of these features.
This paper presents the Acoustic Interactions for Robot Audition (AIRA) corpus which has the following characteristics:
It was recorded in six different real life environments with a considerable amount of variations in terms of noise presence and reverberation time.
Up to four static speech sources were controlled by high-end flat-response studio monitors reproducing the recordings from another cleanly recorded corpus in Mexican Spanish: the DIMEx100 corpus (Pineda et al., 2010). All clean speech data from these static speech sources is included along with their direction of arrival to the microphone array.
Mobile speech sources were carried out by human volunteers, and their position through time are either provided by a laser-based tracking system or by an estimation from their start and end position.
The transcripts of all speech sources are included.
Two array configurations were used: a triangular array and a 16-microphone three-dimensional (3D) array.
It is important to note that a preliminary version of this corpus has been successfully used for the evaluation of an algorithm that tracks multiple sound sources with a small amount of microphones (Rascon et al., 2015). In this paper, the decrease of the performance of the evaluated system as the number of sources increased was more notable in a real setting than in an anechoic chamber. This prompted future work for its robustness against reverberation.
2. Corpus summary
The AIRA corpus is composed of audio recordings of up to four simultaneous speech sources surrounding a microphone array in different acoustic environments. Each recording has a length of 30 s and is stored in a raw audio WAV file sampled at 48 kHz, with 16-bit floating point accuracy. If the sources are static, it is accompanied by the clean audio signal from each speech source. The direction of arrival and the transcript of all sources are also included.
For some recording environments, static sources were set as electronic speakers reproducing clean speech. Their positions were decided upon considering different spatial separations between the sources ranging from 30° to 120° apart, as well as with different heights relative to the array center, ranging from −0.17 to 0.22 m. For each source position configuration, up to ten repetitions were carried out, using different users. For some of these environments, a recording of a sine-sweep of 50 Hz to 4 kHz is also included for room characterization.
For other recording environments, several human volunteers acted as mobile speech sources with diverse movement paths around the microphone array. For each source movement configuration, up to ten repetitions were carried out, using a different transcript. The motive of these recordings is that of tracking, so no clean channels are provided.
Considering all of the recordings, from all the different source position configurations, the AIRA corpus includes close to 145 h of audio data. Its full documentation is available as supplementary material of this work.1
3. Microphone array configurations
It is important to mention that different hardware was used between the array configurations. The recordings for the triangular array were captured first with equipment that was available at the time and that was able to capture 3-channel audio data. Afterwards, the 3D array recordings were captured with more up-to-date equipment that was able to capture 16-channel audio data. However, their differences are minor.
3.1 Triangular array
The aim of this configuration is for algorithms that only require a small amount of microphones, as well as for circumstances where there are more sources than microphones. In Fig. 1, a schematic of the array is shown.
The sources were positioned 1 m away from the center of the array, and their direction was registered in every recording. In terms of height, they were positioned in the same height as the imaginary plane formed by the triangular array. Thus, no height information was registered.
The distance between microphones was either 0.18 or 0.21 m, as to simulate the width of the human head so that either pair of microphones can be used for free-field binaural algorithms. It is important to clarify that the distance between the microphones changed between recording environments because of the limited space in some of the environments. However, these changes were all registered as part of the corpus.
3.2 3D array
The aim of this configuration is for algorithms that require a high amount of microphones, as well as for circumstances that break the inter-microphone free-field assumption. In Fig. 2, a schematic of the array is shown.
As it can be seen, the center of the array is positioned in its top center part, so that the top six microphones can still be used as a free-field array. Additionally, microphone 3 is not positioned in the top center; it is positioned such that microphones 1, 2, and 3 form a close-to-equilateral triangle to test if algorithms are robust against inaccurate microphone positioning and compare performance against the triangular array. Additionally, the array as a whole can be used for algorithms that are robust against having a body between microphones, which breaks the free-field assumption. In fact, pairs of microphones can be chosen (such as microphones 9 and 15) to be used with non-free-field binaural algorithms. Plastic was used as the material for the body because of: ease of molding; ease of reproduction; ease of transportation; and enough sturdiness to hold the 16 microphones.
4. Recording environments
The noise level was measured in each recording environment by a sound pressure level (SPL) meter and is presented in dB SPL. The reverberation time (τ60) was measured using a reverberation estimator that can provide measurements that are greater than 0.01 s.
The sources were placed 1 m from the center of the microphone array at different azimuth angles, which ranged from 30° to 150° between sources. In addition, the sources were also placed in different elevations relative to the center of the array, which ranged from −0.17 to 0.22 m.
4.1 Anechoic chamber
This environment is a full-anechoic chamber, described in Boullosa and Lopez (1999). It has a size of 5.3 m × 3.7 m × 2.8 m, and has a very low noise level (≈0.13 dB SPL) with τ60 < 0.01 s. In this environment, the two array configurations were used. No other noise sources were present in this setting. Recordings have a signal-to-noise ratio (SNR) of ≈43 dB.
4.2 Cafeteria
This environment is a moderate-size student cafeteria during a 5 h period of high customer presence. It has an approximate size of 20.7 m × 9.6 m × 3.63 m. It has a high noise level (71 dB SPL) with an average τ60 = 0.27 s. Its walls and floor are made of concrete, and its walls are made of a mixture of concrete and glass. In this environment, only the 3D array configuration was used. Noise sources around the array included people talking, babies crying, tableware clanking, some furniture movement, and cooling fans of stoves. Recordings have a SNR of ≈16 dB.
4.3 Department store
This environment is a sizable department store (comparable to a Walmart or Tesco) during a 5 h period of moderate customer presence (as it was only allowed by the store administration). It has an approximate size of 91 m × 62 m × 6 m. It has a high noise level (63 dB SPL), and it has an average τ60 = 0.16 s. It has a high ceiling made of plastic and aluminum, its walls are made of brick, and its floor made of concrete. As it can be seen, this environment is quieter and less reverberant than the Cafeteria environment. In this environment, only the 3D array configuration was used. Noise sources around the array included: people talking, people walking by with trolleys, and general announcements. Recordings have a SNR of ≈17 dB.
4.4 Hall
This environment is an office hallway and it has a low noise level (48 dB SPL), and it has an average τ60 = 0.21 s. It has an average width of approximately 1.5 m, a length of approximately 28 m, and a height of 2.1 m. The ceiling is made of plaster, and the walls and the floor are made of concrete. In this environment, only the triangular array configuration was used. Noise sources around the array included inter-office chatter and robotic motor ego-noise. Recordings have a SNR of ≈10 dB.
During the recordings, human volunteers acted as speech sources and were asked to read pre-specified randomly chosen sentences from the DIMEx100 Corpus while they were moving in a pre-specified path. This path only had a starting and stop position, thus the positions in the ground truth position files are estimations which assumed they moved in a uniform fashion.
In addition, the microphone array was also moving, being placed over a service robot called Golem-II (Pineda et al., 2015), that was borrowed from the Golem Group. The robot was programmed to move in a straight line, through the hallway, at a speed of 0.13 m/s.
4.5 Office A
This environment is an open-cubicle computer laboratory that has an approximate size of 5.7 m × 6.6 m × 2.1 m, and has a somewhat low noise level (52 dB SPL) with an average τ60 = 0.20 s. The ceiling is made of plaster, the walls a mix between concrete and glass, and the floor of concrete. In this environment, the two array configurations were used. Noise sources around the array included inter-cubicle chatter and computer cooling fans. Recordings have a SNR of ≈21 dB.
4.6 Office B
This environment is another open-cubicle computer laboratory. It has an approximate size of 10.5 m × 4.9 m × 2.1 m, and is divided into three 3.5-m-wide spaces that are acoustically connected. It has a low noise level (42 dB SPL) with an average τ60 = 0.42 s. The ceiling is made of plaster, the walls a mix between concrete and glass, and the floor of concrete. As it can be seen, it is a bit quieter but more reverberant than Office A. In this environment, only the triangular array configuration was used. Noise sources around the array included inter-cubicle chatter and computer cooling fans. Recordings have a SNR of ≈20 dB.
During the recordings, human volunteers acted as speech sources in a similar way as in the Hall recording environment, with the only difference being that in this environment the microphone array did not move.
4.7 Office C
This environment was the same as Office A. However, in this case the sources were mobile and were automatically tracked. The tracking system used the laser on the bottom part of the Golem-II robot, and the human volunteers were asked to use an accessory on their legs that facilitated their tracking. Thus, their ground truth position through time is known and included as part of the corpus. In this environment, only the triangular array configuration was used. Recordings have a SNR of ≈14 dB.
5. Conclusion
The AIRA corpus can be used to model or to evaluate techniques for sound source localization and separation, as well as for multi-user speech recognition. It uses two microphone array configurations, it was recorded in six very varied acoustic environments, it includes clean speech data for static sources and tracking information (both grounded and estimated) for mobile sources, it provides the transcript of all speech sources, and it is freely available from https://aira.iimas.unam.mx/.
This version of AIRA was captured over a span of 7 years. However, there are plans to continue complementing it with recordings in different environments. Additionally, it is of interest to provide a different SNR per recording environment by employing several speaker pre-amplifications. It is also of interest to use different bodies with which to set the microphone arrays, as to explore different materials as well as different array configurations. Finally, it is also of interest to expand the AIRA corpus into other application scenarios, such as autonomous drones. In fact, an important push in this regard has already been made (Ruiz-Espitia et al., 2018) that the official AIRA website links to, and that the readers are encouraged to follow.
Acknowledgments
The authors thank the support of CONACYT through Project Nos. 81965, 178673, and 251319, PAPIIT-UNAM through Project No. IN107513, and ICYTDF through Project No. PICCO12-024. In addition, the authors would like to thank Dr. Luis Pineda, who provided the DIMEx100 corpus and loaned the Golem-II+ robot for some of the recordings. Furthermore, the authors would like to acknowledge the help of Oscar Aguilar, Rodolfo Petrearce, Varinia Estrada, and Alfonso Vega during the capture process. Finally, a special recognition is given to Dr. Santiago Jesus Perez Ruíz, of the Laboratorio de Acústica y Vibraciones of the Instituto de Ciencias Aplicadas y Tecnología (formerly known as the Centro de Ciencias Aplicadas y Desarrollo Tecnológico) of the Universidad Nacional Autónoma de México, for his invaluable support in the recollection of the AIRA Corpus.
See supplementary material at https://doi.org/10.1121/1.5078769E-JASMAN-144-510811 for the full documentation of the AIRA corpus.