Virtual reality environments offer new possibilities in perceptual research, such as presentation of physically impossible but ecologically valid stimuli in contrived scenarios. To facilitate perceptual research in such environments, this study presents a publicly available database of anechoic audio speech samples with matching stereoscopic and 360° video. These materials and accompanying software tool allow researchers to create simulations with up to five talkers positioned at arbitrary azimuthal locations, at multiple depth planes, in any 360° or stereoscopic environment. This study describes recording conditions and techniques, contents of the corpus, and how to use the materials within a virtual reality environment.
For studies investigating speech and communications in naturalistic and ecologically valid environments, the need for a research tool to allow the creation of acoustically and visually complex environments while maintaining the ability to parametrically manipulate the audio and video of multiple talkers, and their environment, is necessary (Cappelloni et al., 2019; Stecker et al., 2018). The advent of Virtual Reality (VR) technology has greatly enhanced our ability to explore questions surrounding speech intelligibility (Gonzalez-Franco et al., 2017), spatial audio (Poirier-Quinot and Katz, 2018), auditory localization (Ahrens et al., 2019), room acoustics (Garí et al., 2019; Rungta et al., 2018; Stecker et al., 2018), hearing aid evaluation (Hendrikse, 2019), and many others in complex environments (Stecker, 2019; Hendrikse et al., 2019). However, until now, there has not been a speech corpus specifically designed to leverage the unique capabilities this technology has to offer the research community. This corpus was created to facilitate auditory and multisensory research in immersive environments and designed to allow experimental control in multidimensional tasks.
II. 3D AUDIO-VISUAL SPEECH CORPUS
This audio-visual speech corpus includes recorded sentences from five talkers, two male and three female. The talkers were recorded saying two different speech corpora; the coordinate response measure (CRM) (Bolia et al., 2000) and the Harvard IEEE corpus word list (Rothauser, 1969). The CRM corpus was chosen for its use in measuring speech intelligibility in multitalker environments, as well as offering the ability to “gain measures of sensitivity (d') and response bias” via a signal detection style task (Bolia et al., 2000). Additionally, it is a widely used corpus across many fields including automatic speech recognition (ASR) technology (Cooke et al., 2006), as well as audiology (Best et al., 2012). The IEEE corpus was chosen because of its wide use in speech intelligibility testing in multitalker scenarios (Hawley et al., 2004; Qin and Oxenham, 2003; Bernstein and Grant, 2009) and due to the construction of the sentences being phonetically balanced and low context. The CRM recordings include seven callsigns (“Laker,” “Baron,” “Charlie,” “Ringo,” “Eagle,” “Hopper,” and “Tiger”), four colors (red, green, white, and blue), and numbers one through eight. All five talkers give completed, all factorial combinations of the callsigns, colors, and numbers, giving a total of 224 sentences. These sentences are recorded twice, at two planes of depth: a near plane of 81 cm and a far plane of 183 cm. The multiple planes of depth provide realistic depth cues of the talkers for near and far environmental contexts. The IEEE recordings include 50 list sentences per talker; each talker recorded unique list sentences. The talkers recorded the same 50 list sentences at the two planes of depth. The talker and corresponding list numbers are provided on the hosting site provided below.
This audio-visual speech corpus includes both a 360° recording and 180° stereoscopic video of each recorded sentence. Additionally, each 360° and 180° video also includes three versions: a black background, a “black” transparent background, and a greenscreen version. The black background version consists of the talkers cropped out of the greenscreen and placed in an empty black background. The greenscreen version allows the user to do cropping and placement of the talkers into any 180° or 360° filmed background environment, and the “black” transparent allows the already cropped talkers to be placed into any intended background and composited into videos with other talkers. This third “black” transparent version appears like the black background mentioned above when viewed alone; however, it will be read as transparent by video composition tools. In total, with all five talkers (see Fig. 1), both speech corpora, two planes of depth, both 360° and 180°, and all three backgrounds (greenscreen, black, and black transparent), a total of 16 440 videos are included, as well as color calibration videos. The raw audio files for every recording are also included for a total of 2740 individual audio files.
All video and audio were recorded in an anechoic chamber. The anechoic chamber was a fully anechoic room manufactured by Eckel Industries with an interior clear dimension of 15 ft wide × 25 ft long × 15 ft high. The chamber was built and tested to meet a background noise criteria of 10 dB below the human hearing threshold as defined in ISO 7029 from 100 Hz to 20 kHz. Free field (anechoic) conditions were verified using the room qualification procedure specified in Annex A of ISO 3745 from 100 Hz to 20 kHz. All talkers were filmed against a greenscreen of approximately 9 ft wide × 10 ft tall with a length of 20 ft. The talkers were lit with six lights in individual 24 in. × 24 in. Softboxes with 85 W 5500 K CFL bulbs to minimize shadow effects for post-processing of greenscreen cropping. All talkers were chosen for an American English accent to eliminate talker accent as a potential confounding factor for those using the corpus for speech intelligibility. Five total talkers were recorded, two males and three females.
Talkers were filmed by two simultaneously recording Vuze XR Dual VR Cameras: one recording in 3D 180 × 180 × 2 stereoscopic mode (half sphere) and the other in 360 × 180 mode (full sphere). All video was captured at a 5.7 k video resolution and 30 frames per second. The 360 full sphere video was recorded at a height of 162 cm and the 3D 180° half sphere video was recorded at a height of 165 cm. Talkers were filmed at two planes of depth from the cameras: near and far.
The main mic was a Sennheiser 416 shotgun mic attached to a boom stand and located approximately 1 meter away and out of the camera view. The second microphone was a Shure CVL-B/C Lapel mic clipped to the talker's collar, which was attached to a Shure BLX1 H10 transmitter and placed into the talker's back pocket. This signal was received by a Shure BLX4 H10 receiver. Both microphones were hard wired to a Zoom F8 field recorder operating at 48 K sample frequency and 24-bit resolution.
C. Recording process
Prior to recording, all talkers went through a training phase for both the CRM sentences and the IEEE sentences. The training phase consisted of listening to an example sentence for timing, tone, and emphasis. Talkers were then required to repeat the training phrases with correct timing and pronunciation prior to the start of the recording sessions. During the recording sessions, one experimenter listened to and maintained the timing of the sentences from the talker, and another listened to and maintained the pronunciation and quality of the sentences. Both experimenters had to accept the sentence spoken by the talker before moving on to the next. Periodically throughout the recording sessions, the talkers would listen to the training sentences again for timing.
The recording process for the CRM sentences consisted of the talker being given a call sign and a color, and asked to follow the CRM carrier phrase starting at the number one and ending at number eight, before being prompted by an experiment for the next color. This was done for all seven call signs and four colors. During the CRM sentence recording, we asked the talkers to maintain eye contact with the space between the two cameras.
The recording for the IEEE sentences consisted of the talker reading all the sentences prior to the recording session for familiarity and pronunciation questions. During the recording of the IEEE sentences, each talker had the sentence displayed directly at the top of the cameras to read while the recording was happening. For the IEEE sentences, this may have the appearance that the talker is reading and looking slightly above the camera. Talkers were prompted to take breaks every twenty minutes.
During the recording, talkers were asked to stand on their distance mark on the greenscreen paper and face the camera, maintaining eye contact with the space between the two Vuze Cameras while speaking the CRM sentences, and directly above the top camera for the IEEE sentences. Five seconds of silence before and after each sentence were recorded.
Before each twenty-minute recording session, the talker was recorded holding a XRite color calibration palette to facilitate color correction in post-production, their position on each camera for green-screen clearance was checked, and an experimenter would create a loud clap for post-processing audio-video alignment.
After completing the collection of the entire corpus, the audio and video components had several post-processing steps.
The audio from the lavalier mic was discarded, and the audio from the boom mic was used for all audio purposes. The audio was edited using Adobe Audition. The raw audio from the boom mic was volume matched using Auditions “Match Clip Loudness” effect to a target loudness of −23 LUFS, with a tolerance of 0.1 LU (EBU128-2014). The audio was then cleaned for noise, and talker generated noise (such as the movement of clothes or stomach sounds) were removed manually by using Adobe Audition's “Noise Reduction” effect at 50% and reduced by 16 dB. Five seconds of matched “room tone” was ensured at either end of the talker audio by taking a five-second “room tone” clip consisting of a static recording of the room, free of talker noise and aligning over the audio tracks. The final audio track edits were saved to a .wav file at 48 kHz Mono 32-bit format. The audio files were then imported to Adobe Premiere Pro 2019 for alignment with video described in the section below. The audio .wav files, without video, are also provided by this corpus.
The video from both the 360° camera and the stereoscopic 180° camera was processed in Adobe Premiere Pro 2019. To facilitate generation of complementary video material, we include the project settings in Adobe Premier, which were as follows: frame size was set to 5760 horizontal by 2880 vertical, with a frame rate of 29.97 frames/s, and a square pixel aspect ratio. Scan mode was set to progressive. The audio settings for the videos were at a sample rate of 48 000 samples/s. The VR settings for projection were equirectangular with a monoscopic layout and captured view of 360° horizontal × 180° vertical.
The 360° video general settings also used a custom editing mode with a timebase of 29.97 fps. The video settings for frame size, frame rate, pixel aspect ratio, and fields were the same as above, and the audio settings were again 48 000 samples/s. The VR settings for projection, layout, and captured view were also the same as above.
The audio and the video clips were aligned in Adobe Audition using sound markers, and the tracks were then imported into Premiere. The greenscreen was then cropped and videos exported in three background types; transparent “black” background, black background, and greenscreen. A three-second silent sequence was ensured both before and after every talker phrase or sentence. The video format used was the Quicktime format (mov) using Apple Pro Res 4444+Alpha as the codec, applied to both sets of exported videos: the 180° and 360° “black” transparent and the 180° and 360° green screen videos.
Each video and audio file were screened for mispronounced words, extraneous noises, speed, and so forth. The few known issues with the video and/or audio of the corpus are listed below.
E. Special notes
Here we list the known issues with talker 5, and exceptions to the CRM sentences and limits on the IEEE sentences. Talker 5 has a removed t-shirt logo that required a matte effect to obscure said logo. The logo, occurring on the black t-shirt, was covered with the following color on the close distance (161 619) and far distance (20232 A) such that it would blend with the black t-shirt. The following CRM sentence combinations are not included in the corpus due to various errors such as noise, talker eye position, and talker pronunciation, and designated in the following format: Talker followed by the Callsign/color/number/distance of the sentence. Talker 1: Laker/blue/one/close, Laker/blue/two/far, Charlie/white/one/far, Eagle/green/one/far, Eagle/green/eight/far; Talker 2: Laker/red/one/far, Laker/green/one/close; Talker 4: Baron/green/seven/close; Talker 5: no sentences with number 8 for the far depth planes are included, Laker/blue/three/both, Baron/white/one/close, Baron/blue/one/close, Charlie/blue/four/far, Ringo/green/seven/far, Eagle/green/seven/far, Eagle/blue/seven/far, Tiger/red/five/close.
The corpus data is available from an open linked SharePoint Site maintained by Facebook,1 or by email request to the corresponding author. The corpus, in its entirety, is available for free under the Creative Commons Attribution Noncommercial 4.0 International public license agreement (CC-BY-4.0)2 and hosted publicly. Facebook will maintain the Sharepoint site and any necessary subsequent hosting for the corpus in perpetuity. Users will have access to a repository of audio and video recordings made at FRL Redmond with employee participants. This includes the 180° and 360° “black” transparent, the 180° and 360° greenscreen videos, and the 180° and 360° black background, as well as the edited audio .wav files. In total, a complete download of the corpus would include 17 322 videos (including the XRite color bar recordings) and 3200 audio files. Additionally, for future lighting or color adjustments, each talker has a short video sequence holding an XRite color calibration palette, for each distance position. A JSON file is also included with a description of the files naming convention, talker notes, and instructions for viewing across different VR platforms, compiling the videos, and spatializing the audio for the compiled videos. Additionally, there is software in the form of a Unity (Unity Technologies, San Francisco, CA) package, described below, to load and display the audio and video recordings.
A. Use in VR environments
In addition to the above, we are making available a simplistic Unity tool to allow experimenters to quickly compile multiple talkers (up to three talkers at a time) with in any background and their relevant audio files. This player will also allow experimenters to designate the spatial locations of the audio files, as well as a basic interface for behavioral data collection and data exporting. Importantly, this tool will allow users to view the corpus and compiled videos in consumer VR headsets. The complete details of this Unity tool will also be hosted at the above for download.
The authors would like to thank the research assistants for their incredible work to help record and quality check this corpus, with a special thank you to Alex Gustafson and Stephanie Cunningham, as well as to our research team for useful feedback on prototypes and early production.
SharePoint Site maintained by Facebook: https://fb.sharepoint.com/:f:/s/FRLAudioResearch/EpU2AeUdvDBBvF589aYDOEEBeREku01AkIQWlD8V4H3m8g?e=Dq4FWR.