The bi-modal speech recognition system requires a 2-sample language input for training and for testing algorithms which precisely depicts natural English speech. For the purposes of the audio-visual recordings, a training data base of 264 sentences (1730 words without repetitions; 5685 sounds) has been created. The language sample reflects vowel and consonant frequencies in natural speech. The recording material reflects both the lexical word frequencies and casual speech sound frequencies in the BNC corpus of approx. 100m words. The semantically and syntactically congruent sentences mirror the 100m-word corpus frequencies. The absolute deviation from source sound frequencies is .09% and individual vowel deviation is reduced to a level between .0006% (min.) and .009% (max.). The absolute consonant deviation is .006% and oscillates between .00002% (min.) and .012% (max.). Similar convergence is achieved in the language sample for testing algorithms (29 sentences; 599 sounds). The post-recording analysis involves the examination of particular articulatory settings which aid visual recognition as well as co-articulatory processes which may affect the acoustic characteristics of individual sounds. Results of bi-modal speech elements recognition employing the language material are included in the paper.
Skip Nav Destination
Article navigation
2 December 2013
166th Meeting of the Acoustical Society of America
2–6 December 2013
San Francisco, California
Session 2pSCa: Speech Communication
January 25 2014
Language material for English audiovisual speech recognition system development
Andrzej Czyzewski;
Andrzej Czyzewski
Multimedia Systems, Gdansk University of Technology, Narutowicza, Gdansk, pomorskie 80-233 Poland
Search for other works by this author on:
Bozena Kostek;
Bozena Kostek
Audio Acoustics Lab., Gdansk University of Technology, Narutowicza, Gdansk, pomorskie 80-233 Poland
Search for other works by this author on:
Tomasz Ciszewski;
Tomasz Ciszewski
Philology Faculty, University of Gdansk, Wita Stwosza, Gdansk, 80-952 Poland
Search for other works by this author on:
Dorota Majewicz
Dorota Majewicz
Philology Faculty, University of Gdansk, Wita Stwosza, Gdansk, 80-952 Poland
Search for other works by this author on:
Proc. Mtgs. Acoust. 20, 060002 (2013)
Article history
Received:
November 18 2013
Accepted:
January 22 2014
Citation
Andrzej Czyzewski, Bozena Kostek, Tomasz Ciszewski, Dorota Majewicz; Language material for English audiovisual speech recognition system development. Proc. Mtgs. Acoust. 2 December 2013; 20 (1): 060002. https://doi.org/10.1121/1.4864363
Download citation file:
Citing articles via
Related Content
Methodology and technology for the polymodal allophonic speech transcription
Proc. Mtgs. Acoust. (January 2017)
Audiovisual integration in fricative production and perception by non-native speakers
Proc. Mtgs. Acoust. (July 2017)
The early maximum likelihood estimation model of audiovisual integration in speech perception
J. Acoust. Soc. Am. (May 2015)
Effect of acoustic fine structure cues on the recognition of auditory-only and audiovisual speech
J. Acoust. Soc. Am. (June 2016)
Leveraging audiovisual speech perception to measure anticipatory coarticulation
J. Acoust. Soc. Am. (October 2018)