Medical masks have become necessary of late because of the COVID-19 outbreak; however, they tend to attenuate the energy of speech signals and affect speech quality. Therefore, this study proposes an optical-based microphone approach to obtain speech signals from speakers' medical masks. Experimental results showed that the optical-based microphone approach achieved better performance (85.61%) than the two baseline approaches, namely, omnidirectional (24.17%) and directional microphones (31.65%), in the case of long-distance speech and background noise. The results suggest that the optical-based microphone method is a promising approach for acquiring speech from a medical mask.

The classical method for obtaining speech information from users involves air-conducted microphones, which can obtain high-quality speech information from speakers for quiet environments and short distances. However, there is still room for improvement in their performance under challenging conditions, such as for distant speech and under ambient noise.1–4 Specifically, when sound waves propagate in air, the energy of the speech signals decreases as the distance between the sound source and the target increases.1 These signals can also be easily contaminated by environmental noise in the vicinity of the sensor.2,3 Medical masks have become necessary of late because of the COVID-19 outbreak. However, the presence of medical masks may have a deleterious effect on the transmission of sound waves. Their use attenuates the energy and distorts the high-frequency components of speech signals, thereby affecting speech quality.5,6 In addition, speech intelligibility is negatively affected when speakers wear medical masks because of the reduction in visual information available to listeners.5,7

Currently, several approaches have been proposed to alleviate the effect of distant speech and ambient noise, such as microphone array-based (directional microphone) speech enhancement approaches that use the minimum variance distortionless response (MVDR) beamforming algorithm,8 but the issues of speech signal distortion and energy decay caused by the medical mask continue to be problematic with the microphone array-based technology. However, it can be observed that when people speak, the energy from the sound waves can cause medical masks to vibrate. Therefore, we assumed that the vibration of the mask is rich in speech information, which can be used to acquire the user's speech. Moreover, because the medical mask is the vibrational object surface closest to the sound source, namely, medical personnel or patients, the vibration caused by speech can be directly transmitted to the medical mask. Therefore, the signal-to-noise ratio (SNR) remains high in noisy conditions, even if the sound waves of background noise contact the medical mask. To verify our assumption that the vibration signal of medical masks provides useful speech information, an optical-based sensor, a laser Doppler vibrometer (LDV),9–13 is used to acquire the medical mask vibration signals in this study.

The LDV sensor provides an alternative method for distant speech acquisition based on the Doppler effect. More specifically, LDV sensors can obtain distant target speech by irradiating the object surface near the target sound source, such as a glass window, metal plate, or plastic box,9–11 with a laser beam at the micrometer or even nanometer scale. Based on this LDV concept, previous studies (e.g., Sun et al.12 and Xie et al.13) used a combination of LDV and normal acoustic speech features to train a deep neural network (DNN) acoustic model, and thus achieved performance improvements for automatic speech recognition (ASR) under both quiet and noisy conditions. Based on earlier studies on LDV sensors in speech processing tasks, the primary purpose of this study is to further investigate the feasibility of obtaining speech signals from medical masks using LDV sensors. As the surface of the medical mask is a non-smooth reflective surface composed of fibers, it may affect the possibility of optical reflection. Furthermore, we intend to investigate whether LDV sensors can provide advantages for ASR systems under challenging conditions (e.g., distant speech and ambient noise) in comparison to a classical microphone array device.

Figure 1 illustrates the system proposed in this study, where the LDV sensor is used as an optical-based microphone to record speech from a target sound source. The LDV sensor is a non-contact measurement device based on the principle of interferometry. Measurements are made at the point where the laser beam strikes the object surface that is vibrated by the target sound source. The LDV equipment we used (LDV Vector-Series, Optomet, Darmstadt, Germany) is classified as class 2, which is considered safe for normal operation. As the class 2 laser beam is typically less than 1 mW, it will not damage human skin or mask materials. In our case, the laser beam is directed at the KEMAR mask (45BB-4 KEMAR head, GRAS, Holte, Denmark), which is used in this study to replace the real speaker to provide fair testing conditions. When KEMAR's artificial mouth is playing audio, the vibration is directly transmitted to the medical mask and was easily captured by the LDV sensor.

Fig. 1.

Proposed optical-based speech acquisition system. BS, beam splitters; FM, frequency modulation.

Fig. 1.

Proposed optical-based speech acquisition system. BS, beam splitters; FM, frequency modulation.

Close modal
The major components of a LDV sensor include a laser, Bragg cell, photodetector, optical lens, FM demodulator, and several BS. First, a laser beam with frequency f 0 is divided into a reference beam and an object beam by BS1. Then the object beam passes through BS2 and is directed to the vibrated object (a medical mask) using an optical lens. As reflected by the medical mask, the backscattered beam with a Doppler shift f d travels back to BS2 and mixes with the reference beam at BS3. The Doppler shift depends on the instantaneous vibrational velocity v ( t ), as shown in Eq. (1), where t represents the time domain parameter, α represents the angle between the object beam and velocity vector, and λ represents the laser beam wavelength,
(1)
Meanwhile, the reference beam arrives at BS3 after passing through the Bragg cell and produces a frequency shift f b. The backscattered beam and the reference beam mix and interfere with each other at BS3, and a signal with a frequency shift f d + f b is generated. Finally, the optical signal is converted to a voltage signal using a photodetector. It is transmitted into an FM demodulator, where the output signal z ( t ) is generated with f d and f b as its carrier and modulated frequencies, respectively. The equation for z ( t ) is given by Eq. (2), where A v represents the amplitude and f v represents the vibration frequency from the medical masks. Additional LDV sensor details can be found in Refs. 1, 9, 11, and 12,
(2)

The purpose of this study is to investigate the benefits of an optical-based microphone under distant speech and ambient noise conditions when a speaker is wearing a medical mask. Initially, we asked a speaker to record his voice according to the Taiwan Mandarin hearing in noise test (TMHINT) corpus14 in advance and normalize each recorded sentence. Subsequently, we used KEMAR (GRAS 45BB-4 KEMAR head) with a medical mask to simulate the application scenario, and the speech was played from KEMAR's artificial mouth. Note that the masks we used are medical masks generally made of polypropylene, which, in recent years, have been recommended for the public to wear to help prevent the spread of COVID-19. Therefore, we can avoid multiple influencing factors, such as the involuntary movement of heads, articulators, and airstreams in real speakers to provide fair testing conditions for different approaches. Moreover, since the energy of each sentence has been adjusted to be the same, we can also ensure the consistency of the acoustic characteristics of each test. As for the microphone systems, we used three types of microphone systems—an omnidirectional microphone (CDCU-04IP VX-1, IPEVO, Sunnydale, CA), a directional microphone (IPEVO CDCU-04IP VX-1), and the proposed system (Optomet LDV Vector-Series) to record these sentences with a 65-dB sound pressure level (SPL). In the experimental stage, we used the aforementioned three approaches to record the speech at three distances (1.5, 2, and 2.5 m) under quiet and noisy conditions. To simulate noise from conversation, the background noise of Institute of Electrical and Electronics Engineers (IEEE) speech-shaped noise (SSN) was used at 65 dB SPL. The entire experiment was conducted inside an audiometric room to ensure that the environmental variables were consistent. After the recording, we conducted two types of objective evaluations to analyze the performance of different microphone approaches, namely spectrogram analysis and the objective evaluation metric of Google ASR, to verify the benefits of the optical-based microphone. In the spectrogram analysis, we can observe which microphone performs better by comparing the similarities between the corpus recorded by different microphones and the original corpus. In the Google ASR test, the accuracy is used to determine the speech quality performance of each microphone approach at different distances.

Figure 2 depicts the spectrograms of the original speech signal and the recorded corpus via three types of microphone configurations at 2 m distances in noisy conditions. We observed that there was a broad noise band in the omnidirectional and directional microphones, whereas the speech derived from the optical-based microphones was clear and similar to the original speech. Moreover, in the high-frequency region around 2 kHz indicated by the red arrows, the high-frequency components of speech recorded by LDV are clearer and more intact compared with those recorded by omnidirectional and directional microphones. In other words, these spectrograms indicated that the optical-based microphones provided clearer low- to mid-frequency information (e.g., below 2 kHz) of speech, which held the richest speech information. However, notably, the signal-derived information of the low- to mid-frequency range was blurry from the air-conducted microphones compared to the optical-based microphones. Based on these results, we concluded that the speech acquired from an optical-based microphone exhibited higher quality compared to air-conducted microphones. It was noted that the harmonics running from the bottom to the top at 3 s in Fig. 2(d) represent the phenomenon called speckle noise. The speckle noise comes from randomly distributed light interference caused by the irregularly arranged polypropylene fibers in the mask.15 In addition, it is a crackling sound in the recorded audio and would produce a harmonic band across a wide range of frequencies in the spectrogram, which cannot be seen in the target speech in Fig. 1(a). Moreover, certain high-frequency signals are lost in the optical-based corpus, whereas the low-frequency signals remain intact. To solve this issue, the frequency bandwidth extension16,17 technology, voice conversion18,19 and speech enhancement20 technologies can synthesize frequency components in the high band (i.e., 4–8 kHz) given the low-band frequencies (i.e., 0–4 kHz) that can be applied in future investigations.

Fig. 2.

Comparison of spectrograms recorded by different devices: (a) target speech, (b) omnidirectional microphone, (c) directional microphone, and (d) optical-based sensors.

Fig. 2.

Comparison of spectrograms recorded by different devices: (a) target speech, (b) omnidirectional microphone, (c) directional microphone, and (d) optical-based sensors.

Close modal

Figure 3 shows the results of the Google ASR accuracy using the aforementioned microphone approaches under quiet conditions. Note that the speech level is at 65 dB SPL in this study. The x axis represents the distance between the microphones and sound source, and the y axis represents the judgment accuracy of Google ASR. Evidently, a performance gap exists between the optical-based microphone and other baseline microphone systems. More specifically, the accuracy of the baseline microphone systems decreased by approximately 16.5% when the distance increased from 1.5 to 2.5 m. In contrast, the accuracy of the optical-based microphone system decreased by only 7% with the same increase in distance. We assumed that this gap may be due to the following two reasons. First, the medical masks attenuated the high-frequency components and absorbed energy, thus blurring the speech and lowering the volume. Second, the energy from the KEMAR was attenuated when the distance between the microphones and the sound source increased. However, this did not occur in the optical-based microphone because the LDV sensor depended on optical-based vibration to acquire sound signals. Therefore, the accuracy of these two baseline systems (omnidirectional and directional microphones) dropped below that of the optical-based microphone approach.

Fig. 3.

Google ASR performance in quiet conditions with different distances at 65 dB SPL speech level.

Fig. 3.

Google ASR performance in quiet conditions with different distances at 65 dB SPL speech level.

Close modal

Figure 4 illustrates the results of the Google ASR accuracy for the three microphone types under noisy conditions. Compared to quiet conditions, notably, as shown in Fig. 3, the baseline system accuracies were significantly affected by the noise, particularly at 2.5 m. However, the LDV sensor continued to provide stable accuracy, compared with the 1.5 and 2.0 m testing conditions. Because the medical mask associated with the LDV sensor was the closest object surface near the sound source, the vibration was directly transmitted to the medical mask. As a result, the SNR was very high even under 65 dB SPL background noise. In conclusion, the results indicated that the optical-based approach surpassed the baseline approaches under adverse conditions, such as distant speech and background noise.

Fig. 4.

Google ASR performance in noisy conditions with different distances.

Fig. 4.

Google ASR performance in noisy conditions with different distances.

Close modal

Distant speech and ambient noise are bottlenecks for the performance improvement of conventional air-conducted microphones. A commonly used medical mask causes voice distortion and affects speech quality, thereby reducing the accuracy of speech recognition systems. To alleviate these issues, we proposed an optical-based microphone approach to acquire speech signals from a speaker with a medical mask for robust speech recognition performance. The classical approaches involving omnidirectional and directional microphones were used as baseline systems for comparison. The experimental results indicated that the proposed optical-based microphone system achieved better performance than the classical microphone array-based system. These results imply that the optical-based microphone can be a promising approach to enhance the performance of a voice-control-based system under challenging conditions when the speaker has a medical mask. In addition, this study proves that the vibration signals on the medical masks of speakers contain useful information for speech recognition applications.

This study was supported by the Ministry of Science and Technology of Taiwan under the 110–2218-E-A49A-501 project. The authors would like to thank IEA Electro-Acoustic Technology Co., Ltd. for providing the experimental devices.

1.
Y.
Deng
, “
Long range standoff speaker identification using laser Doppler vibrometer
,” in
Proceedings of the 2016 IEEE 8th International Conference on Biometrics Theory, Application and Systems (BTAS)
, Niagara Falls, NY (September 6–9,
2016
).
2.
L.
Armani
,
M.
Matassoni
,
M.
Omologo
, and
P.
Svaizer
, “
Use of a CSP-based voice activity detector for distant-talking ASR
,” in
Proceedings of the Eighth European Conference on Speech Communication Technology
, Geneva, Switzerland (September 1–4,
2003
).
3.
H. K.
Kim
and
R. C.
Rose
, “
Cepstrum-domain acoustic feature compensation based on decomposition of speech and noise for ASR in noisy environments
,”
IEEE Trans. Speech Audio Process.
11
(
5
),
435
446
(
2003
).
4.
C.
Ris
and
S.
Dupont
, “
Assessing local noise level estimation methods: Application to noise robust ASR
,”
Speech Commun.
34
(
1
),
141
158
(
2001
).
5.
S. R.
Atcherson
,
L. L.
Mendel
,
W. J.
Baltimore
,
C.
Patro
,
S.
Lee
,
M.
Pousson
, and
M. J.
Spann
, “
The effect of conventional and transparent surgical masks on speech understanding in individuals with and without hearing loss
,”
J. Am. Acad. Audiol.
28
(
1
),
58
67
(
2017
).
6.
L. L.
Mendel
,
J. A.
Gardino
, and
S. R.
Atcherson
, “
Speech understanding using surgical masks: A problem in health care?
,”
J. Am. Acad. Audiol.
19
(
9
),
686
695
(
2008
).
7.
C.
Llamas
,
P.
Harrison
,
D.
Donnelly
, and
D.
Watt
, “
Effects of different types of face coverings on speech acoustics and intelligibility
,”
York Pap. Linguist.
2
(
9
),
80
104
(
2009
).
8.
S.
Araki
,
N.
Ono
,
K.
Kinoshita
, and
M.
Delcroix
, “
Meeting recognition with asynchronous distributed microphone array using block-wise refinement of mask-based MVDR beamformer
,” in
Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, Calgary, Canada (April 15–20,
2018
), pp.
5694
5698
.
9.
Y.
Avargel
and
I.
Cohen
, “
Speech measurements using a laser Doppler vibrometer sensor: Application to speech enhancement
,” in
Proceedings of the 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays
, Edinburgh, UK (May 30–June 1,
2011
), pp.
109
114
.
10.
R.
Li
,
T.
Wang
,
Z.
Zhu
, and
W.
Xiao
, “
Vibration characteristics of various surfaces using an LDV for long-range voice acquisition
,”
IEEE Sens. J.
11
(
6
),
1415
1422
(
2011
).
11.
W.
Li
,
M.
Liu
,
Z.
Zhu
, and
T. S.
Huang
, “
LDV remote voice acquisition and enhancement
,” in
Proceedings of the 18th International Conference on Pattern Recognition (ICPR'06)
, Hong Kong (August 20–24,
2006
), pp.
262
265
.
12.
L.
Sun
,
J.
Du
,
Z.
Xie
, and
Y.
Xu
, “
Auxiliary features from laser-Doppler vibrometer sensor for deep neural network based robust speech recognition
,”
J. Signal Process. Syst.
90
(
7
),
975
983
(
2018
).
13.
Z.
Xie
,
J.
Du
,
I.
McLoughlin
,
Y.
Xu
,
F.
Ma
, and
H.
Wang
, “
Deep neural network for robust speech recognition with auxiliary features from laser-Doppler vibrometer sensor
,” in
Proceedings of the 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP)
, Tianjin, China (October 17–20,
2016
).
14.
L. L.
Wong
,
S. D.
Soli
,
S.
Liu
,
N.
Han
, and
M.-W.
Huang
, “
Development of the Mandarin hearing in noise test (MHINT)
,”
Ear Hear.
28
(
2
),
70S
74S
(
2007
).
15.
Y.
Wang
,
W.
Zhang
,
X.
Kong
,
Y.
Wang
, and
H.
Zhang
, “
Two-sided LPC-based speckle noise removal for laser speech detection systems
,”
IEICE Trans. Inf. Syst.
E104.D
(
6
),
850
862
(
2021
).
16.
P.
Jax
and
P.
Vary
, “
On artificial bandwidth extension of telephone speech
,”
Signal Process.
83
(
8
),
1707
1719
(
2003
).
17.
S.
Chennoukh
,
A.
Gerrits
,
G.
Miet
, and
R.
Sluijter
, “
Speech enhancement via frequency bandwidth extension using line spectral frequencies
,” in
Proceedings of the 2001 IEEE International Conference on Acoustics, Speech and Signal Processing
, Salt Lake City, UT (May 7–10,
2001
), pp.
665
668
.
18.
W. Z.
Zheng
,
J. Y.
Han
,
C. K.
Lee
,
Y. Y.
Lin
,
S. H.
Chang
, and
Y. H.
Lai
, “
Phonetic posteriorgram-based voice conversion system to improve speech intelligibility of dysarthric patients
,”
Comput. Methods Programs Biomed.
215
,
106602
(
2022
).
19.
C.-Y.
Chen
,
W.-Z.
Zhen
,
S.-S.
Wang
,
Y.
Tsao
,
P.-C.
Li
, and
Y.-H.
Lai
, “
Enhancing intelligibility of dysarthric speech using gated convolutional-based voice conversion system
,” in
Proc. IEEE Interspeech
(
2020
).
20.
X.
Lu
,
Y.
Tsao
,
S.
Matsuda
, and
C.
Hori
, “
Speech enhancement based on deep denoising autoencoder
,” in
Interspeech
(
2013
), Vol. 2013, pp.
436
440
.