Speech recognition using air-conduction microphones is less accurate under high noise conditions and when the volume of the speaker's voice is relatively low. In this study, the effect of mounting location of throat microphones (which are less susceptible to ambient noise) on recognition accuracy was experimentally investigated. The results confirmed that mounting position and speaker gender affected recognition accuracy, regardless of any other factor or speech recognition system. In addition, relatively lower recognition accuracy was observed in the upper part of the neck near the mandibular angle for both males and females.
1. Introduction
According to a World Health Organization fact sheet,1 at least 2.2 × 109 people worldwide are visually impaired, and population growth and aging may put more people at risk of developing visual impairment. With the widespread use of smartphones, many visually impaired people now use smartphones for phone calls, e-mail, internet browsing, and other purposes.2,3 It has also been reported that smartphones have been used instead of, or in combination with, conventional visual aids, such as magnifiers.4 There are various methods of smartphone operation for the visually impaired, including touch, Braille display intervention, gesture recognition, and voice recognition.5,6 However, a visually impaired person using a white cane or accompanied by a guide dog risks the danger of a fall if they use both hands to hold and use a smartphone. Therefore, the need to use applications in a non-visual environment is high and important for the visually impaired in their daily lives. Gesture recognition and voice recognition that can be performed in a non-visual environment, without occupying both hands, can contribute to safer smartphone use for the visually impaired. Because speech recognition is already integrated into mobile platforms, such as iOS (Cupertino, CA) and Android (Mountain View, CA) devices, this study focused on speech recognition. However, protecting the privacy of spoken conversations is also an important issue, which has led to the development of guidelines for sound and vibration design in healthcare facilities,7 research on predictors of speech security in offices,8 and research on parameters used in the design of closed-office and open-plan facilities,9,10 as well as the standardization of methods for measuring and evaluating speech privacy by the American Society for Testing and Materials (ASTM).11 Accordingly, it is necessary to consider the protection of privacy during voice recognition. Therefore, if voice recognition in noisy environments can be performed hands-free and in a small voice, it is expected that the visually impaired will be able to operate their smartphones in a safe and privacy-conscious manner.
While this study focuses on the visually impaired, the popularity of voice is currently being promoted by the proliferation of smart speakers and other devices. The Smart Audio Report from National Public Media12 reported that 62% of Americans used voice-operated assistants on smartphones and other devices in its 2022 survey results. On the other hand, a 2020 survey by PERFICIENT13 also reported that nearly 50% of people feel annoyed by listening to other people's voice controls. Therefore, the application of voice recognition using a throat microphone for whispered speech may benefit not only the visually impaired, but also a wide range of people.
There are several means of voice input: air-conduction microphones, bone-conduction microphones placed on the head or face, non-audible murmur (NAM) microphones, and throat microphones. The accuracy of speech recognition with air-conduction microphones decreases in low signal-to-noise situations with higher background noise levels or lower user voice levels. In contrast, bone-conduction microphones, NAM microphones, and throat microphones have acoustic characteristics different from those of air-conduction microphones, and they are used in close contact with the skin and have the advantage of being less affected by background noise than air-conduction microphones.14–16 Bone-conduction microphones may require the wearing of a helmet for placement on the head. This is less likely to be the case with throat microphones, which are more easily used. Therefore, this study focused on the applicability of throat microphones to voice recognition.
A throat microphone is a type of contact microphone that is attached to the neck and picks up vibrations in the skin. While this is less susceptible to background noise, it has different acoustic characteristics than air-conduction microphones, resulting in reduced recognition accuracy in speech recognition systems trained with air-conduction microphones. For this reason, research has been conducted on improving the accuracy of speech recognition in noisy environments by using a combination of air-conduction and throat microphones,16–18 and on improving speech quality and speech recognition accuracy through speech processing.19–22 In addition, research on the mounting position has been conducted as a preliminary step to these studies.23–25 However, most studies on the position of the throat microphone have examined only the case where it is distributed in the vertical direction of the neck. In addition, it is possible that differences in recognition accuracy between men and women may occur due to differences in the degree of development of the laryngeal prominence, but the effect of gender has not been examined. In this study, speech recognition accuracy was investigated by moving the position of the throat microphone to the front-back and up-and-down directions for male and female subjects. This was intended to first contribute to establishing the conditions for the installation of throat microphones to achieve hands-free and quiet speech recognition under noisy conditions. The results of experiments using a generally available speech recognition system are discussed.
2. Methods
2.1 Mounting positions and instruction of subjects
Eight locations for throat microphones, shown in Figs. 1(a) and 1(b), were established—two in the vertical direction and four in the anterior–posterior direction. Figure 1(c) shows a schematic of the pharyngeal structure. The two position types in the vertical direction are on the upper part of the thyroid cartilage (P1, P3, P5, and P7) and the cricothyroid ligament (P2, P4, P6, and P8). The four position types in the anterior–posterior direction are the frontal locations (P1 and P2), the location adjacent to the protrusion of the cartilage (P3 and P4), the locations between P3 and P7, and P4 and P8 (P5 and P6, respectively), and the mandibular angle (P7, P8). The content of Figs. 1(a) and 1(b) was described orally to subjects, and they were asked to wear the throat microphone themselves to simulate its use in a practical environment. During the presentation, the upper cricothyroid cartilage was paraphrased as the laryngeal ligament. For the cricothyroid ligament, the cricothyroid cartilage was described as the cartilage protruding in the middle of the neck and then rephrased as an indentation located above the cricothyroid cartilage. In this experiment, measurements were taken only on the right side of the neck and not on the left side.
(a) Mounting positions of throat microphones on the surface of a subject's neck. (b) Schematic of these positions. (c) Schematic diagram of the throat structure.
(a) Mounting positions of throat microphones on the surface of a subject's neck. (b) Schematic of these positions. (c) Schematic diagram of the throat structure.
2.2 Recording of voice data
The subjects were five males and five females in their 20s [mean, 22.7 years; standard deviation (SD), 0.6 year]. The equipment used for recording was a condenser microphone (Sony, ECM-SP10, Tokyo, Japan), a throat microphone (Retevis, C9038A, Shenzhen, China), and a smartphone (Sony, XperiaZ2 SO-03 °F). Monaural recordings were made using a sampling frequency of 44 100 Hz and a quantization bit rate of 16 bits. The condenser microphone was attached to the subject's chest and the throat microphone was attached to one of the designated positions. Recordings were made with both throat and condenser microphones in each position. The subject was asked to speak three sentences, five times each, in each position. The sentences were taken from parallel 100 of the Japanese versatile speech (JVS) corpus.26 The equivalent continuous A-weighted sound pressure level during utterance was specified as 61.5–63.5 dB to avoid large variation of the uttered voices of respective subjects. The speech was recorded using a precision sound level meter (Rion, NL-52A, Kokubunji, Tokyo, Japan) at a reception point 50 cm from each subject's mouth.
Speech data were collected twice during a period of at least 1 week for each subject. In addition, it was conducted four times for only one male and one female subject among all the subjects.
2.3 Calculation of recognition accuracy
Three automatic speech recognition (ASR) systems (hereafter referred to as System I, System II, and System III) were used in this experiment. These ASR systems are not designed specifically for voice with a throat microphone and are generic ready-made. The recorded speech data were input into each speech recognition system, and the character error rate (CER) was calculated from the output results.
2.4 Statistical analysis
To investigate whether there were significant differences in the results, based on the mounting position and gender, a three-way analysis of variance (ANOVA) was performed using JMP28 (JMP Statistical Discovery LLC) on the results obtained with the throat microphone using the speech recognition systems, mounting position, and gender as factors at the 5% level of significance. In addition, the intraclass correlation coefficients (ICC) were calculated to investigate within-subject reproducibility. This calculation was performed in R using the irr package (version 0.84.1).
3. Results and discussion
3.1. Character error rates (CERs) for each microphone and each ASR system
The results of the CER obtained in the experiment with the condenser microphone and throat microphone are shown in Fig. 2. The figure shows that the CER for the condenser microphone is lower than that for the throat microphone for any speech recognition system. For throat microphones, the CER is lower at P5–P8 than at P1–P4 for any speech recognition system. In addition, the CERs were generally lower for odd-numbered reception points, compared to even-numbered points, that is, for the upper thyroid cartilage compared to the cricothyroid ligament.
(a) CERs obtained by the condenser microphone. (b) CERs obtained by the throat microphone.
(a) CERs obtained by the condenser microphone. (b) CERs obtained by the throat microphone.
3.2 Influence of mounting conditions
A three-way ANOVA was performed with the voice recognition system, mounting position, and gender as factors at a 5% level of significance. The results of this statistical analysis performed with JMP are shown in Table 1. The table shows that no significant differences were obtained for the second-order interactions, but significant differences were obtained for all first-order interactions. Significant differences were also obtained for the main effects of all three factors.
Results of ANOVA. The notation “A-B” expresses the interaction of the factors.
Factor . | Sum of Squares . | df . | Mean Square . | F-value . | P-value . |
---|---|---|---|---|---|
Gender | 31 411 | 1 | 31 411 | 6.6 | <0.05a |
Mounting position | 274 072 | 7 | 39 153 | 98.6 | <0.001b |
ASR system | 50 605 | 2 | 25 303 | 63.7 | <0.001b |
ASR system–Gender | 4924 | 2 | 2462 | 6.2 | <0.05a |
Mounting position–Gender | 26 858 | 7 | 3837 | 9.7 | <0.001b |
ASR system–Mounting position | 13 616 | 14 | 973 | 2.4 | <0.05a |
ASR system–Gender–Mounting position | 4774 | 14 | 341 | 0.9 | 0.604 (NS) |
Factor . | Sum of Squares . | df . | Mean Square . | F-value . | P-value . |
---|---|---|---|---|---|
Gender | 31 411 | 1 | 31 411 | 6.6 | <0.05a |
Mounting position | 274 072 | 7 | 39 153 | 98.6 | <0.001b |
ASR system | 50 605 | 2 | 25 303 | 63.7 | <0.001b |
ASR system–Gender | 4924 | 2 | 2462 | 6.2 | <0.05a |
Mounting position–Gender | 26 858 | 7 | 3837 | 9.7 | <0.001b |
ASR system–Mounting position | 13 616 | 14 | 973 | 2.4 | <0.05a |
ASR system–Gender–Mounting position | 4774 | 14 | 341 | 0.9 | 0.604 (NS) |
p < 0.05.
p < 0.001.
First-order interactions are shown in Figs. 3(a)–3(c). Note that in Figs. 3(a) and 3(b), CERs are sorted in descending order for males for the sake of clarity. The results of the simple main effect test show that the effect of mounting position is observed for all speech recognition systems (ps < 0.001) and is influenced by a similar trend [Fig. 3(a)]. The effect of mounting position was also observed for both men and women (ps < 0.001), with men being more strongly affected than women [Fig. 3(b)], although the effects were generally similar. In contrast, the effect of gender was observed for all speech recognition systems (ps < 0.001), with a similar trend toward gender influence [Fig. 3(c)]. In terms of the mounting position, the effect of gender was observed for P1 and positions P5–P8 (P2: p = 0.553; P3: p = 0.054; P4: p = 0.093; other positions: ps < 0.001).
Interaction (a) between mounting position and ASR system, (b) between mounting position and gender, and (c) between ASR system and gender. Multiple comparison of (d) mounting position in ASR system I, (e) mounting position in ASR system II, and (f) mounting position in ASR system III. Multiple comparison of mounting position in (g) male subjects and (h) female subjects.
Interaction (a) between mounting position and ASR system, (b) between mounting position and gender, and (c) between ASR system and gender. Multiple comparison of (d) mounting position in ASR system I, (e) mounting position in ASR system II, and (f) mounting position in ASR system III. Multiple comparison of mounting position in (g) male subjects and (h) female subjects.
In addition, multiple comparisons were made using the Dunnett method with reference to P7, which obtained the lowest CER under all conditions for the mounting position. The results are shown in Figs. 3(d)–3(h). Figures 3(d)–3(f) compare the results for each recognition system, and Figs. 3(g) and 3(h). compare the results for males and females. The results show no significant difference only at P5 for males, and no significant difference at P5 and P8 for females and each recognition system. Therefore, it is considered that the throat microphone can acquire a low CER in the upper part of the neck near the mandibular angle, regardless of gender or recognition system, although there are differences in CER values depending on gender and recognition system.
In P5–P8, the throat microphones are in close contact with the skin due to muscle and fat under the skin, whereas in P1–P4, the contact area is small due to the cartilage ridges. Therefore, it is thought that in P1–P4, the contact of the throat microphone may have become unstable due to the movement during speech. This may be one of the reasons why P5–P8 has a lower CER compared to P1–P4. Also, because there were several large signals in P1–P4 and they were observed the clipping, an effect of clipping noise was considered. As for the higher CER in the lower neck, compared to the upper neck, the lower neck is located off the thyroid cartilage to which the vocal cords are connected, so the acoustic characteristics may have changed during the vibration transmission process.
Munger and Thomason27 investigated the relationship between mounting position and gender using accelerometers. In their study, 14 males and 10 females were asked to speak several phonemes after accelerometers were attached to 15 locations on their heads and necks. Then, their speech was recorded by accelerometers and air-conduction microphones. The power spectral density was calculated from the voice data of both devices, and this was used as an index to evaluate and compare the quality of speech at each mounting position. In the neck, three sound-receiving points were measured and evaluated. Although the mounting positions were not described in detail, based on a comparison of the figure showing the mounting positions in the paper by Munger and Thomason and the positions in this study, they were determined to be almost identical to P2, P7, and P8 in this study. Therefore, the study by Munger and Thomason25 provides results useful for comparison with this study.
Table 2(a) shows the results of the comparison of the mounting positions where better audio was obtained. Table 2(b) shows the results of the comparison of the gender that obtained better quality at each mounting position. These tables show that the first and third positions in terms of the order of quality positions in males were interchanged in the study by Munger and Thomason25 and this study. In addition, the gender that obtained better quality at all mounting positions is different. However, some points of agreement exist, namely, the better quality of positions P7, P8, and P2 for females, in that order, for both studies.
(a) Comparison of the study by Munger and Thomason (Ref. 25) and this study regarding the ranking of mounting position. (b) Gender that obtained better quality at each mounting position.
(a) . | ||||
---|---|---|---|---|
Study . | Gender . | Mounting position . | ||
First . | Second . | Third . | ||
Munger and Thomason | Male | P2 | P8 | P7 |
This study | P7 | P8 | P2 | |
Munger and Thomason | Female | P7 | P8 | P2 |
This study | P7 | P8 | P2 |
(a) . | ||||
---|---|---|---|---|
Study . | Gender . | Mounting position . | ||
First . | Second . | Third . | ||
Munger and Thomason | Male | P2 | P8 | P7 |
This study | P7 | P8 | P2 | |
Munger and Thomason | Female | P7 | P8 | P2 |
This study | P7 | P8 | P2 |
(b) . | |||
---|---|---|---|
Study . | Gender . | ||
P2 . | P7 . | P8 . | |
Munger and Thomason | Male | Female | Female |
This study | — | Male | Male |
(b) . | |||
---|---|---|---|
Study . | Gender . | ||
P2 . | P7 . | P8 . | |
Munger and Thomason | Male | Female | Female |
This study | — | Male | Male |
3.3 Reproducibility within each subject
Because some subjects showed a large difference in CER between the first and second experiments, ICC(1, 1) was used to check the within-subject reproducibility of the experiment, as calculated in R using the irr package. The results are shown in Fig. 4. As shown in the figure, most male subjects had reproducibility, while three of five female subjects did not show reproducibility. The fact that most of the male respondents had reproducibility, but half of the female respondents did not, may be due to differences in the vertical position of the throat microphone for female subjects, which may be caused by the difficulty of understanding the position of the laryngeal prominence.
4. Limitation
There are three limitations to this study. The first is an analysis of only three factors that affect recognition accuracy: recognition system, gender, and mounting position. Therefore, there is a possibility that there are uncontrolled factors in this study. Second, we did not specify the exact coordinates of the placement of the throat microphone. Finally, there was a bias in the age of the subjects. The ages of the subjects in this experiment were very close. Based on the results obtained in this study, future studies should use samples with more age variation.
5. Conclusions
This study examined whether the positions of the throat microphones and gender affected the accuracy of speech recognition, as well as the reproducibility within each subject. The results of the experiment using eight throat microphone positions showed that the position of the microphone and gender affected the accuracy of speech recognition. Although there were differences in CER values, depending on gender and recognition systems, it was confirmed that a lower CER could be obtained under all conditions when the microphone is placed in the upper part of the neck and close to the mandibular angle. It was also confirmed that attention may need to be focused on the placement of throat microphones with respect to females.