While speaking, hand postures, such as holding a hand in front of the mouth or cupping the hands around the mouth, influence human voice directivity. This study presents and analyzes spherical voice directivity datasets of an articulated [a] with and without hand postures. The datasets were determined from measurements with 13 subjects in a surrounding spherical microphone array with 32 microphones and then upsampled to a higher spatial resolution. The results show that hand postures strongly impact voice directivity and affect the directivity index by up to 6 dB, which is more than variances caused by phoneme-dependent differences.
1. Introduction
The analysis of directional properties of human voice sound radiation has been subject to research for more than 200 years. While early scientific studies (Henry, 1857; Saunders, 1790; Wyatt, 1813) analyzed the directional radiation of speech in general, a first study by Trendelenburg (1929) compared the directivity patterns for several vowels and fricatives in the horizontal plane. Some years later, Dunn and Farnsworth (1939) carried out a comprehensive study analyzing the spherical sound radiation for a spoken sentence in third-octave bands from 63 Hz up to 12 kHz. Since then, a large number of follow-up studies investigated a variety of specific aspects of human voice radiation, such as phoneme dependencies (Katz and D'Alessandro, 2007; Kocon and Monson, 2018; Monson , 2012; Pörschmann and Arend, 2021), the influence of voice level (Marshall and Meyer, 1985), differences between speaking and singing (Monson , 2012), gender differences (Monson , 2012), or recently the influence of face masks (Pörschmann , 2020).
Although it is intuitively clear that hand postures affect voice radiation, to the best of our knowledge, their influence has not yet been systematically investigated. This paper examines two widely used postures: (a) holding a hand in front of the mouth and (b) cupping the hands around the mouth while speaking. Investigating these two example hand postures extends earlier research, in which we examined statistically significant effects of various phonemes influencing human voice directivity (Pörschmann and Arend, 2021). Hand postures influence the radiated sound field both due to diffraction and reflections and become relevant if the hands' dimensions are on the order of magnitude of the wavelength. According to the dimensions of the hands, the shadowing effects thus have only a minor influence on directivity at low frequencies. We show for the articulation of an [a] that the postures affect the directivity index by up to 6 dB and that the differences to the articulation without hand posture are statistically significant, especially towards higher frequencies. The datasets determined in this study allow to consider hand postures adequately in auralizations of voice radiation of avatars in virtual reality (VR) and augmented reality (AR). In addition, a better understanding of the effects of hand postures on voice directivity can provide a basis for further research on human communication in casual situations.
2. Materials
2.1 Measurements
We performed the measurements in the anechoic chamber of TH Köln (width: 4.5 m, depth: 11.7 m, height: 2.3 m), which has a lower cut-off frequency of about 200 Hz. The measurements took place in a surrounding spherical microphone array (Arend , 2019; Arend , 2017), with a diameter of 2 m, and a shape of a pentakis dodecahedron with 32 cardioid microphones (Rode NT5) located at the vertices. This sampling scheme allows resolving the directivity up to a spatial order of N = 4 (Pollow, 2015). We positioned an additional microphone of the same type at the frontal reference position (azimuth , elevation ). We used four RME Octamic II devices as preamplifiers and AD converters for the 32 microphone signals of the array. Two RME Fireface UFX served as audio interfaces of which one was used as preamplifier and AD converter for the reference microphone. We placed the subjects' heads precisely in the center of the array by adapting the height of the seat and continuously monitored the position during the measurements using a cross-line laser and if required gave instructions for readjustments. Figure 1 shows the surrounding spherical microphone array with a person sitting inside and holding the hand in front of the mouth (left) or cupping the hands around the mouth (right).
We applied the glissando method as proposed by Kob and Jers (1999) or Brandner (2020), that is, the subjects sang an [a] with an increasing pitch over at least one octave. We determined the directivity patterns for the following conditions:
-
REF: Reference measurement—normal articulation of an [a].
-
HFM: Holding a hand in front of the mouth while articulating an [a].
-
CAM: Cupping the hands around the mouth while articulating an [a].
The datasets presented in this study have been determined in the same test series as those in Pörschmann and Arend (2021), with each posture and the reference condition measured twice for 13 subjects (2 female and 11 male), aged between 25 and 64 years. As the head radius is required to calculate a subject-specific spherical head model applied in the spatial upsampling process, we determined for each subject the head width (M = 15.1 cm; SD = 0.76 cm), the head height (M = 12.6 cm; SD = 1.35 cm), and the head length (M = 19.5 cm; SD = 0.86 cm).
2.2 Postprocessing
The postprocessing is only briefly summarized here as it was carried out exactly in the same way as in Pörschmann and Arend (2020, 2021). As done in a similar way in Brandner (2020), we first calculated for each recorded condition the impulse response between any microphone signal of the surrounding microphone array and the reference microphone at azimuth and elevation , resulting in datasets of 32 impulse responses, each of them representing the sound radiation in one direction. By calculating the impulse responses, the datasets were normalized to the frontal direction and thus do not reflect the impact of the hand postures on absolute radiated sound power. As the datasets are towards low frequencies affected by reflections and room modes of the anechoic chamber, which become prominent below 200 Hz, the original low-frequency component was substituted in the frequency domain by an adequately matched component of an analytic low-frequency model applying a low-frequency extension according to Xie (2009). A further postprocessing step was required compensating the inaccuracies in positioning the subjects in the center of the microphone array. As small deviations of some centimeters already lead to strong impairments in the spatial upsampling process (Pörschmann and Arend, 2019), we applied a method for distance error compensation that reduces the impairments of distance errors for voice directivity patterns (Pörschmann and Arend, 2020). Finally, the sparse datasets were spatially upsampled to a dense grid with 2702 sampling points on a Lebedev sampling scheme applying the spatial upsampling by directional equalization (SUpDEq) method,1 which we described in detail in Pörschmann and Arend (2020) and Pörschmann (2019) and which we only briefly summarized here. First, the sparse dataset is equalized by spectral division with corresponding directional rigid sphere transfer functions. The rigid sphere transfer functions represent a simplified voice directivity carrying no information on the specific shape of the mouth opening or the form of, for example, the cheekbones, but only featuring the basic shape of a spherical head. The equalization results in a time-aligned and spectrally matched dataset with a significantly reduced spatial complexity. Accordingly, it is better suited for interpolation to a dense grid. For interpolation, the equalized dataset is transformed to the spherical harmonics (SH) domain (Williams, 1999), interpolated to a dense grid by an inverse SH transform, and subsequently de-equalized by spectral multiplication with corresponding rigid sphere transfer functions. The evaluation of the method showed that the approach can be applied for the spatial upsampling of voice directivity patterns measured with the employed surrounding spherical microphone array (Pörschmann and Arend, 2020).
To visualize and further analyze the voice directivity patterns, we interpolated the upsampled dataset to 360 directions in steps along the horizontal plane ( to , where positive angles point to the left; ) and along the vertical plane ( to , where points to the front, to the top, and to the back). Technically, this interpolation was carried out by transforming the upsampled dataset to the SH domain at an SH order N = 35 and then resampling the datasets using the inverse SH transform. Finally, we averaged the datasets for each subject over both repetitions.
3. Results
3.1 Polar diagram analysis
Figure 2 depicts the directivity patterns in the horizontal and vertical plane in octave bands from 250 Hz to 8 kHz and averaged over all subjects. To illustrate interindividual variations, the plots show the standard deviations. For the clearness of the presentation we restrict this plot to octave bands.2 We refrained from plotting directivity patterns for frequencies below the 250 Hz octave band because the influence of the hand postures diminishes for low frequencies and because the low-frequency extension further eliminates all differences between the datasets in this frequency range. We only show plots up to the 8 kHz octave band for two reasons. First, most of the energy of human voice articulation is far below 8 kHz and second, as shown in Pörschmann and Arend (2020), the errors in the spatial upsampling of the sparse sampling grid increase strongly for frequencies above 8 kHz. From the plots, differences between the two hand postures (HFM, CAM) and the reference (REF) can easily be observed. As expected, the directivity is stronger directed to the front for CAM and due to reflections at the hands more to the sides for HFM. The influence of the hand postures becomes prominent above 1 kHz and increases with frequency. Moreover, the standard deviations are larger for the hand postures especially towards higher frequencies, reaching values of more than 6 dB in the 8 kHz band for both postures.
As the polar plots depicted in Fig. 2 do not resolve the fine structure of the directivity patterns over frequency, we analyze the radiation patterns of the postures in more detail in Fig. 3, showing the directivity pattern over frequency in the horizontal and vertical planes normalized to the frontal direction. Generally, below 1 kHz, the differences between the postures and the reference diminish due to diffraction. Above 3 kHz, for HFM, an increased sound radiation to the sides and backwards can be observed, which is caused by reflections at the hand. It is worth noting that above 3 kHz, the pattern shows some ripples that might be caused by resonances in the sound propagation between mouth and hand. The resonance frequencies depend on the distance between mouth and hand and, assuming hard-walled planes, occur at multiples of , with λ the wavelength. These resonances lead to peaks, which vary to some extent between the datasets for the different subjects. From the datasets we determined that the lowest resonance occurs at about 3 kHz. This corresponds to a distance of 5.7 cm between hand and mouth and matches our observations during the measurements quite well. For CAM, Fig. 3 shows a slightly reduced sound radiation to the sides and upwards. Downwards, the measurements are almost not affected by the posture, which is probably caused by the way the subjects cupped their hands around their mouths. As shown in Fig. 1 for one subject, the hands were not completely closed downwards.
3.2 Directivity index
We determined the DI from the upsampled datasets for the postures and the reference condition. Figure 4 shows the mean values and the standard deviations of the DIs illustrating the differences between the hand postures and the reference condition for frequencies above 1.5 kHz. Whereas the graphs show for CAM a strong increase and a maximal value of all averaged DIs of 8 dB at 3.1 kHz, for HFM the DI decreases strongly from 5.5 dB at 1.3 kHz to nearly 0 dB at 4 kHz. Furthermore, the standard deviations are larger for the hand postures than for REF indicating higher individual variances for the postures, especially for HFM and frequencies above 1.5 kHz. Figure 4 indicates further that the influence of CAM on the DI is maximal at about 3 kHz and decreases both towards lower and higher frequencies. Comparing HFM and CAM to REF further supports our assumption that the influence of HFM on the directivity is greater than that of CAM, especially above 3 kHz.
Using parametric statistical analysis, we examined these assumptions in further detail. As expected, studying all DIs using a Greenhouse-Geisser (GG) corrected (Greenhouse and Geisser, 1959) two-way repeated measures analysis of variance (ANOVA) with the within-subject factors frequency (19 third-octave frequency-bands from 125 Hz to 8 kHz) and condition (REF, HFM, and CAM) showed a significant main effect of frequency [F(18,216) = 36.39, pGG < 0.001, = 0.75, ϵ = 0.20] and condition [F(2,24) = 78.17, pGG < 0.001, = 0.87, ϵ = 0.79] as well as a significant frequency × condition interaction [F(36,432) = 29.14, pGG < 0.001, = 0.71, ϵ = 0.18]. To disentangle the effect of HFM and CAM on the DIs with respect to REF, we performed two pairwise comparisons between REF and HFM as well as REF and CAM using nested GG-corrected two-way repeated measures ANOVAs with the within subject factors frequency and condition. Consistent with above results, both nested ANOVAs yielded significant main effects for frequency and condition as well as a significant frequency × condition interaction (all effects pGG < 0.001; for the sake of conciseness, the statistical results of nested ANOVAs are not reported in greater detail throughout the paper). Interestingly, however, the main effect of condition in the REF vs HFM comparison showed a considerably higher effect size [F(1,12) = 53.33, pGG < 0.001, = 0.82, ϵ = 1] than the main effect of condition in the REF vs CAM comparison [F(1,12) = 20.47, pGG < 0.001, = 0.63, ϵ = 1]. This indicates that, in relation to the reference condition, HFM affects the DIs more than CAM does.
To analyze in which frequency bands HFM and CAM differ significantly from REF and thus to determine the frequency range most affected by the alternating condition in comparison to the reference, we performed multiple nested GG-corrected one-way repeated measures ANOVAs with the within-subject factor condition (either the two levels REF and HFM or REF and CAM) for each of the 19 levels of frequency. The initial significance level of 0.05 was further corrected according to Hochberg (1988) to prevent alpha-error accumulation. For the comparison REF vs HFM, the ANOVAs revealed significant differences in DIs for the third-octave bands centered at 250 Hz, 315 Hz, 400 Hz, 500 Hz, 2.5 kHz, 3.15 kHz, 4 kHz, and 5 kHz (all pGG < 0.001). In contrast, the ANOVAs comparing REF with CAM yielded significantly different DIs only in the three third-octave bands centered at 1.6 kHz, 2 kHz, and 3.15 kHz (all pGG < 0.001). Those findings indicate that, compared to the reference, HFM leads to distinct differences in DIs over a wide frequency range, whereas CAM results in much smaller differences in a more narrow frequency range when comparing to the reference.
4. Discussion
In our measurements, we determined voice directivity patterns of an articulated [a] for two different hand postures and compared them to a reference condition without a posture. As described in Sec. 2.2, the datasets were normalized to the frontal direction and thus do not reflect the influence of the hand postures on absolute radiated sound power. Accordingly, the results do not show that, for example, in the CAM condition, the total radiated sound power is probably increased due to an optimized matching of the radiation impedance to the sound field's acoustic impedance (Blauert and Xiang, 2009). To study such effects, the complete measurement procedure would have to be changed and it would, for example, have to be assured that the subjects repeat the same articulations numerous times with exactly the same articulation strength. However, this is almost impossible to control for the untrained, especially when using the glissando method. Alternatively, the articulation strength needs to be measured during vocalization, for example with a bone conduction microphone.
Generally, the frequency dependencies of the postures can be described in a simple model by placing a point source on a rigid sphere that is shadowed by a nearby surface representing the hand. Accordingly, due to diffraction, hand postures have only weak impact on directivity at low frequencies, and the shadowing effects become strong towards high frequencies. Assuming an extension of the hand of d = 0.15 m and that occlusion becomes relevant for dimensions above yields = 1.1 kHz. This estimate is consistent with the observation that the deviations of the REF from the postures increase above 1 kHz.
As expected, the two hand postures affect human voice directivity in a completely different manner. While CAM leads to a stronger frontal sound radiation comparable to the influence of a horn on directivity, HFM results in a softer directivity, as it shadows frontal sound radiation. The analysis of the radiation patterns showed that the influence of HFM is stronger than that of CAM. Furthermore, as indicated by the increased standard deviations, the interindividual differences tend to be larger for the hand postures than for REF because the postures cause further individual variances, for instance, due to the exact form of the hands or the distance and position at which the hands were held. For HFM the analysis of the spectral differences showed for frequencies above 3 kHz some ripples in the spectral fine structure of the directivity pattern. This could be caused by resonances that vary individually depending on the distance between the hand and the mouth. For CAM, a slightly reduced sound radiation to the sides and upwards, but nearly no decrease downwards was observed. This seems plausible as most subjects did not close their hands downwards when cupping the hands around the mouth.
Analyzing the frequency-dependent deviations of voice directivity based on DIs showed that the differences in the DIs are very small towards low frequencies. While they are still significant for HFM at 500 Hz, we found no significant impact of CAM for 1.6 kHz. Furthermore, the postures show deviations from the reference articulation of by more than 4 dB at 2 kHz (CAM) and 6 dB at 4 kHz (HFM). This exceeds any of the differences between different vowels or fricatives, which in Pörschmann and Arend (2021) were always below 3 dB. Thus, hand postures, especially HFM, have a stronger impact on voice radiation than the differences between the various phonemes. In another recent study (Pörschmann , 2020), we analyzed the influence of wearing face masks showing that for most of the masks the DI is only slightly affected by a maximum of 2 dB for frequencies up to 8 kHz, which is far less than the influence of the hand postures. Only for facepiece respirator masks, we found variations in DI of up to 7 dB in some frequency bands above 3 kHz, which is comparable in magnitude to the investigated hand postures, especially to HFM. Thus, hand postures have a stronger impact on voice directivity than most face masks.
When frontally facing a human speaker, the DI is directly related to the direct-to-reverberant ratio (DRR) in a room, for which Frank and Brandner (2019) determined a just-noticeable-difference (JND) of 1.8 dB for a DRR of 0 dB. Similar results were obtained by Larsen (2008), who found JND values of about 2–3 dB in rooms with a DRR of about 0 dB or +10 dB and of about 6–8 dB in rooms with a DRR of −10 dB or +20 dB. These JNDs are in the same range or are smaller than the variations of the DI caused by the hand postures, at least for 2 kHz. Accordingly, the impact of hand postures on voice directivity is supposed to be perceptible, for example, in auralizations in virtual acoustic environments. Furthermore, in Frank and Brandner (2019), only simplified frequency-independent directivity patterns were assumed. Accordingly, the spectral composition is not considered, which is strongly affected by the hand postures, especially for HFM, and thus probably has additional perceptual influence.
The datasets determined in this study can be used for further investigations on perceptual influences of hand postures on voice directivity for auralizations in VR and AR applications. Furthermore, our results can serve as a basis for future research on the impact of voice postures on human communication in casual situations.
Acknowledgments
Anonymized datasets of the measured and the upsampled directivity patterns of all human speakers and hand postures are available in the SOFA format under a Creative Commons CC BY-SA 4.0 license and can be downloaded as supplementary material at https://doi.org/10.5281/zenodo.5995215. The authors would like to thank Raphaël Gillioz for supporting the measurements. The results presented in this paper have been carried out in the research project NarDasS, which was funded by the Federal Ministry of Education and Research in Germany, support code: BMBF 03FH014IX5-NarDasS.
A Matlab-based implementation of the SUpDEq method can be accessed on https://github.com/AudioGroupCologne/SUpDEq
The corresponding plots in third-octave bands for the reference and the postures can be found in the supplementary material at https://doi.org/10.5281/zenodo.5995215