Vocal effort is a major source of variability in speech processing. The present study examines its spectral effects from calibrated data recorded in 1977. The 97 talkers were instructed to vary their vocal effort in five degrees. Each sequence was represented by its sound level and its 1/3 octave long-term-average spectrum. After normalization to a common arbitrary level, comparing each spectrum to the others demonstrated that the original sound level could be recovered within a 5 dB error margin. A principal component analysis brought out several spectral features involved in the quantitative relationship between spectral shape and sound level.
1. Introduction
Any speech production, even of low intensity, is the result of a muscular effort exerted to produce subglottal pressure and move the speech articulators. This vocal effort (VE) is difficult to measure or model, because the vocal apparatus is complex and its parts are permanently interacting. While adjusting muscular settings in order to produce a louder or softer voice, the talker implicitly modifies other spectro-temporal properties of the speech signal. As a result, VE manifests itself in the signal not only by the intended sound level change, but also by the modification of acoustic features that are more or less related to it, impacting all aspects of speech communication.
Thus, understanding the VE effects and correlates appears as a priority in speech and voice sciences. A number of studies have been done in that direction. For instance, VE has been related to the distance to the addressee (Liénard and Di Benedetto, 1999; Traunmüller and Eriksson, 2000), ambient noise level (Junqua, 1993), and expressive intention (Rilliard et al., 2018). Whether voluntary or oblivious, VE adjustment is always accompanied by substantial variations of the signal, causing great difficulties in speech processing (Zhang and Hansen, 2007). Even natural perception can be fooled when a human has to recognize a speaker using different VE modes (Brungart et al., 2001).
Studying VE effects requires a set of objective properties marking its various states. VE is commonly defined by a small number of “vocal modes,” rarely exceeding five, ranging from “whispered” to “shouted.” However, while the extreme modes may be associated to quite distinct phonatory and acoustic structures, the intermediary states occur progressively and no boundary between adjacent categories can be defined with certainty. This questions the potential of vocal modes to reliably quantify VE, so that another property reflecting VE degree may be needed.
The total acoustic power radiated by the talker's head is a good candidate for this function. In practice this quantity cannot be measured directly, but it may be represented by the sound pressure level (SPL) (in dB), measured in free field (anechoic room), in the mouth axis, at a given distance of the mouth. Considering 1 m as the reference distance, as for any sound source, seems reasonable. Sound propagation around the head is assumed to be spherical, resulting in a 6 dB decrease of the SPL when doubling the distance; this assumption is valid for the main speech spectral frequencies. A close measurement (a few cm) is prone to errors due to the slight movements of the talker's head; conversely, a distant measurement (a few meters) may be biased by acoustic reflections and background noise if the free field conditions are not fulfilled. Choosing a distance d (in meters) differing from the 1 m reference may be justified from the point of view of the recording quality, but if the SPL measurements have to be compared to others, a distance correction of 20log(d) decibels must be added to the measured value. In any case, recording should be carefully calibrated, as explained by Svec and Granqvist (2018). If the measurement is correctly calibrated and corrected, then its value represents the total acoustic power delivered by the talker, expressed in dB above the reference auditory threshold. This quantity, specific to the talker's emission, is called voice strength (VS) in the present paper, in order to avoid any confusion with the distance-dependent SPL received by the interlocutor or microphone. In the VE literature, although some studies mention the recording distance used, the general picture is that SPL figures and conditions vary from one study to the next, which makes inter-study comparisons difficult. It also overlooks the idea that VS may be a key factor of speech and voice variability. If this idea proved to be accurate, VS should be identified and taken into account in most phonetic and vocal studies.
The purpose of the present study is twofold: first, to evaluate to what extent VS allows VE quantifying more reliably than the vocal mode labeling; second, to determine whether VS may be deduced from the average spectrum of an uncalibrated speech sequence.
2. The data of Pearsons et al.
The proposed approach requires the use of calibrated data, comprising VE variations provided by a large number of talkers. Few contemporary public databases meet those requirements. However, such a database was recorded four decades ago by Pearsons et al. (1977) in the context of speech intelligibility standards in noisy environments. The sound recordings were lost thereafter, but the spectral measurement sheets have been recently collected and restored by Nash (2014), who made them publicly available.
The speech material consisted of the phonetically balanced, meaningless utterance “Joe took father's shoe bench out; she was waiting at my lawn.” The 97 nonprofessional speakers (48 adult men, 37 adult women, and 12 children under 13) were asked to repeatedly pronounce the sentence for at least 10 s, according to four VE instructions: “normal,” “raised,” “loud,” and “shout.” An informal conversation with an interlocutor located 1 m away was added to the previous recordings and received the VE label “casual.” The only information reported about the subjects was their sex and age. Sound recordings were done in an anechoic room at 1 m from the talker and calibrated with a professional sound level meter. The sounds were later processed by a spectrum analyzer (24-channel 1/3 octave filterbank, center frequencies ranging from 50 Hz to 10 kHz), resulting in a long term average spectrum (LTAS) for each speech sequence. The LTAS band levels, as well as the total level (VS) are expressed as their averaged acoustic power (total energy divided by the utterance duration in seconds) in level-equivalent dB (Leq).
This dataset was well adapted to test VS as a VE quantitative property because recording had been performed by experts, in the reference conditions. The sequence duration (10 to 20 s) could be considered as too short for a perfect spectral stabilization, but Löfqvist and Mandersson (1987) have shown that LTAS-derived properties of a single-speaker signal, fragmented into shorter segments, remained similar for segments as short as about 10 s. This suggests that 10 to 20 s could have been sufficient for the Pearsons' subjects to provide a stabilized LTAS.
In what follows, the term “token” refers to a single utterance (the continuously repeated sentence) produced by a given speaker. A data token consisted of 25 Leq values, one for VS and 24 for the LTAS channels. Three measurement sheets were missing, and two were unusable due to a transcription error, so that the total size of the dataset was 480 tokens instead of 485.
Figure 1 displays the LTAS of two subjects, representative of the whole dataset, in response to the five requested VE instructions. The male speaker used all the dynamics he was able to produce, whereas the female barely expressed the requested shades. Pearsons et al. reported large differences in the way subjects realized the VE instructions, which resulted in a large overlap of the modal regions and a lack of precision of the averaged LTAS. However, the subjects demonstrated a remarkable individual consistency in their sound level progression from low to high. This figure illustrates the poor relevance of the requested vocal modes, as well as the subjects' capability to produce close but distinct LTAS and VS values.
A preliminary study of the tokens (LTAS centers of gravity, dominant frequencies in the F0 zone and other measurements) revealed the existence of two distinct distributions of the LTAS according to the age and sex of the speakers. It appeared that the upper age chosen by Pearsons et al. to qualify the young subjects as children (12 years of age) did not reflect the acoustic reality of their voices: many boys under 16 were classified as male adults despite their high-pitched voice. Consequently, the study was performed not only on the whole dataset (denoted “all,” 480 tokens), but also on two disjoint subsets denoted “men” (males over 15, 202 tokens) and “f + c” (adult females plus all children under 16, 278 tokens). A finer partitioning of the dataset could have been done (i.e., adult males, adult females, girls, boys), but some of the subsets would have been too small for the results to be significant. Besides, a visual examination of the LTAS graphic representations revealed that, in some recordings, the first two channels took a value close to 40 dB, obviously due to non-speech background noise. At the other end of the spectrum, channels 23 and 24 provided little consistent information. Consequently, only channels 3–22 were taken into account in the study.
3. Analysis method
The goal was to check to what extent the VS value of any token was implicitly coded in its LTAS shape, regardless of the actual recorded level. In this perspective, the direct VS information was removed from the spectral data by shifting each LTAS to the same arbitrary total value (e.g., 50 dB) and applying the same threshold (0 dB). After this, each token was represented by its measured VS and its 20 normalized LTAS values in the common interval, denoted here as [0:50]. The purpose of the analysis was to determine a relation between the 20 normalized LTAS values (explanatory variables) and the VS value (response, or dependent variable) and to check its validity.
Insofar as all the variables to process were continuous, the most obvious method to use seemed to be the multiple linear regression analysis (MLRA). However, a previous study (Liénard and Barras, 2013) on similar data (intensity classification of French isolated vowels), using linear discriminant analysis (intensity was categorized into a limited number of categories), had highlighted a large non-linearity of the data representation in the discriminant space. As a consequence, MLRA was not selected for the present study. The kNN method has been preferred because it is based on the notion of vicinity and does not make any hypothesis on the linearity of the relation between explanatory variables and response.
Each normalized LTAS from a given speaker (let us call this vector the “candidate”) was compared (by Euclidean distance) to all of the normalized LTAS from the other speakers (to prevent any same-speaker comparison), resulting in a large distance table. The k nearest neighbors (kNN) of each candidate were selected, and the predicted voice strength (pVS) was computed from the inverse-rank weighted average of the kNN's measured VS. This process was repeated for every token of the corpus. The results, plotted as a distribution in the (VS, pVS) coordinates, should ideally align along the diagonal. The Pearson correlation coefficient between VS and pVS values was computed, as well as the standard deviation of the (pVS-VS) distribution, hereafter called “error margin.” In a second experiment, a principal component analysis was performed on each dataset to determine what spectral features were mostly involved in the VS prediction. All processing was done with the praat software (Boersma and Weenink, 2019).
4. Results
4.1 Results in terms of error margin
Figure 2 illustrates the results obtained for the three datasets “all,” “men,” and “f + c” in the (VS, pVS) coordinates. The normalization interval was [0:50] dB, and k = 3 nearest neighbors were used in the weighted distance computation. The choice of k = 3 was a trade-off between two drawbacks. Choosing k = 1 made the results too sensitive to any local sparsity of the data. Conversely, a larger value such as k = 8 or more produced a slight flattening of the set of dots at its ends, that could be explained as follows. A high level token, logically located near the high end of the distribution, has more lower level neighbors than higher ones, yielding a weakening bias of their average value; a reciprocal consideration applies to the tokens of the low end, that are slightly overestimated. This centrality effect was hardly noticeable for k = 3.
The error margin varied according to the dataset: 4.7 dB for “all,” 4.5 dB for “men,” and 3.8 dB for “f+c.” The correlation coefficients 0.915, 0.931, and 0.935, respectively, confirmed the statistical validity of the results. The “all” display exhibited some outliers, located in the high part of the VS scale. They were due to the spectral similarity between some strong male voices and some female voices of comparable F0 and lesser VS. The particular status of shouted voice was confirmed by another result: discarding from the “all” dataset the tokens of VS greater than 80 dB produced an appreciable improvement of the error margin (4.0 dB instead of 4.7).
Several results obtained with the same data and k = 3 are reported in Table 1, in order to assess the effect of the normalization interval and the use of the error margin as a quality criterion. Case (a) was related to the use of raw data, without any normalization; in other terms, the VS information was implicitly present in the data. Thus, the predicted pVS were maximally close to the measured VS. This case indicated the ultimate precision that could be expected with the dataset. Case (b) implemented a [0:100] normalization: all LTAS were shifted upward without any alteration of their shapes. As expected, the loss of performance with respect to case (a) was noticeable, but remained moderate (between 1.8 and 2.9 dB). Cases (c) and (d) showed the effects of more severe normalizations, [0:50] and [0:30], reducing the LTAS to their most intense parts. In case (e) the spectrally computed distances were replaced by random values chosen in the [0:50] interval according to a uniform distribution (mean of 10 trials).
Error margin (dB) . | “all” . | “men" . | “f + c” . |
---|---|---|---|
(a) Raw data | 1.5 | 1.5 | 1.8 |
(b) Normalization [0:100] | 3.3 | 4.4 | 3.7 |
(c) Normalization [0:50] | 4.7 | 4.5 | 3.8 |
(d) Normalization [0:30] | 4.7 | 4.5 | 4.4 |
(e) Random distances [0:50] | 13.6 | 14.4 | 12.9 |
Error margin (dB) . | “all” . | “men" . | “f + c” . |
---|---|---|---|
(a) Raw data | 1.5 | 1.5 | 1.8 |
(b) Normalization [0:100] | 3.3 | 4.4 | 3.7 |
(c) Normalization [0:50] | 4.7 | 4.5 | 3.8 |
(d) Normalization [0:30] | 4.7 | 4.5 | 4.4 |
(e) Random distances [0:50] | 13.6 | 14.4 | 12.9 |
It may therefore be concluded that the predicted values pVS remained consistently close to the measured values VS, even when the normalization interval was as small as [0:30] dB. The error margins, always smaller than 5 dB, had to be appreciated between the two extremes of 1.5 and 13.6 dB (in the “all” case).
4.2 Results of principal component analysis
In each dataset, the first principal component contributed to approximately 66% of the total explained variance. But the “all” dataset required five components to attain a total value of 90%, whereas two (for “men”) or three (for “f + c”) were sufficient to reach the same efficiency, which confirms the relevance of the distinction made between the subsets. This difference could be understood by looking at the main eigenvectors, which revealed which parts of the normalized LTAS contributed most to the VS prediction.
Figure 3 displays the profiles of the first two eigenvectors, regardless of their eigenvalues. The “men” display is the simplest to interpret. Its first eigenvector exhibits a gradual increase in the male F0 zone, representing the F0 increase with the VS, and a plateau in the medium-to-high frequencies, representing the spectral richness. The second eigenvector is dominated by a sharp peak at 100 Hz, followed by a gradual decrease in the F0 range. Its role is to classify the male voices between low-pitched and high-pitched ones. Performing the prediction computation from the first two principal components only, yielded an error margin of 4.9 dB, instead of 4.5 dB for the whole set of components. Those first two principal components taken together represented 90% of the total explained variance.
The “f + c” first two principal components follow a similar pattern, but the profiles are shifted upward by about 1 octave in their low-frequency part. Taking them together in the prediction computation accounted for 87% of the total explained variance and produced an error margin of 3.9 dB instead of 3.8 dB with the whole set of components.
The “all” dataset behaves differently. The first principal component looks like an average of the second components of the subsets “men” and “f+c.” It selects a large zone of fundamentals F0 observed in normal male and female voices, from 100 to 300 Hz. The second principal component exhibits a large contrast in the 100 to 200 Hz interval, producing a first, incomplete contribution to separating male from female voices. Those two components, taken together, represented only 77% of the total explained variance and produced a 7.2 dB error margin, instead of 4.7 dB with 20 channels. Filling this large gap requires the contribution of upper principal components, not displayed here for the sake of clarity.
Globally the results strengthen the proposition that VS can be recovered to a large extent from the LTAS shape, provided that it is not reduced to a single feature, no matter how relevant as in Sundberg and Nordenberg (2006). At least two independent dimensions are necessary to provide a satisfactory relationship between spectral shape and voice strength; subject's age and gender seems to be another VS factor, whose acoustic effects may be grouped and considered as a third independent dimension.
5. Discussion and perspectives
The data of Pearsons et al. were used here for a purpose that had not been envisioned by their authors. This presented some advantages: the rigorous level calibration as well as the large number and variety of speakers conferred robustness to the results; the logarithmic 1/3 octave frequency scale emphasized the role of the low-frequency components. Besides, as error-prone speech parameters such as the fundamental and formants were unavailable, VS predictions had to be done from the crude LTAS data, which made them all the more reliable.
The data, however, had some imperfections regarding their new use: the phonetic content of the “casual” speech material differed from that of the other VE categories; softest voices (under 50 dB) were not represented; and the duration of the utterances, between 10 and 20 s, may have been too short to guarantee LTAS stability. The main problem was the absence of the speech signal itself, which restricted the study to spectral descriptions, whereas the effects of VE should be investigated in both time and frequency dimensions.
In spite of those experimental limitations, it appears that recovering VS from the signal might contribute to separate the variations due to VE from those due to other causes. In phonetic-acoustic analysis it might help to focus a study on the only items of interest. In automatic speech processing it might reduce the size of the learning datasets. In speech transmission, it might be used to compensate for the level adjustments done during recording.
The clear speech issue might also benefit from the new perspective. In summarizing a number of studies on the subject, Smiljanic and Bradlow (2009) have enumerated nine acoustic-phonetic features which contribute to differentiate clear speech from ordinary, conversational speech: (a) decrease in speaking rate, (b) wider dynamic pitch range, (c) greater SPLs, (d) more salient stop releases, (e) greater rms intensity of the non-silent portions of obstruent consonants, (f) increased energy in LTAS 1–3 kHz range, (g) higher voice intensity, (h) vowel space expansion, and (i) increased modulation depth of low-frequency changes in the intensity envelope. The Pearsons LTAS and VS dataset, lacking signal and intelligibility scoring, could not be used to test those features. However, one can hypothesize that VS, if correctly evaluated, could be directly or indirectly involved in most of the clear speech features.
First, although features (c) (SPLs) and (g) (intensity) refer directly to SPL, the description of how and where the intensity measurements were done is often missing. This indicates that the VS notion was not considered as a meaningful indicator in the reported experiments. Feature (f) is the mid to high frequency increase in energy that has been constantly verified since the Licklider et al. (1955) observation; its relation with VS has been the very object of the present study.
Second, there are features for which the relation with VS has not been demonstrated, but that could be an indirect consequence of it. For instance, feature (b) (pitch range) may partly reflect the automatic increase of F0 when VS augments, which has been known for a long time. This reflex can be partly controlled by the talker in order to obey some intonational constraints; its evaluation may also be influenced by the choice of the fundamental frequency unit: the same shift of a given F0 range may appear as large if expressed in Hz, or as moderate if expressed in semitones. Features (d) (stop release), (e) (consonant-to-vowel transition), and (i) (depth of intensity modulation) may be due to a better articulation, but may also be influenced by VS. Moreover, their values may be corrupted in low voice, because of the presence of background noise.
The remaining features, (a) speaking rate and (h) vowel space expansion, may be only marginally related to VS. The question may arise, however, as to whether raising VS might in some cases alter intelligibility instead of improving it. This point has been investigated by Pickett (1956), who found that word intelligibility was not impacted within the VS range 55–78 dB, but declined rapidly out of this range, in very low or shouted voice.
In the future, the study should be extended to other calibrated datasets, so as to investigate the VS effects in the spectro-temporal domain, which has not been possible with the Pearsons data. Some points deserve particular attention; one of them is the duration of the speech segments to consider: long sequence, sentence, breath group, or syllable. Another point is the ability of VS features to apply to speech material coming from different sources, or distorted during recording. A third one relates to low voices, which have taken a great importance in the contemporary use of audio-visual devices.
In conclusion, the present study is similar to many others in asserting that VE is a major source of acoustic and phonetic variability of speech signals. Although VE is not a measurable quantity, it seems to be closely represented by VS, which in turn can be estimated from the LTAS shape. This view may introduce a new way to consider the speech signal, in which the VE variations would be identified by means of VS estimation and fully taken into account instead of appearing as an arbitrary factor of variability.
Acknowledgments
The author is grateful to Anthony Nash, from Charles M. Salter Associates, San Francisco, who kindly provided the data, as well as to Albert Rilliard from LIMSI, for fruitful discussions about the role and nature of vocal effort.