Vowel space area (VSA) is an attractive metric for the study of speech production deficits and reductions in intelligibility, in addition to the traditional study of vowel distinctiveness. Traditional VSA estimates are not currently sufficiently sensitive to map to production deficits. The present report describes an automated algorithm using healthy, connected speech rather than single syllables and estimates the entire vowel working space rather than corner vowels. Analyses reveal a strong correlation between the traditional VSA and automated estimates. When the two methods diverge, the automated method seems to provide a more accurate area since it accounts for all vowels.
I. Introduction
Vowel space area (VSA) refers to the two-dimensional area bounded by lines connecting first and second formant frequency coordinates (F1/F2) of vowels.1 Estimation of VSA has a long history in the study of vowel identity, speaker characteristics, speech development, speaking style and sociolinguistic factors that influence vowel production.2–10 Traditional VSA computation methodology is shown in Fig. 1(a). A typical computation involves making static measurements of the F1/F2 values for each of the four corner vowels (or three point vowels, /a, i, u/ for triangle) at 50% vowel duration, for several productions of each vowel. The mean F1/F2 value for each of the four corner vowels is then used to compute the area of the quadrilateral formed by the corner vowels. Since frequencies of the first and second formants roughly relate to the size and shape of the cavities created by jaw opening (F1) and tongue position (F2), the VSA is an acoustic proxy for the kinematic displacements of the articulators.11 In general, studies have shown that VSA is larger in speech that is clearer and more intelligible than speech associated with smaller VSAs.12 This is interpreted as corresponding to greater articulatory excursions and more distinct acoustic-articulatory vowel targets. Thus, the VSA and other derived vowel metrics related to distinctiveness have been quite successful in the study of speaking style, dialects, and languages.6,7,9
(Color online) Block diagrams for (a) the typical steps taken in the manual computation of the vowel space area. Speech samples are phonetically segmented, formants for the corner vowels are estimated, the mean value of each corner vowel is computed, and finally the area bounded by the mean of corner vowels is computed. (b) The proposed method.
(Color online) Block diagrams for (a) the typical steps taken in the manual computation of the vowel space area. Speech samples are phonetically segmented, formants for the corner vowels are estimated, the mean value of each corner vowel is computed, and finally the area bounded by the mean of corner vowels is computed. (b) The proposed method.
Because abnormal vowel formant reduction (centralization) is a common feature of speech production deficits, there has been a longstanding interest in using VSA estimations for characterizing speech motor control, including speech development,10,13 speech disorders,14–17 and speech interventions.18 Despite the intuitive appeal of using VSA as an index of speech motor control and intelligibility, its success has been limited and modest.19 For instance, VSA was minimally predictive of overall intelligibility for individuals with dysarthria, secondary to Parkinson's disease and multiple sclerosis (between 6 and 13 %).20,21 More optimistic relationships (over 40%) were reported when examining the same relationship for speakers with dysarthria, secondary to amyotrophic lateral sclerosis (ALS).22,23 The most promising predictive relationship of VSA and intelligibility was demonstrated by Higgins and Hodge,24 in an assessment of a heterogeneous sample of children with dysarthria. Attempts to modify the VSA estimate to more sensitively account for differences in the front-back and high-low dimensions have offered some benefit.25 Such modifications may be preferable for mapping VSA to perceptual measures and speaker classification.14,26–28 However, it is likely that more extensive modifications are required to obtain VSA estimates that hold clinical utility for speech production deficits and the resulting decrements in speech intelligibility. Such information, particularly if fully automated and robust to speech sample, would provide an important objective assessment to augment and support clinical practice.
There are several significant limitations associated with existing VSA estimates in the context of speech production disorders. The first two limitations are that VSA calculations are based only on point (triangle) or corner (quadrilateral) vowels, rather than all vowels; and these vowels are produced in isolation (typically hVd). This methodology was borrowed from the study of vowel production in healthy speech to examine vowel distinctiveness as described above.8 This makes good sense from the standpoint of defining the most disparate regions of the vowel space (and, by extension, the maximal articulatory excursions) in a way that is free of extraneous coarticulatory influences. However, previous research using VSA estimations on disordered speech has not shown the ability to robustly elicit and/or capture speech production deficits and intelligibility in a clinically meaningful way. There is every reason to believe that when the VSA is globally reduced, as in speech production disorders, more sensitive methodology is required. One possibility is to sample the entire articulatory working space, and characterize its shape, to fully account for the extent of articulatory displacements and their acoustic consequences. It also may be useful to extract vowel formant information from productions in connected speech rather than single word productions to magnify the impact of the underlying movement disorder. Finally, the third, and perhaps most important limitation from an applied stand-point, is that the traditional VSA estimation process is cumbersome, requiring phonetic segmentation of input speech.
In an effort to overcome these limitations and move closer to a clinical tool, the present report describes a novel alternative for VSA estimation that (1) is fully automated, (2) can be collected from any length or variety of speech material that contains a range of vowels, and (3) considers all vowels produced rather than estimating the shape of the VSA with a triangle or quadrilateral. The algorithm relies on a series of automated tools for extracting all formants from voiced sections of speech, thereby removing the need for hand segmentation. This is followed by a clustering and area calculation algorithm based on the convex hull of the cluster centers to estimate the final VSA. The proposed algorithm is applied to healthy speech and then compared against an estimate of the vowel space quadrilateral area formed from hand-segmented speech of the same sample.29 Results show that the automated estimate exhibits a strong correlation with the hand-segmented estimate, and often yields a more accurate estimate of the VSA.
II. Methods
Figure 1(b) shows a block diagram of the proposed method for the automated estimation of the VSA. The algorithm can operate on any incoming speech signal that contains a range of vowels. The signal is analyzed on a frame-by-frame basis and, for each voiced frame, the first and second formants are estimated. Following, outliers are removed and the remaining points are clustered. The convex hull of the cluster centers is determined and the area of the resulting convex hull is calculated. In the following sections the details of each of the required steps is discussed.
A. Formant extraction
A praat script30 is used to automatically extract all F1/F2 pairs corresponding to voiced frames. The praat script assesses voicing on a frame-by-frame basis by estimating periodicity using an autocorrelation-based method. In this study we only consider the first two formants, however, using the recommended praat values, five formants were extracted per frame below a ceiling value (5000 male, 5500 female) in Hz. Other settings were as follows: 1 ms frame advance; 50 ms analysis window; pre-emphasis starting from 50 Hz. Internally, praat computes estimates of the formants by resampling to twice the ceiling of the formant search range, then applying a pre-emphasis filter, windowing the speech in the time domain using a Gaussian window, and estimating the LPC coefficients using the algorithm by Burg.31,32 Processing all input speech results in an N × 2 matrix, Fp, that stores all F1/F2 pairs for a particular speaker, where N is the number of formant observations for a particular speaker.
B. Filtering
Automated formant estimation algorithms can result in outliers. In order to identify the extrema, the probability distribution of each speaker's formants, Fp, is modeled using a Gaussian mixture model (GMM) and low-likelihood points are identified and removed. The use of GMMs is common in speech processing applications.33 The weight, mean, and (full) covariance matrix for each of the four component densities in the Gaussian mixture are learned using the expectation maximization (EM) algorithm. For each formant in Fp, the log-likelihood is calculated and components with a likelihood less than are identified as outliers and removed from downstream processing. denotes the mean likelihood of all observations in Fp. The filtered parameter set is denoted by . The outlier filtering rejected approximately 15% of the total number of formant observations for a particular speaker.
C. Clustering
Following outlier rejection, the remaining points are clustered using the k-means algo-rithm.34 Twelve cluster centers (one corresponding to each of the 12 English vowels) were initialized using the mean F1/F2 values as reported by Hillenbrand5 at 50% vowel duration. The cluster centers were initialized for adult males and females, using the respective reported values, and returned values are denoted by Kp.
D. Convex hull/area calculation
Using the Quick-hull35 algorithm in matlab,36 the convex hull of the set of points in Kp is found. The clockwise ordered endpoints (beginning and ending with the same point) of the resulting convex polygon is denoted by . The area of the polygon with m corners is then given, with slight abuse of the determinant notation, by
E. Stimuli
Speech samples were drawn from the TIMIT corpus commissioned by DARPA.30 The TIMIT corpus consists of 6300 sentences, 10 sentences spoken by 630 speakers from 1 of 8 major dialect regions37 of the United States. The TIMIT corpus includes hand verified, time-aligned orthographic, phonetic, and word transcriptions as well as 16-bit, 16 kHz speech waveform files for each utterance. Corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI), and Texas Instruments, Inc. (TI). The speech material consists of phonetically diverse sentences intended to expose dialectal variants of the speech.
III. Results and discussion
The output of the automated metric is compared to a traditional VSA metric computed from hand-segmented speech, including several derivations of the Pearson correlation coefficient.
A. Performance analysis
In order to assess the performance of the proposed method several comparisons were made between the proposed method and a control method. The control method uses the traditional VSA computation paradigm by utilizing the meta-data provided with the TIMIT corpus. More specifically, for each occurrence of the corner vowels, estimates of the means of the formant frequencies are calculated and the area of the resulting quadrilateral is computed. An estimate of the VSA for each of the 630 speakers utilizing all ten sentences per speaker (xi[n], i = 1,…,10) for both the proposed and control methods were computed. When divided by sex, male and female speakers yield correlation coefficients of ρ = 0.790980 and 0.74681, respectively. The proposed method yields a correlation coefficient of ρ = 0.77553 when computed over all 630 speakers. A scatter plot of the data is show in Fig. 2 and the results are summarized in Table I.
(Color online) A scatter plot showing the estimated VSA obtained using the proposed and control methods for each of the 630 speakers in the TIMIT corpus for (a) male speakers, (b) female speakers. Male and female speakers yield correlation coefficients of ρ = 0.790980 and 0.74681, respectively. The proposed method yields a correlation coefficient of ρ = 0.77553 over all speakers.
(Color online) A scatter plot showing the estimated VSA obtained using the proposed and control methods for each of the 630 speakers in the TIMIT corpus for (a) male speakers, (b) female speakers. Male and female speakers yield correlation coefficients of ρ = 0.790980 and 0.74681, respectively. The proposed method yields a correlation coefficient of ρ = 0.77553 over all speakers.
Correlation between the proposed and control methods.
Case . | By speaker . | By dialect region . |
---|---|---|
Male | 0.79098 | 0.50937 |
Female | 0.74681 | 0.52836 |
All | 0.77553 | 0.60118 |
Case . | By speaker . | By dialect region . |
---|---|---|
Male | 0.79098 | 0.50937 |
Female | 0.74681 | 0.52836 |
All | 0.77553 | 0.60118 |
Similar analyses comparing estimates of the VSA corresponding to an entire dialect region were performed. When estimating the VSA for the eight dialect regions by sex, estimates yield a correlation coefficients of ρ = 0.50937 and 0.52836, for male and female speakers, respectively. The proposed method yields a correlation coefficient of ρ = 0.60118 when estimating the VSA for a dialect region using both male and female speakers. Again, the results are summarized in Table I.
Overall the proposed method has high correlation to the control method. However, the proposed method may actually yield a more accurate result than the conventional method, because the conventional method limits the definition of the vowel space area to the space interior of only four of the twelve English vowels (the corner vowels). In reality, there are many occurrences of F1/F2 pairs that occur outside of this space and contribute to the overall shape of the vowel space. This is readily seen in Fig. 3, by comparing the VSA as bounded using the proposed and control methods. The proposed metric results in consistently larger VSA estimates, but also more accurately accounts for the actual shape of the VSA. This may provide a more complete assessment of the contribution of VSA to intelligibility and subsequent decrements.
(Color online) The VSA for three speakers as bounded using the proposed (dashed line) and control (dash-dot line) methods overlaid on the filtered points (small gray dots). The mean corner vowels Kc (large squares) and the cluster centers Kp (large dots) are also shown. The proposed method better accounts for the actual shape of the VSA. The axes have been chosen so that the plots have the same orientation as the standard IPA vowel trapezium.
(Color online) The VSA for three speakers as bounded using the proposed (dashed line) and control (dash-dot line) methods overlaid on the filtered points (small gray dots). The mean corner vowels Kc (large squares) and the cluster centers Kp (large dots) are also shown. The proposed method better accounts for the actual shape of the VSA. The axes have been chosen so that the plots have the same orientation as the standard IPA vowel trapezium.
It is important to note that a key requirement of the algorithm is that the vowel space is adequately sampled. This means that the analyzed content must be phonetically balanced or consistent across individuals for comparison. By design, the TIMIT corpus indeed satisfied this requirement. For clinical applications of this work, clinicians will have the option of specifying the spoken text, ensuring that the incoming speech stream is balanced.
IV. Conclusion
The assessment of speech intelligibility is the cornerstone of clinical practice in speech-language pathology, as it indexes a patient's communicative handicap. There has been a desire to develop efficient, objective, and reliable measures that can be added to the clinical repertoire. Given the relationship of VSA and intelligibility decrements20–22,24,38 it is critical to have a sensitive and efficient assessment of VSA; this includes the exploration of a more complete assessment of the vowel space, by including the complete range of vowels in spoken language. In the current investigation, an automated assessment of the VSA demonstrated a strong relationship with the traditional methods of VSA derivation.
Moreover, the proposed method is fully automated and was demonstrated to capture a more complete assessment of the VSA by allowing for arbitrary VSA shapes, rather than only triangle or quadrilateral shaped VSAs. Moving forward, the relationship between the proposed calculation of VSA will be related to intelligibility ratings to understand its relationship with intelligibility decrements. The success with which the automated procedure estimated the VSA along with the ease of computation, makes the proposed an attractive metric for characterizing speech motor control.
Acknowledgments
This research was supported in part by National Institute of Health, National Institute on Deafness and Other Communicative Disorders Grants Nos. 2R01DC006859 (J.M.L.) and 1R21DC012558 (J.M.L. and V.B.).