Using a known speaker-intrinsic normalization procedure, formant data are scaled by the reciprocal of the geometric mean of the first three formant frequencies. This reduces the influence of the talker but results in a distorted vowel space. The proposed speaker-extrinsic procedure re-scales the normalized values by the mean formant values of vowels. When tested on the formant data of vowels published by Peterson and Barney, the combined approach leads to well separated clusters by reducing the spread due to talkers. The proposed procedure performs better than two top-ranked normalization procedures based on the accuracy of vowel classification as the objective measure.

## 1. Introduction

Formant frequencies measured over the mid-part of a vowel of American English, spoken in the same context (/hVd/) by talkers of different age and gender, unanimously labelled by native listeners, show a considerable spread in the *F*_{2} versus *F*_{1} space.^{1} This has motivated researchers to look for a suitable transformation or normalization of the measured raw formant data to bring out the underlying invariance of vowels. The normalization is expected to reduce the spread in the formant data arising due to the influence of talker's gender and age, while preserving the relative mean positions of the vowels as in the original formant space.^{2,3}

There is a huge amount of literature on vowel normalization, spanning over six decades, inhibiting a critical review in this short paper. We cite some secondary sources. Adank^{4} gives a review of the literature up to 2003. The effectiveness of some select vowel normalization methods have been compared based on certain objective criteria.^{5–7} Carpenter and Govindarajan^{8} give a brief description as well as an evaluation of 32 intrinsic and 128 extrinsic procedures for the vowel classification task. Normalization in the context of sociolinguistics has also been reported.^{7,9}

Some important milestones of research in this area are briefly covered. On an average, the vocal tract length (VTL) of an adult female (or child) is shorter than that of an adult male. Theoretically, this implies that all the formants be scaled inversely as the ratio of VTLs. However, the ratio of the mean formant frequency of adult female speakers to that of male speakers is both vowel and formant dependent,^{10} varying over a wide range of 1.03 to 1.30 for the data published by Peterson and Barney^{1} (abbreviated as P&B). This wide range combined with the fact that the mean formant frequency of adult female speakers has been reported to be lower than that of adult male speakers for a specific Swedish vowel,^{10} has led researchers to speculate that factors other than VTL, such as possible gender based differences in articulation, may also contribute to the noted differences in the formant ratios.^{11} F0 has also been considered to be an additional parameter for disambiguating vowels. For normalization, researchers have proposed differences such as $(F1\u2212F0),\u2009(F2\u2212F1),\u2009(F3\u2212F2)$ and the ratios $(F2/F1),\u2009(F3/F2),$ etc., in various frequency scales such as Koenig, log, mel, or Bark.^{12,13}

The topmost performing normalization procedure for automatic vowel classification yields only about 80% accuracy even with the controlled context of P&B data.^{8} Despite the availability of a large number of procedures, a fully satisfactory solution for normalization is yet to emerge.^{6} This has motivated us to propose an intrinsic-cum-extrinsic normalization procedure, resulting in what we refer to as de-normalized formants. The effectiveness of the combined procedure in reducing the influence of talker's age and gender is illustrated using the P&B data. Vowel classification using the pooled de-normalized formant values of all speakers (adult male, adult female, and child) is shown to give a very high accuracy (95%). The performance of the proposed procedure compares well with, or is better than, two top-ranked normalization procedures.^{4,5}

## 2. Proposed method

### 2.1 Intrinsic normalization

The geometric mean of the first three formant frequencies^{14,15} of a speaker's vowel sample is given by

where *F*(*i*) corresponds to the *i*th raw formant frequency in Hz. Let *AM*(*i*) and *AF*(*i*), *i* = 1, 2, 3 denote the mean values of the first three formant frequencies of adult males and females, respectively. Assuming $AF(i)=\alpha AM(i)$, the ratio of geometric means, GM123(female)/GM123(male) is equal to *α*. Hence GM123 may be expected to normalize any uniform scaling of the formant frequencies arising due to gender and age. The normalized formant frequency^{14,15} of a given vowel sample is given by the ratio

where the ratio *NF*(*i*) is a dimensionless quantity. Equation (2) makes use of speaker-specific data of the first three formants of only the given vowel sample. Hence the procedure has to be strictly called “speaker-intrinsic, formant-extrinsic, and vowel-intrinsic” normalization.^{5} Instead, for the sake of brevity, we refer to the procedure as intrinsic normalization.

GM123 has a wide range of about 644 Hz (vowel /u/ of an adult male speaker) to 1400 Hz (vowel /æ/ of the same speaker) for the P&B data, i.e., a factor of more than 2. However, for a given speaker, VTL varies only by about 10% for different vowels. The over-correction in intra-speaker normalization results in a distortion of the vowel space. Due to the very low value of GM123 for back rounded vowels, in the *NF*_{2} versus *NF*_{1} space, these vowels lie above vowel /ɑ/ along the /ɑ/-/i/ direction instead of lying below /ɑ/ in the /ɑ/-/u/ direction as in the raw formant space. In order to restore the original relative vowel positions, we propose an extrinsic de-normalization procedure.

### 2.2 Proposed extrinsic de-normalization procedure

Assumptions: In a normalization procedure, it is incorrect to assume the vowel identity of a sample to be known. Hence, the statistics of the formant data across all vowels, instead of vowel specific statistics, are used in the existing extrinsic procedures.^{4–6,11} However, we make use of vowel specific statistics, the mean *μ*(*i*, *j*), and the standard deviation *σ*(*i*, *j*) of vowel *j*. During the process of the proposed extrinsic normalization, the identity of the vowel sample is also determined. Since *μ*(*i*,*j*) and *σ*(*i*, *j*) depend solely on a specific formant *i* of a specific vowel *j*, the procedure is “formant-intrinsic” and “vowel-intrinsic.”^{5} Since the statistics represent the average across speakers, it is “speaker-extrinsic.” For the sake of brevity, we use the term “extrinsic.”

Development of the proposed procedure: We define the geometric mean of the average formant frequencies for a given vowel as

Initially, we explored using the ratio GMA123/GM123 as the normalization factor in Eq. (2) instead of the reciprocal of GM123. The rationale is that while GM123 is expected to normalize for the inter-speaker differences, the factor GMA123 would restore the relative vowel positions. Further, the normalized values will now have the unit of Hz, with the range of values comparable to those of the raw formant data. However, both GM123 and GMA123 are common scale factors for all the three formants of a given vowel *j*. However, as noted in Sec. 1, formant ratios are both formant and vowel dependent. Hence we propose *μ*(*i*, *j*) itself as a scaling factor since it is both formant (*i*) and vowel (*j*) dependent.

Proposed extrinsic de-normalization: The intrinsically normalized formant values *NF*(*i*) of a vowel sample are transformed to what we refer to as the de-normalized values. Since the vowel identity of a test sample is unknown, we use a “hypothesize-test” paradigm. Let *V* be the number of vowels in the database. We hypothesize the index *J*, one at a time, of the unknown vowel and for each hypothesis *J*, the de-normalized formant value is determined as

In our study, we find that the mapping from the dimensionless NF to DF with the unit in Hz does not affect the results.^{16} Each vowel sample *NF*(*i*) maps to *V* de-normalized values, *DF*(*i*, *J*), for hypotheses *J* = 1, *V* of which only one hypothesis has to be selected. We test each hypothesis by computing the distance between the de-normalized first two formants and the mean values of the corresponding de-normalized formant data of the hypothesized vowel as

where *Distance*$\u3008\u3009$ denotes an appropriate distance measure (see Sec. 3.2). The third formant frequency has an indirect influence via *NF*(*i*). Let $J\xaf$ be the index for which *D*(*J*) is the minimum. The vowel index is postulated as $J\xaf$. Only $DF(i,J\xaf)$ is taken as the de-normalized value. That is, *NF*(*i*) maps to $DF(i,J\xaf)$ in the de-normalized space. This procedure at once achieves vowel de-normalization as well as vowel classification.

A parallel to perceptual studies: Utilizing the mean and standard deviation values implies having *a priori* knowledge of the vowel space of a given language. The performance is known to degrade if anomalous information is given about the speaker's gender (male/female)^{17} or the language (American English/Canadian English).^{18} This suggests that a listener's performance of perceptual identification of vowels improves with *a priori* knowledge (or familiarity) of the talker's identity or gender or language. It is speculated that listeners use a “cognitive frame of reference” of the talker.^{11} With this background, the use of *a priori* knowledge of the mean and standard deviation values of vowel formant data appears justified.

### 2.3 Experimental results and discussion

We have used the P&B data^{19,20} for illustrating the procedure. There are 66, 56, and 30 samples for “men,” “women,” and “children” categories, respectively. We have considered all the (nine) vowels excluding the retroflex vowel /ɝ/. In the illustrations to follow, a vowel triangle^{5,6,22} based on the mean values of the three corner vowels is also shown for the adult male and female speakers. Its relevance is discussed in Sec. 3.1. We have followed the convention used by P&B in selecting the orientation of the plot with vowel /u/ near the bottom-left of the graph. In all the figures, the same notation as given in Fig. 1 is followed.

A plot of raw formant data, *F*_{2} versus *F*_{1}, is shown in Fig. 1. For the front vowels, the data show a wide spread across gender and age. Also, a considerable spread is seen within each vowel. The front vowels are not well separated and some back vowels (/ʊ/ and /u/, /ɑ/ and /ɔ/) heavily overlap. Also see Fig. 8 of Peterson and Barney^{1} and Fig. 3 of Miller.^{13}

In the de-normalized formant space, both the inter and intra speaker spread is reduced considerably (*DF*_{2} versus *DF*_{1} plot of Fig. 2). The relative positions of vowels are preserved as in the raw formant data space. Tense/lax and high/low front vowels form distinct clusters. The separation amongst back vowels is surprisingly good. Clusters for vowels (/ʊ/ and /u/) and (/ɑ/ and /ɔ/) are also reasonably well separated.

## 3. Comparison with other methods

### 3.1 Formant plots and vowel triangles

The plots of formant values normalized using the z-score (*Z*_{2} versus *Z*_{1}) and S-centroid (*S*_{2} versus *S*_{1}) procedures are shown in Figs. 3 and 4, respectively. The spread in the data points arising due to gender and age difference is reduced for both the procedures. However, the vowel samples are widely scattered. In the case of S-centroid procedure, clustering is very good only for vowel /i/ as it acts as a reference corner. It is difficult to infer the number of vowels from the plots shown for z-score and S-centroid. In the de-normalized formant space, one distinct cluster per vowel is seen (Fig. 2).

One of the ways to study the effectiveness of a normalization procedure is to compare the overlap of vowel triangles for male (VTM) and female (VTF) speakers.^{5,6,22} We give only a qualitative comparison. For the raw data (Fig. 1), VTF is much bigger than VTM and is significantly displaced upwards and to the right. For the proposed procedure (Fig. 2), VTF and VTM almost overlap except for a slight mismatch in the /i/-/ɑ/ direction. For the z-score normalization (Fig. 3), VTF is smaller than VTM with a slight mismatch in the /i/-/u/ direction. For the S-centroid method (Fig. 4), it is difficult to discern the two vowel triangles as the overlap is almost complete. A vowel triangle is determined by only three normalized parameters, *F*_{1} and *F*_{2} of /i/ and *F*_{1} of /ɑ/ and hence it does not reflect the spread of data. We propose to use the accuracy of vowel classification as an objective measure for a comparison of different normalization procedures.

### 3.2 Vowel classification accuracy as an objective measure

We assume a labeled database of formants of a given language to be available. The set of formant frequencies (*F*_{1}, *F*_{2}) in mel is used as the feature vector. The mean values $\mu \xaf1$ and $\mu \xaf2$ represent the vowel space. Given the test formant data, its nearest vowel in the vowel space is declared as the identity of the test vowel and compared with the known label. The overall accuracy for all the samples is determined. A similar procedure is applied on the normalized formant values of z-score and S-centroid procedures. Vowel classification is a part of the proposed procedure, as already noted in Sec. 2.2. We have used a weighted Euclidean distance (WED) measure given by

Selection of test samples: For the P&B database, for the gender-independent (MW) case, formant data of “men” and “women” categories and for the gender-age-independent (MWC) case, formant data of all the three categories are pooled together and used. Improvement in vowel classification accuracy, computed with the pooled normalized formant values over the accuracy obtained with the pooled raw formant data is considered as a measure of the effectiveness of the normalization procedure. The vowel dependent statistics ($\mu \xaf,\u2009\sigma \xaf$) are computed on the raw and normalized (or de-normalized) pooled formant data using the known labels. For automatic vowel classification, the statistics are to be computed from a training set.

Results: The classification accuracies for the raw data, S-centroid, z-score, and the proposed procedures are [82.9%, 85.0%, 85.7%, 95.2%] for the MW case and [77.2%, 84.5%, 84.4%, 94.9%] for the MWC case, respectively. The proposed procedure gives the highest accuracy of about 95%, nearly 10% higher than the S-centroid and z-score normalization procedures and 12% (18%) higher than the MW (MWC) case of raw data.

## 4. Conclusion

We have used vowel dependent statistics and proposed an intrinsic-cum-extrinsic procedure along with a “hypothesize-and-test” paradigm. For the given P&B database, the large spread observed in the acoustic space for different vowels and talkers has been effectively reduced. Clear clusters have emerged in the de-normalized formant space. The proposed procedure performs better than two top performing procedures in removing the influence of gender and age based on the accuracy of vowel classification as the objective measure. For future work, comparison with other procedures of normalization with rigorous objective measures may be undertaken and the applicability of the proposed procedure, over a larger database and in areas like sociolinguistics, language change, influence of accent, etc., may be explored. The proposed procedure can also be applied on normalized data obtained with other procedures.