This paper proposes an efficient method to improve speaker recognition performance by dynamically controlling the ratio of phoneme class information. It utilizes the fact that each phoneme contains different amounts of speaker discriminative information that can be measured by mutual information. After classifying phonemes into five classes, the optimal ratio of each class in both training and testing processes is adjusted using a non-linear optimization technique, i.e., the Nelder–Mead method. Speaker identification results verify that the proposed method achieves 18% improvement in terms of error rate compared to a baseline system.

## I. Introduction

Automatic speaker recognition, the task of verifying a speaker’s identity using his∕her voice, has been greatly extended in recent years.^{1} It often uses spectral features such as Mel-frequency cepstral coefficients (MFCCs) but characteristics of spectral features vary along with the phonetic contents of input speech. Since speaker recognition systems are designed for text-independent application for flexibility, the spectral variation in the input signal is always very high. Due to limitations on the number of input features, however, it is not easy to include all the characteristics of a speaker in a stochastic model such as Gaussian mixture model (GMM).^{2,3} In other words, to improve speaker recognition performance, it is very important to build a GMM model that represents speaker characteristics in realistic conditions. Various approaches have been proposed to overcome this limitation by utilizing phoneme information.^{4–7}

In this paper, we also propose a method that utilizes phoneme information to improve speaker recognition performance. While previous studies focus on using separate models for each phoneme and combining scores,^{4,6,7} we focus on finding an optimal phoneme class ratio, the portion of each phoneme class in an utterance, that maximizes speaker recognition performance based on mutual information. In speaker recognition, some researchers use this measurement to measure or improve speaker recognition accuracy.^{8,9} In this paper, we experimentally re-evaluate the speaker discriminative power of each phoneme class using mutual information and then find the optimal phoneme class ratio. We adopt the Nelder–Mead method, which is widely used for nonlinear optimization of multi-dimensional data.^{10} Experimental results show that the optimal phoneme class ratio we find is somewhat different from that occurring in normal speech: the portion of consonants is increased, but that of vowels is reduced. This approach can be applied to both training and testing, but the improvement is more significant when it is used for testing. Speaker identification results show that the proposed system that uses the optimal phoneme class ratio has around 18% better performance than a conventional system.

The rest of this paper is organized as follows. First, we review the concept of mutual information, which measures the speaker discriminative power of given speech signals, and suggest how to find the optimal phoneme class ratio based on aspects of information theory in Sec. II. Section III shows the experimental setup and results, which verify the usefulness of the proposed algorithm. The conclusions and future work are given in Sec. IV.

## II. Optimization of Phoneme Class Ratio

### A. Problem formulation

Mutual information represents the amount of information shared by two given random variables. Equation (1) represents the definition of mutual information.

In the equation, $H(C)$ denotes the entropy of a specific speaker presence, and $H(C|X)$ denotes the entropy of specific speaker presence when the feature set $X$ is given. Eriksson *et al.*^{8} showed that the error rate of a speaker recognition system decreases as mutual information increases. In Eq. (1), $H(C)$ can be simplified by the following equation assuming that the distribution of the speaker presence probability is uniform.

where $S$ denotes the number of speakers and $P(s)$ is the probability of each speaker’s presence. Since $H(C)$ only depends on the number of speakers, $H(C|X)$ is the only term to affect $I(C;X)$. It is evident that minimizing $H(C|X)$ achieves maximization of $I(C;X)$. Thus, our goal is to minimize $H(C|X)$ using the following definition:

Equation (3) can be approximated by the law of large numbers

where $xn$ is the $nth$ feature and $N$ is the total number of features. We can classify $xn$ into $K$ classes using pre-defined class information.

where $Nk$ is the number of features in the $kth$ class and $xnk,k$ is the $nth$ feature in the $kth$ class. Trivially, $\u2211k=1KNk=N$. The portion of each class’s entropy can be represented by

where $p$ is the vector whose element contains the ratio of each class to all features, i.e., $pk=Nk\u2215N$. Thus, it also satisfies the following constraint:

$P(s|xn,k)$ is defined as follows:

where $\lambda s,p$ is the GMM of speaker $s$ trained by features with the class ratio of $p$, and $p(x|\lambda s,p)$ is the likelihood of feature $xnk,k$ given $\lambda s,p$. Thus, we can rewrite Eq. (5) using $Hk(p)$:

Similarly, if we define $Ik(p)$ to represent the portion of class $k$ in $I(C|X)$, $I(C|X)$ can be rewritten as follows:

where $Ik(p)$ is defined as follows:

Now we have to be concerned about the redundancy of each class. Redundancy of a class $k$ usually increases as the class ratio $pk$ increases. Increasing redundancy caused by including unnecessary data actually degrades speaker recognition performance. For example, in our preliminary experiments on varying the ratio of vowels and consonants, we find that speaker identification performance is better when the ratio of vowels is 80% compared to 90%, even though it is known that vowels have more speaker discriminative information than consonants. Thus, we also need to consider the redundancy of the data while maximizing mutual information by controlling the phoneme class ratio. One simple solution is removing the $pk$ term from Eq. (10) because it directly relates to redundancy. Equation (12) shows the modified equation:

The equation denotes the average $Ik(p)$ for all classes. Therefore, the objective of the proposed algorithm is finding the optimal ratio of phoneme classes that maximizes $I(p)$:

There is one more issue about the relation between mutual information and speaker recognition accuracy. According to Eriksson *et al.*,^{8} the relation between mutual information and speaker recognition accuracy becomes exact as the speaker recognition accuracy increases. If the portion of any class is very small, we cannot say that the class has a large amount of speaker discriminative information even though mutual information of that class is large. In this case, we may assume that the $Ik(p)$ of the class is meaningless. Thus, we force $Ik(p)$ to zero when $pk$ is smaller than a certain threshold $\theta $, smaller than a minimum probability, and $Ik(p)$ is larger than a minimum $Ik(pmin)$. Equation (14) shows the modification rule:

In the equation above, $\theta $ denotes the threshold of $pk$. If $pk$ is larger than $\theta $, it means that we can use $Ik(p)$ and if $pk$ is smaller than $\theta $, we do not use $Ik(p)$ because it is unlikely that the value is meaningful. $pk,min$ and $Ik(pmin)$ denote the minimum $Ik(p)$ and corresponding $pk$. In this paper, we experimentally find $pk,min$ and $Ik(pmin)$ during the optimization process, as it is hard to find a global minimum value theoretically.

### B. Optimization method to find the optimal ratio

Since the speaker model $\lambda s\u2032,p$, which is given in Eq. (8), should be retrained by the expectation-maximization algorithm whenever $p$ varies, we cannot directly find a $p$ that maximizes Eq. (12). In this case, the Nelder–Mead method popularly used for nonlinear optimization in many-dimensional data is suitable.^{10} Generally, the Nelder–Mead method is an unconstrained optimization method but our application has two constraints. One is given in Eq. (7), and the other is as follows:

Inclusion of these constraints does not affect the applicability of the method. The first constraint, Eq. (7), does not affect the result if we choose the initial vertices to satisfy the constraint. To satisfy the second constraint, we adjust the coefficients that are used to find the reflection and new vertex to ensure the vertices are located in a suitable range.

## III. Experiments and Results

### A. Experimental setup

We perform experiments to verify the feasibility of the proposed algorithm with the TIMIT corpus, which contains phoneme information for all sentences.^{11} From the TIMIT corpus, each phoneme is classfied into one of seven classes: *stops, affricates, fricatives, nasals, glides* and *semivowels, vowels*, and *others*. Among these classes, affricates take up a very small portion of the TIMIT corpus and some speakers do not have this class; others comprise labels which are not speech segments. Thus, we only use five classes, disregarding affricates and others. MFCCs up to order 20 are extracted using a $20ms$ windowed speech signal and the analysis frame is shifted every $10ms$ in the baseline system. The boundary regions of each phoneme are omitted to remove the effect of transition regions. In training the speaker model using extracted MFCCs, we construct 16-mixture GMMs using five sentences for each speaker. After training GMMs for all speakers, we evaluate $I(p)$ using two sentences that are not used in training. When we evaluate the performance of the system, we use the remaining three sentences.

In the experiments, we have to extract features that are adjusted in the given phoneme class ratio in every iteration of the algorithm because the speaker model $\lambda s,p$ is retrained during the algorithm according to the variation in the phoneme class ratio $p$. To adjust the phoneme class ratio, we analyze the speech signal every $0.25ms$ and sample features from each class to satisfy the given class ratio $pk$. Therefore, the number of features $Nk$ that belongs to class $k$ becomes as follows:

and we set $N$ as the number of features when the analysis interval is $10ms$. This adjusting method is also used in the testing procedure. In the Nelder–Mead method, we initialize the vertices as suitable values around conventional phoneme class ratios and repeat the algorithm 100 times to converge the vertices.

### B. Results and analysis

Before performing experiments, we evaluate the phoneme class ratio $p$, the modified mutual information $I(p)$, and $Ik(p)$ of each class from the TIMIT corpus. Table I shows the results. As the table shows, $Ik(p)$ of vowels is larger than any other class. It means that vowels have more speaker discriminative information. Moreover, $Ik(p)$ of nasals is quite large even though the nasals ratio is just 5.78%, which confirms the results presented by Eatock and Mason.^{5}

Class . | Ratio(%) . | $Ik(p)$ . | ||
---|---|---|---|---|

Baseline . | Proposed . | Difference . | ||

Stops | 6.94 | 7.34 | $+0.40$ | 3.66 |

Fricatives | 18.01 | 18.40 | $+0.39$ | 3.20 |

Nasals | 5.78 | 13.40 | $+7.62$ | 5.89 |

Semivowels | 12.83 | 8.40 | $\u22124.43$ | 5.83 |

Vowels | 56.44 | 52.46 | $\u22123.98$ | 6.52 |

All | 100 | 100 | … | 5.59 |

Class . | Ratio(%) . | $Ik(p)$ . | ||
---|---|---|---|---|

Baseline . | Proposed . | Difference . | ||

Stops | 6.94 | 7.34 | $+0.40$ | 3.66 |

Fricatives | 18.01 | 18.40 | $+0.39$ | 3.20 |

Nasals | 5.78 | 13.40 | $+7.62$ | 5.89 |

Semivowels | 12.83 | 8.40 | $\u22124.43$ | 5.83 |

Vowels | 56.44 | 52.46 | $\u22123.98$ | 6.52 |

All | 100 | 100 | … | 5.59 |

Next, we estimate the optimal class ratio using the proposed method and compare the speaker recognition error rate of the proposed system with the baseline system. The optimal threshold $\theta $, which is the boundary of the compensation of $Ik(p)$, is found experimentally by varying from 0.0 to 0.5. Tables I and II show the proposed class ratio and the speaker identification results of the proposed system when the threshold $\theta =0.08$. Table I shows the proposed class ratio and the baseline class ratio for comparison. As the table shows, the ratio of consonants is increased and that of vowels is decreased. The ratio of stops and fricatives is increased about 0.4% and that of nasals is increased more than 7%. On the other hand, the ratios of semivowels and vowels are decreased 4.43% and 3.98%. From this result, we can see that nasals are important to improve the speaker recognition performance. Also, semivowels and vowels have more redundant information than other classes even though they greatly contribute to the performance of the speaker recognition system. Table II shows the results of speaker identification tests of the proposed system and the baseline system. The table shows the average error rate, 95% confidence interval, and the minimum and the maximum of the confidence interval. The error rate of the proposed system is 18.33% lower than that of the baseline system on average. From these results, we can confirm that the performance of the proposed algorithm is superior to conventional algorithms.

. | Error rate(%) . | |
---|---|---|

Baseline . | Proposed . | |

Average | 4.571 | 3.734 |

95% conf. | 0.118 | 0.104 |

Min | 4.463 | 3.629 |

Max | 4.690 | 3.838 |

Improvement | … | 18.33 |

. | Error rate(%) . | |
---|---|---|

Baseline . | Proposed . | |

Average | 4.571 | 3.734 |

95% conf. | 0.118 | 0.104 |

Min | 4.463 | 3.629 |

Max | 4.690 | 3.838 |

Improvement | … | 18.33 |

We next apply the proposed method of adjusting the class ratio to the training and testing procedures to investigate the effect of each procedure. Figure 1 shows the result. The $y$-axis denotes the speaker recognition error rate with 95% confidence interval, and the $x$-axis denotes the type of system. *Baseline* and *proposed* denote the systems that use the phoneme class ratio of the baseline or the proposed system for both training and testing. $P+B$ means that the proposed class ratio is used for the training procedure and the baseline class ratio is used for the testing procedure. $B+P$ is the opposite of $P+B$. As the figure shows, the class ratio in the testing procedure influences the result more than that in the training procedure. In other words, the selection of test segments is more important than more accurate modeling of the distribution. In addition, we can say that speaker recognition performance can be improved by adjusting the phoneme class ratio of test data even if the speaker model has already been trained by conventional methods. Of course, the best performance can be achieved when both training and test methods adopt the proposed class ratios.

## IV. Conclusions and Future Work

In this paper, we proposed a method for finding an optimal phoneme class ratio that utilizes mutual information to improve speaker recognition performance. First, we defined $Ik(p)$, the portion of a class $k$ in mutual information $I(C|X)$, and proposed a method for finding an optimal phoneme class ratio by maximizing the average $Ik(p)$. From the results of speaker identification tests using optimal phoneme class ratios, we verified that the proposed system improves speaker identification performance about 18% compared to a conventional system. We also found that using the proposed phoneme class ratio is still applicable to test processes even if the speaker model has been trained with data having conventional phoneme class ratios.

Future work will be finding an optimal method of dividing classes. In this paper, we used the phoneme class label of TIMIT to simplify the problem. However, phonemes have different characteristics even though they are in the same phoneme class. Therefore, we need to examine optimal classification methods for phonemes for better performance. In addition, the accuracy of such classification methods needs to be considered in practical applications.