This paper proposes an efficient method to improve speaker recognition performance by dynamically controlling the ratio of phoneme class information. It utilizes the fact that each phoneme contains different amounts of speaker discriminative information that can be measured by mutual information. After classifying phonemes into five classes, the optimal ratio of each class in both training and testing processes is adjusted using a non-linear optimization technique, i.e., the Nelder–Mead method. Speaker identification results verify that the proposed method achieves 18% improvement in terms of error rate compared to a baseline system.
I. Introduction
Automatic speaker recognition, the task of verifying a speaker’s identity using his∕her voice, has been greatly extended in recent years.1 It often uses spectral features such as Mel-frequency cepstral coefficients (MFCCs) but characteristics of spectral features vary along with the phonetic contents of input speech. Since speaker recognition systems are designed for text-independent application for flexibility, the spectral variation in the input signal is always very high. Due to limitations on the number of input features, however, it is not easy to include all the characteristics of a speaker in a stochastic model such as Gaussian mixture model (GMM).2,3 In other words, to improve speaker recognition performance, it is very important to build a GMM model that represents speaker characteristics in realistic conditions. Various approaches have been proposed to overcome this limitation by utilizing phoneme information.4–7
In this paper, we also propose a method that utilizes phoneme information to improve speaker recognition performance. While previous studies focus on using separate models for each phoneme and combining scores,4,6,7 we focus on finding an optimal phoneme class ratio, the portion of each phoneme class in an utterance, that maximizes speaker recognition performance based on mutual information. In speaker recognition, some researchers use this measurement to measure or improve speaker recognition accuracy.8,9 In this paper, we experimentally re-evaluate the speaker discriminative power of each phoneme class using mutual information and then find the optimal phoneme class ratio. We adopt the Nelder–Mead method, which is widely used for nonlinear optimization of multi-dimensional data.10 Experimental results show that the optimal phoneme class ratio we find is somewhat different from that occurring in normal speech: the portion of consonants is increased, but that of vowels is reduced. This approach can be applied to both training and testing, but the improvement is more significant when it is used for testing. Speaker identification results show that the proposed system that uses the optimal phoneme class ratio has around 18% better performance than a conventional system.
The rest of this paper is organized as follows. First, we review the concept of mutual information, which measures the speaker discriminative power of given speech signals, and suggest how to find the optimal phoneme class ratio based on aspects of information theory in Sec. II. Section III shows the experimental setup and results, which verify the usefulness of the proposed algorithm. The conclusions and future work are given in Sec. IV.
II. Optimization of Phoneme Class Ratio
A. Problem formulation
Mutual information represents the amount of information shared by two given random variables. Equation (1) represents the definition of mutual information.
In the equation, denotes the entropy of a specific speaker presence, and denotes the entropy of specific speaker presence when the feature set is given. Eriksson et al.8 showed that the error rate of a speaker recognition system decreases as mutual information increases. In Eq. (1), can be simplified by the following equation assuming that the distribution of the speaker presence probability is uniform.
where denotes the number of speakers and is the probability of each speaker’s presence. Since only depends on the number of speakers, is the only term to affect . It is evident that minimizing achieves maximization of . Thus, our goal is to minimize using the following definition:
Equation (3) can be approximated by the law of large numbers
where is the feature and is the total number of features. We can classify into classes using pre-defined class information.
where is the number of features in the class and is the feature in the class. Trivially, . The portion of each class’s entropy can be represented by
where is the vector whose element contains the ratio of each class to all features, i.e., . Thus, it also satisfies the following constraint:
is defined as follows:
where is the GMM of speaker trained by features with the class ratio of , and is the likelihood of feature given . Thus, we can rewrite Eq. (5) using :
Similarly, if we define to represent the portion of class in , can be rewritten as follows:
where is defined as follows:
Now we have to be concerned about the redundancy of each class. Redundancy of a class usually increases as the class ratio increases. Increasing redundancy caused by including unnecessary data actually degrades speaker recognition performance. For example, in our preliminary experiments on varying the ratio of vowels and consonants, we find that speaker identification performance is better when the ratio of vowels is 80% compared to 90%, even though it is known that vowels have more speaker discriminative information than consonants. Thus, we also need to consider the redundancy of the data while maximizing mutual information by controlling the phoneme class ratio. One simple solution is removing the term from Eq. (10) because it directly relates to redundancy. Equation (12) shows the modified equation:
The equation denotes the average for all classes. Therefore, the objective of the proposed algorithm is finding the optimal ratio of phoneme classes that maximizes :
There is one more issue about the relation between mutual information and speaker recognition accuracy. According to Eriksson et al.,8 the relation between mutual information and speaker recognition accuracy becomes exact as the speaker recognition accuracy increases. If the portion of any class is very small, we cannot say that the class has a large amount of speaker discriminative information even though mutual information of that class is large. In this case, we may assume that the of the class is meaningless. Thus, we force to zero when is smaller than a certain threshold , smaller than a minimum probability, and is larger than a minimum . Equation (14) shows the modification rule:
In the equation above, denotes the threshold of . If is larger than , it means that we can use and if is smaller than , we do not use because it is unlikely that the value is meaningful. and denote the minimum and corresponding . In this paper, we experimentally find and during the optimization process, as it is hard to find a global minimum value theoretically.
B. Optimization method to find the optimal ratio
Since the speaker model , which is given in Eq. (8), should be retrained by the expectation-maximization algorithm whenever varies, we cannot directly find a that maximizes Eq. (12). In this case, the Nelder–Mead method popularly used for nonlinear optimization in many-dimensional data is suitable.10 Generally, the Nelder–Mead method is an unconstrained optimization method but our application has two constraints. One is given in Eq. (7), and the other is as follows:
Inclusion of these constraints does not affect the applicability of the method. The first constraint, Eq. (7), does not affect the result if we choose the initial vertices to satisfy the constraint. To satisfy the second constraint, we adjust the coefficients that are used to find the reflection and new vertex to ensure the vertices are located in a suitable range.
III. Experiments and Results
A. Experimental setup
We perform experiments to verify the feasibility of the proposed algorithm with the TIMIT corpus, which contains phoneme information for all sentences.11 From the TIMIT corpus, each phoneme is classfied into one of seven classes: stops, affricates, fricatives, nasals, glides and semivowels, vowels, and others. Among these classes, affricates take up a very small portion of the TIMIT corpus and some speakers do not have this class; others comprise labels which are not speech segments. Thus, we only use five classes, disregarding affricates and others. MFCCs up to order 20 are extracted using a windowed speech signal and the analysis frame is shifted every in the baseline system. The boundary regions of each phoneme are omitted to remove the effect of transition regions. In training the speaker model using extracted MFCCs, we construct 16-mixture GMMs using five sentences for each speaker. After training GMMs for all speakers, we evaluate using two sentences that are not used in training. When we evaluate the performance of the system, we use the remaining three sentences.
In the experiments, we have to extract features that are adjusted in the given phoneme class ratio in every iteration of the algorithm because the speaker model is retrained during the algorithm according to the variation in the phoneme class ratio . To adjust the phoneme class ratio, we analyze the speech signal every and sample features from each class to satisfy the given class ratio . Therefore, the number of features that belongs to class becomes as follows:
and we set as the number of features when the analysis interval is . This adjusting method is also used in the testing procedure. In the Nelder–Mead method, we initialize the vertices as suitable values around conventional phoneme class ratios and repeat the algorithm 100 times to converge the vertices.
B. Results and analysis
Before performing experiments, we evaluate the phoneme class ratio , the modified mutual information , and of each class from the TIMIT corpus. Table I shows the results. As the table shows, of vowels is larger than any other class. It means that vowels have more speaker discriminative information. Moreover, of nasals is quite large even though the nasals ratio is just 5.78%, which confirms the results presented by Eatock and Mason.5
Class . | Ratio(%) . | . | ||
---|---|---|---|---|
Baseline . | Proposed . | Difference . | ||
Stops | 6.94 | 7.34 | 3.66 | |
Fricatives | 18.01 | 18.40 | 3.20 | |
Nasals | 5.78 | 13.40 | 5.89 | |
Semivowels | 12.83 | 8.40 | 5.83 | |
Vowels | 56.44 | 52.46 | 6.52 | |
All | 100 | 100 | … | 5.59 |
Class . | Ratio(%) . | . | ||
---|---|---|---|---|
Baseline . | Proposed . | Difference . | ||
Stops | 6.94 | 7.34 | 3.66 | |
Fricatives | 18.01 | 18.40 | 3.20 | |
Nasals | 5.78 | 13.40 | 5.89 | |
Semivowels | 12.83 | 8.40 | 5.83 | |
Vowels | 56.44 | 52.46 | 6.52 | |
All | 100 | 100 | … | 5.59 |
Next, we estimate the optimal class ratio using the proposed method and compare the speaker recognition error rate of the proposed system with the baseline system. The optimal threshold , which is the boundary of the compensation of , is found experimentally by varying from 0.0 to 0.5. Tables I and II show the proposed class ratio and the speaker identification results of the proposed system when the threshold . Table I shows the proposed class ratio and the baseline class ratio for comparison. As the table shows, the ratio of consonants is increased and that of vowels is decreased. The ratio of stops and fricatives is increased about 0.4% and that of nasals is increased more than 7%. On the other hand, the ratios of semivowels and vowels are decreased 4.43% and 3.98%. From this result, we can see that nasals are important to improve the speaker recognition performance. Also, semivowels and vowels have more redundant information than other classes even though they greatly contribute to the performance of the speaker recognition system. Table II shows the results of speaker identification tests of the proposed system and the baseline system. The table shows the average error rate, 95% confidence interval, and the minimum and the maximum of the confidence interval. The error rate of the proposed system is 18.33% lower than that of the baseline system on average. From these results, we can confirm that the performance of the proposed algorithm is superior to conventional algorithms.
. | Error rate(%) . | |
---|---|---|
Baseline . | Proposed . | |
Average | 4.571 | 3.734 |
95% conf. | 0.118 | 0.104 |
Min | 4.463 | 3.629 |
Max | 4.690 | 3.838 |
Improvement | … | 18.33 |
. | Error rate(%) . | |
---|---|---|
Baseline . | Proposed . | |
Average | 4.571 | 3.734 |
95% conf. | 0.118 | 0.104 |
Min | 4.463 | 3.629 |
Max | 4.690 | 3.838 |
Improvement | … | 18.33 |
We next apply the proposed method of adjusting the class ratio to the training and testing procedures to investigate the effect of each procedure. Figure 1 shows the result. The -axis denotes the speaker recognition error rate with 95% confidence interval, and the -axis denotes the type of system. Baseline and proposed denote the systems that use the phoneme class ratio of the baseline or the proposed system for both training and testing. means that the proposed class ratio is used for the training procedure and the baseline class ratio is used for the testing procedure. is the opposite of . As the figure shows, the class ratio in the testing procedure influences the result more than that in the training procedure. In other words, the selection of test segments is more important than more accurate modeling of the distribution. In addition, we can say that speaker recognition performance can be improved by adjusting the phoneme class ratio of test data even if the speaker model has already been trained by conventional methods. Of course, the best performance can be achieved when both training and test methods adopt the proposed class ratios.
IV. Conclusions and Future Work
In this paper, we proposed a method for finding an optimal phoneme class ratio that utilizes mutual information to improve speaker recognition performance. First, we defined , the portion of a class in mutual information , and proposed a method for finding an optimal phoneme class ratio by maximizing the average . From the results of speaker identification tests using optimal phoneme class ratios, we verified that the proposed system improves speaker identification performance about 18% compared to a conventional system. We also found that using the proposed phoneme class ratio is still applicable to test processes even if the speaker model has been trained with data having conventional phoneme class ratios.
Future work will be finding an optimal method of dividing classes. In this paper, we used the phoneme class label of TIMIT to simplify the problem. However, phonemes have different characteristics even though they are in the same phoneme class. Therefore, we need to examine optimal classification methods for phonemes for better performance. In addition, the accuracy of such classification methods needs to be considered in practical applications.