This paper proposes an efficient method to improve speaker recognition performance by dynamically controlling the ratio of phoneme class information. It utilizes the fact that each phoneme contains different amounts of speaker discriminative information that can be measured by mutual information. After classifying phonemes into five classes, the optimal ratio of each class in both training and testing processes is adjusted using a non-linear optimization technique, i.e., the Nelder–Mead method. Speaker identification results verify that the proposed method achieves 18% improvement in terms of error rate compared to a baseline system.

Automatic speaker recognition, the task of verifying a speaker’s identity using his∕her voice, has been greatly extended in recent years.1 It often uses spectral features such as Mel-frequency cepstral coefficients (MFCCs) but characteristics of spectral features vary along with the phonetic contents of input speech. Since speaker recognition systems are designed for text-independent application for flexibility, the spectral variation in the input signal is always very high. Due to limitations on the number of input features, however, it is not easy to include all the characteristics of a speaker in a stochastic model such as Gaussian mixture model (GMM).2,3 In other words, to improve speaker recognition performance, it is very important to build a GMM model that represents speaker characteristics in realistic conditions. Various approaches have been proposed to overcome this limitation by utilizing phoneme information.4–7 

In this paper, we also propose a method that utilizes phoneme information to improve speaker recognition performance. While previous studies focus on using separate models for each phoneme and combining scores,4,6,7 we focus on finding an optimal phoneme class ratio, the portion of each phoneme class in an utterance, that maximizes speaker recognition performance based on mutual information. In speaker recognition, some researchers use this measurement to measure or improve speaker recognition accuracy.8,9 In this paper, we experimentally re-evaluate the speaker discriminative power of each phoneme class using mutual information and then find the optimal phoneme class ratio. We adopt the Nelder–Mead method, which is widely used for nonlinear optimization of multi-dimensional data.10 Experimental results show that the optimal phoneme class ratio we find is somewhat different from that occurring in normal speech: the portion of consonants is increased, but that of vowels is reduced. This approach can be applied to both training and testing, but the improvement is more significant when it is used for testing. Speaker identification results show that the proposed system that uses the optimal phoneme class ratio has around 18% better performance than a conventional system.

The rest of this paper is organized as follows. First, we review the concept of mutual information, which measures the speaker discriminative power of given speech signals, and suggest how to find the optimal phoneme class ratio based on aspects of information theory in Sec. II. Section III shows the experimental setup and results, which verify the usefulness of the proposed algorithm. The conclusions and future work are given in Sec. IV.

Mutual information represents the amount of information shared by two given random variables. Equation (1) represents the definition of mutual information.

(1)

In the equation, H(C) denotes the entropy of a specific speaker presence, and H(C|X) denotes the entropy of specific speaker presence when the feature set X is given. Eriksson et al.8 showed that the error rate of a speaker recognition system decreases as mutual information increases. In Eq. (1), H(C) can be simplified by the following equation assuming that the distribution of the speaker presence probability is uniform.

(2)

where S denotes the number of speakers and P(s) is the probability of each speaker’s presence. Since H(C) only depends on the number of speakers, H(C|X) is the only term to affect I(C;X). It is evident that minimizing H(C|X) achieves maximization of I(C;X). Thus, our goal is to minimize H(C|X) using the following definition:

(3)

Equation (3) can be approximated by the law of large numbers

(4)

where xn is the nth feature and N is the total number of features. We can classify xn into K classes using pre-defined class information.

(5)

where Nk is the number of features in the kth class and xnk,k is the nth feature in the kth class. Trivially, k=1KNk=N. The portion of each class’s entropy can be represented by

(6)

where p is the vector whose element contains the ratio of each class to all features, i.e., pk=NkN. Thus, it also satisfies the following constraint:

(7)

P(s|xn,k) is defined as follows:

(8)

where λs,p is the GMM of speaker s trained by features with the class ratio of p, and p(x|λs,p) is the likelihood of feature xnk,k given λs,p. Thus, we can rewrite Eq. (5) using Hk(p):

(9)

Similarly, if we define Ik(p) to represent the portion of class k in I(C|X), I(C|X) can be rewritten as follows:

(10)

where Ik(p) is defined as follows:

(11)

Now we have to be concerned about the redundancy of each class. Redundancy of a class k usually increases as the class ratio pk increases. Increasing redundancy caused by including unnecessary data actually degrades speaker recognition performance. For example, in our preliminary experiments on varying the ratio of vowels and consonants, we find that speaker identification performance is better when the ratio of vowels is 80% compared to 90%, even though it is known that vowels have more speaker discriminative information than consonants. Thus, we also need to consider the redundancy of the data while maximizing mutual information by controlling the phoneme class ratio. One simple solution is removing the pk term from Eq. (10) because it directly relates to redundancy. Equation (12) shows the modified equation:

(12)

The equation denotes the average Ik(p) for all classes. Therefore, the objective of the proposed algorithm is finding the optimal ratio of phoneme classes that maximizes I(p):

(13)

There is one more issue about the relation between mutual information and speaker recognition accuracy. According to Eriksson et al.,8 the relation between mutual information and speaker recognition accuracy becomes exact as the speaker recognition accuracy increases. If the portion of any class is very small, we cannot say that the class has a large amount of speaker discriminative information even though mutual information of that class is large. In this case, we may assume that the Ik(p) of the class is meaningless. Thus, we force Ik(p) to zero when pk is smaller than a certain threshold θ, smaller than a minimum probability, and Ik(p) is larger than a minimum Ik(pmin). Equation (14) shows the modification rule:

(14)

In the equation above, θ denotes the threshold of pk. If pk is larger than θ, it means that we can use Ik(p) and if pk is smaller than θ, we do not use Ik(p) because it is unlikely that the value is meaningful. pk,min and Ik(pmin) denote the minimum Ik(p) and corresponding pk. In this paper, we experimentally find pk,min and Ik(pmin) during the optimization process, as it is hard to find a global minimum value theoretically.

Since the speaker model λs,p, which is given in Eq. (8), should be retrained by the expectation-maximization algorithm whenever p varies, we cannot directly find a p that maximizes Eq. (12). In this case, the Nelder–Mead method popularly used for nonlinear optimization in many-dimensional data is suitable.10 Generally, the Nelder–Mead method is an unconstrained optimization method but our application has two constraints. One is given in Eq. (7), and the other is as follows:

(15)

Inclusion of these constraints does not affect the applicability of the method. The first constraint, Eq. (7), does not affect the result if we choose the initial vertices to satisfy the constraint. To satisfy the second constraint, we adjust the coefficients that are used to find the reflection and new vertex to ensure the vertices are located in a suitable range.

We perform experiments to verify the feasibility of the proposed algorithm with the TIMIT corpus, which contains phoneme information for all sentences.11 From the TIMIT corpus, each phoneme is classfied into one of seven classes: stops, affricates, fricatives, nasals, glides and semivowels, vowels, and others. Among these classes, affricates take up a very small portion of the TIMIT corpus and some speakers do not have this class; others comprise labels which are not speech segments. Thus, we only use five classes, disregarding affricates and others. MFCCs up to order 20 are extracted using a 20ms windowed speech signal and the analysis frame is shifted every 10ms in the baseline system. The boundary regions of each phoneme are omitted to remove the effect of transition regions. In training the speaker model using extracted MFCCs, we construct 16-mixture GMMs using five sentences for each speaker. After training GMMs for all speakers, we evaluate I(p) using two sentences that are not used in training. When we evaluate the performance of the system, we use the remaining three sentences.

In the experiments, we have to extract features that are adjusted in the given phoneme class ratio in every iteration of the algorithm because the speaker model λs,p is retrained during the algorithm according to the variation in the phoneme class ratio p. To adjust the phoneme class ratio, we analyze the speech signal every 0.25ms and sample features from each class to satisfy the given class ratio pk. Therefore, the number of features Nk that belongs to class k becomes as follows:

(16)

and we set N as the number of features when the analysis interval is 10ms. This adjusting method is also used in the testing procedure. In the Nelder–Mead method, we initialize the vertices as suitable values around conventional phoneme class ratios and repeat the algorithm 100 times to converge the vertices.

Before performing experiments, we evaluate the phoneme class ratio p, the modified mutual information I(p), and Ik(p) of each class from the TIMIT corpus. Table I shows the results. As the table shows, Ik(p) of vowels is larger than any other class. It means that vowels have more speaker discriminative information. Moreover, Ik(p) of nasals is quite large even though the nasals ratio is just 5.78%, which confirms the results presented by Eatock and Mason.5 

TABLE I.

Class ratio of baseline system and the proposed system when θ=0.08 and Ik(p) of each class in TIMIT corpus.

ClassRatio(%)Ik(p)
BaselineProposedDifference
Stops 6.94 7.34 +0.40 3.66 
Fricatives 18.01 18.40 +0.39 3.20 
Nasals 5.78 13.40 +7.62 5.89 
Semivowels 12.83 8.40 4.43 5.83 
Vowels 56.44 52.46 3.98 6.52 
All 100 100 … 5.59 
ClassRatio(%)Ik(p)
BaselineProposedDifference
Stops 6.94 7.34 +0.40 3.66 
Fricatives 18.01 18.40 +0.39 3.20 
Nasals 5.78 13.40 +7.62 5.89 
Semivowels 12.83 8.40 4.43 5.83 
Vowels 56.44 52.46 3.98 6.52 
All 100 100 … 5.59 

Next, we estimate the optimal class ratio using the proposed method and compare the speaker recognition error rate of the proposed system with the baseline system. The optimal threshold θ, which is the boundary of the compensation of Ik(p), is found experimentally by varying from 0.0 to 0.5. Tables I and II show the proposed class ratio and the speaker identification results of the proposed system when the threshold θ=0.08. Table I shows the proposed class ratio and the baseline class ratio for comparison. As the table shows, the ratio of consonants is increased and that of vowels is decreased. The ratio of stops and fricatives is increased about 0.4% and that of nasals is increased more than 7%. On the other hand, the ratios of semivowels and vowels are decreased 4.43% and 3.98%. From this result, we can see that nasals are important to improve the speaker recognition performance. Also, semivowels and vowels have more redundant information than other classes even though they greatly contribute to the performance of the speaker recognition system. Table II shows the results of speaker identification tests of the proposed system and the baseline system. The table shows the average error rate, 95% confidence interval, and the minimum and the maximum of the confidence interval. The error rate of the proposed system is 18.33% lower than that of the baseline system on average. From these results, we can confirm that the performance of the proposed algorithm is superior to conventional algorithms.

TABLE II.

Speaker identification error rate of baseline system and the proposed system θ=0.08.

 Error rate(%)
BaselineProposed
Average 4.571 3.734 
95% conf. 0.118 0.104 
Min 4.463 3.629 
Max 4.690 3.838 
Improvement … 18.33 
 Error rate(%)
BaselineProposed
Average 4.571 3.734 
95% conf. 0.118 0.104 
Min 4.463 3.629 
Max 4.690 3.838 
Improvement … 18.33 

We next apply the proposed method of adjusting the class ratio to the training and testing procedures to investigate the effect of each procedure. Figure 1 shows the result. The y-axis denotes the speaker recognition error rate with 95% confidence interval, and the x-axis denotes the type of system. Baseline and proposed denote the systems that use the phoneme class ratio of the baseline or the proposed system for both training and testing. P+B means that the proposed class ratio is used for the training procedure and the baseline class ratio is used for the testing procedure. B+P is the opposite of P+B. As the figure shows, the class ratio in the testing procedure influences the result more than that in the training procedure. In other words, the selection of test segments is more important than more accurate modeling of the distribution. In addition, we can say that speaker recognition performance can be improved by adjusting the phoneme class ratio of test data even if the speaker model has already been trained by conventional methods. Of course, the best performance can be achieved when both training and test methods adopt the proposed class ratios.

FIG. 1.

Speaker identification error rate of the proposed system. P+B: training-proposed∕testing-baseline; B+P: training-baseline∕testing-proposed.

FIG. 1.

Speaker identification error rate of the proposed system. P+B: training-proposed∕testing-baseline; B+P: training-baseline∕testing-proposed.

Close modal

In this paper, we proposed a method for finding an optimal phoneme class ratio that utilizes mutual information to improve speaker recognition performance. First, we defined Ik(p), the portion of a class k in mutual information I(C|X), and proposed a method for finding an optimal phoneme class ratio by maximizing the average Ik(p). From the results of speaker identification tests using optimal phoneme class ratios, we verified that the proposed system improves speaker identification performance about 18% compared to a conventional system. We also found that using the proposed phoneme class ratio is still applicable to test processes even if the speaker model has been trained with data having conventional phoneme class ratios.

Future work will be finding an optimal method of dividing classes. In this paper, we used the phoneme class label of TIMIT to simplify the problem. However, phonemes have different characteristics even though they are in the same phoneme class. Therefore, we need to examine optimal classification methods for phonemes for better performance. In addition, the accuracy of such classification methods needs to be considered in practical applications.

1.
S.
Furui
, “
Fifty years of progress in speech and speaker recognition
,”
J. Acoust. Soc. Am.
116
,
2497
2498
(
2004
).
2.
B. S.
Atal
, “
Text-independent speaker recognition
,”
J. Acoust. Soc. Am.
52
,
181
(
1972
).
3.
D. A.
Reynolds
and
R. C.
Rose
, “
Robust text-independent speaker identification using Gaussian mixture speaker models
,”
IEEE Trans. Speech Audio Process.
3
,
72
83
(
1995
).
4.
R.
Auckenthaler
,
E.
Parris
, and
M.
Carey
, “
Improving a gmm speaker verification system by phonetic weighting
,” in
IEEE International Conference on Acoustics, Speech, and Signal Processing
(
1999
), Vol.
1
, pp.
313
316
.
5.
J.
Eatock
and
J.
Mason
, “
A quantitative assessment of the relative speaker discriminating properties of phonemes
,” in
IEEE International Conference on Acoustics, Speech, and Signal Processing
(
1994
), Vol.
i
, pp.
133
136
.
6.
D.
Gutman
and
Y.
Bistritz
, “
Speaker verification using phoneme-adapted Gaussian mixture models
,” in
EUSIPCO-2002 the XI European Signal Processing Conference
(
2002
), Vol.
3
, pp.
85
88
.
7.
L.
Rodriguez-Linares
and
C.
Garcia-Mateo
, “
Phonetically trained models for speaker recognition
,”
J. Acoust. Soc. Am.
109
,
385
389
(
2001
).
8.
T.
Eriksson
,
S.
Kim
,
H.-G.
Kang
, and
C.
Lee
, “
An information-theoretic perspective on feature selection in speaker recognition
,”
IEEE Signal Process. Lett.
12
,
500
503
(
2005
).
9.
M. K.
Omar
,
J.
Navratil
, and
G. N.
Ramaswamy
, “
Maximum conditional mutual information modeling for speaker verification
,” in
EUROSPEECH
(
2005
).
10.
J. A.
Nelder
and
R.
Mead
, “
A simplex method for function minimization
,”
Comput. J.
7
,
308
313
(
1965
).
11.
J. S.
Garofalo
,
L. F.
Lamel
,
W. M.
Fisher
,
J. G.
Fiscus
,
D. S.
Pallett
, and
N. L.
Dahlgren
, “
The DARPA TIMIT acoustic-phonetic continuous speech corpus CDROM
,” Linguistic Data Consortium (
1993
).