In statistical parametric speech synthesis, a mixture density network is employed to address the limitations of a linear output layer such as pre-computed fixed variances and the unimodal assumption. However, it also has a defect, i.e., it cannot deploy a static-dynamic constraint needed in the training phase for high-quality speech synthesis. To cope with this problem, this paper proposes a training algorithm based on the minimum trajectory error for a mixture density network. And a modulation spectrum-constrained loss function is also proposed to alleviate the over-smoothing effect. The experimental results confirm meaningful improvement both in objective and subjective performance measures.

## 1. Introduction

The speech quality of statistical parametric speech synthesis (SPSS) has been improved noticeably through the use of deep neural networks (DNNs),^{1} which show better performance in representing the complex, nonlinear, and high-dimensional relationships between linguistic and acoustic features as compared to conventional hidden Markov models.^{2} Over the last few years, several end-to-end speech synthesis frameworks generating human-like synthetic speech, including WaveNet, Tacotron, and Deepvoice, have been proposed.^{3–5} However, they work so poorly under small corpus conditions and also incur high computation costs. Hence, SPSS approaches are still more useful in real environments.

In SPSS, several studies with various DNN-based architectures have been reported.^{6–10} Usually they include a linear output layer and are trained with a mean squared error (MSE) loss function, generating acoustic features with a maximum likelihood parameter generation (MLPG) algorithm.^{11} However, two unneglectable error sources still exist. They are the frame-wise independence assumption for the MSE criterion and the unimodal assumption for the linear output layer. Although the temporal information of speech is crucial for high-quality speech synthesis,^{12} the MSE criterion fragments the relationship between the static and dynamic features in the training phase. Wu and King proposed the minimum trajectory error (MTE) criterion to moderate this drawback by adding a static-dynamic constraint on the MSE criterion.^{13} The MTE training can generate a more natural trajectory, but it is still over-smoothed due to the linear output layer.^{14} A mixture density network (MDN) can overcome this problem.^{14–17} Multiple Gaussian mixtures in an MDN can represent the multimodality of speech and predict the variances while the linear output layer uses pre-computed fixed variances. Nevertheless, like the MSE criterion, the MDN also generates an unnatural trajectory because it cannot deploy temporal information during the training phase, i.e., the relationships between adjacent frames cannot be used. Another way of alleviating the over-smoothing effect is to utilize analytical features such as the global variance (GV) and the modulation spectrum (MS), known as perceptual cues.^{18,19} Training/synthesis algorithms constrained on the GV and MS show improved clarity of synthetic speech.^{20–23} However, these methods focus only on making the trajectory variation similar to the natural ones but do not consider the multimodality of speech.

In this paper, a novel MTE criterion-based training algorithm for MDNs is proposed to address both over-smoothing and unnatural trajectory problems. To introduce the MTE criterion, we reformulated the conventional iterative MLPG algorithm for MDNs (Ref. 11) into a closed-form solution utilizing only the most probable mixture (MPM) component. The proposed algorithm covers both the static-dynamic constraint and the multimodality of speech and thus can generate more natural and clear synthetic speech. Furthermore, we introduce an MS constraint into the MTE loss function and mitigate the over-smoothing effect. Our experimental results confirm that the proposed algorithm improves the synthetic speech quality meaningfully both in objective and subjective evaluations.

## 2. Trajectory error training for an MDN constrained on a MS

### 2.1 Conventional training algorithm for a linear output layer

For a given frame-level linguistic feature sequence $X=[x1\u22a4,\u2026,xt\u22a4,xT\u22a4]\u22a4$ and an observed acoustic sequence $O=[o1\u22a4,\u2026,ot\u22a4,oT\u22a4]\u22a4$, the linear output layer in a DNN predicts the acoustic features directly^{6–10} as $O\u0302=Fl(X;\Phi l)$, where *F _{l}*(·) is a DNN-based mapping function with a linear output layer and Φ

_{l}denotes the parameters of

*F*(·). Note that the acoustic feature sequence consists of the static and dynamic features as $ot=[ct\u22a4,\Delta ct\u22a4,\Delta 2ct\u22a4]\u22a4$, and

_{l}**can be obtained as $O=WC$, where**

*O***is a weight matrix for calculating dynamic features from static features.**

*W*^{11}To obtain $C\u0302$ from the observed acoustic features $O\u0302$ with the maximum likelihood criterion, an MLPG algorithm is used. It can be written as

where $\Sigma =diag[\Sigma 1,\u2026,\Sigma t,\u2026,\Sigma T]$ is a covariance matrix sequence.^{11} Note that *F _{l}*(·) predicts only the mean of the acoustic features. For this reason, $\Sigma $ is pre-computed over the training corpus. The MSE loss is designed to minimize the L2 loss of the observed acoustic features $O\u0302$, whereas the trajectory loss is defined as the L2 loss of the static acoustic features $C\u0302$ in Eq. (1). Now, the MSE and MTE loss functions can be written as follows:

The MTE training algorithm has a static-dynamic constraint based on the MLPG algorithm in the training phase and can generate a more natural synthetic speech.^{13} However, this method still suffers from over-smoothing due to the unimodal assumption of the linear output layer.

### 2.2 Conventional training algorithm for an MDN

An MDN, which utilizes the multimodal feature distribution using a Gaussian mixture model (GMM), can overcome the above problem of a linear output layer. The MDN maps ** X** to a GMM parameter sequence as $\Lambda =[\lambda 1\u22a4,\u2026,\lambda t\u22a4,\u2026,\lambda T\u22a4]\u22a4=Fm(X;\Phi m)$,

^{14}where $\lambda t$ and

*F*(·) denote the GMM parameters at the

_{m}*t*th frame and the DNN-based mapping function with the MDN, respectively.

*F*(·) can be trained by maximizing the GMM likelihood. The maximum likelihood (ML) loss function then becomes

_{m}where *M* is the number of mixtures and $wm,t,\u2009\mu m,t$, and $\sigma m,t$ correspond to the mixture weight, the mean, and the variance of the *m*th Gaussian component at the *t*th frame. From the GMM parameters $\Lambda ,\u2009C\u0302$ can also be obtained using an MLPG algorithm.^{11} The MLPG algorithm for the MDN can be written as

where *γ _{t}*(

*m*) is the occupancy probability, which can be solved with an iterative Expectation-Maximization (EM) algorithm. Due to the iterative-form solution, the trajectory loss of MDN cannot be derived, as opposed to Eq. (3).

### 2.3 Proposed trajectory error training algorithm for an MDN

To utilize the aforementioned advantages of both the MTE criterion and an MDN, we propose a new MTE training algorithm for an MDN. To derive the MTE loss function for an MDN, we reformulated Eq. (5a) into a closed-form solution using only the MPM component. This approach can be considered as a suboptimal solution. In this case, *γ _{t}*(

*m*) is 1 at every time step with $M=1$. Accordingly, Eqs. (5b) and (5d) can be rewritten as $\Sigma \u22121\xaf=\Sigma MPM\u22121=diag[\Sigma mMPM(1),1\u22121,\u2026,\Sigma mMPM(T),T\u22121]$ and $\Sigma \u22121M\xaf=\Sigma MPM\u22121MMPM=[\Sigma mMPM(1),1\u22121\mu mMPM(1),1\xaf,\u2026,\Sigma mMPM(T),T\u22121\mu mMPM(T),T\xaf]\u22a4$, respectively, where $mMPM(t)$ denotes the MPM at the

*t*th frame. Finally, Eq. (5a) can be reformulated as follows:

Using this closed-form solution, the MTE loss function for the MDN can be defined as follows:

Note that $\Sigma $ is also trainable, in contrast to Eq. (3). By the way, two ways to determine the MPM can be expressed as follows:

In the synthesis phase, the MPM is determined by Eq. (8) because $ot$ in Eq. (9) is unobservable. On the other hand, in order to train the mixtures corresponding to the training data, the MPM in Eq. (9) should be used in the training phase and it then can be considered as a teacher-forcing method.^{24} If the MPM in Eq. (8) is adopted in the training phase, only one mixture is chosen as the MPM for the given linguistic feature, and the output of the DNN can be smoothed similarly to a linear output layer. In addition, when the MDN is trained by only the MTE loss function in Eq. (7), the parameters of the mixtures except for those of the MPM would be updated erroneously because they are affected by the shared layers. In order to train the MDN stably, we propose a new loss function to optimize Eqs. (4) and (7) jointly, as follows:

### 2.4 Proposed MS-constrained training algorithm

Although the training algorithm in Sec. 2.3 can generate a more natural trajectory due to the use of a static-dynamic constraint, the generated features can still be smoothed by outliers because they have not a multimodal but an irregular distribution. To address this problem, we propose an MS-constrained MTE loss function. It improves the synthetic speech quality by making the trajectory variation similar to natural ones. We defined the MS as a log power spectrum of the acoustic feature sequence^{21} $S(C)=[S1,\u2026,SD],\u2009Sd=[Sd0,\u2026,SdN]$, where *D* and *N* correspondingly denote the acoustic feature dimension and half of the fast Fourier transform (FFT) size. The segment-level MS, not the utterance-level MS, is used due to the normalization effect; it is computed regardless of the utterance length without zero-padding.^{21} The MS loss function can be written as

where *K*, $S(C)$, and $S(C\u0302)$ are the number of MS segments and the MS of natural and generated acoustic features, respectively. From Eqs. (10) and (11), the MS-constrained MTE loss function for the MDN is defined as

In contrast to Eq. (10), each term in Eq. (12) has a different perspective on improving the speech quality; $LMTE\u2009MDN(C,C\u0302)$ and $LMS(C,C\u0302)$ involve the generation loss and variation of speech, respectively. If the momentum is biased for $LMS(C,C\u0302)$, the trained model focuses only on the high-frequency component of the features and not on the generation loss. This problem can be solved by adding an emphasis coefficient *α*, which is used in an MS-based postfilter.^{21} The final loss function can be written as follows:

Hence, a larger *α* is for a more natural variation in speech, while a smaller *α*, for less generation loss in training.

## 3. Experiments and results

### 3.1 Experiment configuration

The Blizzard Challenge 2013 Nancy corpus was used as the training data. It consists of 12 095 English utterances (∼18 h) sampled at 16 kHz.^{25} 90% and 10% of the corpus except for 100 evaluation utterances were used as the training and validation data, respectively. The WORLD vocoder was utilized for the analysis/synthesis of the acoustic features.^{26} The 25th order Mel-cepstrum coefficients (MCs), coded aperiodicity (AP), continuous *F*0 (Ref. 27), and voiced/unvoiced (V/UV) measures were extracted for each frame with a 5 ms shift. In addition, 372 linguistic features, e.g., the dot level, stress, inflection, playability, and the quin-Lessemes identity, were extracted based on Lessemes.^{28}

To investigate the performance of the proposed algorithm, experimental systems were constructed by adopting a bi-directional long short-term memory (bLSTM) recurrent neural network for a duration model (DM) and an acoustic model (AM). Three bLSTM layers with 64 and 256 memory cells were used for the DM and the AM, respectively. The linear output layer trained with the MSE criterion by Eq. (2) was used in the DM. The detailed AM configurations are given in Table 1. Empirically determined different numbers of mixtures for each acoustic feature were used; four for MC, two for AP, two for *F*0, and one for V/UV. The MS is computed from the acoustic feature sequence but without V/UV. For segment-level MS computations, a Bartlett window with a frame-length of 25 and with a frame shift size of 12 was used, and the FFT size was 64 frames. The emphasis coefficient *α* of Proposed-2 was determined empirically through listening tests. We set *α* to 0.2, because the MTE loss increases rapidly when *α* ≥ 0.3 and we also found some unstable fluctuations for some samples when *α* = 0.3. All systems were trained using a mini-batch stochastic gradient descent-based backpropagation algorithm with the Adam optimizer.^{29} The early stopping method was adopted to determine the number of training epochs.^{30} All systems were implemented in PyTorch.^{31}

Notation . | Output layer . | Training phase . | Synthesis phase . |
---|---|---|---|

$MTE$ | Linear output layer | Update Φ_{l} with Eq. (3) | Generate $C\u0302$ with Eq. (1) |

$ML$ | MDN output layer | Update Φ_{m} with Eq. (4) | Generate $C\u0302$ with Eq. (5a) |

$Proposed-1$ | MDN output layer | Update Φ_{m} with Eqs. (9) and (10) | Generate $C\u0302$ with Eqs. (6) and (8) |

$Proposed-2$ | MDN output layer | Update Φ_{m} with Eqs. (9) and (13) | Generate $C\u0302$ with Eqs. (6) and (8) |

Notation . | Output layer . | Training phase . | Synthesis phase . |
---|---|---|---|

$MTE$ | Linear output layer | Update Φ_{l} with Eq. (3) | Generate $C\u0302$ with Eq. (1) |

$ML$ | MDN output layer | Update Φ_{m} with Eq. (4) | Generate $C\u0302$ with Eq. (5a) |

$Proposed-1$ | MDN output layer | Update Φ_{m} with Eqs. (9) and (10) | Generate $C\u0302$ with Eqs. (6) and (8) |

$Proposed-2$ | MDN output layer | Update Φ_{m} with Eqs. (9) and (13) | Generate $C\u0302$ with Eqs. (6) and (8) |

For a subjective evaluation, preference tests of four cases were conducted; 20 listeners assessed 20 synthetic speech pairs for each case shown in Table 1; the Mel-cepstral distortion (MCD) and root-mean-square error (RMSE) of *F*0 were computed for an objective evaluation. The speech samples were generated from the DM and the AM for the subjective test, whereas they were generated from only the AMs for the objective test. All experimental results were drawn for 100 evaluation utterances.

### 3.2 Results and discussion

Figure 1 summarizes our subjective and objective test results. Figure 2 shows the generated feature plot with its natural counterpart. $ML$ and the proposed methods are preferred to $MTE$, as shown in Fig. 1(a). Although $MTE$ predicted *F*0 fairly well by deploying temporal information, Fig. 1(b) shows that it is limited when used to model the MC multimodality, as described in earlier work.^{15} Figure 2(b) also confirms that the generated MC sequence from $MTE$ is smoother than the others.

$Proposed-1$ outperforms $ML$ on the objective and subjective measures. Specifically, *F*0 generated from $ML$ is degraded by abrupt fluctuations in time, as illustrated in Figs. 1(c) and 2(a). This phenomenon also arose in earlier work,^{17} having been caused by the absence of dynamic statistics in the training procedure. By introducing the MTE criterion, $Proposed-1$ moderated this drawback of the ML criterion. However, the features generated by $Proposed-1$ are still smoothed by the outliers.

By introducing an MS constraint, $Proposed-2$ improves this smoothing problem with the better MS and GV, which are more similar to the natural ones, as shown in Figs. 2(c) and 2(d). Note that the lower MS and GV mean a more smoothed and degraded trajectory. A significant improvement in the *F*0 estimation was achieved with minor degradation in the MC estimation, as depicted in Figs. 1(b) and 1(c). We can therefore conclude that certain amounts of natural fluctuations of *F*0 and the MC in the time domain improve the naturalness and clarity of synthetic speech despite the higher generation loss. Our experimental result also supports this idea. Whereas the MC values generated by other algorithms were narrowly distributed, those by $Proposed-2$ were rather widely distributed like that of the natural samples.

## 4. Conclusion

In this paper, a novel training algorithm for an MDN based on the MTE criterion was proposed in an attempt to overcome the quality degradation of synthetic speech caused by the frame-wise independence and unimodal assumptions of conventional DNN-based SPSS algorithms.

Our proposed algorithm improved the naturalness and the clarity of synthetic speech with more than a 60% preference score in a subjective evaluation when compared to the conventional algorithms and more precise *F*0 and unsmoothed MC in an objective evaluation. We also investigated an MS constraint which reduces over-smoothing caused by outliers and got a reasonably successful improvement. Furthermore, the proposed parameter generation algorithm in a closed-form can save the computation cost of the iterative MLPG algorithm. Considering all these results, we can say that our proposed methods achieved a meaningful improvement in the synthetic speech quality.

## Acknowledgments

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Science, ICT, and Future Planning (Grant No. 2017R1A2B4011357).