Speech glottal flow has been predominantly described in the time-domain in past decades, the Liljencrants–Fant (LF) model being the most widely used in speech analysis and synthesis, despite its computational complexity. The causal/anti-causal linear model (LFCALM) was later introduced as a digital filter implementation of LF, a mixed-phase spectral model including both anti-causal and causal filters to model the vocal-fold open and closed phases, respectively. To further simplify computation, a causal linear model (LFLM) describes the glottal flow with a fully causal set of filters. After expressing these three models under a single analytic formulation, we assessed here their perceptual consistency, when driven by a single parameter Rd related to voice quality. All possible paired combinations of signals generated using six Rd levels for each model were presented to subjects who were asked whether the two signals in each pair differed. Model pairs LFLM–LFCALM were judged similar when sharing the same Rd value, and LF was considered the same as LFLM and LFCALM given a consistent shift in Rd. Overall, the similarity between these models encourages the use of the simpler and more computationally efficient models LFCALM and LFLM in speech synthesis applications.
I. INTRODUCTION
The acoustic theory of speech production formalised by Fant (1960) assumes independence and linearity between the airflow modulated in the glottis by the vibration of the vocal folds, called glottal flow, and the resonance effect of the vocal tract that shapes the glottal flow into a speech signal. The linear acoustic theory offers a somewhat simplified view of the physics of speech production, but it is still a very effective and widely used representation of voice signals for speech processing applications (e.g., speech coding, synthesis, parameterization) and acoustic phonetics analyses. In this theory, vocal tract resonances introduce spectral formants and anti-formants (maxima and minima of the spectral envelope) that characterise speech sounds. Vocal tract formants are themselves often associated with linear filters: series or parallel branches of second order resonant sections in formant synthesisers; auto-regressive filter models in linear prediction. In early applications, the voice source component was also considered as a low-pass filter, the so-called glottal formant. The transmission line analog proposed by Fant (1960) used a four-pole model subsequently simplified in a two-pole model in linear prediction of speech by Markel and Gray (1982). Note that this glottal formant is not related to a physical resonance but describes the spectrum of the glottal pulse, modelled as the impulse response of the low-pass filter. However, glottal filter impulse responses poorly match glottal flow waveforms obtained by inverse filtering or by indirect measurements like electroglottography. This has led to the proposition of a multiplicity of glottal flow models (GFMs) defined in the time-domain by analytic and parametric formulations of the glottal flow waveform and its derivative: Rosenberg (1971) (Rosenberg model); Hedelin (1984), Fujisaki and Ljungqvist (1986), and Klatt and Klatt (1990) (KLGLOTT88 model); Fant et al. (1985) [Liljencrants–Fant (LF) model]; Veldhuis (1998) (R++ model). These widely used models adopt various mathematical functions to describe the glottal flow oscillation, yet Doval et al. (2006) showed that the Rosenberg, KLGLOTT88, LF, and R++ models can be grouped under one general expression that is parameterized by a common set of five parameters. Variations of these parameters are closely related to voice quality perception (e.g., breathiness, tenseness, vocal force), that strongly motivates the use of GFM in expressive speech related research. This includes analysis of emotion in speech [Gobl and Ní Chasaide (2003), Patel et al. (2011), and Ní Chasaide et al. (2013): LF model; Burkhardt and Sendlmeier (2000): KLSYNTH88 model]; analysis-resynthesis schemes for voice modification [Childers (1995), Cabral et al. (2014), and Degottex et al. (2013): LF model]; or expressive text-to-speech synthesis [Raitio et al. (2013), Airaksinen et al. (2016), and Juvela et al. (2019): LF model]. This list is not exhaustive; however, LF has been the most widely adopted model for analysis and synthesis of speech signals.
The main limitation of the model is its computational complexity. It requires solving implicit equations that can only be performed with numerical approaches. This model is not suitable for applications where computational complexity is a constraint, such as real-time speech or singing synthesis. Also, spectral glottal flow models are desirable because voice quality is often described in spectral terms (e.g., voice spectral tilt, brightness, tenseness): spectral parameters are closer to perception than time-domain parameters. It is therefore interesting to investigate the apparent discrepancy between GFM like and filter impulse-response models. Along this line, Doval et al. (2006) highlighted that and the other time-domain models under study have a simple magnitude representation in the frequency-domain that can be modelled with a third order filter, as also noted by Childers and Lee (1991). This has led to the proposal of new models: the causal/anti-causal linear model () by Doval et al. (2003), followed by an all-causal linear model () used in the Cantor Digitalis singing synthesiser (Feugère et al., 2017), which both gradually simplify the computation of the glottal flow by using digital filters instead of analytic functions, thus enabling a precision-complexity trade-off, being the most precise and the simplest. While we will show in Sec. II that the simplification operating on and can substantially modify the glottal flow waveform, it is not clear if this affects their auditory perception. The aim of this paper is threefold. Section II studies the three models , , and in terms of linear filters. Formulations for impulse responses are derived, and differences between the models are investigated. After this objective and analytic comparison, subjective experiments are conducted in Sec. III for assessing the perceptual equivalence of the three models. Armed with analytic formulations and perceptual analyses, the discussion in Sec. IV summarises the results obtained: linear-filter formulations equivalent to the model are able to account for both the observed glottal formant and glottal flow waveforms.
II. LINEAR-FILTER FORMULATION OF GLOTTAL FLOW MODELS
A. Glottal flow model parameters: and Rd
All GFMs attempt to describe a vocal-fold vibration period in time-domain (see Fig. 1). Three phases are considered: the opening phase (lung pressure forces the vocal folds to spread, and an increasing air flow passes through the glottis); the closing phase (the elasticity of the vocal folds takes over, closing the air passage); the closed phase (the airflow is blocked). Then the lung pressure increases again, and a new opening phase follows. This cycle can be represented by five parameters (Doval et al., 2006): the cycle period T0 or fundamental frequency ; the cycle amplitude, generally represented by E, the maximum of the absolute value of the glottal flow derivative (GFD) (i.e., the negative peak at the glottal closure instant has amplitude −E); the open quotient Oq, the ratio of the open phase duration Te over the period T0; the asymmetry coefficient αm, the ratio of the opening phase duration Tp over the open phase duration; and Ta, the closing time duration (Fig. 1). Period T0 and amplitude E change the time and amplitude scales of the glottal flow. The three other parameters change the shape of the glottal flow and account for the voice timbre or quality. Empirically, Fant et al. (1994) established that the perceptual effect of the shape parameters Oq, αm, and Ta can be gathered into a unique high-level parameter called Re initially, Rd afterward (Fant, 1995) (see Appendix A). Typical values of Rd range from 0.4 (short open phase, strong asymmetry of the glottal flow leading to a tense voice) to 2.7 (long open phase, symmetry of the glottal flow providing a relaxed voice). Note that the time-domain NAQ-coefficient proposed later (Alku et al., 2002) is proportional to Rd. Rd will be used as a control parameter below.
B. Glottal formant, LFCALM, and LFLM
Radiated sound pressure outside the vocal tract can be approximated by the derivative of the speech flow measured at the lips. For this reason, the glottal flow derivative is often preferred to the glottal flow for analysis purposes. The spectrum of the glottal flow derivative shows a marked spectral peak, the glottal formant. Figure 1 displays on the right the magnitude spectrum of the glottal flow derivative computed with the model, superimposed on a third order filter approximation. Two poles form the glottal formant, a low-frequency resonance with centre frequency Fg that is directly related to the oscillation of the open phase of the vocal folds. The remaining pole is an extra attenuation with cut-off frequency FST, called spectral tilt, that is responsible for the smoothness of the closed phase of the vocal folds. Phase analysis has shown that this third order approximation is a mixed-phase model (Gardner and Rao, 1997), allowing it to represent the open phase of the model as a second order filter response (damped sinusoid) that evolves toward negative time, while the closed phase resembles the response of a first order filter (decreasing exponential) that evolves toward positive time (bottom-left of Fig. 1). Following this analysis, the causal/anti-causal linear model of the glottal flow () has been proposed by Doval et al. (2003) to generate a glottal derivative waveform by filtering a pulse train with the mixed-phase third order filter. The is a simple formulation reproducing the dual relations between time-domain parameters and spectral shape (Gobl et al., 2018; Henrich et al., 2001). A real-time implementation of the model called RT-CALM was then derived by d'Alessandro et al. (2006). The mixed-phase characteristic of the glottal flow has been exploited for the estimation of the glottal flow from speech signals (Bozkurt et al., 2005; Drugman et al., 2011; Hézard et al., 2013). The glottal formant can also be represented by causal filters, following Klatt (1980) and Holmes (1983), but at the expense of some distortion in the phase spectrum compared to the model. A formulation of this causal linear voice source model has been proposed and used for real-time voice synthesis and voice source analysis (Feugère et al., 2017; McLoughlin et al., 2020; Perrotin and McLoughlin, 2019, 2020). The perceptual effect of this phase difference is studied in Sec. III.
To summarise, the model that is widely accepted as a precise time-domain GFM has been simplified by a frequency-domain representation that uses a mixed-phase third order filter called . To go further in reducing computation complexity, an all-causal linear model () has been recently formulated.
All three GFMs are defined in terms of their open and closed phase, described separately in Secs. II C and II D. For this reason, we define glottal opening instants (GOIs) that mark the beginning of each open phase and are spaced by a duration of T0 and glottal closure instants (GCIs) marking the beginning of each closed phase. GOIs and GCIs are spaced by a duration of .
C. Modeling the open phase
1. General formulation of the open phase
Let us define the impulse response of a truncated second order filter, whose generic formulation is
If hT is anti-causal, T < 0 is the instant of truncation, , and . Its causal counterpart is defined for T > 0, , and . It can be shown that the open phase definitions of the three GFMs under study can be formulated with respect to Eq. (1) by setting appropriately the Gn, an, bn, , and T parameters (index n is subsequently replaced by the name of the model in consideration: , , or ). In their original formulations, is defined as a continuous time-domain function, while and are defined as digital filters (Z-domain). For the sake of generalisation, all expressions are given below as equivalent continuous representations (time and Laplace domains), and derivation details from the original papers' formulations are given in Appendixes B, C, and D.
a. LF.
The model (Fant et al., 1985) is defined by an analytic function in the time-domain relative to the GOI and can be interpreted as an unstable, divergent, and truncated causal filter. However, re-parameterization with Oq and αm and setting the time origin at the GCI (see Appendix B) allow us to express as an anti-causal filter truncated at , matching Eq. (1). The equations below give the resulting waveform analytic expression and its Laplace transform:
One can now identify from the top equation the values of parameters , and that are summarised in Table I. is the open phase damping coefficient. It is set so that the airflow of a period is zero and results from an implicit equation (see Appendix B).
. | . | . | . |
---|---|---|---|
bn | |||
an | > 0 | ||
Gn | |||
T | |||
Open phase | |||
Formulation | Analytic | Filter | Filter |
Causality | Anti-causal | Anti-causal | Causal |
Truncation | At | At | No truncation |
Closed phase | |||
Formulation | Analytic | Filter | Filter |
Causality | Causal | Causal | Causal |
. | . | . | . |
---|---|---|---|
bn | |||
an | > 0 | ||
Gn | |||
T | |||
Open phase | |||
Formulation | Analytic | Filter | Filter |
Causality | Anti-causal | Anti-causal | Causal |
Truncation | At | At | No truncation |
Closed phase | |||
Formulation | Analytic | Filter | Filter |
Causality | Causal | Causal | Causal |
b. .
The causal/anti-causal linear model uses a second order anti-causal and truncated bandpass filter to model the open phase of the glottis (Doval et al., 2003), whose equation and parameters are derived in Appendix C. The time-domain response of , truncated at , and the frequency-domain response are given by computing the inverse Z-transform and Laplace transform of the filter, respectively,
c. .
The model (Feugère et al., 2017) is the causal version of with the difference that the filter is not truncated, since it converges (see Appendix D). The time and frequency responses of , whose parameters are given in Table I, are again given by computing the inverse Z-transform and Laplace transform of the filter,
2. Comparison between the GFM open phases
Figure 2 displays the open phases of (blue), (orange), and (green) for the glottal flows (top-left), GFDs (bottom-left), and spectrum of the GFD (right), computed with and E = 0.2. The top-right of Fig. 2 displays similarities between the three models. First, all open phases derive from second order filters, as their respective Laplace transforms , , and all show a similar denominator with a complex conjugate pole. This results in ±20 dB/decade asymptotes. In particular, all Laplace transforms simplify to E/s at high frequencies, resulting in similar asymptotes for the three GFMs. At low frequencies, the asymptotes are shifted between models but only from a few dB.
LF and display two more similarities. First, their anti-causality causes the GFD phase to increase (bottom-right of Fig. 2); second, they are both truncated at . The thin dashed curves in the left panels show what would be non-truncated versions of and . A direct effect of the truncation is the computation of their Laplace transform on the interval , which results in the appearance of the term in and . This causes the ripples observed in the and spectra. The main difference between and is that the former is parameterized to be class C1 i.e., with a continuous GFD at the GOI (). This parameterization results in a generic second order filter that is neither low-pass nor bandpass, as shown by the numerator of . A consequence is the large lobe around the resonance frequency of the GFD magnitude spectrum. Conversely, is parameterized to be a bandpass filter, which allows a reduction of the resonance's lobewidth but cannot suppress it completely because of the effect of truncation. The consequence of the bandpass parameterization is a discontinuous GFD at the glottal opening instant.
Two differences between and are also highlighted. The difference of causality is well-displayed by a vertical symmetry in the time-domain and a horizontal symmetry of the phase spectrum. Also, because converges, it is not truncated at . This implies a leak of the period to the next one but also greatly simplifies its implementation. As a result, its spectrum is the exact frequency response of a bandpass filter, with no ripples and no lobe around the resonance centre frequency. Note that the vertical symmetry of the GFD between and implies a sign inversion of the glottal flow, but one that the ear is not sensitive to.
D. Modeling the closed phase
1. Formulation of the closed phases
Definitions of the GFM closed phase fall within two categories (Doval et al., 2006): it is either described in the time-domain by an analytic formulation, as , or defined in the frequency-domain with a first order filter, as or .
a. LF: Analytic expression.
The closed phase of the model, after shifting the glottal closing instant at t = 0, is expressed as
where ϵ is the closed phase coefficient. It satisfies the continuity of the open and closed phase expressions at the GCI and is obtained from an implicit equation (see Appendix B). Note that because a is computed from ϵ, the shape of the open phase depends on the closed phase, although both phases are defined by distinct analytical expressions.
b. and : Filtering.
With and , the closed phase is modelled by a first order low-pass filter attenuating high frequencies above its cut-off frequency and called spectral tilt (Doval et al., 2003; Feugère et al., 2017). Filter formulation is given in Appendix C. In these cases, the spectral tilt filter is applied on the full signal and therefore changes the open phase shape.
2. Comparison between the GFM closed phases
Figure 3 displays the three GFM full waveforms, obtained by adding to the open phases of Fig. 2 their respective closed phase contributions while keeping . Note that this process changes the open phases. The top-right panel shows high similarity between the three GFMs' spectrum magnitudes. The closed phase adds a supplementary −20 dB/decade attenuation to all open phase spectra, resulting in a −40 dB/decade attenuation at high frequencies. We can also observe an increase in gain in low frequencies for the model. This is directly linked to the change of the parameter. A consequence is the largest amplitudes of the glottal flow and glottal flow derivative for .
Looking at the phase spectrum (bottom-right panel), and almost overlap, showing a similar effect of the closed phases on their respective phase spectra: it adds a supplementary offset at high frequencies to all phase spectra of the open phases. The spectral tilt filter is displayed in black. This offset introduces an asymmetry between and on one side and on the other. The addition at high frequencies for all models reduces the phase of and from to π but also reduces the phase of from to . This asymmetry is reflected in the shapes of the glottal flow derivatives (bottom-left panel). One can see that and are not symmetrical anymore and that the filtering attenuates more the GFD peak near the glottal closure instant for than for . Finally, it is important to mention that the spectral tilt filter is not truncated for and , and its application results in an infinite response that may overlap with the next period. This appears for high values of Rd, as shown in Sec. II F.
E. Assessment of computational costs
To evaluate the computational efficiency of each GFM, we measured the average time necessary to compute one period of a 1-s stationary signal for each model. The ratio of computation time over the period duration gives the real-time factor. A real-time factor below 1 means that the signal is faster to compute than to play back, so we can listen the signal while it is generated. Inversely, a real-time factor higher than 1 indicates that the signal takes longer to compute than to play back. This experiment was made in the condition of a fine-grain control of the GFM: parameters are calculated for each period. To assess the dependency of the real-time factor on F0 and Rd, we generated 564 stationary signals using a combination of the six Rd values described in Sec. III and 94 F0 values, from 70 to 1000 Hz with steps of 10 Hz. All signals were generated on an iMac Intel Core i9, with a 3.6 GHz processor. Figure 4 displays the real-time factors for the three GFMs depending on F0. For each model and F0 value, we computed the mean and standard deviation of the real-time factor across the six Rd values. The means for each model are represented by the thick coloured lines, and the shading around each mean value highlights the ± standard deviation range around the mean. and are more than 10–100 times faster than . This is a direct consequence of the resolution of the implicit equation for the model, which is costly. Also, the efficiency of decreases with higher F0 because the resolution of the implicit equation requires a constant duration. Therefore, when the period duration decreases, the real-time factor increases, and this dependency between computation efficiency and input parameter is not desirable. Finally, Rd has no effect on the computation time for all three GFMs.
F. Summary of the model implementation and effect of Rd
Table I summarises the implementations of the three GFMs under study. To conclude this section, Fig. 5 shows the effect of Rd on the GFD (top row) and the respective spectra computed on a single period (second and third rows) for the three models [LF (blue), (orange), and (green)]. In the top row, the dashed vertical lines represent the GOIs, while the dotted lines show the GCIs. In the second and third row, the vertical line indicates the cut-off frequency of the spectral tilt filter. Globally, Rd has a similar effect on the three GFMs. Looking at the spectrum magnitude, low values of Rd lead to higher centre frequency and bandwidth of the glottal formant and a higher spectral tilt cut-off frequency. These combined effects favour the presence of numerous harmonics that give a sharp GFD closure, close to the shape of an impulse. This is typical for tensed and loud voice, when the vocal folds open and close abruptly. Inversely, high values of Rd lower the centre frequency and bandwidth of the glottal formant as well as the spectral tilt cut-off frequency. It thus emphasises more the first and second harmonics, leading to a more sine-like GFD shape. This is lax/soft voice, when the vocal folds oscillate more symmetrically.
In the first column of Fig. 5, the three GFMs appear very similar for two reasons. First, a low value of Rd leads to a high attenuation coefficient an that allows and to have almost horizontal tangents at GOI. The truncation thus does not introduce an abrupt change of slope on the GFD, which results in a reduction of ripples on the and spectra. Second, the effect of spectral tilt that introduces an asymmetry between and is small (high cut-off frequency), leading to almost symmetrical and GFDs. Inversely, the three GFM shapes diverge with increasing values of Rd. Truncation has stronger effects on and , increasing ripples in their spectrum, and the spectral tilt whose cut-off frequency is closed to the glottal formant position has a strong effect on the GFD shapes. In particular, one can note that the minimum values of and diverge from −E when Rd increases. Moreover, the last column illustrates well the effect of absence of truncation of the spectral tilt filter on and . The GFD computed for one period overlaps on the next one, leading to negative (respectively, positive) value of the GFD at the GOI for (respectively, ).
We have shown that the difference of construction between the three GFMs (formulation, causality, truncation) leads to clear visible differences in the GFD waveforms and spectra. However, their effect on auditory perception is unclear and is assessed in Sec. III.
III. PERCEPTUAL COMPARISON OF VOICE SOURCE MODELS
A. Experiment
1. Protocol and task
The aim of the experiment was to assess any perceptual difference between the three GFMs for different values of the Rd parameter. We used for this purpose a two-alternative forced-choice (2AFC) protocol (Kingdom and Prins, 2016), where each subject's task was to listen to paired sounds and to say if they were the same or different, with respect to any distinctive features, whatever their nature (e.g., timbre, level, pitch, etc.). The experiment was divided into three blocks. The first block used synthesised sounds from the GFMs only. The second and third blocks used additional /a/ and /i/ vocal tract models convolved with the GFMs. These two vowels were chosen for their lowest (/i/) and highest (/a/) first formant frequency in order to test a more natural vocal sound than the GFM alone.
For each GFM and following Degottex et al. (2013), six values of Rd were chosen equally spaced on a logarithmic scale, leading to three GFMs × 6 Rd = 18 stimuli per block (the one displayed on Fig. 5). Then for each block, every combination of pairs of different GMFs was tested ( × ; × ; × ; × ; × ; × ). Finally, 3 vowels × 6 pair combinations × 6 Rd values for the first element of the pair × 6 Rd values for the second element of the pair led to a total of 648 pairs of stimuli to compare.
A computer interface was specially designed for this experiment and programmed in max 6.1 The protocol was identical for all the paired stimuli. To proceed, the subject clicked a button, which launched the playback of two sounds, A and B, separated by 500 ms. The test sounds were ordered randomly and played for each subject only once to keep sessions as short as possible and identical among subjects. The subject had to choose whether the two sounds were identical or different, without any other choice. Each block lasted approximately 10 min, and subjects were especially encouraged to stop and rest between the three blocks with a message displayed automatically. The entire experiment took place in an acoustically insulated and treated room designed for perceptual experiments. Sound was played using a Focusrite (High Wycombe, UK) Scarlett 2i2 audio interface on a Mac OSX and AKG (Los Angeles, CA) K271 headphones. Before the experiment, subjects were trained with a subset of the sound-pair list (GFM convolved to /a/ vocal tract or without vocal tract, three Rd values spread over the full range of possible values).
A group of 18 subjects took part in the experiment (median age of 28 years, from 21 to 54 years old). Among them, 12 subjects worked in the field of sound technologies, and six others had a regular musical practice. An audiogram test was performed for each of the subjects, and none of them reported any known auditory impairment except one who was single-side deaf, but stereo listening was not needed to perform the task. Fourteen subjects were members of the laboratory and participated in the experiment on a voluntary basis without being paid. The four remaining subjects were paid for the experiment.
2. Stimuli specification
Stimuli were synthesised at a sampling rate of Fs = 96 kHz. A constant fundamental frequency of Hz and a peak amplitude E = 0.2 were chosen. The GFDs were generated by using the analytic formulations of Eqs. (2) and (5) and by solving the implicit Eqs. (B3) and (B4). The and GFDs were generated by filtering a pulse train with their respective open and closed phase filters ( Appendix E). All signals lasted 0.3 s, a duration longer than a standard spoken syllable but short enough to facilitate recall of the two stimuli for comparison. Fade-in and fade-out amplitudes were applied using half Hanning windows of length . Vowels were invariant in time and were applied by filtering the GFM with a bank of five parallel resonant filters corresponding to vowels /i/ and /a/, whose transfer functions are given in Feugère et al. (2017). Finally, all stimuli were normalised in dBA.2
B. Results
Results report the proportion of pairs that were judged similar depending on the factors in consideration. In particular, we factorised the six different model pairs into two factors: the Model/factor (three levels: × ; × ; × ) and the Order factor that codes the order of presentation of each pair (two levels). The additional factors are Vowel (three levels: source only; /a/; /i/) and Rd (36 levels for all combinations of the six selected values). In the following, we used a single generalised linear model following a binomial distribution to assess the significance of each factor and their interactions for the perception results. The obtained model was subsequently simplified by iteratively removing non-significant interactions between factors provided that, at each simplification step, the current and the simplified models do not significantly differ (p > 0.05) (Crawley, 2013). Post hoc Pearson's chi-squared tests were run to assess whether proportions obtained for single conditions significantly differ from chance.
Figure 6 shows perceptual experiment responses for all factors and interactions except Order (results for both presentation orders of each pair are merged). The top-left panel shows results relative to the Rd factor only. Each square corresponds to the proportion of pairs judged similar for a given couple of Rd values, all models and Vowels combined. Pairs in black and white were judged similar by 100% and 0% of the subjects, respectively. Scores that fall within the red rectangle on the colour bar do not significantly differ from chance according to the post hoc Pearson's chi-squared tests (p > 0.05). On the left-hand side of the figure, the top row (columns 2–4) shows the model × Rd interaction, with all levels of Vowel and Order combined; the left column (rows 2–4) shows the Vowel × Rd interaction, with all levels of model and Order combined; the remaining panels show the Vowel × model × Rd interaction for each level of Vowel and model, indicated in the top and left margins of the figure. Panels with yellow and green contours are replicated on the right side, with put in the abscissa. On top (respectively, bottom) for each Rd(LF) value (each column), the distribution of perceived similar Rd() [respectively, Rd()] values was obtained and superimposed on the figure, the circles being the medians and the error bars corresponding to 90% of the values around the median. Smaller circles indicate scores below the level of significance (Pearson's chi-squared test). The shaded area links all error bars and represents the space of perceptual equivalence between Rd(LF) and Rd() (respectively, Rd()).
1. Effect of Rd and order
The Rd factor has the strongest effect on results [Rd: , degrees of freedom (df) = 35, p < 0.001]. The top-left panel of Fig. 6 clearly shows that, over all other factors, pairs with similar values of Rd are strongly perceived as similar, and vice versa. This confirms that Rd has a strong perceptual effect on the synthesis of glottal flow. Presentation order had no influence on similarity judgment (Order: , df = 1, p = 0.90). Therefore, all results displayed in Fig. 6 and detailed below combine the scores of both presentation orders.
2. Effect of model
The model factor alone has a small and marginally significant effect on subjects' scores (model: , df = 2, p = 0.012) and therefore demonstrates that the three models are perceptually close to each other. and are judged the most similar models, and and are judged the least similar when all answers are averaged. The subjects' perception seems to reflect the differences between models' construction that are summarised in Table I. and derive from the same filtering process, with the only difference being the causality of the open phase and the truncation of . Inversely, and differ at almost every point of Table I. While these results average all possible Rd pairs, results depending on Rd follow the significant two-way interaction between model and Rd (model × Rd, , df = 70, p < 0.001). Corresponding results are shown in the top row of the left side of Fig. 6 (columns 2–4). The first observation is that stimuli with similar values of Rd are judged extremely similar (close to 100% similarity), while stimuli with different values of Rd are judged different (0% similarity). One can then note a diagonal asymmetry in the × and × panels for Rd values higher than 0.86, i.e., when the models start to differ the most (Fig. 5). In particular, subjects judged and similar mostly when Rd() was greater than or equal to Rd(LF). Similarly, and were mostly judged similar when Rd() was greater than to or equal to Rd(LF). Conversely, and were judged the most similar when they shared the same Rd value, picturing more symmetric results (top-right panel of the left side of Fig. 6).
The right side of Fig. 6 summarises this asymmetry between and the other models. Recall that these panels are replicates of the one with yellow and green contours from the left-hand side, but with put in the abscissa for both plots. For each Rd(LF), medians of corresponding distributions of perceived similar Rd() [respectively, Rd()] are all on or above the diagonal. Also, the spread of each distribution represented by the error bars (90% of the values around the median) and emphasised by the shaded areas clearly displays asymmetrical spaces of perceptual equivalence between Rd(LF) and Rd() [respectively, Rd()] that are again above the diagonal, with Rd() [respectively, Rd()] mostly equal to or greater than corresponding Rd(LF).
3. Effect of vowel
The effect of vowels (Vowel: , df = 2, p < 0.001) supports that GFDs presented alone were significantly judged less similar than when they were passed through a vowel, the vowel /i/ giving the highest similarity results. Therefore, the introduction of resonances in the signal mitigates the perception of the glottal source timbre. Moreover, the glottal formant Fg evolves within the range [64, 121] Hz for the chosen values of Rd for all models. The vowel /i/, having its first formant resonance the closest to Fg, could mask the effect of Rd variation, leading to sources judged more similar with /i/ rather than vowel /a/.
Also, a significant two-way interaction with Rd is present (Vowel × Rd: , df = 70, p < 0.001) as shown in the left column of Fig. 6, rows 2–4. Stimuli presented with the source only show similarity concentrated around the diagonal. When presented with the vowel /i/, the similarity spreads across adjacent Rd values for high Rd. This corresponds to Fg and FST values that are around 100 Hz, close to the first formant frequency of vowel /i/ (215 Hz). Conversely, for the /a/ vowel, it seems that stimuli with high Rd value were neither clearly perceived as similar nor dissimilar. In this case, the first formant frequency (700 Hz) is far above the Fg and FST ranges. A possibility is that subjects either focused on the low or high frequency parts of the signal, the former hearing the source differences and the latter focusing on the /a/ resonance.
4. Remaining interactions
No significant three-way interaction between Vowel and model and Rd was detected. It can be seen in Fig. 6 that the trend previously observed in the top row and left column (two-way interactions) applies to the remaining plots. Statistical analysis did not reveal a significant Vowel × model interaction, showing that the perception of differences between models is relatively independent from the addition of a vocal tract. Although it would be necessary to cover a larger number of vocal tract configurations, this finding encourages the hypothesis that the choice of the glottal flow can be made independently from the behavior of the vocal tract. Finally, two-way interactions Order × Rd and Order × model result from the asymmetry of the model levels (, df = 35, p < 0.001; , df = 2, p = 0.022, respectively). The top row of Fig. 6 showed an asymmetry between and and between and . When considering the order of presentation as a factor, e.g., distinguishing × vs × , the asymmetry of × results is reversed compared to × , hence the two-way interaction.
IV. DISCUSSION AND CONCLUSION
In this study, the model is reformulated in terms of linear filters. This formulation reconciles the apparent discrepancy between time-domain GFM and spectral voice source models. It allows for quantitative spectral interpretation of the model parameters because the correspondence between time-domain and spectral parameters can be analytically computed. This unifies Fant's views on the voice source: the key point is the interpretation of the GFM [in Fant et al. (1985)] as a mixed phase system and not as a simple resonant filter [as in Fant (1960)]. The joint variation of the waveform and glottal formant as a function of Rd can be computed for voice quality analysis and synthesis. As a rule of thumb, increasing Rd corresponds to lowering the glottal formant centre frequency (often referred to as the “voicing bar” in wideband spectrogram reading) and increasing the spectral tilt toward lower frequencies (the right-hand “skirt” of the glottal formant).
Following the proposal of glottal flow models that attempt to reduce the computational complexity of , namely and , we sought to assess the perceptual consistency of these models. We first showed that even though is defined from an analytic expression and and from digital filters, they can all be expressed by the same analytic function, with their own set of parameters. In terms of construction, and have anti-causal and truncated open phases, while has a causal and non-truncated open phase. The three GFM closed phases are causal.
Perceptual pairwise-comparison of these models parameterized with various levels of Rd using a same-different forced-choice paradigm on short stationary signals shows that all models are perceived similarly, in that they share the same Rd parameterization with a possible offset. In particular, and are perceived similarly with the same Rd, while is perceived similarly as and when has a smaller Rd value. Investigation seems to show that this shift in perception relates more to the truncation of the glottal flow open phase than to a difference of causality. Nevertheless, this needs to be confirmed in further experiments. Finally, we showed that the addition of vocal tract effect with low vocalic formants increases the perception of similar waveforms when Rd varies slightly between two waveforms. If the high dissimilarity between waveforms (Fig. 3) has favoured the use of for precise analysis of the glottal flow (i.e., time-domain analyses), the perceptual consistency between models encourages the use of and as simpler models than for speech synthesis applications and for spectral analyses of the voice source and voice quality.
ACKNOWLEDGMENTS
Part of this work has been done in the framework of the Agence Nationale de la Recherche, through the ChaNTeR and GEPETO Projects (ANR-13-CORD-0011, 2014–2017, ANR-19-CE28-0018, 2019–2023) and “Investissements d'avenir” programs ANR-15-IDEX-02 and ANR-11-LABX-0025-01. The authors are indebted to Professor Boris Doval for his help in the development of the model calculations.
APPENDIX A: HIGH- TO LOW-LEVEL GLOTTAL PARAMETERS
Fant (1995) derived a unique high-level parameter Rd to control all low-level parameters Oq, αm, and Ta. He first defined intermediate parameters Ra, Rk, and Rg from which are derived the low-level parameters,
APPENDIX B: DERIVATION OF LF
is defined in the time-domain by an analytic function (Fant et al., 1985). After re-parameterization with Oq and αm, Doval et al. (2006) expressed the open phase of the glottal flow derivative as
Setting the time origin at the glottal closure instant allows us to express as an anti-causal filter truncated at . This is simply done by defining ,
Also, if we note the Laplace transform of the original formulation given by Eq. (B1), then the time shift operated between and is translated as . This linear phase shift does not have any effect on the timbre of the source and is ignored in this paper.
is the open phase damping coefficient. It is set so that the airflow of a period is zero and thus also depends on the closed phase coefficient ϵ [Eq. (5)]. The latter satisfies the continuity of the open and closed phase expressions at the GCI from the implicit equation
Given the expression of the closed phase, is calculated so that the integral of the glottal flow derivative is null on a period, leading to the implicit equation
Both implicit equations are resolved numerically.
APPENDIX C: DERIVATION OF LFCALM
The open phase anti-causal filter is defined in the Z-domain by Doval et al. (2003) as
The associated filter coefficients are those of a second order resonant biquad filter,
where Fs is the sampling frequency and Fg, Bg, and Ag are the centre frequency, bandwidth, and amplitude of the resonance (glottal formant) and are defined as
By setting and , the time-domain impulse response of , truncated at , is given by computing the inverse Z-transform,
The closed phase causal filter is defined in in the Z-domain as
and its filter coefficients are computed from the cut-off frequency ,
APPENDIX D: DERIVATION OF LFLM
is the causal version of (Feugère et al., 2017). Therefore, the glottal formant, also defined in the Z-domain, has the following transfer function:
whose coefficients are given by Eqs. (C2) and (C3). To have a convergent filter, it is necessary that . Therefore, and . Finally, the time-domain impulse response of is
APPENDIX E: SYNTHESIS WITH LFCALM and LFLM
open phase uses the anti-causal filter [Eq. (C1)]. We define a pulse train δgci whose impulses are placed on the GCIs. The pulse train is then filtered by , leading to the recursion equation
For each period, the impulse response is truncated at the previous GOI. Then the full signal is filtered by the causal spectral tilt filter HST [Eq. (C5)], leading to the recursion equation
In the case of , both glottal formant and spectral tilt filters are applied in their causal form. We define a pulse train δgoi whose impulses are placed on the GOIs. The pulse train is then filtered successively by the causal version of the glottal formant filter [Eq. (D1)], leading to the recursion equation
and the spectral tilt filter HST [Eq. (C5)], leading to the recursion equation
http://cycling74.com (Last viewed 8/16/2021).
See the supplementary material https://www.scitation.org/doi/suppl/10.1121/10.0005879 for all stimuli.