Speech glottal flow has been predominantly described in the time-domain in past decades, the Liljencrants–Fant (LF) model being the most widely used in speech analysis and synthesis, despite its computational complexity. The causal/anti-causal linear model (LF_{CALM}) was later introduced as a digital filter implementation of LF, a mixed-phase spectral model including both anti-causal and causal filters to model the vocal-fold open and closed phases, respectively. To further simplify computation, a causal linear model (LF_{LM}) describes the glottal flow with a fully causal set of filters. After expressing these three models under a single analytic formulation, we assessed here their perceptual consistency, when driven by a single parameter *R _{d}* related to voice quality. All possible paired combinations of signals generated using six

*R*levels for each model were presented to subjects who were asked whether the two signals in each pair differed. Model pairs LF

_{d}_{LM}–LF

_{CALM}were judged similar when sharing the same

*R*value, and LF was considered the same as LF

_{d}_{LM}and LF

_{CALM}given a consistent shift in

*R*. Overall, the similarity between these models encourages the use of the simpler and more computationally efficient models LF

_{d}_{CALM}and LF

_{LM}in speech synthesis applications.

## I. INTRODUCTION

The acoustic theory of speech production formalised by Fant (1960) assumes independence and linearity between the airflow modulated in the glottis by the vibration of the vocal folds, called glottal flow, and the resonance effect of the vocal tract that shapes the glottal flow into a speech signal. The linear acoustic theory offers a somewhat simplified view of the physics of speech production, but it is still a very effective and widely used representation of voice signals for speech processing applications (e.g., speech coding, synthesis, parameterization) and acoustic phonetics analyses. In this theory, vocal tract resonances introduce spectral formants and anti-formants (maxima and minima of the spectral envelope) that characterise speech sounds. Vocal tract formants are themselves often associated with linear filters: series or parallel branches of second order resonant sections in formant synthesisers; auto-regressive filter models in linear prediction. In early applications, the voice source component was also considered as a low-pass filter, the so-called glottal formant. The transmission line analog proposed by Fant (1960) used a four-pole model subsequently simplified in a two-pole model in linear prediction of speech by Markel and Gray (1982). Note that this glottal formant is not related to a physical resonance but describes the spectrum of the glottal pulse, modelled as the impulse response of the low-pass filter. However, glottal filter impulse responses poorly match glottal flow waveforms obtained by inverse filtering or by indirect measurements like electroglottography. This has led to the proposition of a multiplicity of glottal flow models (GFMs) defined in the time-domain by analytic and parametric formulations of the glottal flow waveform and its derivative: Rosenberg (1971) (Rosenberg model); Hedelin (1984), Fujisaki and Ljungqvist (1986), and Klatt and Klatt (1990) (KLGLOTT88 model); Fant *et al.* (1985) [Liljencrants–Fant (LF) model]; Veldhuis (1998) (R++ model). These widely used models adopt various mathematical functions to describe the glottal flow oscillation, yet Doval *et al.* (2006) showed that the Rosenberg, KLGLOTT88, LF, and R++ models can be grouped under one general expression that is parameterized by a common set of five parameters. Variations of these parameters are closely related to voice quality perception (e.g., breathiness, tenseness, vocal force), that strongly motivates the use of GFM in expressive speech related research. This includes analysis of emotion in speech [Gobl and Ní Chasaide (2003), Patel *et al.* (2011), and Ní Chasaide *et al.* (2013): LF model; Burkhardt and Sendlmeier (2000): KLSYNTH88 model]; analysis-resynthesis schemes for voice modification [Childers (1995), Cabral *et al.* (2014), and Degottex *et al.* (2013): LF model]; or expressive text-to-speech synthesis [Raitio *et al.* (2013), Airaksinen *et al.* (2016), and Juvela *et al.* (2019): LF model]. This list is not exhaustive; however, LF has been the most widely adopted model for analysis and synthesis of speech signals.

The main limitation of the $LF$ model is its computational complexity. It requires solving implicit equations that can only be performed with numerical approaches. This model is not suitable for applications where computational complexity is a constraint, such as real-time speech or singing synthesis. Also, spectral glottal flow models are desirable because voice quality is often described in spectral terms (e.g., voice spectral tilt, brightness, tenseness): spectral parameters are closer to perception than time-domain parameters. It is therefore interesting to investigate the apparent discrepancy between GFM like $LF$ and filter impulse-response models. Along this line, Doval *et al.* (2006) highlighted that $LF$ and the other time-domain models under study have a simple magnitude representation in the frequency-domain that can be modelled with a third order filter, as also noted by Childers and Lee (1991). This has led to the proposal of new models: the causal/anti-causal linear model ($LF$_{$CALM$}) by Doval *et al.* (2003), followed by an all-causal linear model ($LF$_{$LM$}) used in the Cantor Digitalis singing synthesiser (Feugère *et al.*, 2017), which both gradually simplify the computation of the glottal flow by using digital filters instead of analytic functions, thus enabling a precision-complexity trade-off, $LF$ being the most precise and $LF$_{$LM$} the simplest. While we will show in Sec. II that the simplification operating on $LF$_{$CALM$} and $LF$_{$LM$} can substantially modify the glottal flow waveform, it is not clear if this affects their auditory perception. The aim of this paper is threefold. Section II studies the three models $LF$, $LF$_{$CALM$}, and $LF$_{$LM$} in terms of linear filters. Formulations for impulse responses are derived, and differences between the models are investigated. After this objective and analytic comparison, subjective experiments are conducted in Sec. III for assessing the perceptual equivalence of the three models. Armed with analytic formulations and perceptual analyses, the discussion in Sec. IV summarises the results obtained: linear-filter formulations equivalent to the $LF$ model are able to account for both the observed glottal formant and glottal flow waveforms.

## II. LINEAR-FILTER FORMULATION OF GLOTTAL FLOW MODELS

### A. Glottal flow model parameters: $LF$ and *R*_{d}

_{d}

All GFMs attempt to describe a vocal-fold vibration period in time-domain (see Fig. 1). Three phases are considered: the opening phase (lung pressure forces the vocal folds to spread, and an increasing air flow passes through the glottis); the closing phase (the elasticity of the vocal folds takes over, closing the air passage); the closed phase (the airflow is blocked). Then the lung pressure increases again, and a new opening phase follows. This cycle can be represented by five parameters (Doval *et al.*, 2006): the cycle period *T*_{0} or fundamental frequency $F0=1/T0$; the cycle amplitude, generally represented by *E*, the maximum of the absolute value of the glottal flow derivative (GFD) (i.e., the negative peak at the glottal closure instant has amplitude −*E*); the open quotient *O _{q}*, the ratio of the open phase duration

*T*over the period

_{e}*T*

_{0}; the asymmetry coefficient

*α*, the ratio of the opening phase duration

_{m}*T*over the open phase duration; and

_{p}*T*, the closing time duration (Fig. 1). Period

_{a}*T*

_{0}and amplitude

*E*change the time and amplitude scales of the glottal flow. The three other parameters change the shape of the glottal flow and account for the voice timbre or quality. Empirically, Fant

*et al.*(1994) established that the perceptual effect of the shape parameters

*O*,

_{q}*α*, and

_{m}*T*can be gathered into a unique high-level parameter called

_{a}*R*initially,

_{e}*R*afterward (Fant, 1995) (see Appendix A). Typical values of

_{d}*R*range from 0.4 (short open phase, strong asymmetry of the glottal flow leading to a tense voice) to 2.7 (long open phase, symmetry of the glottal flow providing a relaxed voice). Note that the time-domain NAQ-coefficient proposed later (Alku

_{d}*et al.*, 2002) is proportional to

*R*.

_{d}*R*will be used as a control parameter below.

_{d}### B. Glottal formant, LF_{CALM}, and LF_{LM}

Radiated sound pressure outside the vocal tract can be approximated by the derivative of the speech flow measured at the lips. For this reason, the glottal flow derivative is often preferred to the glottal flow for analysis purposes. The spectrum of the glottal flow derivative shows a marked spectral peak, the glottal formant. Figure 1 displays on the right the magnitude spectrum of the glottal flow derivative computed with the $LF$ model, superimposed on a third order filter approximation. Two poles form the glottal formant, a low-frequency resonance with centre frequency *F _{g}* that is directly related to the oscillation of the open phase of the vocal folds. The remaining pole is an extra attenuation with cut-off frequency

*F*, called spectral tilt, that is responsible for the smoothness of the closed phase of the vocal folds. Phase analysis has shown that this third order approximation is a mixed-phase model (Gardner and Rao, 1997), allowing it to represent the open phase of the $LF$ model as a second order filter response (damped sinusoid) that evolves toward negative time, while the closed phase resembles the response of a first order filter (decreasing exponential) that evolves toward positive time (bottom-left of Fig. 1). Following this analysis, the causal/anti-causal linear model of the glottal flow ($LF$

_{ST}_{$CALM$}) has been proposed by Doval

*et al.*(2003) to generate a glottal derivative waveform by filtering a pulse train with the mixed-phase third order filter. The $LF$

_{$CALM$}is a simple formulation reproducing the dual relations between time-domain parameters and spectral shape (Gobl

*et al.*, 2018; Henrich

*et al.*, 2001). A real-time implementation of the model called RT-CALM was then derived by d'Alessandro

*et al.*(2006). The mixed-phase characteristic of the glottal flow has been exploited for the estimation of the glottal flow from speech signals (Bozkurt

*et al.*, 2005; Drugman

*et al.*, 2011; Hézard

*et al.*, 2013). The glottal formant can also be represented by causal filters, following Klatt (1980) and Holmes (1983), but at the expense of some distortion in the phase spectrum compared to the $LF$ model. A formulation of this causal linear voice source model $LF$

_{$LM$}has been proposed and used for real-time voice synthesis and voice source analysis (Feugère

*et al.*, 2017; McLoughlin

*et al.*, 2020; Perrotin and McLoughlin, 2019, 2020). The perceptual effect of this phase difference is studied in Sec. III.

To summarise, the $LF$ model that is widely accepted as a precise time-domain GFM has been simplified by a frequency-domain representation that uses a mixed-phase third order filter called $LF$_{$CALM$}. To go further in reducing computation complexity, an all-causal linear model ($LF$_{$LM$}) has been recently formulated.

All three GFMs are defined in terms of their open and closed phase, described separately in Secs. II C and II D. For this reason, we define glottal opening instants (GOIs) that mark the beginning of each open phase and are spaced by a duration of *T*_{0} and glottal closure instants (GCIs) marking the beginning of each closed phase. GOIs and GCIs are spaced by a duration of $OqT0$.

### C. Modeling the open phase

#### 1. General formulation of the open phase

Let us define the impulse response of a truncated second order filter, whose generic formulation is

If *h _{T}* is anti-causal,

*T*< 0 is the instant of truncation, $D=[T,0]$, and $an>0$. Its causal counterpart is defined for

*T*> 0, $D=[0,T]$, and $an<0$. It can be shown that the open phase definitions of the three GFMs under study can be formulated with respect to Eq. (1) by setting appropriately the

*G*,

_{n}*a*,

_{n}*b*, $\varphi n$, and

_{n}*T*parameters (index

*n*is subsequently replaced by the name of the model in consideration: $LF$, $CALM$, or $LM$). In their original formulations, $LF$ is defined as a continuous time-domain function, while $LF$

_{$CALM$}and $LF$

_{$LM$}are defined as digital filters (

*Z*-domain). For the sake of generalisation, all expressions are given below as equivalent continuous representations (time and Laplace domains), and derivation details from the original papers' formulations are given in Appendixes B, C, and D.

##### a. LF.

The $LF$ model (Fant *et al.*, 1985) is defined by an analytic function in the time-domain relative to the GOI and can be interpreted as an unstable, divergent, and truncated causal filter. However, re-parameterization with *O _{q}* and

*α*and setting the time origin at the GCI (see Appendix B) allow us to express $LF$ as an anti-causal filter truncated at $TLF=\u2212OqT0$, matching Eq. (1). The equations below give the resulting waveform analytic expression and its Laplace transform:

_{m}One can now identify from the top equation the values of parameters $GLF,\u2009aLF,\u2009bLF,\u2009\varphi LF$, and $TLF$ that are summarised in Table I. $aLF$ is the open phase damping coefficient. It is set so that the airflow of a period is zero and results from an implicit equation (see Appendix B).

. | $LF$ . | $LF$_{$CALM$}
. | $LF$_{$LM$}
. |
---|---|---|---|

b _{n} | $\pi \alpha mOqT0$ | $\pi OqT0$ | $\pi OqT0$ |

a _{n} | > 0 | $\pi OqT0\u2009tan\u2009(\pi (1\u2212\alpha m))$ | $\u2212\pi OqT0\u2009tan\u2009(\pi (1\u2212\alpha m))$ |

$\varphi n$ | $\pi \alpha m$ | $\pi (1\u2212\alpha m)$ | $\u2212\pi (1\u2212\alpha m)$ |

G _{n} | $\u2212E\u2009sin\u2009(\varphi LF)$ | $\u2212E\u2009sin\u2009(\varphi CALM)$ | $\u2212E\u2009sin\u2009(\varphi LM)$ |

T | $\u2212OqT0$ | $\u2212OqT0$ | $\u221e$ |

Open phase | |||

Formulation | Analytic | Filter | Filter |

Causality | Anti-causal | Anti-causal | Causal |

Truncation | At $\u2212OqT0$ | At $\u2212OqT0$ | No truncation |

Closed phase | |||

Formulation | Analytic | Filter | Filter |

Causality | Causal | Causal | Causal |

. | $LF$ . | $LF$_{$CALM$}
. | $LF$_{$LM$}
. |
---|---|---|---|

b _{n} | $\pi \alpha mOqT0$ | $\pi OqT0$ | $\pi OqT0$ |

a _{n} | > 0 | $\pi OqT0\u2009tan\u2009(\pi (1\u2212\alpha m))$ | $\u2212\pi OqT0\u2009tan\u2009(\pi (1\u2212\alpha m))$ |

$\varphi n$ | $\pi \alpha m$ | $\pi (1\u2212\alpha m)$ | $\u2212\pi (1\u2212\alpha m)$ |

G _{n} | $\u2212E\u2009sin\u2009(\varphi LF)$ | $\u2212E\u2009sin\u2009(\varphi CALM)$ | $\u2212E\u2009sin\u2009(\varphi LM)$ |

T | $\u2212OqT0$ | $\u2212OqT0$ | $\u221e$ |

Open phase | |||

Formulation | Analytic | Filter | Filter |

Causality | Anti-causal | Anti-causal | Causal |

Truncation | At $\u2212OqT0$ | At $\u2212OqT0$ | No truncation |

Closed phase | |||

Formulation | Analytic | Filter | Filter |

Causality | Causal | Causal | Causal |

##### b. $LF$_{$CALM$}.

The causal/anti-causal linear model uses a second order anti-causal and truncated bandpass filter to model the open phase of the glottis (Doval *et al.*, 2003), whose equation and parameters are derived in Appendix C. The time-domain response of $LF$_{$CALM$}, truncated at $TCALM=\u2212OqT0$, and the frequency-domain response are given by computing the inverse *Z*-transform and Laplace transform of the filter, respectively,

##### c. $LF$_{$LM$}.

The $LF$_{$LM$} model (Feugère *et al.*, 2017) is the causal version of $LF$_{$CALM$} with the difference that the filter is not truncated, since it converges (see Appendix D). The time and frequency responses of $LF$_{$LM$}, whose parameters are given in Table I, are again given by computing the inverse *Z*-transform and Laplace transform of the filter,

#### 2. Comparison between the GFM open phases

Figure 2 displays the open phases of $LF$ (blue), $LF$_{$CALM$} (orange), and $LF$_{$LM$} (green) for the glottal flows (top-left), GFDs (bottom-left), and spectrum of the GFD (right), computed with $Rd=1.84$ and *E* = 0.2. The top-right of Fig. 2 displays similarities between the three models. First, all open phases derive from second order filters, as their respective Laplace transforms $HLFopen$, $HCALMopen$, and $HLMopen$ all show a similar denominator with a complex conjugate pole. This results in ±20 dB/decade asymptotes. In particular, all Laplace transforms simplify to *E*/*s* at high frequencies, resulting in similar asymptotes for the three GFMs. At low frequencies, the asymptotes are shifted between models but only from a few dB.

LF and $LF$_{$CALM$} display two more similarities. First, their anti-causality causes the GFD phase to increase (bottom-right of Fig. 2); second, they are both truncated at $t=\u2212OqT0$. The thin dashed curves in the left panels show what would be non-truncated versions of $LF$ and $LF$_{$CALM$}. A direct effect of the truncation is the computation of their Laplace transform on the interval $[\u2212OqT0,0]$, which results in the appearance of the term $e\u2212sT$ in $HLFopen$ and $HCALMopen$. This causes the ripples observed in the $LF$ and $LF$_{$CALM$} spectra. The main difference between $LF$ and $LF$_{$CALM$} is that the former is parameterized to be class *C*^{1} i.e., with a continuous GFD at the GOI ($\u2212OqT0$). This parameterization results in a generic second order filter that is neither low-pass nor bandpass, as shown by the numerator of $HLFopen$. A consequence is the large lobe around the resonance frequency of the GFD magnitude spectrum. Conversely, $LF$_{$CALM$} is parameterized to be a bandpass filter, which allows a reduction of the resonance's lobewidth but cannot suppress it completely because of the effect of truncation. The consequence of the bandpass parameterization is a discontinuous GFD at the glottal opening instant.

Two differences between $LF$_{$CALM$} and $LF$_{$LM$} are also highlighted. The difference of causality is well-displayed by a vertical symmetry in the time-domain and a horizontal symmetry of the phase spectrum. Also, because $LF$_{$LM$} converges, it is not truncated at $OqT0$. This implies a leak of the period to the next one but also greatly simplifies its implementation. As a result, its spectrum is the exact frequency response of a bandpass filter, with no ripples and no lobe around the resonance centre frequency. Note that the vertical symmetry of the GFD between $LF$_{$CALM$} and $LF$_{$LM$} implies a sign inversion of the glottal flow, but one that the ear is not sensitive to.

### D. Modeling the closed phase

#### 1. Formulation of the closed phases

Definitions of the GFM closed phase fall within two categories (Doval *et al.*, 2006): it is either described in the time-domain by an analytic formulation, as $LF$, or defined in the frequency-domain with a first order filter, as $LF$_{$CALM$} or $LF$_{$LM$}.

##### a. LF: Analytic expression.

The closed phase of the $LF$ model, after shifting the glottal closing instant at *t* = 0, is expressed as

where *ϵ* is the closed phase coefficient. It satisfies the continuity of the open and closed phase expressions at the GCI and is obtained from an implicit equation (see Appendix B). Note that because *a*_{$LF$} is computed from *ϵ*, the shape of the open phase depends on the closed phase, although both phases are defined by distinct analytical expressions.

##### b. $LF$_{$CALM$} and $LF$_{$LM$}: Filtering.

With $LF$_{$CALM$} and $LF$_{$LM$}, the closed phase is modelled by a first order low-pass filter attenuating high frequencies above its cut-off frequency $Fa=1/(2\pi Ta)$ and called spectral tilt (Doval *et al.*, 2003; Feugère *et al.*, 2017). Filter formulation is given in Appendix C. In these cases, the spectral tilt filter is applied on the full signal and therefore changes the open phase shape.

#### 2. Comparison between the GFM closed phases

Figure 3 displays the three GFM full waveforms, obtained by adding to the open phases of Fig. 2 their respective closed phase contributions while keeping $Rd=1.84$. Note that this process changes the open phases. The top-right panel shows high similarity between the three GFMs' spectrum magnitudes. The closed phase adds a supplementary −20 dB/decade attenuation to all open phase spectra, resulting in a −40 dB/decade attenuation at high frequencies. We can also observe an increase in gain in low frequencies for the $LF$ model. This is directly linked to the change of the $aLF$ parameter. A consequence is the largest amplitudes of the glottal flow and glottal flow derivative for $LF$.

Looking at the phase spectrum (bottom-right panel), $LF$ and $LF$_{$CALM$} almost overlap, showing a similar effect of the closed phases on their respective phase spectra: it adds a supplementary $\u2212\pi /2$ offset at high frequencies to all phase spectra of the open phases. The spectral tilt filter is displayed in black. This offset introduces an asymmetry between $LF$ and $LF$_{$CALM$} on one side and $LF$_{$LM$} on the other. The addition $\u2212\pi /2$ at high frequencies for all models reduces the phase of $LF$ and $LF$_{$CALM$} from $3\pi /2$ to *π* but also reduces the phase of $LF$_{$LM$} from $\u22123\pi /2$ to $\u22122\pi $. This asymmetry is reflected in the shapes of the glottal flow derivatives (bottom-left panel). One can see that $LF$_{$LM$} and $LF$_{$CALM$} are not symmetrical anymore and that the filtering attenuates more the GFD peak near the glottal closure instant for $LF$_{$LM$} than for $LF$_{$CALM$}. Finally, it is important to mention that the spectral tilt filter is not truncated for $LF$_{$CALM$} and $LF$_{$LM$}, and its application results in an infinite response that may overlap with the next period. This appears for high values of *R _{d}*, as shown in Sec. II F.

### E. Assessment of computational costs

To evaluate the computational efficiency of each GFM, we measured the average time necessary to compute one period of a 1-s stationary signal for each model. The ratio of computation time over the period duration gives the real-time factor. A real-time factor below 1 means that the signal is faster to compute than to play back, so we can listen the signal while it is generated. Inversely, a real-time factor higher than 1 indicates that the signal takes longer to compute than to play back. This experiment was made in the condition of a fine-grain control of the GFM: parameters are calculated for each period. To assess the dependency of the real-time factor on *F*_{0} and *R _{d}*, we generated 564 stationary signals using a combination of the six

*R*values described in Sec. III and 94

_{d}*F*

_{0}values, from 70 to 1000 Hz with steps of 10 Hz. All signals were generated on an iMac Intel Core i9, with a 3.6 GHz processor. Figure 4 displays the real-time factors for the three GFMs depending on

*F*

_{0}. For each model and

*F*

_{0}value, we computed the mean and standard deviation of the real-time factor across the six

*R*values. The means for each model are represented by the thick coloured lines, and the shading around each mean value highlights the ± standard deviation range around the mean. $LF$

_{d}_{$CALM$}and $LF$

_{$LM$}are more than 10–100 times faster than $LF$. This is a direct consequence of the resolution of the implicit equation for the $LF$ model, which is costly. Also, the efficiency of $LF$ decreases with higher

*F*

_{0}because the resolution of the implicit equation requires a constant duration. Therefore, when the period duration decreases, the real-time factor increases, and this dependency between computation efficiency and input parameter is not desirable. Finally,

*R*has no effect on the computation time for all three GFMs.

_{d}### F. Summary of the model implementation and effect of *R*_{d}

_{d}

Table I summarises the implementations of the three GFMs under study. To conclude this section, Fig. 5 shows the effect of *R _{d}* on the GFD (top row) and the respective spectra computed on a single period (second and third rows) for the three models [LF (blue), $LF$

_{$CALM$}(orange), and $LF$

_{$LM$}(green)]. In the top row, the dashed vertical lines represent the GOIs, while the dotted lines show the GCIs. In the second and third row, the vertical line indicates the cut-off frequency of the spectral tilt filter. Globally,

*R*has a similar effect on the three GFMs. Looking at the spectrum magnitude, low values of

_{d}*R*lead to higher centre frequency and bandwidth of the glottal formant and a higher spectral tilt cut-off frequency. These combined effects favour the presence of numerous harmonics that give a sharp GFD closure, close to the shape of an impulse. This is typical for tensed and loud voice, when the vocal folds open and close abruptly. Inversely, high values of

_{d}*R*lower the centre frequency and bandwidth of the glottal formant as well as the spectral tilt cut-off frequency. It thus emphasises more the first and second harmonics, leading to a more sine-like GFD shape. This is lax/soft voice, when the vocal folds oscillate more symmetrically.

_{d}In the first column of Fig. 5, the three GFMs appear very similar for two reasons. First, a low value of *R _{d}* leads to a high attenuation coefficient

*a*that allows $LF$ and $LF$

_{n}_{$CALM$}to have almost horizontal tangents at GOI. The truncation thus does not introduce an abrupt change of slope on the GFD, which results in a reduction of ripples on the $LF$ and $LF$

_{$CALM$}spectra. Second, the effect of spectral tilt that introduces an asymmetry between $LF$

_{$CALM$}and $LF$

_{$LM$}is small (high cut-off frequency), leading to almost symmetrical $LF$

_{$CALM$}and $LF$

_{$LM$}GFDs. Inversely, the three GFM shapes diverge with increasing values of

*R*. Truncation has stronger effects on $LF$ and $LF$

_{d}_{$CALM$}, increasing ripples in their spectrum, and the spectral tilt whose cut-off frequency is closed to the glottal formant position has a strong effect on the GFD shapes. In particular, one can note that the minimum values of $LF$

_{$CALM$}and $LF$

_{$LM$}diverge from −

*E*when

*R*increases. Moreover, the last column illustrates well the effect of absence of truncation of the spectral tilt filter on $LF$

_{d}_{$CALM$}and $LF$

_{$LM$}. The GFD computed for one period overlaps on the next one, leading to negative (respectively, positive) value of the GFD at the GOI for $LF$

_{$CALM$}(respectively, $LF$

_{$LM$}).

We have shown that the difference of construction between the three GFMs (formulation, causality, truncation) leads to clear visible differences in the GFD waveforms and spectra. However, their effect on auditory perception is unclear and is assessed in Sec. III.

## III. PERCEPTUAL COMPARISON OF VOICE SOURCE MODELS

### A. Experiment

#### 1. Protocol and task

The aim of the experiment was to assess any perceptual difference between the three GFMs for different values of the *R _{d}* parameter. We used for this purpose a two-alternative forced-choice (2AFC) protocol (Kingdom and Prins, 2016), where each subject's task was to listen to paired sounds and to say if they were the same or different, with respect to any distinctive features, whatever their nature (e.g., timbre, level, pitch, etc.). The experiment was divided into three blocks. The first block used synthesised sounds from the GFMs only. The second and third blocks used additional /a/ and /i/ vocal tract models convolved with the GFMs. These two vowels were chosen for their lowest (/i/) and highest (/a/) first formant frequency in order to test a more natural vocal sound than the GFM alone.

For each GFM and following Degottex *et al.* (2013), six values of *R _{d}* were chosen equally spaced on a logarithmic scale, leading to three GFMs × 6

*R*= 18 stimuli per block (the one displayed on Fig. 5). Then for each block, every combination of pairs of different GMFs was tested ($LF$

_{d}_{$LM$}× $LF$

_{$CALM$}; $LF$

_{$LM$}× $LF$; $LF$

_{$CALM$}× $LF$; $LF$

_{$CALM$}× $LF$

_{$LM$}; $LF$ × $LF$

_{$LM$}; $LF$ × $LF$

_{$CALM$}). Finally, 3 vowels × 6 pair combinations × 6

*R*values for the first element of the pair × 6

_{d}*R*values for the second element of the pair led to a total of 648 pairs of stimuli to compare.

_{d}A computer interface was specially designed for this experiment and programmed in max 6.^{1} The protocol was identical for all the paired stimuli. To proceed, the subject clicked a button, which launched the playback of two sounds, A and B, separated by 500 ms. The test sounds were ordered randomly and played for each subject only once to keep sessions as short as possible and identical among subjects. The subject had to choose whether the two sounds were identical or different, without any other choice. Each block lasted approximately 10 min, and subjects were especially encouraged to stop and rest between the three blocks with a message displayed automatically. The entire experiment took place in an acoustically insulated and treated room designed for perceptual experiments. Sound was played using a Focusrite (High Wycombe, UK) Scarlett 2i2 audio interface on a Mac OSX and AKG (Los Angeles, CA) K271 headphones. Before the experiment, subjects were trained with a subset of the sound-pair list (GFM convolved to /a/ vocal tract or without vocal tract, three *R _{d}* values spread over the full range of possible values).

A group of 18 subjects took part in the experiment (median age of 28 years, from 21 to 54 years old). Among them, 12 subjects worked in the field of sound technologies, and six others had a regular musical practice. An audiogram test was performed for each of the subjects, and none of them reported any known auditory impairment except one who was single-side deaf, but stereo listening was not needed to perform the task. Fourteen subjects were members of the laboratory and participated in the experiment on a voluntary basis without being paid. The four remaining subjects were paid for the experiment.

#### 2. Stimuli specification

Stimuli were synthesised at a sampling rate of *F _{s}* = 96 kHz. A constant fundamental frequency of $F0=110$ Hz and a peak amplitude

*E*= 0.2 were chosen. The $LF$ GFDs were generated by using the analytic formulations of Eqs. (2) and (5) and by solving the implicit Eqs. (B3) and (B4). The $LF$

_{$CALM$}and $LF$

_{$LM$}GFDs were generated by filtering a pulse train with their respective open and closed phase filters ( Appendix E). All signals lasted 0.3 s, a duration longer than a standard spoken syllable but short enough to facilitate recall of the two stimuli for comparison. Fade-in and fade-out amplitudes were applied using half Hanning windows of length $10T0=0.09\u2009s$. Vowels were invariant in time and were applied by filtering the GFM with a bank of five parallel resonant filters corresponding to vowels /i/ and /a/, whose transfer functions are given in Feugère

*et al.*(2017). Finally, all stimuli were normalised in dBA.

^{2}

### B. Results

Results report the proportion of pairs that were judged similar depending on the factors in consideration. In particular, we factorised the six different model pairs into two factors: the *Model*/factor (three levels: $LF$ × $LF$_{$CALM$}; $LF$_{$LM$} × $LF$; $LF$_{$CALM$} × $LF$_{$LM$}) and the *Order* factor that codes the order of presentation of each pair (two levels). The additional factors are *Vowel* (three levels: source only; /a/; /i/) and *R _{d}* (36 levels for all combinations of the six selected values). In the following, we used a single generalised linear model following a binomial distribution to assess the significance of each factor and their interactions for the perception results. The obtained model was subsequently simplified by iteratively removing non-significant interactions between factors provided that, at each simplification step, the current and the simplified models do not significantly differ (

*p*> 0.05) (Crawley, 2013).

*Post hoc*Pearson's chi-squared tests were run to assess whether proportions obtained for single conditions significantly differ from chance.

Figure 6 shows perceptual experiment responses for all factors and interactions except *Order* (results for both presentation orders of each pair are merged). The top-left panel shows results relative to the *R _{d}* factor only. Each square corresponds to the proportion of pairs judged similar for a given couple of

*R*values, all

_{d}*models*and

*Vowels*combined. Pairs in black and white were judged similar by 100% and 0% of the subjects, respectively. Scores that fall within the red rectangle on the colour bar do not significantly differ from chance according to the

*post hoc*Pearson's chi-squared tests (

*p*> 0.05). On the left-hand side of the figure, the top row (columns 2–4) shows the

*model*×

*R*interaction, with all levels of

_{d}*Vowel*and

*Order*combined; the left column (rows 2–4) shows the

*Vowel*×

*R*interaction, with all levels of

_{d}*model*and

*Order*combined; the remaining panels show the

*Vowel*×

*model*×

*R*interaction for each level of

_{d}*Vowel*and

*model*, indicated in the top and left margins of the figure. Panels with yellow and green contours are replicated on the right side, with $LF$ put in the abscissa. On top (respectively, bottom) for each

*R*(LF) value (each column), the distribution of perceived similar

_{d}*R*($LF$

_{d}_{$CALM$}) [respectively,

*R*($LF$

_{d}_{$LM$})] values was obtained and superimposed on the figure, the circles being the medians and the error bars corresponding to 90% of the values around the median. Smaller circles indicate scores below the level of significance (Pearson's chi-squared test). The shaded area links all error bars and represents the space of perceptual equivalence between

*R*(LF) and

_{d}*R*($LF$

_{d}_{$CALM$}) (respectively,

*R*($LF$

_{d}_{$LM$})).

#### 1. Effect of R_{d} and order

The *R _{d}* factor has the strongest effect on results [

*R*: $\chi 2=3620$, degrees of freedom (df) = 35,

_{d}*p*< 0.001]. The top-left panel of Fig. 6 clearly shows that, over all other factors, pairs with similar values of

*R*are strongly perceived as similar, and vice versa. This confirms that

_{d}*R*has a strong perceptual effect on the synthesis of glottal flow. Presentation order had no influence on similarity judgment (

_{d}*Order*: $\chi 2=0$, df = 1,

*p*= 0.90). Therefore, all results displayed in Fig. 6 and detailed below combine the scores of both presentation orders.

#### 2. Effect of model

The *model* factor alone has a small and marginally significant effect on subjects' scores (*model*: $\chi 2=8.8$, df = 2, *p* = 0.012) and therefore demonstrates that the three models are perceptually close to each other. $LF$_{$CALM$} and $LF$_{$LM$} are judged the most similar models, and $LF$ and $LF$_{$LM$} are judged the least similar when all answers are averaged. The subjects' perception seems to reflect the differences between models' construction that are summarised in Table I. $LF$_{$CALM$} and $LF$_{$LM$} derive from the same filtering process, with the only difference being the causality of the open phase and the truncation of $LF$_{$CALM$}. Inversely, $LF$ and $LF$_{$LM$} differ at almost every point of Table I. While these results average all possible *R _{d}* pairs, results depending on

*R*follow the significant two-way interaction between

_{d}*model*and

*R*(

_{d}*model*×

*R*, $\chi 2=486$, df = 70,

_{d}*p*< 0.001). Corresponding results are shown in the top row of the left side of Fig. 6 (columns 2–4). The first observation is that stimuli with similar values of

*R*are judged extremely similar (close to 100% similarity), while stimuli with different values of

_{d}*R*are judged different (0% similarity). One can then note a diagonal asymmetry in the $LF$ × $LF$

_{d}_{$CALM$}and $LF$

_{$LM$}× $LF$ panels for

*R*values higher than 0.86, i.e., when the models start to differ the most (Fig. 5). In particular, subjects judged $LF$ and $LF$

_{d}_{$CALM$}similar mostly when

*R*($LF$

_{d}_{$CALM$}) was greater than or equal to

*R*(LF). Similarly, $LF$ and $LF$

_{d}_{$LM$}were mostly judged similar when

*R*($LF$

_{d}_{$LM$}) was greater than to or equal to

*R*(LF). Conversely, $LF$

_{d}_{$CALM$}and $LF$

_{$LM$}were judged the most similar when they shared the same

*R*value, picturing more symmetric results (top-right panel of the left side of Fig. 6).

_{d}The right side of Fig. 6 summarises this asymmetry between $LF$ and the other models. Recall that these panels are replicates of the one with yellow and green contours from the left-hand side, but with $LF$ put in the abscissa for both plots. For each *R _{d}*(LF), medians of corresponding distributions of perceived similar

*R*($LF$

_{d}_{$CALM$}) [respectively,

*R*($LF$

_{d}_{$LM$})] are all on or above the diagonal. Also, the spread of each distribution represented by the error bars (90% of the values around the median) and emphasised by the shaded areas clearly displays asymmetrical spaces of perceptual equivalence between

*R*(LF) and

_{d}*R*($LF$

_{d}_{$CALM$}) [respectively,

*R*($LF$

_{d}_{$LM$})] that are again above the diagonal, with

*R*($LF$

_{d}_{$CALM$}) [respectively,

*R*($LF$

_{d}_{$LM$})] mostly equal to or greater than corresponding

*R*(LF).

_{d}#### 3. Effect of vowel

The effect of vowels (*Vowel*: $\chi 2=17.5$, df = 2, *p* < 0.001) supports that GFDs presented alone were significantly judged less similar than when they were passed through a vowel, the vowel /i/ giving the highest similarity results. Therefore, the introduction of resonances in the signal mitigates the perception of the glottal source timbre. Moreover, the glottal formant *F _{g}* evolves within the range [64, 121] Hz for the chosen values of

*R*for all models. The vowel /i/, having its first formant resonance the closest to

_{d}*F*, could mask the effect of

_{g}*R*variation, leading to sources judged more similar with /i/ rather than vowel /a/.

_{d}Also, a significant two-way interaction with *R _{d}* is present (

*Vowel*×

*R*: $\chi 2=302$, df = 70,

_{d}*p*< 0.001) as shown in the left column of Fig. 6, rows 2–4. Stimuli presented with the source only show similarity concentrated around the diagonal. When presented with the vowel /i/, the similarity spreads across adjacent

*R*values for high

_{d}*R*. This corresponds to

_{d}*F*and

_{g}*F*values that are around 100 Hz, close to the first formant frequency of vowel /i/ (215 Hz). Conversely, for the /a/ vowel, it seems that stimuli with high

_{ST}*R*value were neither clearly perceived as similar nor dissimilar. In this case, the first formant frequency (700 Hz) is far above the

_{d}*F*and

_{g}*F*ranges. A possibility is that subjects either focused on the low or high frequency parts of the signal, the former hearing the source differences and the latter focusing on the /a/ resonance.

_{ST}#### 4. Remaining interactions

No significant three-way interaction between *Vowel* and *model* and *R _{d}* was detected. It can be seen in Fig. 6 that the trend previously observed in the top row and left column (two-way interactions) applies to the remaining plots. Statistical analysis did not reveal a significant

*Vowel*×

*model*interaction, showing that the perception of differences between models is relatively independent from the addition of a vocal tract. Although it would be necessary to cover a larger number of vocal tract configurations, this finding encourages the hypothesis that the choice of the glottal flow can be made independently from the behavior of the vocal tract. Finally, two-way interactions

*Order*×

*R*and

_{d}*Order*×

*model*result from the asymmetry of the

*model*levels ($\chi 2=98$, df = 35,

*p*< 0.001; $\chi 2=7.6$, df = 2,

*p*= 0.022, respectively). The top row of Fig. 6 showed an asymmetry between $LF$ and $LF$

_{$CALM$}and between $LF$

_{$LM$}and $LF$. When considering the order of presentation as a factor, e.g., distinguishing $LF$ × $LF$

_{$CALM$}vs $LF$

_{$CALM$}× $LF$, the asymmetry of $LF$

_{$CALM$}× $LF$ results is reversed compared to $LF$ × $LF$

_{$CALM$}, hence the two-way interaction.

## IV. DISCUSSION AND CONCLUSION

In this study, the $LF$ model is reformulated in terms of linear filters. This formulation reconciles the apparent discrepancy between time-domain GFM and spectral voice source models. It allows for quantitative spectral interpretation of the $LF$ model parameters because the correspondence between time-domain and spectral parameters can be analytically computed. This unifies Fant's views on the voice source: the key point is the interpretation of the $LF$ GFM [in Fant *et al.* (1985)] as a mixed phase system and not as a simple resonant filter [as in Fant (1960)]. The joint variation of the waveform and glottal formant as a function of *R _{d}* can be computed for voice quality analysis and synthesis. As a rule of thumb, increasing

*R*corresponds to lowering the glottal formant centre frequency (often referred to as the “voicing bar” in wideband spectrogram reading) and increasing the spectral tilt toward lower frequencies (the right-hand “skirt” of the glottal formant).

_{d}Following the proposal of glottal flow models that attempt to reduce the computational complexity of $LF$, namely $LF$_{$CALM$} and $LF$_{$LM$}, we sought to assess the perceptual consistency of these models. We first showed that even though $LF$ is defined from an analytic expression and $LF$_{$CALM$} and $LF$_{$LM$} from digital filters, they can all be expressed by the same analytic function, with their own set of parameters. In terms of construction, $LF$ and $LF$_{$CALM$} have anti-causal and truncated open phases, while $LF$_{$LM$} has a causal and non-truncated open phase. The three GFM closed phases are causal.

Perceptual pairwise-comparison of these models parameterized with various levels of *R _{d}* using a same-different forced-choice paradigm on short stationary signals shows that all models are perceived similarly, in that they share the same

*R*parameterization with a possible offset. In particular, $LF$

_{d}_{$LM$}and $LF$

_{$CALM$}are perceived similarly with the same

*R*, while $LF$ is perceived similarly as $LF$

_{d}_{$CALM$}and $LF$

_{$LM$}when $LF$ has a smaller

*R*value. Investigation seems to show that this shift in perception relates more to the truncation of the glottal flow open phase than to a difference of causality. Nevertheless, this needs to be confirmed in further experiments. Finally, we showed that the addition of vocal tract effect with low vocalic formants increases the perception of similar waveforms when

_{d}*R*varies slightly between two waveforms. If the high dissimilarity between waveforms (Fig. 3) has favoured the use of $LF$ for precise analysis of the glottal flow (i.e., time-domain analyses), the perceptual consistency between models encourages the use of $LF$

_{d}_{$CALM$}and $LF$

_{$LM$}as simpler models than $LF$ for speech synthesis applications and for spectral analyses of the voice source and voice quality.

## ACKNOWLEDGMENTS

Part of this work has been done in the framework of the Agence Nationale de la Recherche, through the ChaNTeR and GEPETO Projects (ANR-13-CORD-0011, 2014–2017, ANR-19-CE28-0018, 2019–2023) and “Investissements d'avenir” programs ANR-15-IDEX-02 and ANR-11-LABX-0025-01. The authors are indebted to Professor Boris Doval for his help in the development of the model calculations.

### APPENDIX A: HIGH- TO LOW-LEVEL GLOTTAL PARAMETERS

Fant (1995) derived a unique high-level parameter *R _{d}* to control all low-level parameters

*O*,

_{q}*α*, and

_{m}*T*. He first defined intermediate parameters

_{a}*R*,

_{a}*R*, and

_{k}*R*from which are derived the low-level parameters,

_{g}### APPENDIX B: DERIVATION OF LF

$LF$ is defined in the time-domain by an analytic function (Fant *et al.*, 1985). After re-parameterization with *O _{q}* and

*α*, Doval

_{m}*et al.*(2006) expressed the open phase of the glottal flow derivative as

Setting the time origin at the glottal closure instant allows us to express $LF$ as an anti-causal filter truncated at $TLF=\u2212OqT0$. This is simply done by defining $hLFopen(t)=xLFopen(t+OqT0)$,

Also, if we note $XLFopen$ the Laplace transform of the original formulation given by Eq. (B1), then the time shift operated between $hLFopen$ and $xLFopen$ is translated as $XLFopen(s)=HLFopen(s)e\u2212sOqT0$. This linear phase shift does not have any effect on the timbre of the source and is ignored in this paper.

$aLF$ is the open phase damping coefficient. It is set so that the airflow of a period is zero and thus also depends on the closed phase coefficient *ϵ* [Eq. (5)]. The latter satisfies the continuity of the open and closed phase expressions at the GCI from the implicit equation

Given the expression of the closed phase, $aLF$ is calculated so that the integral of the glottal flow derivative is null on a period, leading to the implicit equation

Both implicit equations are resolved numerically.

### APPENDIX C: DERIVATION OF LF_{CALM}

The $LF$_{$CALM$} open phase anti-causal filter is defined in the *Z*-domain by Doval *et al.* (2003) as

The associated filter coefficients are those of a second order resonant biquad filter,

where *F _{s}* is the sampling frequency and

*F*,

_{g}*B*, and

_{g}*A*are the centre frequency, bandwidth, and amplitude of the resonance (glottal formant) and are defined as

_{g}By setting $aCALM=\pi Bg$ and $bCALM=2\pi Fg$, the time-domain impulse response of $LF$_{$CALM$}, truncated at $TCA=\u2212OqT0$, is given by computing the inverse *Z*-transform,

The $LF$_{$CALM$} closed phase causal filter is defined in in the *Z*-domain as

and its filter coefficients are computed from the cut-off frequency $Fa=1/(2\pi Ta)$,

### APPENDIX D: DERIVATION OF LF_{LM}

$LF$_{$LM$} is the causal version of $LF$_{$CALM$} (Feugère *et al.*, 2017). Therefore, the glottal formant, also defined in the *Z*-domain, has the following transfer function:

whose coefficients are given by Eqs. (C2) and (C3). To have a convergent filter, it is necessary that $aLM<0$. Therefore, $aLM=\u2212\pi Bg$ and $bLM=2\pi Fg$. Finally, the time-domain impulse response of $LF$_{$LM$} is

### APPENDIX E: SYNTHESIS WITH LF_{CALM} and LF_{LM}

$LF$_{$CALM$} open phase uses the anti-causal filter $HCALMopen$ [Eq. (C1)]. We define a pulse train *δ _{gci}* whose impulses are placed on the GCIs. The pulse train is then filtered by $HCALMopen$, leading to the recursion equation

For each period, the impulse response is truncated at the previous GOI. Then the full signal is filtered by the causal spectral tilt filter *H _{ST}* [Eq. (C5)], leading to the recursion equation

In the case of $LF$_{$LM$}, both glottal formant and spectral tilt filters are applied in their causal form. We define a pulse train *δ _{goi}* whose impulses are placed on the GOIs. The pulse train is then filtered successively by the causal version of the glottal formant filter $HLMopen$ [Eq. (D1)], leading to the recursion equation

and the spectral tilt filter *H _{ST}* [Eq. (C5)], leading to the recursion equation

^{1}

http://cycling74.com (Last viewed 8/16/2021).

^{2}

See the supplementary material https://www.scitation.org/doi/suppl/10.1121/10.0005879 for all stimuli.