This study investigated how the bandwidths of resonances simulated by transmission-line models of the vocal tract compare to bandwidths measured from physical three-dimensional printed vowel resonators. Three types of physical resonators were examined: models with realistic vocal tract shapes based on Magnetic Resonance Imaging (MRI) data, straight axisymmetric tubes with varying cross-sectional areas, and two-tube approximations of the vocal tract with notched lips. All physical models had hard walls and closed glottis so the main loss mechanisms contributing to the bandwidths were sound radiation, viscosity, and heat conduction. These losses were accordingly included in the simulations, in two variants: A coarse approximation of the losses with frequency-independent lumped elements, and a detailed, theoretically more precise loss model. Across the examined frequency range from 0 to 5 kHz, the resonance bandwidths increased systematically from the simulations with the coarse loss model to the simulations with the detailed loss model, to the tube-shaped physical resonators, and to the MRI-based resonators. This indicates that the simulated losses, especially the commonly used approximations, underestimate the real losses in physical resonators. Hence, more realistic acoustic simulations of the vocal tract require improved models for viscous and radiation losses.

Acoustic simulations of the vocal tract are widely used in speech research and for articulatory speech synthesis. There are one-dimensional (1D), two-dimensional (2D), and three-dimensional (3D) simulation methods. Although the 3D methods can be very accurate up to high frequencies, they are too computationally expensive for many applications (Arnela , 2019; Blandin , 2022; Speed , 2013; Takemoto , 2010; Vampola , 2008). Therefore, when fast simulations are needed, 1D methods based on the assumption of plane wave propagation in the vocal tract are the preferred choice. The plane wave assumption holds up to frequencies of about 5 kHz, which is sufficient for most applications. A widely-used 1D simulation method is based on the transmission-line (TL) circuit model (Birkholz and Drechsel, 2021; Elie and Laprie, 2016; Flanagan , 1975; Maeda, 1982), which will also be used in this study. The main advantage compared to other 1D simulation models like the Kelly-Lochbaum model (Kelly and Lochbaum, 1962; Liljencrants, 1985) is that it poses no restrictions with regard to the lengths of the tube sections by which the vocal tract is represented.

For an acoustic simulation to be as realistic as possible, it is important that the power losses are adequately modeled because they determine the bandwidths of the acoustic resonances. The main losses in the vocal tract arise from viscous friction at the tube wall, heat conduction at the tube wall, sound radiation from the mouth and nostrils, sound absorption by the soft vocal tract walls, and imperfect closure at the glottis. The equations that are commonly used to model these losses in TL models of the vocal tract (see Sec. II C) are physically well-founded (Flanagan, 1965). Based on these equations, the contributions of the different loss mechanisms to the bandwidths of the resonances can be determined, as illustrated in Fig. 1. However, the equations to model the losses rely on certain simplifying assumptions. For example, the heat conduction loss and the viscous loss were derived for a tube with smooth and hard walls, which does not apply to the vocal tract. Furthermore, the radiation loss is commonly modeled in terms of the radiation impedance of a piston (mouth opening) in a sphere (the head), which approaches that of a piston in an infinite, plane baffle, when the radius of the piston becomes small compared with that of the sphere (Birkholz, 2005; Flanagan, 1965). Again, given the horn-like shape of the human lips, this is also a significant simplification.

FIG. 1.

Variation of formant bandwidth with formant frequency as simulated by a transmission-line model of the vocal tract according to Flanagan (1975). The diagram shows the relative contributions of different loss mechanisms to the total formant bandwidth.

FIG. 1.

Variation of formant bandwidth with formant frequency as simulated by a transmission-line model of the vocal tract according to Flanagan (1975). The diagram shows the relative contributions of different loss mechanisms to the total formant bandwidth.

Close modal

Another difficulty is the frequency-dependence of the equations for the losses due to heat conduction, viscosity, and radiation. In time-domain simulations of the TL model, the frequency-dependence cannot be modeled by the lumped circuit elements, so further approximations are required. Here, a common strategy is to model the viscous loss using the (frequency-independent) Hagen–Poiseuille equation (Birkholz, 2005; Elie and Laprie, 2016; Maeda, 1982). However, this equation only holds for a stationary laminar flow, and significantly underestimates the viscous losses for sound waves. For the radiation loss, Flanagan (1965) proposed an approximation of the radiation impedance of a piston in an infinite plane baffle in terms of a parallel circuit of two frequency-independent circuit elements. However, as will be shown in Sec. II C, this approximation also slightly underestimates the loss.

The present study investigated the extent to which simulated losses based on the aforementioned assumptions and approximations match the losses of physical models of the vocal tract with respect to the bandwidths of the resonances. The physical models were made of hard plastic, and their resonances were measured with a closed glottis. Thus, the effective loss mechanisms in these models were viscous, heat-conduction, and radiation losses. The same types of losses were included in TL simulations of the vocal tract, both using the detailed, frequency-dependent equations (“detailed loss model”), and using the approximations with frequency-independent circuit elements (“coarse loss model”).

The TL simulations with both loss models were performed for 16 different vowels. These were compared with two types of physical models with differently realistic geometries: a set of 16 axisymmetric tube models, and a set of 2 × 16 realistically shaped models based on Magnetic Resonance Imaging (MRI) data. The axisymmetric tubes closely match the situation modeled with the TL circuit model. Hence, any potential bandwidths differences between the TL simulations and the tube-shaped physical models should be mainly related to simplifying assumptions in the models for the viscous and heat-conduction losses. On the other hand, any potential bandwidths differences between the MRI-based physical models and the tube-shaped physical models should be related to the more complex geometry of the MRI-based models, especially in the region of the lips. To study the effect of lip geometry on the bandwidths of the resonances in more detail, a third set of physical resonators was used, which consisted of concatenated cylindrical tubes with notches of different depths to model the “lip horn.” Hence, the study was designed to identify potential shortcomings of the loss models used in TL simulations of the vocal tract and to distinguish the effects of viscous and heat-conduction losses from radiation losses.

Three types of physical resonators were used. The first type [Fig. 2(a)] represents the vocal tract as a straight axisymmetric tube with circular cross-sections, where the cross-sectional area varies along the tube axis. This type of resonator is the most direct physical equivalent of the TL simulation method described in Sec. II C. Of this type, 16 resonators were made, namely, for the eight tense German vowels /aː, eː, iː, oː, uː, ɛː, øː, yː/ and the eight lax German vowels /ɛ, ɪ, ɔ, ʊ, ʏ, œ, ə, ɐ/. The area functions for these resonators were obtained from the vocal tract shapes defined in the articulatory speech synthesizer VocalTractLab 2.3 (VTL) (see www.vocaltractlab.de). The epilaryngeal tube of the physical models was slightly widened compared to the models defined in VTL. In future studies, this will allow them to be more easily excited with silicone vocal fold models, similar to Birkholz (2019), but this was not done in the present study. For the widening, the radius of the epilaryngeal tube was changed linearly from 8.5 mm at the glottal end to 3.5 mm at a position 1.8 cm above the “glottis.” All models were designed with a wall thickness of 3 mm and supplemented with flanges of 57 mm diameter at both ends. All 16 models were 3D-printed with an Ultimaker 3 printer (Dynamism, Chicago, IL) using the hard plastic material polylactic acid (PLA) with an infill ratio of 100%. Nine of these resonators were previously used by Birkholz (2022).

FIG. 2.

(Color online) Examples of the 3D-printed vowel resonators used in this study. (a) From left to right: tube resonators for the vowels /aː, iː, uː, ə/. (b) MRI-based resonators for /i:/ (left) and /a:/ (right). (c) Two-tube models for /e/ (front row) and /a/ (back row) including teeth with lip notch depths of 0 (left), 2 (middle), and 4 cm (right).

FIG. 2.

(Color online) Examples of the 3D-printed vowel resonators used in this study. (a) From left to right: tube resonators for the vowels /aː, iː, uː, ə/. (b) MRI-based resonators for /i:/ (left) and /a:/ (right). (c) Two-tube models for /e/ (front row) and /a/ (back row) including teeth with lip notch depths of 0 (left), 2 (middle), and 4 cm (right).

Close modal

The second type of resonators were models with realistic 3D vocal tract shapes that were obtained from volumetric MRI data of two subjects [Fig. 2(b)]. These models were previously published as the Dresden Vocal Tract Dataset in terms of 3D-printable geometry files (Birkholz , 2020). Here, we used the models of the eight tense German vowels /aː, eː, iː, oː, uː, ɛː, øː, yː/ and the eight lax German vowels /a, ɛ, ɪ, ɔ, ʊ, ʏ, œ, ə/ of both subjects (32 resonators in total). All resonators were 3D-printed with PLA (100% infill ratio) on an Ultimaker 3 printer.

The third type of resonators are two-tube approximations of the vocal tract with notched lips, as shown in Fig. 2(c). They consist of two cylindrical, cascaded tubes of different cross-sectional areas. Their dimensions represent the four vowels /a, ae, e, ə/, inspired by Flanagan (1965), and are detailed in Fig. 3. Each of these vowels was made in five variants, with lip notches of 0 (no notch), 1 , 2, 3, and 4 cm. This allows to study the specific effect of the lip notch on the bandwidths of the resonances. All 20 models were designed with a wall thickness of 3 mm and 3D-printed on an Ultimaker 3 printer with PLA (100% infill ratio). In addition, all models were equipped with a row of “teeth,” as shown in Fig. 2(c). They reduce the radiating area of the mouth to a more realistic level than without teeth. More details about these resonators are given in Birkholz and Venus (2018).

FIG. 3.

Geometries of the four two-tube resonators with notched lips. The notch depths were 0 (no notch), 1, 2, 3, and 4 cm.

FIG. 3.

Geometries of the four two-tube resonators with notched lips. The notch depths were 0 (no notch), 1, 2, 3, and 4 cm.

Close modal

For all physical resonators, the volume velocity transfer function (VVTF), i.e., the complex frequency-dependent ratio of the volume velocity at the lips to the volume velocity through the glottis, was measured with the method described in Fleischer (2018) and Birkholz (2020). This method is based on the principle of reciprocity and has the advantage that it does not require a broadband volume velocity source to excite the resonators at the glottis. Instead, a standard loudspeaker is used, placed about 30 cm in front of the resonator to be measured and aimed at its mouth opening. In a first step, the loudspeaker emits a broadband sine sweep signal, while the sweep response P 1 ( ω ) is measured with a microphone inside the resonator at the closed glottal end (closed-glottis case). In a second step, a reference measurement is performed, where the mouth opening of the resonator is closed (using a 3 mm thick circular plate for the flanged resonators, and a layer of modeling clay for the others). For this measurement, the loudspeaker emits the same excitation signal as before, while a microphone measures the sweep response P 2 ( ω ) immediately in front of the closed mouth. According to the theory outlined in Fleischer (2018), the VVTF can then be calculated as H ( ω ) = P 1 ( ω ) / P 2 ( ω ). The video https://youtu.be/9AoRS9X2BNY illustrates this procedure.

All resonators were measured in the large anechoic chamber of the TU Dresden at a temperature of 20.5 °C. For the measurement of P 1 ( ω ), a 1/4-in. measurement microphone (MK301E capsule with MV301 preamplifier by Microtech Gefell, Berlin, Germany) was used. This microphone was inserted through a hole of a glottal adaptor plate. For the measurement of P 2 ( ω ), the probe microphone G.R.A.S. 40SC was used (apart from the MRI-based resonators, where the G.R.A.S. 46BL was used). Both microphones were connected to the audio interface Terratec Aureon XFire 8.0 HD (Terratec, Alsdorf, Germany), which in turn was connected to a laptop computer (MSI GT72-2QE, MSI, New Tapei City, Taiwan) with the operating system Windows 8.1, 64 Bit.

The measurements were made with the open-source software MeasureTransferFunction (Birkholz, 2019), which implements the method by Farina (2000) and allows one to obtain the linear frequency response of the resonators despite potential harmonic distortions generated by the loudspeaker. The transfer functions were measured from 100 to 10 000 Hz with a spectral resolution of 0.96 Hz. Figure 4 shows examples of obtained transfer functions for the two-tube resonators for the vowel /ae/ without a lip notch (gray curve) and with a lip notch of 3 cm depth (black). Here, the lip notch causes a shift of the resonances towards higher frequencies but also an increase in their bandwidths (well visible for the fourth and fifth resonance). The complete set of 3D-printable resonators and their transfer functions is contained in the supplemental material1 (also at www.vocaltractlab.de/index.php?page=birkholz-supplements).

FIG. 4.

Measured transfer functions for the two-tube resonators for the vowel /ae/ without a lip notch (gray curve) and with a lip notch of 3 cm depth (black).

FIG. 4.

Measured transfer functions for the two-tube resonators for the vowel /ae/ without a lip notch (gray curve) and with a lip notch of 3 cm depth (black).

Close modal
The acoustic simulations of the vocal tract were performed with the TL circuit model shown in Fig. 5. The subglottal system and side cavities, such as the sinus piriformis or the nasal cavity, were not considered here. The TL model assumes the propagation of plane waves through the vocal tract, which in turn is approximated as a sequence of N abutting short cylindrical tube sections with the lengths li and the cross-sectional areas Ai ( 1 i N). In the circuit, each tube section i is represented as a T-type network with the series elements Ri and Li, and the shunt elements Gi and Ci. At the left end, the network is driven by an ideal glottal flow source U1 (no source-filter interaction), and at the right end it is terminated by a radiation impedance Z rad. For each tube section i,
(1)
is the acoustic inertance (for one half of the tube section), and
(2)
is the acoustic compliance, where ϱ = 1.14 · 10 3 g/cm–3 is the ambient density and c = 3.5 · 10 4 cm/s is the sound velocity for moist air at body temperature (Flanagan, 1965). These elements are purely reactive and do not dissipate energy.
FIG. 5.

Acoustic network of the vocal tract (without side branches) driven by a glottal volume velocity source U1. The network includes losses due to viscosity (Ri, i = 1 , , N), heat conduction (Gi), and radiation ( Z rad).

FIG. 5.

Acoustic network of the vocal tract (without side branches) driven by a glottal volume velocity source U1. The network includes losses due to viscosity (Ri, i = 1 , , N), heat conduction (Gi), and radiation ( Z rad).

Close modal
Losses within the tube sections are modeled with Ri and Gi, where the resistance Ri models dissipation due to viscous friction at the tube wall (for one-half of the section), and the conductance Gi models a power loss that arises from heat conduction at the tube wall. Flanagan (1965) derived equations for Ri and Gi based on theoretical considerations. For the circuit in Fig. 5, they are
(3)
and
(4)
where S i = 2 A i π is the tube circumference, μ = 1.86 · 10 4 dyne s/cm2 is the dynamic viscosity, ω is the angular frequency, and λ / c p = 0.229 · 10 3 g · cm– 1s– 1 is the quotient of the coefficient of heat conduction and the specific heat of air. In principle, the circuit sections can also be extended with a branch to model losses due to soft walls (Birkholz , 2022; Flanagan , 1975). However, they were omitted here to make the circuit model comparable to the hard-walled physical resonators used in this study.
The radiation impedance Z rad at the mouth opening is roughly comparable to that of a vibrating piston (corresponding to the mouth area) set in a spherical baffle (corresponding to the head). However, the analytical expression for this impedance is involved and cannot be expressed in closed form (Flanagan, 1965). A more tractable radiation impedance is that for a vibrating piston in an infinite wall, which is
(5)
where A rad = A N is the radiating area, a = A rad / π is the radius of the piston, k = ω / c is the wave number, J1 is the first order Bessel function, S1 is the first order Struve function, and j = 1.
Equations (3), (4), and (5) represent theoretically well-motivated models for viscous losses, heat conduction losses, and radiation losses, and are frequently used for the calculation of vocal tract transfer functions (Badin and Fant, 1984; Wakita and Fant, 1978; Zhang and Espy-Wilson, 2004). However, all three equations, i.e., circuit elements, are frequency-dependent and therefore not usable for time-domain simulations, which are more suitable for the synthesis of continuous speech (Birkholz and Drechsel, 2021; Elie and Laprie, 2016; Maeda, 1982). Heat conduction losses are usually completely omitted in time-domain simulations because frequency-independent conductivities Gi would cause a constant “leakage” for the DC portion of the volume velocity through the vocal tract. Viscous losses are often approximated by the (frequency-independent) Hagen–Poiseuille equation for a stationary laminar flow (e.g., Birkholz, 2005; Elie and Laprie, 2016; Maeda, 1982), i.e.,
(6)
With regard to the radiation impedance, Flanagan (1965) proposed a way to approximate Z rad in Eq. (5) by a parallel ( | |) circuit of a (frequency-independent) resistor and inductivity as follows:
(7)
(8)
(9)
Figure 6 shows that for small values of k · a, the real and imaginary parts of Z ̂ rad (dashed lines) fit Z rad according to Eq. (5) (solid lines) very well. For higher frequencies or greater mouth openings, the real part of Z ̂ rad (black lines), which causes the radiation losses, tends to underestimate that of Z ̂ rad.
FIG. 6.

Comparison of the (precise) radiation impedance of a piston in an infinite baffle (solid curves) and its approximation by a parallel R–L-circuit according to Eq. (7). The black curves are the real parts of the impedance (resistance), and the gray curves are the imaginary parts (reactance). The values are normalized, i.e., multiplied by A rad / ( ϱ c ). The upper scale shows the impedance values as a function of ka, where k = ω / c is the wave number and a is the piston radius. The bottom scale shows the frequency for the case of a large piston area with a = 2 cm ( A rad = 12.6 cm2).

FIG. 6.

Comparison of the (precise) radiation impedance of a piston in an infinite baffle (solid curves) and its approximation by a parallel R–L-circuit according to Eq. (7). The black curves are the real parts of the impedance (resistance), and the gray curves are the imaginary parts (reactance). The values are normalized, i.e., multiplied by A rad / ( ϱ c ). The upper scale shows the impedance values as a function of ka, where k = ω / c is the wave number and a is the piston radius. The bottom scale shows the frequency for the case of a large piston area with a = 2 cm ( A rad = 12.6 cm2).

Close modal

Given the previously noted considerations, we studied two variants of the transmission-line model:

  • one variant with a “detailed loss model” that applies the theoretically precise losses for Ri, Gi, and Z rad according to Eqs. (3), (4), and (5).

  • one variant with a “coarse loss model,” which is frequently used for time-domain simulations, and that omits the heat conduction losses (Gi = 0) and uses the approximations R ̂ i and Z ̂ rad according to Eqs. (6) and (7).

Since the Struve function S 1 ( · ) in the equation for Z rad is not readily available in matlab and many other programs and programming languages, we implemented the approximation provided by Aarts and Janssen (2003).

To determine the effect of the loss model variants on the bandwidths, the transmission-line circuit in Fig. 5 was used to calculate the volume velocity transfer functions from the glottis to the lips, i.e., H ( ω ) = U N + 1 ( ω ) / U 1 ( ω ), for 16 tube shapes. We included the tube shapes for the eight tense German vowels /aː, eː, iː, oː, uː, ɛː, øː, yː/ and for the eight lax German vowels /ɛ, ɪ, ɔ, ʊ, ʏ, œ, ə, ɐ/, which were exported as discrete area functions with N = 40 tube sections from the articulatory speech synthesizer VocalTractLab 2.3. Therefore, the tube shapes used here largely correspond to the 3D-printed tube-shaped models.

To determine the transfer function for a given tube shape, each tube section i was represented in terms of its chain matrix (Birkholz, 2005; Wakita and Fant, 1978)
(10)
where
With this chain matrix, the relation between the input and output of the tube section i can be written as
(11)
where Pi and Ui, and P i + 1 and U i + 1 represent the input and output sound pressures and volume velocities. The matrix K tot for the entire vocal tract is then the product of the matrices for all tube sections, i.e.,
(12)
Hence, the relation between the acoustic quantities at the input and output of the whole vocal tract is
(13)
Considering that P N + 1 = U N + 1 Z rad, the volume velocity transfer function can be derived as
(14)

The transfer functions for all tube shapes have been calculated with a spectral resolution of 1 Hz, which is similar to the spectral resolution of the transfer functions measured for the physical models, and allows a precise determination of the frequencies and bandwidths of the resonances.

The transfer functions H ( ω ) of the physical and simulated resonators were used to obtain the frequencies and bandwidths of the resonances in the frequency range from 0 to 5 kHz. For each resonance, the frequency f R was determined as that of the corresponding local maximum in the magnitude spectrum | H ( ω ) |. Its bandwidth was calculated as the difference between the frequencies left and right of f R, where the magnitude was 3 dB below the maximum. Due to the spectral resolution of about 1 Hz of the transfer functions, the error of the estimation was ±0.5 Hz (no interpolation was performed between the spectral points).

While the previously noted procedure was used for most resonances, there were two problem cases that received special consideration. The first problem occurred when the resonance of interest was so close to another resonance that the amplitude on the flank to the adjacent resonance only decreased by less than 3 dB before increasing again. This situation is shown in Fig. 7(b) for the fourth and fifth resonances. In this case, the bandwidth was estimated only on the basis of the –3 dB-frequency f 3 dB on the flank opposite from the near resonance as B R = 2 · | f R f 3 dB |.

FIG. 7.

Transfer functions of (a) the vowel /aː/ of speaker s1, and (b) the vowel /oe/ of speaker s2 (bottom) of the Dresden Vocal Tract Dataset (Birkholz , 2020). The two vertical lines around the resonances indicate their 3 dB-bandwidths. Resonances marked with an asterisk were excluded from the bandwidth analysis, because they are located next to an antiresonance.

FIG. 7.

Transfer functions of (a) the vowel /aː/ of speaker s1, and (b) the vowel /oe/ of speaker s2 (bottom) of the Dresden Vocal Tract Dataset (Birkholz , 2020). The two vertical lines around the resonances indicate their 3 dB-bandwidths. Resonances marked with an asterisk were excluded from the bandwidth analysis, because they are located next to an antiresonance.

Close modal

The second problem was when there was a spectral zero (antiresonance) next to the resonance of interest. Zeros were mainly found in the transfer functions of the MRI-based physical models, where they are caused by side cavities, such as the piriform fossae, and by transverse modes. A nearby zero may strongly affect the steepness of both flanks of a resonance and hence the frequencies of the –3 dB points on these flanks. Therefore, the strategy of estimating the bandwidth only from the opposite flank does not work reliably here. Hence, resonances with a neighboring zero were completely excluded from the analysis. Figure 7 shows two examples of measured transfer functions, where the omitted resonances/peaks are marked with an asterisk.

The energy losses discussed so far are all linear acoustic losses. However, sharp edges as at the “teeth” of the physical resonators could cause turbulence and hence nonlinear losses (Buick , 2011). In contrast to linear losses, the contribution of nonlinear losses to the bandwidths of the resonances would depend on the amplitude of the acoustic waves. To estimate the extent of nonlinear losses, the transfer functions of two physical resonators were estimated using different sound pressure levels (SPLs) for the excitation signal during the measurements (Sec. II B). The two models were the axisymmetric tube for /a/ and the two-tube model for /a/ with a 2 cm notch. For both models, the VVTF was obtained with excitation signals of 74, 80, 86, and 92 dB SPL (measured during the sweep at about 30 cm from the loudspeaker using the acoustic signal analyser Acoustilyzer AL1 by NTi Audio, NTi Audio, Schaan, Liechtenstein), five times for each SPL. From each of the 20 transfer functions per model, the bandwidths B R 1 , , B R 4 were determined according to Sec. II E. The results are shown in Fig. 8, where each error bar represents ±1 standard deviation estimated from the five repetitions. The results indicate that the estimated bandwidths are largely independent from the SPL. The strongest (but still small) effect of the SPL on the bandwidth was observed for B R 4 of the notched tube model, which increased by 7 Hz between the SPLs of 74 and 92 dB. Since 7 Hz are small compared to the average bandwidth of 236 Hz (caused by mainly linear losses), we conclude that nonlinear losses can be neglected in our analysis.

FIG. 8.

Variation of the measured bandwidths B R 1 to B R 4 of two physical resonators using different SPLs of the sweep signal generated by the external sound source. The error bars show the ± 1 σ range of the bandwidths from five measurements per model and SPL.

FIG. 8.

Variation of the measured bandwidths B R 1 to B R 4 of two physical resonators using different SPLs of the sweep signal generated by the external sound source. The error bars show the ± 1 σ range of the bandwidths from five measurements per model and SPL.

Close modal

Figure 9 plots the bandwidths of all considered resonances of all analyzed vowels as a function of resonance frequency. The left plot shows the data from the simulations with the coarse loss model (black squares) and the detailed loss model (white circles), and the right plot shows the data for the physical tube-shaped resonators (black squares) and the MRI-based resonators (white circles). For all four model types, the bandwidths generally increase with frequency. This differs from human bandwidth data, where (closed-glottis) bandwidths decrease up to about 500 Hz, before increasing towards higher frequencies (Fujimura and Lindqvist, 1971). This difference is due to the soft walls of the real vocal tract, which were not present in either the simulations or the physical models. The data in Fig. 9 also show an increasing scattering of the bandwidths with increasing frequency. This indicates that towards higher resonance frequencies, the bandwidths become increasingly dependent on the vocal tract shape.

FIG. 9.

Left, bandwidths over frequencies of the resonances of the simulated vocal tract transfer functions with a coarse loss model (squares) and a detailed loss model (circles). Right, bandwidths over frequencies of the resonances of the measured transfer functions for the physical tube-shaped resonators (squares) and the MRI-based resonators (circles).

FIG. 9.

Left, bandwidths over frequencies of the resonances of the simulated vocal tract transfer functions with a coarse loss model (squares) and a detailed loss model (circles). Right, bandwidths over frequencies of the resonances of the measured transfer functions for the physical tube-shaped resonators (squares) and the MRI-based resonators (circles).

Close modal

Due to this scattering, the data cannot be accurately represented in terms of regression lines beyond a frequency of 2 or 3 kHz. Instead, for a quantitative comparison between the four models, the data points were split into five 1 kHz-wide frequency bands. Figure 10 shows the bandwidth distributions of all four model types in each of the five frequency bands. The mean values and standard deviations of these distributions have been summarized in Table I. In all frequency bands, the average bandwidths increase from the simulations with the coarse loss model, to the simulations with the detailed loss model, to the physical tube-shaped resonators, and to the physical MRI-based resonators. Significant bandwidth differences between the model types in the individual frequency bands, as determined by two-sample t-tests, are indicated by the asterisks in Fig. 10. The significance level for these tests was α = 0.05, and Bonferroni correction was applied to account for the six comparisons in each frequency band. Note that the bandwidth differences in the 4–5 kHz band, due to the strong scattering, did not reach a level of significance. Furthermore, there are fewer data points for the MRI-based models in the 4–5 kHz band due to the relatively high number of antiresonances in this band.

FIG. 10.

Distributions of the bandwidths of the resonances in five 1 kHz bands for four types of resonators: simulated resonators with a coarse loss model, simulated resonators with a detailed loss model, physical tube-shaped resonators, and physical MRI-based resonators. The black diamonds indicate mean values. The asterisks indicate pairs of resonator types for which the means differ significantly with α = 0.05 (considering Bonferroni correction).

FIG. 10.

Distributions of the bandwidths of the resonances in five 1 kHz bands for four types of resonators: simulated resonators with a coarse loss model, simulated resonators with a detailed loss model, physical tube-shaped resonators, and physical MRI-based resonators. The black diamonds indicate mean values. The asterisks indicate pairs of resonator types for which the means differ significantly with α = 0.05 (considering Bonferroni correction).

Close modal
TABLE I.

Average bandwidths and standard deviations (both in Hz, standard deviations in brackets) of the resonances in different frequency bands for the four types of examined resonators, and for humans (as reference).

Resonator type 0–1 kHz 1–2 kHz 2–3 kHz 3–4 kHz 4–5 kHz
Simulated resonators with coarse loss model  4 (2)  16 (8)  13 (13)  12 (11)  62 (74) 
Simulated resonators with detailed loss model  14 (4)  34 (9)  37 (15)  52 (12)  94 (84) 
Physical tube-shaped resonators  15 (4)  40 (10)  59 (18)  66 (33)  111 (98) 
Physical MRI-based resonators  25 (9)  49 (21)  66 (38)  92 (52)  131 (56) 
Human data (Fant, 1972 52 (15)  54 (12)  79 (39)     
Resonator type 0–1 kHz 1–2 kHz 2–3 kHz 3–4 kHz 4–5 kHz
Simulated resonators with coarse loss model  4 (2)  16 (8)  13 (13)  12 (11)  62 (74) 
Simulated resonators with detailed loss model  14 (4)  34 (9)  37 (15)  52 (12)  94 (84) 
Physical tube-shaped resonators  15 (4)  40 (10)  59 (18)  66 (33)  111 (98) 
Physical MRI-based resonators  25 (9)  49 (21)  66 (38)  92 (52)  131 (56) 
Human data (Fant, 1972 52 (15)  54 (12)  79 (39)     

Looking at the two types of simulations, the lower bandwidths with the coarse loss model are due to both the approximations for the viscous and radiation losses and the neglect of the heat-conduction loss. As Fig. 6 shows, the resistive component of the radiation impedance is somewhat underestimated by the lumped-element approximation, but mainly at higher frequencies and never by more than 20%. For the viscous loss, the difference between Eqs. (3) and (6) is more significant. For example, for a tube with a circular cross section of 1 cm2 and at a frequency of 1000 Hz, the resistance according to Eq. (3) is 19.6 times that of the frequency-independent approximation in Eq. (6).

The bandwidth differences between the simulations with the detailed loss model and the physical tube-shaped resonators show that the detailed loss model underestimates the real losses, although the tube shapes were virtually identical for the simulated and physical models. The biggest difference is in the 2–3 kHz band, where the bandwidths of the physical models are 59% higher than in the simulations. The differences are likely caused by the approximations inherent to the equations for the visco-thermal losses. While Eqs. (3) and (4) both assume smooth walls, the walls of the 3D-printed resonators have a slightly rough surface, which results from the layer-wise printing process of the 3D-printer. For our models, the layer thickness was 0.1 mm. Hence, there are small grooves with a spacing of 0.1 mm on the surface of all 3D-printed models. This is illustrated with a high-resolution microscopic image of a section of the surface in the supplemental material.

The comparison between the two types of physical resonators shows that the realistically shaped MRI-based vocal tract models cause higher losses than the axisymmetric tubes. This increases the average bandwidth between 12% in the 2–3 kHz band, and 67% in the 0–1 kHz band. The differences can have two causes. On the one hand, due to their more complex geometries, the MRI-based models have larger inner surfaces than the axisymmetric tubes. This likely increases the viscous and heat-conduction losses. On the other hand, the horn-shaped lips of the MRI-based models likely increase the radiation losses compared to the tube-shaped models with their circular radiating apertures. The contributions of these two differences to the wider bandwidths of the MRI-based models are difficult to assess. However, the results for the notched resonators can provide an estimate of the individual effect of the lip geometry on the bandwidths.

The resonance frequencies and bandwidths of the two-tube resonators with the notched lips are shown in Fig. 11. For each of the four vowels, the results for the five notch depths (0 cm, 1 cm, 2 cm, 3 cm, 4 cm) are displayed in the same plot. This shows that an increasing notch depth (corresponding to increased lip spreading) generally increases both the resonance frequencies and bandwidths. The increase in resonance frequencies is caused by an effective reduction of the vocal tract length by the notches and has been discussed before (Birkholz and Venus, 2018; Lindblom , 2007). The increase in bandwidths is probably related to the increased radiating area, which takes the form of a curved surface in notched tube openings.

FIG. 11.

Bandwidths of the resonances of the four notched two-tube resonators as a function of the resonance frequency for different notch depths.

FIG. 11.

Bandwidths of the resonances of the four notched two-tube resonators as a function of the resonance frequency for different notch depths.

Close modal

An interesting observation is that changing the notch depth affects the bandwidths differently depending on the resonator shape and resonance index. For example, the bandwidths of the first two resonances of /e/ are hardly affected by notch depth, while those of higher-order resonances increase somewhat uniformly with increasing notch depth. In contrast, for /a/ the 1st, 2nd, 3rd, and 5th resonances are hardly affected by the notch depth, while the bandwidth of the 4th resonance increases strongly with increasing notch depth. The likely reason is that the 4th resonance is effectively the 3/4-λ resonance of the anterior (oral) part of the tube model, which is hence most strongly affected by the lip shape. Since the anterior tube section has a length of 8 cm, the 3/4-λ resonance has a wavelength of 10.7 cm and thus a frequency of about 3270 Hz, which corresponds to the affected resonance.

In general, the two-tube approximations of the vocal tract with the notched lips seem to overestimate the bandwidths somewhat, as there are several cases where they exceed 300 Hz, which is rare in the axisymmetric and MRI-based models (Fig. 9). This is probably due to the relatively big cross-sections of the anterior tube segments, which result in a correspondingly high radiation loss.

Since the MRI-based models have the most realistic shape of all three types of physical resonators, their bandwidths should most closely match those of humans. To verify this, their bandwidths (the same as in Fig. 9) were plotted along with human bandwidth data by Fant (1972) in Fig. 12. Table I shows their mean values and standard deviations in 1 kHz-frequency bands. For resonances with frequencies above 800 Hz, the human data points (black triangles and squares) are all within the dashed bounding trapezoid around the data points of the MRI-based models (white circles) and look similarly distributed. Below 800 Hz, the human bandwidths increase towards lower frequencies. This phenomenon, which was also found for human bandwidth data by Fujimura and Lindqvist (1971), can be attributed to the soft walls of the vocal tract (Stevens, 1998). The effect of the soft walls was also recently demonstrated for physical models with hard and soft walls (Birkholz , 2022).

FIG. 12.

Scatter plot of the measured resonance bandwidths over resonance frequencies of the physical MRI-based resonators (white circles) compared to measurements from male (black squares) and female (black triangles) subjects. The human data are those provided by Fant (1972). The dashed lines form the bounding trapezoid around the data points of the MRI-based models. The gray curve is the formant frequency-bandwidth relation proposed by Hawks and Miller (1995).

FIG. 12.

Scatter plot of the measured resonance bandwidths over resonance frequencies of the physical MRI-based resonators (white circles) compared to measurements from male (black squares) and female (black triangles) subjects. The human data are those provided by Fant (1972). The dashed lines form the bounding trapezoid around the data points of the MRI-based models. The gray curve is the formant frequency-bandwidth relation proposed by Hawks and Miller (1995).

Close modal

Finally, the gray curve in Fig. 12 is an approximation of the frequency-bandwidth relation proposed by Hawks and Miller (1995). This relation is based on a regression analysis using the human bandwidth data by Fujimura and Lindqvist (1971) and Fant (1972). Obviously, this function cannot account for the increased scattering of bandwidth values towards higher frequencies and seems to overestimate the bandwidths of the MRI-based vocal tract models for frequencies above 3 kHz.

We can conclude that (a) the MRI-based models of the vocal tract with hard walls represent well the losses of the human vocal tract, except for the losses due to soft walls, and (b) the bandwidth is not a simple function of the resonance frequency but also of the vowel (especially for higher frequencies).

The results show that the usual equations for transmission-line simulations of the vocal tract underestimate the viscous, heat-conduction, and radiation losses with both the detailed and the coarse loss models. This underestimation was already suspected by Flanagan (1965), who derived Eqs. (3) and (4) for the viscous and heat-conduction losses based on the simplifying assumptions of a smooth and hard-walled tube. However, previous direct comparisons between losses (in terms of bandwidths) from simulations and measurements are rare, even though multiple studies provide human bandwidth data as a reference (Dunn, 1961; Fant, 1962, 1972; Fujimura and Lindqvist, 1971; Kent and Vorperian, 2018). One exception is the study by Hanna (2016), who found that the simulated resonances of a rigid cylindrical tube had significantly lower bandwidths than those measured on human subjects for the vowel /ɜ:/. They reported that the attenuation coefficient of the complex wave number in the simulations (which accounts for visco-thermal losses) had to be increased by a factor of 5 to achieve the measured bandwidths.

To our knowledge, the present study is the first one that made comprehensive measurements of bandwidths of physical tube models and compared them to simulations. Physical models have the advantage that their shape and boundary conditions (state of the glottis, material of the walls) can be precisely controlled and measurements can be performed very accurately. Therefore, the bandwidths data presented here can serve as a benchmark for future simulations models and are available in the supplementary material.1

How could the results of this study help to improve TL simulations of the vocal tract? With regard to viscous and heat-conduction losses, we saw that the detailed loss model only slightly underestimated the bandwidths compared to the equivalent physical tube-shaped resonators. Hence, if the TL model is simulated in the frequency domain, a correction factor on the right-hand side of Eqs. (3) and (4) might compensate for the bandwidth differences. In additional simulations we found that a correction factor of 1.51 applied to both Ri and Gi (i.e., when both quantities are increased by 51%) minimizes the average bandwidth differences between the simulations and the tube-shaped physical models in the 1 kHz bands of Fig. 10. However, the coarse loss model for time-domain simulations underestimates the bandwidths much more, because the viscous loss has been replaced by the (frequency-independent) flow resistance of Eq. (6), and the heat-conduction loss has been omitted completely. Here, a correction factor for Eq. (6) could certainly reduce the overall bandwidth differences, but the frequency-independence of this equation would then likely overestimate the losses at low frequencies, and underestimate the losses at higher frequencies. An alternative (with the same problem) would be to model the viscous resistance with Eq. (3) using a fixed frequency, e.g., 1000 Hz, as proposed by Titze (2014) and Wakita and Fant (1978). Another approach could be to model the frequency-dependence of the viscous resistance based on a discrete representation of the velocity profile in the boundary layer of the tube sections, as proposed by Birkholz and Jackèl (2004).

With regard to the radiation loss, the results indicate that the radiation model based on a piston in a plane baffle underestimates the loss compared to physical models with more realistic lip shapes. To compensate for these differences in the simulation, a “correction” could be applied to the radiating area A rad in Eqs. (8) and (9). However, the exact nature of the correction is a topic for future study.

Finally, future work should further investigate the impact of bandwidths on the perception of simulated speech. Early studies on this topic indicated that bandwidths affect the perception of vowels relatively little. For example, Carlson (1979) found that variations as large as 40% in B R 1 and B R 2 yield little change in subjects' subjective judgments on a psychophysical scale. However, the effect of bandwidths on the perception for continuous speech remains to be explored.

This research was partly supported by grant BI 1639/7-1 by the German Research Foundation (DFG).

1

See supplemental material at https://doi/org/10.1121/10.0019682 for the 3D-prvintable resonator models, their transfer functions, and matlab scripts for the simulations.

1.
Aarts
,
R. M.
, and
Janssen
,
A. J.
(
2003
). “
Approximation of the Struve function H1 occurring in impedance calculations
,”
J. Acoust. Soc. Am.
113
(
5
),
2635
2637
.
2.
Arnela
,
M.
,
Dabbaghchian
,
S.
,
Guasch
,
O.
, and
Engwall
,
O.
(
2019
). “
MRI-based vocal tract representations for the three-dimensional finite element synthesis of diphthongs
,”
IEEE/ACM Trans. Audio Speech Lang. Process.
27
(
12
),
2173
2182
.
3.
Badin
,
P.
, and
Fant
,
G.
(
1984
). “
Notes on vocal tract computation
,”
STL-QPSR
25
(
2–3
),
53
108
.
4.
Birkholz
,
P.
(
2005
).
3D-Artikulatorische Sprachsynthese
(
Logos Verlag
,
Berlin
).
5.
Birkholz
,
P.
(
2019
). “
MeasureTransferFunction [software]
https://www.vocaltractlab.de/index.php?page=measuretransferfunction-download (Last viewed June 6, 2023).
6.
Birkholz
,
P.
, and
Drechsel
,
S.
(
2021
). “
Effects of the piriform fossae, transvelar acoustic coupling, and laryngeal wall vibration on the naturalness of articulatory speech synthesis
,”
Speech Commun.
132
,
96
105
.
7.
Birkholz
,
P.
,
Gabriel
,
F.
,
Kürbis
,
S.
, and
Echternach
,
M.
(
2019
). “
How the peak glottal area affects linear predictive coding-based formant estimates of vowels
,”
J. Acoust. Soc. Am.
146
(
1
),
223
232
.
8.
Birkholz
,
P.
,
Häsner
,
P.
, and
Kürbis
,
S.
(
2022
). “
Acoustic comparison of physical vocal tract models with hard and soft walls
,” in
Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2022)
,
May 22–27
,
Singapore
, pp.
8242
8246
.
9.
Birkholz
,
P.
, and
Jackèl
,
D.
(
2004
). “
Boundary-layer resistance in time-domain simulations of the vocal tract system
,” in
Proceedings of the 12th European Signal Processing Conference (EUSIPCO-2004)
,
September 6–10
,
Vienna, Austria
, pp.
999
1002
.
10.
Birkholz
,
P.
,
Kürbis
,
S.
,
Stone
,
S.
,
Häsner
,
P.
,
Blandin
,
R.
, and
Fleischer
,
M.
(
2020
). “
Printable 3D vocal tract shapes from MRI data and their acoustic and aerodynamic properties
,”
Sci. Data
7
(
1
),
1
16
.
11.
Birkholz
,
P.
, and
Venus
,
E.
(
2018
). “
Considering lip geometry in one-dimensional tube models of the vocal tract
,” in
Proceedings of Studies on Speech Production: 11th International Seminar on Speech Production (ISSP 2017)
,
October 16–19
,
Tianjin, China
, pp.
78
86
.
12.
Blandin
,
R.
,
Arnela
,
M.
,
Félix
,
S.
,
Doc
,
J.-B.
, and
Birkholz
,
P.
(
2022
). “
Efficient 3D acoustic simulation of the vocal tract by combining the multimodal method and finite elements
,”
IEEE Access
10
,
69922
69938
.
13.
Buick
,
J. M.
,
Atig
,
M.
,
Skulina
,
D.
,
Campbell
,
D.
,
Dalmont
,
J.
, and
Gilbert
,
J.
(
2011
). “
Investigation of non-linear acoustic losses at the open end of a tube
,”
J. Acoust. Soc. Am.
129
(
3
),
1261
1272
.
14.
Carlson
,
R.
,
Granström
,
B.
, and
Klatt
,
D.
(
1979
). “
Vowel perception: The relative perceptual salience of selected acoustic manipulations
,” Speech Transmission Laboratories (Stockholm) Quarterly Progress Report 34, pp.
77
83
.
15.
Dunn
,
H.
(
1961
). “
Methods of measuring vowel formant bandwidths
,”
J. Acoust. Soc. Am.
33
(
12
),
1737
1746
.
16.
Elie
,
B.
, and
Laprie
,
Y.
(
2016
). “
Extension of the single-matrix formulation of the vocal tract: Consideration of bilateral channels and connection of self-oscillating models of the vocal folds with a glottal chink
,”
Speech Commun.
82
,
85
96
.
17.
Fant
,
G.
(
1962
). “
Formant bandwidth data
,”
STL-QPSR
3
,
1
3
.
18.
Fant
,
G.
(
1972
). “
Vocal tract wall effects, losses, and resonance bandwidths
,”
STL-QPSR
3
(
1
),
28
52
.
19.
Farina
,
A.
(
2000
). “
Simultaneous measurement of impulse response and distortion with a swept-sine technique
,” in
Proceedings of the 108th Convention of the Audio Engineering Society
,
February 19–22
,
Paris, France
.
20.
Flanagan
,
J. L.
(
1965
).
Speech Analysis, Synthesis and Perception
(
Springer-Verlag
,
Berlin
).
21.
Flanagan
,
J. L.
,
Ishizaka
,
K.
, and
Shipley
,
K. L.
(
1975
). “
Synthesis of speech from a dynamic model of the vocal cords and vocal tract
,”
Bell Syst. Tech. J.
54
(
3
),
485
506
.
22.
Fleischer
,
M.
,
Mainka
,
A.
,
Kürbis
,
S.
, and
Birkholz
,
P.
(
2018
). “
How to precisely measure the volume velocity transfer function of physical vocal tract models by external excitation
,”
PLoS One
13
(
3
),
e0193708
.
23.
Fujimura
,
O.
, and
Lindqvist
,
J.
(
1971
). “
Sweep-tone measurements of vocal-tract characteristics
,”
J. Acoust. Soc. Am.
49
(
2
),
541
557
.
24.
Hanna
,
N.
,
Smith
,
J.
, and
Wolfe
,
J.
(
2016
). “
Frequencies, bandwidths and magnitudes of vocal tract and surrounding tissue resonances, measured through the lips during phonation
,”
J. Acoust. Soc. Am.
139
(
5
),
2924
2936
.
25.
Hawks
,
J. W.
, and
Miller
,
J. D.
(
1995
). “
A formant bandwidth estimation procedure for vowel synthesis [43.72.Ja]
,”
J. Acoust. Soc. Am.
97
(
2
),
1343
1344
.
26.
Kelly
,
J. L.
, and
Lochbaum
,
C. C.
(
1962
). “
Speech synthesis
,” in
Proceedings of the Fourth International Congress on Acoustics
,
August 21–28
,
Copenhagen, Denmark
, pp.
1
4
.
27.
Kent
,
R. D.
, and
Vorperian
,
H. K.
(
2018
). “
Static measurements of vowel formant frequencies and bandwidths: A review
,”
J. Commun. Disorders
74
,
74
97
.
28.
Liljencrants
,
J.
(
1985
). “
Speech synthesis with a reflection-type line analog
,” Ph.D. thesis,
Royal Institute of Technology
,
Stockholm, Sweden
.
29.
Lindblom
,
B.
,
Sundberg
,
J.
,
Branderud
,
P.
, and
Djamshidpey
,
H.
(
2007
). “
On the acoustics of spread lips
,”
Proc. Fonetik TMH-QPSR
50
(
1
),
13
16
.
30.
Maeda
,
S.
(
1982
). “
A digital simulation method of the vocal-tract system
,”
Speech Commun.
1
,
199
229
.
31.
Speed
,
M.
,
Murphy
,
D.
, and
Howard
,
D.
(
2013
). “
Modeling the vocal tract transfer function using a 3D digital waveguide mesh
,”
IEEE/ACM Trans. Audio. Speech. Lang. Process.
22
(
2
),
453
464
.
32.
Stevens
,
K. N.
(
1998
).
Acoustic Phonetics
(
The MIT Press
,
Cambridge, MA
).
33.
Takemoto
,
H.
,
Mokhtari
,
P.
, and
Kitamura
,
T.
(
2010
). “
Acoustic analysis of the vocal tract during vowel production by finite-difference time-domain method
,”
J. Acoust. Soc. Am.
128
(
6
),
3724
3738
.
34.
Titze
,
I. R.
,
Palaparthi
,
A.
, and
Smith
,
S. L.
(
2014
). “
Benchmarks for time-domain simulation of sound propagation in soft-walled airways: Steady configurations
,”
J. Acoust. Soc. Am.
136
(
6
),
3249
3261
.
35.
Vampola
,
T.
,
Horáček
,
J.
, and
Švec
,
J. G.
(
2008
). “
FE modeling of human vocal tract acoustics. Part I: Production of Czech vowels
,”
Acta Acust. united Ac.
94
(
3
),
433
447
.
36.
Wakita
,
H.
, and
Fant
,
G.
(
1978
). “
Toward a better vocal tract model
,”
STL-QPSR
1
,
9
29
.
37.
Zhang
,
Z.
, and
Espy-Wilson
,
C. Y.
(
2004
). “
A vocal-tract model of American English /l/
,”
J. Acoust. Soc. Am.
115
(
3
),
1274
1280
.
Published open access through an agreement with TU Dresden

Supplementary Material