This article describes modal analysis of acoustic waves in the human vocal tract while the subject is pronouncing $[\xf8\u02d0]$. The model used is the wave equation in three dimensions, together with physically relevant boundary conditions. The geometry is reconstructed from anatomical MRI data obtained by other researchers. The computations are carried out using the finite element method. The model is validated by comparing the computed modes with measured data.

## I. Introduction

The purpose of this article is to study vowel production by the wave equation with boundary conditions as specified by Eq. (2). This model constitutes the input part of a (*scattering*) *conservative linear dynamical system* as presented by, e.g., Malinen *et al.* (2006); Malinen and Staffans (2006, 2007). A preliminary version of the present work was presented at the Phonetics Symposium 2006 (Hannukainen *et al.*, 2006).

In the past, the vocal tract (VT) acoustics has been modeled in a number of different ways. Electrical transmission lines have been used for a long time (see, e.g., Dunn, 1950). The celebrated Kelly–Lochbaum model makes use of reflection coefficients obtained from a variable diameter tube (Kelly and Lochbaum, 1962). Such reflection coefficients appear in, e.g., models from geophysics and in interpolation theory (see Foias and Frazho, 1990). We remark that the Kelly–Lochbaum model is closely related to the horn model described by the Webster equation (see Chiba and Kajiyama, 1958; Fant, 1970). All these models have produced very accurate simulation results with a relatively light computational load, and they have applications, e.g., in mobile phones. More advanced two- and three-dimensional descendants of the Kelly–Lochbaum model are the transmission line networks that have been developed by El Masri *et al.* (1996, 1998); Mullen *et al.* (2006). For a recent review and further references, see Palo (2006).

Equation (2) in an anatomically realistic geometry has a more direct basis in physics than any of the approaches discussed in the previous paragraph. This is particularly useful in some applications, for example, in modeling the effects of anatomical abnormalities and maxillofacial surgery on speech (Dedouch *et al.*, 2002a; Nishimoto *et al.*, 2004; Švancara and Horáček, 2006). As solving Eq. (2) analytically is possible only in a radically simplified geometry (see Sondhi, 1986), we solve the problem numerically by the *finite element method* (FEM). This is the approach used by, e.g., Lu *et al.* (1993), Suzuki *et al.* (1993), Kawanishi *et al.* (1996), Niikawa *et al.* (2002), Dedouch *et al.* (2002b), Sasaki *et al.* (2003), and Švancara *et al.* (2004), too. Unfortunately, heavy computations are involved in this method.

We present a modal analysis of an anatomical configuration of $[\xf8\u02d0]$ as produced by a native Swedish speaker. We obtain resonance frequencies computationally, which correspond to formants. Unlike the scattering transfer function estimation used by Nishimoto *et al.* (2004) and Sasaki *et al.* (2003), our method does not necessarily require taking into account the radiation impedance at the mouth. Our approach is more closely related to Dedouch *et al.* (2002b) but instead of Neumann boundary condition on the glottis, we use a reflection-free boundary condition slightly above the glottis [see the last lines of Eqs. (2) and (4)]. Using reflection-free boundary conditions Eq. (3), our Eq. (2) can be coupled to a glottis model in a physically realistic manner. Our results indicate that the computationally obtained formants identify the vowel $[\xf8\u02d0]$ correctly in a larger set of measured data.

For numerical computations, a detailed geometric description of the VT is necessary. Nowadays, accurate anatomical data can be obtained using magnetic resonance imaging (MRI). We are indebted to Dr. Olov Engwall (KTH) for kindly providing us with the required data.

## II. Acoustic model

Deriving the wave equation for sound pressure starts by assuming that the total pressure $p=p(r,t)$ and the density $\rho =\rho (r,t)$ can be expressed as

respectively, where $p0$ and $\rho 0$ are independent of time $t$ and space variable $r$. For linearization of the equations, it is assumed that $p\u2032=p\u2032(r,t)\u2aa1p0$ and $\rho \u2032=\rho \u2032(r,t)\u2aa1\rho 0$ are small perturbations at point $r=(x,y,z)\u220a\Omega $ at time $t$. Here $\Omega \u2282R3$ denotes the interior of the VT with boundary $\u2202\Omega =\Gamma 1\u222a\Gamma 2\u222a\Gamma 3$, where $\Gamma 1$ is the mouth opening, $\Gamma 2$ denotes the walls of the VT, and $\Gamma 3$ is a virtual boundary control surface a small distance above the glottis.

By $v=v(r,t)$ denote the velocity field of the flow described by $p$ and $\rho $. A velocity potential $\Phi =\Phi (r,t)$ is any function that satisfies $v=\u2212\u2207\Phi $. With this notation, our acoustic model is given by

To derive Eq. (2) from “first principles,” one needs to assume that an isentropic thermodynamic equation of state for pressure $p=p(s,\rho )$ holds where $s$, $\rho $ are the entropy and density, respectively. Then we define the sound speed $c$ by linearizing the equation of state $p\u2032=p(s,\rho 0+\rho \u2032)\u2212p(s,\rho 0)\u2248c2\rho \u2032$ where $p0=p(s,\rho 0)$ and $c2=(\u2202p\u2215\u2202\rho )(s,\rho 0)$. In this approximation, the entropy $s$ is kept constant since the associated thermodynamic process is assumed to be reversible. In the case of monatomic ideal gas, we have $p\u2215\rho \gamma =p0\u2215\rho 0\gamma $ and $c2=\gamma p0\u2215\rho 0$ where $\gamma =5\u22153$ is the adiabatic constant.

Now the wave equation $\Phi tt=c2\Delta \Phi $ can be derived by a linearization argument involving the continuity equation, Euler equation and linearized equation of state $p\u2032=c2\rho \u2032$. Having computed $\Phi $, we obtain the perturbation pressure from $p\u2032=\rho 0\Phi t$. All this can be found, e.g., in Fetter and Walecka (1980, Chap. 9).

Equation (2) is sophisticated enough to capture many relevant properties of wave propagation in three-dimensional geometry (e.g., to detect cross modes). It can also be used as the theoretical starting point in deriving the Webster equation mentioned earlier. However, it does not take into account turbulence, shock formation, or losses due to viscosity, heat conduction, or boundary dissipation on $\Gamma 2$.

We also need to take into account the walls and both ends of the VT. The last line in Eq. (2) specifies the required boundary conditions. We regard the mouth as an open end of an acoustic tube, and this is described by the Dirichlet condition $\Phi (r,t)=0$. More complicated models for the mouth opening or the surrounding acoustic space have been considered by Kawanishi *et al.* (1996) (an impedance model involving Bessel functions), Nishimoto *et al.* (2004) (an impedance model consisting of a small reflecting hemisphere), and Švancara *et al.* (2004) (an exterior model of two concentric spheres with an absorbing outer boundary).

On the walls of the VT, we use the same Neumann condition $(\u2202\Phi \u2215\u2202\nu )(r,t)=0$ as one would use at the closed end of a resonating tube. These two boundary conditions are discussed by Fetter and Walecka (1980, pp. 306, 307).

At the glottis end, we use a scattering boundary condition that specifies the ingoing sound energy wave. For motivation, we define the ingoing wave $u(r,t)$ and the outgoing wave $y(r,t)$ for $r\u220a\Gamma 3$ by

The first of these equations coincides with the third boundary condition in Eq. (2). The net power absorbed by the interior domain $\Omega $ through the control/observation boundary at time $t$ satisfies

where $je=\u2212\rho 0\Phi t\u2207\Phi =p\u2032v$ is the energy-flux vector as introduced in Fetter and Walecka (1980, 307 pp.).

Instead of solving Eq. (2), we solve an easier—yet relevant—problem related to Eq. (2). More precisely, we determine the resonance frequencies corresponding to a particular vowel articulation position. By Malinen and Staffans (2006, Theorem 2.3), the resonances of Eq. (2) can be solved by finding the discrete, complex frequencies $\lambda $ and the corresponding nonzero eigenfunctions $\Phi \lambda (r)$ such that

## III. Finite element modeling

The variational formulation of Eq. (4) (with $p\lambda $ in place of $\Phi \lambda $) is

where $\varphi $ is an arbitrary test function in Sobolev space $H\Gamma 11(\Omega )={f\u220aH1(\Omega ):f(r)=0forr\u220a\Gamma 1}$. The FEM can be used to approximately solve Eq. (5); see, e.g., Johnson (1987) for an elementary treatment. We use piecewise linear shape functions and a tetrahedral mesh of $n=64254$ elements which gives sufficiently accurate results. We obtain three $n\xd7n$ matrices, namely the stiffness matrix $K$, the mass matrix $M$, and $P$ representing the glottis boundary condition in Eq. (4).

When treating Eq. (5) we proceed to solve the following linear algebra problem: Find all complex numbers $\lambda $ and corresponding nonzero vectors $x(\lambda )$ such that

where $A=[\u2212cPI\u2212c2M0]$, $B=[K00I]$, and $y(\lambda )=[\lambda x(\lambda )x(\lambda )]$ (Saad, 1992). The numbers $\lambda $ are good approximations of the $\lambda $’s appearing in Eq. (4), provided that the number $n$ of elements is high enough. The lowest formants $F1,F2,\u2026$, correspond to the numbers $\lambda $ in the order of increasing imaginary part.

## IV. Data

Figure 1 in Hannukainen *et al.* (2006) shows a sliced representation of the VT geometry that we have used as the basis of our analysis. There are 29 slices, each consisting of 51 points, and they define the VT from glottis to mouth. For faster computation, the slices were downsampled by taking into account only every fourth point.

The raw MRI data was collected from a native male speaker of Swedish while he pronounced a prolonged vowel $[\xf8\u02d0]$ in supine position. Engwall and Badin (1999) describe the MR imaging procedure and image postprocessing. Corresponding formant measurement data are also available in the same article. The formants were estimated from speech recorded on a different occasion but with the same subject in a similar supine condition.

## V. Results and conclusions

The latter form of Eq. (6) was solved in MATLAB environment, and the formants F1 to F4 that we obtained are shown in Table I. These computed formants are roughly $312$ semitones too high compared to the measured values, and we will discuss the physical background of this discrepancy in the following. The bottom row in Table I shows the computed formants multiplied by 0.817, which corresponds to a difference of $312$ semitones.

Computed, measured, and scaled formants for $[\xf8\u02d0]$ in kilohertz. . | ||||
---|---|---|---|---|

. | F1 . | F2 . | F3 . | F4 . |

Computed | 0.68 | 1.35 | 2.71 | 3.79 |

Measured | 0.50 | 1.06 | 2.48 | 3.24 |

Scaled | 0.56 | 1.11 | 2.22 | 3.10 |

Computed, measured, and scaled formants for $[\xf8\u02d0]$ in kilohertz. . | ||||
---|---|---|---|---|

. | F1 . | F2 . | F3 . | F4 . |

Computed | 0.68 | 1.35 | 2.71 | 3.79 |

Measured | 0.50 | 1.06 | 2.48 | 3.24 |

Scaled | 0.56 | 1.11 | 2.22 | 3.10 |

We also obtained the resonance modes $p\lambda $—see Eq. (4)—corresponding to the formants F1–F4. They are computed as linear combinations of the element basis functions, using the components of $x(\lambda )$ as weights. We note that the perturbation pressures $p\lambda $ are not given here in any physically relevant scale but they have been normalized so that the maximum deviation from the static pressure $p0$ is either 1 or $\u22121$. Figure 1 shows isobars for the fourth mode. Figure 2 shows the pressure distributions of the modes. Figures 1 and 2 are plotted along a cross-sagittal midline cut (see Fig. 1, Hannukainen *et al.*, 2006). We remark that Fig. 1 supports the hypothesis that a weak cross-mode resonance related to F4 should appear in the oral cavity.

The vowels from Engwall and Badin (1999: Table 4), together with the scaled and computed $[\xf8\u02d0]s,c$ from Table I, are plotted in the (F2, F1) plane in Fig. 3. Clearly, $[\xf8\u02d0]s,c$ is closer to measured $[\xf8\u02d0]$ than to any other measured vowel, *except* possibly $[\u0251\u02d0]$. To further clarify the situation, let us consider the formants F1 to F4 for $[\xf8\u02d0]s,c$, $[\xf8\u02d0]$, and $[\u0251\u02d0]$ as vectors: $[\xf8\u02d0]s,c=(0.56,1.11,2.22,3.10)$, $[\xf8\u02d0]=(0.5,1.06,2.48,3.24)$, and $[\u0251\u02d0]=(0.56,0.94,2.74,3.24)$. Then the Euclidean distance between $[\xf8\u02d0]s,c$ and $[\xf8\u02d0]$ is 0.31, but the distance between $[\xf8\u02d0]s,c$ and $[\u0251\u02d0]$ is significantly larger, equaling 0.57. This difference is explained by F3, since the fourth formants are almost the same. We conclude that the *first two* formants classify the scaled, computed vowel $[\xf8\u02d0]s,c$ almost correctly. Moreover, if we look at *all four* available formants, even the remaining ambiguity disappears.

As we pointed out earlier, the computed formants F1–F4 differ from the corresponding measured formants by $312$ semitones. Having said that, the *ratios* between the computed formants and the measured formants match each other very well. There is a simple physical explanation why such a discrepancy is to be expected. In Eq. (2), we use the Dirichlet boundary condition on the lip opening. This results in a vibrational node at the opening. In reality, such a node would appear further away outside the mouth. In that sense, the real life VT is effectively longer than the one described by Eq. (2), resulting in lower formants. To get rid of this artifact, we should also model the surrounding acoustic space.

Surrounding acoustic space has been modeled by a lumped impedance for a transmission line (Laine, 1982), by using a “small space” model with impedance termination on the outer shell (Nishimoto *et al.*, 2004), and by using a “large space” model with an absorbing outer boundary (Švancara *et al.*, 2004). The first two of these approaches include a tuning parameter to be determined experimentally so that the measured and computed formants coincide. We remark that impedance termination for the wave equation is inherently more difficult than for the transmission line, since the termination must be of boundary control instead just of point control type.

## Acknowledgments

We would like to thank Olov Engwall from KTH, Stockholm, for providing the articulation geometry for this study. We would also like to thank an anonymous reviewer for valuable comments. A. H. has been supported by the Academy of Finland.

## REFERENCES AND LINKS

**EA96-12**, pp.

**EA93-8**, pp.