A data-driven approach using artificial neural networks is proposed to address the classic inverse area function problem, i.e., to determine the vocal tract geometry (modelled as a tube of nonuniform cylindrical cross-sections) from the vocal tract acoustic impedance spectrum. The predicted cylindrical radii and the actual radii were found to have high correlation in the three- and four-cylinder model (Pearson coefficient (ρ) and Lin concordance coefficient (ρc) exceeded 95%); however, for the six-cylinder model, the correlation was low (ρ around 75% and ρc around 69%). Upon standardizing the impedance value, the correlation improved significantly for all cases (ρ and ρc exceeded 90%).
1. Introduction
Understanding how vocal tract shape (its articulation) determines the output speech sound (from the lips) originating from an acoustic current (produced at the glottis) is a critical and fundamental key to vocal tract acoustics and voice science. While the acoustic modelling of the vocal tract from a known cross-sectional area function to determine its input acoustic impedance is now well established, the reverse—elucidating geometric information from acoustic data—remains challenging. Here, we explore a machine learning approach to address this challenge, motivated by the recent results using neural networks in estimating other vocal tract properties in (Gómez et al., 2018; Ibarra et al., 2021; Zhang, 2020).
The vocal tract can be modelled as a “time-varying” tube of nonuniform cross section (Rabiner and Schafer, 1978). The direct methods to determine the shape of the vocal tract include X-ray radiography, computed tomography, and magnetic resonance imaging (MRI) (Baer et al., 1991; Kim et al., 2013; Soquet et al., 2002). Fant (1970) and Perkell (1969) have presented a direct estimation of vocal tract cross section areas through the use of x-ray techniques. The techniques are unsatisfactory because they are either highly invasive (endoscopic study: requires anaesthetics, and the probe has to go past the velum), expensive and possibly harmful (x-ray fluoroscopy), poorly resolved/difficult to interpret (ultrasound imaging), or unnatural (MRI: subject lies supine, often long measurements, and very noisy). Sondhi and Gopinath (1971) have described an approach to infer the vocal tract shape using acoustic measurements by exciting the vocal tract with an external source. Wakita (1973) presented a method for estimation of vocal tract area function by inverse filtering the acoustic speech waveform. Indirect measurement methods for estimating the vocal tract area function involve transforming mel-frequency cepstrum coefficients (MFCC) (Dusan and Deng, 1998), linear predictive coding (LPC) (Rossiter et al., 1994; Wankhede and Shah, 2013), and methods such as particle swarm optimization algorithm (PSO) (Ismail, 2008). Most studies have concentrated using the audio output (voice) to elucidate the vocal tract geometry. However, another approach is to exploit the vocal tract's acoustic impedance measured directly at the lips (Garnier et al., 2010; Hanna et al., 2016; Henrich et al., 2007; Joliveau et al., 2004), which yields highly resolved frequency information in a non-invasive manner. More recently, Rodriguez et al. (2018) approximated the vocal tract area function from the acoustic impedance measured at the lips by representing the tract using one to ten cylindrical segments and iteratively minimizing magnitude and phase errors. Where the area function has a gentle slope, they reported moderately good agreement on the scale of about a centimetre with the target area function.
Artificial neural networks (ANNs) are known to unravel non-linear relationships between the factors of a system using a data-driven method (Basheer and Hajmeer, 2000; Hassoun et al., 1995). This technique does not require any a priori knowledge about the possible relationship between the systems' parameters, thus enabling researchers to solve ill-posed inverse problems (Adler and Öktem, 2017; McCann et al., 2017). Further, ANNs follow a non-parametric approach that is free of any assumptions and can adapt to new data. Such a breakthrough had resulted in solving problems in health care (Dybowski and Gant, 2001; Lisboa, 2002; Prato and Zanni, 2008), security (Teo et al., 2021; Wang et al., 2010), fluid mechanics (Lakkam et al., 2019), etc. Zhang (2020) reported applying neural networks on simulation data estimating vocal fold properties such as geometry, stiffness, and position and the subglottal pressure from the produced acoustics; Gómez et al. (2018) reported a method to estimate the subglottal pressure using recurrent neural network; Ibarra et al. (2021) described a framework using physiologically relevant model of voice production and machine learning tools to determine subglottal pressure, vocal fold collision pressure, and laryngeal muscle activation. Motivated by these capabilities, a data-driven approach was proposed to solve the inverse area function problem.
2. Background
2.1 Modelling the vocal tract
For a finite pipe of length L extending from x = 0 to x = L terminated with a load impedance ZL, ZL is the complex ratio of pressure to flow (Fletcher and Rossing, 2012), given by
where p(L, t) is the pressure in the pipe resulting from the superposition of two waves with amplitude A and B moving to the right and left, respectively. At any point x, the pressure is therefore given by
U(L, t) is the acoustic particle velocity, which is again the superposition of these two aforementioned waves, multiplied with pipe cross section S, and is given by
Starting with a load impedance ZL, the impedance (Z) at the opposite end of the finite pipe cylindrical section can be calculated from Eqs. (1)–(3) and simplify (Fletcher and Rossing, 2012)
where Z0 is the characteristic impedance of the pipe given by , ρ is the density of air, c is the speed of sound, and S is the surface area. k is complex wave number and is given by
where and the angular frequency, α, is attenuation coefficient that accounts for losses at the tube walls due to viscous drag and thermal conduction.
Equation (4) can be applied to a theoretical pipe 170 mm long with a 13-mm radius, approximating the vocal tract configuration with a closed glottis phonating the neutral vowel ə (ə as in the word “herd”), as shown in Fig. 1. This condition produces resonances at about 500, 1500, 2500, and 3500 Hz, presented as maxima in the plot. In the corresponding phase plots, these resonance frequencies are indicated by zero crossings and sharp negative slopes.
3. Vocal tract modelling
Four different combinations of cylindrical pipes were considered in modelling the vocal tract. In this investigation, we will refer to the first 100 mm of the vocal tract starting from lips as the oral region, and the remaining 70 mm will be referred as pharyngeal region. We started with a simple approximation with three cylindrical pipes and increased the complexity by adding more cylindrical pipe sections. The radii combinations for various cylindrical pipes sections were chosen following Story et al. (1996) and Rodriguez et al. (2018).
3.1 Three-cylinder model
Here, the overall vocal tract is divided into three sections: two sections for the oral region (towards the lips) and one section for the pharyngeal region (towards the glottis). The first section of the oral region is 40 mm long, and the second section is 60 mm long. The pharyngeal region is considered as a single section having a length of 70 mm. The radii of first two sections (i.e., a1 and b1) vary from 2 to 25 and 2 to 20 mm, respectively, in steps of 1 mm. The radius of the final section (i.e., c1) varies from 5 to 12 mm, in steps size of 1 mm. These sections are referred as S1, S2, and S3 in Fig. 2(a) (see Table 1 for summary).
Simulation . | Radius symbol . | Range of radius value (mm) . | Incremental step (mm) . |
---|---|---|---|
Three cylinders | a1 | 2–25 | |
b1 | 2–20 | 1 | |
c1 | 5–12 | ||
Four cylinders | a2 | 2–25 | |
2–25 | |||
b2 | 2–20 | 1 | |
c2 | 5–12 | ||
Five cylinders | a3 | 2–24.5 | |
2–24.5 | |||
b3 | 2–20 | 1.5 | |
2–20 | |||
c3 | 5–11 | ||
Six cylinders | a4 | 2–24 | |
2–24 | |||
b4 | 2–20 | ||
2–20 | 2 | ||
c4 | 5–11 | ||
5–11 |
Simulation . | Radius symbol . | Range of radius value (mm) . | Incremental step (mm) . |
---|---|---|---|
Three cylinders | a1 | 2–25 | |
b1 | 2–20 | 1 | |
c1 | 5–12 | ||
Four cylinders | a2 | 2–25 | |
2–25 | |||
b2 | 2–20 | 1 | |
c2 | 5–12 | ||
Five cylinders | a3 | 2–24.5 | |
2–24.5 | |||
b3 | 2–20 | 1.5 | |
2–20 | |||
c3 | 5–11 | ||
Six cylinders | a4 | 2–24 | |
2–24 | |||
b4 | 2–20 | ||
2–20 | 2 | ||
c4 | 5–11 | ||
5–11 |
3.2 Four-cylinder model
The oral region of the vocal tract is divided into three sections, and the pharyngeal region is a single section. The first two sections of the oral regions are 20 mm long, whereas the last section is 60 mm long. The pharyngeal region was kept the same as that of three-cylinder approximation [see and in Fig. 2(b)]. The radii of these sections (i.e., a2, , b2, and c2) are varied similar to a1, b1, and c1 (see Table 1).
3.3 Five-cylinder model
The vocal tract was further subdivided in the five-cylinder approximation, with four sections in the oral regions and one section in the pharyngeal region. The first two sections of the oral section have a length of 20 mm, and the last two are 30 mm long. The pharyngeal region was treated as one region having a length of 70 mm [see , and in Fig. 2(c)]. The radii of these sections are varied across the same nominal range as that of the earlier three- and four-cylinder approximations, but with a slightly lower resolution (i.e., with a slightly larger step size of 1.5 mm compared to 1 mm). This adjustment was made to reduce the number of simulated cases (keeping the original resolution resulted in a large feature matrix, making it harder to process).
3.4 Six-cylinder model
This approximation is similar to that of five cylinders; however, the pharyngeal region was divided into two sections of lengths 30 and 40 mm. The radii used for this approximation varied across the same nominal range as that of five cylinders, but the incremental steps among radii were further reduced from 1.5 to 2 mm.
It is worth noting that in Rodriguez et al. (2018), although increasing the number of cylinders (from one to ten) generally reduces error, only marginal gain is seen above six, and we chose to limit our investigation to six cylinders.
4. Experimental methodology
4.1 Acoustic impedance feature
For simplicity, when introducing open and closed pipes, Fletcher and Rossing (2012) considered an ideally open pipe with ZL = 0 [Eq. (8.25)]; while not physically realizable exactly, this simple condition nevertheless offers us a good approximation indicating the behaviour of how vocal tract geometry determines vocal tract impedance when the vocal tract is considered as a series of simple cylinders of varying radii. (In practice, at the open lips, ; however, ZRadiation here depends further on geometric structural factors associated with the face acting as baffle and the effective radius of the open lips, complicating our exploratory model unnecessarily, and detracts from understanding vocal tract geometry behaviour directly.)
Starting with ZL at the open lips for the first cylinder, we apply Eq. (4) sequentially for each subsequent (“upstream”) cylinder of varying radii (representing various segments of the vocal tract) to determine the final input impedance seen at the glottis. For this investigation, the speed of sound is assumed to be 340 m/s, density of air, (Fletcher and Rossing, 2012), and finally we consider frequency values from 1 to 4000 Hz with 5-Hz resolution
Figures 3(a) and 3(b) show an example vocal tract modelling using a three-cylinder model and the corresponding acoustic impedance spectrum [note that impedance values (dB) are normalized to a maximum value of 1], respectively. The radii chosen for cylindrical sections are , , and mm, respectively. Another cylindrical approximation example (i.e., using six cylinders) and its corresponding acoustic impedance spectrum is shown in Figs. 3(c) and 3(d), respectively. These maximum-normalized impedance vectors along with the radii values are used to train the neural network to solve this inverse problem.
4.2 Neural network architecture
To estimate the radii of vocal tract sections approximating as a combination of cylindrical pipes, the deep neural network was implemented using the keras machine learning library (Chollet et al., 2015; Lakkam et al., 2019). A six-layer deep neural network was used. Each hidden layer has 100 neurons, except the last one, which has 16 neurons. The number of neurons in the output layer changes with various cylindrical combinations (varies from 3 to 6), and this final layer outputs the estimated regressed radii. In this model, each hidden layer has a rectified linear unit as activation function and finally a linear activation unit in the output layer. The objective of the model is to optimize the squared loss function (number of epochs chosen was 200). The “Adam” optimization method, which effectively calculates individual adaptive learning rates for different parameters based on the estimates of first and second moments of the gradients during the backpropagation stage of learning, was used to stochastically optimize the weights of the neurons in hidden layers. These hyperparameter choices, when compared with other choices, yielded the best result when validated against a subset of the various input combinations.
4.3 Test–train split-up
The number of combinations for radii length (shown in Table 1) varies with various cylindrical models. For example, in the three-cylinder model, a1 can take 24 distinct values, b1 can take 19 values, and c1 takes 8 distinct values. Altogether, this could create 3648 combinations of radii length and their corresponding impedance vectors. Similarly, for the four-cylinder model, there would be a total of more than 80 000 combinations. For the five- and six-cylinder approximations, even though the number of combinations that can be formed could exceed 200 000, we chose to work on 100 000 randomly selected combinations and their impedances. The network was trained on 70% of the combinations generated and tested on the remaining 30%. The impedance corresponding to a particular combination of radii was in either the training set or the test set but not both.
4.4 Model accuracy metric
The prediction accuracy of the model was derived by estimating Pearson's correlation coefficient (ρ) and Lin's concordance coefficient (ρc). ρ and ρc can take values between −1 and +1; ρ is a measure of how close the prediction is to the actual value in a linear fashion where the correlation line () can be situated anywhere [any slope (m) or intercept (c)], whereas ρc is a measure of reliability and a perfect reliability will be reflected as . This implies that when actual values are plotted against predicted values, the latter follows the former and all the plotted data points will lie on a line [described by a linear equation between the actual and predicted values of unit slope and zero intercept (y = x)] (Jinyuan et al., 2016; Lawrence and Lin, 1989). This “lying on a line” characteristic is significantly important for this investigation, since the geometric values chosen for the cylinder radii are quite small and analysing ρ alone would sometimes mislead to high model accuracy even when the predictions are random.
5. Results
The predicted radii for various cylinder combinations were compared against the actual radii, and the resulting ρ and ρc are shown in Table 2. First, ρc was always lower than ρ, which is expected (Estrada et al., 2015). The ρ and ρc were found to be high for all of the selected radii combinations in the three- and four-cylinder approximations; however, for the five- and six-cylinder approximations, the correlation decreased. This drop in ρ and ρc is significant for a (the radius corresponding to the first section in the oral region closer to the lips), and this could be attributed to the finer resolution in geometry, especially the length of these oral regions. As mentioned in Sec. 4.4, because we are dealing with incremental values of radii, it would be important to verify whether predicted values clearly follow the actual values or not. The resulting graphs relating actual and predicted values shown as a bin scatterplot of a1, b2, c3, and a4 are presented in Figs. 4(a), 4(b), 4(c), and 4(d), respectively.
Simulation . | Radius symbol . | ρ (%) . | ρc (%) . |
---|---|---|---|
Three cylinders | a1 | 98.2 | 96.6 |
b1 | 99.2 | 99.2 | |
c1 | 97.9 | 97.4 | |
Four cylinders | a2 | 95.1 | 94.9 |
99.5 | 99.4 | ||
b2 | 99.8 | 98.6 | |
c2 | 99.3 | 99.3 | |
Five cylinders | a3 | 87.7 | 86.6 |
92.1 | 91.7 | ||
b3 | 98.7 | 98.6 | |
99.8 | 99.8 | ||
c3 | 99.2 | 97.6 | |
Six cylinders | a4 | 73.1 | 69.4 |
87.0 | 86.2 | ||
b4 | 98.1 | 98.1 | |
99.9 | 99.9 | ||
c4 | 99.5 | 99.5 | |
98.6 | 97.6 |
Simulation . | Radius symbol . | ρ (%) . | ρc (%) . |
---|---|---|---|
Three cylinders | a1 | 98.2 | 96.6 |
b1 | 99.2 | 99.2 | |
c1 | 97.9 | 97.4 | |
Four cylinders | a2 | 95.1 | 94.9 |
99.5 | 99.4 | ||
b2 | 99.8 | 98.6 | |
c2 | 99.3 | 99.3 | |
Five cylinders | a3 | 87.7 | 86.6 |
92.1 | 91.7 | ||
b3 | 98.7 | 98.6 | |
99.8 | 99.8 | ||
c3 | 99.2 | 97.6 | |
Six cylinders | a4 | 73.1 | 69.4 |
87.0 | 86.2 | ||
b4 | 98.1 | 98.1 | |
99.9 | 99.9 | ||
c4 | 99.5 | 99.5 | |
98.6 | 97.6 |
Looking at the correlation plots (Fig. 4), it is quite clear that the variance in prediction is high when a higher number of cylinders are allowed when approximating the vocal tract (see the bin counts for a4 shown in Fig. 4(d)). The high spread in a4 could be attributed to the lower increment while defining the possible combination for a4 radius, and this claim requires further investigation.
5.1 Impact of standardization of impedance on prediction accuracy
Another experiment was undertaken to investigate if whitening (Z-score scaling) of input features improves the prediction accuracy of segment radii. This is achieved by modifying the distribution of max-normalized impedance values to have zero mean and unit standard deviation. The mean and standard deviation of the training data were estimated, and these variables were used to scale the training and test data. The deep neural network was retrained using this new dataset, and results of this investigation are shown in Table 3. The results shown in Table 3 suggest the standardization procedure has a positive impact on the prediction accuracy, and this is most evident in the six-cylinder approximation, for which the prediction accuracy of a4 improved by almost by 25%. The resulting bin scatterplot relating actual and predicted values of a1, b2, c3, and a4 with standardization in place are shown in Figs. 5(a), 5(b), 5(c), and 5(d), respectively. The extent of variance in the predicted radii has reduced drastically, especially for c3 (most predicted values overlap); however, for a4 there still exists some variation, but less so compared to the case with no standardization (see bin counts in Fig. 5(d)).
Simulation . | Radius symbol . | ρ (%) . | ρc (%) . |
---|---|---|---|
Three cylinders | a1 | 99.8 | 99.8 |
b1 | 99.9 | 99.9 | |
c1 | 99.5 | 99.4 | |
Four cylinders | a2 | 98.6 | 98.5 |
99.9 | 99.9 | ||
b2 | 99.9 | 99.9 | |
c2 | 99.9 | 99.9 | |
Five cylinders | a3 | 96.9 | 96.7 |
99.0 | 98.9 | ||
b3 | 99.9 | 99.9 | |
99.9 | 99.9 | ||
c3 | 99.9 | 99.9 | |
Six cylinders | a4 | 92.8 | 92.7 |
97.3 | 97.1 | ||
b4 | 99.7 | 99.7 | |
99.9 | 99.9 | ||
c4 | 99.9 | 99.9 | |
99.9 | 99.9 |
Simulation . | Radius symbol . | ρ (%) . | ρc (%) . |
---|---|---|---|
Three cylinders | a1 | 99.8 | 99.8 |
b1 | 99.9 | 99.9 | |
c1 | 99.5 | 99.4 | |
Four cylinders | a2 | 98.6 | 98.5 |
99.9 | 99.9 | ||
b2 | 99.9 | 99.9 | |
c2 | 99.9 | 99.9 | |
Five cylinders | a3 | 96.9 | 96.7 |
99.0 | 98.9 | ||
b3 | 99.9 | 99.9 | |
99.9 | 99.9 | ||
c3 | 99.9 | 99.9 | |
Six cylinders | a4 | 92.8 | 92.7 |
97.3 | 97.1 | ||
b4 | 99.7 | 99.7 | |
99.9 | 99.9 | ||
c4 | 99.9 | 99.9 | |
99.9 | 99.9 |
Finally, comparing the predictions of Rodriguez et al. (2018)—where their fitting method predicted vocal tract shape to a resolution of about a centimetre—to our resulting predictions using a data-driven approach with or without standardization in place resulted in similar or better prediction for most of the simulated cases; however, the prediction accuracy of segment radii is slightly compromised with the inclusion of more segments (for example, a4).
6. Conclusion and discussion
A data-driven approach using ANNs to solve the inverse area function problem was proposed to derive the non-linear relationship between the vocal tract impedance and the corresponding vocal tract geometry. A deep neural network was trained using acoustic impedance spectra, and the predicted radii, associated with the vocal tract geometry approximated using cylindrical tubes, were found to be highly correlated with the actual radii, showing reasonable agreement.
Although not strictly physiological, our proof-of-concept approach now complements earlier studies using similar neural network techniques to resolve inverse problems associated with voice mechanics, such as vocal fold properties (Zhang, 2020), subglottal pressure, and other physiological control (Gómez et al., 2018; Ibarra et al., 2021). Such a systematic approach, when integrated in the future, has the potential to resolve questions of vocal tract geometry and voice mechanics associated with different voice conditions during speech and singing and thereby offer a potential diagnostic tool for applications in speech pathology, voice therapy, and language training in a natural, ecological context.