A data-driven approach using artificial neural networks is proposed to address the classic inverse area function problem, i.e., to determine the vocal tract geometry (modelled as a tube of nonuniform cylindrical cross-sections) from the vocal tract acoustic impedance spectrum. The predicted cylindrical radii and the actual radii were found to have high correlation in the three- and four-cylinder model (Pearson coefficient (ρ) and Lin concordance coefficient (ρc) exceeded 95%); however, for the six-cylinder model, the correlation was low (ρ around 75% and ρc around 69%). Upon standardizing the impedance value, the correlation improved significantly for all cases (ρ and ρc exceeded 90%).

Understanding how vocal tract shape (its articulation) determines the output speech sound (from the lips) originating from an acoustic current (produced at the glottis) is a critical and fundamental key to vocal tract acoustics and voice science. While the acoustic modelling of the vocal tract from a known cross-sectional area function to determine its input acoustic impedance is now well established, the reverse—elucidating geometric information from acoustic data—remains challenging. Here, we explore a machine learning approach to address this challenge, motivated by the recent results using neural networks in estimating other vocal tract properties in (Gómez et al., 2018; Ibarra et al., 2021; Zhang, 2020).

The vocal tract can be modelled as a “time-varying” tube of nonuniform cross section (Rabiner and Schafer, 1978). The direct methods to determine the shape of the vocal tract include X-ray radiography, computed tomography, and magnetic resonance imaging (MRI) (Baer et al., 1991; Kim et al., 2013; Soquet et al., 2002). Fant (1970) and Perkell (1969) have presented a direct estimation of vocal tract cross section areas through the use of x-ray techniques. The techniques are unsatisfactory because they are either highly invasive (endoscopic study: requires anaesthetics, and the probe has to go past the velum), expensive and possibly harmful (x-ray fluoroscopy), poorly resolved/difficult to interpret (ultrasound imaging), or unnatural (MRI: subject lies supine, often long measurements, and very noisy). Sondhi and Gopinath (1971) have described an approach to infer the vocal tract shape using acoustic measurements by exciting the vocal tract with an external source. Wakita (1973) presented a method for estimation of vocal tract area function by inverse filtering the acoustic speech waveform. Indirect measurement methods for estimating the vocal tract area function involve transforming mel-frequency cepstrum coefficients (MFCC) (Dusan and Deng, 1998), linear predictive coding (LPC) (Rossiter et al., 1994; Wankhede and Shah, 2013), and methods such as particle swarm optimization algorithm (PSO) (Ismail, 2008). Most studies have concentrated using the audio output (voice) to elucidate the vocal tract geometry. However, another approach is to exploit the vocal tract's acoustic impedance measured directly at the lips (Garnier et al., 2010; Hanna et al., 2016; Henrich et al., 2007; Joliveau et al., 2004), which yields highly resolved frequency information in a non-invasive manner. More recently, Rodriguez et al. (2018) approximated the vocal tract area function from the acoustic impedance measured at the lips by representing the tract using one to ten cylindrical segments and iteratively minimizing magnitude and phase errors. Where the area function has a gentle slope, they reported moderately good agreement on the scale of about a centimetre with the target area function.

Artificial neural networks (ANNs) are known to unravel non-linear relationships between the factors of a system using a data-driven method (Basheer and Hajmeer, 2000; Hassoun et al., 1995). This technique does not require any a priori knowledge about the possible relationship between the systems' parameters, thus enabling researchers to solve ill-posed inverse problems (Adler and Öktem, 2017; McCann et al., 2017). Further, ANNs follow a non-parametric approach that is free of any assumptions and can adapt to new data. Such a breakthrough had resulted in solving problems in health care (Dybowski and Gant, 2001; Lisboa, 2002; Prato and Zanni, 2008), security (Teo et al., 2021; Wang et al., 2010), fluid mechanics (Lakkam et al., 2019), etc. Zhang (2020) reported applying neural networks on simulation data estimating vocal fold properties such as geometry, stiffness, and position and the subglottal pressure from the produced acoustics; Gómez et al. (2018) reported a method to estimate the subglottal pressure using recurrent neural network; Ibarra et al. (2021) described a framework using physiologically relevant model of voice production and machine learning tools to determine subglottal pressure, vocal fold collision pressure, and laryngeal muscle activation. Motivated by these capabilities, a data-driven approach was proposed to solve the inverse area function problem.

For a finite pipe of length L extending from x = 0 to x = L terminated with a load impedance ZL, ZL is the complex ratio of pressure to flow (Fletcher and Rossing, 2012), given by

(1)

where p(L, t) is the pressure in the pipe resulting from the superposition of two waves with amplitude A and B moving to the right and left, respectively. At any point x, the pressure is therefore given by

(2)

U(L, t) is the acoustic particle velocity, which is again the superposition of these two aforementioned waves, multiplied with pipe cross section S, and is given by

(3)

Starting with a load impedance ZL, the impedance (Z) at the opposite end of the finite pipe cylindrical section can be calculated from Eqs. (1)–(3) and simplify (Fletcher and Rossing, 2012)

(4)

where Z0 is the characteristic impedance of the pipe given by ρc/S, ρ is the density of air, c is the speed of sound, and S is the surface area. k is complex wave number and is given by

(5)

where ω=2πf and the angular frequency, α, is attenuation coefficient that accounts for losses at the tube walls due to viscous drag and thermal conduction.

Equation (4) can be applied to a theoretical pipe 170 mm long with a 13-mm radius, approximating the vocal tract configuration with a closed glottis phonating the neutral vowel ə (ə as in the word “herd”), as shown in Fig. 1. This condition produces resonances at about 500, 1500, 2500, and 3500 Hz, presented as maxima in the |ZL| plot. In the corresponding phase plots, these resonance frequencies are indicated by zero crossings and sharp negative slopes.

Fig. 1.

Acoustic impedance spectrum corresponding to a cylindrical pipe 170 mm long and 13 mm in radius, approximating the vocal tract configuration with a closed glottis phonating the neutral vowel ə.

Fig. 1.

Acoustic impedance spectrum corresponding to a cylindrical pipe 170 mm long and 13 mm in radius, approximating the vocal tract configuration with a closed glottis phonating the neutral vowel ə.

Close modal

Four different combinations of cylindrical pipes were considered in modelling the vocal tract. In this investigation, we will refer to the first 100 mm of the vocal tract starting from lips as the oral region, and the remaining 70 mm will be referred as pharyngeal region. We started with a simple approximation with three cylindrical pipes and increased the complexity by adding more cylindrical pipe sections. The radii combinations for various cylindrical pipes sections were chosen following Story et al. (1996) and Rodriguez et al. (2018).

Here, the overall vocal tract is divided into three sections: two sections for the oral region (towards the lips) and one section for the pharyngeal region (towards the glottis). The first section of the oral region is 40 mm long, and the second section is 60 mm long. The pharyngeal region is considered as a single section having a length of 70 mm. The radii of first two sections (i.e., a1 and b1) vary from 2 to 25 and 2 to 20 mm, respectively, in steps of 1 mm. The radius of the final section (i.e., c1) varies from 5 to 12 mm, in steps size of 1 mm. These sections are referred as S1, S2, and S3 in Fig. 2(a) (see Table 1 for summary).

Fig. 2.

Vocal tract modelled using cylinders. (a) Three-cylinder model, (b) four-cylinder model, (c) five-cylinder model, and (d) six-cylinder model.

Fig. 2.

Vocal tract modelled using cylinders. (a) Three-cylinder model, (b) four-cylinder model, (c) five-cylinder model, and (d) six-cylinder model.

Close modal
Table 1.

Specifications for various cylinder models.

SimulationRadius symbolRange of radius value (mm)Incremental step (mm)
Three cylinders a1 2–25  
 b1 2–20 
 c1 5–12  
Four cylinders a2 2–25  
 a2 2–25  
 b2 2–20 
 c2 5–12  
Five cylinders a3 2–24.5  
 a3 2–24.5  
 b3 2–20 1.5 
 b3 2–20  
 c3 5–11  
Six cylinders a4 2–24  
 a4 2–24  
 b4 2–20  
 b4 2–20 
 c4 5–11  
 c4 5–11  
SimulationRadius symbolRange of radius value (mm)Incremental step (mm)
Three cylinders a1 2–25  
 b1 2–20 
 c1 5–12  
Four cylinders a2 2–25  
 a2 2–25  
 b2 2–20 
 c2 5–12  
Five cylinders a3 2–24.5  
 a3 2–24.5  
 b3 2–20 1.5 
 b3 2–20  
 c3 5–11  
Six cylinders a4 2–24  
 a4 2–24  
 b4 2–20  
 b4 2–20 
 c4 5–11  
 c4 5–11  

The oral region of the vocal tract is divided into three sections, and the pharyngeal region is a single section. The first two sections of the oral regions are 20 mm long, whereas the last section is 60 mm long. The pharyngeal region was kept the same as that of three-cylinder approximation [see S1,S2 and S3 in Fig. 2(b)]. The radii of these sections (i.e., a2, a2, b2, and c2) are varied similar to a1, b1, and c1 (see Table 1).

The vocal tract was further subdivided in the five-cylinder approximation, with four sections in the oral regions and one section in the pharyngeal region. The first two sections of the oral section have a length of 20 mm, and the last two are 30 mm long. The pharyngeal region was treated as one region having a length of 70 mm [see S1,S2, and S3 in Fig. 2(c)]. The radii of these sections are varied across the same nominal range as that of the earlier three- and four-cylinder approximations, but with a slightly lower resolution (i.e., with a slightly larger step size of 1.5 mm compared to 1 mm). This adjustment was made to reduce the number of simulated cases (keeping the original resolution resulted in a large feature matrix, making it harder to process).

This approximation is similar to that of five cylinders; however, the pharyngeal region was divided into two sections of lengths 30 and 40 mm. The radii used for this approximation varied across the same nominal range as that of five cylinders, but the incremental steps among radii were further reduced from 1.5 to 2 mm.

It is worth noting that in Rodriguez et al. (2018), although increasing the number of cylinders (from one to ten) generally reduces error, only marginal gain is seen above six, and we chose to limit our investigation to six cylinders.

For simplicity, when introducing open and closed pipes, Fletcher and Rossing (2012) considered an ideally open pipe with ZL = 0 [Eq. (8.25)]; while not physically realizable exactly, this simple condition nevertheless offers us a good approximation indicating the behaviour of how vocal tract geometry determines vocal tract impedance when the vocal tract is considered as a series of simple cylinders of varying radii. (In practice, at the open lips, ZL=ZRadiation; however, ZRadiation here depends further on geometric structural factors associated with the face acting as baffle and the effective radius of the open lips, complicating our exploratory model unnecessarily, and detracts from understanding vocal tract geometry behaviour directly.)

Starting with ZL at the open lips for the first cylinder, we apply Eq. (4) sequentially for each subsequent (“upstream”) cylinder of varying radii (representing various segments of the vocal tract) to determine the final input impedance seen at the glottis. For this investigation, the speed of sound is assumed to be 340 m/s, density of air, ρ=1.2kg/m3 (Fletcher and Rossing, 2012), and finally we consider frequency values from 1 to 4000 Hz with 5-Hz resolution

Figures 3(a) and 3(b) show an example vocal tract modelling using a three-cylinder model and the corresponding acoustic impedance spectrum [note that impedance values (dB) are normalized to a maximum value of 1], respectively. The radii chosen for cylindrical sections are a1=4, b1=3, and c1=6 mm, respectively. Another cylindrical approximation example (i.e., using six cylinders) and its corresponding acoustic impedance spectrum is shown in Figs. 3(c) and 3(d), respectively. These maximum-normalized impedance vectors along with the radii values are used to train the neural network to solve this inverse problem.

Fig. 3.

Vocal tract modelling and the corresponding impedance plot. (a) and (c) show an example of three- and six-cylinder approximation. (b) and (d) show the corresponding impedance magnitude spectra.

Fig. 3.

Vocal tract modelling and the corresponding impedance plot. (a) and (c) show an example of three- and six-cylinder approximation. (b) and (d) show the corresponding impedance magnitude spectra.

Close modal

To estimate the radii of vocal tract sections approximating as a combination of cylindrical pipes, the deep neural network was implemented using the keras machine learning library (Chollet et al., 2015; Lakkam et al., 2019). A six-layer deep neural network was used. Each hidden layer has 100 neurons, except the last one, which has 16 neurons. The number of neurons in the output layer changes with various cylindrical combinations (varies from 3 to 6), and this final layer outputs the estimated regressed radii. In this model, each hidden layer has a rectified linear unit as activation function and finally a linear activation unit in the output layer. The objective of the model is to optimize the squared loss function (number of epochs chosen was 200). The “Adam” optimization method, which effectively calculates individual adaptive learning rates for different parameters based on the estimates of first and second moments of the gradients during the backpropagation stage of learning, was used to stochastically optimize the weights of the neurons in hidden layers. These hyperparameter choices, when compared with other choices, yielded the best result when validated against a subset of the various input combinations.

The number of combinations for radii length (shown in Table 1) varies with various cylindrical models. For example, in the three-cylinder model, a1 can take 24 distinct values, b1 can take 19 values, and c1 takes 8 distinct values. Altogether, this could create 3648 combinations of radii length and their corresponding impedance vectors. Similarly, for the four-cylinder model, there would be a total of more than 80 000 combinations. For the five- and six-cylinder approximations, even though the number of combinations that can be formed could exceed 200 000, we chose to work on 100 000 randomly selected combinations and their impedances. The network was trained on 70% of the combinations generated and tested on the remaining 30%. The impedance corresponding to a particular combination of radii was in either the training set or the test set but not both.

The prediction accuracy of the model was derived by estimating Pearson's correlation coefficient (ρ) and Lin's concordance coefficient (ρc). ρ and ρc can take values between −1 and +1; ρ is a measure of how close the prediction is to the actual value in a linear fashion where the correlation line (y=mx+c) can be situated anywhere [any slope (m) or intercept (c)], whereas ρc is a measure of reliability and a perfect reliability will be reflected as ρc=1. This implies that when actual values are plotted against predicted values, the latter follows the former and all the plotted data points will lie on a line [described by a linear equation between the actual and predicted values of unit slope and zero intercept (y = x)] (Jinyuan et al., 2016; Lawrence and Lin, 1989). This “lying on a line” characteristic is significantly important for this investigation, since the geometric values chosen for the cylinder radii are quite small and analysing ρ alone would sometimes mislead to high model accuracy even when the predictions are random.

The predicted radii for various cylinder combinations were compared against the actual radii, and the resulting ρ and ρc are shown in Table 2. First, ρc was always lower than ρ, which is expected (Estrada et al., 2015). The ρ and ρc were found to be high for all of the selected radii combinations in the three- and four-cylinder approximations; however, for the five- and six-cylinder approximations, the correlation decreased. This drop in ρ and ρc is significant for a (the radius corresponding to the first section in the oral region closer to the lips), and this could be attributed to the finer resolution in geometry, especially the length of these oral regions. As mentioned in Sec. 4.4, because we are dealing with incremental values of radii, it would be important to verify whether predicted values clearly follow the actual values or not. The resulting graphs relating actual and predicted values shown as a bin scatterplot of a1, b2, c3, and a4 are presented in Figs. 4(a), 4(b), 4(c), and 4(d), respectively.

Table 2.

Resulting accuracy without standardization.

SimulationRadius symbolρ (%)ρc (%)
Three cylinders a1 98.2 96.6 
 b1 99.2 99.2 
 c1 97.9 97.4 
Four cylinders a2 95.1 94.9 
 a2 99.5 99.4 
 b2 99.8 98.6 
 c2 99.3 99.3 
Five cylinders a3 87.7 86.6 
 a3 92.1 91.7 
 b3 98.7 98.6 
 b3 99.8 99.8 
 c3 99.2 97.6 
Six cylinders a4 73.1 69.4 
 a4 87.0 86.2 
 b4 98.1 98.1 
 b4 99.9 99.9 
 c4 99.5 99.5 
 c4 98.6 97.6 
SimulationRadius symbolρ (%)ρc (%)
Three cylinders a1 98.2 96.6 
 b1 99.2 99.2 
 c1 97.9 97.4 
Four cylinders a2 95.1 94.9 
 a2 99.5 99.4 
 b2 99.8 98.6 
 c2 99.3 99.3 
Five cylinders a3 87.7 86.6 
 a3 92.1 91.7 
 b3 98.7 98.6 
 b3 99.8 99.8 
 c3 99.2 97.6 
Six cylinders a4 73.1 69.4 
 a4 87.0 86.2 
 b4 98.1 98.1 
 b4 99.9 99.9 
 c4 99.5 99.5 
 c4 98.6 97.6 
Fig. 4.

Bin scatter plots relating actual and predicted values of radii with no standardization: (a) a1, (b) b2, (c) c3, and (d) a4.

Fig. 4.

Bin scatter plots relating actual and predicted values of radii with no standardization: (a) a1, (b) b2, (c) c3, and (d) a4.

Close modal

Looking at the correlation plots (Fig. 4), it is quite clear that the variance in prediction is high when a higher number of cylinders are allowed when approximating the vocal tract (see the bin counts for a4 shown in Fig. 4(d)). The high spread in a4 could be attributed to the lower increment while defining the possible combination for a4 radius, and this claim requires further investigation.

Another experiment was undertaken to investigate if whitening (Z-score scaling) of input features improves the prediction accuracy of segment radii. This is achieved by modifying the distribution of max-normalized impedance values to have zero mean and unit standard deviation. The mean and standard deviation of the training data were estimated, and these variables were used to scale the training and test data. The deep neural network was retrained using this new dataset, and results of this investigation are shown in Table 3. The results shown in Table 3 suggest the standardization procedure has a positive impact on the prediction accuracy, and this is most evident in the six-cylinder approximation, for which the prediction accuracy of a4 improved by almost by 25%. The resulting bin scatterplot relating actual and predicted values of a1, b2, c3, and a4 with standardization in place are shown in Figs. 5(a), 5(b), 5(c), and 5(d), respectively. The extent of variance in the predicted radii has reduced drastically, especially for c3 (most predicted values overlap); however, for a4 there still exists some variation, but less so compared to the case with no standardization (see bin counts in Fig. 5(d)).

Table 3.

Resulting accuracy with standardization in place.

SimulationRadius symbolρ (%)ρc (%)
Three cylinders a1 99.8 99.8 
 b1 99.9 99.9 
 c1 99.5 99.4 
Four cylinders a2 98.6 98.5 
 a2 99.9 99.9 
 b2 99.9 99.9 
 c2 99.9 99.9 
Five cylinders a3 96.9 96.7 
 a3 99.0 98.9 
 b3 99.9 99.9 
 b3 99.9 99.9 
 c3 99.9 99.9 
Six cylinders a4 92.8 92.7 
 a4 97.3 97.1 
 b4 99.7 99.7 
 b4 99.9 99.9 
 c4 99.9 99.9 
 c4 99.9 99.9 
SimulationRadius symbolρ (%)ρc (%)
Three cylinders a1 99.8 99.8 
 b1 99.9 99.9 
 c1 99.5 99.4 
Four cylinders a2 98.6 98.5 
 a2 99.9 99.9 
 b2 99.9 99.9 
 c2 99.9 99.9 
Five cylinders a3 96.9 96.7 
 a3 99.0 98.9 
 b3 99.9 99.9 
 b3 99.9 99.9 
 c3 99.9 99.9 
Six cylinders a4 92.8 92.7 
 a4 97.3 97.1 
 b4 99.7 99.7 
 b4 99.9 99.9 
 c4 99.9 99.9 
 c4 99.9 99.9 
Fig. 5.

Bin scatter plots relating actual and predicted values of radii with standardization: (a) a1, (b) b2, (c) c3, and (d) a4.

Fig. 5.

Bin scatter plots relating actual and predicted values of radii with standardization: (a) a1, (b) b2, (c) c3, and (d) a4.

Close modal

Finally, comparing the predictions of Rodriguez et al. (2018)—where their fitting method predicted vocal tract shape to a resolution of about a centimetre—to our resulting predictions using a data-driven approach with or without standardization in place resulted in similar or better prediction for most of the simulated cases; however, the prediction accuracy of segment radii is slightly compromised with the inclusion of more segments (for example, a4).

A data-driven approach using ANNs to solve the inverse area function problem was proposed to derive the non-linear relationship between the vocal tract impedance and the corresponding vocal tract geometry. A deep neural network was trained using acoustic impedance spectra, and the predicted radii, associated with the vocal tract geometry approximated using cylindrical tubes, were found to be highly correlated with the actual radii, showing reasonable agreement.

Although not strictly physiological, our proof-of-concept approach now complements earlier studies using similar neural network techniques to resolve inverse problems associated with voice mechanics, such as vocal fold properties (Zhang, 2020), subglottal pressure, and other physiological control (Gómez et al., 2018; Ibarra et al., 2021). Such a systematic approach, when integrated in the future, has the potential to resolve questions of vocal tract geometry and voice mechanics associated with different voice conditions during speech and singing and thereby offer a potential diagnostic tool for applications in speech pathology, voice therapy, and language training in a natural, ecological context.

1.
Adler
,
J.
, and
Öktem
,
O.
(
2017
). “
Solving ill-posed inverse problems using iterative deep neural networks
,”
Inverse Probl.
33
(
12
),
124007
.
2.
Baer
,
T.
,
Gore
,
J. C.
,
Gracco
,
L. C.
, and
Nye
,
P. W.
(
1991
). “
Analysis of vocal tract shape and dimensions using magnetic resonance imaging: Vowels
,”
J. Acoust. Soc. Am.
90
(
2
),
799
828
.
3.
Basheer
,
I. A.
, and
Hajmeer
,
M.
(
2000
). “
Artificial neural networks: Fundamentals, computing, design, and application
,”
J. Microbiol. Methods
43
(
1
),
3
31
.
4.
Chollet
,
F.
(
2015
). “
Keras
,” https://keras.io Last viewed 23/10/2021.
5.
Dusan
,
S.
, and
Deng
,
L.
(
1998
). “
Recovering vocal tract shapes from MFCC parameters
,” in
Proceedings of the Fifth International Conference on Spoken Language Processing
.
6.
Dybowski
,
R.
, and
Gant
,
V.
(
2001
).
Clinical Applications of Artificial Neural Networks
(
Cambridge University Press
,
Cambridge
).
7.
Estrada
,
L.
,
Torres
,
A.
,
Sarlabous
,
L.
, and
Jané
,
R.
(
2015
). “
Improvement in neural respiratory drive estimation from diaphragm electromyographic signals using fixed sample entropy
,”
IEEE J. Biomed. Health. Inform.
20
(
2
),
476
485
.
8.
Fant
,
G.
(
1970
).
Acoustic Theory of Speech Production
(
Walter de Gruyter
,
Berlin
).
9.
Fletcher
,
N. H.
, and
Rossing
,
T. D.
(
2012
).
The Physics of Musical Instruments
(
Springer Science & Business Media
,
Berlin
).
10.
Garnier
,
M.
,
Henrich
,
N.
,
Smith
,
J.
, and
Wolfe
,
J.
(
2010
). “
Vocal tract adjustments in the high soprano range
,”
J. Acoust. Soc. Am.
127
(
6
),
3771
3780
.
11.
Gómez
,
P.
,
Schützenberger
,
A.
,
Semmler
,
M.
, and
Döllinger
,
M.
(
2018
). “
Laryngeal pressure estimation with a recurrent neural network
,”
IEEE J. Transl. Eng. Health Med.
7
,
1
11
.
12.
Hanna
,
N.
,
Smith
,
J.
, and
Wolfe
,
J.
(
2016
). “
Frequencies, bandwidths and magnitudes of vocal tract and surrounding tissue resonances, measured through the lips during phonation
,”
J. Acoust. Soc. Am.
139
(
5
),
2924
2936
.
13.
Hassoun
,
M. H.
(
1995
).
Fundamentals of Artificial Neural Networks
(
MIT Press
,
Cambridge, MA
).
14.
Henrich
,
N.
,
Kiek
,
M.
,
Smith
,
J.
, and
Wolfe
,
J.
(
2007
). “
Resonance strategies used in Bulgarian women's singing style: A pilot study
,”
Logoped. Phoniatr. Vocol.
32
(
4
),
171
177
.
15.
Ibarra
,
E. J.
,
Parra
,
J. A.
,
Alzamendi
,
G. A.
,
Cortés
,
J. P.
,
Espinoza
,
V. M.
,
Mehta
,
D. D.
,
Hillman
,
R. E.
, and
Zañartu
,
M.
(
2021
). “
Estimation of subglottal pressure, vocal fold collision pressure, and intrinsic laryngeal muscle activation from neck-surface vibration using a neural network framework and a voice production model
,”
Front. Physiol.
12
,
1419
.
16.
Ismail
,
M. A.
(
2008
). “
Vocal tract area function estimation using particle swarm
,”
J. Comput.
3
(
6
),
32
38
.
17.
Jinyuan
,
L.
,
Wan
,
T.
,
Guanqin
,
C.
,
Yin
,
L.
,
Changyong
,
F.
, and
Xin
,
M.
(
2016
). “
Correlation and agreement: Overview and clarification of competing concepts and measures
,”
Shanghai Arch. Psychiatry
28
(
2
),
115
120
.
18.
Joliveau
,
E.
,
Smith
,
J.
, and
Wolfe
,
J.
(
2004
). “
Tuning of vocal tract resonance by sopranos
,”
Nature
427
(
6970
),
116
.
19.
Kim
,
Y.-C.
,
Kim
,
J.
,
Proctor
,
M.
,
Toutios
,
A.
,
Nayak
,
K.
,
Lee
,
S.
, and
Narayanan
,
S. S.
(
2013
). “
Toward automatic vocal tract area function estimation from accelerated three-dimensional magnetic resonance imaging
,” in Proceedings of
Speech Production in Automatic Speech Recognition
, pp.
40
43
.
20.
Lakkam
,
S.
,
Balamurali
,
B. T.
, and
Bouffanais
,
R.
(
2019
). “
Hydrodynamic object identification with artificial neural models
,”
Sci. Rep.
9
(
1
),
11242
.
21.
Lawrence
,
I.
, and
Lin
,
K.
(
1989
). “
A concordance correlation coefficient to evaluate reproducibility
,”
Biometrics
45
,
255
268
.
22.
Lisboa
,
P. J.
(
2002
). “
A review of evidence of health benefit from artificial neural networks in medical intervention
,”
Neural Netw.
15
(
1
),
11
39
.
23.
McCann
,
M. T.
,
Jin
,
K. H.
, and
Unser
,
M.
(
2017
). “
Convolutional neural networks for inverse problems in imaging: a review
,”
IEEE Signal Process. Mag.
34
(
6
),
85
95
.
24.
Perkell
,
J. S.
(
1969
).
Physiology of Speech Production: Results and Implications of a Quantitative Cineradiographic Study
(
MIT Press
,
Cambridge, MA
).
25.
Prato
,
M.
, and
Zanni
,
L.
(
2008
). “
Inverse problems in machine learning: An application to brain activity interpretation
,”
J. Phys. Conf. Ser.
135
,
012085
.
26.
Rabiner
,
L. R.
, and
Schafer
,
R.
(
1978
).
Digital Processing of Speech Signals
(
Pearson
,
New York
).
27.
Rodriguez
,
A.
,
Hanna
,
N.
,
Almeida
,
A.
,
Smith
,
J.
, and
Wolfe
,
J.
(
2018
). “
Estimation of vocal tract and trachea area functions from impedance spectra measured through the lips
,” in Proceedings of
Australasian International Conference on Speech Science and Technology
, pp.
77
80
.
28.
Rossiter
,
D.
,
Howard
,
D. M.
, and
Downes
,
M.
(
1994
). “
A real-time LPC-based vocal tract area display for voice development
,”
J. Voice
8
(
4
),
314
319
.
29.
Sondhi
,
M. M.
, and
Gopinath
,
B.
(
1971
). “
Determination of vocal-tract shape from impulse response at the lips
,”
J. Acoust. Soc. Am.
49
(
6B
),
1867
1873
.
30.
Soquet
,
A.
,
Lecuit
,
V.
,
Metens
,
T.
, and
Demolin
,
D.
(
2002
). “
Mid-sagittal cut to area function transformations: direct measurements of mid-sagittal distance and area with MRI
,”
Speech Commun.
36
(
3-4
),
169
180
.
31.
Story
,
B. H.
,
Titze
,
I. R.
, and
Hoffman
,
E. A.
(
1996
). “
Vocal tract area functions from magnetic resonance imaging
,”
J. Acoust. Soc. Am.
100
(
1
),
537
554
.
32.
Teo
,
K. R.
,
BT
,
B.
,
Zhou
,
J.
, and
Chen
,
J.-M.
(
2021
). “
Categorizing touch-input locations from touchscreen device interfaces via on-board mechano-acoustic transducers
,”
Appl. Sci.
11
(
11
),
4834
.
33.
Wakita
,
H.
(
1973
). “
Direct estimation of the vocal tract shape by inverse filtering of acoustic speech waveforms
,”
IEEE Trans. Audio Electroacoust.
21
(
5
),
417
427
.
34.
Wang
,
G.
,
Hao
,
J.
,
Ma
,
J.
, and
Huang
,
L.
(
2010
). “
A new approach to intrusion detection using artificial neural networks and fuzzy clustering
,”
Expert Syst. Appl.
37
(
9
),
6225
6232
.
35.
Wankhede
,
N. S.
, and
Shah
,
M. S.
(
2013
). “
Investigation on optimum parameters for LPC based vocal tract shape estimation
,” in
Proceedings of 2013 International Conference on Emerging Trends in Communication, Control, Signal Processing and Computing Applications (C2SPCA)
, pp.
1
6
.
36.
Zhang
,
Z.
(
2020
). “
Estimation of vocal fold physiology from voice acoustics using machine learning
,”
J. Acoust. Soc. Am.
147
(
3
),
EL264
EL270
.