For most of his illustrious career, Ken Stevens focused on examining and documenting the rich detail about vocal tract changes available to listeners underlying the acoustic signal of speech. Current approaches to speech inversion take advantage of this rich detail to recover information about articulatory movement. Our previous speech inversion work focused on movements of the tongue and lips, for which “ground truth” is readily available. In this study, we describe acquisition and validation of ground-truth articulatory data about velopharyngeal port constriction, using both the well-established measure of nasometry plus a novel technique—high-speed nasopharyngoscopy. Nasometry measures the acoustic output of the nasal and oral cavities to derive the measure nasalance. High-speed nasopharyngoscopy captures images of the nasopharyngeal region and can resolve velar motion during speech. By comparing simultaneously collected data from both acquisition modalities, we show that nasalance is a sufficiently sensitive measure to use as ground truth for our speech inversion system. Further, a speech inversion system trained on nasalance can recover known patterns of velopharyngeal port constriction shown by American English speakers. Our findings match well with Stevens' own studies of the acoustics of nasal consonants.

1. Articulation vs acoustics

In speech research, there is a persistent tension between two types of analysis: articulation involves continuous movement, while perception at the phoneme level requires categorical judgment of acoustic cues, or distinctive features. Theoretically, most of the information produced by the interaction of overlapping articulator movements is recoverable from the speech signal. Historically, however, much research on speech has focused on how acoustic features reflect expectations at the phoneme level—in essence, speech understanding considered as the task of detecting a particular combination of characteristics in the acoustic signal in order to declare that a series of particular phonemes has been uttered. This tension between the continuous nature of articulation and the categorical approach to speech was a major focus of Ken Steven's work, starting from his participation in one of the early foundational x-ray studies of the vocal tract during speech (Stevens and Öhman, 1963; Perkell, 1969), and the creative impetus for much of his work on acoustics, distinctive feature theory, quantal theory, and landmark-based speech recognition (Stevens, 1989, 2000, 2002). In particular, much of Stevens's work was dedicated to painstaking documentation of the rich information about the vocal tract contained in the speech signal itself (Stevens, 2000). This rich information source, in turn, powers current applications of machine learning to recovering articulatory information, known as speech inversion (Siriwardena and Espy-Wilson, 2023).

However, speech is not produced in a categorical fashion. Instead, it is produced by continuous, coordinated movement of articulators, such as tongue, velum, and lips that shape sound emitted by the larynx, forming alternations of vocal tract constriction (for consonants) and opening (for vowels). The fact that the movements themselves have nonlinear spatial consequences (e.g., partial constriction resulting in frication vs complete closure resulting in silence), and also overlap in time, means that their effects on the acoustic signal can be temporally scattered, compressed, and sometimes obscured by co-occurring events. Consequently, many patterns that may appear chaotically variable when considered as cues in the acoustic signal turn out to be organized and rule-governed when the underlying articulation is considered (Gafos, 1999; Cho and Keating, 2009; Katsika , 2014; Krivokapić, 2014).

At the same time, the ability to weave the articulatory movements of different articulators together over time in coherent, consistent patterns is a key element in speech communication. For instance, given the same sequence of phonemes to utter, speakers of different languages are known to use different patterns of articulation, as in the different control of aspiration between English [phV] and Spanish [pV] (Abramson & Lisker 1973). A better scientific description of how these patterns differ across languages can inform our understanding of how speakers control variability in production and tolerate it in perception. More practically, a better understanding of these patterns can help us identify problems with speech intelligibility as well as speech and language disorders.

2. Recovering articulation from acoustics

The best way to track and identify articulatory patterns is by observation of the articulators in motion; however, their direct observation requires expensive specialized equipment and can be invasive. Of the principal techniques available, x-ray imaging is limited by radiation concerns; real-time magnetic resonance imaging (rtMRI) is expensive and acoustically noisy; electropalatography (EPG) requires speaker-specific palates and is limited to tongue contact; ultrasound provides only partial imaging of the tongue and is biased by probe position unless corrected for jaw displacement; and electromagnetic articulography (EMA) is invasive and can perturb normal articulation.

An alternative approach to identifying articulatory patterns is to estimate them from the acoustic signal using models trained on a database of “ground-truth” correspondence between speech and co-recorded articulation. Such a “speech inversion” system, if shown to be valid, has immense potential for expanding scientific and clinical understanding of speech articulation by reducing the cost and effort of articulatory data collection. However, acoustic-to-articulatory speech inversion is a highly non-linear and non-unique problem, further complicated by speaker variability (Qin and Carreira-Perpiñán, 2007).

Approaches to addressing this problem can be broadly grouped into three categories: (1) codebook based (Ouni and Laprie, 2005), where a constructed codebook of acoustics with corresponding articulatory trajectories is used; (2) analytical approaches (Krstulović, 2001) making use of carefully designed articulatory models; and (3) parametric and non-parametric statistical modeling. The statistical modeling approach has the overall advantage of being less bound by the assumptions of articulatory working models, i.e., better generalization to new speakers. It has been explored using Gaussian mixture models (GMMs) (Toda , 2004), hidden Markov models (HMMs) (Hiroya and Honda, 2004), and recently, with more powerful and flexible approaches to neural net architectures (Siriwardena and Espy-Wilson, 2023). The recent advances in deep learning have driven our adoption of neural network–based approaches for speech inversion, primarily due to their ability to learn complex non-linear mappings from acoustics to articulatory data. These neural networks also excel at generalizing across speakers, an area where classical rule-based approaches have previously fallen short. While not as interpretable as approaches utilizing acoustic features directly, hidden layers of a trained network can nonetheless provide highly correlated analogs to such features, e.g., Bartelds (2022).

Espy-Wilson and colleagues have developed speech inversion systems that recover continuous articulatory movement trajectories from the acoustic signal using neural-net methods (Mitra , 2017; Espy-Wilson , 2018; Sivaraman , 2019). Because the output acoustics depend less on the movement of individual articulators than how their combined movements affect the shape of the vocal tract, the system recovers synergistic timings of vocal tract constriction degree and location (tract variables), rather than positions of specific articulators. As with any computational system based on deep learning, speech inversion requires a significant amount of training data for the system to recognize patterns (Asterios , 2008; Wu , 2014; Wu , 2023). The Espy-Wilson speech inversion system was originally trained on the University of Wisconsin's x-ray microbeam dataset (Westbury, 1994). It has been tested against “ground-truth” data from direct tracking of lip aperture and fixed points on the tongue corresponding to alveolar and velar places of articulation, using 46 speakers drawn from the x-ray microbeam database, as well as an eight-speaker EMA database (Tiede , 2017). Results have shown that the speech inversion system recovers both timing and extent of movement with good spatial and temporal accuracy in comparison with ground-truth data from lip and tongue articulators.

3. Recovering velopharyngeal (VP) movement and nasality from acoustics

Nasal sounds are among the most common in English (Fry, 2004), but to date, nasality has not been incorporated into speech inversion systems. This is because the degree of nasality in the acoustic signal is a function of the degree to which the VP is open or closed, controlling coupling of the nasal tract side branch, and direct observation of the VP is difficult and invasive. Previous work has shown that nasality can be estimated as a time-varying signal from speech acoustics using statistical learning methods to condense a range of spectral characteristics [Mel-frequency cepstral coefficients (MFCCs) and phonetically grounded parameters] (Carignan, 2021; Carignan , 2023). Results provide good agreement with co-collected nasometer data, but lack ground-truth comparison with direct observation of VP function. Overall, the results of these and other studies (Stevens, 2000; Beddor, 2007) support the notion that the acoustic signal contains detailed information sufficient to retrieve articulatory movement information controlling nasality.

Although it is easy to obtain acoustic recordings of speech that include nasal and non-nasal sounds, obtaining direct measures of VP status has required methods that are either invasive (e.g., pressure/flow measurements requiring catheters), or expensive [such as magnetic resonance imaging (MRI)] (Kochetov, 2020). Consequently, it has been difficult to obtain ground-truth data that (1) tracks VP opening directly, (2) provides timing resolution similar to that available for tongue and lip data, and (3) allows collection of data for a range of speakers and speech materials large enough to train a deep learning system.

Data from direct methods show how patterns of articulatory movement can explain variability in the location of acoustic cues for nasality. A number of studies have shown that American English speakers use different patterns of velum raising and lowering (a measure of VP constriction) according to syllabic organization (Krakow 1989, 1999; Byrd , 2009), and there are accordingly more cues in the acoustic signal for nasalization on tautosyllabic vowel+nasal sequences than in heterosyllabic vowel+nasal sequences (Beddor, 2007). Figure 1 shows an example of this pattern for “home E” vs “hoe me,” adapted from Krakow (1989); the velum moves earlier and the VP port stays open longer when the /m/ is in the rime (home) than when the /m/ is in the onset of the following word (me).

FIG. 1.

Vertical position of the velum and lower lip aligned at the onset of bilabial contact for /m/ in “hoe me” vs “home E.” Triangles show velum lowering onset and co-occurring event in lower lip. Adapted with permission from Krakow (1989). Doctoral dissertation, Yale University, New Haven, CT, 1989.

FIG. 1.

Vertical position of the velum and lower lip aligned at the onset of bilabial contact for /m/ in “hoe me” vs “home E.” Triangles show velum lowering onset and co-occurring event in lower lip. Adapted with permission from Krakow (1989). Doctoral dissertation, Yale University, New Haven, CT, 1989.

Close modal

Currently, the most comprehensive and accessible data on nasality is obtained from nasometry. With this method, the degree of opening between the oral and nasal cavities is monitored by tracking the relative contributions of acoustic energy emanating from the nose vs the mouth [nasalance (NAS)]. This method has the advantage of being non-invasive and inexpensive, but its evidence is relatively indirect. It is known, for instance, that VP opening can be quite extensive before the resulting effects are observed in the acoustic signal (Manuel and Krakow, 1995). Additionally, both oral and nasal sources of acoustic energy are subject to confounding glottal source effects (Kochetov, 2020). Most nasometric data that have been collected on both clinical and non-clinical populations used commercial nasometer systems. For technical reasons that will be detailed later, these systems bypass the issue of whether nasometry can track VP opening and closing movements by instead reporting average nasometry scores over multi-segment stretches of speech (Dalston , 1991). Accordingly, although large amounts of nasometric data have been collected (Alfwaress , 2022; Scarmagnani , 2023), it is unclear whether the method is sensitive enough to quantify degree of VP opening with sufficiently accurate timing resolution to support speech inversion of nasality. Note that because nasometry is an indirect measure of nasality, the measure itself is referred to as “nasalance” (Oren 2024).

4. Speech inversion for VP movement/nasality

In this paper, we describe and test a newly developed expansion of the speech inversion system developed by Espy-Wilson and colleagues (Sivaraman , 2019; Siriwardena , 2023a), based on validated NAS as “ground truth.” To perform this validation of NAS as a sensitive indicator of VP movement, we describe a preliminary study of the correspondence between co-recorded NAS and direct articulatory observation of VP opening using a high-speed research version of a commonly employed clinical measure of VP opening degree known as high-speed nasopharyngoscopy (HSN). We further test whether VP opening and closing movements correspond with nasometry at sufficient resolution to discover articulatory timing patterns. Ultimately, the aim is to develop a speech inversion system that uses the acoustic speech signal to track movement patterns of the VP port. Note that this paper extends results presented in Siriwardena (2023b) on a smaller set of speakers.

In order to compare with what is known about acoustic cues to nasality in the speech signal, and their distribution as a function of articulatory patterns across languages and dialect, our results will be evaluated in several ways. First, we look at overall correlation between the NAS and HSN movement trajectories. Second, we look at a subset of data incorporating nasal targets to determine whether onsets and offsets of movement in the VP match across modalities and if so, by what margin. Third, we seek to determine whether NAS trajectories in our data replicate patterns of movement consistently reported in the literature. Finally, we show how these articulatory measures correspond with results from our speech inversion system.

As part of a larger study comparing cross-linguistic speech patterns, we collected nasometry and acoustic recording data from 20 healthy adult speakers of English. Out of the 20 speakers, 18 are native English speakers and two are French–English speakers. For six of these subjects, we also collected simultaneous HSN data. One speaker produced speech in two sessions: nasometry and acoustic recording only vs nasometry, acoustic recording, and HSN. For both types of sessions, electroglottography (EGG) data were collected concurrently using electrodes placed on the neck at the level of the thyroid protuberance. The speech inversion results reported here are based on acoustic and nasometer data from the full set of 20 speakers. Comparisons between nasometer and HSN results are based on data from the five Native American English speakers who provided both data types.

Speech materials included portions of commonly used sentences and paragraphs drawn from other studies. These included portions of the well-known “Grandfather Passage (Darley , 1975) and the Harvard sentences (IEEE, 1969), as well as portions of the speech materials described in Krakow (1999) and Westbury (1994). The same set of speech materials was used for both the nasometry alone and nasometry plus HSN sessions. Subject speaking time covered approximately 15–20 min, although with interruptions, the total time required for the session was approximately 30 min. Because of memory limitations in the system for recording the HSN data, these sessions typically terminated after subjects had recorded approximately two thirds of the full set of speech materials. The subjects were asked to produce the speech presented in text at eye level, while the data were recorded continuously.

Nasopharyngoscopy is a commonly employed clinical procedure in which a camera is inserted into the nasal cavity to examine the anatomy and function of the VP mechanism. After application of a decongestant and topical anesthesia, a flexible endoscope (outer diameter: 3.6 mm) was inserted through a nostril into the nasal cavity to a point providing a view of the VP port opening, as shown in Fig. 2. The flexible endoscope was connected to a high-speed video camera (MIRO model 310, Vision Research, Inc., Wayne, NJ), and images of the VP port were captured at 1000 frames/s. Note that this sampling rate is significantly faster than a standard nasopharyngoscopy procedure, where images are typically captured at rates of only 30–60 frames/s. Data were taken in 27 s intervals to allow for downloading the images from the limited memory on the high speed video camera to an external hard drive. Insertion was performed manually by an experienced otolaryngologist. Because changes in the location of the flexible endoscope in the nasal cavity may affect the estimation of VP opening extent, the position of the scope was monitored for consistency by visual inspection. Data collection took place in a large sound-treated booth in the Ear, Nose and Throat clinic of Cincinnati Children's Hospital.

FIG. 2.

Illustration of the placement of flexible endoscope inserted through the nostril into the nasal cavity to a point providing a view of the VP port opening. The flexible endoscope and the high-speed video camera are also shown.

FIG. 2.

Illustration of the placement of flexible endoscope inserted through the nostril into the nasal cavity to a point providing a view of the VP port opening. The flexible endoscope and the high-speed video camera are also shown.

Close modal

The 1000 Hz sampling rate was sufficient to resolve the temporal motion of the velum during speech. The video frames were used to quantify the relative velar displacement by tracking change in the total light intensity (i.e., brightness) from each image, calculated by summing the light intensity values from all the pixels in that image. Relating the change in image intensity to velar motion relies on the fact that the change in brightness in an image frame corresponds to the excursion level of the velum. As such, the total intensity value from an image of an open VP port is lower than the total intensity from an image of a closed VP. A trajectory for velar displacement can be generated by plotting the total intensity in each image over time. More detail on this procedure can be found in a previous article (Oren , 2020).

Figure 3 shows the time course of image intensity changes for the sequence “It's see more, Sid.” Illustrative video frames for the time points of VP opening for the /m/ in “more” and the VP closure for the /s/ in “Sid” are shown. Note that the fully closed VP for /s/ shows bright reflections that fill the entire frame, while the open VP for /m/ shows less reflection plus a dark section at the bottom. The dark section indicates the lowering of the velum and the loss of intensity as the light source is passed into the oral cavity.

FIG. 3.

(Color online) Time course of HSN intensity (blue) across the utterance “It is see more Sid.” The upper panel shows the summed pixel intensity across the entire frame for all HSN frames. The lower panel shows pixel intensity across a kymograph profile stripe for all frames (blue) with superimposed VP port aperture (red). The inset HSN frames illustrate (1) partial VP opening, (2) full opening associated with the nasal /m/, and (3) complete closure associated with /s/. The position of the kymograph stripe is shown in yellow on each frame. The profile inset (4) shows pixel intensity along the stripe, and the two green lines show detected VP aperture.

FIG. 3.

(Color online) Time course of HSN intensity (blue) across the utterance “It is see more Sid.” The upper panel shows the summed pixel intensity across the entire frame for all HSN frames. The lower panel shows pixel intensity across a kymograph profile stripe for all frames (blue) with superimposed VP port aperture (red). The inset HSN frames illustrate (1) partial VP opening, (2) full opening associated with the nasal /m/, and (3) complete closure associated with /s/. The position of the kymograph stripe is shown in yellow on each frame. The profile inset (4) shows pixel intensity along the stripe, and the two green lines show detected VP aperture.

Close modal

As noted above, nasometry involves comparison between the acoustic energy transduced by separate oral and nasal microphones. We used a custom nasometry setup similar to that previously reported in Oren (2020). The microphones were mounted to the top and bottom of a separator plate acting as an acoustic barrier between the speaker's upper lip and nose. The microphones were attached medially on both sides of the plate. The acoustic data from the microphones (type 4958, Brüel & Kjaer, Duluth, GA) were captured at 51.2 kHz using a data acquisition system (National Instruments Model 9234, Austin, TX) for digitization and custom LabVIEW code was used to convert the data to “.wav” format audio files. It is important to note that our setup is very similar to a more commonly used commercially available device [Nasometer II system (PENTAX Medical, Montvale, New Jersey)], with the exception of upgraded microphone sensitivity, a higher sampling rate, and some differences in processing steps for comparing the two microphone outputs. These steps are described below.

1. NAS parameter

NAS was computed from the recorded oral and nasal microphone signals. As a first step, both the nasal and oral microphone signals were high pass filtered with a cutoff of 20 Hz to remove any ambient low frequency background noise. Acoustic energy was computed as root mean square (RMS): signals were first squared, then processed with a moving average finite difference filter applied forwards and backwards to avoid phase delay, and the square root taken of the result. After experimenting with different filter window types, we concluded that a rectangular window with a length of 25 ms provided the best smoothing without noticeable edge effects while remaining responsive to rapid changes in VP status. The resulting acoustical energy from the oral and nasal acoustic energy signals (AEoral and AEnasal) was then used to compute the NAS parameter following Eq. (1),
(1)
NAS was computed by first obtaining the acoustic energy (RMS smoothed with a 25 ms sliding window) for the oral (AEoral) and nasal (AEnasal) signals, taking the quotient NAS = AEnasal/(AEoral + AEnasal), and downsampling to the HSN rate. Thresholded nasalance (NASTHR) excludes low-energy/non-phonating regions where (AEoral + AEnasal) is less than a threshold percentage (10%) of its maximum value.

A key contribution of this work is the development of a speaker-independent (SI) system to estimate the proposed NAS parameter from the input acoustic signal. The SI system is an extension of the work in Siriwardena (2023b), where the temporal convolution network (TCN) is replaced by a newly proposed bi-directional gated recurrent neural network (BiGRNN), which takes in self-supervised learning (SSL) based speech embeddings as an acoustic representation. Below, we describe the details of the modeling of the latest SI system.

1. Data pre-processing and preparation for SI system

The audio recorded by the oral and nasal microphones was mixed together to create a combined audio signal. The starting and ending silences of the combined signal were trimmed if the signal energy was less than 20 dB, from the Music and Audio research Lab (MARL) Librosa python library [Brooklyn, NY (McFee , 2015)]. The processed combined signal was downsampled to 16 kHz and segmented into 2 s–long segments. The remaining segments that were less than 2 s were zero padded. The segmentation was done mainly to increase the number of audio samples needed to train the deep neural network (DNN)–based SI system and to provide input acoustic representations of fixed dimensionality to the input layer of the DNN models.

The NAS parameter discussed earlier is used as one of the ground-truth targets to train the SI system. For the purpose of training the SI system, the NAS parameter was normalized to a (−1, 1) range. Hence, all the estimates from the SI system for the NAS parameter are plotted on a (−1, 1) scale.

Siriwardena and Espy-Wilson (2023) showed that incorporating features that capture the glottal activity of speech as additional targets for the SI system can noticeably improve the accuracy of the NAS estimation. In the same work, a voicing parameter extracted from EGG data and three additional source features that include the estimated pitch, and the amount of aperiodic and periodic energy (Deshmukh and Espy-Wilson, 2005) were used. Since EGG was also recorded synchronously from all the subjects along with the nasometer setup, all the SI systems were trained to estimate both the NAS parameter and the four glottal source features (envelope of the EGG signal, plus aperiodicity, periodicity, and pitch waveforms).

The envelope of the EGG signal sampled at 51.2 kHz is extracted as a voicing feature. As with the NAS parameter, the EGG signal is first high pass filtered to remove the baseline wander and ambient room noises. Then, the magnitude of the Hilbert transform was computed as the envelope of the EGG signal. The envelope was downsampled to 100 Hz and normalized the same way as the NAS parameter to generate the final voicing parameter.

2. Model training

DNN-based SI models were trained to estimate the NAS parameter from the acoustic input. The model is trained to minimize the mean absolute error (MAE) loss between the estimated and the ground-truth parameters. As in Siriwardena and Espy-Wilson (2023), four additional glottal source parameters were also used as targets along with the NAS parameter. Training, development, and test splits were designed so that none of the data from the speakers in the development and test splits were included in the training split and hence, all the models were trained in a “speaker-independent” fashion. The data from subject 10 were excluded since there was an operator error with collecting EGG data from that subject. Data from two subjects each were used as the development and test splits, respectively. Data from the rest of the 16 subjects were used for training the SI systems. The splits also ensured that around 80% of the total number of utterances were present in training (∼2 h of speech); 10% for development and 10% for testing. All the allocations were done in a completely random manner.

3. BiGRNN SI system with HuBERT embeddings

SSL-based speech representations, when used in the SI task with EMA data, have been shown to outperform the conventional acoustic features, such as Mel-spectrograms and MFCCs (Cho , 2023). In this work, the SSL representations were finetuned for the downstream task of speech inversion and can be expected to generalize better even with a limited amount of ground-truth articulatory data. Based on the previous work in Cho (2023), we explored the idea of using HuBERT SSL features (Hsu , 2021) as the input acoustic representation to train the proposed SI architecture. We used the HuBERT-large model pre-trained with the Librilight dataset (60 000 h) to extract the HuBERT speech embeddings from the 2 s–long segments using the speechbrain open-source artificial intelligence (AI) toolkit (Ravanelli , 2021) The HuBERT embeddings are sampled at 50 Hz and have a dimensionality of 1024.

4. BiGRNN SI system architecture

In this work inspired by the work in Siriwardena . (2023a), we developed a novel BiGRNN–based model architecture for the SI system. Figure 4 shows the proposed model architecture which was implemented in the TensorFlow Keras (Google LLC, Mountain View, CA) machine learning framework. The first two bi-directional layers consist of gated recurrent units (GRUs), each followed by a dropout of 0.3. Each bi-directional layer has 128 GRUs and the first layer takes in the self-supervised–based HuBERT speech embeddings (Hsu , 2021) as the input speech representation.

FIG. 4.

Proposed BiGRNN model architecture for the Speech Inversion system. The input is an audio (.wav) with speech and the output is an array of five output features (NAS parameter, EGG envelope, periodicity, aperiodicity and pitch).

FIG. 4.

Proposed BiGRNN model architecture for the Speech Inversion system. The input is an audio (.wav) with speech and the output is an array of five output features (NAS parameter, EGG envelope, periodicity, aperiodicity and pitch).

Close modal

A custom-developed self-attention layer is used after the second bi-directional layer with the focus of extracting better contextual information. Self-attention functions by converting the output embedding from the second bi-directional layer into three vectors: query, key, and value. These vectors are derived by linearly transforming the input. The attention mechanism computes a weighted combination of the values, considering the resemblance between the query and key vectors. The outcome, a weighted sum, together with the initial input, undergoes processing via a non-linear transformation to generate the output. This procedure enables the model to concentrate on pertinent details and grasp long-range dependencies.

The output from the attention layer is passed through a time-distributed fully connected layer with 128 hidden units, which is then followed by an upsampling layer, which matches the time dimension of the processed outputs to the desired sampling rate of the NAS parameter (and source features). A batch normalization is then carried out with another dropout layer. The output from the dropout layer is passed through the final time-distributed fully connected layer which has five hidden units to match the target outputs (NAS, voicing from EGG, aperiodicity, periodicity, and pitch waveforms).

5. Hyper-parameter tuning

An ADAptive Moment Estimation (ADAM) optimizer with a starting learning rate of 3e-4 and an exponential learning rate scheduler was used. To choose the best starting “learning rate,” we did a grid search on [1e-3, 3e-4, 1e-4] as possible learning rates. Here, the values were empirically chosen, considering the size of the model and the amount of training data used. Similarly, to choose the training batch size, a similar grid search was done with [8 16 32 64 128] as possible batch sizes. An early stopping criterion with a patience of five epochs monitoring the “validation loss” on the development set was utilized to avoid model over fitting. Based on the validation loss, 3e-4 and 8 were chosen as the starting learning rate and batch size, respectively. The best performing model has around 1 × 106 trainable parameters and takes around 30(±5) minutes to converge.

1. Correlation analysis

To determine whether nasometry data has sufficient timing resolution to act as a proxy for the onset and offset of VP constriction movements, we compared NAS and HSN trajectories. Data consist of matched (oral, nasal) acoustic recordings at 51 200 Hz and HSN intensity at 1000 Hz obtained from five speakers. Correlations (Pearson's r) were measured between HSN ∼ NAS across all recordings from all speakers were calculated as shown below (note that HSN intensity and NAS are anti-correlated; that is, NAS is high and HSN intensity is low during periods of VP opening corresponding to nasal articulation). Table I(a) shows results for this comparison measured across complete recordings.

TABLE I.

Pearson's r correlations for HSN ∼ NAS.

Correlation Number Maximum (worst) Minimum (best) Mean [Standard deviation (S.D.)] Median [interquartile range (IQR)]
(a) Full recordings 
HSN ∼ NAS  38 (1 excluded)  −0.040  −0.756  −0.581 (0.167)  −0.629 (0.103) 
HSN ∼ NASTHR (thresholded NAS)  38 (1 excluded)  −0.012  −0.833  −0.577 (0.202)  −0.616 (0.288) 
(b) All labeled utterances 
HSN ∼ NAS  435 (24 excluded)  −0.002  −0.898  −0.661 (0.160)  −0.701 (0.175) 
HSN ∼ NASTHR  407 (52 excluded)  −0.012  −0.974  −0.663 (0.234)  −0.733 (0.286) 
(c) Short sentences with nasal targets 
HSN ∼ NAS  178 (14 excluded)  −0.002  −0.941  −0.684 (0.171)  −0.715 (0.228) 
HSN ∼ NASTHR  179 (13 excluded)  −0.027  −0.980  −0.711 (0.219)  −0.768 (0.257) 
Correlation Number Maximum (worst) Minimum (best) Mean [Standard deviation (S.D.)] Median [interquartile range (IQR)]
(a) Full recordings 
HSN ∼ NAS  38 (1 excluded)  −0.040  −0.756  −0.581 (0.167)  −0.629 (0.103) 
HSN ∼ NASTHR (thresholded NAS)  38 (1 excluded)  −0.012  −0.833  −0.577 (0.202)  −0.616 (0.288) 
(b) All labeled utterances 
HSN ∼ NAS  435 (24 excluded)  −0.002  −0.898  −0.661 (0.160)  −0.701 (0.175) 
HSN ∼ NASTHR  407 (52 excluded)  −0.012  −0.974  −0.663 (0.234)  −0.733 (0.286) 
(c) Short sentences with nasal targets 
HSN ∼ NAS  178 (14 excluded)  −0.002  −0.941  −0.684 (0.171)  −0.715 (0.228) 
HSN ∼ NASTHR  179 (13 excluded)  −0.027  −0.980  −0.711 (0.219)  −0.768 (0.257) 

The Montreal forced aligner (McAuliffe , 2017) was used to label word and phone boundaries within each recording. A custom procedure was used to map the forced alignment boundaries (which can include interpolated pauses) to the expected target glosses and to discard errors. These boundaries were then used to segment individual utterances, resulting in 459 usable labeled data tokens, each including 200 ms before and after the labeled utterance. Table II(b) shows results for these comparisons across labeled utterances. Of these, 398 have short (three to four–word) carrier sentences (e.g., “It's see more, Sid”); 11 have repeated targets (e.g., “Say na na na na again”); and 50 are miscellaneous sentences. Of the short sentences, 192 have a target containing a nasal. Table I(b) gives results for this subset of data. Table I(c) shows Pearson's r correlations for the 192 short sentences with a nasal target. For this calculation, correlations were evaluated only over the target nasal sound (with 200 ms pre/post).

TABLE II.

Average mutual information for HSN ∼ NAS.

Comparison Number Minimum (worst) Maximum (best) Mean (S.D.) Median (IQR)
(a) All labled utterances 
HSN ∼ NAS  459  0.267  1.596  −0.872 (0.241)  0.868 (0.334) 
(b) Short sentences with nasal targets 
HSN ∼ NAS  192  0.343  2.284  1.009 (0.323)  0.975 (0.458) 
Comparison Number Minimum (worst) Maximum (best) Mean (S.D.) Median (IQR)
(a) All labled utterances 
HSN ∼ NAS  459  0.267  1.596  −0.872 (0.241)  0.868 (0.334) 
(b) Short sentences with nasal targets 
HSN ∼ NAS  192  0.343  2.284  1.009 (0.323)  0.975 (0.458) 
TABLE III.

PPMC scores (mean and S.D. across eight trials) for the SI systems evaluated on the unseen test set data. Improved PPMC scores are bolded.

Model NAS parameter Voicing (EGG envelope) Periodicity Aperiodicity Pitch Mean
TCN (Siriwardena , 2023b 0.7754 (0.02)  0.8124 (0.03)  0.8024 (0.01)  0.8106 (0.01)  0.8125 (0.03)  0.8027 (0.02) 
BiGRNN-hubert  0.8115 (0.01)  0.8330 (0.02)  0.8373 (0.01)  0.8542 (0.01)  0.8562 (0.02)  0.8385 (0.01) 
Model NAS parameter Voicing (EGG envelope) Periodicity Aperiodicity Pitch Mean
TCN (Siriwardena , 2023b 0.7754 (0.02)  0.8124 (0.03)  0.8024 (0.01)  0.8106 (0.01)  0.8125 (0.03)  0.8027 (0.02) 
BiGRNN-hubert  0.8115 (0.01)  0.8330 (0.02)  0.8373 (0.01)  0.8542 (0.01)  0.8562 (0.02)  0.8385 (0.01) 

A useful extension to linear correlation is mutual information (MI) (Cover and Thomas, 2006) computed from how the joint distribution of signals A and B differs from the product of their marginal distributions, and representing the extent to which signal B can be predicted from signal A. Mutual information is quantified in bit units of information on an open-ended scale, where higher values indicate greater similarity. Table II shows the average mutual information computed on all labeled utterances and short sentences with nasal targets as above.

In summary, on both linear correlation and mutual information comparisons, the correspondence between HSN intensity and NAS is highly significant. We conclude from this that nasometry is a reliable indicator of change in the movement of the VP mechanism and is therefore a reasonable proxy for VP constriction degree.

1. Results on unseen test split data with the TCN baseline

To evaluate the performance of the proposed BiGRNN model, the TCN model in Siriwardena (2023b) was used as the baseline. Both the SI systems were trained and tested on the same splits of data discussed in model training. Pearson product moment correlation (PPMC) between the ground truth and estimated signals was used as the metric to evaluate the SI systems. The PPMC score quantifies the similarity between two continuous variables, in this case, the ground truth and the estimated signal from the SI system. A higher average PPMC score across samples from a given system indicates better performance in accurately estimating the respective signals. The individual SI systems were trained in eight trials and the Table III reports the resulting PPMC scores (mean and S.D.) on the unseen test split.

Figure 5 shows the ground-truth NAS parameter and the corresponding NAS parameters estimated by the two SI systems for the utterance “Say aah again, say mmm again” from the unseen test split. Overall, it can be seen that the NAS parameter estimated by the BiGRNN model agrees better with the ground truth compared to the baseline TCN model. Both models accurately match the NAS for the /n/ of “again.” However, the TCN model has better estimated the /m/ in “mmm again” compared to the BiGRNN model.

FIG. 5.

(Color online) NAS parameter estimates from the TCN (blue, dotted line) and BiGRNN (red, dashed line) SI systems compared to the ground-truth NAS parameter (black, solid line) for the utterance “Say aah again, say mmm again.” The panel on top shows the waveform and the middle panel shows the corresponding spectrogram.

FIG. 5.

(Color online) NAS parameter estimates from the TCN (blue, dotted line) and BiGRNN (red, dashed line) SI systems compared to the ground-truth NAS parameter (black, solid line) for the utterance “Say aah again, say mmm again.” The panel on top shows the waveform and the middle panel shows the corresponding spectrogram.

Close modal

2. Cross-corpus evaluation of the proposed BiGRNN SI system

Ultimately, we hope to use the speech inversion system to derive articulatory information when no ground-truth data are available. As a quick test of the independent capabilities of the system, the better performing SI system trained with the collected dataset (the BiGRRN system) was applied to the completely different Wisconsin x-ray microbeam (XRMB) speech dataset (Westbury, 1994). The unseen 16 kHz audio samples were given as input to the SI system to produce an estimate of the NAS time course. Figure 6 shows an example of the estimated NAS parameter, for the sentence “When can we go home?” The NAS parameter estimated by the BIGRNN SI system correctly shows a peak at the location of the three nasal consonants in “when,” “can,” and “home.” In contrast, the TCN SI system did not generalize that well, detecting NAS only for the word “home.”

FIG. 6.

(Color online) Estimated NAS parameter for the sentence “When can we go home?” from the XRMB dataset by the TCN (blue, dotted line) and BiGRNN (red, dashed line) SI systems. The panel on top shows the waveform and the middle panel shows the corresponding spectrogram.

FIG. 6.

(Color online) Estimated NAS parameter for the sentence “When can we go home?” from the XRMB dataset by the TCN (blue, dotted line) and BiGRNN (red, dashed line) SI systems. The panel on top shows the waveform and the middle panel shows the corresponding spectrogram.

Close modal

While the large dataset analysis described above shows that, overall, NAS and VP port movement are highly correlated, the degree to which NAS can be used as a proxy for VP constriction movement with appropriate timing resolution is still unknown. As noted above (see Fig. 1), a number of studies have shown that American English speakers use different patterns of velum raising and lowering (i.e., VP port constriction), according to syllabic organization. In particular, for syllable-initial nasal consonants, the movement onsets for lip closure and velum lowering occur at the same time, while the velum begins to lower much earlier for syllable-final consonants. Accordingly, we examined how our results match up with known patterns of syllabic organization. In addition, we were interested in whether the system might be able to distinguish the quantal perception of nasality from the continuous nature of movement. In other words, for the VP mechanism and nasality, we consider whether the SI system may be sensitive enough to identify the differences between continuous VP movement vs the point at which constriction degree is sufficiently wide to show acoustic coupling. Note that the patterns in question are defined by reference to the independent movement of the non-nasal articulators. For /m/, this means the independent movement of the lip aperture or in the case of the Krakow (1989,, 1999) data, lower lip movement. To provide an independent point of alignment, the lip aperture tract variable (LATV) was extracted using the articulatory speech inversion system (Espy-Wilson , 2018; Sivaraman , 2019; Siriwardena and Espy-Wilson (2023).

Figures 7(a) and 7(b) show the NAS parameter vs the HSN parameter for “It's hoe me” and “It's home E” from one representative subject (subject 1) in the dataset. The aligned LATV trace is shown for comparison. Overall, for both Figs. 7(a) and 7(b), we see that the HSN and NAS signals correspond synchronously with the LATV minimum at peak opening and peak closing. Based on the patterns shown in Fig. 1 from the Krakow (1989,, 1999) data, we expect that the two tracks showing nasality—NAS and HSN—will show movement into and out of opening, and the LATV track will show movement into and out of closure, at the same time for the word-initial /m/ of “hoe me.” Figure 7(a) shows this expected pattern. The arrow in Fig. 7(a) shows that for both NAS and HSN, the tracks move downward toward VP opening late in the vowel of “hoe,” and at roughly the same time as the LATV begins to move downward toward lip closure. All three tracks begin downward movement (toward opening) at roughly the same time, and hit the peak of downward movement at the same time. Figure 7(b) also shows the pattern we expect based on Fig. 1. For the word-final (coda) /m/ of “home,” the HSN and NAS parameters begin to move toward VP opening for the /m/ at a much earlier point than for the word-initial /m/ of “hoe me.” The two tracks start to move during the /h/ of “home,” and remain in this partially open state together during the vowel of “home,” resulting in partial nasalization during the vowel. This plateau pattern may be an instance of nasalization functioning as an “enhancing gesture,” as Stevens suggested in his work on features (Stevens and Keyser, 2010). Following the plateau of movement, both HSN and NAS tracks in Fig. 7(b) show an additional movement, from plateau toward full VP opening, during the /m/ of “home.” This point of maximum VP opening coincides with the point where lip closure (as shown by the LATV) occurs, and corresponds to the point of full nasalization during the nasal (stop) consonant. The arrows in Figs. 7(a) and 7(b) indicate the points of similarity between the patterns shown in these figures relative to Fig. 1.

FIG. 7.

(Color online) (a) (left) and (b) (right) Time course of two utterances contrasting “hoe me” (/ho#mi/) and “home E” (/hõm#i/) sequences. The upper panels show the oral (blue) and nasal (red) acoustic waveforms and combined spectrogram. LATV is the lip aperture estimated from the speech inversion system. The bottom panel shows HSN intensity (blue) combined with inverted NAS (red) and thresholded (>10% acoustic energy) NAS (green). The inset HSN movie frames show the contrast between non-nasalized /o/ with completely closed VP preceding a word boundary (left), and nasalized /õ/ with a partially open VP when the nasal occurs in the coda (right). Note that in these figures, the orientation for VP opening (i.e., velum lowering) is down while the orientation for VP closing (i.e., velum raising) is up, and NAS is inverted to track this behavior. Similarly, in the LATV track, the orientation for lip closure is down. Arrows show key points of difference between Figs. 7(a) and 7(b) during the vowel preceding the nasal consonant /m/.

FIG. 7.

(Color online) (a) (left) and (b) (right) Time course of two utterances contrasting “hoe me” (/ho#mi/) and “home E” (/hõm#i/) sequences. The upper panels show the oral (blue) and nasal (red) acoustic waveforms and combined spectrogram. LATV is the lip aperture estimated from the speech inversion system. The bottom panel shows HSN intensity (blue) combined with inverted NAS (red) and thresholded (>10% acoustic energy) NAS (green). The inset HSN movie frames show the contrast between non-nasalized /o/ with completely closed VP preceding a word boundary (left), and nasalized /õ/ with a partially open VP when the nasal occurs in the coda (right). Note that in these figures, the orientation for VP opening (i.e., velum lowering) is down while the orientation for VP closing (i.e., velum raising) is up, and NAS is inverted to track this behavior. Similarly, in the LATV track, the orientation for lip closure is down. Arrows show key points of difference between Figs. 7(a) and 7(b) during the vowel preceding the nasal consonant /m/.

Close modal

In his seminal work, “Acoustic Phonetics” (pg. 488, Fig. 9.2), Stevens (2000) discusses the time course of the cross sectional area of the oral constriction and the VP opening and speculates that the continuous VP movement will give rise to an abrupt change in the acoustic output. If this were the case, we would expect to see divergence between the tracks, with the HSN track showing smooth movement toward opening and the corresponding NAS track showing abrupt change. This expected pattern is not found in movement toward VP opening; rather, the data suggest consistency rather than divergence between HSN and NAS tracks. As a matter of interest, we note an unexpected pattern of divergence between the NAS and HSN tracks in offset from full VP opening. In this case, the HSN track shows continuous movement toward VP closure during the /i/ vowel following /m/, while the NAS track moves more slowly. As expected, the two tracks converge at full VP closure during the /s/ of “Sid.” It is unclear how to interpret this pattern of divergence during the vowel. It may be that the divergence we expected to see in HSN vs NAS for VP opening is more likely to occur in the course of VP closing movements, but there may also be methodological explanations. As noted above, for instance, slight movements of the subject's head will have minimal effect on the NAS measure but may slightly affect the placement of the HSN scope and consequent estimation of VP opening. Overall, however, Figs. 7(a) and 7(b) show a high degree of correspondence in timing between HSN and NAS data for VP constriction.

Our previous work on speech inversion used ground-truth movement data from oral articulators showing continuous change in key constriction points along the vocal tract as well as acoustic indicators of voicing. In this paper, we ask whether the speech inversion system can be extended to recover continuous movement to and from VP constriction. The results of the NAS ∼ HSN correlation analyses, together with confirmation that both HSN and NAS signals reflect known American English VP movement patterns, show that NAS as a parameter is sufficiently sensitive to VP movement to be used as training for a speech inversion system. We show that our own speech inversion system trained on NAS can recover articulatory information about patterns of VP constriction timing and degree with reasonable resolution. Additional research is needed to refine our understanding of how VP constriction and NAS interact at a detailed level. At the time his Acoustic Phonetics was written, Dr. Stevens noted the difficulty of tracking VP movement directly, and so his attribution of the relationship between articulation and acoustics for nasality remained speculative. Our work confirms much of Stevens' work investigating and documenting the rich acoustic detail available in the speech signal, showing that recovering articulatory information about the time course of VP constriction from the speech signal can indeed be possible.

This work was supported by NSF Grant No. BCS2141413, Collaborative Research: Estimating Articulatory Constriction Place and Timing from Speech Acoustics (C.Y.E.-W., M.K.T., and S.E.B., Principal Investigators).

The authors have no conflicts to report.

The research data reported in this paper were approved by the University of Cincinnati Institutional Review Board and conform to the ethical principles of the Acoustical Society of America. Informed consent was obtained from all participants.

Data collection in this study is still ongoing. De-identified data recordings may be provided to academic researchers upon request. At the conclusion of the study, the speech inversion system in the form of an executable pipeline with the pre-trained model will be made available to academic researchers upon reasonable request.

1.
Alfwaress
,
F.
,
Kummer
,
A. W.
, and
Weinrich
,
B.
(
2022
). “
Nasalance scores for normal speakers of American English obtained by the Nasometer II using the MacKay-Kummer SNAP-R test
,”
Cleft Palate Craniofacial J.
59
,
765
773
.
2.
Asterios
,
T.
,
Ouni
,
S.
, and
Laprie
,
Y.
(
2008
). “
Protocol for a model-based evaluation of a dynamic acoustic-to-articulatory inversion method using electromagnetic articulography
,” in
International Seminar on Speech Production 2008
, pp.
317
320
.
3.
Bartelds
,
M.
,
de Vries
,
W.
,
Sanal
,
F.
,
Richter
,
C.
,
Liberman
,
M.
, and
Wieling
,
M.
(
2022
). “
Neural representations for modeling variation in speech
,”
J. Phon.
92
,
101137
.
4.
Beddor
,
P. S.
(
2007
). “
Nasals and nasalization: The relation between segmental and coarticulatory timing
,” in
Proceedings of the 16th International Congress of Phonetic Sciences
, edited by
J.
Trouvain
and
W. J.
Barry
(
International Phonetic Association
,
Saarbrucken, Germany
), pp.
249
254
.
5.
Byrd
,
D.
,
Tobin
,
S.
,
Bresch
,
E.
, and
Narayanan
,
S.
(
2009
). “
Timing effects of syllable structure and stress on nasals: A real-time MRI examination
,”
J. Phon.
37
,
97
110
.
6.
Carignan
,
C.
(
2021
). “
A practical method of estimating the time-varying degree of vowel nasalization from acoustic features
,”
J. Acoust. Soc. Am.
149
,
911
922
.
7.
Carignan
,
C.
,
Chen
,
J.
,
Harvey
,
M.
,
Stockigt
,
C.
,
Simpson
,
J.
, and
Strangways
,
S.
(
2023
). “
An investigation of the dynamics of vowel nasalization in Arabana using machine learning of acoustic features
,”
Lab. Phonol.
14
(
1
),
1
31
.
8.
Cho
,
C. J.
,
Wu
,
P.
,
Mohamed
,
A.
, and
Anumanchipalli
,
G. K.
(
2023
). “
Evidence of vocal tract articulation in self-supervised learning of speech
,” in
Proceedings of the Institute of Electrical and Electronics Engineers (IEEE) International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(
Rhodes
,
Greece
), pp.
1
5
.
9.
Cho
,
T.
, and
Keating
,
P.
(
2009
). “
Effects of initial position versus prominence in English
,”
J. Phon.
37
,
466
485
.
10.
Cover
,
T. M.
, and
Thomas
,
J. A.
(
2006
).
Elements of Information Theory
(
John Wiley & Sons
,
Hoboken, NJ
).
11.
Dalston
,
R. M.
,
Warren
,
D. W.
, and
Dalston
,
E. T.
(
1991
). “
Use of nasometry as a diagnostic tool for identifying patients with velopharyngeal impairment
,”
Cleft Palate Craniofac J.
28
,
184
188
, discussion 188-189.
48.
Darley
,
F. A.
,
Aronson
,
A. E.
, and
Brown
,
J. R.
(
1975
).
Motor Speech Disorders
(
Saunders
,
Philadelphia, PA
).
12.
Deshmukh
,
O.
, and
Espy-Wilson
,
C. Y.
(
2005
). “
Speech enhancement using auditory phase opponency model
,” in
9th European Conference on Speech Communication and Technology
(
Eurospeech
,
Lisbon, Portugal
), pp.
2117
2120
.
13.
Espy-Wilson
,
C. Y.
,
Tiede
,
M.
,
Mitra
,
V.
,
Sivaraman
,
G.
,
Saltzman
,
E.
, and
Goldstein
,
L.
(
2018
). “
Speech inversion using naturally spoken data
,” in
Rethinking Reduction
, edited by
C.
Francesco
,
C.
Meghan
,
N.
Oliver
,
S.
Barbara
, and
Z.
Margaret
(
De Gruyter Mouton
,
Berlin
), pp.
243
276
.
14.
Fry
,
E.
(
2004
). “
Phonics: A large phoneme - Grapheme frequency count revised
,”
J. Lit. Res.
36
,
85
98
.
15.
Gafos
,
A. I.
(
1999
). “
Articulatory investigation of coronal consonants
,” in
The Articulatory Basis of Locality in Phonology
(
Garland Publishing, Inc
.,
New York
), pp.
131
174
.
16.
Hiroya
,
S.
, and
Honda
,
M.
(
2004
). “
Estimation of articulatory movements from speech acoustics using an HMM-based speech production model
,”
IEEE Trans. Speech Audio Process.
12
,
175
185
.
17.
Hsu
,
W.-N.
,
Bolte
,
B.
,
Tsai
,
Y.-H. H.
,
Lakhotia
,
K.
,
Salakhutdinov
,
R.
, and
Mohamed
,
A.
(
2021
). “
HuBERT: Self-supervised speech representation learning by masked prediction of hidden units
,”
IEEE/ACM Trans. Audio. Speech. Lang. Process.
29
,
3451
3460
.
49.
IEEE
(
1969
). “
IEEE Recommended Practice for Speech Quality Measurements
,”
IEEE Trans. Audio Electroacoust.
17
(3),
225
246
, Appendix C 1965 Revised List of Phonetically Balanced Sentences (Harvard Sentences).
18.
Katsika
,
A.
,
Krivokapić
,
J.
,
Mooshammer
,
C.
,
Tiede
,
M.
, and
Goldstein
,
L.
(
2014
). “
The coordination of boundary tones and its interaction with prominence
,”
J. Phon.
44
,
62
82
.
19.
Kochetov
,
A.
(
2020
). “
Research methods in articulatory phonetics II: Studying other gestures and recent trends
,”
Lang. Linguist. Compass
14
,
e12371
.
50.
Krakow
,
R. A.
(
1989
). “
The articulatory organization of syllables: A kinematic analysis of labial and velar gestures
,”
Doctoral dissertation, Yale University
,
New Haven, CT
.
20.
Krakow
,
R.
(
1999
). “
Physiological organization of syllables: A review
,”
J. Phon.
27
,
23
54
.
21.
Krivokapić
,
J.
(
2014
). “
Gestural coordination at prosodic boundaries and its role for prosodic structure and speech planning processes
,”
Philos. Trans. R. Soc. B
369
,
20130397
.
22.
Krstulović
,
S.
(
2001
).
Speech Analysis with Production Constraints
(
Ecole Polytechnique Federale de Lausanne
,
Lausanne, France
).
23.
Manuel
,
S. Y.
, and
Krakow
,
R. A.
(
1995
). “
Correlating movement and acoustic measures of nasalization
,”
J. Acoust. Soc. Am.
97
,
3365
.
24.
McAuliffe
,
M.
,
Socolof
,
M.
,
Mihuc
,
S.
,
Wagner
,
M.
, and
Sonderegger
,
M.
(
2017
). “
Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi
,” in
Proceedings of InterSpeech
, pp.
498
502
.
25.
McFee
,
B.
,
Raffel
,
C.
,
Liang
,
D.
,
Ellis
,
D. P. W.
,
McVicar
,
M.
,
Battenberg
,
E.
, and
Nieto
,
O.
(
2015
). “
librosa: Audio and Music Signal Analysis in Python
,” in
SciPy
.
26.
Mitra
,
V.
,
Sivaraman
,
G.
,
Nam
,
H.
,
Espy-Wilson
,
C.
,
Saltzman
,
E.
, and
Tiede
,
M.
(
2017
). “
Hybrid convolutional neural networks for articulatory and acoustic information based speech recognition
,”
Speech Commun.
89
,
103
112
.
28.
Oren
,
L.
(
2024
). “
How nasal airflow can affect nasalance magnitude
,”
J. Acoust. Soc. Am.
155
,
A335
.
27.
Oren
,
L.
,
Rollins
,
M.
,
Padakanti
,
S.
,
Kummer
,
A.
,
Gutmark
,
E.
, and
Boyce
,
S.
(
2020
). “
Using high-speed nasopharyngoscopy to quantify the bubbling above the velopharyngeal valve in cases of nasal rustle
,”
Cleft Palate Craniofac J.
57
,
637
645
.
29.
Ouni
,
S.
, and
Laprie
,
Y.
(
2005
). “
Modeling the articulatory space using a hypercube codebook for acoustic-to-articulatory inversion
,”
J. Acoust. Soc. Am.
118
,
444
460
.
30.
Perkell
,
J. S.
(
1969
).
Physiology of Speech Production: Results and Implications of a Quantitative Cineradiographic Study
(
MIT Press
,
Cambridge, MA
).
31.
Qin
,
C.
, and
Carreira-Perpiñán
,
M. Á.
(
2007
). “
An empirical investigation of the nonuniqueness in the acoustic-to-articulatory mapping
,” in
Proceedings of Interspeech
,
Antwerp, Belgium
, pp.
74
77
.
32.
Ravanelli
,
M.
,
Parcollet
,
T.
,
Plantinga
,
P.
,
Rouhe
,
A.
,
Cornell
,
S.
,
Lugosch
,
L.
,
Subakan
,
C.
,
Dawalatabad
,
N.
,
Heba
,
A.
,
Zhong
,
J.
,
Chou
,
J.-C.
,
Yeh
,
S.-L.
,
Fu
,
S.-W.
,
Liao
,
C.-F.
,
Rastorgueva
,
E.
,
Grondin
,
F.
,
Aris
,
W.
,
Na
,
H.
,
Gao
,
Y.
,
De Mori
, and
Bengio
,
R.
(
2021
). “
SpeechBrain: A general-purpose speech toolkit
.”
33.
Scarmagnani
,
R. H.
,
Lohmander
,
A.
,
Salgado
,
M. H.
,
Fukushiro
,
A. P.
,
Trindade
,
I. E. K.
, and
Yamashita
,
R. P.
(
2023
). “
Models for predicting velopharyngeal competence based on speech and resonance errors and velopharyngeal area estimation
,”
Cleft Palate Craniofacial J.
61
,
965
975
.
34.
Siriwardena
,
Y. M.
,
Attia
,
A. A.
,
Sivaraman
,
G.
, and
Espy-Wilson
,
C. Y.
(
2023a
). “
Audio data augmentation for acoustic-to-articulatory speech inversion
,” in
Proceedings of the 31st European Signal Processing Conference (EUSIPCO)
,
Helsinki, Finland
, pp.
301
305
.
36.
Siriwardena
,
Y. M.
, and
Espy-Wilson
,
C. Y.
(
2023
). “
The secret source: Incorporating source features to improve acoustic-to-articulatory speech inversion
,” in
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing
(
ICASSP
,
Rhodes, Greece
), pp.
1
5
.
35.
Siriwardena
,
Y. M.
,
Espy-Wilson
,
C. Y.
,
Boyce
,
S.
,
Tiede
,
M.
, and
Oren
,
L.
(
2023b
). “
Speaker-independent speech inversion for estimation of nasalance
,” in
Proceedings of Interspeech 2023
,
Dublin, Ireland
, pp.
4743
4747
.
37.
Sivaraman
,
G.
,
Mitra
,
V.
,
Nam
,
H.
,
Tiede
,
M.
, and
Espy-Wilson
,
C.
(
2019
). “
Unsupervised speaker adaptation for speaker independent acoustic to articulatory speech inversion
,”
J. Acoust. Soc. Am.
146
,
316
329
.
38.
Stevens
,
K. N.
(
1989
). “
On the quantal nature of speech
,”
J. Phon.
17
,
3
45
.
39.
Stevens
,
K. N.
(
2000
).
Acoustic Phonetics
(
MIT Press
,
Cambridge, MA
).
40.
Stevens
,
K. N.
(
2002
). “
Toward a model for lexical access based on acoustic landmarks and distinctive features
,”
J. Acoust. Soc. Am.
111
,
1872
1891
.
41.
Stevens
,
K. N.
, and
Keyser
,
S. J.
(
2010
). “
Quantal theory, enhancement and overlap
,”
J. Phon.
38
,
10
19
.
42.
Stevens
,
K. N.
, and
Öhman
,
S. E. G.
(
1963
). “
Cineradiographic studies of speech
,” in
Speech Transmission Laboratory Quarterly Progress and Status Report
(
Royal Institute of Technology
,
Stockholm, Sweden
), pp.
9
11
.
43.
Tiede
,
M. K.
,
Espy-Wilson
,
C. Y.
,
Goldenberg
,
D.
,
Mitra
,
V.
,
Nam
,
H.
, and
Sivaraman
,
G.
(
2017
). “
Quantifying kinematic aspects of reduction in a contrasting rate production task
,”
J. Acoust. Soc. Am.
141
,
3580
.
44.
Toda
,
T.
,
Black
,
A. W.
, and
Tokuda
,
K.
(
2004
). "
Acoustic-to-articulatory inversion mapping with gaussian mixture model
," in
International Congress of Phonetic Sciences
(
ICSLP
,
Jeju Island, Korea
), pp.
1129
1132
.
45.
Westbury
,
J. R.
(
1994
). “
X-ray microbeam speech production database user's handbook
,” in
IEEE Personal Communications
(
University of Wisconsin Madison
,
Madison, WI
).
46.
Wu
,
P.
,
Chen
,
L.-W.
,
Cho
,
C. J.
,
Watanabe
,
S.
,
Goldstein
,
L.
,
Black
,
A. W.
, and
Anumanchipalli
,
G. K.
(
2023
). "
Speaker-independent acoustic-to-articulatory speech inversion
," in
Proceedings of the Institute of Electrical and Electronics Engineers (IEEE) International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
,
Rhodes, Greece
, pp.
1
5
.
47.
Wu
,
Z.
,
Zhao
,
K.
,
Wu
,
X.
,
Lan
,
X.
, and
Meng
,
H.
(
2014
). “
Acoustic to articulatory mapping with deep neural network
,”
Multimed. Tools Appl.
74
,
9889
9907
.