Reduced vowel space area (VSA) is a known effect of neurodegenerative diseases such as Parkinson's disease (PD). Using large publicly available corpuses, two experiments were conducted comparing the vowel space of speakers with and without Alzheimer's disease (AD) during spontaneous and read speech. First, a comparison of vowel distance found reduced distance in AD for English spontaneous speech, but not Spanish read speech. Findings were then verified using an unsupervised learning approach to quantify VSA through cluster center detection. These results corroborate observations for PD that VSA reduction is task-dependent, but further experiments are necessary to quantify the effect of language.
1. Introduction
Alzheimer's disease (AD) is an insidious neurodegenerative disease (ND) with loss of cognitive and bodily function. Motor skills are known to deteriorate as the disease progresses; for example, the arm movements of AD patients are inaccurate, slower, and discontinuous relative to those of controls (Ghilardi et al., 1999). Assessments of fine motor dexterity, such as the finger-tapping test, demonstrate that AD results in impairments to the timing and motor execution of finger movements (Roalf et al., 2018). Assessments of manual dexterity during a writing task show AD patients' movements were slower, less smooth, less coordinated, and less consistent than those of healthy controls (HCs) (Yan et al., 2008). Research demonstrates that AD results in measurable motor deficits during preclinical stages of the disease (Buchman and Bennett, 2011); therefore, assessments of speech motor skills via acoustic analysis serve as a non-invasive method for early diagnosis and monitoring of disease progression, with previous research identifying AD-related changes to prosody (Martínez-Sánchez et al., 2012), voice quality (Xiu et al., 2022), and segment timing (Baker et al., 2007; Cera et al., 2018). Reductions to the vowel space area (VSA) of a speaker are known manifestations of reduced fine motor skills for NDs such as Parkinson's disease (PD) and amyotrophic lateral sclerosis (ALS) (Skodda et al., 2011; Turner et al., 1995). Recently, Xiu et al. (2022) investigated the first three formants (F1–F3) of 17 AD speakers (9 male, 8 female) and 13 HC speakers (9 male, 4 female) as measured from nine individual words read in isolation. Formant values for all male and female participants were pooled and directly compared between disease state groups. No difference in F1 or F2 was observed, but a statistically significant difference was observed for F3. In other words, no difference in VSA was observed between AD and HC. Considering that fine motor skills are observably impaired in AD, the absence of any reduction to VSA from AD is unexpected. We note that the results of Xiu et al. were based on a relatively small sample of participants (17 AD, 13 HC) and made use of a small number of tokens read in isolation; thus, it remains unclear whether these findings extend to continuous speech measured across larger populations. Assessments of vowel space in PD patients has shown that vowel reduction effects are most salient for natural spontaneous speech, less salient for read passages, and least salient for isolated words or vowels (Exner, 2019; Rusz et al., 2013). Therefore, it is possible that the VSA of AD patients may be reduced outside of isolated word contexts. We also note that directly comparing raw formant values of males and females as done in Xiu et al. is not ideal due to inherent differences in the VSA of males and females. Ideally, some form of controlling for sex or formant normalization must be employed; controlling for sex is also valuable considering increased disease symptom severity in female AD patients (Laws et al., 2018).
The availability of large public corpuses containing speech from AD and HC speakers allows us to evaluate whether the findings of Xiu et al. for isolated words extend to continuous speech. For this purpose, we compare the VSA of AD and HC speakers across two large corpuses spanning hundreds of AD and HC speakers: (1) the English-language Pitt corpus (Becker et al., 1994), which contains spontaneous speech as elicited via a picture-description task, and (2) the Spanish-language Ivanova corpus (Ivanova and Meilán, 2022), which contains read speech. We note that these corpuses differ in both task and language, which complicates the matter of directly comparing them; however, they represent the largest publicly available datasets of continuous speech from AD and HC speakers.
With these datasets, we conduct two experiments to evaluate the effect of AD on VSA: (1) We calculate the Euclidean distance of vowel tokens in a quadrilateral or triangular vowel set from speaker-specific centroids, and (2) we validate our findings in the first experiment using an unsupervised learning paradigm for the automatic calculation of VSA based on cluster center detection (Sandoval et al., 2013).
As movement trajectories in AD are observed to be reduced, slower, and discontinuous for both gross and fine motor skills, we hypothesize that impairments to fine motor skills in AD should be reflected in reductions to the VSA of AD speakers during speech. This should manifest as shortened movement trajectories for individual vowel targets that generalize to a smaller working VSA, as observed for other NDs (e.g., PD, ALS). We note that while Xiu et al. did not observe VSA reduction in non-continuous speech, both datasets used in the present study employ some form of continuous speech. Evidence from PD suggests that natural spontaneous speech as contained in the English Pitt corpus would demonstrate more substantial reduction than read speech as contained in the Spanish Ivanova corpus, but we must consider that this comparison is made across datasets that also differ in language.
The vowel systems of English and Spanish differ in size and shape of the vowel space. English has a large vowel inventory compared to Spanish, and the vowel space is accordingly larger to accommodate the number of necessary acoustic contrasts (Bradlow, 1995). As a larger vowel space necessitates larger movements (Alfonso and Baer, 1982; Lee et al., 2016), we would expect English to exhibit greater reduction of VSA than Spanish. However, we note that for other NDs, such as PD, reduction of VSA is not influenced by language-specific properties, such as the size of the vowel space or the degree of relative contrast (Kim and Choi, 2017). In other words, there is good reason to expect that VSA reduction in AD should not be influenced by language-specific properties, but as this has never been empirically verified for AD, we must remain open to a possible effect of language.
2. Data
Table 1 provides a summary of speakers across both datasets, including the distribution of speakers by sex, age, and disease state across the Pitt and Ivanova corpuses, with each corpus described in detail within its relevant subsection.
Corpus . | English Pitt . | Spanish Ivanova . | ||||||
---|---|---|---|---|---|---|---|---|
Group . | AD . | HC . | AD . | HC . | ||||
Gender . | F . | M . | F . | M . | F . | M . | F . | M . |
No. speaker | 114 | 64 | 43 | 36 | 32 | 15 | 105 | 49 |
Age range | 49–88 | 50–83 | 46–78 | 49–80 | 59–95 | 72–87 | 58–93 | 55–96 |
Mean age | 72.9 (7.8) | 68.7 (8.2) | 63.3 (7.5) | 64.3 (7.5) | 78.8 (8.5) | 80.9 (4) | 76 (7.2) | 75.8 (9) |
Corpus . | English Pitt . | Spanish Ivanova . | ||||||
---|---|---|---|---|---|---|---|---|
Group . | AD . | HC . | AD . | HC . | ||||
Gender . | F . | M . | F . | M . | F . | M . | F . | M . |
No. speaker | 114 | 64 | 43 | 36 | 32 | 15 | 105 | 49 |
Age range | 49–88 | 50–83 | 46–78 | 49–80 | 59–95 | 72–87 | 58–93 | 55–96 |
Mean age | 72.9 (7.8) | 68.7 (8.2) | 63.3 (7.5) | 64.3 (7.5) | 78.8 (8.5) | 80.9 (4) | 76 (7.2) | 75.8 (9) |
2.1 English Pitt corpus
The English-language Pitt corpus contains speech recordings of 178 AD patients and 79 HCs, with up to five recordings obtained longitudinally over 20 years. We used audio files from all available longitudinal sessions for AD and HC participants performing the cookie-theft picture-description task, with each file providing approximately 1 min of audio. Control participants whose diagnosis changed throughout the study were omitted (N = 23). AD subjects within the Pitt corpus who were diagnosed with PD, Lewy body dementia, and mild cognitive impairment (MCI) were not included within the analysis, and the number of speakers for the AD condition reported in this paper refers only to speakers diagnosed with probable AD and no other co-morbid NDs throughout the study. All audio files within the corpus were in WAV format with a 44.1 kHz sampling rate and a 16-bit depth encoding.
2.2 Spanish Ivanova corpus
The Spanish-language Ivanova corpus contains speech from 47 AD participants and 154 HCs. Each participant performed a reading of the first paragraph of the classic novel Don Quixote, with an approximate reading time of 1 min. Note that within the Ivanova corpus, high quality WAV files are provided for only 47 AD speakers. An additional 28 speakers are provided in MP3 format but are not included in the present analysis. Our analysis is limited to AD participants for which WAV files are available. Each WAV file was provided with a 44.1 kHz sampling rate and a 16-bit depth encoding.
2.3 Data preparation and extraction
Files from both datasets were transcribed at the sentence level by a native speaker of the corresponding language, before undergoing forced alignment for phonemic transcription using Montreal Forced Aligner (McAuliffe et al., 2017) with regionally specified acoustic and pronunciation dictionaries for each language. The forced-aligned textgrid files were then manually verified by ensuring alignment boundaries of each phoneme corresponded to the acoustic signal and adjusted if necessary. Using the Burg algorithm in the acoustic analysis software Praat (version 6.1.4) (Boersma and Weenink, 2016), a script was used to calculate the first and second vowel formant of target vowels in the forced-aligned textgrid. For English, target vowels included stressed tokens from all nine English monophthongs while omitting all diphthongs. For Spanish, all five vowels were included (note that Spanish does not make use of contrastive stress or diphthongs). From each vowel token, the first and last third of the vowel were omitted, with F1 and F2 calculated from the middle third. After segregating the data by sex, language, and group (AD, HC), outliers were removed for each formant using the interquartile range technique. The total number of vowel tokens extracted from each sex, language, and disease state group is outlined in Table 2. Lobanov-normalized formant values (Lobanov, 1971) were calculated for each speaker from the target vowel set using the R vowels package (Kendall et al., 2018) to facilitate interspeaker comparisons. Lobanov normalization was employed for its effectiveness in reducing anatomical differences across speakers while retaining differences in articulation (Adank et al., 2004).
3. Experiment 1: Peripheral vowel Euclidean distance
We employ a well-established method of assessing vowel space reduction by calculating the Euclidean distance of three (for Spanish) or four (for English) peripheral vowels from a speaker-specific centroid, which is itself calculated as the mean formant values of the triangular or quadrilateral targets (Nycz and Hall-Lew, 2013). Using Lobanov-normalized formant values, we calculate a quadrilateral vowel space for English using the four corner peripheral monophthongs /i, ae, u, aa/ and a triangular vowel space for Spanish using the three corner monophthongs /i, ae, u/. For each speaker, we calculate the centroid (C1, C2) by first calculating the speaker-specific mean F1 and F2 of each target vowel and then calculating the arithmetic mean of means for F1 (C1) and F2 (C2); for example, the formula to calculate the C1 of Spanish is as follows:
From this, the Euclidean distance from the centroid was calculated for individual vowel tokens. Note that the present method provides calculations of Euclidean distance for individual vowel tokens but does not provide a unified calculation of VSA for each speaker; this allows insight into the degree of variability of distance for each vowel that a unified metric of VSA cannot. Furthermore, by incorporating token-wise calculations of Euclidean distance, we allow a greater number of observations, which increases our statistical power as is necessary for complex linear modeling.
3.1 Linear modeling
A linear mixed effects regression model was fitted to the data in R, using the lme4 package and the optimx optimizer. In our model, Euclidean distance was set as the dependent variable. Disease state (AD, HC) was included as the first predictor term, along with task and sex included as fixed effect interactions to disease condition. We incorporate both task and sex as fixed effect interactions of disease state to allow insight into whether any observed reduction is dependent on either the task (spontaneous speech vs read speech) or sex-specific differences; however, note that in our dataset and model, task is inherently confounded with language. Our model also incorporated random intercepts for vowel and speaker. Our model formula as used in lme4 is as follows:
We then compared model performance via the use of likelihood ratio test, comparing the model to one omitting group as a predictor. A post hoc pairwise comparison of interaction was conducted with Bonferroni adjustment on the model predictions using the emmeans package (Lenth, 2022) in R.
3.2 Results
Figure 1 provides a boxplot of Euclidean distance values sorted by vowel and disease state. Note that vowels between languages are treated as distinct vowels, even for vowels that overlap, e.g., /i/; therefore, English vowels are presented in ARPABET (IY, AE, AA, UW) and Spanish vowels in International Phonetic Alphabet (IPA) (i, a, u).
The results of the LMER model assessing the effect of disease state on Euclidean distance are provided in Fig. 2. The results of our likelihood ratio test (LRT) comparison to a model omitting the effect of condition suggest that condition has a significant effect on Euclidean distance [ = 22.35, degrees of freedom (df) = 4, p = 0.0002]. However, no clear trend can be observed between groups.
For English speakers, AD had reduced distance for both males and females, whereas for Spanish, male AD had increased distances, whereas females remained relatively constant. Table 3 presents a post hoc pairwise comparison of interactions, which reports that English males and females had significant effects of condition, while Spanish males and females did not.
Group . | SE . | t-ratio . | p . |
---|---|---|---|
English female AD-HC | 0.023 | −3.70 | 0.0002 |
English male AD-HC | 0.028 | −2.51 | 0.01 |
Spanish female AD-HC | 0.037 | −0.07 | 0.94 |
Spanish male AD-HC | 0.053 | 1.68 | 0.09 |
Group . | SE . | t-ratio . | p . |
---|---|---|---|
English female AD-HC | 0.023 | −3.70 | 0.0002 |
English male AD-HC | 0.028 | −2.51 | 0.01 |
Spanish female AD-HC | 0.037 | −0.07 | 0.94 |
Spanish male AD-HC | 0.053 | 1.68 | 0.09 |
3.3 Summary
The results of the LMER analysis indicate that distance was reduced for AD relative to HC for both males and females in the English dataset, but no such trend was observed for Spanish data. Spanish female AD and HC estimates were nearly equal, and male AD distance was notably higher than HC, but this was deemed to be insignificant in post hoc comparison. We note that model estimates accurately reflect what can be observed in raw-data comparison of individual vowels in Fig. 3; namely, clear reduction can be observed for English vowels (spontaneous speech), but not for Spanish vowels (read speech).
4. Experiment 2: Automatic assessment of VSA
4.1 K-means cluster detection
As validation of our findings in experiment 1, We employ the approach of Sandoval et al. (2013) to automatic VSA calculation using an unsupervised learning algorithm (k-means) for cluster detection via vector quantization. Following outlier removal, mean VSAs were calculated independently for each combination of sex, language, and disease state, using the pooled Lobanov-normalized formant data for the entire vowel set from all participants within that group. The k-means algorithm was set to automatically detect nine cluster centers for English data and five cluster centers for Spanish data, with the total number of cluster centers corresponding to the number of distinct vowels for each language included within formant extraction as described in Sec. 2.3. We initiated ten random starting cluster centers for each VSA and allowed up to 1000 iterations for the algorithm to converge. Subsequently, a convex hull calculation was performed across the area denoted by the cluster centers, using the convex hull function within base R. We plot the calculated cluster centers and convex hull area for each group as qualitative verification for our manual assessment of movement distance in experiment 1.
4.2 Results
Figure 3 displays the mean VSA as calculated via k-means, controlled for sex and group. In each plot, the colored points reflect the cluster centers as calculated for AD (orange) or HC (blue) via k-means, with the colored polygon reflecting the VSA as calculated by convex hull calculation of the cluster centers. The area as calculated for each group is denoted alongside the disease state label. For example, the male English AD group had a combined VSA of 5.4 vs 5.96 for HC. Qualitatively, we can observe reduction of the VSA for English groups, but not for Spanish, where instead we can observe that the AD condition is moderately larger for individual sexes. Consistent with our LMER observations in experiment 1, reduction was consistent for both English males and females.
5. Discussion
We reported two experiments comparing the effect of AD on VSA across two datasets containing continuous speech: the English-language Pitt corpus, which contained natural speech, and the Spanish-language Ivanova corpus, which contained read speech. In the first experiment, we assessed reduction of the vowel space by calculating the Euclidean distance of vowel tokens from speaker-specific centroids using a triangular (Spanish) or quadrilateral (English) subset of the vowel space periphery. LMER models were fitted to the data evaluating the effect of disease on Euclidean distance while also accounting for the interaction of disease with language and sex. We found that vowel distance was significantly reduced for both males and female AD groups in the Pitt corpus, but no such effect could be observed in the Ivanova corpus.
To verify our findings in experiment 1, we conducted a second experiment where we used an unsupervised learning approach to calculate aggregate VSA plots for groups sorted by language, sex, and disease state. Aggregate VSA plotting suggested a substantial reduction to English VSAs for AD relative to HC, but no such change was visible for Spanish. These results were largely consistent with the triangular and quadrilateral vowel distances calculated in experiment 1; reduction to VSA for AD speakers can be observed in the Pitt corpus, but not the Ivanova corpus. These results also serve as qualitative validation that the effects observed for the peripheral vowel space in experiment 1 extend to the full set of vowels included in experiment 2.
In sum, both experiments found a reduction of vowel space for AD speakers in the Pitt corpus but not the Ivanova corpus. The absence of VSA reduction in the Ivanova corpus is consistent with the results of Xiu et al. (2022), where no change could be observed in the VSA of AD speakers as measured from Mandarin words in isolation. However, our observations of reduced VSA in the Pitt corpus conflict with those reported by Xiu et al. Between the work conducted in the present analysis and Xiu et al., comparisons of VSA in AD have been conducted across three distinct tasks and three distinct languages; therefore, we cannot conclude with certainty whether differences between these datasets stem from task or language. We find it noteworthy that the only dataset to demonstrate reduction is that which employed natural speech, which corroborates observations from PD research that natural speech is more susceptible to VSA reduction than read speech. While language-specific effects cannot be conclusively ruled out, we note that if VSA reduction in AD is language-specific, then this form of VSA reduction would be unusual when compared with other NDs. To date, we know of no language reported as being exempt from reduction of VSA in PD or ALS.
Future work would benefit from data that allow independent comparisons of task or language, either by comparing consistent tasks across multiple languages or a single language across multiple tasks. We should also consider that, apart from language and task, there are differences between the Pitt and Ivanova corpuses in terms of size, methodological design, and subject selection that may also act as potential confounds. We note that the Xiu and Ivanova datasets are substantially smaller than the Pitt corpus; thus, it is possible that these discrepancies result from a lack of statistical power.
In conclusion, this study found reduced VSA in English spontaneous speech, which provides support for the prediction that impairments to fine motor skills in AD should manifest in reductions to the movement trajectories of vowel articulation. The absence of VSA reduction in Spanish read speech and Mandarin isolated words suggests that VSA reduction in AD may be influenced by task or language.
Acknowledgments
This work was funded by National Institutes of Health (NIH) Grant No. DC-002717 awarded to Haskin's Laboratories.