This study evaluated the feasibility of differentiating conversational and clear speech produced by individuals with muscle tension dysphonia (MTD) using landmark-based analysis of speech (LMBAS). Thirty-four adult speakers with MTD recorded conversational and clear speech, with 27 of them able to produce clear speech. The recordings of these individuals were analyzed with the open-source LMBAS program, SpeechMark®, matlab Toolbox version 1.1.2. The results indicated that glottal landmarks, burst onset landmarks, and the duration between glottal landmarks differentiated conversational speech from clear speech. LMBAS shows potential as an approach for detecting the difference between conversational and clear speech in dysphonic individuals.
1. Introduction
Dysphonia (i.e., disordered voice) frequently reduces the speech intelligibility of affected individuals (Ishikawa , 2017a). Various voice and speech therapy techniques exist for improving intelligibility, with clear speech being one such technique that has been incorporated into voice therapy programs (Gartner-Schmidt , 2016). Currently, the degree of success in implementing clear speech and its effectiveness in enhancing intelligibility are assessed perceptually by clinicians. The subjective nature of this auditory-perceptual assessment poses several challenges associated with human perception, including speaker familiarity, context familiarity, and variability across clinicians (Babel and Russell, 2015;, McGowan, 2015; Kutlu , 2022; McLaughlin and Van Engen, 2022). An acoustic-based approach would provide a solution as it is free of these biases, and identifying a biomarker for implementation of clear speech is the first step toward developing such an approach.
Speakers with a healthy voice can enhance intelligibility by intentionally changing how they produce speech. One of the most well-documented intelligibility enhancement techniques is clear speech, which is elicited by instructing speakers to hyperarticulate speech sounds (Ferguson and Kewley-Port, 2007; Moon and Lindblom, 1994; Bradlow , 2003; Smiljanić and Bradlow, 2009). Clinically, clear speech is an attractive therapy technique as it exploits the individual's previous experience in enhancing intelligibility in challenging communication environments. The technique has been incorporated into therapy programs for individuals with dysarthria (Beukelman , 2002), hearing loss (Schum, 1997), and, more recently, voice disorders (Gartner-Schmidt , 2016). Acoustically, clear speech has been shown to elicit temporal and spectral changes in a speech signal, including a decrease in speech rate and increases in vowel duration (Picheny , 1986; Ferguson and Kewley-Port, 2002), overall intensity (Picheny , 1986), consonant-vowel ratio (Picheny , 1986), vowel space (Ferguson and Kewley-Port, 2007), and spectral energy in 1–3 kHz (Smiljanic and Gilbert, 2017).
Landmark-based analysis of speech (LMBAS) holds promise for providing a biomarker for clear speech. LMBAS is a knowledge-based approach rooted in the landmark (LM) theory of speech production and perception (Stevens, 2002). The theory proposes that articulatory movement elicits abrupt changes in a speech signal, called landmarks, or LMs. The theory assumes that listeners are sensitive to acoustic patterns around the LMs, recognizing them as crucial cues for speech perception. This emphasis in the connection between speech articulation and perception may yield a biomarker that allows a more direct assessment of a speaker's ability to modify speech for enhanced intelligibility compared to approaches that rely on statistical patterns of the signal. For the evaluation of speech intelligibility, the theory also addresses the lack of invariance problem (i.e., a many-to-many mapping between speech acoustics and linguistic percepts) that plagues traditional feature-based approaches of speech analysis (Peterson and Barney, 1952; Liberman , 1967; Miller and Baer, 1983; Magnuson , 2020). The LMs remain relatively consistent across speakers due to their basis in the physical properties of the articulatory gestures, which are constrained by the capacities of human articulators (Stevens, 2002). As a result, the lack of invariance is less of an issue in the LMBAS model, with listeners focusing on the relatively invariant LMs rather than the variable aspects of the speech signal. This focus makes the model more agnostic to the identity of the speaker.
The LMBAS algorithm first identifies the moments that qualify as LMs and then classifies them into specific markers based on their acoustic characteristics (Fig. 1). Table 1 displays the acoustic rules for each LM. It is important to note that the term “burst” has a different meaning in the LM system than in traditional acoustic phonetics, where it primarily pertains to stop consonants. Burst LMs are designed to capture abrupt acoustic changes occurring in unvoiced regions, which can be associated with stops, affricates, and fricatives. Conversely, syllabic LMs capture similar abrupt acoustic changes in voiced regions, which may relate to sonorant consonants. Glottal, burst, and syllabic LMs are the most common LMs (Ishikawa , 2020), and burst and syllabic onset LMs occur more frequently than their offset counterparts due to the greater energy increase at the onset of consonants (Ishikawa , 2017b). The feasibility of detecting differences between conversational and clear speech using LMBAS in healthy speakers has been indicated by a study that showed that successfully produced clear speech generated a greater number of LMs than poorly produced ones (Boyce , 2013). LMBAS has also been applied to describe speech production disorders in children (Atkins , 2019; Liu, 2021; Kalita , 2018) and depression (Huang , 2019).
Acoustic rules for each type of LM. Note that the symbols and mnemonics are not intended to identify underlying articulatory or phonetic events, only to suggest examples: syllabic, voiced frication, etc.
Symbol . | Landmark type . | Rule . |
---|---|---|
+g | Glottal onset | The beginning of sustained vocal fold vibration, i.e., of periodicity or of power and spectral slope similar to that of a nearby segment of sustained periodicity |
−g | Glottal offset | End of sustained vocal fold vibration |
+b | Burst onset | At least 3 of 5 frequency bands show simultaneous power increases of at least 6 dB in both the finely smoothed and the coarsely smoothed contours, in an unvoiced segment (not between +g and the next −g) |
−b | Burst offset | At least 3 of 5 frequency bands show simultaneous power decreases of at least 6 dB in both the finely smoothed and the coarsely smoothed contours, in an unvoiced segment |
+s | Syllabic onset | At least 3 of 5 frequency bands show simultaneous power increases of at least 6 dB in both the finely smoothed and the coarsely smoothed contours, in a voiced segment (between +g and the next −g) |
−s | Syllabic offset | At least 3 of 5 frequency bands show simultaneous power decreases of at least 6 dB in both the finely smoothed and the coarsely smoothed contours, in a voiced segment |
Symbol . | Landmark type . | Rule . |
---|---|---|
+g | Glottal onset | The beginning of sustained vocal fold vibration, i.e., of periodicity or of power and spectral slope similar to that of a nearby segment of sustained periodicity |
−g | Glottal offset | End of sustained vocal fold vibration |
+b | Burst onset | At least 3 of 5 frequency bands show simultaneous power increases of at least 6 dB in both the finely smoothed and the coarsely smoothed contours, in an unvoiced segment (not between +g and the next −g) |
−b | Burst offset | At least 3 of 5 frequency bands show simultaneous power decreases of at least 6 dB in both the finely smoothed and the coarsely smoothed contours, in an unvoiced segment |
+s | Syllabic onset | At least 3 of 5 frequency bands show simultaneous power increases of at least 6 dB in both the finely smoothed and the coarsely smoothed contours, in a voiced segment (between +g and the next −g) |
−s | Syllabic offset | At least 3 of 5 frequency bands show simultaneous power decreases of at least 6 dB in both the finely smoothed and the coarsely smoothed contours, in a voiced segment |
For the detection of clear speech, burst onset and syllabic onset LMs may be particularly promising. These LMs are generated when spectral power simultaneously increases in multiple frequency bands. Picheny (1986) reported the root mean square (RMS) intensity for obstruent sounds, particularly stop consonants, was as much as 10 dB greater in clear speech than in conversational speech. The burst onset LM is designed to capture such a change in unvoiced regions. On the other hand, the syllabic onset LM would capture the power increase in voiced regions, as reported by Smiljanic and Gilbert (2017). Another promising LM-based feature is the duration between glottal onset and offset LMs, which typically occur in pairs (Ishikawa , 2017b; Ishikawa , 2020). This durational measure could capture the slower speech rate and longer vowel duration associated with clear speech (Picheny , 1986; Ferguson and Kewley-Port, 2002). Given the well-documented acoustic differences between normal and dysphonic speech signals (e.g., Eadie and Doyle, 2005), further investigation is warranted to better understand how these differences manifest in LM expression in clear speech from dysphonic speakers. Our previous study (Ishikawa , 2020) found that dysphonic speech had more glottal and burst LMs and fewer syllabic LMs compared to normal speech. The excess glottal and burst LMs are consistent with the greater aperiodicity in dysphonic speech, and the decreased number of syllabic LMs indicates a lack of harmonic content in the speech signal. These results, along with acoustic features of clear speech previously reported support the focus of the current study on glottal [g], burst [b], and syllabic [s] LMs.
The overarching goal of this project is to develop an automatic tool for monitoring patient progress in speech production, ultimately leading to improved intelligibility. The LM-based system serves as a feature extraction method capable of identifying potential biomarkers that effectively evaluate the utilization of speech enhancement techniques. Biomarkers incorporating theories of speech production and perception are attractive, as they provide better explainability to a machine learning model than other statistically derived features. In this study, we analyzed samples from individuals with muscle tension dysphonia (MTD), a prevalent voice disorder caused by excessive tension in the laryngeal muscles. Our research builds upon previous investigations that compared LM expression differences between casual and clear speech in healthy speakers (Boyce , 2013), as well as differences in LM expression between healthy and dysphonic speakers (Ishikawa , 2020). We tested the following hypothesis: clear speech will yield a greater number of burst and syllabic onset LMs and longer duration between [g] LMs than conversational speech.
2. Methods
2.1 Participants
The participants of this study were 34 treatment-seeking adults with MTD who agreed to participate in a randomized clinical trial of voice therapy involving gargle phonation. The recordings used in this study were collected as a subtask prior to the treatment trials. Of the 34 participants, 26 were adult females, and eight were adult males. The average age of the participants was 53.5 years old (SD ± 18). Fifteen participants were diagnosed with primary MTD (i.e., excessive laryngeal muscle tension as the primary cause of dysphonia), and 19 participants had secondary MTD (i.e., excessive laryngeal tension that occurs as a result of another underlying laryngeal pathology). The diagnoses were established through clinical voice evaluation by a laryngologist and a speech-language pathologist (SLP). All participants were native speakers of American English and had no history of other speech-language disorders, neurological voice disorders, or hearing loss.
2.2 Recording procedure
The participants were seated in a clinical room and fitted with a headset microphone (C555L, AKG, Los Angeles, CA). The microphone was placed at 5 cm and 45° off-axis from the corner of the mouth. All speech was recorded directly to a solid-state recorder (DR-40X, TASCAM, Santa Fe Springs, CA) and digitized at the sampling rate of 44.1 kHz. The SLPs instructed the participants to read the six Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) sentences, first in the conversational speech style, and then repeat them in the clear speech style. The conversational speech was elicited by asking the participants to “please read these sentences as if you are speaking with a very good friend or family member.” Clear speech style was elicited by asking the participants to “please read these sentences again but this time read them as if you are leaving a very important message on someone's phone. Please articulate clearly so they can fully understand what you are saying.”
2.3 Perceptual analysis for determining dysphonia severity
The severity of dysphonia was rated by four SLPs who specialize in the treatment of voice disorders using CAPE-V. The recorded samples were presented to the SLPs via an online survey created in Qualtrics. The SLPs were instructed to listen to the recordings with headphones in a quiet room. They were allowed to take a break at any time they wished.
2.4 Perceptual analysis for speech styles
A listening test was conducted to check the perceptual validity of clear speech. The recordings of conversational and clear speech from the same speaker were presented to two SLPs who specialize in the treatment of voice disorders. The order of the speech style was randomized. The SLPs were asked to identify which recording was more clearly articulated, or whether both recordings had the same level of clarity. The samples that SLPs correctly identified were considered valid productions of conversational or clear speech. The recordings were presented via Qualtrics. The SLPs were instructed to listen to the recordings using headphones in a quiet room and were allowed to listen to them as many times as they wished.
2.5 Acoustic analysis
The recordings were manually edited to extract CAPE-V sentences in both speech styles. The starting and ending points of a sentence were visually determined on a spectrogram. Praat (Boersma and Weenink, 2018) was used for the editing process. The listeners in the perceptual analysis described above agreed on 28 of 32 samples as having a perceivable difference between conversational and clear speech. Only these samples were subjected to the LM-based analysis using SpeechMark® matlab Toolbox version 1.1.2. (SpeechMark, 2018). The analysis yielded a sequence of LMs with time of their occurrence. These time data were used to obtain the duration between [+g] and [−g] LMs.
2.6 Statistical analyses
2.6.1 Perceptual data on dysphonia severity ratings
The average was calculated for the ratings from all SLPs. A two-way mixed-effects model intraclass correlations test (ICC) was used to evaluate the inter- and intra-rater reliabilities of the SLPs. Twenty percent of the data were used to evaluate the intra-rater reliability.
2.6.2 Acoustic data
Pairwise Wilcoxon Tests with Bonferroni correction were used to evaluate the effect of speech style on the LM count and duration between the glottal LMs. The threshold of significance with the adjusted p-value was p = 0.008.
This study was approved by Mayo Clinic's institutional review board (IRB) (Mayo IRB No. 20-004267).
3. Results
3.1 Dysphonia severity rating
Table 2 describes the dysphonia severity rating of the participants. All participants' voices were rated to have a severity below 60 in all rating categories. Within those, 88.24% were rated to have an overall severity equal to or less than 40 (equal to or less than moderate dysphonia).
Descriptive statistics of CAPE-V ratings by SLPs. The values indicate the percentage of participants who fell into the specific range of severity.
Rating category . | Rating (0 = normal, 100 = the most severely deviant) . | |||||
---|---|---|---|---|---|---|
0–10 . | 11–20 . | 21–30 . | 31–40 . | 41–50 . | 51–60 . | |
Overall severity | 14.71 | 38.24 | 17.65 | 17.65 | 2.94 | 8.82 |
Breathiness | 58.82 | 14.71 | 11.76 | 2.94 | 2.94 | 8.82 |
Roughness | 50.00 | 20.59 | 20.59 | 5.88 | 2.94 | 0.00 |
Strain | 52.94 | 23.53 | 11.76 | 5.88 | 2.94 | 2.94 |
Pitch | 91.18 | 0.00 | 5.88 | 2.94 | 0.00 | 0.00 |
Loudness | 88.24 | 11.76 | 0.00 | 0.00 | 0.00 | 0.00 |
Rating category . | Rating (0 = normal, 100 = the most severely deviant) . | |||||
---|---|---|---|---|---|---|
0–10 . | 11–20 . | 21–30 . | 31–40 . | 41–50 . | 51–60 . | |
Overall severity | 14.71 | 38.24 | 17.65 | 17.65 | 2.94 | 8.82 |
Breathiness | 58.82 | 14.71 | 11.76 | 2.94 | 2.94 | 8.82 |
Roughness | 50.00 | 20.59 | 20.59 | 5.88 | 2.94 | 0.00 |
Strain | 52.94 | 23.53 | 11.76 | 5.88 | 2.94 | 2.94 |
Pitch | 91.18 | 0.00 | 5.88 | 2.94 | 0.00 | 0.00 |
Loudness | 88.24 | 11.76 | 0.00 | 0.00 | 0.00 | 0.00 |
The inter-rater reliability among four SLPs is shown in Table 3. The inter-rater reliability among four SLPs was highest for overall severity, with an ICC(3,1) = 0.74, followed by breathiness [ICC(3,1) = 0.70], strain [ICC(3,1) = 0.57], pitch [ICC(3,1) = 0.56], roughness [ICC(3,1) = 0.45], and loudness [ICC(3,1) = 0.35]. All raters had high intra-rater reliability ranging from ICC(3,1) of 0.74 to 0.88.
3.2 Landmark analysis
On average, conversational and clear speech generated a total of 139.74 (SD ± 24.48), and 165.67 LMs (SD ± 26.96) per sentence, respectively. For LM subtypes, conversational speech generated an average of 31.93 [+g/−g] LMs (SD ± 8.32), 24.96 [+b] LMs (SD ± 6.03), 20.67 [−b] LMs (SD ± 6.92),13.07 [+s] LMs (SD ± 4.79), and 14.41 [−s] LMs (SD ± 5.34), while clear speech generated an average of 38.33 [+g/−g] LMs (SD ± 9.33), 30.96 [+b] LMs (SD ± 7.08), 24.30 [−b] LMs (SD ± 5.96), 15.14 [+s] LMs (SD ± 6.27), and 16.00 [−s] LMs (SD ± 4.73). The average duration between [+g] and [−g] LMs was 0.3 s (SD ± 0.36) for conversational speech and 0.388 s (SD ± 0.49) for clear speech (Fig. 2).
Bar plot showing the average duration between glottal onset and offset LMs and count of LMs. Error bars indicate standard error. Conv, conversational speech; Clear, clear speech.
Bar plot showing the average duration between glottal onset and offset LMs and count of LMs. Error bars indicate standard error. Conv, conversational speech; Clear, clear speech.
Results of the pairwise Wilcoxon tests indicated that the LM counts for clear speech were significantly greater for [+/−g] and [+b] LMs (p = 0.001 and 0.007, respectively), and the duration between the glottal LMs was significantly longer for clear speech compared to casual speech (p = 0.005). The LM counts between casual and clear speech were not significantly different for [−b], [+s], and [−s] LMs (p = 0.041, 0.214, and 0.140, respectively).
4. Discussion
This study aimed to evaluate the utility of LM-based features for detecting acoustic changes associated with clear speech produced by individuals with MTD. Our hypothesis was that the number of burst and syllabic onset LMs, as well as the duration between glottal LMs, would differentiate conversational speech from clear speech. The results partially supported the hypothesis: An increase in the number of burst onset LMs predicted clear speech, while syllabic onset LMs did not. Clear speech resulted in a significantly greater number of burst onset LMs and longer duration between glottal LMs compared to casual speech; however, the number of syllabic LMs did not significantly differ between clear and casual speech.
Hyperarticulation in clear speech would yield more clearly defined acoustic boundaries between speech sounds than in casual speech. It is well-documented that lenition, a reduction in the distinctiveness of speech sounds, frequently occurs in conversational speech. Lenition makes the boundaries between voiced and voiceless regions less distinct, and this lack of distinctiveness likely underlies the reduction in glottal LMs and burst onset LMs. The longer duration between glottal LMs in clear speech corroborates the slower speech rate reported in previous studies. In addition to enhanced definition of acoustic boundaries, the extended time between voicing onsets and offsets helps to further distinguish speech sounds, which would enhance the intelligibility of speech.
The insignificant finding in syllabic LMs may reflect the effect of dysphonia, as shown in the previous study (Ishikawa , 2020). The voicing difficulty likely restricted our participants from making an acoustic change required for the generation of syllabic onset LMs, which is to increase the spectral energy of their voice across multiple frequency bands. The greater number of burst onset LMs and longer duration between glottal LMs in clear speech suggest that the speakers depended on hyperarticulation of consonants and slower speech rate for clear speech production. The significant increase in glottal LMs may indicate that the speakers produced speech in a more canonical, dictionary-like form.
Several methodological limitations should be noted. The order of speech style was not randomized during the recording. All participants recorded conversational speech first and then clear speech. The practice effect may have over-inflated the participants' ability to produce clear speech. Additionally, the dysphonia severity of the participants was limited to mild and moderate. Whether individuals with severe dysphonia would be able to produce clear speech that is noticeable to listeners and whether the burst LM would continue to predict the speech production change needs to be examined in future studies. Last, speech material limits the generalizability of the results in real-world communication scenarios. The recordings used for this study were from sentence reading. Spontaneous speech likely has greater frequency and intensity ranges, which may affect the LMs.
The strength of the LMBAS is that it is a theoretically driven, knowledge-based system that allows users to evaluate the link between speech production and perception. For future work, the performance of LMBAS can be compared to the performance of other acoustic features commonly extracted for use in voice quality analysis and other neuropsychiatric disorders that commonly affect voice and emotion recognition (Agurto , 2019; Agurto , 2020; Bone , 2017; Cummins , 2015; Deshpande and Schuller, 2020; Eyben , 2010; Harati , 2018; Huang , 2018; König , 2015; Low , 2020; Maor , 2020; Marmar , 2019; Norel , 2018; Orozco-Arroyave , 2016; Perez , 2018; Pinkas , 2020; Rusz , 2011; Sara , 2020). Some of these features include autocorrelation, zero-crossing rate, entropy/entropy ratios across targeted spectral ranges, energy/intensity, Mel/Bark-frequency cepstral coefficients (MFCC), linear predictive coefficients (LPC), perceptual linear predictive coefficients (PLP), perceptual linear predictive cepstral coefficients (PLP-CC), spectral features, psychoacoustic sharpness, spectral harmonicity, F0, F0 harmonics ratios, jitter/shimmer, and a variety of statistical and mathematical summary measurements for these frame-level values. These measures should be applied to a control population of healthy voices to explore differences between individuals with MTD and individuals with healthy voices. As the research progresses, the dataset could expand to incorporate other voice disorders and other disorders that impact vocal function so that differential analysis is possible.
5. Conclusions
In conclusion, the LMBAS-based features detected the difference between conversational and clear speech in patients with mild to moderate MTD. Future studies should examine whether the current findings will generalize to samples collected from spontaneous speech and to individuals with severe MTD, as well as other types of dysphonia.