Anxiety disorders (AD) and major depressive disorders (MDD) are growing in prevalence, yet many people suffering from these disorders remain undiagnosed due to known perceptual, attitudinal, and structural barriers. Methods, tools, and technologies that can overcome these barriers and improve screening rates are needed. Tools based on automated analysis of acoustic voice could help bridge this gap. Comorbid AD/MDD presents additional challenges since some effects of AD and MDD oppose one another. Here, acoustic models that use acoustic and phonemic data from verbal fluency tests to discern the presence of comorbid AD/MDD are presented, with the best results of F1 = 0.83.
1. Introduction
Anxiety disorders (AD) and major depressive disorder (MDD) are prevalent mental health conditions, affecting 19.1% and 8.3% of U.S. adults, respectively,1,2 and are leading causes of disability, especially in individuals under 40.1,2 Despite their prevalence, treatment rates are low—36.9% for AD and 61.0% for MDD1,2—due to attitudinal, perceptual, and structural barriers like stigma, low self-perception of need, and limited access to care.3 Comorbidities exacerbate and mask these disorders, worsening health outcomes.2,4 The COVID-19 pandemic further increased AD and MDD rates to 33% and 30%, respectively.5
Traditional screening methods are often inefficient, clinician-dependent, and biased. Voice-based acoustic tests, operable across various platforms and devices, offer a promising solution that could address both challenges in current screening methods and known barriers to care. Recent studies explored relationships between acoustic speech characteristics and AD/MDD, with a particular focus on inference of the presence or severity of AD and MDD via machine learning/signal processing (MLSP) techniques.6–14 For example, MDD is associated with decreased speaking rate and reduced pitch variability. Conversely, AD increases speaking rate and pitch variability.15 Compared to speakers without either AD or MDD, MDD speakers have lower fundamental frequency (F0) and lower spectral tilt. In contrast, speakers with AD tend to have higher F0.15,16 The literature is unclear about the relationship between spectral tilt and AD; however, speakers experiencing acute stress (frequently associated with AD17) tend to have higher spectral tilt.18 Jitter and Shimmer values (indicating instability in frequency and amplitude, respectively), however, are also elevated in speakers with either AD or MDD.15 With respect to silence in speech, AD is positively correlated with an increase in both percent pause time and the number of pauses,19,20 while increased MDD severity is associated with longer and more variable pause times and longer total pause time, but not more pauses.21 The acoustic markers for AD and MDD often oppose each other, raising questions about the voice characteristics of individuals with comorbid AD/MDD. Many prior depression studies neither screened participants for AD nor considered AD in the inclusion/exclusion criteria; the same is true of anxiety studies.6,8,9 The Netherlands Study of Depression and Anxiety (NESDA), showed that 67% of participants with a primary depression diagnosis had comorbid anxiety; similarly, 63% of participants with a primary anxiety diagnosis also had comorbid depression.22 Clearly understanding the comorbid AD/MDD speaker profile, in a context likely to highlight differences from healthy speakers, is, therefore, an important data point for acoustic analysis of AD/MDD and for the development of clinical screening tools.
Verbal fluency tests (VFT), which assess word generation in a limited timeframe, have shown differences in performance among healthy individuals, those with AD, and those with MDD. VFTs ask for the naming of as many words as possible within a category (semantic) or that start with a given sound (phonemic).23 For individuals with MDD, both semantic and phonemic verbal fluency were impaired.8 In children with AD, the number of clusters (groups of words that start with similar sounds) was associated with AD severity.24 VFTs also place a small amount of stress (which is associated with the development of comorbid AD/MDD25,26) on the participant since it is a timed test. Further, people with MDD have higher cortisol levels in response to stress and longer recovery times from psychological stressors.27 Given (1) the relationships among stress, AD, and MDD, and (2) the differences in verbal fluency test results in people with AD, MDD, and healthy people, verbal fluency test data were selected for exploration. This study investigates the research question: “How effectively can machine learning distinguish individuals with comorbid AD/MDD from healthy controls using acoustic and phonemic analysis of semantic VFT data?” Feature analysis was also conducted to identify key acoustic and phonemic differences, enhancing model explainability.
2. Dataset
A dataset was created for this study, including demographic information, health history, verbal fluency test data, language information, depression and anxiety screening results, and audio-video recordings. Participants were recruited via flyers, mass emails, and psychiatric outpatient clinician referrals, providing IRB approved informed consent either in-person or online using HIPAA-compliant REDCap software.
Given consent, demographic and health screening questionnaires were presented to determine eligibility for the study. Participants were eligible if they were at least 18 years old and screened positive for co-occurring AD and MDD using the PHQ-8 (“Patient Health Questionnaire-9” or PHQ-9 depression screener with the suicide question removed) and GAD-7 (“Generalized Anxiety Disorder-7”) screening instruments,28,29 as defined by PHQ-8 ≥ 5 and GAD-7 ≥ 5, or were classified as controls, as defined by PHQ-8 ≤ 4 and GAD-7 ≤ 4. Participants were selected to represent the range of severity across AD/MDD. Exclusion criteria included any substance use disorder except tobacco that was not in remission; neurocognitive disorders; other serious psychiatric disorders (e.g., schizophrenia, personality disorders, etc.); serious heart disease; Huntington's disease; Parkinson's disease; amyotrophic lateral sclerosis (ALS); and other serious neurologic disease. All participants were fluent in English (with 10% identifying English as a second language) to maintain homogeneity in language patterns.
Interviews were conducted by a trained study team facilitator via a secure televideo platform (PHI Zoom), with separate audio recordings for the participant and facilitator. The complete set of survey instruments was administered via REDCap during testing, and the PHQ-8 and GAD-7 were purposely re-administered on the testing day to ensure accurate scores. Zoom and computer settings were configured on both channels to prevent or minimize any audio/video disturbances and precluded the use of filters, echo cancellation, special effects, and background noise suppression, and mandated the use of headphones unless echo was undetectable via listening. Each session produced 32 KHz mono m4a participant-only, facilitator-only, and combined audio recordings and mp4 (1280 × 720, H.264, MPEG-R AAC) audio-video recordings. Participants could choose to test in a clinic or a quiet environment of their preference, with strict guidelines to minimize audio-visual noise, including no extraneous audio or visual noise, no other people present, and a suitable recording device stationary and square on a level surface. Participants were instructed to remove masks and hats, face the camera head-on, and frame their video to fill the screen top-to-bottom from head to elbow. Study facilitators checked for compliance and standardization of the set-up.
The study included 41 female participants, aged 19–71, divided into two groups: 20 positive for AD and MDD, and 21 controls. Female participants were selected for this pilot study, enabling focus on a single gender, given gender differences between voices. The average and standard deviation of age was similar across groups (38.2 ± 16.7 years for controls, 38.4 ± 14.6 for AD). A composite AD/MDD score was calculated by summing the GAD-7 and PHQ-8 scores, defining composite severity levels based on severity ranges in the GAD-7 and PHQ-9: (1) Controls (0–8), (2) Mild (9–18), (3) Moderate (19–28), and (4) Severe (29+). Since the suicide question was removed from the PHQ-9, the moderately-severe and severe categories in the PHQ-8 were together classified as “severe.” Using these ranges, at testing time, approximately 50% of AD/MDD participants were in the mild category, 25% moderate, and 25% severe. The distributions of GAD-7 and PHQ-8 scores are summarized in Table 1. The small difference between the GAD-7 and PHQ-8 scores shows that severity levels of AD and MDD were similar within participants; for example, most participants with moderate depression also had moderate anxiety.
Distribution (means and standard deviations) of PHQ-8 scores, GAD-7 scores, combined GAD-7+PHQ-8 scores, and difference between the GAD-7 and PHQ-8 scores across participants and within the AD/MDD and control groups.
. | Controls (μ/σ) . | A/D All (μ/σ) . | All (μ/σ) . |
---|---|---|---|
PHQ-8 | 1.5/1.3 | 10.7/4.0 | 5.5/5.4 |
GAD-7 | 1.7/1.5 | 9.4/4.4 | 5.0/4.9 |
PHQ-8+GAD-7 | 3.2/2.0 | 20.2/7.3 | 10.9/9.9 |
|PHQ-8 - GAD-7| | 1.1/1.2 | 3.1/3.0 | 2.0/2.3 |
. | Controls (μ/σ) . | A/D All (μ/σ) . | All (μ/σ) . |
---|---|---|---|
PHQ-8 | 1.5/1.3 | 10.7/4.0 | 5.5/5.4 |
GAD-7 | 1.7/1.5 | 9.4/4.4 | 5.0/4.9 |
PHQ-8+GAD-7 | 3.2/2.0 | 20.2/7.3 | 10.9/9.9 |
|PHQ-8 - GAD-7| | 1.1/1.2 | 3.1/3.0 | 2.0/2.3 |
All participants completed the VFT at the end of the interview (with only results from the VFT portion reported here). The prompt for the VFT was: “Tell me the names of as many animals as you can. Name them as quickly as possible. Any animals will do; they can be from the farm, the jungle, the ocean, or house pets. For instance, you could begin with dog… Ready? Begin.”
Audio recordings were converted to 16-bit 32 KHz WAV format prior to analysis. Data before the verbal fluency prompt and after the one-minute limit was trimmed and audio volume was normalized to a standard level.
3. Description of features
3.1 Acoustic features
For acoustic analysis and machine modeling in Experiments 1, 2, and 4 (see the following), acoustic features from the OpenSMILE30 ComParE-16 feature set were extracted. Frame-level, or “low-level descriptor” (LLD) features were extracted using a range of window sizes, between 60 msec and 44 s with a 10-ms advance; while summary (SUM) features (functional analyses, often statistics, across sequences of frame-level LLDs) were extracted using 60 ms sliding windows with the same 10-ms advance, yielding up to 6373 acoustic features suitable for paralinguistic analysis, as demonstrated in the INTERSPEECH 2016 Computational Paralinguistics Challenge.31
3.2 Phonemic features
Phonemic analysis (see Experiments 3 and 5) used the small OpenAI Whisper pretrained model (whisper-small.en)32 to transcribe VFT recordings, with corrections made by a study team member. Structural and sequential phonemic similarity features were extracted, inspired by a recent study using verbal fluency data for the detection of the at-risk condition for schizophrenia.34 Structural data summarized phonemic similarity, complexity, and diversity, while sequence data detailed sound progression. Words were phonetically transcribed using an extended phonemic dictionary33 and mapped onto a phonemic similarity graph, where nodes represented words sharing initial sounds. The graph typically spanned up to three levels, representing phonemic characteristics. Figure 1 shows a sample phonemic graph.
This graph illustrates the phonemic similarity of the following animal names: “giraffe, gerbil, cow, chimp, chicken, llama, lions, liger, lemur.” The top-level nodes contain words starting with the same sound. The second-level nodes contain words starting with the same two sounds. In this example, no words begin with the same three sounds, resulting in a graph with only two levels.
This graph illustrates the phonemic similarity of the following animal names: “giraffe, gerbil, cow, chimp, chicken, llama, lions, liger, lemur.” The top-level nodes contain words starting with the same sound. The second-level nodes contain words starting with the same two sounds. In this example, no words begin with the same three sounds, resulting in a graph with only two levels.
Phonemic structure was quantified with a vector of 37 features, including metrics quantifying the number of nodes, graph depth, number of nodes at each level (to a depth of 7), number of words per node per level (μ and σ), and the number of sounds in a word (per node per level).
Phonemic sequence information was determined by word order. In the example given in Fig. 1, the nine words start with four distinct sounds. The first two words form a “cluster” of size two that begins with the same single sound. The word “cow” forms its own “cluster” of size one, followed by a cluster of size two starting with the same sound, followed by a final cluster of size four starting with a different sound. The sequence has four clusters with three “switches,” or changes, between them. The sequence analysis examined words sharing up to the same six starting sounds and was summarized by a vector of 30 features that included metrics on (1) the number of switches between clusters, where clusters are consecutive words starting with the same first 1–6 sounds, (2) the number of words per cluster (μ and σ), and (3) the size of words across clusters (μ and σ). Additional features included the number of unique animal names, the number of total names spoken, and ratio of valid to total names, producing 71 phonemic features total.
4. Experiments
4.1 Classification experiments
Experiment 1: This experiment classified comorbid AD/MDD vs healthy controls using acoustic features extracted from VFT data. Random forest (RF), K-nearest neighbors (KNN), and bagging (BAG) models were trained using both OpenSMILE LLD and SUM features and were selected based on the small dataset, explainability needs, and non-linear data separability. Balanced training was achieved via random upsampling when needed, and models were evaluated with threefold nested cross-validation using F1 scores. Recursive feature elimination was utilized during model training to optimize the selection of informative features.
Experiment 2: Acoustic deep learning models were evaluated to classify comorbid AD/MDD vs healthy controls using VFT acoustics. Options were limited by the small data size. The long short-term memory model (LSTM) model was chosen for its ability to learn complex, long-range patterns, and trained on OpenSMILE LLD features. Given observed differences in voice quality, pauses, vocal variation, and the use of filler utterances in our data, varying window sizes (1, 2, 3, 8, 16, 32, 36, 40, 44 s) and learning rates (LRs) (0.001 and 0.0005) were explored and applied to the VFT data, with a 10 msec window advance. Balanced training data were achieved via random upsampling, and threefold nested cross-validation was used. F1 scores are reported.
Experiment 3: This experiment classified comorbid AD/MDD vs healthy controls using phonemic similarity features extracted from VFT data. RF, KNN, and BAG models were selected for the same reasons given in Experiment 1. As in the previous experiments, training data were balanced across conditions via random upsampling and threefold nested cross-validation was employed. Recursive feature elimination optimized feature selection.
4.2 Acoustic feature analysis
Experiment 4: To understand acoustic differences between AD/MDD voices and the voices of those without AD/MDD, the OpenSMILE acoustic features were ranked for their ability to separate conditions, using statistical methods [t-tests and analysis of variance (ANOVA) F-value scores]. The top ranked acoustic features are reported and discussed.
Experiment 5: To understand phonemic differences between AD/MDD voices and voices without AD/MDD, the phonemic similarity features, as in Experiment 4, were ranked for their ability to separate conditions, using statistical methods (t-tests and ANOVA F-values scores). The top ranked phonemic features are reported and discussed.
5. Results
5.1 Classification experiment results
Experiment 1: The best model separating comorbid AD/MDD VTF speech from controls using classic machine learning models and OpenSMILE ComParE16 acoustic features was the RF model trained with summary features, with best fold performance of F1 = 0.76 and average-across-fold performance of F1 = 0.70. Models trained with SUM features outperformed those trained with LLD features in every case, and RF models outperformed the other classic models overall (see Fig. 2).
F1 scores of RF, KNN, and BAG models trained to separate AD/MDD vs Control conditions are compared here. Models were trained using OpenSMILE ComParE16 LLD, OpenSMILE ComParE16 SUM, and Phonemic (PHO) features extracted from semantic VF test data. The mean F1 score across folds and the best single-fold F1 score is displayed for each model type.
F1 scores of RF, KNN, and BAG models trained to separate AD/MDD vs Control conditions are compared here. Models were trained using OpenSMILE ComParE16 LLD, OpenSMILE ComParE16 SUM, and Phonemic (PHO) features extracted from semantic VF test data. The mean F1 score across folds and the best single-fold F1 score is displayed for each model type.
Experiment 2: Results of LSTM models are summarized in Fig. 3. The best LSTM model (32 s. frames/LR = 0.001) separated AD/MDD VFT speech from controls at F1 = 0.83 with an average-across-fold performance also of F1 = 0.83, indicating a stable model. The faster learning rate of 0.001 resulted in better, more stable models than the slower learning rate of 0.0005, and a window size between 32 and 36 s produced the best, most stable models.
F1 scores of LSTM models trained to separate AD/MDD vs Control conditions are compared here. Models were trained using OpenSMILE ComParE16 LLD features, using varying LR and window sizes (1–44 s). The best LSTM model (F1 = 0.83) used 32-s frames with LR = 0.001.
F1 scores of LSTM models trained to separate AD/MDD vs Control conditions are compared here. Models were trained using OpenSMILE ComParE16 LLD features, using varying LR and window sizes (1–44 s). The best LSTM model (F1 = 0.83) used 32-s frames with LR = 0.001.
Experiment 3: Results of RF, KNN, and BAG models trained with phonemic similarity features applied to VFT data are compared in Fig. 2. The best performing model for this experiment was BAG (Best F1 = 0.62; Average F1 = 0.60). This experiment demonstrated the presence of a stable signal, but one that was weaker that the results obtained from general acoustic analysis.
5.2 Acoustic feature analysis results
Experiments 4 and 5: Table 2 presents the top ten highest-performing features used to train the leading acoustic and phonemic models. The OpenSMILE30 SUM features capture general acoustic differences between voices with and without comorbid AD/MDD, while the phonemic features highlight acoustic differences attributable to linguistic content differences.
The highest-ranked acoustic and phonemic features from the top-performing models. The strongest acoustic OpenSMILE30 SUM features are functionals applied across OpenSMILE LLD features, including mel frequency cepstral coefficients (1), rasta-filtered bands (2), (9), (10), the auditory spectrum vector sum (3), (5), (6), (7), (8), and spectral kurtosis (4). The strongest phonemic features relate to word size (1), (4), the mean/stddev number of valid words at different levels in the graph (6), (9), (10) or in clusters (5), the number of switches at different levels in the graph (2), (3), the number of starting sounds of words (7), and the graph depth (8). Features ranked with “*” ranked equally.
Rank . | OpenSMILE SUM Features . | Phonemic Features . |
---|---|---|
1 | mfcc_sma_de[11]_minPos | L0_sigma_word_size_unique |
2 | audSpec_Rfilt_sma[22]_upleveltime50 | num_switches_L2* |
3 | audSpec_lengthL1norm_sma_qregerrQ | num_switches_L3* |
4 | pcm_fftMag_spectralKurtosis_sma_de_upleveltime50 | L1_u_word_size_all_words_cluster* |
5 | audSpec_lengthL1norm_sma_upleveltime50 | L1_u_num_all_words_cluster |
6 | audSpec_lengthL1norm_sma_linregerrQ | valid_word_ratio |
7 | audSpecRasta_lengthL1norm_sma_peakRangeRel | L0_num_nodes |
8 | audspec_lengthL1norm_sma_upleveltime75 | max_depth |
9 | audspec_Rfilt_sma[6]_peakRangeRel | L1_sigma_num_unique_words |
10 | audSpec_Rfilt_sma[22]_upleveltime50 | L3_u_num_unique_words |
Rank . | OpenSMILE SUM Features . | Phonemic Features . |
---|---|---|
1 | mfcc_sma_de[11]_minPos | L0_sigma_word_size_unique |
2 | audSpec_Rfilt_sma[22]_upleveltime50 | num_switches_L2* |
3 | audSpec_lengthL1norm_sma_qregerrQ | num_switches_L3* |
4 | pcm_fftMag_spectralKurtosis_sma_de_upleveltime50 | L1_u_word_size_all_words_cluster* |
5 | audSpec_lengthL1norm_sma_upleveltime50 | L1_u_num_all_words_cluster |
6 | audSpec_lengthL1norm_sma_linregerrQ | valid_word_ratio |
7 | audSpecRasta_lengthL1norm_sma_peakRangeRel | L0_num_nodes |
8 | audspec_lengthL1norm_sma_upleveltime75 | max_depth |
9 | audspec_Rfilt_sma[6]_peakRangeRel | L1_sigma_num_unique_words |
10 | audSpec_Rfilt_sma[22]_upleveltime50 | L3_u_num_unique_words |
The strongest feature from the best acoustic model identifies where the minimum change in mel-frequency cepstral coefficient 11 occurs; this may be related to differences in silence patterns and/or voice quality in people with AD/MDD. Many of the other best-performing acoustic features were OpenSMILE functionals applied across the audSpec_lengthL1norm_sma (sum of auditory spectrum vectors) or the audspec_Rfilt_sma[x] (RASTA-filtered auditory spectrum, band x) LLD features. The qregerrQ functional measures the quadratic regression error between the spoken contour and the quadratic regression line, while linregerrQ captures the error between the spoken contour and the linear regression line, allowing for comparisons of how the AD/MDD and control groups track regression contours. The upleveltimeXX functional quantifies the time the signal remains above a specified loudness threshold (XX% * range + min), indicating differences in loudness variation between conditions. The peakRangeRel functional measures the range between maximum and minimum peak values in the signal, highlighting differences in overall loudness levels.
RASTA filtering (indicated by Rfilt or Rasta in the feature name) effectively suppresses spectral components that vary at rates differing from typical speech while accentuating sonic “edges,” which delineate changes in sound, akin to outlining objects in an image. The strongest phonemic feature, L0_sigma_word_size_unique, indicated reduced variability in phonemic complexity among individuals with AD/MDD, who tended to use shorter, simpler names. The second and third ranked features revealed notable differences in overall phonemic similarity and the frequency of switching between consecutive groups of words starting with the same two or three sounds; the AD/MDD group exhibited fewer switches.
The fourth and fifth features focused on the number and complexity of consecutive words beginning with the same two sounds, showing that the AD/MDD group had lower complexity and fewer words in these clusters. Additional observations included that the AD/MDD group had (1) more repeated or invalid animal names, (2) fewer starting sounds in their word lists, (3) shallower similarity graphs, (4) Less variation in the number of words beginning with the same two sounds, and (5) differences in the number of words in groups starting with the same three sounds.
6. Discussion
The study experiments successfully detected and characterized comorbid AD/MDD using only one-minute semantic VFT data, achieving F1 = 70–83, using a limited dataset of 41 fluent English-speaking females, with 10% designating English as a second language. While results were encouraging, the small sample size and skew toward mild cases (50%) restricted the use of many advanced deep learning techniques, limited model generalizability, and increased the challenge of separating AD/MDD vs controls. In addition, the inclusion criteria enabled the isolation of AD/MDD characteristics by avoiding confounds from other disorders (e.g., cognitive limitations and voice/movement changes), but this further narrowed diversity and limited generalizability.
Our results specifically examined comorbid AD/MDD and used a more comprehensive feature set (OpenSMILE) than some of the prior work. Jitter and shimmer features specifically did not perform as well as other features in our models. That said, several top-ranked features did point to differences in overall signal level variation, along with differences in the time that the signal remained above the threshold; this may align with prior findings relating to differences in silence patterns, shimmer, and general expressivity. Our results noted a lower speaking rate in AD/MDD subjects, while prior work found this reduced in MDD but increased in AD.
Finally, future studies should account for individual cognitive differences unrelated to AD/MDD, such as language deficits or education levels. For instance, groups could be matched by education or data normalized using vocabulary tests and education measurements. In our dataset, 40% of the control group had at least a Master's degree, compared to 30% in the AD/MDD group.
7. Conclusions
The experiments in this paper demonstrated approximately 70%–83% accuracy in detecting comorbid AD/MDD using data from a one-minute semantic verbal fluency test. The study utilized a dataset of 41 participants, curated to exclude confounding conditions and allow for a focused examination of AD/MDD characteristics. Analysis of acoustic and phonemic models highlighted their roles in characterizing AD/MDD, with baseline models providing explanatory insights. Future work will expand the dataset for diversity, include additional speech types (e.g., interviews, read speech), integrate multimodal signals (e.g., video), explore data augmentation techniques to improve stability, and explore other disorders to provide differential screening and model generalizability.
Acknowledgments
This work was supported by the Jump ARCHES program, a collaboration between OSF HealthCare, the University of Illinois Urbana-Champaign, and the University of Illinois College of Medicine Peoria. The authors thank Conner Driver, Cory Mahler, Anvesh Jalasutrum, Dustin Pilat, Abigail Sebald, and Benjamin Finkenbine for assistance with recruitment of participants and/or data acquisition.
Author Declarations
Conflict of Interest
The authors have no conflicts to disclose.
Ethics Approval
All methods and procedures in this work were approved by the IRB at the University of Illinois College of Medicine Peoria.
Data Availability
As the data contain potentially identifiable content, the data will be made available with the execution of written data sharing agreements.