The voice of COVID-19: Acoustic correlates of infection in sustained vowels

COVID-19 is a global health crisis that has been affecting our daily lives throughout the past year. The symptomatology of COVID-19 is heterogeneous with a severity continuum. Many symptoms are related to pathological changes in the vocal system, leading to the assumption that COVID-19 may also affect voice production. For the first time, the present study investigates voice acoustic correlates of a COVID-19 infection based on a comprehensive acoustic parameter set. We compare 88 acoustic features extracted from recordings of the vowels /i:/, /e:/, /u:/, /o:/, and /a:/ produced by 11 symptomatic COVID-19 positive and 11 COVID-19 negative German-speaking participants. We employ the Mann-Whitney U test and calculate effect sizes to identify features with prominent group differences. The mean voiced segment length and the number of voiced segments per second yield the most important differences across all vowels indicating discontinuities in the pulmonic airstream during phonation in COVID-19 positive participants. Group differences in front vowels are additionally reflected in fundamental frequency variation and the harmonics-to-noise ratio, group differences in back vowels in statistics of the Mel-frequency cepstral coefficients and the spectral slope. Our findings represent an important proof-of-concept contribution for a potential voice-based identification of individuals infected with COVID-19.


Introduction
In December 2019 and early January 2020, a cluster of pneumonia cases with unknown cause emerged in China's Hubei Province.The pneumonia was found to be caused by a novel coronavirus named severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).The disease spread quickly and the first known cases outside of China were identified in mid-January.On 11 February 2020, the World Health Organization (WHO) announced that the disease caused by SARS-CoV-2 would be named COVID-19.A month later, the WHO announced COVID-19 as a pandemic.A year after the emergence of COVID-19, 71 919 725 confirmed cases including 1 623 064 deaths were reported to the WHO 2 .
The severity of COVID-19 is heterogeneous, ranging from asymptomatic infections or mild flu-like symptoms to severe illness and death.Chest CT [1] and post-mortem biopsies [2,3] found characteristic pathological changes in patients with COVID-19 and suggest that the lung is the organ that is primarily affected by the disease.Common symptoms of COVID-19 include fever, cough, shortness of breath, weakness, muscle pain, loss of taste and/or smell as well as the ear-nose-throat manifestations sore throat and headache [4].Less common ear-nose-throat manifestations of COVID-19 are tonsil enlargement, pharyngeal erythema, nasal congestion, rhinorrhea, and upper respiratory tract infection [5].Lechien and colleagues [6] reported dysphonia for 26.8 % of their investigated patients with mild-to-moderate COVID-19 symptoms.The authors further found a greater severity of COVID-19 symptoms in dysphonic patients compared to non-dysphonic patients.In general, a great proportion of the symptoms associated with COVID-19 affect anatomical correlates of speech production.Important components of the vocal system are the lungs and the lower airway producing the airflow, the vocal folds whose vibrations produce the voice sound, and the vocal and nasal tracts modifying the voice source to produce specific phones, cf.[7].
Voice changes have been repeatedly reported for a number of diseases related to pathological changes in components of the vocal system.For example, patients with asthma were found to differ from healthy controls in maximum phonation time (MPT), shimmer, harmonics-to-noise ratio (HNR), jitter, fundamental frequency (F0), first vowel formant (F1), and second vowel formant (F2) [8,9].Singh Walia and Sharma [10] found the severity of asthma to be related to jitter [%].Jitter values derived from recordings of the sustained phonation of the vowel /a:/ was 0.25 for healthy males, 0.41 for males with mild asthma, 0.9 for males with moderate asthma, and 1.83 for males with severe asthma [10].Petrović-Lazić and colleagues [11] reported that jitter, shimmer, F0 variation, voice turbulence index (VTI), pitch perturbation quotient (PPQ), amplitude perturbation quotient (APQ), and HNR values differed between patients with vocal fold polyps and healthy controls.Type and size of vocal fold polyps were found to have effects on jitter and HNR [12].Male and female patients with unilateral vocal fold paralysis were found to differ from healthy gender-matched controls in jitter, shimmer, HNR, standard-deviation of F0, and standard-deviation of the frequency of F2 [13].In addition, the males with unilateral vocal fold paralysis differed significantly from the male controls in F1 and F2 frequency values as well as in the standard-deviation of the frequency of F1 [13].Segura-Hernández and colleagues [14] investigated voice characteristics in children with cleft lip and palate before and after speech and language pathology intervention and compared their findings with the voice characteristics of healthy controls.They found that jitter and shimmer were significantly higher in the patients with cleft lip and palate before intervention, whereas the two groups did not differ in these parameters after intervention.In contrast, intervention had no effect on hypernasality, a prominent voice characteristic of patients with cleft lip and palate.These findings, demonstrating vocal atypicalities in a variety of diseases related to pathological changes in components of the vocal system, lead to the assumption that COVID-19 may be characterised through atypical voice parameters.Characteristic vocal patterns would constitute the starting point for an automatic quick-and-easy-to-apply COVID-19 detection, for example based on smartphone applications.To date, there is hardly any literature on voice parameters of patients with COVID-19.Recently, Asiaee and colleagues [15] compared voice samples of a sustained vowel /a:/ produced by Persian speakers with and without COVID-19.They extracted the following eight acoustic parameters: F0 and its variations (F0SD), jitter, shimmer, HNR, difference between the first two harmonic amplitudes (H1-H2), MPT, and cepstral peak prominence (CPP).Except F0, all acoustic parameters were significantly different between the patients with COVID-19 and healthy controls.To the best of our knowledge, voice parameters have not yet been analysed for other vowels and there is no study focusing on the voice of German-speaking patients with COVID-19.The present study aims to provide a deeper insight into voice characteristics of patients with COVID-19 by extracting and comparing a comprehensive set of voice parameters from voice samples of the sustained vowels /i:/, /e:/, /o:/, /u:/, and /a:/ produced by German-speaking symptomatic patients with COVID-19 and healthy controls.In a first audio pre-processing step, the recordings are converted into the uniform audio format 16 kHz/16 Bit (single channel) PCM by means of FFmpeg3 .Then, we use Audacity4 to segment the recordings for all single vowels to be exported as separate audio files for the feature extraction step.
Acoustic feature extraction is done by means of the open-source toolkit openSMILE5 by audEERING™ GmbH [16,17] in its current release 3.0.
From each single vowel, we extract the features of the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS), representing a compact standard set of 88 acoustic higher-level signal descriptors launched in 2016 by Eyben and colleagues [18].These higher-level descriptors include statistical functionals, such as arithmetic mean, coefficient of variation, percentiles, etc., computed for the trajectories of a range of acoustic time-, energy-, and/or spectral/cepstral-based low-level descriptors, such as F0, Mel-frequency cepstral coefficients (MFCCs), harmonics-tonoise ratio (HNR), jitter, or shimmer.While being a comparably small set among the available openSMILE standard sets, the features of the eGeMAPS were carefully selected by a consortium of engineers, linguists, phoneticians, and clinicians based on their theoretical and practical value for computational voice analysis tasks including clinical applications [18].
We apply the Mann-Whitney U test (group-and vowel-wise feature values are not normally distributed) to analyse the distributions of the extracted acoustic features for differences between the pos.-and the neg.-group.On the one hand, this is done separately for each vowel.On the other hand, we analyse the combination of the front vowels /i:/ and /e:/, of the back vowels /o:/ and /o:/, as well as of all vowels.To identify the most important acoustic features in either constellation in order to distinguish between individuals from the pos.-group and individuals from the neg.-group, we finally rank the acoustic features according to the effect size r as being the absolute value of Cohen's correlation coefficient [19].As null-hypothesis testing with p-values as decisive criterion has been repeatedly criticised (see the statement by the American Statistical Association [20], we here report p-values as additional descriptive measures and do not employ them for accepting or rejecting a null-hypothesis.

Results
We display the respective top acoustic features, i. e., features differing between the pos.-and the neg.-group with an effect size r > .3,for the single-vowel scenarios as well as for the vowel combination scenarios in Tables 1 and 2. Additionally, we present boxplots for all features with an effect size r > .4 in the single-vowel examinations in Figure 1.
The mean voiced segment length as well as the number of voiced segments represent the features that differ most between the pos.-and the neg.-group in terms of effect size when combining recordings of all vowels.Further features with prominent group differences across vowels and vowel constellations are bandwidth statistics of the third vowel formant and local shimmer.Additionally, group differences in (a) front vowels are reflected in F0 related statistics and the coefficient of variation of the HNR, and in (b) back vowels in statistics related to the first two MFCCs and in the coefficient of variation of the spectral slope 500-1500 Hz in voiced regions.

Discussion
In this study, we acoustically analysed sustained vowels produced by participants with a COVID-19 infection and a group of healthy controls.We identified a number of acoustic features to moderately differ between the two groups.We found that the pos.-group produced a higher number of voiced segments per second at a shorter mean voiced segment length as compared to the neg.-group.As participants were instructed to produce sustained vowels, i. e., with a continuous phonation over a certain time, this finding may indicate discontinuities in the pulmonic airstream in COVID-19 infected participants leading to sporadic, unintended interruptions of phonation.Voiced segments per second and mean voiced segment length as sort of (overall) duration measures are important in front vowels -separately and taken together, not amongst the most important in back vowels, and again amongst the important ones in /a:/.They turn out to be most important across all vowels.With a caveat, this might be due to the overall effort that is lower for back vowels whose tongue position is closer to the [@] (schwa)./a:/ is notoriously prone to 'laryngeal irritations'.
Asiaee and colleagues [15] partly analysed the same acoustic features as used in our study.Among these, the F0 standard deviation, jitter, shimmer, and the HNR were found to be different between COVID-19 positive and COVID-19 negative participants when comparing voice samples of a sustained vowel /a:/.In our study, these features are not among the most important ones to differentiate the groups in the sustained vowel /a:/.However, the normalised F0 standard deviation and local jitter turned out to be relevant for group differentiation in the front vowel /e:/, the normalised HNR standard deviation in the front vowels /i:/ and /e:/, and local shimmer in the vowels /i:/ and /o:/.However, diverging findings between our study and the study by Asiaee and colleagues [15] may result from the fact that the participants of the latter were Persian speakers, whereas participants in our study are German speakers.All things considered, findings by Asiaee and colleagues [15] as well as our findings pointing to voice acoustic correlates of a COVID-19 infection across the frequency (e. g., F0, formants, jitter), energy (e. g., shimmer, HNR), and in our study also spectral/cepstral (e. g., MFCCs, slope, harmonic difference) domains suggest that a COVID-19 infection may not be characterisable by a single feature, but by a combination of selected candidate features tied to specific phonation tasks.
Acoustic analysis in this work is based on a compact standard feature set designed for a variety of computational voice analysis tasks also including tasks in clinical context.Starting from the gained knowledge about each feature's relevance for reflecting vocal differences between COVID-19 positive and COVID-19 negative speakers, future work should additionally focus on specific clinical speech parameters that allow for interpretations from a voice-physiological point of view, such as the glottal-to-noise excitation (GNE) ratio [21].
A limitation of our study is the relatively small sample size.Furthermore, the neg.-group consists of healthy speakers only, i. e., speakers without any symptoms of a cold, whereas the pos.-grouponly includes patients with mild-tomoderate flu-like symptoms.To evaluate whether there are voice parameters specific for a COVID-19 infection, future studies need to include a considerable amount of patients with COVID-19 who do not show respiratory or ear-nose-throat symptoms and COVID-19 negative participants with cold-like symptoms.Moreover, as some of the acoustic features we show to be important for a differentiation between COVID-19 positive and COVID-19 negative participants were also reported to be relevant for differentiating between patients with asthma and healthy controls [8,9,10], it is highly important for future studies to include patients with asthma and other chronically ill patients in a control group.
Despite the limitations, our study can be regarded as a first step towards unravelling the complex acoustic fingerprint of COVID-19 and as an important proof-of-concept achievement for future voice-based viral infection identification applications.A re-validation of our findings based on a much larger and more heterogeneous sample is warranted.
r is rounded to two decimal places.p-values of the underlying Mann-Whitney U tests rounded to three decimal places are given as well.A3 = amplitude of third vowel formant, F0 = fundamental frequency, F1-3 = first to third vowel formant, H1 = relative amplitude of first harmonic, HNR = harmonics-to-noise ratio, MFCC1-2 = first and second Mel-frequency cepstral coefficient, SD norm = standard deviation normalised by the arithmetic mean (coefficient of variation), VR = voiced regions

Figure 1 :
Figure 1: Vowel-wise acoustic feature comparisons between COVID-19 negative (neg.) and COVID-19 positive (pos.)participants in form of boxplots for features with a differentiation effect r > .4ordered from left to right according to a decreasing r, respectively.The effect size r as well as the p-value of the Mann-Whitney U difference test are given above each boxplot.p is rounded to three decimal places.r is rounded to two decimal places.Outliers (marked with red plus symbols) are defined as value that are more than 1.5 times the interquartile range away from the bottom or top of the respective box.# = number of, F0 = fundamental frequency, F1 = first vowel formant, F3 = third vowel formant, len.= length, pctlrg = percentile range, RS = rising slope, seg.= segment, ST = semitone from 27.5 Hz, SD norm = standard deviation normalised by the arithmetic mean (coefficient of variation), slp = slope; VR = voiced regions

Table 1 :
Vowel-wise acoustic features with a differentiation effect r > .3 between COVID-19 negative and COVID-19 positive participants, ranked according to the effect size r.r is rounded to two decimal places.
p-values of the underlying Mann-Whitney U tests rounded to three decimal places are given as well.A3 = amplitude of third vowel formant, F0 = fundamental frequency, F1-3 = first to third vowel formant, H1 = relative amplitude of first harmonic, HNR = harmonics-to-noise ratio, MFCC1-4 = first to fourth Mel-frequency cepstral coefficient, pctl = percentile, pctlrg = percentile range, SD norm = standard deviation normalised by the arithmetic mean (coefficient of variation), VR = voiced regions

Table 2 :
Acoustic features with a differentiation effect r > .3 between COVID-19 negative and COVID-19 positive participants, ranked according to the effect size r for the combination of the front vowels /e:/ and /i:/, the back vowels /u:/ and /o:/, and all vowels.