The fundamental frequency (fo) is pivotal for quantifying vocal-fold characteristics. However, the accuracy of fo estimation in hoarse voices is notably low, and no definitive algorithm for fo estimation has been previously established. In this study, we introduce an algorithm named, “Spectral-based fo Estimator Emphasized by Domination and Sequence (SFEEDS),” which enhances the spectrum method and conducted comparative analyses with conventional estimation methods. We analyzed 454 voice samples and used conventional methods and SFEEDS to calculate fo. The ground truth of fo was determined as the lowest frequency within the most dominant harmonic complex observed on the spectrogram. Subsequently, we assessed the concordance between each fo-estimation method and the fo ground truth. We also examined the variations in the accuracy of these methods when analyzing speech with hoarseness. Regardless of hoarseness, the fo-estimation accuracy was significantly greater by SFEEDS than by conventional methods. Moreover, whereas the conventional methods impaired fo-estimation accuracy in samples with roughness, the SFEEDS algorithm was robust and significantly reduced subharmonic errors. The SFEEDS fo-estimation algorithm accurately estimated the fo of both normal and hoarse voices.

1.
Aichinger
,
P.
,
Roesner
,
I.
,
Schneider-Stickler
,
B.
,
Leonhard
,
M.
,
Denk-Linnert
,
D. M.
,
Bigenzahn
,
W.
,
Fuchs
,
A. K.
,
Hagmuller
,
M.
, and
Kubin
,
G.
(
2017
). “
Towards objective voice assessment: The diplophonia diagram
,”
J. Voice
31
,
253.e17
253.e26
.
2.
Anand
,
S.
,
Kopf
,
L. M.
,
Shrivastav
,
R.
, and
Eddins
,
D. A.
(
2021
). “
Using pitch height and pitch strength to characterize type 1, 2, and 3 voice signals
,”
J. Voice
35
,
181
193
.
3.
Awan
,
S. N.
, and
Awan
,
J. A.
(
2020
). “
A two-stage cepstral analysis procedure for the classification of rough voices
,”
J. Voice
34
,
9
19
.
4.
Ba
,
H.
,
Yang
,
N.
,
Demirkol
,
I.
, and
Heinzelman
,
W.
(
2012
). “
BaNa: A hybrid approach for noise resilient pitch detection
,” in
2012 IEEE Statistical Signal Processing Workshop
, pp.
369
372
.
5.
Bagshaw
,
P. C.
(
1994
). “
Automatic prosodic analysis for computer aided pronunciation teaching
,” Ph.D. thesis,
University of Edinburgh
,
Edinburgh, Scotland
.
6.
Bagshaw
,
P. C.
,
Hiller
,
S. M.
, and
Jack
,
M. A.
(
1993
). “
Enhanced pitch tracking and the processing of f0 contours for computer aided intonation teaching
,”
in Proc. EUROSPEEECH' 93
, pp.
1003
1006
.
7.
Baken
,
R. J.
(
1987
).
Clinical Measurement of Speech and Voice
(Taylor & Francis, London).
8.
Barsties
,
B.
, and
Maryn
,
Y.
(
2016
). “
External validation of the Acoustic Voice Quality Index version 03.01 with extended representativity
,”
Ann. Otol. Rhinol. Laryngol.
125
,
571
583
.
9.
Bechtold
,
B.
(
2021
).
Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods
(
Universität Oldenburg
,
Oldenburg, Germany
).
10.
Boersma
,
P.
(
1993
). “
Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound
,” in
Proceedings of the Institute of Phonetic Sciences (Amsterdam)
, pp.
97
110
.
11.
Boersma
,
P.
(
2001
). “
Praat, a system for doing phonetics by computer
,”
Glot. Int.
5
,
341
345
.
12.
Camacho
,
A.
, and
Harris
,
J. G.
(
2008
). “
A sawtooth waveform inspired pitch estimator for speech and music
,”
J. Acoust. Soc. Am.
124
,
1638
1652
.
13.
Cavalli
,
L.
, and
Hirson
,
A.
(
1999
). “
Diplophonia reappraised
,”
J. Voice
13
,
542
556
.
14.
Cliff
,
N.
(
1996
).
Ordinal Methods for Behavioral Data Analysis
(
Psychology Press
,
London
).
15.
DeBodt
,
M. S.
,
Wuyts
,
F. L.
,
VandeHeyning
,
P. H.
, and
Croux
,
C.
(
1997
). “
Test-retest study of the GRBAS scale: Influence of experience and professional background on perceptual rating of voice quality
,”
J. Voice
11
,
74
80
.
16.
De Cheveigné
,
A.
, and
Kawahara
,
H.
(
2002
). “
YIN, a fundamental frequency estimator for speech and music
,”
J. Acoust. Soc. Am.
111
,
1917
1930
.
17.
Dejonckere
,
P.
, and
Lebacq
,
J.
(
1996
). “
Acoustic, perceptual, aerodynamic and anatomical correlations in voice pathology
,”
Otorhinolaryngol. Relat. Spec.
58
,
326
332
.
18.
Dejonckere
,
P.
,
Schoentgen
,
J.
,
Giordano
,
A.
,
Fraj
,
S.
,
Bocchi
,
L.
, and
Manfredi
,
C.
(
2011
). “
Validity of jitter measures in non-quasi-periodic voices. Part I: Perceptual and computer performances in cycle pattern recognition
,”
Logoped. Phoniatr. Vocol.
36
,
70
77
.
19.
Deliyski
,
D. D.
(
1993
). “
Acoustic model and evaluation of pathological voice production
,” in
Third European Conference on Speech Communication Technology
.
20.
Deliyski
,
D. D.
,
Shaw
,
H. S.
, and
Evans
,
M. K.
(
2005
). “
Adverse effects of environmental noise on acoustic voice quality measurements
,”
J. Voice
19
,
15
28
.
21.
Deliyski
,
D. D.
,
Shaw
,
H. S.
,
Evans
,
M. K.
, and
Vesselinov
,
R.
(
2006
). “
Regression tree approach to studying factors influencing acoustic voice analysis
,”
Folia Phoniatr. Logop.
58
,
274
288
.
22.
Garofolo
,
J.
,
Lamel
,
L.
,
Fisher
,
W.
,
Fiscus
,
J.
, and
Pallett
,
D.
(
1993
). “
DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM
,” NIST Speech Disc. 1, Vol.
93
. NASA STI/Recon Technical Report.
23.
Hess
,
W.
(
1983
). “
Time-domain pitch determination
,” in
Pitch Determination Speech Signals: Algorithms Devices
(
Springer-Verlag
,
Berlin
), pp.
152
301
.
24.
Hess
,
W.
(
2012
).
Pitch Determination of Speech Signals: Algorithms and Devices
(
Springer Science & Business Media
,
New York
).
25.
Hosokawa
,
K.
,
Barsties
,
B.
,
Iwahashi
,
T.
,
Iwahashi
,
M.
,
Kato
,
C.
,
Iwaki
,
S.
,
Sasai
,
H.
,
Miyauchi
,
A.
,
Matsushiro
,
N.
,
Inohara
,
H.
,
Ogawa
,
M.
, and
Maryn
,
Y.
(
2017
). “
Validation of the Acoustic Voice Quality Index in the Japanese language
,”
J. Voice
31
,
260.e1
260.e9
.
26.
Hosokawa
,
K.
,
von Latoszek
,
B. B.
,
Ferrer-Riesgo
,
C. A.
,
Iwahashi
,
T.
,
Iwahashi
,
M.
,
Iwaki
,
S.
,
Kato
,
C.
,
Yoshida
,
M.
,
Umatani
,
M.
, and
Miyauchi
,
A.
(
2019b
). “
Acoustic breathiness index for the Japanese-speaking population: Validation study and exploration of affecting factors
,”
J. Speech Lang. Hear. Res.
62
,
2617
2631
.
27.
Hosokawa
,
K.
,
von Latoszek
,
B.
,
Iwahashi
,
T.
,
Iwahashi
,
M.
,
Iwaki
,
S.
,
Kato
,
C.
,
Yoshida
,
M.
,
Sasai
,
H.
,
Miyauchi
,
A.
,
Matsushiro
,
N.
,
Inohara
,
H.
,
Ogawa
,
M.
, and
Maryn
,
Y.
(
2019a
). “
The Acoustic Voice Quality Index version 03.01 for the Japanese-speaking population
,”
J. Voice
33
,
125.e1
125.e12
.
28.
Kitayama
,
I.
(
2024
). The scripts of SFEEDS, https://github.com/LarynxOsaka (Last viewed November 29, 2024).
29.
Kitayama
,
I.
,
Hosokawa
,
K.
,
Iwaki
,
S.
,
Yoshida
,
M.
,
Miyauchi
,
A.
,
Ogawa
,
M.
, and
Inohara
,
H.
(
2023
). “
Validation of subharmonics quantification using two-stage cepstral analysis
,”
J. Voice
(published online).
30.
Kwon
,
O.-W.
,
Chan
,
K.
,
Hao
,
J.
, and
Lee
,
T.-W.
(
2003
). “
Emotion recognition by speech signals
,”
in Eighth European Conference on Speech Communication and Technology
, 2003.
31.
Latoszek
,
B. B. V.
,
Maryn
,
Y.
,
Gerrits
,
T.
, and
De Bodt
,
M.
(
2017
). “
The Acoustic Breathiness Index (ABI): A multivariate acoustic model for breathiness
,”
J. Voice
31
,
511.e11
511.e27
.
32.
Maryn
,
Y.
,
Corthals
,
P.
,
van Cauwenberge
,
P.
,
Roy
,
N.
, and
De Bodt
,
M.
(
2010
). “
Toward improved ecological validity in the acoustic measurement of overall voice quality: Combining continuous speech and sustained vowels
,”
J. Voice
24
,
540
555
.
33.
Noll
,
A. M.
(
1964
). “
Short‐time spectrum and ‘cepstrum’ techniques for vocal‐pitch detection
,”
J. Acoust. Soc. Am.
36
,
296
302
.
34.
Noll
,
A. M.
(
1967
). “
Cepstrum pitch determination
,”
J. Acoust. Soc. Am.
41
,
293
309
.
35.
Omori
,
K.
,
Kojima
,
H.
,
Kakani
,
R.
,
Slavit
,
D. H.
, and
Blaugrund
,
S. M.
(
1997
). “
Acoustic characteristics of rough voice: Subharmonics
,”
J. Voice
11
,
40
47
.
36.
Pirker
,
G.
,
Wohlmayr
,
M.
,
Petrik
,
S.
, and
Pernkopf
,
F.
(
2011
). “
A pitch tracking corpus with evaluation on multipitch tracking scenario
,” in
Interspeech
, pp.
1509
1512
.
37.
Plante
,
F.
,
Meyer
,
G.
, and
Ainsworth
,
W.
(
1995
). “
A pitch extraction reference database
,”
Children
8
,
30
50
.
38.
Rabiner
,
L.
,
Cheng
,
M.
,
Rosenberg
,
A.
, and
McGonegal
,
C.
(
1976
). “
A comparative performance study of several pitch detection algorithms
,”
IEEE Trans. Acoust. Speech Signal Process.
24
,
399
418
.
39.
Ross
,
M.
,
Shaffer
,
H.
,
Cohen
,
A.
,
Freudberg
,
R.
, and
Manley
,
H.
(
1974
). “
Average magnitude difference function pitch extractor
,”
IEEE Trans. Acoust. Speech Signal Process.
22
,
353
362
.
40.
Romano
,
J.
,
Kromrey
,
J. D.
,
Coraggio
,
J.
, and
Skowronek
,
J.
(
2006
). “
Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen's d for evaluating group differences on the NSSE and other surveys?,
” in
Annual Meeting of the Florida Association of Institutional Research
.
41.
Sukhostat
,
L.
, and
Imamverdiyev
,
Y.
(
2015
). “
A comparative analysis of pitch detection methods under the influence of different noise conditions
,”
J. Voice
29
,
410
417
.
42.
Titze
,
I.
(
1994
). “
Fluctuations and perturbations in vocal output
,”
Principles of Voice Production
(Prentice Hall, Hoboken, NJ), pp.
209
306
.
43.
Titze
,
I. R.
(
1995
).
Workshop on Acoustic Voice Analysis: Summary Statement
(
National Center for Voice and Speech
,
Salt Lake City, UT
).
44.
Titze
,
I. R.
, and
Liang
,
H.
(
1993
). “
Comparison of ƒo extraction methods for high-precision voice perturbation measurements
,”
J. Speech Lang. Hear. Res.
36
,
1120
1133
.
45.
van Alphen
,
P.
, and
Van Bergem
,
D.
(
1989
). “
Markov models and their application in speech recognition
,” in
Proceedings of the Institute of Phonetic Sciences of the University of Amsterdam
, pp.
1
26
.
46.
van Latoszek
,
B.
,
De Bodt
,
M.
,
Gerrits
,
E.
, and
Maryn
,
Y.
(
2018
). “
The exploration of an objective model for roughness with several acoustic markers
,”
J. Voice
32
,
149
161
.
47.
Wang
,
C.
(
2001
).
Prosodic Modeling for Improved Speech Recognition and Understanding
(
Massachusetts Institute of Technology
,
Cambridge, MA
).
48.
Yamaguchi
,
H.
,
Shrivastav
,
R.
,
Andrews
,
M. L.
, and
Niimi
,
S.
(
2003
). “
A comparison of voice quality ratings made by Japanese and American listeners using the GRBAS scale
,”
Folia Phoniatr. Logop.
55
,
147
157
.
49.
Zraick
,
R. I.
,
Wendel
,
K.
, and
Smith-Olinde
,
L.
(
2005
). “
The effect of speaking task on perceptual judgment of the severity of dysphonic voice
,”
J. Voice
19
,
574
581
.

Supplementary Material

You do not currently have access to this content.