An algorithm is presented for the estimation of the fundamental frequency (F0) of speech or musical sounds. It is based on the well-known autocorrelation method with a number of modifications that combine to prevent errors. The algorithm has several desirable features. Error rates are about three times lower than the best competing methods, as evaluated over a database of speech recorded together with a laryngograph signal. There is no upper limit on the frequency search range, so the algorithm is suited for high-pitched voices and music. The algorithm is relatively simple and may be implemented efficiently and with low latency, and it involves few parameters that must be tuned. It is based on a signal model (periodic signal) that may be extended in several ways to handle various forms of aperiodicity that occur in particular applications. Finally, interesting parallels may be drawn with models of auditory processing.

1.
Abe, T., Kobayashi, T., and Imai, S. (1995). “Harmonics tracking and pitch extraction based on instantaneous frequency,” Proc. IEEE-ICASSP, pp. 756–759.
2.
Akeroyd
,
M. A.
, and
Summerfield
,
A. Q.
(
2000
). “
A fully-temporal account of the perception of dichotic pitches
,”
Br. J. Audiol.
33
(
2
),
106
107
.
3.
Atake, Y., Irino, T., Kawahara, H., Lu, J., Nakamura, S., and Shikano, K. (2000). “Robust fundamental frequency estimation using instantaneous frequencies of harmonic components,” Proc. ICLSP, pp. 907–910.
4.
Bagshaw, P. C., Hiller, S. M., and Jack, M. A. (1993). “Enhanced pitch tracking and the processing of F0 contours for computer and intonation teaching,” Proc. European Conf. on Speech Comm. (Eurospeech), pp. 1003–1006.
5.
Barnard
,
E.
,
Cole
,
R. A.
,
Vea
,
M. P.
, and
Alleva
,
F. A.
(
1991
). “
Pitch detection with a neural-net classifier
,”
IEEE Trans. Signal Process.
39
,
298
307
.
6.
Boersma
,
P.
(
1993
). “
Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound
,”
Proc. Institute of Phonetic Sciences
17
,
97
110
.
7.
Breebart
,
J.
,
van de Par
,
S.
, and
Kohlrausch
,
A.
(
2001
). “
Binaural processing model based on contralateral inhibition. I. Model structure
,”
J. Acoust. Soc. Am.
110
,
1074
1088
.
8.
Brown
,
J. C.
, and
Puckette
,
M. S.
(
1989
). “
Calculation of a ‘narrowed’ autocorrelation function
,”
J. Acoust. Soc. Am.
85
,
1595
1601
.
9.
Brown
,
J. C.
, and
Zhang
,
B.
(
1991
). “
Musical frequency tracking using the methods of conventional and ‘narrowed’ autocorrelation
,”
J. Acoust. Soc. Am.
89
,
2346
2354
.
10.
Campbell, N. (1997). “Processing a Speech Corpus for CHATR Synthesis,” in Proc. ICSP (International Conference on Speech Processing).
11.
Cariani
,
P. A.
, and
Delgutte
,
B.
(
1996
). “
Neural correlates of the pitch of complex tones. I. Pitch and pitch salience
,”
J. Neurophysiol.
76
,
1698
1716
.
12.
Culling
,
J. F.
, and
Summerfield
,
Q.
(
1995
). “
Perceptual segregation of concurrent speech sounds: absence of across-frequency grouping by common interaural delay
,”
J. Acoust. Soc. Am.
98
,
785
797
.
13.
de Cheveigné, A. (1989). “Pitch and the narrowed autocoincidence histogram,” Proc. ICMPC, Kyoto, pp. 67–70.
14.
de Cheveigné, A. (1990), “Experiments in pitch extraction,” ATR Interpreting Telephony Research Laboratories technical report, TR-I-0138.
15.
de Cheveigné, A. (1991). “Speech f0 extraction based on Licklider’s pitch perception model,” Proc. ICPhS, pp. 218–221.
16.
de Cheveigné
,
A.
(
1993
). “
Separation of concurrent harmonic sounds: Fundamental frequency estimation and a time-domain cancellation model of auditory processing
,”
J. Acoust. Soc. Am.
93
,
3271
3290
.
17.
de Cheveigné, A. (1996). “Speech fundamental frequency estimation,” ATR Human Information Processing Research Laboratories technical report, TR-H-195.
18.
de Cheveigné
,
A.
(
1997
). “
Concurrent vowel identification. III. A neural model of harmonic interference cancellation
,”
J. Acoust. Soc. Am.
101
,
2857
2865
.
19.
de Cheveigné
,
A.
(
1998
). “
Cancellation model of pitch perception
,”
J. Acoust. Soc. Am.
103
,
1261
1271
.
20.
de Cheveigné
,
A.
, and
Kawahara
,
H.
(
1999
). “
Multiple period estimation and pitch perception model
,”
Speech Commun.
27
,
175
185
.
21.
Doval, B. (1994). “Estimation de la fréquence fondamentale des signaux sonores,” Université Pierre et Marie Curie, unpublished doctoral dissertation (in French).
22.
Duifhuis
,
H.
,
Willems
,
L. F.
, and
Sluyter
,
R. J.
(
1982
). “
Measurement of pitch in speech: an implementation of Goldstein’s theory of pitch perception
,”
J. Acoust. Soc. Am.
71
,
1568
1580
.
23.
Goldstein
,
J. L.
(
1973
). “
An optimum processor theory for the central formation of the pitch of complex tones
,”
J. Acoust. Soc. Am.
54
,
1496
1516
.
24.
Hedelin, P., and Huber, D. (1990). “Pitch period determination of aperiodic speech signals,” Proc. ICASSP, pp. 361–364.
25.
Hermes
,
D. J.
(
1988
). “
Measurement of pitch by subharmonic summation
,”
J. Acoust. Soc. Am.
83
,
257
264
.
26.
Hermes, D. J. (1993). “Pitch analysis,” in Visual Representations of Speech Signals, edited by M. Cooke, S. Beet, and M. Crawford (Wiley, New York), pp. 3–25.
27.
Hess, W. (1983). Pitch Determination of Speech Signals (Springer-Verlag, Berlin).
28.
Hess, W. J. (1992). “Pitch and voicing determination,” in Advances in Speech Signal Processing, edited by S. Furui and M. M. Sohndi (Marcel Dekker, New York), pp. 3–48.
29.
Huang, X., Acero, A., and Hon, H.-W. (2001). Spoken Language Processing (Prentice–Hall, Upper Saddle River, NJ).
30.
ISO/IEC_JTC_1/SC_29 (2001). “Information Technology—Multimedia Content Description Interface—Part 4: Audio,” ISO/IEC FDIS 15938-4.
31.
Joris
,
P. X.
, and
Yin
,
T. C. T.
(
1998
). “
Envelope coding in the lateral superior olive. III. Comparison with afferent pathways
,”
J. Neurophysiol.
79
,
253
269
.
32.
Kawahara, H., Katayose, H., de Cheveigné, A., and Patterson, R. D. (1999a). “Fixed Point Analysis of Frequency to Instantaneous Frequency Mapping for Accurate Estimation of F0 and Periodicity,” Proc. EUROSPEECH 6, 2781–2784.
33.
Kawahara
,
H.
,
Masuda-Katsuse
,
I.
, and
de Cheveigné
,
A.
(
1999b
). “
Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds
,”
Speech Commun.
27
,
187
207
.
34.
Kawahara, H., Zolfaghari, P., and de Cheveigné, A. (in preparation). “Fixed-point-based source information extraction from speech sounds designed for a very high-quality speech modifications.”
35.
Licklider
,
J. C. R.
(
1951
). “
A duplex theory of pitch perception
,”
Experientia
7
,
128
134
.
36.
Medan
,
Y.
,
Yair
,
E.
, and
Chazan
,
D.
(
1991
). “
Super resolution pitch determination of speech signals
,”
IEEE Trans. Acoust., Speech, Signal Process.
39
,
40
48
.
37.
Meddis
,
R.
, and
Hewitt
,
M. J.
(
1991
). “
Virtual pitch and phase sensitivity of a computer model of the auditory periphery. I: Pitch identification
,”
J. Acoust. Soc. Am.
89
,
2866
2882
.
38.
Miller
,
G. A.
, and
Taylor
,
W. G.
(
1948
). “
The perception of repeated bursts of noise
,”
J. Acoust. Soc. Am.
20
,
171
182
.
39.
Moore, B. C. J. (1997). An Introduction to the Psychology of Hearing (Academic, London).
40.
Ney
,
H.
(
1982
). “
A time warping approach to fundamental period estimation
,”
IEEE Trans. Syst. Man Cybern.
12
,
383
388
.
41.
Noll
,
A. M.
(
1967
). “
Cepstrum pitch determination
,”
J. Acoust. Soc. Am.
41
,
293
309
.
42.
Pressnitzer
,
D.
,
Patterson
,
R. D.
, and
Krumbholz
,
K.
(
2001
). “
The lower limit of melodic pitch
,”
J. Acoust. Soc. Am.
109
,
2074
2084
.
43.
Rabiner, L. R., and Schafer, R. W. (1978). Digital Processing of Speech Signals (Prentice-Hall, Englewood Cliffs, NJ).
44.
Ritsma
,
R. J.
(
1962
). “
Existence region of the tonal residue. I
,”
J. Acoust. Soc. Am.
34
,
1224
1229
.
45.
Rodet
,
X.
, and
Doval
,
B.
(
1992
). “
Maximum-likelihood harmonic matching for fundamental frequency estimation
,”
J. Acoust. Soc. Am.
92
,
2428
2429
(abstract).
46.
Ross
,
M. J.
,
Shaffer
,
H. L.
,
Cohen
,
A.
,
Freudberg
,
R.
, and
Manley
,
H. J.
(
1974
). “
Average magnitude difference function pitch extractor
,”
IEEE Trans. Acoust., Speech, Signal Process.
22
,
353
362
.
47.
Slaney, M. (1990). “A perceptual pitch detector,” Proc. ICASSP, pp. 357–360.
48.
Terhardt
,
E.
(
1974
). “
Pitch, consonance and harmony
,”
J. Acoust. Soc. Am.
55
,
1061
1069
.
49.
Vu Ngoc Tuan, and d’Alessandro, C. (2000). “Glottal closure detection using EGG and the wavelet transform,” in Proceedings 4th International Workshop on Advances in Quantitative Laryngoscopy, Voice and Speech Research, Jena, pp. 147–154.
50.
Wightman
,
F. L.
(
1973
). “
The pattern-transformation model of pitch
,”
J. Acoust. Soc. Am.
54
,
407
416
.
51.
Xu, Y., and Sun, X. (2000). “How fast can we really change pitch? Maximum speed of pitch change revisited,” Proc. ICSLP, pp. 666–669.
52.
Yost
,
W. A.
(
1996
). “
Pitch strength of iterated rippled noise
,”
J. Acoust. Soc. Am.
100
,
3329
3335
.
This content is only available via PDF.
You do not currently have access to this content.