A system is proposed in which rhythmic representations are used to model the perception of tempo in music. The system can be understood as a five-layered model, where representations are transformed into higher-level abstractions in each layer. First, source separation is applied (Audio Level), onsets are detected (Onset Level), and interonset relationships are analyzed (Interonset Level). Then, several high-level representations of rhythm are computed (Rhythm Level). The periodicity of the music is modeled by the cepstroid vector—the periodicity of an interonset interval (IOI)–histogram. The pulse strength for plausible beat length candidates is defined by computing the magnitudes in different IOI histograms. The speed of the music is modeled as a continuous function on the basis of the idea that such a function corresponds to the underlying perceptual phenomena, and it seems to effectively reduce octave errors. By combining the rhythmic representations in a logistic regression framework, the tempo of the music is finally computed (Tempo Level). The results are the highest reported in a formal benchmarking test (2006–2013), with a P-Score of 0.857. Furthermore, the highest results so far are reported for two widely adopted test sets, with an Acc1 of 77.3% and 93.0% for the Songs and Ballroom datasets.

1.
Alonso
,
M.
,
David
,
B.
, and
Richard
,
G.
(
2004
). “
Tempo and be estimation of musical signals
,” in
Proceedings of ISMIR
,
6
pp.
1.
Alonso
,
M.
,
Richard
,
G.
, and
David
,
B.
(
2007
). “
Accurate tempo estimation based on harmonic + noise decomposition
,”
EURASIP J. Adv. Signal Processing
2007
(
82795
),
1
14
.
2.
Ballroomdancers (
2006
). “Ballroom Dancers,” http://www.ballroomdancers.com (Last viewed May 5, 2015).
2.
Bengio
,
Y.
,
Courville
,
A.
, and
Vincent
,
P.
(
2013
). “
Representation learning: A review and new perspectives
,”
IEEE Trans. Pattern Anal. Mach. Intell.
35
,
1798
1828
.
3.
Böck
,
S.
, and
Widmer
,
G.
(
2013
). “
Maximum filter vibrato suppression for onset detection
,” in
Proceedings of DAFx-13
,
Maynooth
,
Ireland
, September 2–6, 7 pp.
4.
Cemgil
,
A. T.
,
Kappen
,
B.
,
Desain
,
P.
, and
Honing
,
H.
(
2000
). “
On tempo tracking: Tempogram representation and Kalman filtering
,”
J. New Music Res.
29
,
259
273
.
5.
Chen
,
C. W.
,
Cremer
,
M.
,
Lee
,
K.
,
DiMaria
,
P.
, and
Wu
,
H.
(
2009
). “
Improving perceived tempo estimation by statistical modeling of higher-level musical descriptors
,” in
Proceedings of the 126th AES Convention
, 8 pp.
6.
Chordia
,
P.
, and
Rae
,
A.
(
2009
). “
Using source separation to improve tempo detection
,” in
Proceedings of ISMIR
, pp.
183
188
.
7.
Cohen
,
J.
,
Cohen
,
P.
,
West
,
S. G.
, and
Aiken
,
L. S.
(
2003
).
Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences
, 3rd ed. (
Lawrence Erlbaum Associates
,
London
),
691
pp.
8.
Dixon
,
S.
(
2006
). “
Onset detection revisited
,” in
Proceedings of DAFx-06
,
Montreal
,
Canada
, September 18–20, pp.
133
137
.
9.
Duxbury
,
C.
,
Bello
,
J. P.
,
Sandler
,
M.
, and
Davies
,
M.
(
2004
). “
A comparison between fixed and multiresolution analysis for onset detection in musical signals
,” in
Proceedings of DAFx-04
,
Naples
,
Italy
, October 5–8, pp.
207
212
.
10.
Elowsson
,
A.
, and
Friberg
,
A.
(
2013a
). “
Modelling perception of speed in music audio
,” in
Proceedings of Sound and Music Computing Conference 2013
,
Stockholm, Sweden
, pp.
735
741
.
11.
Elowsson
,
A.
, and
Friberg
,
A.
(
2013b
). “
Tempo estimation by modelling perceptual speed
,” in
MIREX Audio Tempo Estimation Task 2013
,
3
pp.
12.
Elowsson
,
A.
,
Friberg
,
A.
,
Madison
,
G.
, and
Paulin
,
J.
(
2013
). “
Modelling the speed of music using features from harmonic/percussive separated audio
,” in
Proceedings of ISMIR
, pp.
481
486
.
13.
Eronen
,
A. J.
, and
Klapuri
,
A.
(
2010
). “
Music tempo estimation with k-NN regression
,”
IEEE Trans. Audio Speech Language Processing
18
,
50
57
.
14.
FitzGerald
,
D.
(
2010
). “
Harmonic/percussive separation using median filtering
,” in
Proceedings of DAFx-10
, Graz, Austria, September 6–10,
4
pp.
15.
FitzGerald
,
D.
, and
Paulus
,
J.
(
2007
). “
Unpitched percussion transcription
,” in
Signal Processing Methods for Music Transcription
, edited by
A.
Klapuri
and
M.
Davy
(
Springer
,
New York
), pp.
131
162
.
16.
Friberg
,
A.
,
Schoonderwaldt
,
E.
,
Hedblad
,
A.
,
Fabiani
,
M.
, and
Elowsson
,
A.
(
2014
). “
Using listener-based perceptual features as intermediate representations in music information retrieval
,”
J. Acoust. Soc. Am.
136
(
4
),
1951
1963
.
17.
Gkiokas
,
A.
,
Katsouros
,
V.
, and
Carayannis
,
G.
(
2010
). “
Tempo induction using filterbank analysis and tonal features
,” in
Proceedings of ISMIR
, pp.
555
558
.
18.
Gkiokas
,
A.
,
Katsouros
,
V.
, and
Carayannis.
G.
(
2012
). “
Reducing tempo octave errors by periodicity vector coding and svm learning
,” in
Proceedings of ISMIR
, pp.
301
306
.
19.
Gouyon
,
F.
,
Herrera
,
P.
, and
Cano
,
P.
(
2002
). “
Pulse-dependent analyses of percussive music
,” in
Proceedings of the 22nd International AES Conference
,
6
pp.
20.
Gouyon
,
F.
,
Klapuri
,
A.
,
Dixon
,
S.
,
Alonso
,
M.
,
Tzanetakis
,
G.
,
Uhle
,
C.
, and
Cano
,
P.
(
2006
). “
An experimental comparison of audio tempo induction algorithms
,”
IEEE Trans. Audio Speech Language Processing
14
,
1832
1844
.
21.
Gulati
,
S.
, and
Rao
,
P.
(
2010
). “
Rhythm pattern representations for tempo detection in music
,” in
Proceedings of the First International Conference on Intelligent Interactive Technologies and Multimedia
, pp.
241
244
.
22.
Hockman
,
J.
, and
Fujinaga
,
I.
(
2010
). “
Fast vs slow: Learning tempo octaves from user data
,” in
Proceedings of ISMIR
, pp.
231
236
.
23.
Jensen
,
J. H.
,
Christensen
,
M. G.
, and
Jensen
,
S. H.
(
2009
). “
A tempo-insensitive representation of rhythmic patterns
,” in
Proceedings of the 17th EUSIPCO
,
4
pp.
24.
Klapuri
,
A.
(
1999
). “
Sound onset detection by applying psychoacoustic knowledge
,”
Proc. IEEE Conf. Acoustics Speech Signal Processing
6
,
3089
3092
.
25.
Klapuri
,
A. P.
,
Eronen
,
A. J.
, and
Astola
,
J. T.
(
2006
). “
Analysis of the meter of acoustic musical signals
,”
IEEE Trans. Audio Speech Language Processing
14
,
342
355
.
26.
Krebs
,
F.
, and
Widmer
,
G.
(
2012
). “
MIREX 2012 audio tempo estimation evaluation: Tempokreb
,” in
MIREX Audio Tempo Estimation 2012
,
4
pp.
27.
Lartillot
,
O.
,
Cereghetti
,
D.
,
Eliard
,
K.
,
Trost
,
W. J.
,
Rappaz
,
M.
, and
Grandjean
,
D.
(
2013
). “
Estimating tempo and metrical features by tracking the whole metrical hierarchy
,” in
Proceedings of the 3rd ICME
,
10
pp.
28.
Lee
,
H.
,
Pham
,
P.
,
Largman
,
Y.
, and
Ng
,
A. Y.
(
2009
). “
Unsupervised feature learning for audio classification using convolutional deep belief networks
,” in
Advances in Neural Information Processing Systems 22
, edited by
Y.
Bengio
,
D.
Schuurmans
,
J.
Lafferty
,
C. K. I.
Williams
, and
A.
Culotta
(
MIT Press
,
Cambridge, MA
), pp.
1096
1104
.
29.
Levy
,
M.
(
2011
). “
Improving perceptual tempo estimation with crowd-sourced annotations
,” in
Proceedings of ISMIR
, pp.
317
322
.
30.
London
,
J.
(
2012
).
Hearing in Time
(
Oxford University Press
,
New York
),
234
pp.
31.
Madison
,
G.
, and
Paulin
,
J.
(
2010
). “
Ratings of speed in real music as a function of both original and manipulated beat tempo
,”
J. Acoust. Soc. Am.
128
,
3032
3040
.
32.
McKinney
,
M. F.
, and
Moelants
,
D.
(
2004
). “
Deviations from the resonance theory of tempo induction
,” in
Conference on Interdisciplinary Musicology
,
Graz
,
Austria
,
11
pp.
32.
MIREX (
2006
). “Practice Data for Tempo,” http://www.music-ir.org/evaluation/MIREX/data/2006/tempo/ (Last viewed May 5, 2015).
32.
MIREX (
2013
). “MIREX 2013: Audio Tempo Extraction – MIREX06 Dataset,” http://nema.lis.illinois.edu/nema_out/mirex2013/results/ate/ (Last viewed May 5, 2015).
33.
Moelants
,
D.
, and
McKinney
,
M. F.
(
2004
). “
Tempo perception and musical content: What makes a piece slow, fast, or temporally ambiguous?
,” in
Proceedings of the 8th ICMPC
, pp.
558
562
.
34.
Oliveira
,
J. L.
,
Gouyon
,
F.
,
Martins
,
L. G.
, and
Reis
,
L. P.
(
2010
). “
IBT: A real-time tempo and beat tracking system
,” in
Proceedings of ISMIR
, pp.
291
296
.
35.
Peeters
,
G.
(
2007
). “
Template-based estimation of time-varying tempo
,”
EURASIP J. Appl. Signal Process.
2007
(
1
),
067215
.
35.
Peeters
,
G.
, and
Flocon-Cholet
,
J.
(
2012
). “
Perceptual tempo estimation using GMM regression
,” in
Proceedings of ACM MIRUM
, pp.
45
50
.
36.
Peeters
,
G.
, and
Marchand
,
U.
(
2013
). “
Predicting agreement and disagreement in the perception of tempo
,” in
Proceedings of CMMR
, pp.
253
266
.
37.
Schörkhuber
,
C.
, and
Klapuri
,
A.
(
2010
). “
Constant-Q transform toolbox for music processing
,” in
Proceedings of the 7th SMC
, pp.
322
330
.
38.
Schuller
,
B.
,
Eyben
,
F.
, and
Rigoll
,
G.
(
2008
). “
Tango or waltz?: Putting ballroom dance style into tempo detection
,”
EURASIP J. Audio Speech Music Processing
2008
,
846135
.
39.
Seppänen
,
J.
(
2001
). “
Tatum grid analysis of musical signals
,” in
IEEE Workshop on the Application of Signal Processing to Audio and Acoustics
, pp.
131
134
.
40.
Seyerlehner
,
K.
,
Wildmer
,
G.
, and
Schnitzer
,
D.
(
2007
). “
From rhythm patterns to perceived tempo
,” in
Proceedings of ISMIR
, pp.
519
524
.
41.
Tryfou
,
G.
,
Härmä
,
A.
, and
Mouchtaris
,
A.
(
2011
). “
Tempo estimation based on linear prediction and perceptual modelling
,” in
Proceedings of ISMIR
, pp.
197
202
.
42.
Xiao
,
L.
,
Tian
,
A.
,
Li
,
W.
, and
Zhou
,
J.
(
2008
). “
Using statistic model to capture the association between timbre and perceived tempo
,” in
Proceedings of ISMIR
, pp.
659
662
.
43.
Zapata
,
J.
, and
Gómez
,
E.
(
2011
). “
Comparative evaluation and combination of audio tempo estimation approaches
,” in
2nd AES Conference on Semantic Audio
,
10
pp.
You do not currently have access to this content.