A method for automatic transcription of polyphonic music is proposed in this work that models the temporal evolution of musical tones. The model extends the shift-invariant probabilistic latent component analysis method by supporting the use of spectral templates that correspond to sound states such as attack, sustain, and decay. The order of these templates is controlled using hidden Markov model-based temporal constraints. In addition, the model can exploit multiple templates per pitch and instrument source. The shift-invariant aspect of the model makes it suitable for music signals that exhibit frequency modulations or tuning changes. Pitch-wise hidden Markov models are also utilized in a postprocessing step for note tracking. For training, sound state templates were extracted for various orchestral instruments using isolated note samples. The proposed transcription system was tested on multiple-instrument recordings from various datasets. Experimental results show that the proposed model is superior to a non-temporally constrained model and also outperforms various state-of-the-art transcription systems for the same experiment.

1.
Bay
,
M.
,
Ehmann
,
A. F.
, and
Downie
,
J. S.
(
2009
). “
Evaluation of multiple-F0 estimation and tracking systems
,” in
10th International Society of Music Information Retrieval Conference
, Kobe, Japan, pp.
315
320
.
2.
Bello
,
J. P.
,
Daudet
,
L.
,
Abdallah
,
S.
,
Duxbury
,
C.
,
Davies
,
M.
, and
Sandler
,
M.
(
2005
). “
A tutorial on onset detection of music signals
,”
IEEE Trans. Audio, Speech Lang. Proc.
13
,
1035
1047
.
3.
Benetos
,
E.
, and
Dixon
,
S.
(
2011a
). “
Joint multi-pitch detection using harmonic envelope estimation for polyphonic music transcription
,”
IEEE J. Sel. Top. Signal Proc.
5
,
1111
1123
.
4.
Benetos
,
E.
, and
Dixon
,
S.
(
2011b
). “
Multiple-instrument polyphonic music transcription using a convolutive probabilistic model
,” in
8th Sound and Music Computing Conference
, Padova, Italy, pp.
19
24
.
5.
Benetos
,
E.
, and
Dixon
,
S.
(
2011c
). “
A temporally-constrained convolutive probabilistic model for pitch detection
,” in
Workshop on Applications of Signal Processing to Audio and Acoustics
, New Paltz, NY, pp.
133
136
.
6.
Benetos
,
E.
, and
Dixon
,
S.
(
2012
). “
Temporally-constrained convolutive probabilistic latent component analysis for multi-pitch detection
,” in
International Conference on Latent Variable Analysis and Signal Separation
, Tel-Aviv, Israel, pp.
364
371
.
7.
Brown
,
J. C.
(
1991
). “
Calculation of a constant Q spectral transform
,”
J. Acoust. Soc. Am.
89
,
425
434
.
8.
Carabias-Orti
,
J. J.
,
Virtanen
,
T.
,
Vera-Candeas
,
P.
,
Ruiz-Reyes
,
N.
, and
Canadas-Quesada
,
F. J.
(
2011
). “
Musical instrument sound multi-excitation model for non-negative spectro-gram factorization
,”
IEEE J. Sel. Topics Signal Proc.
5
,
1144
1158
.
9.
Davy
,
M.
,
Godsill
,
S.
, and
Idier
,
J.
(
2006
). “
Bayesian analysis of western tonal music
,”
J. Acoust. Soc. Am.
119
,
2498
2517
.
10.
de Cheveigné
,
A.
(
2006
). “
Multiple F0 estimation
,” in
Computational Auditory Scene Analysis, Algorithms and Applications
, edited by
D. L.
Wang
and
G. J.
Brown
(
IEEE Press/Wiley
,
New York
), pp.
45
79
.
11.
Dempster
,
A. P.
,
Laird
,
N. M.
, and
Rubin
,
D. B.
(
1977
). “
Maximum likelihood from in-complete data via the EM algorithm
,”
J. Royal Stat. Soc.
39
,
1
38
.
12.
Dessein
,
A.
,
Cont
,
A.
, and
Lemaitre
,
G.
(
2010
). “
Real-time polyphonic music transcription with non-negative matrix factorization and beta-divergence
,” in
11th International Society on Music Information Retrieval Conference
, Utrecht, The Netherlands, pp.
489
494
.
13.
Dixon
,
S.
(
2000
). “
On the computer recognition of solo piano music
,” in
2000 Australasian Computer Music Conference
, Brisbane, Australia, pp.
31
37
.
14.
Duan
,
Z.
,
Pardo
,
B.
, and
Zhang
,
C.
(
2010
). “
Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions
,”
IEEE Trans. Audio, Speech Lang. Proc.
18
,
2121
2133
.
15.
Emiya
,
V.
,
Badeau
,
R.
, and
David
,
B.
(
2010
). “
Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle
,”
IEEE Trans. Audio, Speech Lang. Proc.
18
,
1643
1654
.
16.
Fuentes
,
B.
,
Badeau
,
R.
, and
Richard
,
G.
(
2011
). “
Adaptive harmonic time-frequency decomposition of audio using shift-invariant PLCA
,” in
International Conference on Acoustical Speech and Signal Processing
, Prague, Czech Republic, pp.
401
404
.
17.
Ghahramani
,
Z.
, and
Jordan
,
M.
(
1997
). “
Factorial hidden Markov models
,”
Mach. Learn.
29
,
245
273
.
18.
Goto
,
M.
,
Hashiguchi
,
H.
,
Nishimura
,
T.
, and
Oka
,
R.
(
2003
). “
RWC music database: Music genre database and musical instrument sound database
,” in
International Conference on Music Information Retrieval
, Baltimore, MD.
19.
Grindlay
,
G.
, and
Ellis
,
D.
(
2011
). “
Transcribing multi-instrument polyphonic music with hierarchical eigen instruments
,”
IEEE J. Sel. Top. Signal Proc.
5
,
1159
1169
.
20.
Guyon
,
I.
,
Makhoul
,
J.
,
Schwartz
,
R.
, and
Vapnik
,
V.
(
1998
). “
What size test set gives good error estimates?
IEEE Trans. Pattern Anal. Mach. Intell.
20
,
52
64
.
21.
Kameoka
,
H.
,
Nishimoto
,
T.
, and
Sagayama
,
S.
(
2007
). “
A multipitch analyzer based on harmonic temporal structured clustering
,”
IEEE Trans. Audio, Speech Lang. Proc.
15
,
982
994
.
22.
Klapuri
,
A.
, and
Davy
,
M.
, editors (
2006
).
Signal Processing Methods for Music Transcription
(
Springer-Verlag
,
New York)
, pp.
440
.
23.
Lee
,
C.-T.
,
Yang
,
Y.-H.
, and
Chen
,
H.
(
2011
). “
Automatic transcription of piano music by sparse representation of magnitude spectra
,” in
IEEE International Conference on Multimedia and Expo
, Barcelona, Spain, pp.
1
6
.
24.
Lee
,
D.
, and
Seung
,
H.
(
1999
). “
Learning the parts of objects by non-negative matrix factorization
,”
Nature
401
,
788
791
.
25.
MIREX
(
2007
). “Music Information Retrieval Evaluation eXchange (MIREX)” available at http://music-ir.org/mirexwiki/ (Last viewed August 19, 2012).
26.
Mysore
,
G.
(
2010
). “
A non-negative framework for joint modeling of spectral structure and temporal dynamics in sound mixtures
,” Ph.D. thesis,
Stanford University
, CA, pp.
143
.
27.
Mysore
,
G.
, and
Smaragdis
,
P.
(
2009
). “
Relative pitch estimation of multiple instruments
,” in
International Conference on Acoustical Speech and Signal Processing
, Taipei, Taiwan, pp.
313
316
.
28.
Nakano
,
M.
,
Roux
,
J. L.
,
Kameoka
,
H.
,
Kitano
,
Y.
,
Ono
,
N.
, and
Sagayama
,
S.
(
2010
). “
Nonnegative matrix factorization with Markov-chained bases for modeling time-varying patterns in music spectrograms
,” in
9th International Conference on Latent Variable Analysis and Signal Separation
, St. Malo, France, pp.
149
156
.
29.
Nakano
,
M.
,
Roux
,
J. L.
,
Kameoka
,
H.
,
Ono
,
N.
, and
Sagayama
,
S.
(
2011
). “
Infinite-state spectrum model for music signal analysis
,” in
International Conference on Acoustical Speech and Signal Processing
, Prague, Czech Republic, pp.
1972
1975
.
30.
Peeling
,
P.
, and
Godsill
,
S.
(
2011
). “
Multiple pitch estimation using non-homogeneous Poisson processes
,”
IEEE J. Sel. Top. Signal Proc.
5
,
1133
1143
.
31.
Peeling
,
P.
,
Li
,
C.
, and
Godsill
,
S.
(
2007
). “
Poisson point process modeling for polyphonic music transcription
,”
J. Acoust. Soc. Am.
121
,
168
175
.
32.
Peeters
,
G.
(
2004
). “
A large set of audio features for sound description (similarity and classification) in the CUIDADO project
,” Technical Report No. CUIDADO I.S.T. Project.
33.
Pertusa
,
A.
, and
Iñesta
,
J. M.
(
2008
). “
Multiple fundamental frequency estimation using Gaussian smoothness
,” in
International Conference on Acoustical Speech and Signal Processing
, Las Vegas, NV, pp.
105
108
.
34.
Poliner
,
G.
, and
Ellis
,
D.
(
2007
). “
A discriminative model for polyphonic piano transcription
,”
EURASIP J. Adv. Signal Process.
154
162
.
35.
Quesada
,
F. C.
,
Ruiz-Reyes
,
N.
,
Candeas
,
P. V.
,
Carabias-Orti
,
J. J.
, and
Maldonado
,
S.
(
2010
). “
A multiple-F0 estimation approach based on Gaussian spectral modeling for polyphonic music transcription
,”
J. New Mus. Res.
39
,
93
107
.
36.
Rabiner
,
L. R.
(
1989
). “
A tutorial on hidden Markov models and selected applications in speech recognition
,”
Proc, IEEE
77
,
257
286
.
37.
Ryynänen
,
M.
, and
Klapuri
,
A.
(
2005
). “
Polyphonic music transciption using note event modeling
,” in
Workshop on Applications of Signal Processing to Audio and Acoustics
, New Paltz, NY, pp.
319
322
.
38.
Shashanka
,
M.
,
Raj
,
B.
, and
Smaragdis
,
P.
(
2008
). “
Probabilistic latent variable models as nonnegative factorizations
,”
Comput. Intell. Neurosci.
2008
,
947438
.
39.
Smaragdis
,
P.
(
2009
). “
Relative-pitch tracking of multiple arbitary sounds
,”
J. Acoust. Soc. Am.
125
,
3406
3413
.
40.
Smaragdis
,
P.
, and
Raj
,
B.
(
2007
). “
Shift-invariant probabilistic latent component analysis
,” Technical Report No. TR2007-009, Mitsubishi Electric Research Laboratories.
41.
Smaragdis
,
P.
,
Raj
,
B.
, and
Shashanka
,
M.
(
2006
). “
A probabilistic latent variable model for acoustic modeling
,” in
Neural Information Processing Systems Workshop
,
Whistler, BC
,
Canada
.
42.
Smaragdis
,
P.
,
Raj
,
B.
, and
Shashanka
,
M.
(
2008
). “
Sparse and shift-invariant feature extraction from non-negative data
,” in
International Conference Acoustical Speech and Signal Processing
, Las Vegas, NV, pp.
2069
2072
.
43.
Vincent
,
E.
,
Bertin
,
N.
, and
Badeau
,
R.
(
2010
). “
Adaptive harmonic spectral decomposition for multiple pitch estimation
,”
IEEE Trans. Audio, Speech Lang. Proc.
18
,
528
537
.
44.
Yeh
,
C.
,
Röbel
,
A.
, and
Rodet
,
X.
(
2010
). “
Multiple fundamental frequency estimation and polyphony inference of polyphonic music signals
,”
IEEE Trans. Audio, Speech Lang. Proc.
18
,
1116
1126
.
You do not currently have access to this content.