This paper describes a system for modeling, recognizing, and classifying sound textures. The described system translates contemporary approaches from video texture analysis, creating a unique approach in the realm of audio and music. The signal is first represented as a set of mode functions by way of the Empirical Mode Decomposition technique for time/frequency analysis, before expressing the dynamics of these modes as a linear dynamical system (LDS). Both linear and nonlinear techniques are utilized in order to learn the system dynamics, which leads to a successful distinction between unique classes of textures. Five classes of sounds comprised a data set, consisting of crackling fire, typewriter action, rainstorms, carbonated beverages, and crowd applause, drawing on a variety of source recordings. Based on this data set the system achieved a classification accuracy of 90%, which outperformed both a Mel-Frequency Cepstral Coefficient based LDS-modeling approach from the literature, as well as one based on a standard Gaussian Mixture Model classifier.

1.
H.
von Helmholtz
,
On the Sensation of Tone as a Physiological Basis for the Study of Music
, 2nd English ed. (
Longmans, Green and Company
,
London, England
,
1885
),
576
p.
2.
R.
McAulay
and
T.
Quatieri
, “
Speech analysis/synthesis based on a sinusoidal representation
,”
IEEE Trans. Acoust., Speech, Signal Process.
34
(
4
),
744
754
(
1986
).
3.
X.
Serra
and
J. O.
Smith
, “
Spectral modeling synthesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition
,”
Comput. Music J.
14
(
4
),
14
24
(
1990
).
4.
L.
Barrington
,
A.
Chan
, and
G.
Lanckriet
, “
Modeling music as a dynamic texture
,”
IEEE Trans. Audio, Speech, Lang. Process.
18
(
3
),
602
612
(
2010
).
5.
J.-J.
Aucouturier
,
F.
Pachet
, and
M.
Sandler
, “
The way it sounds: Timbre models for analysis and retrieval of music signals
,”
IEEE Trans. Multimedia
7
(
6
),
1028
1035
(
2005
).
6.
N.
St-Arnaud
and
K.
Popat
, “
Analysis and synthesis of sound textures
,” in
Computational Auditory Scene Analysis
, edited by
D. F.
Rosenthal
and
H.
Okuno
(
Lawrence Erlbaum Associates
,
Mahwah, NJ
,
1998
), pp.
293
308
.
7.
S.
Dubnov
,
Z.
Bar-Joseph
,
R.
El-Yaniv
,
D.
Lischinksi
, and
M.
Werman
, “
Synthesizing sound textures through wavelet tree learning
,”
IEEE Comput. Graphics Appl.
22
(
4
),
38
48
(
2002
).
8.
Z.
Bar-Joseph
,
R.
El-Yaniv
,
D.
Lischinski
, and
M.
Werman
, “
Texture mixing and texture movie synthesis using statistical learning
,”
IEEE Trans. Vis. Comput. Graph.
7
(
2
),
120
135
(
2001
).
9.
B.
Behm
and
J.
Parker
, “
Creating audio textures by samples: Tiling and stretching
,” in
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 04)
, Vol.
4
, pp.
317
320
(
2004
).
10.
Y.
Xu
,
B.
Guo
, and
H.
Shum
, “
Chaos mosaic: Fast and memory efficient texture synthesis
,” Technical Report, MSR-TR-2000-32, Microsoft Research (2000).
11.
L.
Lu
,
L.
Wenyin
, and
H.
Zhang
, “
Audio textures: Theory and applications
,”
IEEE Trans. Speech Audio Process.
12
,
156
167
(
2004
).
12.
B.
Logan
, “
Mel frequency cepstral coefficients for music modeling
,” in
International Symposium on Music Information Retrieval
(
2000
), URL http://ismir2000.ismir. net/papers/logan_paper.pdf (last viewed May 16, 2012).
13.
M.
Athineos
and
D.
Ellis
, “
Sound texture modeling with linear prediction in both time and frequency domains
,” in
IEEE International Conference on Acoustics, Speech, and Signal Processing 2003 (ICASSP'03)
, pp.
648
651
(
2003
).
14.
X.
Zhu
and
L.
Wyse
, “
Sound texture modeling and time-frequency LPC
,” in
Proceedings of the 7th International Conference on Digital Audio Effects DAFX04
, pp.
345
349
(
2004
).
15.
G.
Rilling
,
P.
Flandrin
, and
P.
Goncalves
, “
On empirical mode decomposition and its algorithms
,” in
IEEE-EURASIP Workshop on Nonlinear Signal and Image Processing (NSIP 03)
, Vol.
3
, pp.
8
11
(
2003
).
16.
P.
Flandrin
,
G.
Rilling
, and
P.
Goncalves
, “
Empirical mode decomposition as a filter bank
,”
IEEE Signal Process. Lett.
11
,
112
114
(
2004
).
17.
N. E.
Huang
,
Z.
Shen
,
S. R.
Long
,
M. C.
Wu
,
H. H.
Shih
,
Q.
Zheng
,
N.-C.
Yen
,
C. C.
Tung
, and
H. H.
Liu
, “
The Empirical Mode Decomposition and the Hilbert Spectrum for Nonlinear and Non-Stationary Time Series Analysis
,”
Proc. Math. Phys. Eng. Sci.
454
(
1971
),
903
995
(
1998
).
18.
D.
Van Nort
, “
Instrumental listening: Sonic gesture as design principle
,”
Org. Sound
14(2)
,
177
187
(
2009
).
19.
P.
Heydarian
and
J. D.
Reiss
, “
Extraction of long-term rhythmic structures using the empirical mode decomposition
,” in
Audio Engineering Society Convention
, Vol.
122
, pp.
258
261
(
2007
).
20.
S.
Charleston-Villalobos
,
R.
Gonzalez-Camarena
,
G.
Chi-Lem
, and
T.
Aljama-Corrales
, “
Crackle sounds analysis by empirical mode decomposition. Nonlinear and nonstationary signal analysis for distinction of crackles in lung sounds
,”
IEEE Eng. Med. Biol. Mag.
26
(
1
),
40
47
(
2007
).
21.
K.
Khaldi
,
A.-O.
Boudraa
,
M.
Turki
,
T.
Chonavel
, and
I.
Samaali
, “
Audio encoding based on the empirical mode decomposition
,” 2008 IEEE Pacific-Asia Workshop on Computational Intelligence and Industrial Application (2008), pp.
924
928
. URL http: //hal.archives-ouvertes.fr/hal-00498439/ (last viewed May 16, 2012).
22.
A.
Zaman
,
K.
Khalilullah
,
M.
Islam
, and
M.
Molla
, “
A robust digital audio watermarking algorithm using empirical mode decomposition
,” in
Proceedings of the 23rd Canadian Conference on Electrical and Computer Engineering, CCECE 2010
, pp.
1
4
(
2010
).
23.
G.
Doretto
,
A.
Chiuso
,
Y.
Wu
, and
S.
Soatto
, “
Dynamic textures
,”
Int. J. Comput. Vis.
51
(
2
),
91
109
(
2003
).
24.
M.
Szummer
and
R.
Picard
, “
Temporal texture modeling
,” in
IEEE International Conference on Image Processing
, Vol.
3
, pp.
823
836
(
1996
).
25.
A. B.
Chan
and
N.
Vasconcelos
, “
Modeling, clustering, and segmenting video with mixtures of dynamic textures
,”
IEEE Trans. Pattern Anal. Mach. Intell.
30
(
5
),
909
926
(
2008
).
26.
A. P.
Dempster
,
N. M.
Laird
, and
D. B.
Rubin
, “
Maximum Likelihood from Incomplete Data via the EM Algorithm
,”
J. R. Stat. Soc. Ser. B (Methodol.)
39
(
1
),
1
38
(
1977
).
27.
Z.
Ghahramani
and
G.
Hinton
, “
The EM algorithm for mixtures of factor analyzers
,” Technical Report, CRG-TR-96-1, Department of Computer Science, University of Toronto (1997).
28.
M. S.
Grewal
and
A. P.
Andrews
,
Kalman Filtering: Theory and Practice using MATLAB
(
Prentice Hall
,
Englewood Cliffs, NJ
,
1993
),
350
p.
29.
Z. H.
Wu
and
N. E.
Huang
, “
A study of the characteristics of white noise using the empirical mode decomposition method
,”
Proc. R. Soc. London, Ser. A
460
,
1597
1611
(
2004
).
30.
P.
Daniel
and
R.
Weber
, “
Psychoacoustical roughness: Implementation of an optimized model
,”
Acustica
83
,
113
123
(
1997
).
31.
S. F.
Densil Cabrera
and
E.
Schubert
, “
Psysound3: Software for acoustical and psychoacoustical analysis of sound recordings
,” in
Proceedings of the 13th International Conference on Auditory Display
(
2007
), pp.
356
363
.
32.
S.
McAdams
, “
Perspectives on the contribution of timbre to musical structure
,”
Comput. Music J.
23
(
3
),
85
102
(
1999
).
33.
D.
McEnnis
,
C.
McKay
,
I.
Fujinaga
, and
P.
Depalle
, “
Jaudio: A feature extraction library
,” in
International Conference on Music Information Retrieval
(
2005
), URL http://ismir2005.ismir.net/proceedings/2103.pdf (last viewed August 13, 2012).
34.
G.
Peeters
, “
A large set of audio features for sound description (similarity and classification) in the CUIDADO project
,” Technical Report, CUIDADO I.S.T. Project (2004).
35.
http://www.ee.columbia.edu/̃marios/ctflp/ctflp.html (last viewed August 13, 2012).
36.
Freesound, http://www.freesound.org/ (last viewed August 13, 2012).
You do not currently have access to this content.