The severe hearing loss problems that some people suffer can be treated by providing them with a surgically implanted electrical device called cochlear implant (CI). CI users struggle to perceive complex audio signals such as music; however, previous studies show that CI recipients find music more enjoyable when the vocals are enhanced with respect to the background music. In this manuscript source separation (SS) algorithms are used to remix pop songs by applying gain to the lead singing voice. This work uses deep convolutional auto-encoders, a deep recurrent neural network, a multilayer perceptron (MLP), and non-negative matrix factorization to be evaluated objectively and subjectively through two different perceptual experiments which involve normal hearing subjects and CI recipients. The evaluation assesses the relevance of the artifacts introduced by the SS algorithms considering their computation time, as this study aims at proposing one of the algorithms for real-time implementation. Results show that the MLP performs in a robust way throughout the tested data while providing levels of distortions and artifacts which are not perceived by CI users. Thus, an MLP is proposed to be implemented for real-time monaural audio SS to remix music for CI users.

1.
B. S.
Wilson
and
M. F.
Dorman
, “
Cochlear implants: Current designs and future possibilities
,”
J. Rehab. Res. Dev.
45
(
5
),
695
730
(
2008
).
2.
H. J.
McDermott
, “
Music perception with cochlear implants: A review
,”
Trends Amplif.
8
(
2
),
49
82
(
2004
).
3.
E. M.
Burns
and
N. F.
Viemeister
, “
Played again SAM: Further observations on the pitch of amplitude modulated noise
,”
J. Acoust. Soc. Am.
70
,
1655
1660
(
1981
).
4.
J. J.
Galvin
,
Q.
Fu
, and
R. V.
Shannon
, “
Melodic contour identification and music perception by cochlear implant users
,”
Ann. N.Y. Acad. Sci.
1169
,
518
533
(
2009
).
5.
P. J.
Donnelly
,
B. Z.
Guo
, and
C. J.
Limb
, “
Perceptual fusion of polyphonic pitch in cochlear implant users
,”
J. Acoust. Soc. Am.
126
(
5
),
128
133
(
2009
).
6.
A.
Nagathil
,
C.
Weihs
, and
R.
Martin
, “
Spectral complexity reduction of music signals for mitigating effects of cochlear hearing loss
,”
IEEE/ACM Trans. Audio Speech Lang. Process.
24
(
3
),
445
458
(
2016
).
7.
A.
Nagathil
,
C.
Weihs
,
K.
Neumann
, and
R.
Martin
, “
Spectral complexity reduction of music signals based on frequency-domain reduced-rank approximations: An evaluation with cochlear implant listeners
,”
J. Acoust. Soc. Am.
142
(
3
),
1219
1228
(
2017
).
8.
K.
Gfeller
,
E.
Guthe
,
V.
Driscoll
, and
C. J.
Brown
, “
A preliminary report of music-based training for adult cochlear implant users: Rationales and development
,”
Cochl. Implants Int.
16
(
3
),
22
31
(
2015
).
9.
W.
Buyens
,
B.
van Dijk
,
M.
Moonen
, and
J.
Wouters
, “
Music mixing preferences of cochlear implant recipients: A pilot study
,”
Int. J. Audiol.
53
(
5
),
294
301
(
2014
).
10.
J.
Pons
,
J.
Janer
,
T.
Rode
, and
W.
Nogueira
, “
Remixing music using source separation algorithms to improve the musical experience of cochlear implant users
,”
J. Acoust. Soc. Am.
140
(
6
),
4338
4349
(
2016
).
11.
W.
Buyens
,
B.
van Dijk
,
J.
Wouters
, and
M.
Moonen
, “
A stereo music preprocessing scheme for cochlear implant users
,”
IEEE Trans. Biomed. Eng.
62
(
10
),
2434
2442
(
2015
).
12.
ITU
, “
Recommendation ITU-R BS. 1534-1: Method for the subjective assessment of intermediate quality level of coding systems
” (
2003
).
13.
K.
Kokkinakis
and
P. C.
Loizou
, “
Using blind source separation techniques to improve speech recognition in bilateral cochlear implant patients
,”
J. Acoust. Soc. Am.
123
(
4
),
2379
2390
(
2008
).
14.
N. Q. K.
Duong
,
A.
Ozerov
,
L.
Chevallier
, and
J.
Sirot
, “
An interactive audio source separation framework based on non-negative matrix factorization
,” in
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(
2014
), pp.
1567
1571
.
15.
T.
Virtanen
, “
Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria
,”
IEEE Trans. Audio Speech Lang. Process.
15
(
3
),
1066
1074
(
2007
).
16.
P.
Smaragdis
and
M.
Kim
, “
Non-negative matrix factorization for irregularly-spaced transforms
,” in
2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
(
2013
), pp.
1
4
.
17.
R.
Marxer
and
J.
Janer
, “
Low-latency bass separation using harmonic-percussion decomposition
,” in
International Conference on Digital Audio Effects Conference (DAFx-13)
(
2013
).
18.
Y.
LeCun
,
Y.
Bengio
, and
G.
Hinton
, “
Deep learning
,”
Nature
521
(
7553
),
436
444
(
2015
).
19.
D.
Wang
and
J.
Chen
, “
Supervised speech separation based on deep learning: An overview
,” CoRR abs/1708.07524 (
2017
).
20.
X.
Zhang
and
D.
Wang
, “
Deep learning based binaural speech separation in reverberant environments
,”
IEEE/ACM Trans. Audio Speech Lang. Process.
25
(
5
),
1075
1084
(
2017
).
21.
Y.
Wang
,
A.
Narayanan
, and
D.
Wang
, “
On training targets for supervised speech separation
,”
IEEE/ACM Trans. Audio Speech Lang. Process.
22
(
12
),
1849
1858
(
2014
).
22.
A. A.
Nugraha
,
A.
Liutkus
, and
E.
Vincent
, “
Multichannel audio source separation with deep neural networks
,”
IEEE/ACM Trans. Audio Speech Lang. Process.
24
(
9
),
1652
1664
(
2016
).
23.
W.
Nogueira
,
T.
Gajcki
,
B.
Krüger
, and
A. B. J.
Janer
, “
Development of a sound coding strategy based on a deep recurrent neural network for monaural source separation in cochlear implants
,” in
ITG Speech Communication
(
2016
).
24.
P. S.
Huang
,
M.
Kim
,
M.
Hasegawa-Johnson
, and
P.
Smaragdis
, “
Joint optimization of masks and deep recurrent neural networks for monaural source separation
,”
IEEE/ACM Trans. Audio Speech Lang. Process.
23
(
12
),
2136
2147
(
2015
).
25.
J.
Masci
,
U.
Meier
,
D.
Cireşan
, and
J.
Schmidhuber
, “
Stacked convolutional auto-encoders for hierarchical feature extraction
,” in
Artificial Neural Networks and Machine Learning–ICANN 2011
, edited by
T.
Honkela
,
W.
Duch
,
M.
Girolami
, and
S.
Kaski
, Lecture Notes in Computer Science, Vol. 6791 (
Springer
,
Berlin
,
2011
), pp.
52
59
.
26.
B.
Du
,
W.
Xiong
,
J.
Wu
,
L.
Zhang
,
L.
Zhang
, and
D.
Tao
, “
Stacked convolutional denoising auto-encoders for feature representation
,”
IEEE Trans. Cybern.
47
(
4
),
1017
1027
(
2017
).
27.
J.
Xie
,
L.
Xu
, and
E.
Chen
, “
Image denoising and inpainting with deep neural networks
,” in
Advances in Neural Information Processing Systems 25
, edited by
F.
Pereira
,
C. J. C.
Burges
,
L.
Bottou
, and
K. Q.
Weinberger
(
Curran Associates
,
Red Hook, NY
,
2012
), pp.
341
349
.
28.
P.
Vincent
,
H.
Larochelle
,
I.
Lajoie
,
Y.
Bengio
, and
P. A.
Manzagol
, “
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion
,”
J. Mach. Learn. Res.
11
,
3371
3408
(
2010
).
29.
P.
Vincent
,
H.
Larochelle
,
Y.
Bengio
, and
P. A.
Manzagol
,
Extracting and Composing Robust Features with Denoising Autoencoders
, ICML'08 (
ACM
,
New York
,
2008
), pp.
1096
1103
.
30.
G. E.
Hinton
and
R. R.
Salakhutdinov
, “
Reducing the dimensionality of data with neural networks
,”
Science
313
(
5786
),
504
507
(
2006
).
31.
M.
Kim
and
P.
Smaragdis
,
Adaptive Denoising Autoencoders: A Fine-Tuning Scheme to Learn from Test Mixtures
(
Springer International Publishing
,
Cham
,
2015
), pp.
100
107
.
32.
P.
Smaragdis
and
S.
Venkataramani
, “
A neural network alternative to non-negative audio models
,” in
2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(
2017
), pp.
86
90
.
33.
A.
Krizhevsky
,
I.
Sutskever
, and
G. E.
Hinton
, “
Imagenet classification with deep convolutional neural networks
,”
Commun. ACM
60
(
6
),
84
90
(
2017
).
34.
Y.
LeCun
,
B.
Boser
,
J. S.
Denker
,
D.
Henderson
,
R. E.
Howard
,
W.
Hubbard
, and
L. D.
Jackel
, “
Backpropagation applied to handwritten zip code recognition
,”
Neural Comput.
1
(
4
),
541
551
(
1989
).
35.
P.
Sermanet
,
D.
Eigen
,
X.
Zhang
,
M.
Mathieu
,
R.
Fergus
, and
Y.
LeCun
, “
Overfeat: Integrated recognition, localization and detection using convolutional networks
,” Comput. Res. Repos. abs/1312.6229 (
2013
).
36.
M. D.
Zeiler
and
R.
Fergus
, “
Visualizing and understanding convolutional networks
,” Comput. Res. Repos. abs/1311.2901 (
2013
).
37.
K.
Chatfield
,
K.
Simonyan
,
A.
Vedaldi
, and
A.
Zisserman
, “
Return of the devil in the details: Delving deep into convolutional nets
,” Comput. Res. Repos. abs/1405.3531 (
2014
).
38.
K.
He
,
X.
Zhang
,
S.
Ren
, and
J.
Sun
, “
Spatial pyramid pooling in deep convolutional networks for visual recognition
,” Comput. Res. Repos. abs/1406.4729 (
2014
).
39.
K.
Simonyan
and
A.
Zisserman
, “
Very deep convolutional networks for large-scale image recognition
,” Comput. Res. Repos. abs/1409.1556 (
2014
).
40.
C.
Szegedy
,
W.
Liu
,
Y.
Jia
,
P.
Sermanet
,
S. E.
Reed
,
D.
Anguelov
,
D.
Erhan
,
V.
Vanhoucke
, and
A.
Rabinovich
, “
Going deeper with convolutions
,” Comput. Res. Repos. abs/1409.4842 (
2014
).
41.
B.
Hariharan
,
P.
Arbelaez
,
R. B.
Girshick
, and
J.
Malik
, “
Simultaneous detection and segmentation
,” Comput. Res. Repos. abs/1407.1808 (
2014
).
42.
A. S.
Razavian
,
H.
Azizpour
,
J.
Sullivan
, and
S.
Carlsson
, “
CNN features off-the-shelf: An astounding baseline for recognition
,” Comput. Res. Repos. abs/1403.6382 (
2014
).
43.
S. W.
Fu
,
Y.
Tsao
, and
X.
Lu
, “
SNR-aware convolutional neural network modeling for speech enhancement
,” in
Interspeech 2016, 17th Annual Conference of the International Speech Communication Association
(
2016
), pp.
3768
3772
.
44.
Y.
Han
,
J.
Kim
, and
K.
Lee
, “
Deep convolutional neural networks for predominant instrument recognition in polyphonic music
,”
IEEE/ACM Trans. Audio Speech Lang. Process.
25
(
1
),
208
221
(
2017
).
45.
S.
Uhlich
,
M.
Porcu
,
F.
Giron
,
M.
Enenkl
,
T.
Kemp
,
N.
Takahashi
, and
Y.
Mitsufuji
, “
Improving music source separation based on DNNs through data augmentation and network blending
,” in
IEEE SigPort
(
2017
).
46.
E. M.
Grais
,
G.
Roma
,
A. J. R.
Simpson
, and
M. D.
Plumbley
, “
Discriminative enhancement for single channel audio source separation using deep neural networks
,” Comput. Res. Repos. abs/1609.01678 (
2016
).
47.
P.
Chandna
,
M.
Miron
,
J.
Janer
, and
E.
Gómez
, “
Monoaural audio source separation using deep convolutional neural networks
,” in
13th International Conference on Latent Variable Analysis and Signal Separation (LVA ICA2017)
(10169) (
2017
), pp.
258
266
.
48.
E. M.
Grais
and
M. D.
Plumbley
, “
Single channel audio source separation using convolutional denoising autoencoders
,” in
GlobalSIP
(
2017
).
49.
T.
Goehring
,
F.
Bolner
,
J. J.
Monaghan
,
B.
Dijk
,
A.
Zarowski
, and
S.
Bleeck
, “
Speech enhancement based on neural networks improves speech intelligibility in noise for cochlear implant users
,”
Hear. Res.
344
,
183
194
(
2017
).
50.
D. D.
Lee
and
H. S.
Seung
, “
Algorithms for non-negative matrix factorization
” (
2001
).
51.
M. W.
Gardner
and
S. R.
Dorling
, “
Artificial neural networks (the multilayer perceptron). A review of applications in the atmospheric sciences
,”
Atmos. Environ.
32
(
14–15
),
2627
2636
(
1998
).
52.
D.
Kriesel
, A Brief Introduction to Neural Networks, http://www.dkriesel.com (Last viewed June 7, 2018).
53.
K.
Hornik
,
M.
Stinchcombe
, and
H.
White
, “
Multilayer feedforward networks are universal approximators
,”
Neural Netw.
2
(
5
),
359
366
(
1989
).
54.
S.
Bergner
,
T.
Moller
,
D.
Weiskopf
, and
D. J.
Muraki
, “
A spectral analysis of function composition and its implications for sampling in direct volume visualization
,”
IEEE Trans. Visual. Comput. Graph.
12
(
5
),
1353
1360
(
2006
).
55.
D.
Scherer
,
A.
Müller
, and
S.
Behnke
, “
Evaluation of pooling operations in convolutional architectures for object recognition
,” in
Artificial Neural Networks–ICANN 2010
, edited by
K.
Diamantaras
,
W.
Duch
, and
L. S.
Iliadis
, Lecture Notes in Computer Science, Vol. 6354 (
Springer
,
Berlin
,
2010
), pp.
92
101
.
56.
S.
Kullback
and
R. A.
Leibler
, “
On information and sufficiency
,”
Ann. Math. Stat.
22
(
1
),
79
86
(
1951
).
57.
P. S.
Huang
,
M.
Kim
,
M.
Hasegawa-Johnson
, and
P.
Smaragdis
, “
Deep learning for monaural speech separation
,” in
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(
2014
), pp.
1562
1566
.
58.
C. M.
Bishop
,
Neural Networks for Pattern Recognition
(
Oxford University Press
,
New York
,
1995
).
59.
M. D.
Zeiler
, “
Adadelta: An adaptive learning rate method
,” Comput. Res. Repos. 1212.5701 (
2012
).
60.
T. S.
Chan
,
T. C.
Yeh
,
Z. C.
Fan
,
H. W.
Chen
,
L.
Su
,
Y. H.
Yang
, and
J. S. R.
Jang
, “
Vocal activity informed singing voice separation with the ikala dataset
,” in
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(
2015
), pp.
718
722
.
62.
E.
Vincent
,
R.
Gribonval
, and
C.
Fvotte
, “
Performance measurement in blind audio source separation
,”
IEEE/ACM Trans. Audio Speech Lang. Process.
14
(
4
),
1462
1469
(
2006
).
63.
V.
Emiya
,
E.
Vincent
,
N.
Harlander
, and
V.
Hohmann
, “
Subjective and objective quality assessment of audio source separation
,”
IEEE/ACM Trans. Audio Speech Lang. Process.
19
(
7
),
2046
2057
(
2011
).
66.
J. M.
Aronoff
,
D. J.
Freed
,
L. M.
Fisher
,
I.
Pal
, and
S. D.
Soli
, “
The effect of different cochlear implant microphones on acoustic hearing individuals binaural benefits for speech perception in noise
,”
Ear Hear.
32
(
4
),
468
484
(
2011
).
67.
RStudio Team
,
RStudio: Integrated Development Environment for R
(
RStudio
,
Boston, MA
,
2015
).
68.
J. W.
Tukey
, “
Comparing individual means in the analysis of variance
,”
Biometrics
5
(
2
),
99
114
(
1949
).
69.
G.
Rossum
,
Python Reference Manual
(CWI,
Amsterdam, the Netherlands
,
1995
).
70.
C. M.
Bishop
,
Pattern Recognition and Machine Learning (Information Science and Statistics)
(
Springer-Verlag
New York, Secaucus, NJ
,
2006
).
71.
T.
Virtanen
,
J. F.
Gemmeke
,
B.
Raj
, and
P.
Smaragdis
, “
Compositional models for audio processing: Uncovering the structure of sound mixtures
,”
IEEE Sign. Process. Mag.
32
(
2
),
125
144
(
2015
).
You do not currently have access to this content.