Temporal coherence and spectral regularity are critical cues for human auditory streaming processes and are considered in many sound separation models. Some examples include the Conv-tasnet model, which focuses on temporal coherence using short length kernels to analyze sound, and the dual-path convolution recurrent network (DPCRN) model, which uses two recurring neural networks to analyze general patterns along the temporal and spectral dimensions on a spectrogram. By expanding DPCRN, a harmonic-aware tri-path convolution recurrent network model via the addition of an inter-band RNN is proposed. Evaluation results on public datasets show that this addition can further boost the separation performances of DPCRN.

Singing voice separation is a traditional but challenging research topic in music information retrieval (MIR). It aims to separate the audio mixture into vocal and background music. Retrieved vocal sounds can be used in other MIR applications, such as singer identification (Zhang , 2019), automatic lyrics transcriptions (Demirel , 2020), and pitch tracking (Fan , 2016). On the other hand, separated background music could be an asset to the entertainment industry, such as being used as karaoke songs.

Before the era of the neural network (NN), researchers proposed many singing voice separation methods such as the non-negative matrix factorization (NMF) based methods (Smaragdis , 2014), Bayesian methods (Ozerov , 2007), principle component analysis (PCA) based method (Huang , 2012), and the extraction of repeating structure in music (Rafii and Pardo, 2012). In recent years, many NN-based models were proposed for the separation problem. In the beginning, NN-based separation models were proposed to work in the time-frequency (T-F) domain to predict a soft mask from the mixture spectrogram for each of the sources (Jansson , 2017; Ni and Ren, 2022; Stöter , 2019; Takahashi , 2018). Later on, models directly working on time-domain signals were proposed for sound separation (Défossez , 2019; Luo and Mesgarani, 2019; Stoller , 2018). Psychoacoustic studies reveal that the human brain groups sound streams based on temporal and spectral cues (Moore, 2012). The temporal cues include onset/offset for simultaneous grouping and amplitude/frequency modulation for sequential grouping. Spectral cues, such as spectral regularity, formant structures, and harmonic structures, are mostly used for simultaneous grouping. The grouping process of the human brain is referred to as the auditory scene analysis (ASA) (Bregman, 1994), the mechanism for deciphering speech in a noisy environment, such as a cocktail party. Research findings on ASA have been implicitly or explicitly adopted in NN-based sound separation models. For instance, the encoder in the Conv-tasnet model uses very short one-dimensional (1D) convolution kernels to effectively extract features with very high temporal resolution. The following separator basically learns the temporal coherence cues for separation (Luo and Mesgarani, 2019). Later on, the dual-path recurrent neural network (DPRNN) was proposed to split the temporal coherence cues into chunks and to analyze the short-term and long-term temporal cues using intra-chunk and inter-chunk recurring neural networks (RNNs) (Li , 2021). Furthermore, via modifying the RNN module in the CRN model (Zhao , 2018), the dual-path convolution recurrent network (DPCRN) was proposed to capture both temporal and spectral features along the temporal and spectral dimensions of the spectrogram using each of the two paths (Le , 2021). As expected, the DPCRN demonstrated great performance on speech/noise separation.

Although one path of the DPCRN analyzes the spectrogram along the frequency dimension, it does not specifically emphasize the harmonic structure, which is an important cue for simultaneous grouping and very prominent in the spectrogram of the singing voice even above 7 kHz (as shown in Fig. 1). Therefore, we propose a harmonic-aware tri-path convolution recurrent network (HA-TPCRN) for singing voice separation by expanding DPCRN. In addition to the two paths in the DPCRN for capturing general spectral and temporal features, we added another path for specifically emphasizing the cross-band harmonic structure by splitting the spectrogram along the frequency axis into chunks and using an inter-band RNN to learn the structure. Simulation results show the proposed HA-TPCRN model achieves better performance than the DPRNN and the DPCRN on singing voice separation. Note the band-split RNN (Luo and Yu, 2023) also splits the frequency axis into subbands. However, its splitting is not specifically designed for capturing the harmonic structure.

Fig. 1.

Left, spectrogram of a sample utterance. Right, spectrogram of a sample singing voice.

Fig. 1.

Left, spectrogram of a sample utterance. Right, spectrogram of a sample singing voice.

Close modal

The rest of the paper is organized as follows. In Sec. 2, we briefly introduce the DPRNN and the DPCRN. In Sec. 3, we demonstrate the architecture of the proposed HA-TPCRN model. We then evaluate HA-TPCRN on singing voice separation using public datasets in Sec. 4. Finally, Sec. 5 concludes the paper and gives our future work.

Conv-tasnet was proposed as an end-to-end time-domain model for speech separation (Luo and Mesgarani, 2019). It uses short 1D convolution kernels to constitute the encoder and the decoder and uses a temporal convolution network (TCN) to form the middle separator. Due to the usage of short kernels, the encoder inherently produces long sequential data with a high temporal resolution to the separator. However, the TCN cannot effectively model the long-term temporal dependency within the data with the relatively narrow receptive field. Thus, the DPRNN was proposed by replacing the TCN with the combination of RNNs and the sequence segmentation strategy as the separator (Li , 2021). The outputs of the encoder of DPRNN are first segmented into several chunks along the temporal axis. Then, the model uses intra-chunk RNN to analyze the short-term information within each chunk and uses inter-chunk RNN to learn the long-term dependency among chunks. The combinations of intra-chunk and inter-chunk RNNs are stacked several times to form the separator. Detailed descriptions of DPRNN can be accessed in Li (2021).

The DPCRN was originally proposed as a Fourier spectrogram domain model for speech enhancement. It is composed of a CNN-based encoder, a dual-path RNN separator/enhancer, and an CNN-based decoder. The encoder analyzes local T-F regions of the spectrogram using two-dimensional (2D) convolution layers while the decoder synthesizes T-F regions using 2D transposed convolution layers. Two paths of the RNN separator are built to sequentially and separately analyze the spectral and temporal dependency of the output of the encoder. Specifically, the first path takes frequency bins in a time frame as the input sequence for analyzing the spectral dependency within the time frame. The second path takes time frames in a frequency bin as the input sequence for analyzing the temporal dependency in the frequency bin. Detailed descriptions of DPCRN can be accessed from Le (2021).

Figure 1 shows the spectrograms of a sample utterance and a sample singing voice clip. Clearly, the harmonic structure of the singing voice (as shown in the right panel) is much stronger than the structure of speech (as shown in the left panel) and can be very prominent even above 7 kHz. Such differences might stem from different manners of the vocal track between singing and speaking. The observation motivates us to add another dedicated path to the DPCRN for deep analyzing the harmonics, a very important spectral regularity cue (Moore, 2012), for singing voice separation. Figure 2 shows the proposed model, which includes three blocks, the CNN-based encoder, the tri-path RNN separator, and the CNN-based decoder.

Fig. 2.

Schematic diagram of the proposed HA-TPCRN model for singing voice separation. The model includes CNN-based encoder and decoder, and the harmonic-aware tri-path RNN (HA-TPRNN) separator. The HA-TPRNN separator is formed by adding the third path (inter-band RNN path) to the separator of DPCRN (Le , 2021).

Fig. 2.

Schematic diagram of the proposed HA-TPCRN model for singing voice separation. The model includes CNN-based encoder and decoder, and the harmonic-aware tri-path RNN (HA-TPRNN) separator. The HA-TPRNN separator is formed by adding the third path (inter-band RNN path) to the separator of DPCRN (Le , 2021).

Close modal

The encoder and the decoder are similar to the ones in the DPCRN (Le , 2021). The tri-path RNN separator is constructed by expanding the dual-path RNN block in the DPCRN. Figure 3 shows the details of the tri-path RNN separator, which can be divided into three paths, the spectral RNN path, the temporal RNN path, and the inter-band RNN path.

Fig. 3.

Details of the HA-TPRNN separator. One tri-path module is shown here and three modules are stacked to form HA-TPRNN. The spectral RNN path analyzes the spectral dependency within a time frame. The temporal RNN path analyzes the temporal dependency in a frequency bin. The inter-band RNN path splits the input into segments of length K on the frequency axis and analyzes cross-band harmonic-structure dependency within a time frame.

Fig. 3.

Details of the HA-TPRNN separator. One tri-path module is shown here and three modules are stacked to form HA-TPRNN. The spectral RNN path analyzes the spectral dependency within a time frame. The temporal RNN path analyzes the temporal dependency in a frequency bin. The inter-band RNN path splits the input into segments of length K on the frequency axis and analyzes cross-band harmonic-structure dependency within a time frame.

Close modal

There are three 2D convolution layers and three 2D transposed convolution layers in the encoder and the decoder, respectively. Skip connections are used between outputs of three layers of the encoder and the decoder, and each convolution layer is followed by a batch normalization layer and the PReLU activation function (He , 2015). Table 1 shows the detailed settings of the three convolution layers of the encoder and three transposed convolution layers of the decoder. The stride along the time axis is set to 1 to keep the frame number unchanged while the stride along the frequency axis is set to 2 to reduce the feature size.

Table 1.

Detailed settings of the convolution/transposed convolution layers of the encoder/decoder.

Conv1 Conv2 Conv3 Trans-conv1 Trans-conv2 Trans-conv3
Kernel number  32  64  128  64  32 
Kernel size  5 × 3  3 × 3  1 × 1  1 × 1  3 × 3  5 × 3 
Stride  2 × 1  2 × 1  1 × 1  1 × 1  2 × 1  2 × 1 
Padding  [2, 1]  [1, 1]  [0, 0]  –  –  – 
Conv1 Conv2 Conv3 Trans-conv1 Trans-conv2 Trans-conv3
Kernel number  32  64  128  64  32 
Kernel size  5 × 3  3 × 3  1 × 1  1 × 1  3 × 3  5 × 3 
Stride  2 × 1  2 × 1  1 × 1  1 × 1  2 × 1  2 × 1 
Padding  [2, 1]  [1, 1]  [0, 0]  –  –  – 
The first two paths of the proposed tri-path separator include the spectral RNN and the temporal RNN. They are also called the intra-chunk RNN and the inter-chunk RNN in the DPCRN literature (Le , 2021). F R N*C*T denotes the input feature to the tri-path separator, where N is the number of frequency bins, C and T are the numbers of channels and time frames, respectively. The spectral RNN is implemented by a bi-directional LSTM (long short term memory) (BLSTM) to learn the spectral pattern in a single time frame as
U b = [ iLN ( G ( S P b ( F b [ : , : , i ] ) ) ) , i = 1 , , T ] ,
(1)
where U b R N*C*T is the output of the spectral RNN, b is the stack index of the tri-path module, F b [ : , : , i ] is a N-step sequence with a C-dimensional feature in the time frame i, S P b is the 1-layer BLSTM network in the spectral path, G is a linear fully-connected layer, and iLN denotes the instant layer normalization ( iLN) (Westhausen and Meyer, 2020). A residual connection is used between the input of the spectral RNN path and the output of iLN as
T b = F b + U b .
(2)
Next, Tb serves as an input to the temporal RNN. The temporal RNN learns the temporal pattern in each frequency bin as
V b = iLN G T E b T b i , : , : , i = 1 , , N ,
(3)
where V b R N*C*T is the output of the temporal RNN, T b [ i , : , : ] is a T-step sequence with a C-dimensional feature in the frequency bin i, and T E b is the 1-layer BLSTM network in the temporal path. The BLSTM is also followed by a linear fully-connected layer G, and the instant layer normalization iLN. A residual connection is also applied between the input of the temporal RNN path and the output of iLN as
W b = T b + V b ,
(4)
where W b R N*C*T serves as the input to the third path RNN, the inter-band RNN.
There are mainly three stages in the inter-band RNN path, including segmentation, inter-band RNN and overlap-add. Figure 4 illustrates the segmentation process. We use two sets of segments to capture the harmonic patterns along the frequency axis of W b. As shown in the figure, for the first set, we zero-pad at the beginning of W b to create W b R ( P + N ) *C*T and split it into segments of length K without overlap to have S 1 segments D s , b R K*C*T , s = 1 , , S 1. We set P to the smallest integer to have P+ N divisible by K. For the second set, we zero-pad at the end of W b to create W b R ( N + P ) *C*T and split it evenly into S 2 segments D s , b R K*C*T , s = 1 , , S 2. These two sets of segments are then concatenated into a four-dimensional (4D) tensor X b = [ D 1 , b , , D S 1 b , D 1 , b , , D S 2 , b ] R S*K*C*T, S = S 1 + S 2 and S 1 = S 2. Then, X b is passed into the inter-band RNN as
Y b = [ iLN ( G ( I N b ( X b [ : , i , : , j ] ) ) ) , ( i , j ) = ( 1 , 1 ) , , ( K , 1 ) , ( 1 , 2 ) , , ( K , T ) ] ,
(5)
where Y b R S*K*C*T is the output of the inter-band RNN, X b [ : , i , : , j ] is a S-step sequence with C-dimensional feature, and I N b is the 1-layer BLSTM network in the inter-band path. The BLSTM is also followed by a linear fully-connected layer G and an instant layer normalization, iLN. In the overlap-add stage, Y b is transformed back to a 3D tensor Q b R N*C*T by applying overlap-add to the two sets of segments. There is also a residual connection between the input and output of the inter-band RNN path.
Fig. 4.

Segmentation strategy in the inter-band RNN.

Fig. 4.

Segmentation strategy in the inter-band RNN.

Close modal

All the 1-layer BLSTM networks, S P b, T E b, and I N b, have 256 hidden units. Three tri-path modules are stacked to form the separator to increase the complexity of the proposed model, i.e., b = 1 , 2 , 3. The output channel number of the decoder is set to two to generate two masks for predicting vocal and music spectrograms. After masking, the inverse short-time Fourier transform (ISTFT) is applied with the mixture phase to generate separated vocal and music time-domain signals.

The time-domain and the frequency-domain losses are defined as
L t = 1 2 y ̂ v y v 1 + y ̂ m y m 1 , L f = 1 2 Y ̂ v Y v 2 + Y ̂ m Y m 2 ,
(6)
where y v and y m are waveforms of the target vocal and background music sounds; y ̂ v and y ̂ m are waveforms of the predicted vocal and background music sounds; Y v and Y m are the corresponding target magnitude spectrograms; and Y ̂ v and Y ̂ m are the predicted magnitude spectrograms. The two losses are combined for model training as follows:
L total = α * L t + 1 α * L f ,
(7)
where the weight parameter α is empirically set to 0.9.

We prepared training data from the development set of DSD100 database (Liutkus , 2017). The database contains 100 professionally mixed songs, each of which is accompanied by four unmixed tracks, including bass, drum, others, and vocal. For each song, we added tracks of bass, drum, and others together to form the background music track. We then removed silent sections in both the vocal and music tracks of 50 songs in the development set and randomly mixed these 50 vocal tracks and 50 music tracks with a uniformly distributed random scale ranging from 0.7 to 1. We also added some remixed songs without removing silent sections into the training set. In the end, we created a 15-h remixed dataset for training. All the sounds were re-sampled to 16 kHz to reduce the computational load and memory requirement.

For generating the spectrogram, we used the 64 ms hamming window with a hop length of 16 ms. For each frame, 1024-point fast Fourier transform (FFT) was performed to get a 513-dimension spectrum. We used the Adam optimizer with default parameters (Kingma and Ba, 2014) for model training. The batch size was set to 8 and the learning rate was set to 0.0003. For evaluations, we used the test set of DSD100 and MUSDB18 (Rafii , 2017) datasets and calculated the source-to-distortion ratio (SDR) (Raffel , 2014), the scale-invariant source-to-distortion ratio (SI-SDR) (Le Roux , 2019), and L1-distance on the magnitude of spectrograms for comparison.

For performance comparison, we trained the time-domain DPRNN and the spectrogram-domain DPCRN as baseline models. In DPRNN, the convolution/deconvolution layer was composed of 128 kernels with kernel length of 16 and stride of 8. The dual-path RNN module, composed of two 1-layer 256-unit BLSTM, was stacked four times to form the separator. In DPCRN, the encoder and the decoder were identical to the ones in the proposed HA-TPCRN as shown in Table 1. To form the separator, the dual-path RNN module was stacked three, four, and five times in contrast with the tri-path RNN module stacked three times in HA-TPCRN. In addition, we also included Spleeter-2stem (Hennequin , 2020), which is a well-trained music separation model based on deep U-Net (Jansson , 2017), for comparison.

Table 2 shows the separation results of compared models on the test sets of DSD100 and MUSDB18 databases in terms of the mean SDR, SI-SDR, and L1 distance on spectrograms. The higher the SDR and the SI-SDR scores or the lower L1 distance scores, the better the separation results. Scores from DPCRN with different stack numbers and from the proposed HA-TPCRN with different segment lengths (K) are listed separately. The numbers of parameters of compared models are also given in the table. Note that the spectral RNN path, the temporal RNN path, and the inter-band RNN path roughly have the same number of parameters. The results show the spectrogram-domain DPCRN and the proposed HA-TPCRN models greatly outperform the time-domain DPRNN and the spectrogram-domain Spleeter-2stem in separating vocal and music. The HA-TPCRN (stack number = 3) achieves the best performance in each comparing category, mostly with K= 3 and provides obvious benefits to DPCRN (stack number = 3). On the other hand, the superior performance from HA-TPCRN (stack number = 3) with fewer parameters than DPCRN (stack number = 5) indicates the inter-band analysis along the frequency axis provides helpful information for separating harmonic-rich singing sounds.

Table 2.

Separation model evaluation on DSD100 and MUSDB18 in terms of SDR/SI-SDR/L1 distance.

DSD100 MUSDB18
SDR(dB) SI-SDR(dB) L1 SDR(dB) SI-SDR(dB) L1
(stack #, para #) Music Vocal Music Vocal Music Vocal Music Vocal Music Vocal Music Vocal
Spleeter-2stem  (-,19.6M)  12.44  5.71  12.22  4.82  0.67  1.10  12.94  5.72  12.68  4.59  0.82  1.01 
DPRNN  (4, 6.95 M)  13.15  6.33  12.81  5.94  0.72  1.10  13.12  4.77  12.71  4.25  0.94  1.08 
DPCRN  (3, 5.19 M)  14.58  8.28  14.39  7.76  0.52  0.85  14.93  8.16  14.67  7.60  0.72  0.81 
DPCRN  (4, 6.90 M)  14.76  8.37  14.53  7.94  0.51  0.82  15.12  8.29  14.82  7.84  0.71  0.78 
DPCRN  (5, 8.62 M)  14.66  8.30  14.45  7.84  0.50  0.85  15.08  8.27  14.80  7.80  0.71  0.80 
HA-TPCRN, K = 2  (3, 7.76 M)  14.85  8.45  14.60  8.05  0.52  0.81  15.13  8.26  14.81  7.87  0.74  0.77 
HA-TPCRN, K = 3  (3, 7.76 M)  14.83  8.51  14.62  8.06  0.52  0.82  15.20  8.39  14.93  7.93  0.73  0.78 
HA-TPCRN, K = 4  (3, 7.76 M)  14.78  8.44  14.59  7.99  0.49  0.82  15.12  8.28  14.86  7.85  0.70  0.79 
DSD100 MUSDB18
SDR(dB) SI-SDR(dB) L1 SDR(dB) SI-SDR(dB) L1
(stack #, para #) Music Vocal Music Vocal Music Vocal Music Vocal Music Vocal Music Vocal
Spleeter-2stem  (-,19.6M)  12.44  5.71  12.22  4.82  0.67  1.10  12.94  5.72  12.68  4.59  0.82  1.01 
DPRNN  (4, 6.95 M)  13.15  6.33  12.81  5.94  0.72  1.10  13.12  4.77  12.71  4.25  0.94  1.08 
DPCRN  (3, 5.19 M)  14.58  8.28  14.39  7.76  0.52  0.85  14.93  8.16  14.67  7.60  0.72  0.81 
DPCRN  (4, 6.90 M)  14.76  8.37  14.53  7.94  0.51  0.82  15.12  8.29  14.82  7.84  0.71  0.78 
DPCRN  (5, 8.62 M)  14.66  8.30  14.45  7.84  0.50  0.85  15.08  8.27  14.80  7.80  0.71  0.80 
HA-TPCRN, K = 2  (3, 7.76 M)  14.85  8.45  14.60  8.05  0.52  0.81  15.13  8.26  14.81  7.87  0.74  0.77 
HA-TPCRN, K = 3  (3, 7.76 M)  14.83  8.51  14.62  8.06  0.52  0.82  15.20  8.39  14.93  7.93  0.73  0.78 
HA-TPCRN, K = 4  (3, 7.76 M)  14.78  8.44  14.59  7.99  0.49  0.82  15.12  8.28  14.86  7.85  0.70  0.79 

In addition to SDR, SI-SDR, and L1 objective measures, we also conducted subjective listening tests to evaluate the proposed model. Fifty songs from the DSD100 test set were used for the preference test. For each song, a pair of 15-second segments of separated vocals by HA-TPCRN (K = 3) and DPCRN (stack number = 4) models were extracted. Thirteen subjects participated in the tests and were asked to choose their preferred vocal in each pair. Results from 650 pairs show that separated vocals by HA-TPCRN received the preference rating of 59.23% vs the rating of 40.77% by DPCRN. These results demonstrate that HA-TPCRN performs better than DPCRN in terms of perceived quality.

To investigate whether the inter-band RNN path provides complementary information to the spectral RNN path, we constructed the HA-DPCRN model by replacing the spectral RNN with the inter-band RNN in the baseline DPCRN model. Performance results in Table 3 show the inter-band RNN does not act equivalently to the spectral RNN. Based on results in Tables 2 and 3, we conclude that the spectral RNN is necessary for analyzing sounds and the inter-band RNN provides some complementary information for further boosting the performance. Some demo sounds of compared models can be accessed from https://victoriatw.github.io/HA-TPCRN_singing_voice_separation/demo.html.

Table 3.

Separation performance of the DPCRN and the HA-DPCRN (temporal + inter-band RNNs) models.

DSD100 MUSDB18
SDR(dB) SI-SDR(dB) L1 SDR(dB) SI-SDR(dB) L1
(Stack #, para #) Music Vocal Music Vocal Music Vocal Music Vocal Music Vocal Music Vocal
DPCRN  (4, 6.90 M)  14.76  8.37  14.53  7.94  0.51  0.82  15.12  8.29  14.82  7.84  0.71  0.78 
HA-DPCRN, K = 2  (4, 6.90 M)  14.71  8.31  14.47  7.89  0.54  0.82  14.92  7.95  14.60  7.52  0.75  0.79 
DSD100 MUSDB18
SDR(dB) SI-SDR(dB) L1 SDR(dB) SI-SDR(dB) L1
(Stack #, para #) Music Vocal Music Vocal Music Vocal Music Vocal Music Vocal Music Vocal
DPCRN  (4, 6.90 M)  14.76  8.37  14.53  7.94  0.51  0.82  15.12  8.29  14.82  7.84  0.71  0.78 
HA-DPCRN, K = 2  (4, 6.90 M)  14.71  8.31  14.47  7.89  0.54  0.82  14.92  7.95  14.60  7.52  0.75  0.79 

We propose an STFT-domain singing voice separation model, HA-TPCRN, by expanding DPCRN via the addition of an inter-band RNN module to model the recurring harmonic patterns along the frequency axis. Simulation results show the proposed HA-TPCRN clearly outperforms the baseline DPCRN model, and the frequency step size K of the inter-band RNN module slightly affects the model's capability in capturing the harmonic structure of the singer. Because of the difference between male and female pitch, and pitch often varying a lot while singing, we will combine inter-band RNNs with different step sizes in the future to capture all harmonic structures.

This research is supported by the Ministry of Science and Technology, Taiwan under Grant No. MOST 110-2221-E-A49-115-MY3.

1.
Bregman
,
A. S.
(
1994
).
Auditory Scene Analysis: The Perceptual Organization of Sound
(
MIT Press
,
Cambridge, MA
).
2.
Défossez
,
A.
,
Usunier
,
N.
,
Bottou
,
L.
, and
Bach
,
F.
(
2019
). “
DEMUCS: Deep extractor for music sources with extra unlabeled data remixed
,” arXiv:1909.01174.
3.
Demirel
,
E.
,
Ahlbäck
,
S.
, and
Dixon
,
S.
(
2020
). “
Automatic lyrics transcription using dilated convolutional neural networks with self-attention
,” in
Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN
), July 19–24, Glasgow, Scotland, pp.
1
8
.
4.
Fan
,
Z.-C.
,
Jang
,
J.-S. R.
, and
Lu
,
C.-L.
(
2016
). “
Singing voice separation and pitch extraction from monaural polyphonic audio music via DNN and adaptive pitch tracking
,” in
Proceedings of the 2016 IEEE Second International Conference on Multimedia Big Data (BigMM)
, April 20–22, Taipei, Taiwan, pp.
178
185
.
5.
He
,
K.
,
Zhang
,
X.
,
Ren
,
S.
, and
Sun
,
J.
(
2015
). “
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification
,” in
Proceedings of the IEEE International Conference on Computer Vision
, December 11–18, Washington, DC, pp.
1026
1034
.
6.
Hennequin
,
R.
,
Khlif
,
A.
,
Voituret
,
F.
, and
Moussallam
,
M.
(
2020
). “
Spleeter: A fast and efficient music source separation tool with pre-trained models
,”
J. Open Source Software
5
(
50
),
2154
.
7.
Huang
,
P.-S.
,
Chen
,
S. D.
,
Smaragdis
,
P.
, and
Hasegawa-Johnson
,
M.
(
2012
). “
Singing-voice separation from monaural recordings using robust principal component analysis
,” in
Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, March 25–30, Kyoto, Japan, pp.
57
60
.
8.
Jansson
,
A.
,
Humphrey
,
E.
,
Montecchio
,
N.
,
Bittner
,
R.
,
Kumar
,
A.
, and
Weyde
,
T.
(
2017
). “
Singing voice separation with deep U-Net convolutional networks
,” in
Proceedings of ISMIR 2017
, October 23–27, Suzhou, China.
9.
Kingma
,
D. P.
, and
Ba
,
J.
(
2014
). “
Adam: A method for stochastic optimization
,” arXiv:1412.6980.
10.
Le
,
X.
,
Chen
,
H.
,
Chen
,
K.
, and
Lu
,
J.
(
2021
). “
DPCRN: Dual-path convolution recurrent network for single channel speech enhancement
,” arXiv:2107.05429.
11.
Le Roux
,
J.
,
Wisdom
,
S.
,
Erdogan
,
H.
, and
Hershey
,
J. R.
(
2019
). “
SDR—Half-baked or well done?
,” in
Proceedings of ICASSP 2019
, May 12–17, Brighton, UK, pp.
626
630
.
12.
Li
,
C.
,
Luo
,
Y.
,
Han
,
C.
,
Li
,
J.
,
Yoshioka
,
T.
,
Zhou
,
T.
,
Delcroix
,
M.
,
Kinoshita
,
K.
,
Boeddeker
,
C.
,
Qian
,
Y
,
S.
Watanabe
, and
Z.
Chen
. (
2021
). “
Dual-path RNN for long recording speech separation
,” in
Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT)
, January 19–22, Shenzhen, China, pp.
865
872
.
13.
Liutkus
,
A.
,
Stöter
,
F.-R.
,
Rafii
,
Z.
,
Kitamura
,
D.
,
Rivet
,
B.
,
Ito
,
N.
,
Ono
,
N.
, and
Fontecave
,
J.
(
2017
). “
The 2016 signal separation evaluation campaign
,” in
Latent Variable Analysis and Signal Separation—12th International Conference, LVA/Ica 2015
, August 25–28, Liberec, Czech Republic, pp.
323
332
.
14.
Luo
,
Y.
, and
Mesgarani
,
N.
(
2019
). “
Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation
,”
IEEE/ACM Trans. Audio. Speech. Lang. Process.
27
(
8
),
1256
1266
.
15.
Luo
,
Y.
, and
Yu
,
J.
(
2023
). “
Music source separation with band-split RNN
,”
IEEE/ACM Trans. Audio Speech Lang. Process.
31
,
1893
1901
.
16.
Moore
,
B. C.
(
2012
).
An Introduction to the Psychology of Hearing
(
Brill
,
Leiden, the Netherlands
).
17.
Ni
,
X.
, and
Ren
,
J.
(
2022
). “
FC-U2-Net: A novel deep neural network for singing voice separation
,”
IEEE/ACM Trans. Audio. Speech. Lang. Process.
30
,
489
494
.
18.
Ozerov
,
A.
,
Philippe
,
P.
,
Bimbot
,
F.
, and
Gribonval
,
R.
(
2007
). “
Adaptation of Bayesian models for single-channel source separation and its application to voice/music separation in popular songs
,”
IEEE Trans. Audio Speech Lang. Process.
15
(
5
),
1564
1578
.
19.
Raffel
,
C.
,
McFee
,
B.
,
Humphrey
,
E. J.
,
Salamon
,
J.
,
Nieto
,
O.
,
Liang
,
D.
,
Ellis
,
D. P.
, and
Raffel
,
C. C.
(
2014
). “
mir_eval: A transparent implementation of common MIR metrics
,” in
Proceedings of ISMIR 2014
, October 27–31, Taipei, Taiwan, pp.
367
372
.
20.
Rafii
,
Z.
,
Liutkus
,
A.
,
Stöter
,
F.-R.
,
Mimilakis
,
S. I.
, and
Bittner
,
R.
(
2017
). “
The MUSDB18 corpus for music separation
,”
Zenodo.
https://doi.org/10.5281/zenodo.1117372
21.
Rafii
,
Z.
, and
Pardo
,
B.
(
2012
). “
Repeating pattern extraction technique (REPET): A simple method for music/voice separation
,”
IEEE Trans. Audio. Speech. Lang. Process.
21
(
1
),
73
84
.
22.
Smaragdis
,
P.
,
Fevotte
,
C.
,
Mysore
,
G. J.
,
Mohammadiha
,
N.
, and
Hoffman
,
M.
(
2014
). “
Static and dynamic source separation using nonnegative factorizations: A unified view
,”
IEEE Signal Process. Mag.
31
(
3
),
66
75
.
23.
Stoller
,
D.
,
Ewert
,
S.
, and
Dixon
,
S.
(
2018
). “
Wave-U-Net: A multi-scale neural network for end-to-end audio source separation
,” arXiv:1806.03185.
24.
Stöter
,
F.-R.
,
Uhlich
,
S.
,
Liutkus
,
A.
, and
Mitsufuji
,
Y.
(
2019
). “
Open-Unmix—A reference implementation for music source separation
,”
J. Open Source Software
4
(
41
),
1667
.
25.
Takahashi
,
N.
,
Goswami
,
N.
, and
Mitsufuji
,
Y.
(
2018
). “
MMDenselSTM: An efficient combination of convo- lutional and recurrent neural networks for audio source separation
,” in
Proceedings of the 16th International Workshop on Acoustic Signal Enhancement (IWAENC)
, September 17–20, Tokyo, Japan, pp.
106
110
.
26.
Westhausen
,
N. L.
, and
Meyer
,
B. T.
(
2020
). “
Dual-signal transformation LSTM network for real-time noise suppression
,” arXiv:2005.07551.
27.
Zhang
,
X.
,
Jiang
,
Y.
,
Deng
,
J.
,
Li
,
J.
,
Tian
,
M.
, and
Li
,
W.
(
2019
). “
A novel singer identification method using GMM-UBM
,” in
Proceedings of the 6th Conference on Sound and Music Technology (CSMT)
(
Springer
,
New York
), pp.
3
14
.
28.
Zhao
,
H.
,
Zarar
,
S.
,
Tashev
,
I.
, and
Lee
,
C.-H.
(
2018
). “
Convolutional-recurrent neural networks for speech enhancement
,” in
Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, May 12–17, Brighton, UK, pp.
2401
2405
.