A method to retrieve the parameters used to create a multitrack mix using only raw tracks and the stereo mixdown is presented. This method is able to model linear time-invariant effects such as gain, pan, equalisation, delay, and reverb. Nonlinear effects, such as distortion and compression, are not considered in this work. The optimization procedure used is the stochastic gradient descent with the aid of differentiable digital signal processing modules. This method allows for a fully interpretable representation of the mixing signal chain by explicitly modelling the audio effects rather than using differentiable blackbox modules. Two reverb module architectures are proposed, a “stereo reverb” model and an “individual reverb” model, and each is discussed. Objective feature measures are taken of the outputs of the two architectures when tasked with estimating a target mix and compared against a stereo gain mix baseline. A listening study is performed to measure how closely the two architectures can perceptually match a reference mix when compared to a stereo gain mix. Results show that the stereo reverb model performs best on objective measures and there is no statistically significant difference between the participants' perception of the stereo reverb model and reference mixes.

1.
Barchiesi
,
D.
, and
Reiss
,
J.
(
2010
). “
Reverse engineering of a mix
,”
J. Audio Eng. Soc.
58
(
7/8
),
563
576
.
2.
Belouchrani
,
A.
,
Abed-Meraim
,
K.
,
Cardoso
,
J.-F.
, and
Moulines
,
E.
(
1997
). “
A blind source separation technique using second-order statistics
,”
IEEE Trans. Signal Process.
45
(
2
),
434
444
.
3.
Bogdanov
,
D.
,
Wack
,
N.
,
Gómez
,
E.
,
Gulati
,
S.
,
Herrera
,
P.
,
Mayor
,
O.
,
Roma
,
G.
,
Salamon
,
J.
,
Zapata
,
J.
, and
Serra
,
X.
(
2013
). “
Essentia: An audio analysis library for music information retrieval
,” in
14th Conference of the International Society for Music Information Retrieval (ISMIR)
, edited by
A.
Britto
,
F.
Gouyon
, and
S.
Dixon
, November 4–8, Curitiba, Brazil, pp.
493
498
.
4.
Caclin
,
A.
,
McAdams
,
S.
,
Smith
,
B. K.
, and
Winsberg
,
S.
(
2005
). “
Acoustic correlates of timbre space dimensions: A confirmatory study using synthetic tones
,”
J. Acoust. Soc. Am.
118
(
1
),
471
482
.
5.
Chi
,
T.
,
Ru
,
P.
, and
Shamma
,
S. A.
(
2005
). “
Multiresolution spectrotemporal analysis of complex sounds
,”
J. Acoust. Soc. Am.
118
(
2
),
887
906
.
6.
Choi
,
W.
,
Kim
,
M.
,
Ramírez
,
M. A. M.
,
Chung
,
J.
, and
Jung
,
S.
(
2021
). “
Amss-net: Audio manipulation on user-specified sources with textual queries
,” arXiv:2104.13553.
7.
Colonel
,
J.
, and
Reiss
,
J. D.
(
2019
). “
Exploring preference for multitrack mixes using statistical analysis of mir and textual features
,” in
Audio Engineering Society Convention 147
(
Audio Engineering Society
,
New York, New York
).
8.
Darken
,
C.
,
Chang
,
J.
, and
Moody
,
J.
(
1992
). “
Learning rate schedules for faster stochastic gradient search
,” in
Neural Networks for Signal Processing
(
Citeseer
,
Helsinoger, Denmark
), Vol.
2
.
9.
Elliott
,
T. M.
,
Hamilton
,
L. S.
, and
Theunissen
,
F. E.
(
2013
). “
Acoustic structure of the five perceptual dimensions of timbre in orchestral instrument tones
,”
J. Acoust. Soc. Am.
133
(
1
),
389
404
.
10.
Engel
,
J.
,
Hantrakul
,
L.
,
Gu
,
C.
, and
Roberts
,
A.
(
2019
). “
DDSP: Differentiable digital signal processing
,” in
International Conference on Learning Representations
.
11.
Feng
,
X.
,
Zhang
,
Y.
, and
Glass
,
J.
(
2014
). “
Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition
,” in
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(
IEEE
,
New York
), pp.
1759
1763
.
12.
Gorlow
,
S.
, and
Marchand
,
S.
(
2013
). “
Reverse engineering stereo music recordings pursuing an informed two-stage approach
,” in
2013 International Conference on Digital Audio Effects (DAFx-13)
, pp.
1
8
.
13.
Gorlow
,
S.
, and
Reiss
,
J. D.
(
2013
). “
Model-based inversion of dynamic range compression
,”
IEEE Trans. Audio, Speech, Lang. Process.
21
(
7
),
1434
1444
.
14.
Grais
,
E. M.
, and
Plumbley
,
M. D.
(
2017
). “
Single channel audio source separation using convolutional denoising autoencoders
,” in
2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP)
(
IEEE
,
New York
), pp.
1265
1269
.
15.
Hawley
,
S.
,
Colburn
,
B.
, and
Mimilakis
,
S. I.
(
2019
). “
Profiling audio compressors with deep neural networks
,” in
Audio Engineering Society Convention 147
(
Audio Engineering Society
,
New York, New York
).
16.
Jourjine
,
A.
,
Rickard
,
S.
, and
Yilmaz
,
O.
(
2000
). “
Blind separation of disjoint orthogonal signals: Demixing n sources from 2 mixtures
,” in
2000 IEEE International Conference on Acoustics, Speech, and Signal Processing
, Vol.
5
, pp.
2985
2988
.
17.
Kendall
,
G. S.
(
1995
). “
The decorrelation of audio signals and its impact on spatial imagery
,”
Comput. Music J.
19
(
4
),
71
87
.
18.
Kingma
,
D. P.
, and
Ba
,
J.
(
2015
). “
Adam: A method for stochastic optimization
,” in
International Conference on Learning Representations (Poster)
.
19.
Kuznetsov
,
B.
,
Parker
,
J. D.
, and
Esqueda
,
F.
(
2020
). “
Differentiable IIR filters for machine learning applications
,” in
Proc. Int. Conf. Digital Audio Effects (eDAFx-20)
, pp.
297
303
.
20.
Lebart
,
K.
,
Boucher
,
J.-M.
, and
Denbigh
,
P. N.
(
2001
). “
A new method based on spectral subtraction for speech dereverberation
,”
Acta Acust. Acust.
87
(
3
),
359
366
.
21.
Maas
,
A. L.
,
Hannun
,
A. Y.
, and
Ng
,
A. Y.
(
2013
). “
Rectifier nonlinearities improve neural network acoustic models
,” in
International Conference on Machine Learning
, Vol.
30
, p.
3
.
22.
Moffat
,
D.
, and
Reiss
,
J. D.
(
2018
). “
Perceptual evaluation of synthesized sound effects
,”
ACM Trans. Appl. Percept.
15
(
2
),
1
19
.
23.
Nercessian
,
S.
(
2020
). “
Neural parametric equalizer matching using differentiable biquads
,” in
Proceedings of the 23rd International Conference on Digital Audio Effects (DAFx-20)
, Vienna, Austria.
24.
Peeters
,
G.
,
Giordano
,
B. L.
,
Susini
,
P.
,
Misdariis
,
N.
, and
McAdams
,
S.
(
2011
). “
The timbre toolbox: Extracting audio descriptors from musical signals
,”
J. Acoust. Soc. Am.
130
(
5
),
2902
2916
.
25.
Ramírez
,
M. A. M.
, and
Reiss
,
J. D.
(
2018
). “
End-to-end equalization with convolutional neural networks
,” in
21st International Conference on Digital Audio Effects (DAFx-18).
26.
Ramírez
,
M. A. M.
, and
Reiss
,
J. D.
(
2019
). “
Modeling nonlinear audio effects with end-to-end deep neural networks
,” in
ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (
IEEE
,
New York
), pp.
171
175
.
27.
Ramírez
,
M. M.
,
Benetos
,
E.
, and
Reiss
,
J. D.
(
2020
). “
Deep learning for black-box modeling of audio effects
,”
Appl. Sci.
10
(
2
),
638
.
28.
Recommendation, E.
(2014). “
R 128—Loudness normalisation and permitted maximum level of audio signals
” (PDF) tech.ebu.ch, European Broadcasting Union, Geneva, June 2014 (Retrieved 19 July 2021).
29.
Senior
,
M.
(
2011
).
Mixing Secrets for the Small Studio
(
Taylor and Francis
,
London
).
30.
Steinmetz
,
C. J.
,
Pons
,
J.
,
Pascual
,
S.
, and
Serrà
,
J.
(
2020
). “
Automatic multitrack mixing with a differentiable mixing console of neural audio effects
,” arXiv:2010.10291.
31.
Steinmetz
,
C. J.
, and
Reiss
,
J. D.
(
2021
). “
Efficient neural networks for real-time analog audio effect modeling
,” arXiv:2102.06200.
32.
Stoller
,
D.
,
Ewert
,
S.
, and
Dixon
,
S.
(
2018
). “
Wave-u-net: A multi-scale neural network for end-to-end audio source separation
,” in
Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018
, Paris, France, September 23–27, pp.
334
340
.
33.
Tzanetakis
,
G.
,
Jones
,
R.
, and
McNally
,
K.
(
2007
). “
Stereo panning features for classifying recording production style
,” in International Society for Music Information Retrieval Conference (
ISMIR)
(
Citeseer
, Vienna, Austria), pp.
441
444
.
34.
Välimäki
,
V.
, and
Reiss
,
J. D.
(
2016
). “
All about audio equalization: Solutions and frontiers
,”
Appl. Sci.
6
(
5
),
129
.
35.
Wang
,
X.
,
Takaki
,
S.
, and
Yamagishi
,
J.
(
2020
). “
Neural source-filter waveform models for statistical parametric speech synthesis
,”
IEEE/ACM Trans. Audio, Speech, Lang. Process.
28
,
402
415
.
36.
Wilson
,
A.
, and
Fazenda
,
B.
(
2014
). “
Categorisation of distortion profiles in relation to audio quality
,” in
International Conference on Digital Audio Effects (
DAFx).
37.
Wilson
,
A.
, and
Fazenda
,
B.
(
2015
). “
101 mixes: A statistical analysis of mix-variation in a dataset of multi-track music mixes
,” in
Audio Engineering Society Convention 139
(
Audio Engineering Society
, New York, New York).
38.
Wilson
,
A.
, and
Fazenda
,
B.
(
2016
). “
Variation in multitrack mixes: Analysis of low-level audio signal features
,”
J. Audio Eng. Soc.
64
(
7/8
),
466
473
.
39.
Yu
,
G.
,
Mallat
,
S.
, and
Bacry
,
E.
(
2008
). “
Audio denoising by time-frequency block thresholding
,”
IEEE Trans. Signal Process.
56
(
5
),
1830
1839
.

Supplementary Material

You do not currently have access to this content.