Microphone arrays have long been used to characterize and locate sound sources. However, existing algorithms for processing the signals are computationally expensive and, consequently, different methods need to be explored. Recently, the trained iterative soft thresholding algorithm (TISTA), a data-driven solver for inverse problems, was shown to improve on existing approaches. Here, a more in-depth analysis of its robustness and frequency dependence is provided using synthesized as well as real measurement data. It is demonstrated that TISTA yields favorable results in comparison to a covariance matrix fitting inverse method, especially for large numbers of sources.

To reduce noise emissions, determining characteristics as well as locations of sound sources is crucial, and microphone arrays (MAs) present a powerful tool for this purpose. One way to approach the processing of the microphone signals is by formulating an inverse problem in which microphone signals are thought of as the known effect of an unknown cause, that is, the sound field, which is to be derived from the measurement (Adler and Öktem, 2017; Donoho, 2006; Merino-Martínez , 2019). To solve this inverse problem, a variety of algorithms have been devised, one of which is called iterative soft thresholding algorithm (ISTA), from which consecutive methods like the fast iterative soft thresholding algorithm (FISTA) have been derived (Beck and Teboulle, 2009; Gregor and LeCun, 2010; Lylloff , 2015). Since earlier iterative methods are computationally demanding and oftentimes additional effort arises from the need to determine appropriate hyper-parameters, more recent advances have dropped the iterative nature in favor of machine learning approaches (Borgerding , 2017), where parameters are automatically set to optimal values. One such method is the trained iterative soft thresholding algorithm (TISTA; Ito , 2019; Takabe , 2020), which was recently applied to MA data and significantly improved on existing methods in accuracy and computational load (Kayser , 2022). Using a state-of-the-art array geometry, this work investigates the performance of TISTA as a function of the number of iterations and numbers of sources in the sound field. The covariance matrix fitting (CMF) method (Yardibi , 2008) is used for comparison.

An array of M microphones is used to measure a sound field that is discretized using S focus points while other contributions are thought of as perturbations and noise. The vector of microphone signals is denoted p M while q S holds the contribution of each source at the array center, x 0 3. Assuming an anechoic sound field and no noise, a matrix Ψ M × S can be defined such that p = Ψ q. The elements of this matrix are given by Ψ m s = ( r s , m / r s , 0 ) e j k ( r s , m r s , 0 ), where k is the wave number and r denotes the distance from the source at x s 3 to the microphone at x m 3 and to the reference position x 0 3. The cross-spectral matrix (CSM) of microphone signals C M × M is then given by
(1)
where E denotes the expected value, ( · ) H denotes conjugate transposition, and Q is the CSM of the source signals. If the source signals are exclusively incoherent, Q is diagonal and Eq. (1) can be rewritten as
(2)
with the measurement vector m M that contains the upper triangle elements of C, the signal vector s S that contains the diagonal elements of Q, the sensing matrix A M × S, and the noise vector n M. The sensing matrix is constructed using the elements Ψ Ψ H, corresponding to the upper triangle elements of C. In cases where M < S, the inverse problem is ill-posed. However, the unknown signal, s, can often be assumed to be sparse with only few nonzero elements and, thus, an approximate solution can be obtained by linear regression with 1-penalty, where | | m As | | 2 2 + γ | | s | | 1 is minimized. A well-known MA method based on this problem formulation is the CMF method (Yardibi , 2008). A crucial condition is that a reasonable value for γ 0 needs to be determined in advance to obtain meaningful source maps, which can be viewed as a hyper-parameter optimization problem. A particularly efficient regression method is the least angle regression–Lasso (LARS-Lasso) algorithm (Efron , 2004) whereby the best hyper-parameters in terms of bias variance can be determined via the Bayesian information criterion (BIC). In this work, the CMF method, employing LARS-Lasso and BIC, is used for comparison.
Another possible approach for finding an approximate solution is the TISTA (Ito , 2019; Takabe , 2020), which is a data-driven method derived from the ISTA (Beck and Teboulle, 2009). Given an initial guess, s ̃ 1, this algorithm proceeds recursively
(3)
where t is the number of iterations, B = A T ( A A T ) 1 is the pseudo-inverse of A, and η is the soft shrinkage function that is defined element-wise as
(4)
and after the final iteration, all of the negative values are set to zero because source strength is strictly nonnegative.
In the original ISTA, constant values for the step size, β, and shrinkage parameter, λ, need to be determined such that a solution with small error is found at a reasonable convergence rate. In contrast, TISTA consists of different values for βt and λt for a fixed number of iterations, T, which significantly speeds up convergence. Finding optimal estimates for βt and λt can be achieved by supervised learning. For this purpose, the so called loss function is defined as
(5)
reflecting the deviation of the reconstructed source map, s ̃, from the true source distribution, s. In a preliminary training process, the parameters βt and λt are adjusted so that the loss becomes minimal for a given data set. Since TISTA optimizes the source map while the CMF method optimizes the CSM, an additional evaluation metric accounting for the CSM reconstruction error is used and denoted
(6)
where C ̃ is the reconstructed CSM and C is the true CSM, and · F denotes the Frobenius norm.

The measurement setup used in this work is illustrated in Fig. 1. The planar MA geometry follows Vogel's spiral (H = 1.0,V = 5.0; Sarradj, 2016) with M = 64 channels and an aperture of d a = 1 m. The reference position is set to the origin of the coordinate system. The array faces a focus area of size d x × d y with d x = d y = d a at distance d z = 0.5 d a. The area is sampled by a rectangular focus grid with equidistant spacing of 0.01 d a, resulting in 10 201 focus points.

Fig. 1.

Microphone array (MA; black •), grid (gray +), and origin (black ×) are shown. For the sake of clarity. only every fourth grid point is displayed.

Fig. 1.

Microphone array (MA; black •), grid (gray +), and origin (black ×) are shown. For the sake of clarity. only every fourth grid point is displayed.

Close modal
Training data are generated on demand during training following Borgerding (2017), Ito (2019), and Takabe (2020), starting with a random source distribution, which is then multiplied by the sensing matrix to yield the corresponding microphone data. Source distributions are generated using the random variables r 1 S U ( 0 , 1 ), drawn from a uniform distribution for the source position, and r 2 S R ( 1 ), drawn from a Rayleigh distribution for the source strength. Using the step function
(7)
where q = 0.001 reflects the number of sources relative to the number of grid points, a source distribution can be formalized as
(8)
For the evaluation data, to account for possible imprecisions in the microphone geometry, a set of 100 sensing matrices is generated, where microphone positions are drawn from a bivariate normal distribution with a standard deviation equal to 0.1 % of the aperture. One of the matrices is randomly selected for each evaluation step. Data are generated for different frequencies f (Hz) as given by the Helmholtz number He = f d a / c, where c denotes the speed of sound, using values of He = 4 and He = 6. Training of TISTA is conducted individually for each Helmholtz number as well as for various numbers of iterations, that is, 10–50 iterations in increments of 10. Models are evaluated with an increasing number of sources, specifically 10–100 in increments of 10. The CMF method and every instance of TISTA are evaluated using the same data, with each evaluation consisting of ten sample source distributions over which the final error is averaged. Additionally, the methods are evaluated on real measurement data, using the setup described in Kujawski and Sarradj (2022). Two different cases are investigated, each consisting of four loudspeakers, one with equal strength and one with sound pressure level (SPL) differences of -3, -6, and -9 dB. Computational performances of the TISTA and CMF are measured as the total number of floating point operations (FLOPs) using the performance application programming interface (PAPI; Terpstra , 2010) and median wall clock time on an Intel® Xeon® Silver 4214R central processing unit (CPU; Santa Clara, CA) at 2.40 GHz using a single thread with double precision. The performance comparison considers 10, 50, and 100 sources, each represented by 10 distinct source cases. TISTA is set to T = 10 and T = 50 iterations while the CMF method is kept running until its stopping criterion is reached. The CMF method is used as implemented in the Acoular framework (Sarradj and Herold, 2017), whereas TISTA relies on TensorFlow (Abadi , 2015). Both implementations are optimized for runtime.

Regarding the preliminary training, trainable parameters are initialized with β t = 1 and λ t = 0 for 0 t T. The ADAM optimization algorithm (Kingma and Ba, 2015), with a learning rate of ρ = 8 × 10 4, is used for parameter optimization. Each optimization step is performed for a batch of 200 samples. After 200 optimization steps, the loss is calculated for 1 batch of separate validation data and training is concluded once it does not improve for 10 consecutive evaluation steps.

In Fig. 2, the reconstruction errors for TISTA and the CMF method are shown. While as a reference point the CMF method is kept running until its stopping criterion is reached, TISTA is trained and evaluated for increasing numbers of iterations. As introduced in Eqs. (5) and (6), two different error metrics are considered because the two algorithms optimize different objectives. At He = 16, TISTA outperforms the CMF method in terms of source map reconstruction, LMAP, as well as CSM reconstruction, LCSM, using as little as 20 iterations and retains preferable results for increasing numbers of sources, as can be seen in Figs. 2(A) and 2(C). At He = 4, however, as shown in Figs. 2(B) and 2(D), TISTA performs worse than the CMF in terms of source map reconstruction regardless of how many iterations are used, which is most likely the result of the condition number of the sensing matrix being larger at lower Helmholtz numbers, in turn, rendering the pseudo-inversion less precise. Since the pseudo-inverse matrix is applied at each iteration, this imprecision may not be overcome by increasing the number of iterations. Furthermore, the imprecision directly affects the decay of the training loss, rendering it less steep. Longer training times may improve accuracy, but they also carry the risk of overfitting. However, with increasing numbers of sources, the CMF method deteriorates more distinctly than TISTA, which yields better results at 20 sources and above, reflected in LMAP and LCSM. Interestingly, as opposed to the LMAP, the CSM reconstruction error does improve noticeably when TISTA is run with more iterations, and at 50 iterations, it outperforms the CMF method in terms of LCSM.

Fig. 2.

Reconstruction errors averaged over ten source cases as a function of the number of TISTA iterations [(A), (C)] and sources [(B), (D)]. (A) and (B) show the source map error [Eq. (5)], and (C) and (D) show the CSM reconstruction error [Eq. (6)]. In (B) and (D), TISTA is set to T = 50.

Fig. 2.

Reconstruction errors averaged over ten source cases as a function of the number of TISTA iterations [(A), (C)] and sources [(B), (D)]. (A) and (B) show the source map error [Eq. (5)], and (C) and (D) show the CSM reconstruction error [Eq. (6)]. In (B) and (D), TISTA is set to T = 50.

Close modal

In Fig. 3, synthetic example source maps are shown for TISTA and the CMF method. At He = 16, both yield results that resemble the true source distribution reasonably well with the CMF map being slightly more sparse. Differences in the two methods are more apparent at He = 4. TISTA displays many nonzero elements in the general area of the sound sources, making the inherent uncertainty of the method very apparent while still reasonably representing the true sound field. The CMF method yields a sparse source map that appears very precise while there actually are visible discrepancies to the true source distribution. Most notably, two sources close to each other tend to be displayed as one located in between the two, whereas some other sources are very apparently mislocated. Interestingly though, the total number of nonzero elements closely matches the true number of sources. In Fig. 4, two different source cases from a real measurement are displayed, where the upper row represents He = 4, while the bottom row corresponds to He = 16. Estimated source levels are obtained via sector integration depicted by a black circle. The loudspeaker dimension is indicated by a gray circle. Figure 4(A) consists of equally strong sources, whereas in Fig. 4(B), source strengths are different. In Fig. 4(A), the CMF method and TISTA underestimate the sources by roughly 1.5 dB, which can be attributed to the loudspeaker directivity. In Fig. 4(B), TISTA appears to be more reliable with respect to the weakest source. In general, TISTA yields comparably sparse results at He = 16 but less sparse results at He = 4. Notably, these results demonstrate the robustness of TISTA with respect to the training data as several key parameters are different in the measurement, particularly, the speed of sound, which deviates by roughly 1.8 m/s, and the frequency of interest, which differs by approximately 70 Hz.

Fig. 3.

Source maps for TISTA (T = 50) and CMF at He = 4 (upper row) and He = 16 (lower row), along with the true source map. A total of 14 sources with different strengths are shown. The Rayleigh resolution limit, 1.22 d z / He, is displayed in the lower right corner.

Fig. 3.

Source maps for TISTA (T = 50) and CMF at He = 4 (upper row) and He = 16 (lower row), along with the true source map. A total of 14 sources with different strengths are shown. The Rayleigh resolution limit, 1.22 d z / He, is displayed in the lower right corner.

Close modal
Fig. 4.

Experimental results with four loudspeakers (gray areas). The top and bottom rows correspond to He = 4 and He = 16, respectively. Black circles represent the circular integration sectors with a radius of 0.05 d a. (A) considers equal SPLs, whereas (B) considers level differences with −3, −6, and −9 dB.

Fig. 4.

Experimental results with four loudspeakers (gray areas). The top and bottom rows correspond to He = 4 and He = 16, respectively. Black circles represent the circular integration sectors with a radius of 0.05 d a. (A) considers equal SPLs, whereas (B) considers level differences with −3, −6, and −9 dB.

Close modal

Even with added noise and perturbations of a real measurement setup, TISTA exhibits qualities similar to the noiseless case presented in Kayser (2022). Additionally, a larger array, as well as a more highly resolved grid is used here and, thus, TISTA can be considered a reasonably robust method that is applicable in different circumstances.

Considering the total number of FLOPs and the median wall clock time in Table 1, TISTA outperforms the CMF method, particularly, at large source numbers. This is in accordance with their respective computational complexity: While TISTA has complexity O ( S 2 ) (Ito , 2019), the CMF method has a greater complexity of O ( M 3 ) for S M (Efron , 2004). Hence, TISTA can be considered an exceptionally fast method, especially because it can be efficiently parallelized on a graphics processing unit and thereby further accelerated (Abadi et al., 2015). For TISTA, the measured FLOPs are in good agreement with the theoretical number of FLOPs, which can be obtained by FLOPs = T M ( 4 S + 2 ). Note that the training time of TISTA is not considered in the performance comparison.

TABLE. 1.

Measured computational runtime performance represented by the median computation time and the number of FLOPs, where both are measured on a single CPU thread. The results are dependent on the number of sources, J.

Method Performance (J = 10) Performance (J = 50) Performance (J = 100)
TISTA (T = 10)  0.6 s (1.7 GFLOPs)  0.6 s (1.7 GFLOPs)  0.6 s (1.7 GFLOPs) 
TISTA (T = 50)  2.5 s (8.4 GFLOPs)  2.5 s (8.4 GFLOPs)  2.5 s (8.4 GFLOPs) 
CMF  4.1 s (4.2 GFLOPs)  7.5 s (26.0 GFLOPs)  12.1 s (50.9 GFLOPs) 
Method Performance (J = 10) Performance (J = 50) Performance (J = 100)
TISTA (T = 10)  0.6 s (1.7 GFLOPs)  0.6 s (1.7 GFLOPs)  0.6 s (1.7 GFLOPs) 
TISTA (T = 50)  2.5 s (8.4 GFLOPs)  2.5 s (8.4 GFLOPs)  2.5 s (8.4 GFLOPs) 
CMF  4.1 s (4.2 GFLOPs)  7.5 s (26.0 GFLOPs)  12.1 s (50.9 GFLOPs) 

While the preliminary training process can take up to 6 h per frequency, the effort can be expected to decrease with each new frequency when transfer learning is employed, as demonstrated in Kujawski and Sarradj (2022), where optimized parameters of one frequency are used as initial values for training the next neighboring frequency. Although no detailed results are displayed here, training generally exhibits convergent behavior, and the evaluation error mostly coincides with the training error, which can be considered an indicator that no overfitting is taking place. On the other hand, as the sparsity of the signal is a central consideration in the design of TISTA, it can be expected that some overfitting occurs with respect to the number of sources (Borgerding , 2017).

In this work, TISTA is applied to the processing of MA signals using an inverse problem formulation. Building on recent results, TISTA is further examined with respect to its frequency dependence, different numbers of sources, and the presence of noise. Using synthetic as well as real measurement data, the results presented here show that TISTA can recover source distributions from MAs with reasonable accuracy even in the presence of noise. Computation time is significantly decreased as compared to the CMF method by shifting computational expense to a preliminary training process, which is in accordance with other recent advances using machine learning approaches. While performance is generally worse at low Helmholtz numbers, it degrades less than the CMF method for increasing numbers of sources.

The authors thankfully acknowledge the support of this research by Deutsche Forschungsgemeinschaft through Grant No. SA 1502/13-1.

1.
Abadi
,
M.
,
Agarwal
,
A.
,
Barham
,
P.
,
Brevdo
,
E.
,
Chen
,
Z.
,
Citro
,
C.
,
Corrado
,
G. S.
,
Davis
,
A.
,
Dean
,
J.
,
Devin
,
M.
,
Ghemawat
,
S.
,
Goodfellow
,
I.
,
Harp
,
A.
,
Irving
,
G.
,
Isard
,
M.
,
Jia
,
Y.
,
Jozefowicz
,
R.
,
Kaiser
,
L.
,
Kudlur
,
M.
,
Levenberg
,
J.
,
Mané
,
D.
,
Monga
,
R.
,
Moore
,
S.
,
Murray
,
D.
,
Olah
,
C.
,
Schuster
,
M.
,
Shlens
,
J.
,
Steiner
,
B.
,
Sutskever
,
I.
,
Talwar
,
K.
,
Tucker
,
P.
,
Vanhoucke
,
V.
,
Vasudevan
,
V.
,
Viégas
,
F.
,
Vinyals
,
O.
,
Warden
,
P.
,
Wattenberg
,
M.
,
Wicke
,
M.
,
Yu
,
Y.
, and
Zheng
,
X.
(
2015
). “
TensorFlow: Large-scale machine learning on heterogeneous systems
,” software available at tensorflow.org (Last viewed February 26, 2023).
2.
Adler
,
J.
, and
Öktem
,
O.
(
2017
). “
Solving ill-posed inverse problems using iterative deep neural networks
,”
Inverse Probl.
33
(
12
),
124007
.
3.
Beck
,
A.
, and
Teboulle
,
M.
(
2009
). “
A fast iterative shrinkage-thresholding algorithm for linear inverse problems
,”
SIAM J. Imaging Sci.
2
(
1
),
183
202
.
4.
Borgerding
,
M.
,
Schniter
,
P.
, and
Rangan
,
S.
(
2017
). “
AMP-inspired deep networks for sparse linear inverse problems
,”
IEEE Trans. Signal Process.
65
(
16
),
4293
4308
.
5.
Donoho
,
D. L.
(
2006
). “
Compressed sensing
,”
IEEE Trans. Inform. Theory
52
(
4
),
1289
1306
.
6.
Efron
,
B.
,
Hastie
,
T.
,
Johnstone
,
I.
, and
Tibshirani
,
R.
(
2004
). “
Least angle regression
,”
Ann. Statist.
32
(
2
),
407
499
.
7.
Gregor
,
K.
, and
LeCun
,
Y.
(
2010
). “
Learning fast approximations of sparse coding
,” in
Proceedings of the International Conference on Machine Learning
, June 21–24,
Haifa, Israel
, pp.
399
406
.
8.
Ito
,
D.
,
Takabe
,
S.
, and
Wadayama
,
T.
(
2019
). “
Trainable ISTA for sparse signal recovery
,”
IEEE Trans. Signal Process.
67
(
12
),
3113
3125
.
9.
Kayser
,
C.
,
Kujawski
,
A.
, and
Sarradj
,
E.
(
2022
). “
A trainable iterative soft thresholding algorithm for microphone array source mapping
,” in
Proceedings of the Berlin Beamforming Conference
, June 8–9,
Berlin, Germany
.
10.
Kingma
,
D. P.
, and
Ba
,
J. L.
(
2015
). “
Adam: A method for stochastic optimization
,” in
Proceedings of the International Conference on Learning Representations
, May 7–9,
San Diego, CA
.
11.
Kujawski
,
A.
, and
Sarradj
,
E.
(
2022
). “
Fast grid-free strength mapping of multiple sound sources from microphone array data using a transformer architecture
,”
J. Acoust. Soc. Am.
152
(
5
),
2543
2556
.
12.
Lylloff
,
O. A.
,
Fernández-Grande
,
E.
,
Agerkvist
,
F.
,
Hald
,
J.
,
Roig
,
E. T.
, and
Andersen
,
M. S.
(
2015
). “
Improving the efficiency of deconvolution algorithms for sound source localization
,”
J. Acoust. Soc. Am.
138
,
172
180
.
13.
Merino-Martínez
,
R.
,
Sijtsma
,
P.
,
Snellen
,
M.
,
Ahlefeldt
,
T.
,
Antoni
,
J.
,
Bahr
,
C. J.
,
Blacodon
,
D.
,
Ernst
,
D.
,
Finez
,
A.
,
Funke
,
S.
,
Geyer
,
T. F.
,
Haxter
,
S.
,
Herold
,
G.
,
Huang
,
X.
,
Humphreys
,
W. M.
,
Leclère
,
Q.
,
Malgoezar
,
A.
,
Michel
,
U.
,
Padois
,
T.
,
Pereira
,
A.
,
Picard
,
C.
,
Sarradj
,
E.
,
Siller
,
H.
,
Simons
,
D. G.
, and
Spehr
,
C.
(
2019
). “
A review of acoustic imaging methods using phased microphone arrays
,”
CEAS Aeronaut. J.
10
,
197
230
.
14.
Sarradj
,
E.
(
2016
). “
A generic approach to synthesize optimal array microphone arrangements
,” in
Proceedings of the Berlin Beamforming Conference
, February 29–March 1,
Berlin, Germany
, pp.
1
12
.
15.
Sarradj
,
E.
, and
Herold
,
G.
(
2017
). “
A Python framework for microphone array data processing
,”
Appl. Acoust.
116
,
50
58
.
16.
Takabe
,
S.
,
Wadayama
,
T.
, and
Eldar
,
Y. C.
(
2020
). “
Complex trainable ISTA for linear and nonlinear inverse problems
,” in
Proceedings of the International Conference on Acoustics, Speech, and Signal Processing
, May 4–8,
Barcelona, Spain
, pp.
5020
5024
.
17.
Terpstra
,
D.
,
Jagode
,
H.
,
You
,
H.
, and
Dongarra
,
J.
(
2010
). “
Collecting performance data with PAPI-C
,” in
Tools for High Performance Computing
, edited by
M. S.
Müller
,
M. M.
Resch
,
A.
Schulz
, and
W. E.
Nagel
(
Springer
,
Berlin
), pp.
157
173
.
18.
Yardibi
,
T.
,
Li
,
J.
,
Stoica
,
P.
, and
Cattafesta
,
L.
(
2008
). “
Sparsity constrained deconvolution approaches for acoustic source mapping
,”
J. Acoust. Soc. Am.
123
,
2631
2642
.