Microphone arrays have long been used to characterize and locate sound sources. However, existing algorithms for processing the signals are computationally expensive and, consequently, different methods need to be explored. Recently, the trained iterative soft thresholding algorithm (TISTA), a data-driven solver for inverse problems, was shown to improve on existing approaches. Here, a more in-depth analysis of its robustness and frequency dependence is provided using synthesized as well as real measurement data. It is demonstrated that TISTA yields favorable results in comparison to a covariance matrix fitting inverse method, especially for large numbers of sources.
1. Introduction
To reduce noise emissions, determining characteristics as well as locations of sound sources is crucial, and microphone arrays (MAs) present a powerful tool for this purpose. One way to approach the processing of the microphone signals is by formulating an inverse problem in which microphone signals are thought of as the known effect of an unknown cause, that is, the sound field, which is to be derived from the measurement (Adler and Öktem, 2017; Donoho, 2006; Merino-Martínez , 2019). To solve this inverse problem, a variety of algorithms have been devised, one of which is called iterative soft thresholding algorithm (ISTA), from which consecutive methods like the fast iterative soft thresholding algorithm (FISTA) have been derived (Beck and Teboulle, 2009; Gregor and LeCun, 2010; Lylloff , 2015). Since earlier iterative methods are computationally demanding and oftentimes additional effort arises from the need to determine appropriate hyper-parameters, more recent advances have dropped the iterative nature in favor of machine learning approaches (Borgerding , 2017), where parameters are automatically set to optimal values. One such method is the trained iterative soft thresholding algorithm (TISTA; Ito , 2019; Takabe , 2020), which was recently applied to MA data and significantly improved on existing methods in accuracy and computational load (Kayser , 2022). Using a state-of-the-art array geometry, this work investigates the performance of TISTA as a function of the number of iterations and numbers of sources in the sound field. The covariance matrix fitting (CMF) method (Yardibi , 2008) is used for comparison.
2. Theoretical framework
2.1 Inverse problem formulation
2.2 TISTA algorithm
3. Methods
The measurement setup used in this work is illustrated in Fig. 1. The planar MA geometry follows Vogel's spiral (H = 1.0,V = 5.0; Sarradj, 2016) with M = 64 channels and an aperture of . The reference position is set to the origin of the coordinate system. The array faces a focus area of size with at distance . The area is sampled by a rectangular focus grid with equidistant spacing of , resulting in 10 201 focus points.
Microphone array (MA; black •), grid (gray +), and origin (black ×) are shown. For the sake of clarity. only every fourth grid point is displayed.
Microphone array (MA; black •), grid (gray +), and origin (black ×) are shown. For the sake of clarity. only every fourth grid point is displayed.
Regarding the preliminary training, trainable parameters are initialized with and for . The ADAM optimization algorithm (Kingma and Ba, 2015), with a learning rate of , is used for parameter optimization. Each optimization step is performed for a batch of 200 samples. After 200 optimization steps, the loss is calculated for 1 batch of separate validation data and training is concluded once it does not improve for 10 consecutive evaluation steps.
4. Results and discussion
In Fig. 2, the reconstruction errors for TISTA and the CMF method are shown. While as a reference point the CMF method is kept running until its stopping criterion is reached, TISTA is trained and evaluated for increasing numbers of iterations. As introduced in Eqs. (5) and (6), two different error metrics are considered because the two algorithms optimize different objectives. At , TISTA outperforms the CMF method in terms of source map reconstruction, LMAP, as well as CSM reconstruction, LCSM, using as little as 20 iterations and retains preferable results for increasing numbers of sources, as can be seen in Figs. 2(A) and 2(C). At , however, as shown in Figs. 2(B) and 2(D), TISTA performs worse than the CMF in terms of source map reconstruction regardless of how many iterations are used, which is most likely the result of the condition number of the sensing matrix being larger at lower Helmholtz numbers, in turn, rendering the pseudo-inversion less precise. Since the pseudo-inverse matrix is applied at each iteration, this imprecision may not be overcome by increasing the number of iterations. Furthermore, the imprecision directly affects the decay of the training loss, rendering it less steep. Longer training times may improve accuracy, but they also carry the risk of overfitting. However, with increasing numbers of sources, the CMF method deteriorates more distinctly than TISTA, which yields better results at 20 sources and above, reflected in LMAP and LCSM. Interestingly, as opposed to the LMAP, the CSM reconstruction error does improve noticeably when TISTA is run with more iterations, and at 50 iterations, it outperforms the CMF method in terms of LCSM.
Reconstruction errors averaged over ten source cases as a function of the number of TISTA iterations [(A), (C)] and sources [(B), (D)]. (A) and (B) show the source map error [Eq. (5)], and (C) and (D) show the CSM reconstruction error [Eq. (6)]. In (B) and (D), TISTA is set to T = 50.
In Fig. 3, synthetic example source maps are shown for TISTA and the CMF method. At , both yield results that resemble the true source distribution reasonably well with the CMF map being slightly more sparse. Differences in the two methods are more apparent at . TISTA displays many nonzero elements in the general area of the sound sources, making the inherent uncertainty of the method very apparent while still reasonably representing the true sound field. The CMF method yields a sparse source map that appears very precise while there actually are visible discrepancies to the true source distribution. Most notably, two sources close to each other tend to be displayed as one located in between the two, whereas some other sources are very apparently mislocated. Interestingly though, the total number of nonzero elements closely matches the true number of sources. In Fig. 4, two different source cases from a real measurement are displayed, where the upper row represents , while the bottom row corresponds to . Estimated source levels are obtained via sector integration depicted by a black circle. The loudspeaker dimension is indicated by a gray circle. Figure 4(A) consists of equally strong sources, whereas in Fig. 4(B), source strengths are different. In Fig. 4(A), the CMF method and TISTA underestimate the sources by roughly 1.5 dB, which can be attributed to the loudspeaker directivity. In Fig. 4(B), TISTA appears to be more reliable with respect to the weakest source. In general, TISTA yields comparably sparse results at but less sparse results at . Notably, these results demonstrate the robustness of TISTA with respect to the training data as several key parameters are different in the measurement, particularly, the speed of sound, which deviates by roughly 1.8 m/s, and the frequency of interest, which differs by approximately 70 Hz.
Source maps for TISTA (T = 50) and CMF at (upper row) and (lower row), along with the true source map. A total of 14 sources with different strengths are shown. The Rayleigh resolution limit, , is displayed in the lower right corner.
Source maps for TISTA (T = 50) and CMF at (upper row) and (lower row), along with the true source map. A total of 14 sources with different strengths are shown. The Rayleigh resolution limit, , is displayed in the lower right corner.
Experimental results with four loudspeakers (gray areas). The top and bottom rows correspond to He = 4 and He = 16, respectively. Black circles represent the circular integration sectors with a radius of . (A) considers equal SPLs, whereas (B) considers level differences with −3, −6, and −9 dB.
Experimental results with four loudspeakers (gray areas). The top and bottom rows correspond to He = 4 and He = 16, respectively. Black circles represent the circular integration sectors with a radius of . (A) considers equal SPLs, whereas (B) considers level differences with −3, −6, and −9 dB.
Even with added noise and perturbations of a real measurement setup, TISTA exhibits qualities similar to the noiseless case presented in Kayser (2022). Additionally, a larger array, as well as a more highly resolved grid is used here and, thus, TISTA can be considered a reasonably robust method that is applicable in different circumstances.
Considering the total number of FLOPs and the median wall clock time in Table 1, TISTA outperforms the CMF method, particularly, at large source numbers. This is in accordance with their respective computational complexity: While TISTA has complexity (Ito , 2019), the CMF method has a greater complexity of for (Efron , 2004). Hence, TISTA can be considered an exceptionally fast method, especially because it can be efficiently parallelized on a graphics processing unit and thereby further accelerated (Abadi et al., 2015). For TISTA, the measured FLOPs are in good agreement with the theoretical number of FLOPs, which can be obtained by . Note that the training time of TISTA is not considered in the performance comparison.
Measured computational runtime performance represented by the median computation time and the number of FLOPs, where both are measured on a single CPU thread. The results are dependent on the number of sources, J.
Method . | Performance (J = 10) . | Performance (J = 50) . | Performance (J = 100) . |
---|---|---|---|
TISTA (T = 10) | 0.6 s (1.7 GFLOPs) | 0.6 s (1.7 GFLOPs) | 0.6 s (1.7 GFLOPs) |
TISTA (T = 50) | 2.5 s (8.4 GFLOPs) | 2.5 s (8.4 GFLOPs) | 2.5 s (8.4 GFLOPs) |
CMF | 4.1 s (4.2 GFLOPs) | 7.5 s (26.0 GFLOPs) | 12.1 s (50.9 GFLOPs) |
Method . | Performance (J = 10) . | Performance (J = 50) . | Performance (J = 100) . |
---|---|---|---|
TISTA (T = 10) | 0.6 s (1.7 GFLOPs) | 0.6 s (1.7 GFLOPs) | 0.6 s (1.7 GFLOPs) |
TISTA (T = 50) | 2.5 s (8.4 GFLOPs) | 2.5 s (8.4 GFLOPs) | 2.5 s (8.4 GFLOPs) |
CMF | 4.1 s (4.2 GFLOPs) | 7.5 s (26.0 GFLOPs) | 12.1 s (50.9 GFLOPs) |
While the preliminary training process can take up to 6 h per frequency, the effort can be expected to decrease with each new frequency when transfer learning is employed, as demonstrated in Kujawski and Sarradj (2022), where optimized parameters of one frequency are used as initial values for training the next neighboring frequency. Although no detailed results are displayed here, training generally exhibits convergent behavior, and the evaluation error mostly coincides with the training error, which can be considered an indicator that no overfitting is taking place. On the other hand, as the sparsity of the signal is a central consideration in the design of TISTA, it can be expected that some overfitting occurs with respect to the number of sources (Borgerding , 2017).
5. Conclusion
In this work, TISTA is applied to the processing of MA signals using an inverse problem formulation. Building on recent results, TISTA is further examined with respect to its frequency dependence, different numbers of sources, and the presence of noise. Using synthetic as well as real measurement data, the results presented here show that TISTA can recover source distributions from MAs with reasonable accuracy even in the presence of noise. Computation time is significantly decreased as compared to the CMF method by shifting computational expense to a preliminary training process, which is in accordance with other recent advances using machine learning approaches. While performance is generally worse at low Helmholtz numbers, it degrades less than the CMF method for increasing numbers of sources.
Acknowledgments
The authors thankfully acknowledge the support of this research by Deutsche Forschungsgemeinschaft through Grant No. SA 1502/13-1.