A unified beamforming and source separation model for static and dynamic human-robot interaction

: This paper presents a uniﬁed model for combining beamforming and blind source separation (BSS). The validity of the model’s assumptions is conﬁrmed by recovering target speech information in noise accurately using Oracle information. Using real static human-robot interaction (HRI) data, the proposed combination of BSS with the minimum-variance distortionless response beamformer provides a greater signal-to-noise ratio (SNR) than previous parallel and cascade systems that combine BSS and beamforming. In the difﬁcult-to-model HRI dynamic environment, the system provides a SNR gain that was 2.8 dB greater than the results obtained with the cascade combination, where the parallel combination is infeasible. V C 2024 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution (CC BY)

A unified beamforming and source separation model for static and dynamic human-robot interaction

Introduction
Human-robot interaction (HRI), primarily through spoken language, is becoming enormously relevant in all areas where collaborative and social human-robot integration occurs.

Beamforming and dynamic environments
Speech enhancement is a crucial part of speech communication and speech recognition systems. 1 Unlike single-channel speech enhancement, 2 multichannel speech enhancement can exploit the extra spatial information provided by additional microphones, allowing the separation of the target signal and interference if they come from different directions. 3The delay-and-sum (D&S) and minimum-variance distortionless response (MVDR) beamformers are two of many techniques that have been proposed.D&S beamforming 4 is accomplished by delaying the microphone signals according to the time difference of arrival (TDOA) to synchronize and add all the channels to enhance the target source, improving the signalto-noise ratio (SNR).In standard beamforming technology, the TDOA or steering vector is generally estimated from the audio signal, which can be inaccurate in indoor environments due to reverberation.However, social robots are usually equipped with cameras, and the audio sources can also be identified using computer vision.

Blind source separation
The goal of blind source separation (BSS) is to estimate individual sources given mixtures of those sources. 5One technique for channel separation is independent component analysis (ICA), 6 which considers the measured signals to correspond to mixtures of different sources.In other words, the observed signals correspond to the sources multiplied by a so-called mixture matrix.ICA seeks to separate the sources by maximizing an objective function that depends on the temporal statistics of the observed mixtures.These techniques have been used to separate speech in noisy and reverberant environments. 7Nonnegative matrix factorization (NMF) 8 operates on a similar system and seeks to determine the mixture matrix and sources as a decomposition of matrices that are assumed to be nonnegative.These techniques have been used for voice separation in noisy environments with microphone arrays. 9ICA and NMF require estimating the signal statistics that cannot be obtained reliably in dynamic environments.

Beamforming and BSS
Combining the beamforming method and source separation techniques is a topic discussed previously in the literature.In Ref. 10, an algorithm was proposed for multiple blind source localizations using source separation and beamforming a) Author to whom correspondence should be addressed.techniques.Although the method can locate more sources than the number of microphones in the array, it requires a specific microphone layout.It has only been tested under simulated static conditions.Another approach, presented in Ref. 11, proposes to combine convolutional source separation with geometric beamforming to address ambiguities in source separation and degrees of freedom provided by additional sensors.This method has also not been tested with moving sources or microphones.In Ref. 12, a method was proposed based on the parallel combination of BSS and beamforming.The target source and corresponding direction of arrival (DOA) are first obtained using subband ICA-based BSS, where the average DOA is computed across all the frequency bins.Subsequently, the target source signal is estimated with a null beamforming scheme by employing the average DOA obtained in the previous step.Finally, the target source signal is chosen by either the ICA-based BSS estimate or the null beamforming estimate, depending on the reliability of the frequency-bin-based estimation of the DOA.SNR and word error rate (WER) results in simulated static conditions using real room impulse responses (RIRs) are reported in Ref. 12.A cascade method that employs a beamforming scheme as a preprocessor of BSS is described in Ref. 13.By assuming that the sound source locations are known, the mixed signals are enhanced with the D&S beamforming scheme; then, the enhanced signals are used as input for Infomax-based BSS.This method considers the beamforming and BSS techniques as independent stages that can be applied one after the other.This implies that the BSS method receives the beamformed outputs without any information on how they were obtained.SNR results in simulated static conditions with reverberation times (referred to as RT or RT60) between 0.3 and 0.7 s using real RIRs are reported in Ref. 13.
Historically, from the signal processing point view, beamforming and BSS have been thought of as representing two different families of techniques.5][16][17][18] Those methods are usually treated as "black-boxes" that require training.This paper describes a beamforming-based BSS scheme that can be easily understood and fully integrates the two types of techniques.This method does not require a specific microphone array configuration, applies to scenarios with moving microphones or sources, does not require any prior information about path attenuation or reverberation, and does not require training.In our efforts to integrate beamforming and BSS techniques, we consider a system that develops multiple beams simultaneously, one for each source.If each such output is considered to be a mixture, estimating the acoustic sources could be considered to be a BSS problem.We note that the estimation of TDOA is more feasible in HRI scenarios with computer vision.This paper presents a solution to the problem of BSS by assuming that generated beamformed outputs can be considered mixtures without the need for statistics, such as in ICA or NMF.The resulting solution can be applied frame-byframe, coping well with dynamic environments.In our approach, we assume that we know the direction of arrival of the sources as well as the physical configuration of the microphones, and we estimate the transfer functions that incorporate microphone frequency response mismatches and reverberation effects to represent a given source recorded in a reference channel given the same source recorded in a different channel.We validate the method by experimental results using real static and dynamic HRI scenarios, comparing our methods to other approaches that integrate beamforming schemes with BSS.

Formulation of the classical beamforming method
In the short-time Fourier transform (STFT) domain, the weighted D&S beamformer can be expressed as where S c t; x ð Þ denotes the observed signal recorded by microphone, c, at frequency, x, and frame, t; C is the number of microphones; B l ðt; xÞ is the beamformed output pointing to source l; and w l;c x ð Þ and s l;c correspond to the weight and time delay, respectively, applied to the microphone, c, for steering the beam toward source, l.Specifically, the ordinary D&S beamformer is obtained when w l;c x ð Þ ¼ 1.The observed signal in microphone c, S c t; x ð Þ, corresponds to the summation of the signals received directly from all sources plus reverberation.If the latter is considered negligible, S c t; x ð Þ can be expressed as where S j;c t; x ð Þ denotes the signal from source signal, j, received by microphone, c, at frequency, x, and frame, t, and J is the number of sources.The beamformed output can be written as In Eq. ( 3), time delays, s , are given by the angular positions of the sources, which are considered known, and the estimation of weights, w l;c x ð Þ, and the resulting signal, B l t; x ð Þ, depends on the beamforming algorithm considered.It is worth noting that in Eq. ( 3), given a beamforming method, the only unknown terms are those corresponding to sources S j;c t; x ð Þ.

The ideal case
If we can assume that reverberation is negligible, the microphone array-source distance is much greater than the intermicrophone separation, and the microphones have identical responses, each source signal in microphone c, S j;c , can be written in terms of the same source signal input to an arbitrary reference microphone, S j;1 , as follows: Equation ( 4) states that source j at microphone c can be considered to be a delayed version of the source received at microphone 1, which is used as a reference.Then, Eq. ( 3) can be written as By steering the beamformer to each source, a system of linear equations is obtained for each time frame, t, and frequency bin, x, such that where The sources, S j;1 t; x ð Þ, in microphone 1 can be separated by applying the equation where A À1 x was set equal to the zero matrix if det A x ð Þ was smaller than a given threshold.We note that the matrix A x in Eqs. ( 6) and (7) depends on the values of the weights, w, used by the beamforming formulation as described in Eq. (1).Consequently, the source separation process and the beamforming algorithm are fully integrated.We also note that the number of speech or noise sources is arbitrary in Eqs. ( 6) and (7).

Including corrections for real environments
Equation (4) assumes identical microphones and no reverberation.These assumptions can be relaxed by generalizing our formulation as follows: where H 1 j;c t; x ð Þ is a transfer function that represents the transformation from signal S j,1 (t,x) to signal S j,c (t,x).The transfer function, H 1 j;c t; x ð Þ, presumably incorporates the effects of microphone mismatches, reverberation effects, and other attributes of the room acoustics to represent source, j, recorded in microphone, c, given the same source, j, recorded in microphone 1.The ideal scenario described above is obtained by assuming that H 1 j;c t; x ð Þ ¼ 1 for all t and x.The beamforming in real conditions can be written as and sources S j;1 t; x ð Þ in microphone 1 can now be separated using Eq. ( 7) and expressing a lj in A x as If the source positions are known, the problem of source separation is reduced to the estimation of H 1 j;c x ð Þ in Eqs. ( 9) and (10) such that Using Eq. ( 2), an estimate for the l th source recorded in microphone c, Ŝl;c t; x ð Þ, can be given as The beamformed output, B j t; x ð Þ, pointing to the jth source can be used as an estimate of the source j in microphone 1, S j;1 t; x ð Þ.The jth source in microphone c, S j;c t; x ð Þ, could be approximated as Then, an estimate for Ŝl;c t; x ð Þ can be obtained as where k j;c is a coefficient that needs to be tuned or estimated.For the two-source case, Eq. ( 15) can be written as where l ¼ 1; j ¼ 2 or l ¼ 2; j ¼ 1; the weight, k j;c , was set equal to a constant for all l and j.For simplicity, B j t; x ð Þ in Eq. ( 14) corresponds to the D&S beam.

Database
The experiments were performed using the 330 test utterances from the Aurora-4 database, 19 which were re-recorded in Ref. 20 using the Personal Robot 2 (PR2), with a Microsoft Xbox 360 Kinect mounted on its head.The setup consists of a speech source 2 mfrom the microphone array and a noise source 2 m from the array and 45 to the left of the speech source [see Fig. 1(b)].In the static condition, the microphone array points to the speech source (0 ).In the dynamic scenario, which simulates a relative robot-source movement, the robot head and microphone array rotate between 50 and -50 at a speed of 24.1 /s.Speech and noise were also recorded in different sessions to generate a simulated version of the static condition.Figure 1(a) shows the PR2 robot in the experimental setup used to record the databases.The estimated RT60 in the room was 0.5 s.

Procedures
The proposed scheme was tested using D&S (Ref.21) and MVDR. 22In both cases, the angle of the robot's head was employed to determine the TDOAs on a frame-by-frame basis applied in the frequency domain.The implementation of MVDR considered a free-field model.The noise covariance matrices were computed using the nonspeech segments determined by a voice activity detector (VAD). 23The noise sources correspond to the mechanics of the robot or ambient noise that is generated by the loudspeaker.Both types of noise are added in the microphone signals for the purpose of speech/ ARTICLE asa.scitation.org/journal/jelnonspeech detection.During each nonspeech frame, a noise matrix was calculated.The noise matrix was determined during the speech intervals by interpolating between the previous and following nonspeech noise matrices.The breaks between spoken words and the between pauses in speech are, on average, 253and 3.291 ms, respectively.
The proposed BSS-based enhancement procedure was performed in the frequency domain using Eq. ( 7).Each microphone time-domain signal was segmented into 25-ms time frames with a 15-ms overlap.A fast Fourier transform (FFT) was applied to compute the complex spectrum for each time frame.The sampling frequency was 16 kHz, and a 512-sample FFT was employed, resulting in 257 frequency bins.D&S and MVDR were used to generate B i t; x ð Þ in Eq. ( 5).k j;c in Eq. ( 16) was set equal to 0.5.The resulting enhanced speech source in the time domain was obtained by using the overlap-add procedure with S 1;1 t; x ð Þ obtained using Eq. ( 7).The following metrics were employed to compare the methods tested here with simulated conditions (Table 1): output SNR, perceptual evaluation of speech quality (PESQ), 24 and short-time objective intelligibility (STOI). 25However, only the SNR metric was employed with the real static and dynamic datasets because the reference signals needed to evaluate PESQ and STOI were unavailable.Table 1 also shows WER results obtained with a DNN-HMM automatic speech recognition (ASR) system run in Kaldi and trained as in Ref. 20 with the Aurora-4 database.The training audios were generated by convolving the speech and noise signals with RIRs measured in the same room where the tests were recorded.Then, the speech and noise signals were added with SNR between 0 and 10 dB and processed using the same techniques as employed for the test utterances according to Table 1 to obtain matched training-testing conditions.
Results obtained using three different types of models are compared: (1) a simulated "anechoic" environment in which it is assumed that H 1 j;c t; (2) an "Oracle H" environment in which the room transfer function, H 1 j;c t; x ð Þ, was calculated with Eq. ( 11) using speech samples captured in the absence of noise; and (3) the "Hest" environment in which the transfer function, H 1 j;c t; x ð Þ, is estimated in realistic conditions using Eq. ( 12).Results in Tables 1 and 2 obtained using these three models of the environment are designated D&S-BSS-1 and MVDR-BSS-1, D&S-BSS-Oracle, and MVDR-BSS-Oracle, and D&S-BSS-Hest and MVDR-BSS-Hest.
We also provide performance comparisons to two standard strategies of combining beamforming techniques with BSS in parallel 12 and cascade 13 configurations, which are discussed in Sec.1.3.The approaches are referred to in Tables 1 and 2  The methods published in Refs.12 and 13 were chosen for comparison purposes because they present two standard strategies to combine beamforming techniques and BSS in parallel and cascade configurations, respectively.Additionally, they do not require a special microphone array, and they consider the number of sources to be known.Because the parallel scheme in Ref. 12 requires the number of sources to be equal to the number of microphones, two of the four Kinect channels were employed for BSS and beamforming.The two selected channels were those at the left and right ends of the microphone array.Also, in addition to the parallel combination of ICA and null beamforming proposed in Ref. 12, MVDR was combined with ICA, leading to two different configurations of the same strategy, i.e., ICA//Null BF and ICA//MVDR, respectively.The null beamforming method was replaced by MVDR because the latter provides a SNR gain that is significantly larger than the former.The source DOAs are considered to be known in both cases.The magnification parameter of the threshold defined in Ref. 12 was tuned by maximizing the average SNR of the resulting enhanced speech signals.It is worth observing that ICA//Null BF or ICA//MVDR require a single DOA per frequency bin, therefore, they do not apply to the dynamic condition where the robot head rotates.Regarding the cascade method described in Ref. 13, two frequency-domain beamforming techniques, D&S and MVDR, were evaluated and applied using the four Kinect channels.Two beamformed signals were generated by pointing to both sources and input to the Infomax-based BSS system on a bin-by-bin basis.Then, permutation and scaling were corrected as in Ref. 13.Two cascade configurations were evaluated depending on the beamforming scheme employed before Infomax: D&S !INFOMAX and MVDR !INFOMAX, respectively.

Results and discussion
Table 1 shows results obtained using simulated static scenarios in which the speech and noise sources were recorded separately and combined offline.Unsurprisingly, the best performance in terms of SNR is obtained for clean speech, and the worst performance is observed in response to uncompensated speech.D&S and MVDR beamforming methods provide improvement in all the metrics.With Oracle processing, {D&S,MVDR}-BSS-Oracle, the speech and noise waveforms were recorded separately, and the transfer functions were estimated directly using perfect knowledge of these waveforms.
With {D&S,MVDR}-BSS-1, the imagined room was assumed to be anechoic and the microphones were considered identical, leading to diagonal spatial covariance matrices.The {D&S,MVDR}-BSS-Hest data represent real results without any unrealistic assumptions based on Eqs. ( 12)-( 16).We note, first, that the D&S-BSS-Oracle and MVDR-BSS-Oracle methods recovered the original SNR observed with clean speech, which validates our overall model of the environment and confirms the existence of an optimal H 1 j;c t; x ð Þ as defined in Eq. ( 8).In general, our proposed combination algorithms, D&S-BSS-Hest and MVDR-BSS-Hest, provide substantial improvements in SNR compared to beamforming without BSS.These results are confirmed in the spectrograms of Fig. 2, which depict clean speech, speech with broadband noise limited to 4 kHz, and results obtained using some of the compensation schemes described in this paper.It can be observed that the results obtained using MVDR-BSS-Hest provide the best noise elimination.Nevertheless, the use of D&S-BSS-Hest provided small improvements in the PESQ and STOI metrics, whereas no improvement at all was observed through the use of MVDR-BSS-Hest compared to MVDR alone.These results may be a consequence of possible speech-distortion artifacts introduced by the {D&S,MVDR}-BSS-Hest processing.In any case, the impact of the speech-distortion artifact in the ASR results appears to be small.Table 2 summarizes our results obtained using data that were recorded in real static and dynamic scenarios, as described in Sec.3.1.As expected, the results indicate consistently that the dynamic scenarios are more challenging to the system than the static scenarios.We also note that for every possible comparison of results, the method that we propose for combining BSS and beamforming techniques is more effective than either the BSS or beamforming techniques used in isolation or the traditional cascade and parallel methods to which we compared our data.This was true using the MVDR and D&S beamformers.We observe that MVDR beamforming is more effective than D&S beamforming for every comparable condition considered.We also note that the parallel integration of ICA with beamforming schemes, as described in Ref. 12, requires defining a single DOA per each frequency bin, thus, it cannot be applied to the dynamic HRI scenario where the robot head rotates.Compared to D&S or MVDR beamforming alone, parallel and cascade combination methods lead to little or no improvements in SNR in real HRI conditions.These results suggest that the real scenarios addressed here are challenging to model.It is also worth highlighting that no assumption about path attenuation or reverberation is needed to achieve the performance that we observed.

Conclusions
In this paper, we have presented a unified model that encompasses beamforming and BSS in experimental laboratory conditions that are very similar to natural acoustical environments.The scheme is based on describing each beamformed output as an explicit combination of the audio sources and representing the output of each microphone as a function of a reference microphone.Expressing each microphone output as a delayed version of the reference microphone signal multiplied by a relative transfer function allows us to apply the traditional matrix inversion solution to undo the effects of the mixing.This transfer function models the interchannel mismatch produced by the array's microphones and reverberation.It can be approximated with the difference between the mixed sources in the channel and a linear combination of the delayed versions of beamformed outputs pointing to the other sources.Oracle experiments show that the target speech sources can be accurately recovered, which validates our assumptions and model.Experiments with real static and dynamic HRI datasets recorded with a PR2 robot suggest that the proposed BSS scheme with MVDR can lead to a substantial increase in SNR when compared to standard MVDR only and provide a performance that exceeds that of previous parallel and cascade combinations of BSS and beamforming schemes.The proposed approach with MVDR provided a SNR that was 2.6 and 2.4 dB greater than the parallel and cascade systems with MVDR, respectively, with real static HRI data.Compared to the cascade architecture, a SNR gain equal to 2.8 dB was observed for the real dynamic HRI condition.The parallel integration scheme is not applicable to the dynamic scenario where the robot head rotates.Moreover, the results of ASR experiments suggest that the impact of the speech-distortion artifact is small.The objective of future research will be to develop more precise estimates of the inter-microphone relative transfer functions.

Fig. 1 .
Fig. 1.(a) Experimental setup for database recording with PR2 robot and (b) setup diagram for the HRI database recording showing the rotation of the robot head are displayed.

Table 2 .
Average SNR with real static and dynamic HRI conditions.