This letter presents a single-channel speech dereverberation approach using a non-causal minimum variance distortionless response (MVDR) filter. The non-causal filter is adopted to utilize the additional information of the desired signal that lies in subsequent frames. Note that the desired signal output has minimal distortion due to the introduction of the MVDR criterion. The proposed system further suppresses the late reverberation by employing a statistical reverberant model. Experimental results demonstrate the superiority of the proposed algorithm to conventional approaches.

## I. Introduction

Room reverberation has been a main cause of speech degradation in areas as diverse as telecommunication, hands-free human-machine interaction, audio information retrieval, etc. One major impact of reverberation to speech signal processing is that a delayed energy component originated by previous phonemes is propagated to following phonemes, which results in degradation of speech intelligibility due to spectrum overlapping and masking.^{1}

A number of dereverberation algorithms utilizing one microphone have been developed.^{2,3} Most single channel reverberation suppression approaches that are operated in the frequency domain try to minimize the influence of long-term reverberation using the spectral subtraction methods.^{4–6} These algorithms are designed to estimate the early speech component by calculating late reverberant spectral variance (LRSV). It is well known that single channel speech enhancement algorithms have a dilemma of trade-off between noise reduction and speech distortion.^{7,8} Therefore it seems inevitable that the spectral subtraction based dereverberation algorithms suffer from speech distortion. Recently, a single channel noise reduction algorithm that improves signal-to-noise ratio (SNR) but without paying a price of speech distortion has been proposed.^{9} The algorithm takes an idea from the frequency domain minimum variance distortionless response (MVDR) filter that takes into account the interframe correlation of speech spectrum.^{10,11}

In this letter, we propose a non-causal MVDR filter based approach to improve speech intelligibility by suppressing the late reverberation but not bringing speech distortion. Because the desired speech at the current frame is convolved with a relatively long time interval of acoustic impulse response (AIR) in reverberant environment, the reverberant signals in the following frames also contain the desired speech at the current frame. Therefore the reverberant signal in subsequent frames can be considered as additional information to estimate the desired signal at the current frame. To utilize the idea, we choose to employ the non-causal filter. The filter equation is derived based on the MVDR criterion to have minimal speech distortion. In summary, we extend the previously designed MVDR filter for noise reduction^{9} to the non-causal MVDR filter for dereverberation, which exploits the correlation between the speech spectrum itself and the reverberant ones in subsequent frames. Then the late reverberation is suppressed by statistical reverberation model based correlation parameters. Experimental results show that the algorithm achieves substantial improvement in room reverberant environments compared to conventional algorithms.

The rest of this letter is organized as follows. Section II formulates the problem. The non-causal single-channel MVDR filter for dereverberation is derived in Sec. III. In Sec. IV, we describe the complete algorithm to suppress the late reverberation using the statistical reverberation model. Performance evaluation is presented in Sec. V. The conclusion follows in Sec. VI.

## II. Problem formulation

Using the short-time Fourier transform (STFT), we define a reverberant signal model in the time-frequency domain

where $Y(k,m)$ and $S(k,m)$ are the STFT of the observed reverberant signal and the desired anechoic speech signal, and $H(k,m)$ is a time-invariant acoustic transfer function. *k* and *m* means frequency-bin and time-frame, respectively. $S(k,m)$ is assumed to be uncorrelated to itself at another frequency bins and frames.

In the classical dereverberation model,^{4,5} the desired signal $S(k,m)$ is estimated by applying a frequency dependent gain value $G(k,m)$ to the observed signal $Y(k,m)$. That is,

In reverberant environment, the desired signal $S(k,m)$ is first delayed and attenuated by the AIR and then soaked into the subsequent reverberant signal $Y(k,m+l),l>0$. Therefore the reverberation terms in future frames, which are highly correlated with the desired signal of current frame, should be taken into account in the derivation process of dereverberation algorithms. For that, we employ a non-causal filter,

where the superscripts * and *H* denote complex conjugation and transpose-conjugation, respectively. *L* is the total number of consecutive subsequent time frames.

are vectors of length *L*, and the superscript *T* denotes transposition.

The observed signal $Y(k,m)$ is decomposed into two orthogonal parts corresponding to one that is correlated and one that is uncorrelated with the desired signal $S(k,m)$. We consider the component that does not have correlation with the desired signal as an interference.

where $S'(k,m+l)$ represents the interference signal,^{10}

and

is correlation coefficient between the desired signal $S(k,m)$ and the subsequent observed signal $Y(k,m+l)$.

Thus we can write the vector $y(k,m)$ as

where the normalized correlation vector $\gamma s(k,m)$ is

$sd(k,m)$ is the desired signal vector and

is the interference signal vector that are uncorrelated to the desired signal. $s'(k,m)$ contains the undesired speech signals in subsequent frames, $S(k,m+l),l>0$, and the reverberant signal caused by all undesired speech signals at the earlier time frames. Remember that at the current frame *m*, our desired signal is $S(k,m)$ and the speech signal at another frame contained in the observed signal vector $y(k,m)$, such as $S(k,m+l),l>0$, is considered as the interference.

We can write the estimate $S\u0302(k,m)$ into the following form:

where $Sfd(k,m)=S(k,m)wH(k,m)\gamma s*(k,m)$ is the filtered desired signal and $Sri'(k,m)=wH(k,m)s'(k,m)$ is the residual interference.

## III. Non-causal single-channel MVDR filter

To derive the non-causal dereverberation MVDR filter, we first define the error signal between the estimated and desired signals as

where

is the signal distortion due to the complex non-causal filter and

represents the residual interferences.

The mean-square error (MSE) is then

where $\lambda S(k,m)=E[|S(k,m)|2]$ and $\Phi in(k,m)$ denote the variance of the desired signal and the interference covariance matrix, respectively. We can derive the MVDR filter by minimizing the MSE of the residual interference, $E[|\u025br(k,m)|2]$, with the constraint that the desired signal is not distorted.

for which the solution is

where $\Phi y(k,m)=E[y(k,m)yH(k,m)]$ is the correlation matrix of $y(k,m)$. In the next section, we present a method to estimate the correlation vector $\gamma s(k,m)$, which is the main parameter that affects the performance of our dereverberation algorithm.

## IV. Suppression of the late reverberation

In this section, we derive $\gamma s(k,m)$ using a statistical reverberation model and complete the novel dereverberation algorithm to suppress the late reverberation. The AIR can be decomposed into the early reflections and the late reverberation.^{5} The reverberant only signal at current frame is defined as

where $Ne$ determines the start time of the AIR that we may consider as reverberation. The time instance $Ne$ usually ranges from $32$ to $64$ ms.^{5} In this letter, we empirically choose $Ne=12$ (i.e., $48$ ms), which is identical to one in Habets’ work,^{6} so that $R(k,m)$ in the Eq. (18) consists of only late reverberation. Due to the non-stationarity of the source and due to the statistical properties of the AIR, the early component $S\u0303(k,m)$ and late reverberation component $R(k,m)$ can be assumed to be statistically uncorrelated because the reverberant signal is the convolution of the speech source and the AIR.^{4–6} Using Eqs. (1) and (18), a new desired signal is given by

$S\u0303(k,m)$ represents the speech signal colored by the early reflections of the AIR. Our goal to improve speech intelligibility by suppressing the late reverberation can be achieved by recovering $S\u0303(k,m)$.

From Eqs. (7) and (19), the estimated correlation coefficient is given by

due to $E[R(k,m)Y*(k,m+l)]=E[R(k,m)R*(k,m+l)]$. $\lambda Y(k,m)=E[|Y(k,m)|2]$ and $\lambda R(k,m)=E[|R(k,m)|2]$ represent the variance of the observed signal and late reverberation, respectively.

The acoustic transfer function (ATF) $H(k,m)$ in the STFT domain can be statistically modeled as a zero-mean Gaussian random sequence multiplied by an exponentially decaying function.^{5,6} Then the reverberant $R(k,m)$ in Eq. (18) is rewritable as

where $Br(k)$ is zero-mean Gaussian random variable, *N* means the discrete time shift, and $\alpha (k)=3loge(10)/{T60(k)fs}$ denotes the decay rate, which is determined by both the sampling frequency $fs$ and the reverberation time $T60$.

Because the inter-frame correlation between adjacent speech signals is assumed to be neglected, Eq. (22) is meaningful only when $l"=l+l\'$. Note that $S(k,m)$ is not correlated to $S(k,m+l)$ but correlated to $Y(k,m+l)$ in our assumption. Accordingly,

Note that the correlation of the reverberant component is represented by the multiplication of the variance of the late reverberation and a parameter that is exponentially decaying due to *l*.

The estimated correlation coefficient $\gamma \u0303S(k,m,l)$ is reformulated as

where

and

## V. Performance evaluation

In this section, we evaluate the performance of the proposed MVDR dereverberation filter in comparison with three different conventional single channel dereverberation algorithms, the traditional frequency domain Wiener filter, the Lebart’s method,^{4} and the Harbets’ approach.^{5} For the Wiener filter, we implement the system based on Eq. (2) with $G(k,m)=\lambda S\u0303(k,m)/\lambda Y(k,m)$. The Lebart’s method is implemented by modifying the amplitude spectral subtraction utilizing *a priori* SNR smoothing and spectral flooring technique to improve output speech quality. The Habets’ approach is the optimally modified log spectral amplitude (OM-LSA) spectral gain function, which utilizes the hypothetical gains associated with the speech presence uncertainty, to attenuate the reverberation more dynamically. For fair comparison, we use the same spectral variance of the late reverberation $\lambda R(k,m)$ for all algorithms.

The clean speech signal is created by concatenating five different utterances, which are spoken by five different speakers, from aurora2 database. The signal is sampled at 8 kHz, $15$ s long, and it is transformed into the STFT domain using $75%$ overlapping (i.e., $N=32$). The Kaiser window of $128$ samples is used.

The reverberant signals are generated by convolving the speech signal with different AIRs. The AIRs are synthesized under different environments using the image method.^{12} The source-microphone distance $D={2,4.5}$ m, $T60={600,700,800}$ ms, and the room size is set to $6\xd78\xd75$ m (length $\xd7$ width $\xd7$ height).

The estimates of $\Phi y(k,m)$ are recursively updated as in Benesty’s work.^{9} We use the first $10$ frames (i.e., $40$ ms) to compute the initial estimates of $\Phi y(k,m)$. The rest of signal frames are then used for performance evaluation.

In the simulation, we assume the reverberation time ($T60$) is known, which in practice can be estimated by using blind estimation procedures.^{4,13} Preliminary experiments confirm that the proposed algorithm is robust to the estimation error of $T60$, although further analysis remains as future work. The forgetting factor for the variance of the late reverberation is set to $\kappa =0.2$.

The performance was evaluated using the SNR in the frequency domain and log spectral distance (LSD) measures.^{5} SNR$(S\u0302)$ is given as ratio between variances of *S*(*k*, *m*) and error caused by $S\u0302(k,m)$. LSD$(S\u0302)$ is defined as difference between log spectrums of *S*(*k*, *m*) and $S\u0302(k,m)$.

Tables I and II show the improvements of the LSDs and the SNRs by varying channel orders in various reverberation environment, respectively. The improvement of LSD is calculated by $\Delta LSD=LSD(Y)-LSD(S\u0302)$. The large $\Delta LSD$ value means that the output signal $S\u0302(k,m)$ is much more similar to the desired signal $S(k,m)$ compared to the observed signal $Y(k,m)$. The improvement of SNR is defined by $\Delta SNR=SNR(S\u0302)-SNR(Y)$. The direct to reverberation ratio (DRR) is also depicted to clarify the simulation environment. The DRR is defined as the direct path energy divided by the total energy of the AIR.^{14}

D . | $T60$ . | DRR . | Wiener . | Proposed . | Lebart^{a}
. | Habets^{b}
. | |||
---|---|---|---|---|---|---|---|---|---|

(m) . | (s) . | (dB) . | . | L = 2 . | L = 4 . | L = 8 . | L = 16 . | . | . |

2 | 0.6 | −0.994 | −0.776 | 0.157 | 0.266 | 0.466 | 0.818 | 0.431 | 0.039 |

0.7 | −2.590 | −0.638 | 0.336 | 0.510 | 0.774 | 1.059 | 0.720 | 0.338 | |

0.8 | −3.959 | −0.452 | 0.510 | 0.959 | 1.194 | 1.509 | 1.049 | 0.580 | |

4.5 | 0.6 | −6.020 | 0.197 | 0.504 | 0.763 | 0.972 | 1.111 | 0.664 | 0.691 |

0.7 | −7.623 | 0.546 | 0.735 | 1.093 | 1.395 | 1.616 | 0.996 | 1.228 | |

0.8 | −8.977 | 0.909 | 0.948 | 1.401 | 1.790 | 2.081 | 1.358 | 1.689 |

D . | $T60$ . | DRR . | Wiener . | Proposed . | Lebart^{a}
. | Habets^{b}
. | |||
---|---|---|---|---|---|---|---|---|---|

(m) . | (s) . | (dB) . | . | L = 2 . | L = 4 . | L = 8 . | L = 16 . | . | . |

2 | 0.6 | −0.994 | −0.776 | 0.157 | 0.266 | 0.466 | 0.818 | 0.431 | 0.039 |

0.7 | −2.590 | −0.638 | 0.336 | 0.510 | 0.774 | 1.059 | 0.720 | 0.338 | |

0.8 | −3.959 | −0.452 | 0.510 | 0.959 | 1.194 | 1.509 | 1.049 | 0.580 | |

4.5 | 0.6 | −6.020 | 0.197 | 0.504 | 0.763 | 0.972 | 1.111 | 0.664 | 0.691 |

0.7 | −7.623 | 0.546 | 0.735 | 1.093 | 1.395 | 1.616 | 0.996 | 1.228 | |

0.8 | −8.977 | 0.909 | 0.948 | 1.401 | 1.790 | 2.081 | 1.358 | 1.689 |

D . | $T60$ . | DRR . | Wiener . | Proposed . | Lebart^{a}
. | Habets^{b}
. | |||
---|---|---|---|---|---|---|---|---|---|

(m) . | (s) . | (dB) . | . | L = 2 . | L = 4 . | L = 8 . | L = 16 . | . | . |

2 | 0.6 | −0.994 | −0.004 | 0.011 | 0.017 | 0.089 | 0.108 | 0.065 | 0.075 |

0.7 | −2.590 | 0.003 | 0.021 | 0.097 | 0.144 | 0.275 | 0.120 | 0.148 | |

0.8 | −3.959 | 0.036 | 0.080 | 0.184 | 0.233 | 0.412 | 0.172 | 0.214 | |

4.5 | 0.6 | −6.020 | 0.261 | 0.202 | 0.264 | 0.444 | 0.645 | 0.168 | 0.132 |

0.7 | −7.623 | 0.315 | 0.273 | 0.418 | 0.710 | 1.034 | 0.340 | 0.276 | |

0.8 | −8.977 | 0.376 | 0.404 | 0.613 | 0.984 | 1.350 | 0.571 | 0.429 |

D . | $T60$ . | DRR . | Wiener . | Proposed . | Lebart^{a}
. | Habets^{b}
. | |||
---|---|---|---|---|---|---|---|---|---|

(m) . | (s) . | (dB) . | . | L = 2 . | L = 4 . | L = 8 . | L = 16 . | . | . |

2 | 0.6 | −0.994 | −0.004 | 0.011 | 0.017 | 0.089 | 0.108 | 0.065 | 0.075 |

0.7 | −2.590 | 0.003 | 0.021 | 0.097 | 0.144 | 0.275 | 0.120 | 0.148 | |

0.8 | −3.959 | 0.036 | 0.080 | 0.184 | 0.233 | 0.412 | 0.172 | 0.214 | |

4.5 | 0.6 | −6.020 | 0.261 | 0.202 | 0.264 | 0.444 | 0.645 | 0.168 | 0.132 |

0.7 | −7.623 | 0.315 | 0.273 | 0.418 | 0.710 | 1.034 | 0.340 | 0.276 | |

0.8 | −8.977 | 0.376 | 0.404 | 0.613 | 0.984 | 1.350 | 0.571 | 0.429 |

As we can see from Table I, the improvements of LSDs by the proposed algorithms tend to monotonically increasing with extending the value of *L* in the studied *L* range. The proposed algorithm always outperforms the Wiener filter and shows better performance than the Lebart’s and the Habets’ method when its channel order is greater than $8$.

From the simulation results given in Table II, it is confirmed that the proposed algorithm (when $l\u22658$) is superior to all studied conventional algorithms under simulated environments in terms of SNR and LSD.

We also conducted informal Perceptual Evaluation of Speech Quality measurement results. The results show that the proposed algorithm slightly outperforms all the reference approaches. However, we do not include the detailed scores here because there is a clarification issue whether the PESQ score is suitable measure for measuring qualities in reverberant environment.

## VI. Conclusion

In this letter, a new single-channel dereverberation algorithm was introduced. The non-causal MVDR filter to reduce reverberation while minimizing speech distortion was derived by exploiting the correlation between speech spectrum and reverberant ones in the subsequent frames. The late reverberation was suppressed based on a statistical reverberation model. Experimental results demonstrated the superiority of the proposed algorithm.

## Acknowledgments

This research was supported by the The Ministry of Knowledge Economy, Korea, under the Information Technology Research Center support program supervised by the National IT Industry Promotion Agency (NIPA-2012-H0301-12-2006).