This letter presents a single-channel speech dereverberation approach using a non-causal minimum variance distortionless response (MVDR) filter. The non-causal filter is adopted to utilize the additional information of the desired signal that lies in subsequent frames. Note that the desired signal output has minimal distortion due to the introduction of the MVDR criterion. The proposed system further suppresses the late reverberation by employing a statistical reverberant model. Experimental results demonstrate the superiority of the proposed algorithm to conventional approaches.

Room reverberation has been a main cause of speech degradation in areas as diverse as telecommunication, hands-free human-machine interaction, audio information retrieval, etc. One major impact of reverberation to speech signal processing is that a delayed energy component originated by previous phonemes is propagated to following phonemes, which results in degradation of speech intelligibility due to spectrum overlapping and masking.1 

A number of dereverberation algorithms utilizing one microphone have been developed.2,3 Most single channel reverberation suppression approaches that are operated in the frequency domain try to minimize the influence of long-term reverberation using the spectral subtraction methods.4–6 These algorithms are designed to estimate the early speech component by calculating late reverberant spectral variance (LRSV). It is well known that single channel speech enhancement algorithms have a dilemma of trade-off between noise reduction and speech distortion.7,8 Therefore it seems inevitable that the spectral subtraction based dereverberation algorithms suffer from speech distortion. Recently, a single channel noise reduction algorithm that improves signal-to-noise ratio (SNR) but without paying a price of speech distortion has been proposed.9 The algorithm takes an idea from the frequency domain minimum variance distortionless response (MVDR) filter that takes into account the interframe correlation of speech spectrum.10,11

In this letter, we propose a non-causal MVDR filter based approach to improve speech intelligibility by suppressing the late reverberation but not bringing speech distortion. Because the desired speech at the current frame is convolved with a relatively long time interval of acoustic impulse response (AIR) in reverberant environment, the reverberant signals in the following frames also contain the desired speech at the current frame. Therefore the reverberant signal in subsequent frames can be considered as additional information to estimate the desired signal at the current frame. To utilize the idea, we choose to employ the non-causal filter. The filter equation is derived based on the MVDR criterion to have minimal speech distortion. In summary, we extend the previously designed MVDR filter for noise reduction9 to the non-causal MVDR filter for dereverberation, which exploits the correlation between the speech spectrum itself and the reverberant ones in subsequent frames. Then the late reverberation is suppressed by statistical reverberation model based correlation parameters. Experimental results show that the algorithm achieves substantial improvement in room reverberant environments compared to conventional algorithms.

The rest of this letter is organized as follows. Section II formulates the problem. The non-causal single-channel MVDR filter for dereverberation is derived in Sec. III. In Sec. IV, we describe the complete algorithm to suppress the late reverberation using the statistical reverberation model. Performance evaluation is presented in Sec. V. The conclusion follows in Sec. VI.

Using the short-time Fourier transform (STFT), we define a reverberant signal model in the time-frequency domain

Y(k,m)=l=0H(k,l)S(k,m-l),
(1)

where Y(k,m) and S(k,m) are the STFT of the observed reverberant signal and the desired anechoic speech signal, and H(k,m) is a time-invariant acoustic transfer function. k and m means frequency-bin and time-frame, respectively. S(k,m) is assumed to be uncorrelated to itself at another frequency bins and frames.

In the classical dereverberation model,4,5 the desired signal S(k,m) is estimated by applying a frequency dependent gain value G(k,m) to the observed signal Y(k,m). That is,

Ŝ(k,m)=G(k,m)Y(k,m).
(2)

In reverberant environment, the desired signal S(k,m) is first delayed and attenuated by the AIR and then soaked into the subsequent reverberant signal Y(k,m+l),l>0. Therefore the reverberation terms in future frames, which are highly correlated with the desired signal of current frame, should be taken into account in the derivation process of dereverberation algorithms. For that, we employ a non-causal filter,

Ŝ(k,m)=l=0L-1Wl*(k,m)Y(k,m+l)=wH(k,m)y(k,m),
(3)

where the superscripts * and H denote complex conjugation and transpose-conjugation, respectively. L is the total number of consecutive subsequent time frames.

w(k,m)=[W0(k,m)WL-1(k,m)]T,y(k,m)=[Y(k,m)Y(k,m+L-1)]T,
(4)

are vectors of length L, and the superscript T denotes transposition.

The observed signal Y(k,m) is decomposed into two orthogonal parts corresponding to one that is correlated and one that is uncorrelated with the desired signal S(k,m). We consider the component that does not have correlation with the desired signal as an interference.

Y(k,m+l)=γS*(k,m,l)S(k,m)+S'(k,m+l),
(5)

where S'(k,m+l) represents the interference signal,10 

E[S(k,m)S'*(k,m+l)]=0,
(6)

and

γS(k,m,l)=E[S(k,m)Y*(k,m+l)]/E[|S(k,m)|2]
(7)

is correlation coefficient between the desired signal S(k,m) and the subsequent observed signal Y(k,m+l).

Thus we can write the vector y(k,m) as

y(k,m)=S(k,m)γs*(k,m)+s'(k,m)=sd(k,m)+s'(k,m),
(8)

where the normalized correlation vector γs(k,m) is

γs(k,m)=[γS(k,m,0)γS(k,m,L-1)]T.
(9)

sd(k,m) is the desired signal vector and

s'(k,m)=[S'(k,m)S'(k,m+L-1)]T
(10)

is the interference signal vector that are uncorrelated to the desired signal. s'(k,m) contains the undesired speech signals in subsequent frames, S(k,m+l),l>0, and the reverberant signal caused by all undesired speech signals at the earlier time frames. Remember that at the current frame m, our desired signal is S(k,m) and the speech signal at another frame contained in the observed signal vector y(k,m), such as S(k,m+l),l>0, is considered as the interference.

We can write the estimate Ŝ(k,m) into the following form:

Ŝ(k,m)=Sfd(k,m)+Sri'(k,m),
(11)

where Sfd(k,m)=S(k,m)wH(k,m)γs*(k,m) is the filtered desired signal and Sri'(k,m)=wH(k,m)s'(k,m) is the residual interference.

To derive the non-causal dereverberation MVDR filter, we first define the error signal between the estimated and desired signals as

ɛ(k,m)=Ŝ(k,m)-S(k,m)=ɛd(k,m)+ɛr(k,m),
(12)

where

ɛd(k,m)=Sfd(k,m)-S(k,m)
(13)

is the signal distortion due to the complex non-causal filter and

ɛr(k,m)=Sri'(k,m)
(14)

represents the residual interferences.

The mean-square error (MSE) is then

E[|ɛ(k,m)|2]=E[|ɛd(k,m)|2]+E[|ɛr(k,m)|2]=λS(k,m)|wH(k,m)γs*(k,m)-1|2+wH(k,m)Φin(k,m)w(k,m),
(15)

where λS(k,m)=E[|S(k,m)|2] and Φin(k,m) denote the variance of the desired signal and the interference covariance matrix, respectively. We can derive the MVDR filter by minimizing the MSE of the residual interference, E[|ɛr(k,m)|2], with the constraint that the desired signal is not distorted.

minw(k,m)wH(k,m)Φin(k,m)w(k,m)subject towH(k,m)γs*(k,m)=1,
(16)

for which the solution is

wMVDR(k,m)=Φin-1(k,m)γs*(k,m)γsT(k,m)Φin-1(k,m)γs*(k,m)=Φy-1(k,m)γs*(k,m)γsT(k,m)Φy-1(k,m)γs*(k,m),
(17)

where Φy(k,m)=E[y(k,m)yH(k,m)] is the correlation matrix of y(k,m). In the next section, we present a method to estimate the correlation vector γs(k,m), which is the main parameter that affects the performance of our dereverberation algorithm.

In this section, we derive γs(k,m) using a statistical reverberation model and complete the novel dereverberation algorithm to suppress the late reverberation. The AIR can be decomposed into the early reflections and the late reverberation.5 The reverberant only signal at current frame is defined as

R(k,m)=l'=NeH(k,l')S(k,m-l'),
(18)

where Ne determines the start time of the AIR that we may consider as reverberation. The time instance Ne usually ranges from 32 to 64 ms.5 In this letter, we empirically choose Ne=12 (i.e., 48 ms), which is identical to one in Habets’ work,6 so that R(k,m) in the Eq. (18) consists of only late reverberation. Due to the non-stationarity of the source and due to the statistical properties of the AIR, the early component S̃(k,m) and late reverberation component R(k,m) can be assumed to be statistically uncorrelated because the reverberant signal is the convolution of the speech source and the AIR.4–6 Using Eqs. (1) and (18), a new desired signal is given by

S̃(k,m)=Y(k,m)-R(k,m).
(19)

S̃(k,m) represents the speech signal colored by the early reflections of the AIR. Our goal to improve speech intelligibility by suppressing the late reverberation can be achieved by recovering S̃(k,m).

From Eqs. (7) and (19), the estimated correlation coefficient is given by

γ̃S(k,m,l)=E[S̃(k,m)Y*(k,m+l)]E[|S̃(k,m)|2]=E[{Y(k,m)-R(k,m)}Y*(k,m+l)]E[|Y(k,m)-R(k,m)|2]=E[Y(k,m)Y*(k,m+l)]-E[R(k,m)R*(k,m+l)]λY(k,m)-λR(k,m),
(20)

due to E[R(k,m)Y*(k,m+l)]=E[R(k,m)R*(k,m+l)]. λY(k,m)=E[|Y(k,m)|2] and λR(k,m)=E[|R(k,m)|2] represent the variance of the observed signal and late reverberation, respectively.

The acoustic transfer function (ATF) H(k,m) in the STFT domain can be statistically modeled as a zero-mean Gaussian random sequence multiplied by an exponentially decaying function.5,6 Then the reverberant R(k,m) in Eq. (18) is rewritable as

R(k,m)l'=NeBr(k)e-αl'NS(k,m-l'),
(21)

where Br(k) is zero-mean Gaussian random variable, N means the discrete time shift, and α(k)=3loge(10)/{T60(k)fs} denotes the decay rate, which is determined by both the sampling frequency fs and the reverberation time T60.

Substituting Eq. (21) into Eq. (20), we can write the second term of numerator as

E[R(k,m)R*(k,m+l)]=l=Nel=Ne{E[Br(k)eαlNBr*(k)eαlN]×E[S(k,ml)S*(k,m+ll)]}.
(22)

Because the inter-frame correlation between adjacent speech signals is assumed to be neglected, Eq. (22) is meaningful only when l"=l+l'. Note that S(k,m) is not correlated to S(k,m+l) but correlated to Y(k,m+l) in our assumption. Accordingly,

E[R(k,m)R*(k,m+l)]=l'=NeE[Br(k)e-αl'NBr*(k)e-α(l+l')N]|S(k,m-l')|2=e-αlNE[R(k,m)R*(k,m)]=λR(k,m)e-αlN.
(23)

Note that the correlation of the reverberant component is represented by the multiplication of the variance of the late reverberation and a parameter that is exponentially decaying due to l.

The estimated correlation coefficient γ̃S(k,m,l) is reformulated as

γ̃S(k,m,l)=[λY(k,m)/λX(k,m)]γY(k,m,l)-[λR(k,m)/λX(k,m)]γR(k,m,l),
(24)

where

λX(k,m)=λY(k,m)-λR(k,m),
(25)
γY(k,m,l)=E[Y(k,m)Y*(k,m+l)]/E[|Y(k,m)|2],
(26)

and

γR(k,m,l)=e-αlN.
(27)

The late reverberant spectral variance λR(k,m) can be obtained based on the Habets’ method6 and using Eq. (21).

In this section, we evaluate the performance of the proposed MVDR dereverberation filter in comparison with three different conventional single channel dereverberation algorithms, the traditional frequency domain Wiener filter, the Lebart’s method,4 and the Harbets’ approach.5 For the Wiener filter, we implement the system based on Eq. (2) with G(k,m)=λS̃(k,m)/λY(k,m). The Lebart’s method is implemented by modifying the amplitude spectral subtraction utilizing a priori SNR smoothing and spectral flooring technique to improve output speech quality. The Habets’ approach is the optimally modified log spectral amplitude (OM-LSA) spectral gain function, which utilizes the hypothetical gains associated with the speech presence uncertainty, to attenuate the reverberation more dynamically. For fair comparison, we use the same spectral variance of the late reverberation λR(k,m) for all algorithms.

The clean speech signal is created by concatenating five different utterances, which are spoken by five different speakers, from aurora2 database. The signal is sampled at 8 kHz, 15 s long, and it is transformed into the STFT domain using 75% overlapping (i.e., N=32). The Kaiser window of 128 samples is used.

The reverberant signals are generated by convolving the speech signal with different AIRs. The AIRs are synthesized under different environments using the image method.12 The source-microphone distance D={2,4.5} m, T60={600,700,800} ms, and the room size is set to 6×8×5 m (length × width × height).

The estimates of Φy(k,m) are recursively updated as in Benesty’s work.9 We use the first 10 frames (i.e., 40 ms) to compute the initial estimates of Φy(k,m). The rest of signal frames are then used for performance evaluation.

In the simulation, we assume the reverberation time (T60) is known, which in practice can be estimated by using blind estimation procedures.4,13 Preliminary experiments confirm that the proposed algorithm is robust to the estimation error of T60, although further analysis remains as future work. The forgetting factor for the variance of the late reverberation is set to κ=0.2.

The performance was evaluated using the SNR in the frequency domain and log spectral distance (LSD) measures.5 SNR(Ŝ) is given as ratio between variances of S(k, m) and error caused by Ŝ(k,m). LSD(Ŝ) is defined as difference between log spectrums of S(k, m) and Ŝ(k,m).

Tables I and II show the improvements of the LSDs and the SNRs by varying channel orders in various reverberation environment, respectively. The improvement of LSD is calculated by ΔLSD=LSD(Y)-LSD(Ŝ). The large ΔLSD value means that the output signal Ŝ(k,m) is much more similar to the desired signal S(k,m) compared to the observed signal Y(k,m). The improvement of SNR is defined by ΔSNR=SNR(Ŝ)-SNR(Y). The direct to reverberation ratio (DRR) is also depicted to clarify the simulation environment. The DRR is defined as the direct path energy divided by the total energy of the AIR.14 

TABLE I.

Improvement of LSD (ΔLSD).

D T60DRRWienerProposedLebartaHabetsb
(m)(s)(dB) L = 2L = 4L = 8L = 16  
0.6 −0.994 −0.776 0.157 0.266 0.466 0.818 0.431 0.039 
0.7 −2.590 −0.638 0.336 0.510 0.774 1.059 0.720 0.338 
0.8 −3.959 −0.452 0.510 0.959 1.194 1.509 1.049 0.580 
4.5 0.6 −6.020 0.197 0.504 0.763 0.972 1.111 0.664 0.691 
0.7 −7.623 0.546 0.735 1.093 1.395 1.616 0.996 1.228 
0.8 −8.977 0.909 0.948 1.401 1.790 2.081 1.358 1.689 
D T60DRRWienerProposedLebartaHabetsb
(m)(s)(dB) L = 2L = 4L = 8L = 16  
0.6 −0.994 −0.776 0.157 0.266 0.466 0.818 0.431 0.039 
0.7 −2.590 −0.638 0.336 0.510 0.774 1.059 0.720 0.338 
0.8 −3.959 −0.452 0.510 0.959 1.194 1.509 1.049 0.580 
4.5 0.6 −6.020 0.197 0.504 0.763 0.972 1.111 0.664 0.691 
0.7 −7.623 0.546 0.735 1.093 1.395 1.616 0.996 1.228 
0.8 −8.977 0.909 0.948 1.401 1.790 2.081 1.358 1.689 
a

Reference 4.

b

Reference 5.

TABLE II.

Improvement of SNR (ΔSNR).

D T60DRRWienerProposedLebartaHabetsb
(m)(s)(dB) L = 2L = 4L = 8L = 16  
0.6 −0.994 −0.004 0.011 0.017 0.089 0.108 0.065 0.075 
0.7 −2.590 0.003 0.021 0.097 0.144 0.275 0.120 0.148 
0.8 −3.959 0.036 0.080 0.184 0.233 0.412 0.172 0.214 
4.5 0.6 −6.020 0.261 0.202 0.264 0.444 0.645 0.168 0.132 
0.7 −7.623 0.315 0.273 0.418 0.710 1.034 0.340 0.276 
0.8 −8.977 0.376 0.404 0.613 0.984 1.350 0.571 0.429 
D T60DRRWienerProposedLebartaHabetsb
(m)(s)(dB) L = 2L = 4L = 8L = 16  
0.6 −0.994 −0.004 0.011 0.017 0.089 0.108 0.065 0.075 
0.7 −2.590 0.003 0.021 0.097 0.144 0.275 0.120 0.148 
0.8 −3.959 0.036 0.080 0.184 0.233 0.412 0.172 0.214 
4.5 0.6 −6.020 0.261 0.202 0.264 0.444 0.645 0.168 0.132 
0.7 −7.623 0.315 0.273 0.418 0.710 1.034 0.340 0.276 
0.8 −8.977 0.376 0.404 0.613 0.984 1.350 0.571 0.429 
a

Reference 4.

b

Reference 5.

As we can see from Table I, the improvements of LSDs by the proposed algorithms tend to monotonically increasing with extending the value of L in the studied L range. The proposed algorithm always outperforms the Wiener filter and shows better performance than the Lebart’s and the Habets’ method when its channel order is greater than 8.

From the simulation results given in Table II, it is confirmed that the proposed algorithm (when l8) is superior to all studied conventional algorithms under simulated environments in terms of SNR and LSD.

We also conducted informal Perceptual Evaluation of Speech Quality measurement results. The results show that the proposed algorithm slightly outperforms all the reference approaches. However, we do not include the detailed scores here because there is a clarification issue whether the PESQ score is suitable measure for measuring qualities in reverberant environment.

In this letter, a new single-channel dereverberation algorithm was introduced. The non-causal MVDR filter to reduce reverberation while minimizing speech distortion was derived by exploiting the correlation between speech spectrum and reverberant ones in the subsequent frames. The late reverberation was suppressed based on a statistical reverberation model. Experimental results demonstrated the superiority of the proposed algorithm.

This research was supported by the The Ministry of Knowledge Economy, Korea, under the Information Technology Research Center support program supervised by the National IT Industry Promotion Agency (NIPA-2012-H0301-12-2006).

1.
A. K.
Nabelek
,
T. R.
Letowski
, and
F. M.
Tucker
, “
Reverberant overlap- and self-masking in consonant identification
,”
J. Acoust. Soc. Am.
86
(
4
),
1259
1265
(
1989
).
2.
B. W.
Gillespie
,
H. S.
Malvar
, and
D. A. F.
Florencio
, “
Speech dereverberation via maximum-kurtosis subband adaptive filtering
,” in
IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP)
(
2001
), Vol.
6
, pp.
3701
3704
.
3.
K.
Kinoshita
,
T.
Nakatani
, and
M.
Miyoshi
, “
Spectral subtraction steered by multi-step forward linear prediction for single channel speech dereverberation
,” in
IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP)
(March 14–19,
2001
), pp.
I
817
.
4.
K.
Lebart
,
J. M.
Boucher
, and
P.
Denbigh
, “
A new method based on spectral subtraction for speech dereverberation
,”
Acta Acoust.
87
(
3
),
359
366
(
2001
).
5.
E. A. P.
Habets
, “
Single- and multi-microphone speech dereverberation using spectral enhancement
,” Ph.D.,
Techn. Univ. Eindhoven
, Eindhoven, The Netherlands,
2007
.
6.
E. A. P.
Habets
,
S.
Gannot
, and
I.
Cohen
, “
Late reverberant spectral variance estimation based on a statistical model
,”
IEEE Signal Process. Lett.
16
(
9
),
770
773
(
2009
).
7.
J.
Chen
,
J.
Benesty
,
Y.
Huang
, and
S.
Doclo
, “
New insights into the noise reduction Wiener filter
,”
IEEE Trans. Acoust. Speech Signal Process.
14
,
1218
1234
(
2006
).
8.
N. W. D.
Evans
,
J. S.
D Mason
,
W. M.
Liu
, and
B.
Fauve
, “
An assessment on the fundamental limitations of spectral subtraction
,” in
Proceedings of the International Conference on Acoustics, Speech, Signal Processing
(
2006
), pp.
145
148
.
9.
J.
Benesty
and
Y.
Huang
, “
A single-channel noise reduction MVDR filter
,” in
IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP)
(May 22–27,
2011
), pp.
273
276
.
10.
J.
Benesty
and
J.
Chen
,
Optimal Time-Domain Noise Reduction Filters: A Theoretical Study
(
Springer-Verlag
,
Berlin
,
2011
).
11.
J.
Benesty
,
J.
Chen
, and
E.
Habets
,
Speech Enhancement in the STFT Domain
(
Springer-Verlag
,
Berlin
,
2011
).
12.
J.
Allen
and
D.
Berkley
, “
Image method for efficiently simulating small-room acoustics
,”
J. Acoust. Soc. Am.
65
(
4
),
943
950
(
1979
).
13.
M. R.
Schroeder
, “
New method of measuring reverberation time
,”
J. Acoust. Soc. Am.
37
(
3
),
409
412
(
1965
).
14.
D.
Griesinger
, “
The importance of the direct to reverberant ratio in the perception of distance, localization, clarity, and envelopment
,” in
Audio Engineering Society Convention (AES)
(May
2009
), p.
126
.