The perceptual relevance of adopting the temporal envelope to model the frequency band of (highband) in wideband speech signal is described in this letter. Based on theoretical work in psychoacoustics, we find out that the temporal envelope can indeed be a perceptual cue for the high-band signal, i.e., a noiseless sound can be obtained if the temporal envelope is roughly preserved. Subjective listening tests verify that transparent quality can be obtained if the model is used for the band. The proposed model has the benefits of offering flexible scalability and reducing the cost for quantization in coding applications.
Wideband speech, sampled at with a signal bandwidth of around , offers a more pleasant quality than narrow-band speech (sampled at ) used in conventional telephony systems.1 To reduce the number of bits required to encode wideband speech directly, bandwidth extension techniques are often introduced to speech communication systems. Bandwidth extension is a method to generate a virtual image of wideband sound from a narrow-band speech signal without utilizing additional information or with minimal information.
Speech coders using bandwidth extension methods for wideband signals generally adopt the source-filter model using linear prediction. While the source-filter model can improve coding efficiency, it cannot provide scalability at the bit-stream level. Scalability is the ability of a coder to support an ordered set of bit streams that can produce a reconstructed sequence, and this quality now has various application areas of concern as noted by the objectives of several standard coders. Since the synthesis process of the source-filter model tightly couples with the look-back memory consisting of the synthesized signal in previous frames, scalable implementation of the encoding stage cannot be unique to the decoding stage, which significantly degrades perceptual quality. In addition, the source-filter model is not effective at higher frequency bands (higher than around ) where the spectral envelope is relatively flat compared to the spectral envelope at lower frequencies.
In contrast to the source-filter theory that represents speech by multiplying the spectral envelope and the fine structure in the frequency domain, the speech signal can also be reconstructed by multiplying the temporal envelope with the fine structure in the temporal domain, which we call the temporal envelope model. In other words, the temporal envelope represents the temporal energy contour and the fine structure represents rapid fluctuation. The concept of the temporal envelope model is already utilized in several coders though detailed implementing methods are different. For example, temporal noise shaping2 used in the advanced audio codec series aims to include temporal envelope information, and ITU-T G.729.13 utilizes the temporal envelope information in bandwidth extension based on the source-filter model. Though the importance of the temporal envelope information has been empirically recognized as shown in the previous examples, it remains unclear as to why and how the temporal envelope affects perceptual quality.
In this letter, we find psychoacoustically related evidence to verify the importance of the temporal envelope and the fact that perceptually transparent quality can be obtained if the temporal envelope information in the high band is reasonably maintained, which is confirmed through experiments. The temporal envelope model representation of the speech signal using just the temporal envelope guarantees bit-stream level scalability because its operation is composed of simple multiplications, as opposed to convolution in the source-filter model. Moreover, some perceptual characteristics of the temporal envelope specific to the speech signal suggest the possibility of applying this method for efficient coding.
II. Theoretical foundations of the temporal envelope model in the high-band signal
It is well known that the auditory system, like all sensory systems, has limited temporal resolution and cannot perceive temporal change if it occurs too rapidly.4 This characteristic gives us the inspiration to use a slowly varying temporal envelope as a perceptual cue of wideband speech in the high frequency band . Here we construct the temporal envelope model such that the band-limited signal, , is composed of multiplying the temporal envelope, , representing the temporal energy contour, with the fine structure, , representing rapid fluctuation as follows:
Based on the model, we can first assume that the speech signal can be synthesized to be perceptually identical to the original speech without any information about the original fine structure. Since the fine structure in the high band varies more rapidly than human listener can follow, the aberrations in the phase of the fine structure cannot be detected. Considering this characteristic, we suggest that a synthesized signal consisting of a fine structure that is obtained from random noise signals and a temporal envelope from the original gives signal quality that is comparable to the original speech.
A. Temporal masking and the temporal envelope
The temporal masking effect, which includes premasking and postmasking5 the human auditory system, provides us with the motivation of the temporal envelope model. Premasking represents the masking effect that occurs even before the masker is switched on.5 Conversely, postmasking results from the gradual retreat of the masker. That is, masking does not immediately stop when the masker is switched off, but it still lasts while no masker is actually present, as shown in Fig. 1(a).5 Though the temporal masking effect has been shown under the assumption that both the masker and maskee exist, it can also be applied to peak points in the signal wave form itself without distinguishing a separate masker from the maskee.6 This is reasonable if we consider the peak point as a masker having a very short duration. It is also possible to induce positive peak points in the wave form of some auditory-filtered signal (bandpassed signal) to lead to a high rate of neuron firings psychoacoustically.7 In the auditory model, each neuron has a state that decays exponentially with a characteristic decaying time and is reset when it fires. The reset level depends on the input level during the firing stage. The exponential decay is modeled after the postmasking effect, and therefore the temporal masking threshold can be obtained to make it decay exponentially following each peak pulse, leading to neural firing as shown in Fig. 1(b). The temporal masking threshold obtained in this way looks similar to the temporal envelope of the corresponding band signal in the high frequency band.6 Therefore, in the high frequency band, the variation of the fine structure under the fixed temporal envelope cannot be perceptually detected. The effect may not be applied to a lower band signal where the decaying rate of a masking threshold is faster than the fluctuation of the band signal itself.
B. Just-noticeable amplitude variation and just-noticeable frequency variation
In psychoacoustics, just-noticeable amplitude variation (JNAV) and just-noticeable frequency variation (JNFV) can be examined in order to confirm the suitableness of the temporal envelope model. Just-noticeable variations are useful for producing scales of sensations and are important as the basis on which sensations are built.5 JNAV is often measured by amplitude modulation. The detectability of amplitude modulation in the absence of spectral cues provides a quantitative description of temporal resolution for steady-state signals with relatively small amplitude changes. Modulation thresholds for sinusoidally amplitude-modulated wideband noise are measured as a function of modulation frequency, which results in generating the temporal modulation transfer function (TMTF).4 Detection experiments show that human sensitivity to the modulation could be modeled by a low-pass filter whose cutoff frequency is about . Moreover, Viemeister4 showed that the cutoff frequency of the TMTF increases as the center frequency increases. This means that the variation in the temporal envelope is easily detected for higher frequency bands. Meanwhile, JNFV is measured by using sinusoidal frequency modulation. The just-noticeable value is given as a function of carrier frequency.5 According to the function, at low carrier frequencies, the just-noticeable frequency is approximately constant and has a value of about . Above about , it increases nearly in proportion to the carrier frequency, which is strongly correlated with the critical band rate. Considering the same amount of frequency variation caused by quantization or estimation, the human ear cannot detect the frequency variation of the fine structure well in a high frequency band compared to a low frequency band. As inferred from the characteristics of JNAV and JNFV, it is reasonable to exclude the fine structure and to preserve the temporal envelope for a high-band signal.
C. Temporal envelope of the speech signal
The temporal envelope has been studied in various speech processing applications because it is believed to be related to many perceptual attributes of speech, such as intelligibility and quality. For the purpose of speech quality measurement, Kim8 previously hypothesized that the human auditory system made use of the modulation spectrum, which is related to temporal envelope information. In the paper, spectral components located at a certain modulation frequency region are said to be perceived as being more annoying than at other frequencies. This distortion-related frequency region is assumed to be around , from the observation that the speed of mechanical movement of the human articulatory system is limited to and that the human modulation detection is bounded to the cutoff frequency of about .4 With the assumptions Kim sets out, it can be hypothesized that the temporal envelope with limited bandwidth could give sufficient representation of speech perceptually.
The band-limited characteristic of the temporal envelope is also shown with respect to distortion as well as speech identification. Ghitza,9 through various experiments, showed that roughly one-half of one critical band of the envelope information for a given auditory channel is sufficient to preserve speech quality without any audible distortions.
The narrow-band characteristic guarantees efficient quantization for coding of the temporal envelope. The premise that using just the temporal envelope information can determine speech quality by itself and that it has most information in a narrow band of around means that the temporal envelope is able to give perceptually summarized aspects of the speech signal.
III. Experiments and discussions
A. Generation of test materials
To verify the hypothesis described in Sec. II, we perform subjective listening tests with test samples synthesized using the temporal envelope model. Detailed steps to generate test samples are depicted in Fig. 2. Under the assumption of the temporal envelope model, an arbitrary high-band signal can be expressed by summing the multiplication of temporal envelopes and fine structures of subbands:
where is the input high-band signal and and are the temporal envelope and the fine structure of the th subband, respectively, is the number of the subbands.
The test materials for the listening tests are composed of the low-pass filtered temporal envelopes, from the original wideband speech signal and fine structures, from random noise signals. To obtain the temporal envelopes of the subbands, the subband signals are extracted in advance from the original wideband signal through the bandpass filters with 64th order FIR filters. We divide the band into four subbands (whose bandwiths are 500, 700, 800, and , respectively) by considering the critical-band rate approximately. The temporal envelopes, , are obtained by the subband signals, but detailed extraction methods will be explained in Sec. III B. In addition, in order to confirm the band-limited characteristic of the temporal envelope, the temporal envelopes are low-pass filtered with cut-off frequencies of one-half of each critical-band bandwidth, which validates efficiency when coding. For the fine structures, , an arbitrary white noise signal is bandpass filtered by the filters, whose bandwidths are the same as those used in obtaining . The temporal envelopes, , of the subband signals of the white noise are obtained by the temporal envelope extraction method chosen in the previous process and the corresponding fine structures, , are calculated through dividing the subband signals of the white noise by the temporal envelopes, (“temporal envelope normalization” process in Fig. 2). The multiplication of the low-pass filtered temporal envelopes, , and the fine structures, , become the synthesized subband signals. The synthesized subband signals in the band and the original low-band signal, , are summed to result in the synthesized wideband signal, that is, the test materials.
Though we cannot regenerate the exact spectrum by using random noise for the fine structures as illustrated in the previous paragraph, the following experiments show that fine structure is not important perceptually, i.e., it is difficult to distinguish the synthesized signal from the original one.
B. Temporal envelope extraction methods
In applications where the temporal envelope is needed, we may introduce either a simple short-time energy contour or take the Hilbert transform, which are commonly used in various applications. To obtain a more accurate envelope, we also simulate a method to pick local peaks of each subband signal and to interpolate them in a sample-by-sample manner, which we call local maxima interpolation method. The local maxima interpolation method intuitively takes into account that temporal masking starts from the local peaks of the signal and a synthesized signal should preserve the local peak values that constitute the over-all temporal masking threshold. The reason we have tried three different approaches is to verify whether the accuracy of the extraction process results in variation in quality of the temporal envelope model. Figure 3 shows examples of temporal envelopes extracted by the three above-described methods (left part) and the corresponding fine structures (right part). Observing the examples of fine structures generated from the three methods, the Hilbert transform and the energy contour methods generate fluctuation caused by amplitude variation. If an arbitrary random noise is substituted for the original fine structure, the temporal envelope of the synthesized signal would not be similar to the original one. In contrast, the local maxima interpolation method can ensure wave form similarity with the original signal, although this does not necessarily mean the better quality perceptually, which is ascertained in the results of the listening tests.
For a subjective quality evaluation method, multistimulus test with hidden reference and anchors (MUSHRA)10 was adopted here. In the MUSHRA test, the listener is presented with several speech stimuli. The first is the reference, which is the original wideband speech signal. The remainder are the test stimuli to which the listener must give a score between 0 and 100 depending upon their opinion of the quality. The scale given in this five-interval quality scale.is: excellent(100-80), good(80-60), fair(60-40), poor(40-20), and bad(20-0). The test stimuli include two anchors, a hidden reference and the undistorted narrow-band speech. The listener must try to identify the hidden reference and score it as 100. The test stimuli must be randomized in order, so the listener has no clues to their identity.
The reference speech database was taken from the TIMIT11 data set. Speech samples consist of four sentences spoken by four different male and female speakers and 20 experienced listeners are asked to make a quality judgment. Listening was carried out using headphones to give more consistent conditions than loud speakers in listening rooms. The tests were carried out with Audio-Technica ATH-A700 headphones in a quiet room where equipment noise is not audible.
Fig. 4 shows the results of subjective quality measures, the mean values of the MUSHRA scores, and the 95% confidential intervals for the tolerance. Fig. 4(a) shows the results for the cases where the three temporal envelope extraction methods are applied. The three methods give almost the same performance but the Hilbert transform method is slightly better. However, all the synthesized signals do not offer the same qualities as the original, against our expectations. Theoretical reasons for supporting the temporal envelope model are not validated when we replace subband signals starting from with artificial synthesized signals. The arguments we presented just confirm the trend that the temporal envelope becomes more important as frequency increases, but it does not pinpoint from what frequency we can apply the temporal envelope model. To validate the analysis with regard to quality degradation, we performed similar experiments by varying the start frequency for applying the temporal envelope model, the results of which are shown in Fig. 4(b). The synthesized signal cannot be distinguished from the original if the temporal envelope model starts around . From hypothesis tests, under significance level , we can also say that the qualities of the synthesized signals and the original one are equal.
As shown in Fig. 4, transparent quality can be obtained through the temporal envelope model even if it does not include fine structure information. In the model, the speech signal in each subband is reconstructed by multiplying the temporal envelope with the fine structure. Since the fine structure does not need to be quantized, we can have additional benefits in scalability by quantizing only the temporal envelope information, in contrast to the source-filter model in which filter coefficients as well as the residual signal are needed, which also impacts the synthesis process of successive frames. Moreover, the temporal envelope is also expected to be able to be quantized effectively, considering not only its band-limited characteristics, but also the fact that it has pitch periodicity, which can be extracted by a narrow-band coder.
Based on psychoacoustic evidence, we showed that the temporal envelope of a subband with critical-band bandwidth is shown to provide a perceptual cue in the highband. Through some experiments we also confirmed that the temporal envelope model worked well in the band of . The temporal envelope model can be utilized in scalable wideband speech coders and speech bandwidth extension to convert narrow-band speech into wideband speech virtually. Progressive enhancement by quantizing the temporal envelope enables scalable implementation. Moreover, audio signals could also be applied if target bands are higher than our case, because the temporal envelope model is designed based on human perception. Effective quantization methods that use the band-limit characteristics and pitch-periodicity require further study.