Wearable flexible sensors attached on the neck have been developed to measure the vibration of vocal cords during speech. However, high-frequency attenuation caused by the frequency response of the flexible sensors and absorption of high-frequency sound by the skin are obstacles to the practical application of these sensors in speech capture based on bone conduction. In this paper, speech enhancement techniques for enhancing the intelligibility of sensor signals are developed and compared. Four kinds of speech enhancement algorithms based on a fully connected neural network (FCNN), a long short-term memory (LSTM), a bidirectional long short-term memory (BLSTM), and a convolutional-recurrent neural network (CRNN) are adopted to enhance the sensor signals, and their performance after deployment on four kinds of edge and cloud platforms is also investigated. Experimental results show that the BLSTM performs best in improving speech quality, but is poorest with regard to hardware deployment. It improves short-time objective intelligibility (STOI) by 0.18 to nearly 0.80, which corresponds to a good intelligibility level, but it introduces latency as well as being a large model. The CRNN, which improves STOI to about 0.75, ranks second among the four neural networks. It is also the only model that is able to achieves real-time processing with all four hardware platforms, demonstrating its great potential for deployment on mobile platforms. To the best of our knowledge, this is one of the first trials to systematically and specifically develop processing techniques for bone-conduction speed signals captured by flexible sensors. The results demonstrate the possibility of realizing a wearable lightweight speech collection system based on flexible vibration sensors and real-time speech enhancement to compensate for high-frequency attenuation.
HIGHLIGHTS
Four kinds of neural-network-based speech enhancement algorithms are used to enhance the signal collected by a flexible vibrational throat sensor.
Enhancement qualities and deployment performances of different algorithms are compared.
Deployments of the four algorithms on both edge and cloud platforms are investigated.
The convolutional-recurrent neural network approach is shown to be most suitable for migration to mobile platforms.
I. INTRODUCTION
Voice is one of the most important biological signals involved in the process of communication between people as well as in human–computer interfaces.1,2 The human voice can be transmitted by air through acoustic waves and collected by microphones. However, signals collected through air conduction are subject to interference from background noise.3,4 Especially in noisy factories, subways, and city streets, measurement of clear and high-quality voice through air conduction is challenging. As a result, bone-conduction approaches represented by various types of throat microphones5,6 have been developed to accommodate the needs of voice capture in noisy environments. In addition, advanced throat microphones can pick up very gentle speech such as murmurs or whispered voices, which are almost inaudible even at a short distance.7,8 Therefore, throat microphones are particularly suitable for tasks such as covert military operations and assisting patients with laryngeal injury.5,7 Besides, many clinically relevant parameters such as phonation time, fundamental frequency, and sound pressure level can be derived from throat vibration signals, enabling noninvasive approaches for quantifying vocal states and for diagnosis and treatment of voice disorders.9–11 However, the majority of throat microphones are based on rigid components that require extra fixtures to ensure close contact with wearers’ necks.
Flexible electronics has offered a promising solution to replace rigid throat microphones with soft materials and skin patches. Flexible sensors have been used to capture vibration on the throat, based on piezoelectric,12–14 piezoresistive,15–18 piezocapacitive,19–21 and triboelectric22–24 effects. However, sensors based on piezoelectric and triboelectric effects may require direct contact of the sensing elements with other structural components, causing mechanical abrasion that may eventually lead to sensor damage. In a previous study, we demonstrated a flexible electromagnetic sensor that measured vibration-induced displacement of a soft magnetic membrane suspended within a toroidal magnetic field and enclosed within a flexible multilayer coil.25 Experimental results using this device have demonstrated that it offers a broad frequency response range from 1 Hz to 10 kHz and a high sensitivity of 0.59 mV/µm at an operating frequency of 1.7 kHz. In addition, the vibration of the suspension structures coupled with the sensing elements purely through noncontact electromagnetic coupling reduces the chance of device damage due to repeated mechanical abrasion.
Similar to other throat microphones, voice signals transmitted through soft tissues such as skin and muscles are subject to a low-pass effect due to damping caused by soft tissues, leading to high-frequency attenuation and loss of consonant syllables. These problems become more significant when flexible vibration sensors are involved, since soft vibrational structures in the sensors tend to have lower resonance frequencies. To be more specific, the cutoff frequencies of signals collected by throat microphones are around 2.5 kHz,26 which means that the frequency components of the signals above 2.5 kHz attenuate severely. However, the cutoff frequencies of most flexible vibration sensors, including the sensor presented here, are around 1 kHz.15,17,21,23,25 As a result, approaches for signal enhancement of flexible sensors are urgently needed for practical applications of flexible vibration sensors. In addition, owing to the wearable nature of the devices, the capability to adapt to mobile platforms with low energy consumption and low latency is also highly desirable for different signal enhancement algorithms. To reconstruct the attenuated high-frequency components, several speech enhancement algorithms for bandwidth extension have been developed for the speech generated by rigid throat microphones. The earliest algorithms were based on a source–filter model for feature extraction,27 and these were followed by algorithms based on a Gaussian mixed model (GMM) or codebook for feature transformation. In the source–filter model, speech signals are modeled as an excitation source filtered by the vocal tract. Usually, the excitation source is assumed to be the same for the speech generated by the throat microphone and for its corresponding air-conducted microphone, and the enhancement task is then simplified to transformation of the vocal filter features. However, in some cases, mutual independence of the source and the filter is not strictly guaranteed, and the dimension of the features extracted is relatively low, leading to poor speech enhancement performance. By contrast, the short-time Fourier transform (STFT) does not require additional assumptions.28 Its coefficients, which are high-dimensional features, carry more information about the speech signal.29 With the strong nonlinear fitting capability of a fully connected neural network (FCNN), the transformation of STFT magnitudes between the signal of a rigid throat microphone and its corresponding air-conducted microphone was accomplished in our previous study.30 Besides an FCNN, our research has also demonstrated that using a long short-term memory (LSTM) and a bidirectional long short-term memory (BLSTM) to enhance the signal from rigid throat microphones gave better results than other methods based on a GMM or FCNN,30,31 since both the LSTM and BLSTM can learn contextual information from series of features. Although some algorithms to enhance speech captured by rigid throat microphones have been explored, reconstruction of speech captured from a flexible vibration sensor has rarely been performed. The unique mechanical properties and device design of a flexible vibration sensor may lead to varied performance of different algorithms. In addition, an appropriate signal enhancement algorithm for flexible vibration sensors remains to be developed. Several speech denoising algorithms, which convert noisy speech signals into clean signals, can also be used for enhancement of flexible vibration sensor signals.
In this paper, the three kinds of neural networks mentioned above, namely, an FCNN, an LSTM, and a BLSTM, as well as a convolutional-recurrent neural network (CRNN) that has been proved to be effective for speech denoising,32 are adopted to process speech data obtained by flexible vibration sensors from six persons. Features from the speech data are first extracted by STFT and then processed through the four neural networks. Two objective metrics based on log-spectral distance (LSD) and short-time objective intelligibility (STOI) are adopted to compare the performance of the different algorithms. Selective results show that the average LSD value of the signal decreases from 1.73 to 1.04, 1.03, 0.96, and 1.29, respectively, while the average STOI value of the signal is improved from 0.62 to 0.79, 0.80, 0.87, and 0.8, respectively after processing by the FCNN, LSTM, BLSTM, and CRNN, suggesting that significant improvements in signal quality can be achieved through use of the neural networks. The algorithms are also deployed on embedded systems and cloud servers. The amount of model parameters and the time consumption of the different systems are computed, and their performance on hardware deployment is evaluated. A minimum time consumption of only 0.2 s is needed for enhancing a 2.5 s period of signal on an embedded system, and the time consumption is less than 0.01 s when the signal is processed by a cloud server, indicating the possibility of realizing a wearable lightweight speech collection system based on flexible vibration sensors with real-time speech enhancement capability to compensate for high-frequency attenuation. This paper is one of the first studies to systematically investigate speech processing techniques specifically for flexible sensors. It may benefit the development of enhancement algorithms and miniaturized skin-like electronic systems to facilitate communication between people and human–computer interfaces.
II. EXPERIMENTAL WORK
A. Sensor and hardware prototype
Figure 1(a) shows a flexible vibration sensor attached on the neck of a volunteer and connected to a signal amplification circuit. The sensor contains a flexible coil, two flexible annular magnetic membranes, a vibrational magnet, and two elastic membranes [Fig. 1(b)], with overall dimensions of 25 mm × 25 mm × 0.8 mm and a mass of 1.5 g. The flexible coil contains 180 turns of coplanar copper (Cu) wires with a line width of 60 μm and a thickness of 1 μm. The flexible coil was fabricated through electroplating and photolithography of six layers of Cu on polyimide (PI) layers precoated with seed layers of Ti/Cu (2/100 nm). Adjacent layers of Cu were connected through plasma-etched vias on PI. The vibrational magnet and two flexible annular magnetic membranes are made of a composite of Nd2Fe14B microparticles and polydimethylsiloxane (PDMS), leading to high magnetism and flexibility after magnetization. Vibration of the sensor can be achieved by magnetizing a circular magnetic membrane (2.5 mm in radius and 400 µm in thickness) along the normal direction. Two annular magnetic membranes (5.5 mm in inner radius and 12.5 mm in outer radius) were fabricated by cutting off the central parts of two circular magnetic membranes, which were initially folded three times into a pie shape with 45° vertex angle and magnetized along the radial direction. The two annular magnetic membranes can be used to guide and extend the coverage of the magnetic field within the region of the coil, leading to a higher output voltage from the sensor. As shown in Fig. 1(c), the flexible coil is sandwiched between two flexible annular magnetic membranes through which the vibrational magnet is suspended by two elastic membranes attached externally. When the sensor is subjected to external vibration, the relative displacement between the vibrator and the annular magnetic membranes changes the magnetic field distribution within the coil and thus induces a voltage signal. Figure 1(d) shows a sensor in its bending state: the excellent flexibility of the sensor allows its application to curvilinear and deformable skin surfaces.
Schematics and images of the sensor. (a) Photograph of a flexible sensor attached on the neck of a volunteer. (b) Components and fabrication processes of the sensor. (c) Structure of the sensor. (d) A sensor in its bending state.
Schematics and images of the sensor. (a) Photograph of a flexible sensor attached on the neck of a volunteer. (b) Components and fabrication processes of the sensor. (c) Structure of the sensor. (d) A sensor in its bending state.
The miniaturized audio acquisition system contains a signal amplification circuit and a signal transmission circuit. The signal amplification circuit is based on a low-noise operational amplifier (ADA4691, Analog Devices, Inc.) with a gain of 10 000. The signal generated by the coil in our sensor may be subject to interference from environmental electromagnetic fields. To filter out such interference, a bandpass filter circuit can be used after signal amplification. The signal transmission circuit can collect the amplified signals using a 16-bit analog-to-digital converter (ADS8320, Texas Instruments) with a sampling rate of 8 kHz. A microcontroller (nRF52832 SoC, Nordic Semiconductor) based on the Bluetooth 5 protocol establishes a wireless connection between the module and the signal processing systems that contain embedded processors for speech enhancement. To simultaneously record speech signals by the flexible vibration sensor and the air-conduction microphone, a sound card (UR22C, Steinberg) and a microphone (ECM8000, SHUYIN) were used to collect both the sensor vibration signal and the microphone speech signal for establishment of the training set. The speech signal was sampled with a sampling rate of 16 kHz and stored with 24-bit resolution. Meanwhile, to further decrease the difficulty in enhancement, the collected speech signal was down-sampled to 8 kHz.
The signal enhancement algorithms can be deployed either on embedded systems or on cloud servers. Four kinds of devices are used to deploy the signal enhancement models: a Raspberry Pi 4 and a Jetson Nano for edge computing, and a cloud server based on an Intel Core i5-6500 CPU without or with a GeForce RTX 2080 Ti GPU for cloud computing. The Raspberry Pi 4 uses a Broadcom BCM2711 system on a chip (SoC) with a 1.5 GHz 64-bit quad-core ARM Cortex-A72 processor, while the Jetson Nano uses an SoC with a 1.42 GHz 64-bit quad-core ARM Cortex-A57 processor. The CPU in the Raspberry Pi 4 is one generation newer than that in the Jetson Nano. However, the Jetson Nano has a 128-core Maxwell GPU at 921 MHz. As a result, the Raspberry Pi 4 has 9.69 GFLOPS, while the Jetson Nano has 472 GFLOPS. The Intel Core i5-6500 is a quad-core 64-bit x86 mid-range performance desktop microprocessor with a base frequency of 3.2 GHz; it has 220.1 GFLOPS. The GeForce RTX 2080 Ti is a 4352-core Turing GPU at 1545 MHz, it has 9.69 TFLOPS.
B. Framework of the proposed algorithms
The enhancement of the sensor signal can be considered as a voice conversion task, which converts the distorted signal st into its corresponding air-conducted microphone signal mt. Owing to limitations of algorithm capability, previous methods have assumed that the excitation signals of two kinds of speech are the same, and they have focused on the conversion of low-dimensional spectral envelopes. Because the low-dimensional features cannot describe the characteristics of vocal tracks well, an increasing number of methods are using the powerful nonlinear capabilities of neural networks to explore the conversion of high-dimensional features. Therefore, in our algorithms, the STFT has been used to decompose speech into amplitude and phase. Only amplitude conversion was conducted, on the premise that the phase has little effect on the human perception of speech quality.33 The overall process of signal enhancement is depicted in detail in Fig. 2. The process involves a training stage and an enhancement stage. In the training stage, pieces of simultaneously collected sensor signal st and microphone signal mt from the training set are framed and processed by STFT to first obtain the magnitude spectrum for each frame. Then, logarithmic transformation and Gaussian normalization are performed on the spectrum to obtain the extracted features X and Y for training the neural networks. The two corresponding features X and Y are two-dimensional matrices with the same size. The mean and the variance of the full training set calculated in Gaussian normalization are reserved for the enhancement stage. The mean of X, the mean of Y, the variance of X, and the variance of Y are denoted by μs, μm, σs, and σm, respectively. In the enhancement stage, sensor signals st from the test set are processed with the STFT and logarithmic transformation using the same method as in the training stage and normalized using the reserved coefficients μs and σs to first obtain the features X. Second, the trained neural network is used to transfer the features X to the enhanced features Ŷ. Third, enhanced features Ŷ are denormalized using μm and σm, and this is followed by exponential transformation and inverse STFT (iSTFT) to obtain the enhanced signal t. The corresponding microphone signal mt is used for the evaluation of t.
Overall framework of the proposed algorithms based on neural networks.
Four kinds of neural networks have been introduced and compared to enhance the speech signal. These neural networks have different architectures, but their training processes are identical. The training set is defined as follows and is composed of N pairs of features extracted from sensor and microphone signals:
The aim of the neural network is to find the mapping function F to project X ∈ D×T to Y ∈ D×T, where D and T are the sizes of the frequency dimension and the time dimension, respectively. For this purpose, the mean-square error (MSE) loss function J between Ŷ and Y is defined as follows:
where B is the batch size. A backpropagation algorithm is used to optimize the neural networks. When J is minimized continually, the difference between the enhanced sensor signal Ŷ ≡ F(X) and the corresponding microphone signal Y in the training set continues to decrease globally.
Different kinds of network architecture are shown in Fig. 3. As the high-frequency components in the sensor signal need to be deduced from a small number of low-frequency components, the contextual information of the speech is essential for effective inference. Therefore, an LSTM, a BLSTM, and a CRNN have been chosen because of their ability to model contextual information relationships. In addition, an FCNN, which can utilize context by concatenating current and preceding features into a single and higher-dimensional feature for input, has also been adopted.
Structures of the different neural networks: (a) FCNN model; (b) LSTM cell; (c) LSTM model; (d) BLSTM model; (e) CRNN model.
Structures of the different neural networks: (a) FCNN model; (b) LSTM cell; (c) LSTM model; (d) BLSTM model; (e) CRNN model.
Figure 3(a) shows the structure of the FCNN. Let x⟨t⟩ ∈ D×1 be the tth time step of an input feature X. The input of the FCNN for the tth time step i⟨t⟩ is set to the concatenate of x⟨t⟩ and its M − 1 preceding time steps, as follows
Each input feature i⟨t⟩ is processed by the FCNN, which contains three hidden linear layers and an output layer to obtain the corresponding output feature ŷ⟨t⟩. In the nth layer, the feature is multiplied by a matrix Wn, and then a bias bn is added. For the three hidden layers, the feature is also processed by an activation function E. This procedure can be expressed as follows:
where is the output of the nth hidden layers and H is the number of hidden-layer units. The activation function E used in the FCNN is an exponential linear unit (ELU) and can be expressed as
where α is a constant coefficient and is larger than 0.
Another neural network used for signal enhancement is an LSTM, which is a modified recurrent neural network (RNN) using an LSTM cell as the recurrent component. The output of an RNN layer depends both on the current input and on past information. The use of the LSTM cell can also help to solve the vanishing-gradient problem of an RNN through its sophisticated gates that control the flow of information. Three gates expressed as follows are used in the LSTM cell [Fig. 3(b)]:
where ⊗ denotes the elementwise product, i⟨t⟩, f⟨t⟩, and o⟨t⟩ are the input gate, forget gate, and output gate of the LSTM cell, respectively, x⟨t⟩, h⟨t⟩, and c⟨t⟩ are the input feature, output feature, and cell state of the current time step, respectively, and h⟨t−1⟩ and c⟨t−1⟩ are the output feature and the cell state of the last time step. The activation function used in the LSTM cell is a sigmoid function, which can be expressed as
Figure 3(c) shows the structure of the LSTM. The sensor feature of each time step, x⟨t⟩ ∈ D, is processed by the four layers of the LSTM in sequence. A linear network layer is then applied to obtain the output feature ŷ⟨t⟩ ∈ D.
The BLSTM possesses recurrent connections in both the forward and backward directions [Fig. 3(d)]. The input feature x⟨t⟩ ∈ D is separately processed by a forward LSTM cell and a backward LSTM cell that receive cell states from the LSTM cells of the last and the next time step. After processing by the four-layer BLSTM neural network, the outputs of two LSTM cells are concatenated and processed by a linear network layer to obtain the output feature ŷ⟨t⟩ ∈ D.
The CRNN is a combination of an RNN and a convolutional neural network (CNN) [Fig. 3(e)]. The CNN has been widely used in image processing algorithms, since it can process features along two different dimensions at the same time. As the spectrograms of speech signals can be seen as special types of images (Fig. 2), the CNN can exploit local structures in both the frequency and time domains. Therefore, the CRNN has the great advantage of modeling both contextual and time–frequency relationships. The CRNN is constructed with three components: a convolutional encoder, an LSTM network, and a convolutional decoder. The encoder extracts the high-level features, and the decoder rebuilds the signal from the high-level features enhanced by the LSTM network. Before calculation, a new depth dimension must be added to the input feature X ∈ D×T to give X ∈ 1×D×T as the input of the first layer. The encoder contains four convolution layers. The operation in each layer is the convolution of the input tensor x with a collection of convolutional kernels z, followed by an activation function E, which is the ELU given in Eq. (7):
where h is the output of one hidden layer. The size of the kernel is always less than that of the input tensor. The convolution operation is an elementwise product between each element of the kernel and the input tensor at each location of the tensor that is then summed to obtain the output value of the output tensor in the corresponding position. The distance between two successive kernel positions is called a stride. This process is conducted repeatedly for all kernels, which can be considered as different feature extractors in the kernel set. The arbitrary number of kernels in the kernel set determines the depth of the output tensor. Here, the kernel size is 3 × 2 (frequency × time), and the stride is 2 along the frequency dimension and 1 along the time dimension. Zero padding with a size of 1 along the time dimension is applied for each layer, adding two rows of zeros on the beginning side and the ending side of the input tensor. With this parameter setting, the frequency dimension size D halves layer by layer, while the time dimension size T remains constant. The number of kernels for the encoder layers is gradually increased from 16 to 128. Before being fed to the LSTM, the output feature of the encoder is flattened along the frequency and depth dimensions to produce a two-dimensional feature sequence. The LSTM contains two LSTM network layers, and the output of the LSTM is then reshaped back to fit the input of the decoder. The decoder also contains four convolution layers, and the kernel size and the stride are set to the same values as in the encoder. The number of kernels for the decoder layers is gradually decreased and kept symmetric with the encoder. To ameliorate the vanishing-gradient problem, skip connections, which concatenate the input of each decoder layer and the output of its corresponding encoder layer along the depth dimension, are used, as follows:
where h′ is the output of the corresponding encoder layer. The size of each feature matrix is shown in Fig. 3(e).
III. RESULTS AND DISCUSSION
A. Dataset and setup
Since the loss of information from the sensor speech signal is closely related to the characteristics of the speaker, the enhancement of sensor speech signals is actually a speaker-dependent task. Therefore, speaker-dependent models for individual speakers were constructed. All data used in the study were collected from six volunteers: three males and three females. The volunteers were asked to read short sentences in an acoustically isolated room, while the flexible sensors and the air conduction microphone recorded simultaneously. The sentences were selected from AISHELL-ASR0009-OS1 open-source Mandarin speech corpus, and the duration of signal collection for each volunteer was around 45 min. In the experiments, 95% of the signal sets were used for training and 5% for testing. It is guaranteed that there is no repetitive corpus between the training set and the test set in the experiments. In the training stage, 10% of the signal sets in the training set were used for validation.
In the feature extraction stage, a 129-dimensional STFT magnitude was used for the FCNN, LSTM, and BLSTM, while an 81-dimensional STFT was used for the CRNN for computational efficiency. For the FCNN, the feature window M was set to 15, and the hidden sizes of linear network layers were 1024. For the LSTM and BLSTM, the hidden sizes of the linear network layers and LSTM cells were 512. For the CRNN, the numbers of kernels in the four layers of encoder and decoder were set to 16, 32, 64, and 128, respectively, and the size of the hidden LSTM cells was set to 512. All the networks were trained with the MSE loss function J with a batch size B of 128. The root mean square propagation (RMSProp) was chosen as the optimizer. The number of training epochs was 200. The initial learning rate was set to 0.01, which reduced to half this value if the result of the loss function J of validation set was not reduced.
The LSD and STOI were used to evaluate the speech quality of the enhanced signal: the LSD is the log-spectral distance between the enhanced signal and the target microphone signal, while STOI quantifies speech intelligibility. The value of STOI ranges from 0 to 1, with a higher value indicating higher intelligibility. To evaluate the performance of model deployment, the model size and the time consumption were tested. The model size is measured by the amount of parameters in the neural networks. The smaller this amount, the less memory is required for the model to run. The time consumption is evaluated by the ratio of the time consumption to the length of the signal. For example, the length of the signal in Fig. 4 is 2.5 s, and if the time consumed in enhancing this period of signal is 18.86 s when the BLSTM on the Raspberry Pi 4 is used, then the ratio is 18.86/2.5 = 7.54. If the time consumption is 1.25 ms when the FCNN on the GeForce RTX2080 Ti is used, then the ratio is 0.001 25/2.5 = 0.0005. If the ratio is larger than 1 for a given approach, then the time consumption when enhancing a period of speech signals is more than the length of the signal itself, indicating that real-time enhancement cannot be realized with this approach. If the ratio is less than 1, then real-time enhancement is achievable.
(a) and (b) Waveforms of a signal captured from a male speaker by the sensor and microphone, respectively. (c) and (d) Spectrograms of the signals from the sensor and microphone, respectively. (e)–(h) Spectrograms of the speech enhanced by the FCNN, LSTM, BLSTM, and CRNN, respectively.
(a) and (b) Waveforms of a signal captured from a male speaker by the sensor and microphone, respectively. (c) and (d) Spectrograms of the signals from the sensor and microphone, respectively. (e)–(h) Spectrograms of the speech enhanced by the FCNN, LSTM, BLSTM, and CRNN, respectively.
B. Results and analysis
The effectiveness of the FCNN, LSTM, BLSTM, and CRNN were first evaluated. The waveforms and the spectrograms of the raw sensor signals, the air-conduction signals captured by the microphone, and the spectrograms of the enhanced results using the four kinds of networks are shown in Fig. 4. The time-domain signals captured by the sensor and the microphones indicate significant difference in amplitude [Figs. 4(a) and 4(b)]. The low-frequency components from 0 to 1 kHz captured by the sensor and the microphone are similar to each other [Figs. 4(c) and 4(d)]. However, the high-frequency components measured by the flexible sensor were subject to severe loss, suggesting that the quality of the speech restored directly from the sensor data may not be satisfactory. Weak harmonic patterns can be seen in the high-frequency components in the sensing signals, indicating that the sensor can occasionally capture high-frequency signals with large attenuation. The sensor signal has been improved significantly in Figs. 4(e)–4(h) after being processed by the four models.
As can be seen from Fig. 5(a), the LSD values of signals after enhancement by the FCNN, LSTM, and BLSTM have been greatly reduced when compared with the raw sensor signals. The average LSD values for the FCNN, LSTM, and BLSTM are reduced by 31%, 32%, and 35%, respectively. The CRNN performs better when processing the speech from the male speakers than that from the female speakers. This may be attributed to the concentration of male speech in the middle and low frequencies as compared with female speakers, whose speech contains rich high-frequency components that are more subject to attenuation by the sensor and the skin. However, from the perspective of STOI, the conclusion is a little different. As can be seen from Fig. 5(b), except for male 3, the STOI values of raw sensor signals from different speakers are all ∼0.6, which means that the original intelligibility of the sensing signals is very low. We also evaluated the performance on our sensor signals of the GMM, which was one of the earliest algorithms to be used for signal enhancement. The STOI value was enhanced by only 0.04, which means that the GMM is not suitable for enhancing the sensor signals. The enhancement result with the BLSTM is the most obvious when compared with other networks, and the STOI value of the signal enhanced by BLSTM is more than 0.8, which indicates relatively high intelligibility of the enhanced results. This also shows that the ability of the BLSTM to perform bidirectional derivation can help to deduce lost high-frequency components. Meanwhile, the STOI value of the signal enhanced by the CRNN is above 0.75, which is higher than those for the FCNN and LSTM except for female 3. This may result from the ability of the convolution structure in CRNN to perform local structure extraction. From the blue dashed boxes in Fig. 4, it can be seen that the generated spectral structures from the CRNN are more similar to the target spectrum when compared with those from the other networks. However, the spectrum from the CRNN has dark blue parts (where the energy is almost 0), and the weak harmonic structures in the red dashed boxes also disappear. In other words, CRNN may not be able to learn the time–frequency structure at low energy, and so it is more inclined to set these parts to 0, and the same is true for places with white noise. This may also be the reason why CRNN does not perform well in terms of the LSD values. In our previous study, the BLSTM was used to enhance the speech signal collected by the throat microphone, and the LSD value was reduced by 34% on average.34 However, the LSD value was reduced by 35% when the BLSTM was used on our sensor, indicating that the BLSTM has a better enhancement result on the flexible sensor than on the throat microphone.
Evaluation results for speech quality and deployment performance: (a) LSD; (b) STOI; (c) parameter amount; (d) time consumption ratio.
Evaluation results for speech quality and deployment performance: (a) LSD; (b) STOI; (c) parameter amount; (d) time consumption ratio.
To evaluate the deployment performance of the four neural networks, the amount of parameters and the time consumption ratio were analyzed. From Fig. 5(c), it can be seen that the BLSTM algorithm contains 21.66 million parameters, which is the largest among the four algorithms. Although the number of layers of the BLSTM is the same as that of the LSTM, there are 2.8 times as many parameters in the BLSTM as in the LSTM. The FCNN and CRNN algorithms, which are relatively small in size, contain around four million parameters. Therefore, the FCNN and CRNN consume less memory compared with the BLSTM and LSTM. Figure 5(d) shows the time consumption ratios of the four neural networks on four kinds of devices. In general, the computing power of the Raspberry Pi 4 is not high, and so more time is needed for signal enhancement. The Jetson Nano, which can be accelerated by a GPU, has higher computing power, and it consumes only 10% of the time taken by the Raspberry Pi 4 in running all four kinds of networks. The time consumption of the BLSTM is significantly higher than those of the other neural networks on all devices. In addition, the delay with the BLSTM is the largest for real-time enhancement in theory, since the structure of the BLSTM requires future information. Therefore, BLSTM is suitable only in applications where real-time performance is not required. The time consumptions of the FCNN and CRNN are relatively low on all devices. Specifically, the CRNN performs better on the embedded devices, while the FCNN performs better on the cloud servers. This may be due to the advantages of servers in computing large matrices. It is worth noting that the CRNN is the only network that has a time consumption ratio less than 1 on all four devices. This means that only the CRNN can realize real-time enhancement on each device. Considering that the signal enhanced by the CRNN has a higher STOI value compared with the FCNN, the CRNN is worthy of further investigation in the quest to develop a complete mobile and flexible speech processing system.
IV. CONCLUSIONS
We have studied the performance of four signal enhancement algorithms for speech signals captured by flexible vibration sensors. From the perspective of model size and time consumption, the FCNN and CRNN are better than the LSTM and BLSTM. From the perspective of enhancement quality, the BLSTM is the best of these networks. The experimental results indicate that the CRNN approach may have potential for migration to embedded systems to achieve fully flexible and mobile speech processing systems.
The signal enhancement algorithms for flexible sensors are still trained specifically for individual speakers, because the relationship between the sensor signal and its corresponding air-conduction signal depends strongly on the speaker. Realizing a generalized performance from a large amount of speaker data is a possible solution, but the cost of creating a sufficient dataset would be very high. Combining transfer learning and speaker adaptation technology to make full use of a large amount of public data and improve the modeling performance of a specific speaker could be a promising solution to overcome the problem of speaker dependence. The complexity of the current signal enhancement models is another issue. Future studies may be able to consider accelerated optimization of the algorithms from the aspects of speech coding and neural network compression techniques.
ACKNOWLEDGMENTS
This work was supported in part by the Key Research and Development Program of Zhejiang Province, China (Grant No. 2021C05005), the National Natural Science Foundation of China (Grant No. 81771880), and the Tianjin Municipal Government of China (Grant No. 19JCQNJC12800).
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
S.G. and C.Z. contributed equally to this work.
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.
REFERENCES
Shenghan Gao is in the Laboratory of Biomedical Flexible Electronics, School of Precision Instruments and Optoelectronics Engineering, Tianjin University. He is studying for a Master’s degree. His research interest is speech enhancement algorithms.
Changyan Zheng is an Associate Professor at the High-Tech Institute, Qingzhou. Her research interests include speech processing, intelligent information processing, and deep learning.
Yicong Zhao, Ph.D., is in the Laboratory of Biomedical Flexible Electronics, School of Precision Instruments and Optoelectronics Engineering, Tianjin University. Her research interest is multifunctional sensing electronics.
Ziyue Wu is in the Laboratory of Biomedical Flexible Electronics, School of Precision Instruments and Optoelectronics Engineering, Tianjin University. He is studying for the Ph.D. degree. His research interest is biomedical epidermal sensing electronics.
Jiao Li is an Associate Professor in the Biomedical Photonic Imaging Laboratory, School of Precision Instruments and Optoelectronics Engineering, Tianjin University. Her research interests include biomedical photonic imaging and biomedical measurement systems.
Xian Huang is a Professor in the Laboratory of Biomedical Flexible Electronics, School of Precision Instruments and Optoelectronics Engineering, Tianjin University. His research interests include flexible electronic technology and biomedical flexible electronic systems.