Machine learning enabled auscultating diagnosis can provide promising solutions especially for prescreening purposes. The bottleneck for its potential success is that high-quality datasets for training are still scarce. An open auscultation dataset that consists of samples and annotations from patients and healthy individuals is established in this work for the respiratory diagnosis studies with machine learning, which is of both scientific importance and practical potential. A machine learning approach is examined to showcase the use of this new dataset for lung sound classifications with different diseases. The open dataset is available to the public online.

Machine learning-enabled/assisted medical diagnosis technique for respiratory diseases has recently become a popular topic in acoustic science (Lynch and Church, 2023) and breath research (Xue , 2023; Zhang , 2023) communities in the aftermath of the COVID-19 pandemic. In particular, the mandatory personal protective equipment prevents physicians from routine auscultating for diagnosing lung diseases with a stethoscope. As a novel alternative, digital stethoscopes (Abella , 1992; Emokpae , 2022) can record and remotely transfer the sensed lung sounds of patients. A follow-up machine learning classifier could further help to identify potential pathological events in abnormal sounds (Kevat , 2020). The corresponding digital acquisition and machine learning techniques have been experiencing radical developments in the past decade. In this Letter, we report our recent progress in acoustics-based breath diagnosis studies based on the most updated acquisition equipment and machine learning techniques.

As a prescreening tool, acoustics-based breath diagnosis mainly consists of two types. The first type utilizes the straightforward sensing of breath and cough voice. Bokov (2016) studied the method based on 95 voice recordings from 57 patients in France. Shimon (2021) applied a convolutional neural network (CNN) to classify voice samples from 57 patients in Tel Aviv, Israel. Xue (2023) acquired breath samples from 65 patients and 57 healthy individuals in Shanghai, China, and developed a prescreening algorithm. Similarly, Vahedian-Azimi (2021) examined 6 different artificial intelligence methods on a relatively large dataset with 203 COVID-19 patients and 171 healthy individuals in Tehran, Iran. However, most of those datasets are not available to the public.

The other type of prescreen research utilizes conventional auscultation that directly senses the internal lung sounds, for which the most famous dataset is from the International Conference on Biomedical and Health Informatics (ICBHI) (Rocha , 2019), which contains lung sound samples from 126 patients mainly in Portugal. Some testing equipment, such as the Welch Allyn stethoscope (model 5079-400), 3M Littmann stethoscope (Classic II SE), and air-coupled electret microphones, has been used for preparing this ICBHI dataset. Next, Kevat (2020) further focused on the children's cases and studied a machine learning method based on the dataset from 25 participants at Monash Children's Hospital, Australia. Two digital stethoscopes, Clinicloud DS and 3M Littman 3200, have been used in that work. Most recently, Alqudah (2022) from Jordan increased the training dataset by including additional 70 patients and 35 healthy individuals into ICBHI and examined the effectiveness of machine learning based on the CNN-LSTM. Nevertheless, those datasets, except for ICBHI, are still not open to the public.

The above research shows that both types of diagnosis research can share similar machine learning techniques, which have progressed from support vector machine methods (Icer and Gengec, 2014) in the past decade to CNN (Sfayyih , 2023) and the most updated Transformer-CNN (Bae , 2023) recently. The current bottleneck is the scarcity of the training dataset that is available to the research community. In this Letter, we show the performance of machine learning-based diagnosis approach can be improved considerably in terms of the specificity and the sensitivity, by just including an additional small number of new but balanced samples. To address this issue, we have prepared a new open dataset for machine learning study and disclosed the associated sampling/preparation procedure using the most recent commercial-off-the-shelf testing equipment, which shall assist interested readers in utilizing the dataset, and preparing their own new dataset and to conduct machine learning-based diagnosis study.

As mentioned above, the ICBHI dataset was acquired from 126 patients by using 4 different types of sensing equipment, along with two sets of annotations, where the first contains 6898 samples and the other contains 10 775 events (Rocha , 2019). Each sample is one respiratory cycle that has been annotated by respiratory physiotherapists and pulmonologists as either normal, crackle, or wheeze event. Figure 1 shows that the distribution of the ICBHI dataset is biased in terms of patients' ages and diseases. To address this issue, we further acquire additional data from individuals of various ages and diseases and establish the Peking University (PKU) respiratory dataset, which currently contains 11 968 respiratory cycles, almost 150% of the size of the ICBHI dataset.

Fig. 1.

The distribution of the ICBHI and PKU datasets in terms of age and disease.

Fig. 1.

The distribution of the ICBHI and PKU datasets in terms of age and disease.

Close modal

For the preparation of this PKU dataset, we adopted the same common practice as those in the former references (Kevat , 2020; Rocha , 2019), which consists of data collection, annotation, preprocessing, data augmentation, and feature extraction. We recruited 40 individuals since 2020, of which 20 were healthy and 20 were inpatients with pulmonary diseases. Each time a Ph.D. student and an experienced doctor together recorded lung sounds using two digital stethoscopes, which can save audio files to a smartphone or computer. A Exagiga Elictric ETZ-1A (with sampling rate f s = 44 kHz) was used for healthy individuals while a 3M Litmmann 3200 (with f s = 4 kHz) was used for patients with diseases. Here, we wish to declare that all patients have signed the informed consent form before collecting their lung sounds. As to be shown below, we found that the classification performance of our machine learning approaches has already been considerably improved, although the number of new individuals in our PKU dataset is less than that of the ICBHI dataset. One possible explanation is the more balanced distribution of ages and diseases of the combination of the ICBHI and PKU datasets. As suggested by Fig. 1(a), the variance of the ages of the individuals drops from 32.08 (ICBHI alone) to 30.74 (ICBHI + PKU).

The annotations were conducted by two medical doctors from the Department of Infectious Diseases at Peking University Third Hospital. As shown in Fig. 1(b), patients with new diseases, such as COVID, pulmonary abscess, and acute hypoxic respiratory failure (AHRF), are included in the PKU dataset, in addition to those upper respiratory tract infection (URTI), lower respiratory tract infection (LRTI), and chronic obstructive pulmonary disease (COPD) that are already contained in the original ICBHI dataset. Clinically, doctors diagnose pulmonary diseases by recognizing specific respiratory sounds such as crackle and wheeze. Machine learning approaches can classify these respiratory sounds by learning and identifying the corresponding acoustic features (Pasterkamp , 1997). To achieve this target, the two doctors in our group annotated lung sounds as normal, crackle, or wheeze, which enables the follow-up machine learning research. In the meanwhile, the two doctors marked the beginning time and the ending time of the recorded respiratory cycles and marked those audio cycles of poor quality, which will be excluded from feeding the machine learning.

It is worthwhile noting that all audios in our dataset were collected at two different sampling rates. Hence, the audio files have been resampled to the same 10 kHz before feeding them to the machine learning. Next, all the samples pass through the following preprocessing pipeline: (1) a fourth-order IIR Butterworth bandpass filter with a bandpass frequency between 150 and 2000 Hz is used to admit lung sounds with the similar frequency range; (2) a first-order Butterworth high-pass filter beyond 7.5 Hz is used to remove the DC and low-frequency noise, mainly from the electric supply; and (3) an eighth-order Butterworth low-pass filter at 2.5 kHz is used to remove high-frequency pollutions. Finally, the amplitudes of all audio inputs, either from the PKU dataset or the ICBHI dataset, are normalized to the same range.

In addition, each of our audio files usually contains multiple respiratory cycles and is therefore simply cropped into different samples, where the duration of each sample is empirically set to 6 s, which is chosen for it is mostly close to one respiratory cycle. On the other way around, we will pad each respiratory cycle audio when its duration is less than 6 s. Such a cut-pad operation helps to produce the required samples as input of the same size for CNN training. More details of the different cut-pad strategies and the comparisons of their effectiveness can be found in the reference (Nguyen and Pernkopf, 2022).

It is well-known that the training of deep learning models requires an extensively big and comprehensive dataset. In contrast, the distribution of individuals in the ICBHI dataset is not balanced and the number of individuals in the PKU dataset is scarce. We adopt two approaches to address this issue. First, we use classical data augmentation techniques on the currently available dataset to obtain sufficient training data. For example, we use time shifting to increase the number of samples without affecting the inherent respiratory sound cycles. After such a process, we finally obtained 11 968 samples from the current PKU dataset (cf. 6898 samples in the ICBHI dataset). Second, we are endeavoring to recruit more individuals to extensively enrich the PKU dataset to our goal which is to recruit 100 individuals. This part, especially the associated annotation, is time-consuming and still ongoing. Nevertheless, as shown below, the performance of the machine learning-based diagnosis method has already improved considerably by including the current PKU dataset into the training, which suggests the importance of a balanced data distribution for machine learning tasks.

Last but not least, the past research (Sengupta , 2016) has reported that the statistical features of mel-frequency cepstral coefficients (MFCC) perform better in machine learning than other features, such as Fourier- or wavelet-based features. The MFCC method has been widely used in human speech recognition, which is analogous to the annotation process by doctors. Hence, we applied MFCC to extract preliminary features of audio samples by means of a python package named librosa.

Figure 2 shows the CNN architecture used in this work for the classification of lung sounds. The network input is the extracted two-dimensional MFCC features. The number and the type of the convolutions are denoted at the bottom of each of the layers, where “conv” represents the convolutional layer, “fc” represents the fully connected layer, and the softmax layer is applied as the final output layer to perform the classifications of the diseases. Other activation functions used in the middle layers are rectified linear unit (ReLU), along with max-pooling operations to prevent overfitting. The dropout ratio is set to 0.5 in dense layers. Finally, the number of the overall trainable parameters is around 96 × 106. The loss function used here is the classical cross-entropy function. The stochastic gradient descent optimizer with an initial learning rate of 0.01 is adopted in the training, and the learning rates are further controlled by external callbacks (through the ReduceLROnPlateau function). The setup and the choice of those hyperparameters are essentially the same as those in a common machine learning-based classification task. The neural network is implemented and trained by keras which is a simplified high-level abstraction of the powerful but complicated tensorflow. The trainings are conducted on a desktop computer with 3.7 GHz Intel i7, 64 GB memory, and one Nvidia GeForce RTX 4070 GPU.

Fig. 2.

The CNN architecture used in this work.

Fig. 2.

The CNN architecture used in this work.

Close modal

Moreover, the ICBHI and PKU datasets are automatically separated into training datasets (70%–80% of the whole datasets) and validation datasets (20%–30%). Figure 3 shows the corresponding training and validation accuracies for the classification with respect to training epochs. First, we only use the ICBHI dataset to train the CNN and the accuracy results are shown in Fig. 3(a). It can be seen that the validation accuracy is still much smaller than the training accuracy, which suggests the necessity to introduce additional training samples. Next, we include the PKU dataset and train the CNN model again. The comparison clearly shows two distinctive performance differences between Figs. 3(a) and 3(b). First, the initial classification results are all set to the normal label and, as a result, the initial accuracies immediately jump from around 51% in Fig. 3(a) to 85% in Fig. 3(b) because the ICBHI+PKU dataset contains a greater number of samples from healthy individuals than those in the ICBHI dataset. Second, the training accuracy approaches 100% more closely in Fig. 3(b) than in Fig. 3(a), and the difference between the training accuracy and the validation accuracy is smaller in Fig. 3(b) than in Fig. 3(a). Such a performance improvement is achieved as the training sample size is increased from 6898 (ICBHI) to 18 866 (ICBHI+PKU). In summary, the comparison confirms the advantage of a large dataset with balanced samples for machine learning.

Fig. 3.

The training and the validation accuracies for (a) the ICBHI dataset and (b) the ICBHI+PKU datasets.

Fig. 3.

The training and the validation accuracies for (a) the ICBHI dataset and (b) the ICBHI+PKU datasets.

Close modal
Figure 4 shows the comparisons of the confusion matrices for our CNN model predictions on the test set of the ICBHI dataset alone and the test set of the ICBHI+PKU dataset, respectively. The confusion matrix shows the predicted results and the corresponding number of actual labels on a two-dimensional matrix. For the former ICBHI dataset, the prediction performance of our CNN model is of only moderate quality. For example, 692 out of 883 (692 + 138 + 53) normal breath cycles are correctly classified. As a comparison, 2941 out of 3070 (2941 + 99 + 30) normal breath cycles are correctly classified when more samples from the PKU dataset are included in the training. We further examine the performance in terms of the statistical specificity and sensitivity values, which are adopted in many previous references (Alqudah , 2022; Bae , 2023), and their definitions are as follows:
Specificity = True Negative True Negative + False Positive ,
(1)
Sensitivity = True Positive True Positive + False Negative ,
(2)
where crackle and wheeze are positive labels and normal is the negative label. Then, True Negative means the number of negative labels that are predicted correctly, False Positive means the number of positive labels that are predicted incorrectly, etc. The calculation results are shown in Table 1. We first show two recent results with two different but representative neural network designs in Ma (2020) and Li (2021), where only the ICBHI dataset was used. Our result achieves a comparable statistical performance by only using the ICBHI dataset. As a comparison, both the specificity and the sensitivity statistics are immediately improved by more than 17% and 10%, respectively, simply by including the PKU dataset. Such a comparison confirms the above finding that the inclusion of additional balanced training data could be more favorable than simply tuning machine learning approaches. To further demonstrate the benefits of importing our dataset to ICBHI, we train and test the widely recognized ResNet50 neural network, as designed in He (2016), on the ICBHI dataset. Subsequently, we include the PKU dataset and train the network again. The test results are also presented in Table 1, which shows that the specificity and sensitivity are improved by more than 20% and 10%, respectively, by importing the PKU dataset to the ICBHI dataset. Among models trained only on the ICBHI dataset, our CNN model performs a little better than the other three models. Most importantly, our model trained on the ICBHI + PKU dataset performs much better than the others, which demonstrates the benefit of importing an external dataset to ICBHI.
Fig. 4.

The confusion matrices of the predicted results.

Fig. 4.

The confusion matrices of the predicted results.

Close modal
Table 1.

The comparisons of the specificity and sensitivity values from the previous works, existing neural network, and our work.

Model Dataset Specificity Sensitivity
LungAttn (Li , 2021 ICBHI  71.44%  36.36% 
LungRN + NL (Ma , 2020 ICBHI  63.69%  41.32% 
ResNet50 (He , 2016 ICBHI  73.33%  38.57% 
ResNet50 (He , 2016 ICBHI + PKU  94.59%  51.89% 
Our CNN  ICBHI  78.37%  49.53% 
Our CNN  ICBHI + PKU  95.80%  60.55% 
Model Dataset Specificity Sensitivity
LungAttn (Li , 2021 ICBHI  71.44%  36.36% 
LungRN + NL (Ma , 2020 ICBHI  63.69%  41.32% 
ResNet50 (He , 2016 ICBHI  73.33%  38.57% 
ResNet50 (He , 2016 ICBHI + PKU  94.59%  51.89% 
Our CNN  ICBHI  78.37%  49.53% 
Our CNN  ICBHI + PKU  95.80%  60.55% 

As one of the most fundamental medical physical examination methods, the effectiveness of the auscultating technique is heavily dependent on the professional expertise of doctors and their subjective interpretations. A machine-learning-based auscultating diagnosis tool can be a promising candidate to address this issue, especially for the prescreening purpose. However, as shown in this work, the working and performance of a machine learning tool are significantly influenced by training datasets. In the meantime, both the digital stethoscope hardware and the machine learning approaches have advanced rapidly after the pandemic. In this Letter, we summarize the status quo of the ICBHI dataset available to the public and disclose the common practices that should be adopted to establish an extended dataset with more training samples. A preliminary machine learning approach is examined at the end of this Letter to demonstrate the use of the new dataset for lung sound classifications. Some important observations are summarized here: (1) a balanced dataset in terms of age and disease distributions has a favorable impact on the training performance; (2) a training dataset with a sufficient number of samples is vital for the final deployment of such a machine learning-based diagnosis tool; and (3) data augmentation techniques are quite effective in improving the machine learning performance even for the current dataset.

As our individuals are all local east Asians, interested readers can follow the reported pipeline and construct their new dataset by recruiting suitable individuals locally, to increase the diversity of demographic groups based on the ICBHI and PKU datasets. Despite the benefits of extensive labeled samples, the required annotation could be prohibitive, which calls for further studies in emerging machine learning methods such as contrastive learning (Bae , 2023) that will extensively reduce the required annotation task. Interested readers can also directly use this new dataset for their machine learning studies. Some advanced machine learning techniques, such as semi-supervised or unsupervised learning, could be further explored to enhance the dataset's utility and reduce reliance on extensive labeled data.

Although this technology has great potentials in clinical diagnosis, admittedly, more research is needed before this technology is integrated into clinical workflows. In the respiratory outpatient clinics, physicians could use it for prescreening. For routine clinical care, if fully utilized, it can greatly reduce the workload of nurses on regular monitoring. However, all of these potentials need to be experimented and demonstrated in clinical practice. We should also fully understand the clinician and patient perspectives on the diagnostic model based on machine learning. Through putting it into clinical practice and conducting interviews in the future, some questions may be clear such as how much help this technology provides physicians and nurses and how much trust patients have in it. It is virtually certain that as the diagnostic model improves, these barriers will be easier to overcome.

This research was partly supported by the National Science Foundation of China (Grant No. 12272007) and the Technical Field Fund of Foundation Strengthening Program (Grant No. 2021-JCJQ-JJ-0017). G.Z. and C.L. contributed equally to this work.

The authors have no conflicts to disclose.

The data that support the findings of this study are available from the corresponding author upon reasonable request.

1.
Abella
,
M.
,
Formolo
,
J.
, and
Penny
,
D. G.
(
1992
). “
Comparison of the acoustic properties of six popular stethoscopes
,”
J. Acoust. Soc. Am.
91
,
2224
2228
.
2.
Alqudah
,
A. M.
,
Qazan
,
S.
, and
Obeidat
,
Y. M.
(
2022
). “
Deep learning models for detecting respiratory pathologies from raw lung auscultation sounds
,”
Soft Comput.
26
,
13405
13429
.
3.
Bae
,
S. M.
,
Kim
,
J. W.
,
Cho
,
W. Y.
,
Baek
,
H.
,
Son
,
S. Y.
,
Lee
,
B.
,
Ha
,
C. W.
,
Tae
,
K.
,
Kim
,
S. Y.
, and
Yun
,
S. Y.
(
2023
). “
Patch-mix contrastive learning with audio spectrogram transformer on respiratory sound classification
,” arXiv:2305.14032.
4.
Bokov
,
P.
,
Mahut
,
B.
,
Flaud
,
P.
, and
Delclaux
,
C.
(
2016
). “
Wheezing recognition algorithm using recordings of respiratory sounds at the mouth in a pediatric population
,”
Comput. Biol. Med.
70
,
40
50
.
5.
Emokpae
,
L. E.
,
Emokpae
,
R. N.
,
Bowry
,
E.
,
Saif
,
J. B.
,
Mahmud
,
M.
,
Lalouani
,
W.
,
Younis
,
M.
, and
Joyner
,
R. L.
(
2022
). “
A wearable multi-modal acoustic system for breathing analysis
,”
J. Acoust. Soc. Am.
151
(
2
),
1033
1038
.
6.
He
,
K.
,
Zhang
,
X.
,
Ren
,
S.
, and
Sun
,
J.
(
2016
). “
Deep residual learning for image recognition
,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp.
770
778
.
7.
Icer
,
S.
, and
Gengec
,
S.
(
2014
). “
Classification and analysis of non-stationary characteristics of crackle and rhonchus lung adventitious sounds
,”
Digital Sig. Proc.
28
,
18
27
.
8.
Kevat
,
A.
,
Kalirajah
,
A.
, and
Roseby
,
R.
(
2020
). “
Artificial intelligence accuracy in detecting pathological breath sounds in children using digital stethoscopes
,”
Respir. Res.
21
,
253
.
9.
Li
,
J.
,
Yuan
,
J.
,
Wang
,
H.
,
Liu
,
S.
,
Guo
,
Q.
,
Ma
,
Y.
,
Li
,
Y.
,
Zhao
,
L.
, and
Wang
,
G.
(
2021
). “
LungAttn: Advanced lung sound classification using attention mechanism with dual TQWT and triple STFT spectrogram
,”
Physiol. Meas.
42
(
10
),
105006
.
10.
Lynch
,
J. F.
, and
Church
,
C. C.
(
2023
). “
Introduction to the special issue on COVID-19
,”
J. Acoust. Soc. Am.
153
(
1
),
573
575
.
11.
Ma
,
Y.
,
Xu
,
X. Z.
, and
Li
,
Y. F.
(
2020
). “
LungRN+ NL: An improved adventitious lung sound classification using non-local block ResNet neural network with mixup data augmentation
,” in
Interspeech 2020
, pp.
2902
2906
.
12.
Nguyen
,
T.
, and
Pernkopf
,
F.
(
2022
). “
Lung sound classification using snapshot ensemble of convolutional neural networks
,” in
42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society
.
13.
Pasterkamp
,
H.
,
Kraman
,
S. S.
, and
Wodicka
,
G. R.
(
1997
). “
Respiratory sounds: Advances beyond the stethoscope
,”
Am. J. Respir. Crit. Care Med.
156
(
3
),
974
987
.
14.
Rocha
,
B. M.
,
Filos
,
D.
,
Mendes
,
L.
,
Serbes
,
G.
,
Ulukaya
,
S.
,
Kahya
,
Y. P.
,
Jakovljevic
,
N.
,
Turukalo
,
T. L.
,
Vogiatzis
,
J. M.
,
Perantoni
,
E.
,
Kaimakamis
,
E.
,
Natsiavas
,
P.
,
Oliveira
,
A.
,
Jácome
,
C.
,
Marques
,
A.
,
Maglaveras
,
N.
,
Paiva
,
R. P.
,
Chouvarda
,
I.
, and
Carvalho
,
P.
(
2019
). “
An open access database for the evaluation of respiratory sound classification algorithms
,”
Physiol. Meas.
40
,
035001
.
15.
Sengupta
,
N.
,
Sahidullah
,
M.
, and
Saha
,
G.
(
2016
). “
Lung sound classification using cepstral-based statistical features
,”
Comput. Biol. Med.
75
,
118
129
.
16.
Sfayyih
,
A. H.
,
Sabry
,
A. H.
,
Jameel
,
S. M.
,
Sulaiman
,
N.
,
Raafat
,
S. M.
,
Humaidi
,
A. J.
, and
Kubaiaisi
,
Y. M. A.
(
2023
). “
Acoustic-based deep learning architectures for lung disease diagnosis: A comprehensive overview
,”
Diagnostics
13
,
1748
.
17.
Shimon
,
C.
,
Shafat
,
G.
,
Dangoor
,
I.
, and
Ben-Shitrit
,
A.
(
2021
). “
Artificial intelligence enabled preliminary diagnosis for COVID-19 from voice cues and questionnaires
,”
J. Acoust. Soc. Am.
149
(
2
),
1120
1124
.
18.
Vahedian-Azimi
,
A.
,
Keramatfar
,
A.
,
Asiaee
,
M.
,
Atashi
,
S. S.
, and
Nourbakhsh
,
M.
(
2021
). “
Do you have COVID-19? an artificial intelligence-based screening tool for COVID-19 using acoustic parameters
,”
J. Acoust. Soc. Am.
150
(
3
),
1945
1953
.
19.
Xue
,
C. L.
,
Xu
,
X. H.
,
Liu
,
Z. X.
,
Zhang
,
Y. N.
,
Xu
,
Y. L.
,
Niu
,
J. Q.
,
Jin
,
H.
,
Xiong
,
W. J.
, and
Cui
,
D. X.
(
2023
). “
Intelligent COVID-19 screening platform based on breath analysis
,”
J. Breath Res.
17
,
016005
.
20.
Zhang
,
Q.
,
Chen
,
B. Y.
, and
Liu
,
G. H.
(
2023
). “
Artificial intelligence can dynamically adjust strategies for auxiliary diagnosing respiratory diseases and analyzing potential pathological relationships
,”
J. Breath Res.
17
,
046007
.