A surface defect detection method for hot-rolled steel strips was proposed to address the challenges of detecting small target defects, significant differences in morphology, and unclear defect characteristics. This method is based on multiscale feature perception and adaptive feature fusion. First, based on the spatial distribution characteristics of the steel strip image, redundant background interference is removed using automatic gamma correction and Otsu thresholding. Second, based on the characteristics of surface defects in steel strips, this paper proposes TDB-YOLO (YOLO with a small target detection layer), a Bidirectional Feature Pyramid Network (BiFPN), and Double Cross Stage Partial (CSP) Bottleneck with three convolutions (DC3). To detect small object defects, a small target detection layer with a smaller receptive field focuses on fine-grained features, reducing the model’s probability of missed detection. In terms of feature extraction, DC3 enhances the interaction of feature information from different spatial scales, enabling the model to effectively handle features of varying scales. In terms of feature fusion, the BiFPN is used to adaptively fuse deep-level and shallow-level feature information, enhancing the semantic richness of the feature information. Ultimately, the proposed model in this paper achieved an accuracy of 90.3% and a recall rate of 88.0% for surface defects in steel strips. The mean average precision was 90.4%, and the frames per second was 33. The detection performance of this model outperformed those of other detection models, demonstrating its ability to effectively meet the real-time detection requirements of surface defects in industrial scenarios on steel strips.

The hot-rolled steel strip is an essential product in the modern steel industry, widely used in various fields. During the production process, various surface defects may occur on steel strips due to factors such as the quality of raw materials, manufacturing equipment, processing techniques, and production environment. The quality of the steel strip surface directly affects the performance of the steel strip product. Based on this, surface defect detection has become a necessary process in steel strip production.1 

A co-occurrence analysis of 902 literature studies on surface defect detection of steel strips was conducted, and the results are shown in Fig. 1. To facilitate keyword statistics and analysis, synonymous keywords have been consolidated and certain terms such as “steel,” “surface defect detection,” and “system” have been excluded. Ten different colors were used to represent different publication years, while different sizes of circles represent the frequency of occurrence of keywords. Overall, non-destructive testing has always been a research hotspot in surface defect detection of steel strips. This indicates that surface defect detection technology is essential for ensuring the production quality of steel strips. Traditional non-destructive testing methods use physical principles to detect surface defects, including eddy current testing, magnetic flux leakage testing, and infrared detection. Qiu et al.2 used the signal polarity of a positively magnetized eddy current probe to detect cracks on the surface of steel plates based on the magnetic permeability perturbation mechanism. To overcome the skin effect and detect defects on the backside of steel plates, Wang et al.3 proposed a multi-frequency alternating current magnetic flux leakage detection method that combines the advantages of low-frequency and high-frequency excitation. Usamentiaga Fernández et al.4 used infrared thermography to detect inclusions in steel products. These non-destructive testing methods based on physical principles typically can only detect specific defects and have low levels of automation, and their results are susceptible to external environmental influences.

FIG. 1.

Co-occurrence graph of key terms for surface defect detection in steel strips.

FIG. 1.

Co-occurrence graph of key terms for surface defect detection in steel strips.

Close modal

As shown in Fig. 1, starting from 2016, surface defect detection in steel strips has gradually shifted toward machine vision, which has the advantages of non-contact and automation. The primary process of machine vision includes image preprocessing, Region of Interest (ROI) extraction, image segmentation, feature extraction, and defect classification.5 Feature extraction is crucial in machine vision. In the field of surface defect detection in steel strips, keyword feature extraction has a high frequency of occurrence, as shown in Fig. 1. To extract texture features from the surface of steel strips, Liu et al.6 established a model based on the Haar–Weibull variance to represent the texture features of each local patch in the image. This method can detect defects on surfaces with uniform texture, achieving an average detection rate of 96.2% for surface defects on steel strips. Surface image textures generally possess a low-rank structure, which means they exhibit a certain regularity and coherence. However, the presence of rare defects can disturb this structure. Based on this phenomenon, Wang et al.7 proposed an intuitive feature for surface defect detection using the solid sparse tracking method to effectively identify defects on the surface of steel strips. Ashour et al.8 extracted multi-directional shear features from surface images of steel strips and then used a gray-level co-occurrence matrix to extract a set of statistical features to identify surface defects on steel strips. These methods require manually designed features, which are subjective and have poor robustness.

As shown in Fig. 1, since 2020, the method of feature extraction has gradually shifted toward using deep learning technology, which can adaptively extract surface defect features of steel strips without manual feature design. Detection techniques based on deep learning primarily utilize Convolutional Neural Networks (CNNs)9,10 to automatically extract features. Zhang et al.11 proposed a method that combines domain adaptation and adaptive CNNs. Their approach outperformed classical CNNs and other existing methods in terms of recognition accuracy for surface defect detection on steel strips. Zhao et al.12 proposed a steel strip surface defect detection model based on YOLOv5. This model adopts two feature pyramids to extract more comprehensive features and utilizes separate regression and classification heads to enhance model accuracy. In the steel strip surface defect detection task, the mean average precision (mAP) of the model reached 81.1%. Wang et al.13 proposed a multiscale feature extraction module and spatial attention mechanism to effectively extract multiscale and spatial information from steel strip surfaces. In the steel strip surface defect detection task, the mAP of the model reached 72%. To address the challenges of complex morphology and low detection efficiency in steel strip surface defect detection, Yeung and Lam14 proposed a fused attention network framework. This method applies attention mechanisms to individual feature maps, which improves the accuracy of the detection network while maintaining the detection speed. To address the issue of scale diversity on steel strip surfaces, Zhou et al.15 employed a dual pyramid network that combines a residual atrous spatial pyramid pooling module and a feature pyramid to fuse multiscale features. This approach further enriches the scale information in the feature maps and improves the mAP of steel strip surface defects to 80.93%.

The above-mentioned deep learning-based detection algorithms avoid the complex manual feature design process and further improve the detection accuracy of strip surface defects while maintaining the detection speed. However, the images16 used in these studies have been cropped and segmented into smaller sliced sizes, with each image containing only one type of defect. This approach overlooks the complex backgrounds in industrial environments and reduces the scale differences between defects. As a result, the detection algorithms that investigated these images are poorly generalized and difficult to transfer to strip production lines. Based on this situation, this paper takes the original strip steel images on the production line as the research object to enhance the practicality of the detection algorithm. In actual industrial sites, the original strip steel images are wide and of high-resolution,17 with interference factors such as water stains and lighting. In addition, the original image has many small target defects on the strip surface, large differences in defect morphology, and insignificant defect characteristics, which further exacerbate the difficulty in detecting defects on the strip surface. In response to these detection challenges, this paper takes YOLOv5 as the benchmark model and proposes an efficient method for detecting surface defects on steel strips in industrial settings.

To better reproduce the actual industrial field environment, Luo et al.18 created the first publicly available dataset of original steel strip surface defect images. The images in this dataset were directly captured from actual steel production lines. This paper takes the original images from the dataset as the research object, with an image resolution of 4096 × 1024. The non-steel strip areas occupy a significant portion of the original images, as shown in Fig. 2(a). These areas can interfere with surface defect detection. Therefore, it is necessary to extract the steel strip regions from the original images to improve the accuracy of the detection algorithm.

FIG. 2.

ROI extraction process for the steel strip image. (a) Original images. (b) Corrected images. (c) Binary images. (d) Segmented images.

FIG. 2.

ROI extraction process for the steel strip image. (a) Original images. (b) Corrected images. (c) Binary images. (d) Segmented images.

Close modal

The original images are affected by factors such as noise, water vapor, and lighting, making it difficult to segment the steel strip regions and remove redundant backgrounds using commonly used methods such as adaptive thresholding, edge detection, and region growing. To address this issue, an adaptive ROI extraction method based on the gray-level distribution characteristics of the original steel strip image is proposed and presented.

First, mean smoothing is applied to remove noise, and automatic gamma correction is performed to normalize the gray levels of the original image. The original image has a wide distribution range and significant dynamic variations in grayscale, which makes it unsuitable for threshold segmentation. Automatic gamma correction can bring the grayscale mean of an image closer to a specified target mean. The value of γ is calculated according to formula (1). After correction, the grayscale distribution of the strip region is uniform, as shown in Fig. 2(b),
1w×hi=1wj=1hPijγ=y,
(1)
where w and h are the width and height of the image, respectively, Pij represents the pixel value at the corresponding position, and y is the desired mean value.
Then, in this paper, the OTSU threshold segmentation method is used to binarize the corrected image, and morphological processing is further employed to eliminate noise, as shown in Fig. 2(c). The OTSU method is a threshold segmentation method based on grayscale variance, which can efficiently divide image pixels into C1 and C2 classes. The grayscale level k that maximizes the grayscale variance σ2 in formula (2) is the OTSU threshold. The grayscale of surface defect images of steel strips mainly varies along the horizontal direction. Therefore, using the grayscale distribution of the column histogram instead of the grayscale distribution of the corrected image histogram can accelerate the threshold segmentation speed and maintain the segmentation effect. Finally, the band region in the original image is extracted based on the column pixel distribution of the binary image, as shown in Fig. 2(d),
σ2=i=0kpii=k+1L1pi1p1i=0kipi1p2i=k+1L1ipi2,
(2)
where L is a constant, P1 and P2 represent the probabilities of pixels being classified into classes C1 and C2, respectively, and Pi represents the probability of a pixel having grayscale level i.

To address the detection challenges, such as the presence of multiple small target defects, large variations in defect morphology, and inconspicuous defect features in the surface defects of hot-rolled steel strips, this paper proposes a defect detection model for steel strip surface images based on YOLOv5,19 named TDB-YOLO. The overall network framework structure is shown in Fig. 3. Due to the presence of numerous small target defects in steel strip surface defect images, there is a higher risk of the probability of missed detection. Therefore, a small target detection layer P2 with a smaller receptive field is introduced in the network structure to address this issue. The small target detection layer has a smaller downsampling factor on the input image, which pays more attention to details and shallow small-scale features, resulting in better detection performance for small targets. To address the issue of unclear surface defect characteristics on steel strips, the feature extraction module CSP Bottleneck with three convolutions (C3) has been replaced with Double CSP Bottleneck with three convolutions (DC3), which facilitates more efficient interaction of information at different scales, thus enhancing the model’s ability to extract and represent multiscale features. To capture more comprehensive features due to the significant variations in surface defect morphology on steel strips, the Bidirectional Feature Pyramid Network (BiFPN) is introduced. As shown by the red connecting lines in Fig. 3, the BiFPN adds horizontal connections on features of the same size, effectively avoiding the loss of feature information. In addition, the BiFPN assigns different weights to features from different levels to enhance the efficiency of feature fusion.

FIG. 3.

TDB-YOLO network structure.

FIG. 3.

TDB-YOLO network structure.

Close modal

The size of the region on the original image to which each pixel on the feature map is mapped is called the receptive field. The size of the receptive field directly affects the level of feature perception of the network layer toward the input image, regardless of whether the features obtained by each layer are global and high-level semantic features or local and detailed features. In general, the fewer convolution and pooling operations the input image undergoes, the larger the output feature map, reflecting more fine-grained features and focusing on details. Conversely, the more convolution and pooling operations the input image undergoes, the smaller the output feature map, reflecting features more focused on the overall and global context. Therefore, a small receptive field is more effective for detecting small objects.

Accurately calculating the size of the receptive field plays an important role in analyzing the detection performance of the network. YOLOv5 detects small objects on the prediction feature layer P3, which has 8× downsampling. The minimum receptive field of this feature layer is equal to the minimum receptive field of the feature layer with 8× downsampling in the backbone network. The formula for calculating the receptive field is as follows:
lk=lk1+(fk1)×i=1k1si,
(3)
where k is greater than or equal to 2, lk represents the size of the receptive field in the k-th layer, fk represents the size of the convolution kernel or pooling window, and si represents the stride in the i-th layer. The steps to calculate the minimum receptive field of the prediction feature layer after 8× downsampling are as follows:
  • Step 1: DF = 2, Kernel Size = 6, Stride = 2, RF = 1 + (6 − 1) = 6

  • Step 2: DF = 4, Kernel Size = 3, Stride = 2, RF = 6 + (3 − 1) × 2 = 10

  • Step 3: DF = 8, Kernel Size = 3, Stride = 2, RF = 10 + (3 − 1) × 2 × 2 = 18

Here, DF represents the downsampling factor, and RF represents the size of the receptive field.

According to the calculation, the minimum receptive field of the target detection layer after 8× downsampling is 18 × 18. To enhance the detection of these targets, a target detection layer P2 with 4× downsampling is adopted, and the minimum receptive field of this feature layer is 10 × 10. Compared with the target detection layer P3, as shown in Fig. 4, the detailed information of the target detection layer P2 is abundant, which is more suitable for detecting small targets. The comparison of the detection performance of YOLOv5 and YOLOv5-P2, which introduces the target detection layer P2, for surface defect detection in steel strips is shown in Table I. The small target detection layer P2 has comprehensively improved the performance of YOLOv5 in surface defect detection of steel strips. The accuracy has increased by 1%, and the recall rate has increased by 6.1%, significantly reducing the model’s false negative rate. The mean Average Precision (mAP) has increased by 2%. YOLOv5-P2 has a certain degree of reduction in detection speed compared to YOLOv5, but industrial inspection pays more attention to detection accuracy. Therefore, the improved strategy of the small target detection layer is effective for surface defect detection of steel strips.

FIG. 4.

Visualization of output feature maps. (a) Visualization of P2 output feature maps. (b) Visualization of P3 output feature maps.

FIG. 4.

Visualization of output feature maps. (a) Visualization of P2 output feature maps. (b) Visualization of P3 output feature maps.

Close modal
TABLE I.

Results of comparative experiments on small object detection.

ModelP (%)R (%)mAP (%)FPS
YOLOv5 88.9 81.1 87.1 37 
YOLOv5-P2 89.9 87.2 89.1 32 
ModelP (%)R (%)mAP (%)FPS
YOLOv5 88.9 81.1 87.1 37 
YOLOv5-P2 89.9 87.2 89.1 32 

Feature extraction is a challenge and key in object detection. Defect feature extraction refers to extracting a series of features that represent different defects from input images, eliminating the interference of redundant information and improving the feature representation capability of the model. YOLOv5 uses the C3 module to increase the depth and receptive field of the network to enhance the network’s feature extraction capability. YOLOv820 proposes the more advanced C2f module based on the C3 module, and its working principle is shown in Fig. 5(a). The C2f module demonstrates the significant impact of enhancing the interaction of feature information on the network’s feature extraction capability. Based on this phenomenon, the DC3 module, which has more substantial information interaction capability, was designed based on the C2f module. Its working principle is shown in Fig. 5(b). Compared to C2f, DC3 has an additional branch for transferring feature information, enhancing the interaction of feature information across different scales of space. This enables better extraction of multiscale features.

FIG. 5.

Diagram of the feature extraction module. (a) C2f module structure diagram. (b) DC3 module structure diagram.

FIG. 5.

Diagram of the feature extraction module. (a) C2f module structure diagram. (b) DC3 module structure diagram.

Close modal
The number of parameters in the feature extraction module is an essential indicator for measuring the complexity of the module. The fewer the parameters a module has, the less the computational resources and time are required for that module. According to the working principles of the C2f module and the DC3 module, the amount of parameters needed to train these two modules is as follows:
Pc2f=co(k12ci+g)+nnco2k22co2+g+cok12co2(n+2)+g,
(4)
PDC3=co(k12ci+g)+nnco2k22co2+g+co2k12co2(n+1)+g+co3co2+g,
(5)
where ci refers to the number of input channels, co refers to the number of output channels, k1 refers to the size of a square convolution kernel with a side length of 1, k2 refers to the size of a square convolution kernel with a side length of 3, n refers to the number of repetitions of the Bottleneck module, n′ refers to the number of repetitions of square convolution kernels with a side length of 3 in the Bottleneck module, and g represents the number of parameters required for the BN layer.
By simplifying Eqs. (4) and (5), one could obtain Eqs. (6) and (7), respectively,
Pc2f=co(2n+4)+cico+co2(5n+1),
(6)
PDC3=co(2n+5)+cico+co24(19n+7),
(7)
Pc2fPDC3=co4(co(n3)4).
(8)

Equation (8) represents the calculation of the difference in the number of parameters between C2f and DC3. When n is greater than 3, the number of parameters in DC3 is less than that in C2f. The C3 module in YOLOv5 is replaced with C2f and DC3 modules, and the latter has 438 144 fewer parameters than the former. This indicates that the DC3 module has lower complexity and is more lightweight, making models that utilize this module faster in inference speed.

To verify the performance of different feature extraction modules in surface defect detection of steel strips, comparative experiments were conducted in this paper. The specific results are shown in Table II. YOLOv5-DC3 performs the best in detecting surface defects on steel strips, with an accuracy of 85.2%, a recall rate of 87.5%, and an mAP of 88.6%. Although YOLOv5 is the best in terms of accuracy and detection speed in surface defect detection of steel strips, its recall rate is too low, and the problem of missed detection is severe, with its mAP lower than that of YOLOv5-DC3. From the perspective of industrial inspection, YOLOv5-DC3 performs better than YOLOv5 in detecting surface defects on steel strips. DC3-YOLOv5 performs significantly better than C2f-YOLOv5 in surface defect detection on steel strips. DC3-YOLOv5 demonstrates significantly higher accuracy than C2f-YOLOv5, with a higher mAP as well. Therefore, in the experiment, YOLOv5-DC3 demonstrates the best detection results for surface defects on steel strips, proving the feasibility of the DC3 module.

TABLE II.

Results of comparative experiments on feature extraction.

ModelP (%)R (%)mAP (%)FPS
YOLOv5 88.9 81.1 87.1 37 
YOLOv5-C2f 81.0 85.6 86.1 32 
YOLOv5-DC3 86.9 85.1 87.9 32 
ModelP (%)R (%)mAP (%)FPS
YOLOv5 88.9 81.1 87.1 37 
YOLOv5-C2f 81.0 85.6 86.1 32 
YOLOv5-DC3 86.9 85.1 87.9 32 

Feature fusion is an essential mechanism in YOLOv5, which combines features of different scales to obtain more complete feature information. Shallow-level features have a higher resolution and contain more positional and detailed information. However, due to the fewer convolutions applied to them, they have lower semantic representation and are prone to more noise. High-level features have more substantial semantic information, but they have lower resolution and poor ability to perceive details. Efficiently fusing shallow-level features and deep-level features is critical to improving the detection performance of the model. YOLOv5 uses a Path Aggregation Network (PANet) to achieve feature fusion, as shown in Fig. 6(a), illustrating its working principle. PANet uses a bidirectional feature information fusion method from top-down and bottom-up directions, which transfers the deep semantic information of the head network to the shallow layers and simultaneously transmits the shallow geometric and detailed information of the head network to the deep layers, enhancing the detection performance of the network for objects of different scales. However, PANet is prone to losing the shallow-level information extracted by the backbone network when deepening the network hierarchy. In response to this issue, Tan et al.21 proposed a BiFPN, which adds lateral connections between features at the same scale to alleviate the problem of information loss caused by excessive network hierarchy, as shown in Fig. 6(b), illustrating its working principle.

FIG. 6.

Feature pyramid network architecture diagram. (a) PANet network architecture diagram. (b) BiFPN network architecture diagram.

FIG. 6.

Feature pyramid network architecture diagram. (a) PANet network architecture diagram. (b) BiFPN network architecture diagram.

Close modal

Traditional feature fusion usually only adds or stacks feature maps without considering their different levels, treating all feature maps equally. However, feature maps from different levels have different resolutions, and they also have different impacts on the detection results. Simply adding or stacking feature maps is not the optimal operation. To address this issue, a simple yet efficient weighted feature fusion mechanism is adopted. Different weight coefficients are assigned to feature maps from different levels, thereby suppressing or enhancing features from different levels.

If a learnable weight coefficient is directly assigned to feature maps from different levels without constraining the range of the weight coefficient, it can lead to training instability and make it challenging to train an effective detection model. Therefore, a normalization method is employed to scale the weight coefficients to be between [0, 1], as shown in the following equation:
win=wiε+jwj,
(9)
where wi is a learnable weight coefficient, wi is greater than or equal to 0; ε equals 0.0001; and win represents the normalized weight coefficient.
Taking the calculation of the output feature P4out in the bottom-up pathway with four levels as an example, the detailed feature fusion method is explained. As shown in Eqs. (10) and (11), the traditional BiFPN adopts a feature fusion method that adds the feature maps together. However, this approach has strict requirements on the number of feature channels and can potentially lead to the loss of important feature information. To address this issue, this paper adopts a feature map stacking method for feature fusion, as shown in Eqs. (12) and (13),
P4td=Convw1P4inw1+w2+ε+w2Resize(P5td)w1+w2+ε,
(10)
P4out=Convw1P4inw1+w2+w3+ε+w2P4tdw1+w2+w3+ε+w3Resize(P3out)w1+w2+w3+ε,
(11)
P4td=Concatw1P4inw1+w2+ε,w2Resize(P5td)w1+w2+ε,
(12)
P4out=Concatw1P4inw1+w2+w3+ε,w2P4tdw1+w2+w3+ε, ×w3Resize(P3out)w1+w2+w3+ε,
(13)
where P4in represents the output features of four levels in the backbone network, P4td and P5td represent the output features of corresponding levels in the top-down pathway, and P3out represents the output features of three levels in the bottom-up pathway. “Resize” indicates upsampling, “Conv” represents convolution operation, and “Concat” denotes feature map stacking.

To verify the effectiveness of different feature fusion methods, comparative experiments were conducted in this paper, and the results are shown in Table III. Both feature fusion methods have shown significant improvements for YOLOv5, enhancing the detection accuracy of surface defects in steel strips. Compared to YOLOv5-Add, which utilizes feature map addition, YOLOv5-Concat, which adopts feature map concatenation, avoids the loss of feature information and can better fuse features from different levels. As a result, it achieves better detection results for surface defects in steel strips, with higher accuracy and mean average precision. Therefore, this paper adopts feature map concatenation to achieve adaptive weighted feature fusion.

TABLE III.

Results of comparative experiments on feature fusion.

ModelP (%)R (%)mAP (%)FPS
YOLOv5 88.9 81.1 87.1 37 
YOLOv5-Add 85.2 87.5 88.6 30 
YOLOv5-Concat 85.9 85.1 89.2 30 
ModelP (%)R (%)mAP (%)FPS
YOLOv5 88.9 81.1 87.1 37 
YOLOv5-Add 85.2 87.5 88.6 30 
YOLOv5-Concat 85.9 85.1 89.2 30 

To better reproduce the detection scenario of an industrial site, this paper focuses on the images from CSU STEEL18 for research purposes. The dataset did not crop or segment the original images into smaller local images but directly used the original images from the hot-rolled steel strip industrial production line in CSU STEEL. The image resolution is 4096 × 1024. The original images of surface defects on hot-rolled steel strips contain a significant amount of redundant regions. In this study, a threshold segmentation method based on column histograms was used to segment the original images, resulting in 764 images of surface defects on hot-rolled steel strips. These images were then annotated and corrected to create the dataset used in this paper.

As shown in Fig. 7, there are four categories of labels in this dataset, namely, roll marks, elastic deformation, waves, and oxide scales.

FIG. 7.

Annotated sample examples. (a) Roll marks. (b) Elastic deformation. (c) Waves. (d) Oxide scales.

FIG. 7.

Annotated sample examples. (a) Roll marks. (b) Elastic deformation. (c) Waves. (d) Oxide scales.

Close modal

To better assess the accuracy and real-time performance of surface defect detection in industrial scenarios for hot-rolled steel strips, this study uses four evaluation metrics, namely, precision rate (P), recall rate (R), mean Average Precision (mAP), and Frames Per Second (FPS), to analyze the performance of the object detection model.

The number of samples where the predicted result is positive and the actual result is positive is denoted as True Positive (TP). The number of samples where the predicted result is positive and the actual result is negative is denoted as False Positive (FP). The number of samples where the predicted result is negative and the actual result is positive is denoted as False Negative (FN). P represents the probability of correct predictions among the positive samples detected by the model, and its calculation formula is
P=TPTP+FP.
(14)
R represents the probability of correct predictions of positive samples by the model, and its calculation formula is
R=TPTP+FN.
(15)

AP stands for the area under the precision–recall curve, which is an overall measure of recall and precision and is used to assess the detection performance of a model for a specific class. mAP is the mean average precision for each class, which is used to evaluate the comprehensive detection performance of a model for all classes.

FPS represents the number of images processed by a model per second and is used to measure the detection speed of the model.

Facing the challenges of surface defect detection in steel strips, this paper proposes TDB-YOLO based on YOLOv5, which employs multiple improvement strategies to enhance the accuracy of surface defect detection in steel strips. To tackle small defects on the surface of steel strips, a small object detection layer is employed to make the receptive field of the small object prediction layer more suitable for detecting small-scale defects. In terms of feature extraction, DC3 is utilized to enhance the model’s information interaction capability and extract more comprehensive multiscale features. In terms of feature fusion, the BiFPN is employed to avoid losing defect feature information and effectively fuse features from different levels through weighted fusion.

To verify the performance improvement of TDB-YOLO in detecting surface defects on hot-rolled steel strips, various improvement strategies were designed for ablation experiments to analyze and compare their effects. Model A represents YOLOv5, model B introduces a small object detection layer based on model A, model C utilizes the BiFPN for feature fusion based on model B, and model D represents TDB-YOLO, which incorporates the DC3 module for feature extraction based on model C. The trend of regression box loss, confidence loss, and classification loss for different models during the training process is shown in Fig. 8. As shown in Fig. 8(a), from model A to model D, the regression box loss decreases gradually, indicating an improvement in the accuracy of target defect localization. As shown in Fig. 8(b), the small object detection layer used in model B significantly reduced the confidence loss, improving the adaptability of the model to different scenarios. Models C and D further reduce the confidence loss based on model B. As shown in Fig. 8(c), model D has the lowest category loss and the highest accuracy in terms of target recognition. Therefore, the improvement strategy proposed in this paper is effective, and TDB-YOLO efficiently enhances the detection accuracy of surface defects in steel strips.

FIG. 8.

Trend chart of training loss. (a) Trend chart of regression box loss. (b) Trend chart of confidence loss. (c) Trend chart of classification loss.

FIG. 8.

Trend chart of training loss. (a) Trend chart of regression box loss. (b) Trend chart of confidence loss. (c) Trend chart of classification loss.

Close modal

To prevent overfitting, performance comparisons of ablative experiments were conducted on the test set, as shown in Table IV. Model B introduces the small object detection layer P2, which improves the model’s ability to detect small objects and enriches the feature information. This effectively solves the issue of missed detection in surface defects of steel strips, increasing the recall rate to 87.2%. Model C builds upon model B by incorporating the BiFPN to enhance the efficiency of feature information propagation, reducing the potential loss of feature information. This improves the model’s ability to fuse and express features, allowing the network to focus more on the feature information that is helpful for target defect detection. As a result, the model’s accuracy and overall detection precision are improved. Model D builds upon model C by incorporating the DC3 module to enhance the model’s ability to extract multiscale features. This improves the detection efficiency without compromising the detection speed, increasing the recall rate to 88.0% and the mAP to 90.4%.

TABLE IV.

Results of ablation experiments.

ModelP2BiFPNDC3P (%)R (%)mAP (%)FPS
× × × 88.9 81.1 87.1 37 
✓ × × 89.9 87.2 89.1 32 
✓ ✓ × 90.6 84.2 89.4 33 
✓ ✓ ✓ 90.3 88.0 90.4 33 
ModelP2BiFPNDC3P (%)R (%)mAP (%)FPS
× × × 88.9 81.1 87.1 37 
✓ × × 89.9 87.2 89.1 32 
✓ ✓ × 90.6 84.2 89.4 33 
✓ ✓ ✓ 90.3 88.0 90.4 33 

Through comparative analysis of ablation experiments, it is found that model D with three proposed improvement strategies, namely, TDB-YOLO, shows the most significant improvement. TDB-YOLO effectively addresses the issues of missed detection of small targets, complex defect feature shapes, and unclear defect features. Compared to the original YOLOv5, it improves the overall detection performance by increasing the accuracy by 1.4% and the recall rate by 6.9%. This validates the effectiveness of TDB-YOLO for surface defect detection in steel strips.

To validate the performance of the algorithm proposed in this paper for surface defects in hot-rolled steel strips, TDB-YOLO was compared with mainstream object detection models. The trend of mAP of the detection model during the training process is shown in Fig. 9. TDB-YOLO performed the best in 500 rounds of training. YOLOv7 performed the worst with the slowest convergence speed, while Faster R-CNN had the fastest convergence speed, but the mAP was less than 80%. After 300 rounds, TDB-YOLO’s mAP has consistently been higher than those of the other models. In summary, the TDB-YOLO algorithm proposed in this article has higher accuracy and stability in the surface defect detection of hot-rolled steel strips. Compared to other mainstream object detection models, it performs better in detection effectiveness.

FIG. 9.

Different model mAP trend charts.

FIG. 9.

Different model mAP trend charts.

Close modal

To visually validate the performance of different detection models, the detection results were visualized, as shown in Fig. 10. In the detection of roller imprint defects, except for TDB-YOLO and Faster R-CNN, other detection models have certain redundant detection boxes, resulting in unnecessary resources wastage. In the detection of elastic deformation defects, Faster R-CNN failed to detect this defect and had severe issues with missed detections. In the detection of wavy defects, the prediction boxes of Faster R-CNN are too large, which is not conducive to the subsequent processing of defect detection. In the detection of oxide skin defects, YOLOv5 has excessively large prediction boxes, while YOLOv7 only detects a part of this defect. Such results are not conducive to precise defect analysis and quality control.

FIG. 10.

Detection results of different models.

FIG. 10.

Detection results of different models.

Close modal

To comprehensively evaluate the performance of different detection models, performance metrics for each model were calculated on the test set. The specific results are shown in Table V. TDB-YOLO achieved the highest values in recall rate and mAP, which are 88.0% and 90.4%, respectively. Compared with other detection models, a high recall rate means that TDB-YOLO has the lowest missed detection rate in surface defect detection of steel strips, and a high mAP means that TDB-YOLO overall performs best in terms of detection accuracy in surface defect detection of steel strips. In terms of precision, TDB-YOLO is only 0.5% lower than YOLOv8, but TDB-YOLO’s recall rate is 7.6% higher than that of YOLOv8, its mAP is 2.9% higher than that of YOLOv8, and its FPS is 5 higher than that of YOLOv8. Overall, TDB-YOLO performs better than YOLOv8 in surface defect detection of steel strips. In terms of detection speed, TDB-YOLO is slightly slower than YOLOv5 and YOLOv7. However, TDB-YOLO has better precision, recall rate, and mean average precision than these two detection models. Surface defect detection in steel strips places a higher emphasis on detection accuracy. Therefore, in terms of overall performance, TDB-YOLO performs better than YOLOv5 and YOLOv7 in surface defect detection of steel strips.

TABLE V.

Results of comparative experiments between different models.

ModelP (%)R (%)mAP (%)FPS
YOLOv5 88.9 81.1 87.1 37 
YOLOv7 82.1 79.3 78.4 39 
YOLOv8 90.8 80.4 87.5 28 
Faster-R-CNN 67.4 85.4 86.0 16 
TDB-YOLO 90.3 88.0 90.4 33 
ModelP (%)R (%)mAP (%)FPS
YOLOv5 88.9 81.1 87.1 37 
YOLOv7 82.1 79.3 78.4 39 
YOLOv8 90.8 80.4 87.5 28 
Faster-R-CNN 67.4 85.4 86.0 16 
TDB-YOLO 90.3 88.0 90.4 33 

In conclusion, compared to other mainstream detection models, TDB-YOLO exhibits superior performance in surface defect detection of steel strips. It is capable of detecting surface defects on steel strips better.

To detect surface defects on steel strips better, this paper takes the original surface images of steel strips in the production line as the research object and deeply restores the on-site inspection situation in the industrial setting. This paper proposes a defect detection method based on TDB-YOLO to detect four types of surface defects: elastic deformation, roller imprints, wave patterns, and oxide skins. In terms of image preprocessing, this method utilizes automatic gamma correction and Otsu thresholding to effectively remove redundant parts of the original image, thereby improving the detection accuracy. For small target defects, TDB-YOLO uses a small object detection layer with higher resolution and a focus on details, enriching the feature information and significantly improving the recall rate. In terms of feature extraction, TDB-YOLO uses DC3 to enhance the interaction between feature information, improving the model’s ability to extract multiscale features and the quality of feature information. In terms of feature fusion, TDB-YOLO uses the BiFPN to fuse deep-level feature information with low-level feature information, improving the model’s attention to adequate feature information and the efficiency of feature information propagation.

Experimental results indicate that the proposed model TDB-YOLO achieves an accuracy of 90.3% and a recall rate of 88.0% in surface defect detection for steel strips. The mean average precision (mAP) reaches 90.4%, and the frames per second (FPS) remains at 33. This model is capable of rapidly and accurately detecting small-sized, variably shaped, and low-contrast surface defects on steel strips. Compared to other detection models, TDB-YOLO has a significant advantage in terms of overall performance, meeting the speed and accuracy requirements for surface defect detection in practical steel strip production. Meanwhile, the experimental results also verify the adaptability of the model proposed in this paper to small targets, multiscale targets, and blurred targets, which is expected to accurately detect specific targets in other fields. For example, the morphology of rail defects is similar to that of steel strips. The method proposed in this paper may further improve the current detection accuracy of rail defects.

The main limitations of the proposed method in this paper are the low detection accuracy, slow detection speed, and the lack of comprehensive validation of generalizability. First, TDB-YOLO has a lower detection accuracy for surface defects on steel strips than YOLOv8. In the following research, we will attempt to improve the accuracy further while maintaining the recall rate by using the strategy of decoupling heads. Second, TDB-YOLO has a higher network complexity, requiring more computational resources than YOLOv5, which results in slower training and detection speeds. In the following research, we will attempt to use strategies such as network pruning and knowledge distillation to remove redundant parameters and compress the network structure while maintaining the detection accuracy. Finally, the experimental part of this paper has fewer detection categories and a single detection scenario. To better validate the universality of the method proposed in this paper, we will expand the detection categories and verify the algorithm performance in other application fields in the following research.

This research was sponsored by Natural Science Foundation of China, Grant No. 61901068, and the Chongqing Natural Science Foundation, Grant No. CSTB2023NSCQ-MSX0911, cstc2021jcyj-msxmX0525, and the Chongqing Postgraduate Research and Innovation Project, Grant No. CYS23707.

The authors have no conflicts to disclose.

Zengzhen Mi: Data curation (equal); Formal analysis (equal); Project administration (equal); Visualization (equal); Writing – review & editing (equal). Yan Gao: Data curation (equal); Formal analysis (equal); Project administration (equal); Visualization (equal); Writing – original draft (equal). Xingyuan Xu: Data curation (equal). Jing Tang: Visualization (equal).

The data that support the findings of this study are available from the corresponding author upon reasonable request.

1.
R.
Liu
,
M.
Huang
, and
P.
Cao
, “
An end-to-end steel strip surface defects detection framework: Considering complex background interference
,” in
2021 33rd Chinese Control and Decision Conference (CCDC)
(
IEEE
,
2021
), pp.
317
322
.
2.
G.
Qiu
,
Y.
Kang
,
J.
Tang
et al, “
Normal magnetizing-based eddy current testing method for surface crack and internal delamination of steel plate
,” in
2023 IEEE International Instrumentation and Measurement Technology Conference (I2MTC)
(
IEEE
,
2023
), pp.
1
6
.
3.
G.
Wang
,
Q.
Xiao
,
Z.
Gao
et al, “
Multifrequency AC magnetic flux leakage testing for the detection of surface and backside defects in thick steel plates
,”
IEEE Magn. Lett.
13
,
8102105
(
2022
).
4.
R.
Usamentiaga Fernández
,
S.
Sfarra
,
J.
Fleuret
et al, “
Rail inspection using active thermography to detect rolled-in material
,” in
QIRT 2018 Proceedings
,
2018
.
5.
B.
Tang
,
L.
Chen
,
W.
Sun
et al, “
Review of surface defect detection of steel products based on machine vision
,”
IET Image Process.
17
(
2
),
303
322
(
2023
).
6.
K.
Liu
,
H.
Wang
,
H.
Chen
et al, “
Steel surface defect detection using a new Haar–Weibull-variance model in unsupervised manner
,”
IEEE Trans. Instrum. Meas.
66
(
10
),
2585
2596
(
2017
).
7.
J.
Wang
,
Q.
Li
,
J.
Gan
et al, “
Surface defect detection via entity sparsity pursuit with intrinsic priors
,”
IEEE Trans. Ind. Inf.
16
(
1
),
141
150
(
2019
).
8.
M. W.
Ashour
,
F.
Khalid
,
A.
Abdul Halin
et al, “
Surface defects classification of hot-rolled steel strips using multi-directional shearlet features
,”
Arabian J. Sci. Eng.
44
,
2925
2932
(
2019
).
9.
K.
Simonyan
and
A.
Zisserman
, “
Very deep convolutional networks for large-scale image recognition
,” arXiv:1409.1556.
10.
A. R.
Siyal
,
Z.
Bhutto
,
S.
Muhammad
et al, “
Still image-based human activity recognition with deep representations and residual learning
,”
Int. J. Adv. Comput. Sci. Appl.
11
(
5
),
471
(
2020
).
11.
S.
Zhang
,
Q.
Zhang
,
J.
Gu
et al, “
Visual inspection of steel surface defects based on domain adaptation and adaptive convolutional neural network
,”
Mech. Syst. Signal Process.
153
,
107541
(
2021
).
12.
C.
Zhao
,
X.
Shu
,
X.
Yan
et al, “
RDD-YOLO: A modified YOLO for detection of steel surface defects
,”
Measurement
214
,
112776
(
2023
).
13.
L.
Wang
,
X.
Liu
,
J.
Ma
et al, “
Real-time steel surface defect detection with improved multi-scale YOLO-v5
,”
Processes
11
(
5
),
1357
(
2023
).
14.
C. C.
Yeung
and
K. M.
Lam
, “
Efficient fused-attention model for steel surface defect detection
,”
IEEE Trans. Instrum. Meas.
71
,
2510011
(
2022
).
15.
X.
Zhou
,
M.
Wei
,
Q.
Li
et al, “
Surface defect detection of steel strip with double pyramid network
,”
Appl. Sci.
13
(
2
),
1054
(
2023
).
16.
Y.
He
,
K.
Song
,
Q.
Meng
, and
Y.
Yan
, “
An end-to-end steel surface defect detection approach via fusing multiple hierarchical features
,”
IEEE Trans. Instrum. Meas.
69
(
4
),
1493
1504
(
2019
).
17.
Q.
Luo
and
Y.
He
, “
A cost-effective and automatic surface defect inspection system for hot-rolled flat steel
,”
Robot. Comput.-Integr. Manuf.
38
,
16
30
(
2016
).
18.
Q.
Luo
,
W.
Jiang
,
J.
Su
et al, “
Smoothing complete feature pyramid networks for roll mark detection of steel strips
,”
Sensors
21
(
21
),
7264
(
2021
).
19.
W.
Wu
,
H.
Liu
,
L.
Li
et al, “
Application of local fully convolutional neural network combined with YOLO v5 algorithm in small target detection of remote sensing image
,”
PLoS One
16
(
10
),
e0259283
(
2021
).
20.
H.
Lou
,
X.
Duan
,
J.
Guo
et al, “
DC-YOLOv8: Small-size object detection algorithm based on camera sensor
,”
Electronics
12
(
10
),
2323
(
2023
).
21.
M.
Tan
,
R.
Pang
, and
Q. V.
Le
, “
EfficientDet: Scalable and efficient object detection
,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(
IEEE
,
2020
), pp.
10781
10790
.