Building Blocks for Robust and Effective Semi-Supervised Real-World Object Detection
By Moussa Kassem Sbeyti and Nadja Klein, posted on March 27, 2025
What is the paper about?
Object detection models require vast amounts of labeled data, but acquiring such datasets is expensive and labor-intensive. Semi-supervised object detection (SSOD) addresses this issue by leveraging both labeled and unlabeled data. However, real-world applications suffer from three major challenges:
(1) class imbalance leading to poor generalization,
(2) noisy or incorrect ground truth labels, and
(3) missing or inaccurate pseudo-labels that propagate errors.
We investigate these challenges and propose four novel techniques (Fig.1) to enhance the robustness and effectiveness of SSOD frameworks:
Rare Class Collage (RCC): A data augmentation method that increases rare class representation by creating collages of rare objects.
Rare Class Focus (RCF): A batch sampling strategy that ensures rare classes are consistently represented in training.
Ground Truth Label Correction (GLC): A refinement method that identifies and corrects false, missing, and noisy ground truth labels using teacher model consistency.
Pseudo-Label Selection (PLS): A filtering approach that removes low-quality pseudo-labeled images based on missing detection rates while considering class rarity.
Figure 1: Our building blocks integrated into an exemplary Semi-supervised object detection (SSOD) framework. The teacher model MT, trained on labeled data, generates pseudo-labels for unlabeled data, which are then filtered by a confidence threshold \(\delta_S\). To address class imbalance, Rare Class Collage (RCC) crops instances of rare classes and combines them into collages, increasing their representation. Rare Class Focus (RCF) ensures each training batch contains common and rare classes, with augmented rare class images to boost their impact. Ground Truth Label Correction (GLC) corrects false, missing, and noisy labels by utilizing teacher prediction consistency across augmentations. Pseudo-Label Selection (PLS) removes pseudo-labeled images with many missing detections, estimated using our metric \(D_i(\delta_S, \beta)\), which incorporates detection confidence and class rarity. Together, our methods enhance the ability of the student model \(M_S(\delta_S)\) to learn effectively from both labeled and pseudo-labeled data, minimizing the propagation of errors from the teacher model.
Motivation
While pseudo-labeling is a powerful tool in SSOD, its effectiveness is limited by the quality of labeled and pseudo-labeled data. Many real-world datasets exhibit class imbalance (Fig.2), where common objects dominate training data, leaving rare classes poorly represented. Furthermore, label noise (Fig.3), including incorrect or missing ground truth labels, compromises model learning. Finally, pseudo-labels generated by the teacher model are often incomplete or incorrect (Fig.4), particularly for underrepresented classes, leading to error propagation.
Addressing these challenges is critical for making SSOD frameworks applicable in practical settings such as autonomous driving and industrial inspection.
Figure 2: KITTI (left, 10% labeled), BDD (right, 1% labeled). Class frequency for each class in both datasets. Figure 3: Examples of noisy box labels in KITTI. Human annotators often fail to accurately label occluded and truncated objects, especially in crowded scenes. Figure 4: KITTI (left), BDD (right). For each dataset, the left image represents the highest \(S_i\), while the right image represents the lowest \(S_i\) (poor detection quality). Predictions (in green) are filtered at \(\delta_S = 0.9\). \(M_D\) stands for missing detections.
Theory
Enhancing Class Representation
Rare Class Collage (RCC)
Instead of re-sampling entire images, RCC crops rare class objects and composes collages, ensuring their sufficient presence during training.
This prevents overfitting to common classes and improves generalization.
Rare Class Focus (RCF)
RCF modifies batch composition to always include at least one rare class instance per batch.
Compared to standard batch sampling, this results in the consistent inclusion of rare objects.
Improving Label Quality
Ground Truth Label Correction (GLC)
GLC refines labels by leveraging inference-time augmentation and teacher model consistency.
This improves the quality of the often false, missing, or noisy ground truth labels.
Pseudo-Label Selection (PLS)
PLS introduces a class-aware missing detection rate metric, ensuring only pseudo-labels with valuable learning signals are retained.
Compared to traditional confidence-based pseudo-label filtering, this accounts for class rarity and missing objects.
Experiments
We have conducted experiments on autonomous driving datasets (KITTI and BDD100K) using an EfficientDet-D0-based SSOD framework. Our key results are as follows.
RCC and RCF significantly improve rare class detection without negatively impacting common class performance.
GLC effectively corrects false and missing labels, boosting accuracy in challenging datasets.
PLS reduces error propagation by filtering out low-quality pseudo-labels, enhancing student model performance.
Combining all four building blocks yields up to a 6% improvement in SSOD performance compared to baseline models.
SSOD is a promising approach for reducing reliance on extensive labeled datasets. However, real-world deployments require methods to address class imbalance, improve label quality, and refine pseudo-label selection. Our paper investigates the effects of the latter on SSOD and proposes four modules: RCC, RCF, GLC, and PLS, that provide effective solutions with minimal computational overhead. These techniques can be seamlessly integrated into SSOD pipelines, making them applicable across diverse object detection tasks.
Our work demonstrates that data-centric enhancements are crucial for making SSOD viable for real-world applications.
The code is available here.
References
Sohn, Kihyuk, et al. A simple semi-supervised learning framework for object detection. arXiv preprint arXiv:2005.04757 (2020).
Mingxing Tan, Ruoming Pang, and Quoc V. Le. Efficientdet: Scalable and efficient object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10778–10787, 2020.
Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2636–2645, 2020.
Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361, 2012.
For questions, comments or other matters related to this blog post, please contact us via kleinlab@scc.kit.edu.
If you find our work useful, please cite our paper:
@article{KasKleNowSivAlb2025,
title={Building Blocks for Robust and Effective Semi-Supervised Real-World Object Detection},
author={Moussa Kassem~Sbeyti and Nadja Klein and Azarm Nowzad and Fikret Sivrikaya and Sahin Albayrak},
journal={Transactions on Machine Learning Research},