A lightweight mechanism for vision-transformer-based object detection
Abstract DETR (DEtection TRansformer) is a CV model for object detection that replaces traditional complex methods with a Transformer architecture, and has achieved significant improvement over previous methods, particularly in handling small and medium-sized objects. However, the attention mechanis...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-05-01
|
| Series: | Complex & Intelligent Systems |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s40747-025-01904-x |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract DETR (DEtection TRansformer) is a CV model for object detection that replaces traditional complex methods with a Transformer architecture, and has achieved significant improvement over previous methods, particularly in handling small and medium-sized objects. However, the attention mechanism-based detection framework of DETR exhibits limitations in small and medium-sized object detection. It struggles to extract fine-grained details of small and medium-sized objects from low-resolution features, and its computational complexity increases significantly with the input scale, thereby constraining real-time detection efficiency. To address these limitations, we introduce the Cross Feature Attention (XFA) mechanism and propose XFCOS (XFA-based with FCOS), a novel object detection model built upon it. XFA simplifies the attention mechanism’s computational process and reduces complexity through L2 normalization and two one-dimensional convolutions applied in different directions. This design reduces the computational complexity from quadratic to linear while preserving spatial context awareness. XFCOS enhances the original TSP-FCOS (Transformer-based Set Prediction with FCOS) model by integrating XFA into the transformer encoder, creating a CNN-ViT hybrid architecture, significantly reducing computational costs without sacrificing accuracy. Extensive experiments demonstrate that XFCOS achieves state-of-the-art performance while addressing DETR’s convergence and efficiency limitations. On Pascal VOC 2007, XFCOS attains 54.7 AP and 60.7 AP $$_\textrm{75}$$ 75 - surpassing DETR by 4.5 AP and 7.9 AP $$_\textrm{75}$$ 75 respectively, establishing new benchmarks among ResNet-50-based detectors. The model shows particular strength in small object detection, achieving 24.0 AP $$_\textrm{S}$$ S and 43.9 AP $$_\textrm{M}$$ M on COCO 2017, representing 3.3 AP $$_\textrm{S}$$ S and 3.8 AP $$_{\textrm{M}}$$ M improvements over DETR. Through computational optimization, XFCOS reduces encoder FLOPs to 13.5G, representing a 17.2% decrease compared to TSP-FCOS’s 16.3G, while cutting activation memory from 285.78 to 264.64M, a reduction of 7.4%. This significantly enhances computational efficiency. |
|---|---|
| ISSN: | 2199-4536 2198-6053 |