A lightweight mechanism for vision-transformer-based object detection

Abstract DETR (DEtection TRansformer) is a CV model for object detection that replaces traditional complex methods with a Transformer architecture, and has achieved significant improvement over previous methods, particularly in handling small and medium-sized objects. However, the attention mechanis...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yanming Ye, Qiang Sun, Kailong Cheng, Xingfa Shen, Dongjing Wang
Format:	Article
Language:	English
Published:	Springer 2025-05-01
Series:	Complex & Intelligent Systems
Subjects:	Object detection DETR XFA XFCOS CNN-ViT
Online Access:	https://doi.org/10.1007/s40747-025-01904-x
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Abstract DETR (DEtection TRansformer) is a CV model for object detection that replaces traditional complex methods with a Transformer architecture, and has achieved significant improvement over previous methods, particularly in handling small and medium-sized objects. However, the attention mechanism-based detection framework of DETR exhibits limitations in small and medium-sized object detection. It struggles to extract fine-grained details of small and medium-sized objects from low-resolution features, and its computational complexity increases significantly with the input scale, thereby constraining real-time detection efficiency. To address these limitations, we introduce the Cross Feature Attention (XFA) mechanism and propose XFCOS (XFA-based with FCOS), a novel object detection model built upon it. XFA simplifies the attention mechanism’s computational process and reduces complexity through L2 normalization and two one-dimensional convolutions applied in different directions. This design reduces the computational complexity from quadratic to linear while preserving spatial context awareness. XFCOS enhances the original TSP-FCOS (Transformer-based Set Prediction with FCOS) model by integrating XFA into the transformer encoder, creating a CNN-ViT hybrid architecture, significantly reducing computational costs without sacrificing accuracy. Extensive experiments demonstrate that XFCOS achieves state-of-the-art performance while addressing DETR’s convergence and efficiency limitations. On Pascal VOC 2007, XFCOS attains 54.7 AP and 60.7 AP $$_\textrm{75}$$ 75 - surpassing DETR by 4.5 AP and 7.9 AP $$_\textrm{75}$$ 75 respectively, establishing new benchmarks among ResNet-50-based detectors. The model shows particular strength in small object detection, achieving 24.0 AP $$_\textrm{S}$$ S and 43.9 AP $$_\textrm{M}$$ M on COCO 2017, representing 3.3 AP $$_\textrm{S}$$ S and 3.8 AP $$_{\textrm{M}}$$ M improvements over DETR. Through computational optimization, XFCOS reduces encoder FLOPs to 13.5G, representing a 17.2% decrease compared to TSP-FCOS’s 16.3G, while cutting activation memory from 285.78 to 264.64M, a reduction of 7.4%. This significantly enhances computational efficiency.
ISSN:	2199-4536 2198-6053

A lightweight mechanism for vision-transformer-based object detection

Similar Items