FoT: an efficient transformer framework for real-time small object detection in football videos

Abstract Football videos are playing an increasingly important role in event analysis and tactical evaluation within computer vision. Traditional object detection methods, relying on region proposals and anchor generation, struggle to balance real-time performance and accuracy in complex scenarios s...

Full description

Saved in:
Bibliographic Details
Main Authors: Wentao Zhang, Yaocong Yang
Format: Article
Language:English
Published: Nature Portfolio 2025-08-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-16795-8
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849226324671987712
author Wentao Zhang
Yaocong Yang
author_facet Wentao Zhang
Yaocong Yang
author_sort Wentao Zhang
collection DOAJ
description Abstract Football videos are playing an increasingly important role in event analysis and tactical evaluation within computer vision. Traditional object detection methods, relying on region proposals and anchor generation, struggle to balance real-time performance and accuracy in complex scenarios such as multi-view, motion blur, and small object recognition. Meanwhile, Transformer-based methods face challenges in capturing fine-grained target information due to their high computational cost and slow training convergence. To address these problems, we propose a novel end-to-end detection framework–Football Transformer (FoT). By introducing the Local Interaction Aggregation Unit (LIAU) and Multi-Scale Feature Interaction Module (MFIM), FoT achieves an efficient balance between global semantic expression and local detail capture. Specifically, LIAU reduces the self-attention computation complexity from $$O(N^2)$$ to O(N) through feature aggregation within local windows and a window offset mechanism. MFIM strengthens the collaborative expression of low-level details and high-level semantics through multi-scale feature alignment and progressive fusion, effectively integrating low-level details and high-level semantics, significantly improving small object detection performance. Experimental results show that FoT achieves a 3.0% mAP improvement over the best baseline on the Soccer-Det dataset and a 1.3% gain on the FIFA-Vid dataset, while maintaining real-time inference speed. These results validate the effectiveness and robustness of the proposed method under complex football video scenarios.
format Article
id doaj-art-03e532a19db34b71a8f7c656ebff052d
institution Kabale University
issn 2045-2322
language English
publishDate 2025-08-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-03e532a19db34b71a8f7c656ebff052d2025-08-24T11:22:42ZengNature PortfolioScientific Reports2045-23222025-08-0115111410.1038/s41598-025-16795-8FoT: an efficient transformer framework for real-time small object detection in football videosWentao Zhang0Yaocong Yang1SCHOOL OF PHYSICAL EDUCATION, SHANDONG UNIVERSITYSCHOOL OF PHYSICAL EDUCATION, SHANDONG UNIVERSITYAbstract Football videos are playing an increasingly important role in event analysis and tactical evaluation within computer vision. Traditional object detection methods, relying on region proposals and anchor generation, struggle to balance real-time performance and accuracy in complex scenarios such as multi-view, motion blur, and small object recognition. Meanwhile, Transformer-based methods face challenges in capturing fine-grained target information due to their high computational cost and slow training convergence. To address these problems, we propose a novel end-to-end detection framework–Football Transformer (FoT). By introducing the Local Interaction Aggregation Unit (LIAU) and Multi-Scale Feature Interaction Module (MFIM), FoT achieves an efficient balance between global semantic expression and local detail capture. Specifically, LIAU reduces the self-attention computation complexity from $$O(N^2)$$ to O(N) through feature aggregation within local windows and a window offset mechanism. MFIM strengthens the collaborative expression of low-level details and high-level semantics through multi-scale feature alignment and progressive fusion, effectively integrating low-level details and high-level semantics, significantly improving small object detection performance. Experimental results show that FoT achieves a 3.0% mAP improvement over the best baseline on the Soccer-Det dataset and a 1.3% gain on the FIFA-Vid dataset, while maintaining real-time inference speed. These results validate the effectiveness and robustness of the proposed method under complex football video scenarios.https://doi.org/10.1038/s41598-025-16795-8Football video analysisObject detectionTransformerReal-time detection
spellingShingle Wentao Zhang
Yaocong Yang
FoT: an efficient transformer framework for real-time small object detection in football videos
Scientific Reports
Football video analysis
Object detection
Transformer
Real-time detection
title FoT: an efficient transformer framework for real-time small object detection in football videos
title_full FoT: an efficient transformer framework for real-time small object detection in football videos
title_fullStr FoT: an efficient transformer framework for real-time small object detection in football videos
title_full_unstemmed FoT: an efficient transformer framework for real-time small object detection in football videos
title_short FoT: an efficient transformer framework for real-time small object detection in football videos
title_sort fot an efficient transformer framework for real time small object detection in football videos
topic Football video analysis
Object detection
Transformer
Real-time detection
url https://doi.org/10.1038/s41598-025-16795-8
work_keys_str_mv AT wentaozhang fotanefficienttransformerframeworkforrealtimesmallobjectdetectioninfootballvideos
AT yaocongyang fotanefficienttransformerframeworkforrealtimesmallobjectdetectioninfootballvideos