EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation

Transformer-based models have achieved impressive performance in semantic segmentation in recent years. However, the multi-head self-attention mechanism in Transformers incurs significant computational overhead and becomes impractical for real-time applications due to its high complexity and large l...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhengyu Xia, Joohee Kim
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10852306/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832576786127388672
author Zhengyu Xia
Joohee Kim
author_facet Zhengyu Xia
Joohee Kim
author_sort Zhengyu Xia
collection DOAJ
description Transformer-based models have achieved impressive performance in semantic segmentation in recent years. However, the multi-head self-attention mechanism in Transformers incurs significant computational overhead and becomes impractical for real-time applications due to its high complexity and large latency. Numerous attention variants have been proposed to address this issue, yet the overall performance and inference speed still have limitations. In this paper, we propose an efficient multi-scale Transformer (EMSFormer) that employs learnable keys and values based on the single-head attention mechanism and a dual-resolution structure for real-time semantic segmentation. Specifically, we propose multi-scale single-head attention (MS-SHA) to effectively learn multi-scale attention and improve feature representation capability. In addition, we introduce cross-resolution single-head attention (CR-SHA) to efficiently combine the global context-rich features in the low-resolution branch to the features in the high-resolution branch. Experimental results show that our proposed method can achieve state-of-the-art performance with real-time inference speed on ADE20K, Cityscapes, and CamVid datasets.
format Article
id doaj-art-cca8af057e314e33bbd356854cc63b4e
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-cca8af057e314e33bbd356854cc63b4e2025-01-31T00:02:04ZengIEEEIEEE Access2169-35362025-01-0113182391825210.1109/ACCESS.2025.353411710852306EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic SegmentationZhengyu Xia0https://orcid.org/0000-0001-5225-5580Joohee Kim1https://orcid.org/0000-0001-8833-0319Illinois Institute of Technology, Chicago, IL, USAIllinois Institute of Technology, Chicago, IL, USATransformer-based models have achieved impressive performance in semantic segmentation in recent years. However, the multi-head self-attention mechanism in Transformers incurs significant computational overhead and becomes impractical for real-time applications due to its high complexity and large latency. Numerous attention variants have been proposed to address this issue, yet the overall performance and inference speed still have limitations. In this paper, we propose an efficient multi-scale Transformer (EMSFormer) that employs learnable keys and values based on the single-head attention mechanism and a dual-resolution structure for real-time semantic segmentation. Specifically, we propose multi-scale single-head attention (MS-SHA) to effectively learn multi-scale attention and improve feature representation capability. In addition, we introduce cross-resolution single-head attention (CR-SHA) to efficiently combine the global context-rich features in the low-resolution branch to the features in the high-resolution branch. Experimental results show that our proposed method can achieve state-of-the-art performance with real-time inference speed on ADE20K, Cityscapes, and CamVid datasets.https://ieeexplore.ieee.org/document/10852306/Cross-resolution attentionmulti-scale attentionreal-time semantic segmentationtransformers
spellingShingle Zhengyu Xia
Joohee Kim
EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation
IEEE Access
Cross-resolution attention
multi-scale attention
real-time semantic segmentation
transformers
title EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation
title_full EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation
title_fullStr EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation
title_full_unstemmed EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation
title_short EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation
title_sort emsfomer efficient multi scale transformer for real time semantic segmentation
topic Cross-resolution attention
multi-scale attention
real-time semantic segmentation
transformers
url https://ieeexplore.ieee.org/document/10852306/
work_keys_str_mv AT zhengyuxia emsfomerefficientmultiscaletransformerforrealtimesemanticsegmentation
AT jooheekim emsfomerefficientmultiscaletransformerforrealtimesemanticsegmentation