EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation

Transformer-based models have achieved impressive performance in semantic segmentation in recent years. However, the multi-head self-attention mechanism in Transformers incurs significant computational overhead and becomes impractical for real-time applications due to its high complexity and large l...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhengyu Xia, Joohee Kim
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Cross-resolution attention multi-scale attention real-time semantic segmentation transformers
Online Access:	https://ieeexplore.ieee.org/document/10852306/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832576786127388672
author	Zhengyu Xia Joohee Kim
author_facet	Zhengyu Xia Joohee Kim
author_sort	Zhengyu Xia
collection	DOAJ
description	Transformer-based models have achieved impressive performance in semantic segmentation in recent years. However, the multi-head self-attention mechanism in Transformers incurs significant computational overhead and becomes impractical for real-time applications due to its high complexity and large latency. Numerous attention variants have been proposed to address this issue, yet the overall performance and inference speed still have limitations. In this paper, we propose an efficient multi-scale Transformer (EMSFormer) that employs learnable keys and values based on the single-head attention mechanism and a dual-resolution structure for real-time semantic segmentation. Specifically, we propose multi-scale single-head attention (MS-SHA) to effectively learn multi-scale attention and improve feature representation capability. In addition, we introduce cross-resolution single-head attention (CR-SHA) to efficiently combine the global context-rich features in the low-resolution branch to the features in the high-resolution branch. Experimental results show that our proposed method can achieve state-of-the-art performance with real-time inference speed on ADE20K, Cityscapes, and CamVid datasets.
format	Article
id	doaj-art-cca8af057e314e33bbd356854cc63b4e
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-cca8af057e314e33bbd356854cc63b4e2025-01-31T00:02:04ZengIEEEIEEE Access2169-35362025-01-0113182391825210.1109/ACCESS.2025.353411710852306EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic SegmentationZhengyu Xia0https://orcid.org/0000-0001-5225-5580Joohee Kim1https://orcid.org/0000-0001-8833-0319Illinois Institute of Technology, Chicago, IL, USAIllinois Institute of Technology, Chicago, IL, USATransformer-based models have achieved impressive performance in semantic segmentation in recent years. However, the multi-head self-attention mechanism in Transformers incurs significant computational overhead and becomes impractical for real-time applications due to its high complexity and large latency. Numerous attention variants have been proposed to address this issue, yet the overall performance and inference speed still have limitations. In this paper, we propose an efficient multi-scale Transformer (EMSFormer) that employs learnable keys and values based on the single-head attention mechanism and a dual-resolution structure for real-time semantic segmentation. Specifically, we propose multi-scale single-head attention (MS-SHA) to effectively learn multi-scale attention and improve feature representation capability. In addition, we introduce cross-resolution single-head attention (CR-SHA) to efficiently combine the global context-rich features in the low-resolution branch to the features in the high-resolution branch. Experimental results show that our proposed method can achieve state-of-the-art performance with real-time inference speed on ADE20K, Cityscapes, and CamVid datasets.https://ieeexplore.ieee.org/document/10852306/Cross-resolution attentionmulti-scale attentionreal-time semantic segmentationtransformers
spellingShingle	Zhengyu Xia Joohee Kim EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation IEEE Access Cross-resolution attention multi-scale attention real-time semantic segmentation transformers
title	EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation
title_full	EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation
title_fullStr	EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation
title_full_unstemmed	EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation
title_short	EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation
title_sort	emsfomer efficient multi scale transformer for real time semantic segmentation
topic	Cross-resolution attention multi-scale attention real-time semantic segmentation transformers
url	https://ieeexplore.ieee.org/document/10852306/
work_keys_str_mv	AT zhengyuxia emsfomerefficientmultiscaletransformerforrealtimesemanticsegmentation AT jooheekim emsfomerefficientmultiscaletransformerforrealtimesemanticsegmentation

EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation

Similar Items