EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation
Transformer-based models have achieved impressive performance in semantic segmentation in recent years. However, the multi-head self-attention mechanism in Transformers incurs significant computational overhead and becomes impractical for real-time applications due to its high complexity and large l...
Saved in:
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10852306/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832576786127388672 |
---|---|
author | Zhengyu Xia Joohee Kim |
author_facet | Zhengyu Xia Joohee Kim |
author_sort | Zhengyu Xia |
collection | DOAJ |
description | Transformer-based models have achieved impressive performance in semantic segmentation in recent years. However, the multi-head self-attention mechanism in Transformers incurs significant computational overhead and becomes impractical for real-time applications due to its high complexity and large latency. Numerous attention variants have been proposed to address this issue, yet the overall performance and inference speed still have limitations. In this paper, we propose an efficient multi-scale Transformer (EMSFormer) that employs learnable keys and values based on the single-head attention mechanism and a dual-resolution structure for real-time semantic segmentation. Specifically, we propose multi-scale single-head attention (MS-SHA) to effectively learn multi-scale attention and improve feature representation capability. In addition, we introduce cross-resolution single-head attention (CR-SHA) to efficiently combine the global context-rich features in the low-resolution branch to the features in the high-resolution branch. Experimental results show that our proposed method can achieve state-of-the-art performance with real-time inference speed on ADE20K, Cityscapes, and CamVid datasets. |
format | Article |
id | doaj-art-cca8af057e314e33bbd356854cc63b4e |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-cca8af057e314e33bbd356854cc63b4e2025-01-31T00:02:04ZengIEEEIEEE Access2169-35362025-01-0113182391825210.1109/ACCESS.2025.353411710852306EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic SegmentationZhengyu Xia0https://orcid.org/0000-0001-5225-5580Joohee Kim1https://orcid.org/0000-0001-8833-0319Illinois Institute of Technology, Chicago, IL, USAIllinois Institute of Technology, Chicago, IL, USATransformer-based models have achieved impressive performance in semantic segmentation in recent years. However, the multi-head self-attention mechanism in Transformers incurs significant computational overhead and becomes impractical for real-time applications due to its high complexity and large latency. Numerous attention variants have been proposed to address this issue, yet the overall performance and inference speed still have limitations. In this paper, we propose an efficient multi-scale Transformer (EMSFormer) that employs learnable keys and values based on the single-head attention mechanism and a dual-resolution structure for real-time semantic segmentation. Specifically, we propose multi-scale single-head attention (MS-SHA) to effectively learn multi-scale attention and improve feature representation capability. In addition, we introduce cross-resolution single-head attention (CR-SHA) to efficiently combine the global context-rich features in the low-resolution branch to the features in the high-resolution branch. Experimental results show that our proposed method can achieve state-of-the-art performance with real-time inference speed on ADE20K, Cityscapes, and CamVid datasets.https://ieeexplore.ieee.org/document/10852306/Cross-resolution attentionmulti-scale attentionreal-time semantic segmentationtransformers |
spellingShingle | Zhengyu Xia Joohee Kim EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation IEEE Access Cross-resolution attention multi-scale attention real-time semantic segmentation transformers |
title | EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation |
title_full | EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation |
title_fullStr | EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation |
title_full_unstemmed | EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation |
title_short | EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation |
title_sort | emsfomer efficient multi scale transformer for real time semantic segmentation |
topic | Cross-resolution attention multi-scale attention real-time semantic segmentation transformers |
url | https://ieeexplore.ieee.org/document/10852306/ |
work_keys_str_mv | AT zhengyuxia emsfomerefficientmultiscaletransformerforrealtimesemanticsegmentation AT jooheekim emsfomerefficientmultiscaletransformerforrealtimesemanticsegmentation |