Mix-layers semantic extraction and multi-scale aggregation transformer for semantic segmentation

Abstract Recently, a number of vision transformer models for semantic segmentation have been proposed, with the majority of these achieving impressive results. However, they lack the ability to exploit the intrinsic position and channel features of the image and are less capable of multi-scale featu...

Full description

Saved in:

Bibliographic Details
Main Authors:	Tianping Li, Xiaolong Yang, Zhenyi Zhang, Zhaotong Cui, Zhou Maoxia
Format:	Article
Language:	English
Published:	Springer 2024-11-01
Series:	Complex & Intelligent Systems
Subjects:	Semantic segmentation MEMAFormer MDCE SAPPM
Online Access:	https://doi.org/10.1007/s40747-024-01650-6
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832571197643030528
author	Tianping Li Xiaolong Yang Zhenyi Zhang Zhaotong Cui Zhou Maoxia
author_facet	Tianping Li Xiaolong Yang Zhenyi Zhang Zhaotong Cui Zhou Maoxia
author_sort	Tianping Li
collection	DOAJ
description	Abstract Recently, a number of vision transformer models for semantic segmentation have been proposed, with the majority of these achieving impressive results. However, they lack the ability to exploit the intrinsic position and channel features of the image and are less capable of multi-scale feature fusion. This paper presents a semantic segmentation method that successfully combines attention and multiscale representation, thereby enhancing performance and efficiency. This represents a significant advancement in the field. Multi-layers semantic extraction and multi-scale aggregation transformer decoder (MEMAFormer) is proposed, which consists of two components: mix-layers dual channel semantic extraction module (MDCE) and semantic aggregation pyramid pooling module (SAPPM). The MDCE incorporates a multi-layers cross attention module (MCAM) and an efficient channel attention module (ECAM). In MCAM, horizontal connections between encoder and decoder stages are employed as feature queries for the attention module. The hierarchical feature maps derived from different encoder and decoder stages are integrated into key and value. To address long-term dependencies, ECAM selectively emphasizes interdependent channel feature maps by integrating relevant features across all channels. The adaptability of the feature maps is reduced by pyramid pooling, which reduces the amount of computation without compromising performance. SAPPM is comprised of several distinct pooled kernels that extract context with a deeper flow of information, forming a multi-scale feature by integrating various feature sizes. The MEMAFormer-B0 model demonstrates superior performance compared to SegFormer-B0, exhibiting gains of 4.8%, 4.0% and 3.5% on the ADE20K, Cityscapes and COCO-stuff datasets, respectively.
format	Article
id	doaj-art-cbd44f27c47d4139b42105b1d087189c
institution	Kabale University
issn	2199-4536 2198-6053
language	English
publishDate	2024-11-01
publisher	Springer
record_format	Article
series	Complex & Intelligent Systems
spelling	doaj-art-cbd44f27c47d4139b42105b1d087189c2025-02-02T12:49:03ZengSpringerComplex & Intelligent Systems2199-45362198-60532024-11-0111111510.1007/s40747-024-01650-6Mix-layers semantic extraction and multi-scale aggregation transformer for semantic segmentationTianping Li0Xiaolong Yang1Zhenyi Zhang2Zhaotong Cui3Zhou Maoxia4School of Physics and Electronics, Shandong Normal UniversitySchool of Physics and Electronics, Shandong Normal UniversitySchool of Physics and Electronics, Shandong Normal UniversitySchool of Physics and Electronics, Shandong Normal UniversitySchool of Physics and Electronics, Shandong Normal UniversityAbstract Recently, a number of vision transformer models for semantic segmentation have been proposed, with the majority of these achieving impressive results. However, they lack the ability to exploit the intrinsic position and channel features of the image and are less capable of multi-scale feature fusion. This paper presents a semantic segmentation method that successfully combines attention and multiscale representation, thereby enhancing performance and efficiency. This represents a significant advancement in the field. Multi-layers semantic extraction and multi-scale aggregation transformer decoder (MEMAFormer) is proposed, which consists of two components: mix-layers dual channel semantic extraction module (MDCE) and semantic aggregation pyramid pooling module (SAPPM). The MDCE incorporates a multi-layers cross attention module (MCAM) and an efficient channel attention module (ECAM). In MCAM, horizontal connections between encoder and decoder stages are employed as feature queries for the attention module. The hierarchical feature maps derived from different encoder and decoder stages are integrated into key and value. To address long-term dependencies, ECAM selectively emphasizes interdependent channel feature maps by integrating relevant features across all channels. The adaptability of the feature maps is reduced by pyramid pooling, which reduces the amount of computation without compromising performance. SAPPM is comprised of several distinct pooled kernels that extract context with a deeper flow of information, forming a multi-scale feature by integrating various feature sizes. The MEMAFormer-B0 model demonstrates superior performance compared to SegFormer-B0, exhibiting gains of 4.8%, 4.0% and 3.5% on the ADE20K, Cityscapes and COCO-stuff datasets, respectively.https://doi.org/10.1007/s40747-024-01650-6Semantic segmentationMEMAFormerMDCESAPPM
spellingShingle	Tianping Li Xiaolong Yang Zhenyi Zhang Zhaotong Cui Zhou Maoxia Mix-layers semantic extraction and multi-scale aggregation transformer for semantic segmentation Complex & Intelligent Systems Semantic segmentation MEMAFormer MDCE SAPPM
title	Mix-layers semantic extraction and multi-scale aggregation transformer for semantic segmentation
title_full	Mix-layers semantic extraction and multi-scale aggregation transformer for semantic segmentation
title_fullStr	Mix-layers semantic extraction and multi-scale aggregation transformer for semantic segmentation
title_full_unstemmed	Mix-layers semantic extraction and multi-scale aggregation transformer for semantic segmentation
title_short	Mix-layers semantic extraction and multi-scale aggregation transformer for semantic segmentation
title_sort	mix layers semantic extraction and multi scale aggregation transformer for semantic segmentation
topic	Semantic segmentation MEMAFormer MDCE SAPPM
url	https://doi.org/10.1007/s40747-024-01650-6
work_keys_str_mv	AT tianpingli mixlayerssemanticextractionandmultiscaleaggregationtransformerforsemanticsegmentation AT xiaolongyang mixlayerssemanticextractionandmultiscaleaggregationtransformerforsemanticsegmentation AT zhenyizhang mixlayerssemanticextractionandmultiscaleaggregationtransformerforsemanticsegmentation AT zhaotongcui mixlayerssemanticextractionandmultiscaleaggregationtransformerforsemanticsegmentation AT zhoumaoxia mixlayerssemanticextractionandmultiscaleaggregationtransformerforsemanticsegmentation

Mix-layers semantic extraction and multi-scale aggregation transformer for semantic segmentation

Similar Items