IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer

Transformers are becoming dominant deep learning backbones for both computer vision and natural language processing. While extensive experiments prove its outstanding ability for large models, transformers with small sizes are not comparable with convolutional neural networks in various downstream t...

Full description

Saved in:

Bibliographic Details
Main Authors:	Qihao Chen, Yunfeng Yan, Xianbo Wang, Jishen Peng
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Image classification non-hierarchical vision transformer mask self-attention
Online Access:	https://ieeexplore.ieee.org/document/10849548/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832576769812594688
author	Qihao Chen Yunfeng Yan Xianbo Wang Jishen Peng
author_facet	Qihao Chen Yunfeng Yan Xianbo Wang Jishen Peng
author_sort	Qihao Chen
collection	DOAJ
description	Transformers are becoming dominant deep learning backbones for both computer vision and natural language processing. While extensive experiments prove its outstanding ability for large models, transformers with small sizes are not comparable with convolutional neural networks in various downstream tasks due to its lack of inductive bias which can benefit image understanding. Hierarchical vision transformers are a big family for better efficiency in computer vision, but in order to obtain global dependencies, their design is often complex. This paper proposes a non-hierarchical transformer network which can capture both long-range and short-range as non-hierarchical transformers, and keep its performance on small-sized transformers. First, we discard the framework of progressively reduced feature maps but design two separate stages, i.e., multi-scale feature preparatory stages and multi-scale feature perception stages, where the first stage uses the lightweight multi-branch structure to extract multi-scale features, and the second stage leverages the non-hierarchical networks to learn semantic information for downstream tasks. Second, we design a multi-receptive attention and interaction mechanism to perceive global and local correlations of the images in every transformer block for effective feature learning for small-sized networks. Extensive experiments show that the proposed lightweight IMViT-B outperforms DeiTIII, this paper IMViT-B(300 epochs) achieves a top accuracy of <inline-formula> <tex-math notation="LaTeX">$82.8~\%$ </tex-math></inline-formula> on ImageNet-1K with only 26M parameters, surpasses the DeiTIII-S(800 epochs) +1.4%, with a similar number of parameters and computation cost. Codes are available at <uri>https://github.com/LQchen1/IMViT</uri>.
format	Article
id	doaj-art-8b6807b5098842d698938b8007ee61d4
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-8b6807b5098842d698938b8007ee61d42025-01-31T00:02:07ZengIEEEIEEE Access2169-35362025-01-0113185351854510.1109/ACCESS.2025.353260310849548IMViT: Adjacency Matrix-Based Lightweight Plain Vision TransformerQihao Chen0https://orcid.org/0009-0002-4644-147XYunfeng Yan1https://orcid.org/0000-0002-0939-5526Xianbo Wang2https://orcid.org/0000-0002-0463-2983Jishen Peng3Electrical and Control Engineering, Liaoning Technical University, Huludao, Liaoning, ChinaCollege of Electrical Engineering, Zhejiang University, Hangzhou, Zhejiang, ChinaHainan Institute of Zhejiang University, Sanya, Hainan, ChinaElectrical and Control Engineering, Liaoning Technical University, Huludao, Liaoning, ChinaTransformers are becoming dominant deep learning backbones for both computer vision and natural language processing. While extensive experiments prove its outstanding ability for large models, transformers with small sizes are not comparable with convolutional neural networks in various downstream tasks due to its lack of inductive bias which can benefit image understanding. Hierarchical vision transformers are a big family for better efficiency in computer vision, but in order to obtain global dependencies, their design is often complex. This paper proposes a non-hierarchical transformer network which can capture both long-range and short-range as non-hierarchical transformers, and keep its performance on small-sized transformers. First, we discard the framework of progressively reduced feature maps but design two separate stages, i.e., multi-scale feature preparatory stages and multi-scale feature perception stages, where the first stage uses the lightweight multi-branch structure to extract multi-scale features, and the second stage leverages the non-hierarchical networks to learn semantic information for downstream tasks. Second, we design a multi-receptive attention and interaction mechanism to perceive global and local correlations of the images in every transformer block for effective feature learning for small-sized networks. Extensive experiments show that the proposed lightweight IMViT-B outperforms DeiTIII, this paper IMViT-B(300 epochs) achieves a top accuracy of <inline-formula> <tex-math notation="LaTeX">$82.8~\%$ </tex-math></inline-formula> on ImageNet-1K with only 26M parameters, surpasses the DeiTIII-S(800 epochs) +1.4%, with a similar number of parameters and computation cost. Codes are available at <uri>https://github.com/LQchen1/IMViT</uri>.https://ieeexplore.ieee.org/document/10849548/Image classificationnon-hierarchical vision transformermask self-attention
spellingShingle	Qihao Chen Yunfeng Yan Xianbo Wang Jishen Peng IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer IEEE Access Image classification non-hierarchical vision transformer mask self-attention
title	IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer
title_full	IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer
title_fullStr	IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer
title_full_unstemmed	IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer
title_short	IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer
title_sort	imvit adjacency matrix based lightweight plain vision transformer
topic	Image classification non-hierarchical vision transformer mask self-attention
url	https://ieeexplore.ieee.org/document/10849548/
work_keys_str_mv	AT qihaochen imvitadjacencymatrixbasedlightweightplainvisiontransformer AT yunfengyan imvitadjacencymatrixbasedlightweightplainvisiontransformer AT xianbowang imvitadjacencymatrixbasedlightweightplainvisiontransformer AT jishenpeng imvitadjacencymatrixbasedlightweightplainvisiontransformer

IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer

Similar Items