IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer

Transformers are becoming dominant deep learning backbones for both computer vision and natural language processing. While extensive experiments prove its outstanding ability for large models, transformers with small sizes are not comparable with convolutional neural networks in various downstream t...

Full description

Saved in:
Bibliographic Details
Main Authors: Qihao Chen, Yunfeng Yan, Xianbo Wang, Jishen Peng
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10849548/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832576769812594688
author Qihao Chen
Yunfeng Yan
Xianbo Wang
Jishen Peng
author_facet Qihao Chen
Yunfeng Yan
Xianbo Wang
Jishen Peng
author_sort Qihao Chen
collection DOAJ
description Transformers are becoming dominant deep learning backbones for both computer vision and natural language processing. While extensive experiments prove its outstanding ability for large models, transformers with small sizes are not comparable with convolutional neural networks in various downstream tasks due to its lack of inductive bias which can benefit image understanding. Hierarchical vision transformers are a big family for better efficiency in computer vision, but in order to obtain global dependencies, their design is often complex. This paper proposes a non-hierarchical transformer network which can capture both long-range and short-range as non-hierarchical transformers, and keep its performance on small-sized transformers. First, we discard the framework of progressively reduced feature maps but design two separate stages, i.e., multi-scale feature preparatory stages and multi-scale feature perception stages, where the first stage uses the lightweight multi-branch structure to extract multi-scale features, and the second stage leverages the non-hierarchical networks to learn semantic information for downstream tasks. Second, we design a multi-receptive attention and interaction mechanism to perceive global and local correlations of the images in every transformer block for effective feature learning for small-sized networks. Extensive experiments show that the proposed lightweight IMViT-B outperforms DeiTIII, this paper IMViT-B(300 epochs) achieves a top accuracy of <inline-formula> <tex-math notation="LaTeX">$82.8~\%$ </tex-math></inline-formula> on ImageNet-1K with only 26M parameters, surpasses the DeiTIII-S(800 epochs) +1.4%, with a similar number of parameters and computation cost. Codes are available at <uri>https://github.com/LQchen1/IMViT</uri>.
format Article
id doaj-art-8b6807b5098842d698938b8007ee61d4
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-8b6807b5098842d698938b8007ee61d42025-01-31T00:02:07ZengIEEEIEEE Access2169-35362025-01-0113185351854510.1109/ACCESS.2025.353260310849548IMViT: Adjacency Matrix-Based Lightweight Plain Vision TransformerQihao Chen0https://orcid.org/0009-0002-4644-147XYunfeng Yan1https://orcid.org/0000-0002-0939-5526Xianbo Wang2https://orcid.org/0000-0002-0463-2983Jishen Peng3Electrical and Control Engineering, Liaoning Technical University, Huludao, Liaoning, ChinaCollege of Electrical Engineering, Zhejiang University, Hangzhou, Zhejiang, ChinaHainan Institute of Zhejiang University, Sanya, Hainan, ChinaElectrical and Control Engineering, Liaoning Technical University, Huludao, Liaoning, ChinaTransformers are becoming dominant deep learning backbones for both computer vision and natural language processing. While extensive experiments prove its outstanding ability for large models, transformers with small sizes are not comparable with convolutional neural networks in various downstream tasks due to its lack of inductive bias which can benefit image understanding. Hierarchical vision transformers are a big family for better efficiency in computer vision, but in order to obtain global dependencies, their design is often complex. This paper proposes a non-hierarchical transformer network which can capture both long-range and short-range as non-hierarchical transformers, and keep its performance on small-sized transformers. First, we discard the framework of progressively reduced feature maps but design two separate stages, i.e., multi-scale feature preparatory stages and multi-scale feature perception stages, where the first stage uses the lightweight multi-branch structure to extract multi-scale features, and the second stage leverages the non-hierarchical networks to learn semantic information for downstream tasks. Second, we design a multi-receptive attention and interaction mechanism to perceive global and local correlations of the images in every transformer block for effective feature learning for small-sized networks. Extensive experiments show that the proposed lightweight IMViT-B outperforms DeiTIII, this paper IMViT-B(300 epochs) achieves a top accuracy of <inline-formula> <tex-math notation="LaTeX">$82.8~\%$ </tex-math></inline-formula> on ImageNet-1K with only 26M parameters, surpasses the DeiTIII-S(800 epochs) +1.4%, with a similar number of parameters and computation cost. Codes are available at <uri>https://github.com/LQchen1/IMViT</uri>.https://ieeexplore.ieee.org/document/10849548/Image classificationnon-hierarchical vision transformermask self-attention
spellingShingle Qihao Chen
Yunfeng Yan
Xianbo Wang
Jishen Peng
IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer
IEEE Access
Image classification
non-hierarchical vision transformer
mask self-attention
title IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer
title_full IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer
title_fullStr IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer
title_full_unstemmed IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer
title_short IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer
title_sort imvit adjacency matrix based lightweight plain vision transformer
topic Image classification
non-hierarchical vision transformer
mask self-attention
url https://ieeexplore.ieee.org/document/10849548/
work_keys_str_mv AT qihaochen imvitadjacencymatrixbasedlightweightplainvisiontransformer
AT yunfengyan imvitadjacencymatrixbasedlightweightplainvisiontransformer
AT xianbowang imvitadjacencymatrixbasedlightweightplainvisiontransformer
AT jishenpeng imvitadjacencymatrixbasedlightweightplainvisiontransformer