IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer
Transformers are becoming dominant deep learning backbones for both computer vision and natural language processing. While extensive experiments prove its outstanding ability for large models, transformers with small sizes are not comparable with convolutional neural networks in various downstream t...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10849548/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832576769812594688 |
---|---|
author | Qihao Chen Yunfeng Yan Xianbo Wang Jishen Peng |
author_facet | Qihao Chen Yunfeng Yan Xianbo Wang Jishen Peng |
author_sort | Qihao Chen |
collection | DOAJ |
description | Transformers are becoming dominant deep learning backbones for both computer vision and natural language processing. While extensive experiments prove its outstanding ability for large models, transformers with small sizes are not comparable with convolutional neural networks in various downstream tasks due to its lack of inductive bias which can benefit image understanding. Hierarchical vision transformers are a big family for better efficiency in computer vision, but in order to obtain global dependencies, their design is often complex. This paper proposes a non-hierarchical transformer network which can capture both long-range and short-range as non-hierarchical transformers, and keep its performance on small-sized transformers. First, we discard the framework of progressively reduced feature maps but design two separate stages, i.e., multi-scale feature preparatory stages and multi-scale feature perception stages, where the first stage uses the lightweight multi-branch structure to extract multi-scale features, and the second stage leverages the non-hierarchical networks to learn semantic information for downstream tasks. Second, we design a multi-receptive attention and interaction mechanism to perceive global and local correlations of the images in every transformer block for effective feature learning for small-sized networks. Extensive experiments show that the proposed lightweight IMViT-B outperforms DeiTIII, this paper IMViT-B(300 epochs) achieves a top accuracy of <inline-formula> <tex-math notation="LaTeX">$82.8~\%$ </tex-math></inline-formula> on ImageNet-1K with only 26M parameters, surpasses the DeiTIII-S(800 epochs) +1.4%, with a similar number of parameters and computation cost. Codes are available at <uri>https://github.com/LQchen1/IMViT</uri>. |
format | Article |
id | doaj-art-8b6807b5098842d698938b8007ee61d4 |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-8b6807b5098842d698938b8007ee61d42025-01-31T00:02:07ZengIEEEIEEE Access2169-35362025-01-0113185351854510.1109/ACCESS.2025.353260310849548IMViT: Adjacency Matrix-Based Lightweight Plain Vision TransformerQihao Chen0https://orcid.org/0009-0002-4644-147XYunfeng Yan1https://orcid.org/0000-0002-0939-5526Xianbo Wang2https://orcid.org/0000-0002-0463-2983Jishen Peng3Electrical and Control Engineering, Liaoning Technical University, Huludao, Liaoning, ChinaCollege of Electrical Engineering, Zhejiang University, Hangzhou, Zhejiang, ChinaHainan Institute of Zhejiang University, Sanya, Hainan, ChinaElectrical and Control Engineering, Liaoning Technical University, Huludao, Liaoning, ChinaTransformers are becoming dominant deep learning backbones for both computer vision and natural language processing. While extensive experiments prove its outstanding ability for large models, transformers with small sizes are not comparable with convolutional neural networks in various downstream tasks due to its lack of inductive bias which can benefit image understanding. Hierarchical vision transformers are a big family for better efficiency in computer vision, but in order to obtain global dependencies, their design is often complex. This paper proposes a non-hierarchical transformer network which can capture both long-range and short-range as non-hierarchical transformers, and keep its performance on small-sized transformers. First, we discard the framework of progressively reduced feature maps but design two separate stages, i.e., multi-scale feature preparatory stages and multi-scale feature perception stages, where the first stage uses the lightweight multi-branch structure to extract multi-scale features, and the second stage leverages the non-hierarchical networks to learn semantic information for downstream tasks. Second, we design a multi-receptive attention and interaction mechanism to perceive global and local correlations of the images in every transformer block for effective feature learning for small-sized networks. Extensive experiments show that the proposed lightweight IMViT-B outperforms DeiTIII, this paper IMViT-B(300 epochs) achieves a top accuracy of <inline-formula> <tex-math notation="LaTeX">$82.8~\%$ </tex-math></inline-formula> on ImageNet-1K with only 26M parameters, surpasses the DeiTIII-S(800 epochs) +1.4%, with a similar number of parameters and computation cost. Codes are available at <uri>https://github.com/LQchen1/IMViT</uri>.https://ieeexplore.ieee.org/document/10849548/Image classificationnon-hierarchical vision transformermask self-attention |
spellingShingle | Qihao Chen Yunfeng Yan Xianbo Wang Jishen Peng IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer IEEE Access Image classification non-hierarchical vision transformer mask self-attention |
title | IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer |
title_full | IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer |
title_fullStr | IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer |
title_full_unstemmed | IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer |
title_short | IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer |
title_sort | imvit adjacency matrix based lightweight plain vision transformer |
topic | Image classification non-hierarchical vision transformer mask self-attention |
url | https://ieeexplore.ieee.org/document/10849548/ |
work_keys_str_mv | AT qihaochen imvitadjacencymatrixbasedlightweightplainvisiontransformer AT yunfengyan imvitadjacencymatrixbasedlightweightplainvisiontransformer AT xianbowang imvitadjacencymatrixbasedlightweightplainvisiontransformer AT jishenpeng imvitadjacencymatrixbasedlightweightplainvisiontransformer |