Vision Transformers for Image Classification: A Comparative Survey
Transformers were initially introduced for natural language processing, leveraging the self-attention mechanism. They require minimal inductive biases in their design and can function effectively as set-based architectures. Additionally, transformers excel at capturing long-range dependencies and en...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2025-01-01
|
Series: | Technologies |
Subjects: | |
Online Access: | https://www.mdpi.com/2227-7080/13/1/32 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832587419739750400 |
---|---|
author | Yaoli Wang Yaojun Deng Yuanjin Zheng Pratik Chattopadhyay Lipo Wang |
author_facet | Yaoli Wang Yaojun Deng Yuanjin Zheng Pratik Chattopadhyay Lipo Wang |
author_sort | Yaoli Wang |
collection | DOAJ |
description | Transformers were initially introduced for natural language processing, leveraging the self-attention mechanism. They require minimal inductive biases in their design and can function effectively as set-based architectures. Additionally, transformers excel at capturing long-range dependencies and enabling parallel processing, which allows them to outperform traditional models, such as long short-term memory (LSTM) networks, on sequence-based tasks. In recent years, transformers have been widely adopted in computer vision, driving remarkable advancements in the field. Previous surveys have provided overviews of transformer applications across various computer vision tasks, such as object detection, activity recognition, and image enhancement. In this survey, we focus specifically on image classification. We begin with an introduction to the fundamental concepts of transformers and highlight the first successful Vision Transformer (ViT). Building on the ViT, we review subsequent improvements and optimizations introduced for image classification tasks. We then compare the strengths and limitations of these transformer-based models against classic convolutional neural networks (CNNs) through experiments. Finally, we explore key challenges and potential future directions for image classification transformers. |
format | Article |
id | doaj-art-3f3bfc97abbc4f649feedcd2599f41a4 |
institution | Kabale University |
issn | 2227-7080 |
language | English |
publishDate | 2025-01-01 |
publisher | MDPI AG |
record_format | Article |
series | Technologies |
spelling | doaj-art-3f3bfc97abbc4f649feedcd2599f41a42025-01-24T13:50:48ZengMDPI AGTechnologies2227-70802025-01-011313210.3390/technologies13010032Vision Transformers for Image Classification: A Comparative SurveyYaoli Wang0Yaojun Deng1Yuanjin Zheng2Pratik Chattopadhyay3Lipo Wang4College of Electronic Information Engineering, Taiyuan University of Technology, Taiyuan 030600, ChinaSchool of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, SingaporeSchool of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, SingaporeDepartment of CSE, Indian Institute of Technology (BHU), Varanasi 221005, IndiaSchool of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, SingaporeTransformers were initially introduced for natural language processing, leveraging the self-attention mechanism. They require minimal inductive biases in their design and can function effectively as set-based architectures. Additionally, transformers excel at capturing long-range dependencies and enabling parallel processing, which allows them to outperform traditional models, such as long short-term memory (LSTM) networks, on sequence-based tasks. In recent years, transformers have been widely adopted in computer vision, driving remarkable advancements in the field. Previous surveys have provided overviews of transformer applications across various computer vision tasks, such as object detection, activity recognition, and image enhancement. In this survey, we focus specifically on image classification. We begin with an introduction to the fundamental concepts of transformers and highlight the first successful Vision Transformer (ViT). Building on the ViT, we review subsequent improvements and optimizations introduced for image classification tasks. We then compare the strengths and limitations of these transformer-based models against classic convolutional neural networks (CNNs) through experiments. Finally, we explore key challenges and potential future directions for image classification transformers.https://www.mdpi.com/2227-7080/13/1/32computer visionpattern recognitionartificial intelligencemachine learning |
spellingShingle | Yaoli Wang Yaojun Deng Yuanjin Zheng Pratik Chattopadhyay Lipo Wang Vision Transformers for Image Classification: A Comparative Survey Technologies computer vision pattern recognition artificial intelligence machine learning |
title | Vision Transformers for Image Classification: A Comparative Survey |
title_full | Vision Transformers for Image Classification: A Comparative Survey |
title_fullStr | Vision Transformers for Image Classification: A Comparative Survey |
title_full_unstemmed | Vision Transformers for Image Classification: A Comparative Survey |
title_short | Vision Transformers for Image Classification: A Comparative Survey |
title_sort | vision transformers for image classification a comparative survey |
topic | computer vision pattern recognition artificial intelligence machine learning |
url | https://www.mdpi.com/2227-7080/13/1/32 |
work_keys_str_mv | AT yaoliwang visiontransformersforimageclassificationacomparativesurvey AT yaojundeng visiontransformersforimageclassificationacomparativesurvey AT yuanjinzheng visiontransformersforimageclassificationacomparativesurvey AT pratikchattopadhyay visiontransformersforimageclassificationacomparativesurvey AT lipowang visiontransformersforimageclassificationacomparativesurvey |