Vision Transformers for Image Classification: A Comparative Survey

Transformers were initially introduced for natural language processing, leveraging the self-attention mechanism. They require minimal inductive biases in their design and can function effectively as set-based architectures. Additionally, transformers excel at capturing long-range dependencies and en...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yaoli Wang, Yaojun Deng, Yuanjin Zheng, Pratik Chattopadhyay, Lipo Wang
Format:	Article
Language:	English
Published:	MDPI AG 2025-01-01
Series:	Technologies
Subjects:	computer vision pattern recognition artificial intelligence machine learning
Online Access:	https://www.mdpi.com/2227-7080/13/1/32
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832587419739750400
author	Yaoli Wang Yaojun Deng Yuanjin Zheng Pratik Chattopadhyay Lipo Wang
author_facet	Yaoli Wang Yaojun Deng Yuanjin Zheng Pratik Chattopadhyay Lipo Wang
author_sort	Yaoli Wang
collection	DOAJ
description	Transformers were initially introduced for natural language processing, leveraging the self-attention mechanism. They require minimal inductive biases in their design and can function effectively as set-based architectures. Additionally, transformers excel at capturing long-range dependencies and enabling parallel processing, which allows them to outperform traditional models, such as long short-term memory (LSTM) networks, on sequence-based tasks. In recent years, transformers have been widely adopted in computer vision, driving remarkable advancements in the field. Previous surveys have provided overviews of transformer applications across various computer vision tasks, such as object detection, activity recognition, and image enhancement. In this survey, we focus specifically on image classification. We begin with an introduction to the fundamental concepts of transformers and highlight the first successful Vision Transformer (ViT). Building on the ViT, we review subsequent improvements and optimizations introduced for image classification tasks. We then compare the strengths and limitations of these transformer-based models against classic convolutional neural networks (CNNs) through experiments. Finally, we explore key challenges and potential future directions for image classification transformers.
format	Article
id	doaj-art-3f3bfc97abbc4f649feedcd2599f41a4
institution	Kabale University
issn	2227-7080
language	English
publishDate	2025-01-01
publisher	MDPI AG
record_format	Article
series	Technologies
spelling	doaj-art-3f3bfc97abbc4f649feedcd2599f41a42025-01-24T13:50:48ZengMDPI AGTechnologies2227-70802025-01-011313210.3390/technologies13010032Vision Transformers for Image Classification: A Comparative SurveyYaoli Wang0Yaojun Deng1Yuanjin Zheng2Pratik Chattopadhyay3Lipo Wang4College of Electronic Information Engineering, Taiyuan University of Technology, Taiyuan 030600, ChinaSchool of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, SingaporeSchool of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, SingaporeDepartment of CSE, Indian Institute of Technology (BHU), Varanasi 221005, IndiaSchool of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, SingaporeTransformers were initially introduced for natural language processing, leveraging the self-attention mechanism. They require minimal inductive biases in their design and can function effectively as set-based architectures. Additionally, transformers excel at capturing long-range dependencies and enabling parallel processing, which allows them to outperform traditional models, such as long short-term memory (LSTM) networks, on sequence-based tasks. In recent years, transformers have been widely adopted in computer vision, driving remarkable advancements in the field. Previous surveys have provided overviews of transformer applications across various computer vision tasks, such as object detection, activity recognition, and image enhancement. In this survey, we focus specifically on image classification. We begin with an introduction to the fundamental concepts of transformers and highlight the first successful Vision Transformer (ViT). Building on the ViT, we review subsequent improvements and optimizations introduced for image classification tasks. We then compare the strengths and limitations of these transformer-based models against classic convolutional neural networks (CNNs) through experiments. Finally, we explore key challenges and potential future directions for image classification transformers.https://www.mdpi.com/2227-7080/13/1/32computer visionpattern recognitionartificial intelligencemachine learning
spellingShingle	Yaoli Wang Yaojun Deng Yuanjin Zheng Pratik Chattopadhyay Lipo Wang Vision Transformers for Image Classification: A Comparative Survey Technologies computer vision pattern recognition artificial intelligence machine learning
title	Vision Transformers for Image Classification: A Comparative Survey
title_full	Vision Transformers for Image Classification: A Comparative Survey
title_fullStr	Vision Transformers for Image Classification: A Comparative Survey
title_full_unstemmed	Vision Transformers for Image Classification: A Comparative Survey
title_short	Vision Transformers for Image Classification: A Comparative Survey
title_sort	vision transformers for image classification a comparative survey
topic	computer vision pattern recognition artificial intelligence machine learning
url	https://www.mdpi.com/2227-7080/13/1/32
work_keys_str_mv	AT yaoliwang visiontransformersforimageclassificationacomparativesurvey AT yaojundeng visiontransformersforimageclassificationacomparativesurvey AT yuanjinzheng visiontransformersforimageclassificationacomparativesurvey AT pratikchattopadhyay visiontransformersforimageclassificationacomparativesurvey AT lipowang visiontransformersforimageclassificationacomparativesurvey

Vision Transformers for Image Classification: A Comparative Survey

Similar Items