Data augmentation for Arabic text classification: a review of current methods, challenges and prospective directions

The effectiveness of data augmentation techniques, i.e., methods for artificially creating new data, has been demonstrated in many domains, from images to textual data. Data augmentation methods were established to manage different issues regarding the scarcity of training datasets or the class imba...

Full description

Saved in:
Bibliographic Details
Main Authors: Samia F. Abdhood, Nazlia Omar, Sabrina Tiun
Format: Article
Language:English
Published: PeerJ Inc. 2025-03-01
Series:PeerJ Computer Science
Subjects:
Online Access:https://peerj.com/articles/cs-2685.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850052339122044928
author Samia F. Abdhood
Nazlia Omar
Sabrina Tiun
author_facet Samia F. Abdhood
Nazlia Omar
Sabrina Tiun
author_sort Samia F. Abdhood
collection DOAJ
description The effectiveness of data augmentation techniques, i.e., methods for artificially creating new data, has been demonstrated in many domains, from images to textual data. Data augmentation methods were established to manage different issues regarding the scarcity of training datasets or the class imbalance to enhance the performance of classifiers. This review article investigates data augmentation techniques for Arabic texts, specifically in the text classification field. A thorough review was conducted to give a concise and comprehensive understanding of these approaches in the context of Arabic classification. The focus of this article is on Arabic studies published from 2019 to 2024 about data augmentation in Arabic text classification. Inclusion and exclusion criteria were applied to ensure a comprehensive vision of these techniques in Arabic natural language processing (ANLP). It was found that data augmentation research for Arabic text classification dominates sentiment analysis and propaganda detection, with initial studies emerging in 2019; very few studies have investigated other domains like sarcasm detection or text categorization. We also observed the lack of benchmark datasets for performing the tasks. Most studies have focused on short texts, such as Twitter data or reviews, while research on long texts still needs to be explored. Additionally, various data augmentation methods still need to be examined for long texts to determine if techniques effective for short texts are also applicable to longer texts. A rigorous investigation and comparison of the most effective strategies is required due to the unique characteristics of the Arabic language. By doing so, we can better understand the processes involved in Arabic text classification and hence be able to select the most suitable data augmentation methods for specific tasks. This review contributes valuable insights into Arabic NLP and enriches the existing body of knowledge.
format Article
id doaj-art-b6e28c68d6c24a02b3e75d26f213c65c
institution DOAJ
issn 2376-5992
language English
publishDate 2025-03-01
publisher PeerJ Inc.
record_format Article
series PeerJ Computer Science
spelling doaj-art-b6e28c68d6c24a02b3e75d26f213c65c2025-08-20T02:52:49ZengPeerJ Inc.PeerJ Computer Science2376-59922025-03-0111e268510.7717/peerj-cs.2685Data augmentation for Arabic text classification: a review of current methods, challenges and prospective directionsSamia F. Abdhood0Nazlia Omar1Sabrina Tiun2Center for Artificial Intelligence Technology, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor, MalaysiaCenter for Artificial Intelligence Technology, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor, MalaysiaCenter for Artificial Intelligence Technology, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor, MalaysiaThe effectiveness of data augmentation techniques, i.e., methods for artificially creating new data, has been demonstrated in many domains, from images to textual data. Data augmentation methods were established to manage different issues regarding the scarcity of training datasets or the class imbalance to enhance the performance of classifiers. This review article investigates data augmentation techniques for Arabic texts, specifically in the text classification field. A thorough review was conducted to give a concise and comprehensive understanding of these approaches in the context of Arabic classification. The focus of this article is on Arabic studies published from 2019 to 2024 about data augmentation in Arabic text classification. Inclusion and exclusion criteria were applied to ensure a comprehensive vision of these techniques in Arabic natural language processing (ANLP). It was found that data augmentation research for Arabic text classification dominates sentiment analysis and propaganda detection, with initial studies emerging in 2019; very few studies have investigated other domains like sarcasm detection or text categorization. We also observed the lack of benchmark datasets for performing the tasks. Most studies have focused on short texts, such as Twitter data or reviews, while research on long texts still needs to be explored. Additionally, various data augmentation methods still need to be examined for long texts to determine if techniques effective for short texts are also applicable to longer texts. A rigorous investigation and comparison of the most effective strategies is required due to the unique characteristics of the Arabic language. By doing so, we can better understand the processes involved in Arabic text classification and hence be able to select the most suitable data augmentation methods for specific tasks. This review contributes valuable insights into Arabic NLP and enriches the existing body of knowledge.https://peerj.com/articles/cs-2685.pdfArabic languageData augmentationData generationNatural language processingText classificationClass imbalance
spellingShingle Samia F. Abdhood
Nazlia Omar
Sabrina Tiun
Data augmentation for Arabic text classification: a review of current methods, challenges and prospective directions
PeerJ Computer Science
Arabic language
Data augmentation
Data generation
Natural language processing
Text classification
Class imbalance
title Data augmentation for Arabic text classification: a review of current methods, challenges and prospective directions
title_full Data augmentation for Arabic text classification: a review of current methods, challenges and prospective directions
title_fullStr Data augmentation for Arabic text classification: a review of current methods, challenges and prospective directions
title_full_unstemmed Data augmentation for Arabic text classification: a review of current methods, challenges and prospective directions
title_short Data augmentation for Arabic text classification: a review of current methods, challenges and prospective directions
title_sort data augmentation for arabic text classification a review of current methods challenges and prospective directions
topic Arabic language
Data augmentation
Data generation
Natural language processing
Text classification
Class imbalance
url https://peerj.com/articles/cs-2685.pdf
work_keys_str_mv AT samiafabdhood dataaugmentationforarabictextclassificationareviewofcurrentmethodschallengesandprospectivedirections
AT nazliaomar dataaugmentationforarabictextclassificationareviewofcurrentmethodschallengesandprospectivedirections
AT sabrinatiun dataaugmentationforarabictextclassificationareviewofcurrentmethodschallengesandprospectivedirections