An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and Evaluation

Offensive language is a type of unacceptable language that is impolite amongst individuals, specific community groups, and society as well. With the advent of various social media platforms, offensive language usage has been widely reported, thus developing a toxic online environment that has real-l...

Full description

Saved in:
Bibliographic Details
Main Authors: Salah Ud Din, Shah Khusro, Farman Ali Khan, Munir Ahmad, Oualid Ali, Taher M. Ghazal
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10854428/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832575561784885248
author Salah Ud Din
Shah Khusro
Farman Ali Khan
Munir Ahmad
Oualid Ali
Taher M. Ghazal
author_facet Salah Ud Din
Shah Khusro
Farman Ali Khan
Munir Ahmad
Oualid Ali
Taher M. Ghazal
author_sort Salah Ud Din
collection DOAJ
description Offensive language is a type of unacceptable language that is impolite amongst individuals, specific community groups, and society as well. With the advent of various social media platforms, offensive language usage has been widely reported, thus developing a toxic online environment that has real-life endangers within society. Therefore, to foster a culture of respect and acceptance, a prompt response is needed to combat offensive content. On the other hand, the identification of offensive language has become a challenging task, specifically in low-resource languages such as Urdu. Urdu text poses challenges because of its unique features, complex script, and rich morphology. Applying methods directly that work in other languages is difficult. It also requires exploring new linguistic features and computational techniques on a relatively large dataset to ensure the results can be generalized effectively. Unfortunately, the Urdu language got very limited attention from the research community due to the scarcity of language resources and the non-availability of high-quality datasets and models. This study addresses those challenges, firstly by collecting and annotating a dataset of 12020 Urdu tweets using OLID taxonomy as a benchmark. Secondly, by extracting character-level and word-level features based on bag-of-words, n-grams and TFIDF representation. Finally, an extensive series of experiments were conducted on the extracted features using seven machine learning classifiers to identify the most effective features and classifiers. The experimental findings indicate that word unigrams, character trigrams, and word TFIDF are the most prominent ones. Similarly, among the classifiers, logistic regression and support vector machine attained the highest accuracy of 86% and F1-Score of 75%.
format Article
id doaj-art-021350a1d1d1427b9750b56df45bfabe
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-021350a1d1d1427b9750b56df45bfabe2025-01-31T23:05:24ZengIEEEIEEE Access2169-35362025-01-0113197551976910.1109/ACCESS.2025.353466210854428An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and EvaluationSalah Ud Din0https://orcid.org/0009-0008-3803-9839Shah Khusro1https://orcid.org/0000-0002-7734-7243Farman Ali Khan2https://orcid.org/0000-0002-1748-3016Munir Ahmad3Oualid Ali4Taher M. Ghazal5https://orcid.org/0000-0003-0672-7924Department of Computer Science, University of Peshawar, Peshawar, PakistanDepartment of Computer Science, University of Peshawar, Peshawar, PakistanDepartment of Computer Science, COMSATS University Islamabad, Attock Campus, Attock, PakistanDepartment of Computer Sciences, National College of Business Administration and Economics, Lahore, PakistanCollege of Arts and Science, Applied Science University, Manama, Kingdom of BahrainDepartment of Networks and Cybersecurity, Hourani Center for Applied Scientific Research, Al-Ahliyya Amman University, Amman, JordanOffensive language is a type of unacceptable language that is impolite amongst individuals, specific community groups, and society as well. With the advent of various social media platforms, offensive language usage has been widely reported, thus developing a toxic online environment that has real-life endangers within society. Therefore, to foster a culture of respect and acceptance, a prompt response is needed to combat offensive content. On the other hand, the identification of offensive language has become a challenging task, specifically in low-resource languages such as Urdu. Urdu text poses challenges because of its unique features, complex script, and rich morphology. Applying methods directly that work in other languages is difficult. It also requires exploring new linguistic features and computational techniques on a relatively large dataset to ensure the results can be generalized effectively. Unfortunately, the Urdu language got very limited attention from the research community due to the scarcity of language resources and the non-availability of high-quality datasets and models. This study addresses those challenges, firstly by collecting and annotating a dataset of 12020 Urdu tweets using OLID taxonomy as a benchmark. Secondly, by extracting character-level and word-level features based on bag-of-words, n-grams and TFIDF representation. Finally, an extensive series of experiments were conducted on the extracted features using seven machine learning classifiers to identify the most effective features and classifiers. The experimental findings indicate that word unigrams, character trigrams, and word TFIDF are the most prominent ones. Similarly, among the classifiers, logistic regression and support vector machine attained the highest accuracy of 86% and F1-Score of 75%.https://ieeexplore.ieee.org/document/10854428/Offensive language identificationUrdu language datasetOLID taxonomymachine learning classifierscyberbullyinghate speech
spellingShingle Salah Ud Din
Shah Khusro
Farman Ali Khan
Munir Ahmad
Oualid Ali
Taher M. Ghazal
An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and Evaluation
IEEE Access
Offensive language identification
Urdu language dataset
OLID taxonomy
machine learning classifiers
cyberbullying
hate speech
title An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and Evaluation
title_full An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and Evaluation
title_fullStr An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and Evaluation
title_full_unstemmed An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and Evaluation
title_short An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and Evaluation
title_sort automatic approach for the identification of offensive language in perso arabic urdu language dataset creation and evaluation
topic Offensive language identification
Urdu language dataset
OLID taxonomy
machine learning classifiers
cyberbullying
hate speech
url https://ieeexplore.ieee.org/document/10854428/
work_keys_str_mv AT salahuddin anautomaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation
AT shahkhusro anautomaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation
AT farmanalikhan anautomaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation
AT munirahmad anautomaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation
AT oualidali anautomaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation
AT tahermghazal anautomaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation
AT salahuddin automaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation
AT shahkhusro automaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation
AT farmanalikhan automaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation
AT munirahmad automaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation
AT oualidali automaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation
AT tahermghazal automaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation