An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and Evaluation
Offensive language is a type of unacceptable language that is impolite amongst individuals, specific community groups, and society as well. With the advent of various social media platforms, offensive language usage has been widely reported, thus developing a toxic online environment that has real-l...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10854428/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832575561784885248 |
---|---|
author | Salah Ud Din Shah Khusro Farman Ali Khan Munir Ahmad Oualid Ali Taher M. Ghazal |
author_facet | Salah Ud Din Shah Khusro Farman Ali Khan Munir Ahmad Oualid Ali Taher M. Ghazal |
author_sort | Salah Ud Din |
collection | DOAJ |
description | Offensive language is a type of unacceptable language that is impolite amongst individuals, specific community groups, and society as well. With the advent of various social media platforms, offensive language usage has been widely reported, thus developing a toxic online environment that has real-life endangers within society. Therefore, to foster a culture of respect and acceptance, a prompt response is needed to combat offensive content. On the other hand, the identification of offensive language has become a challenging task, specifically in low-resource languages such as Urdu. Urdu text poses challenges because of its unique features, complex script, and rich morphology. Applying methods directly that work in other languages is difficult. It also requires exploring new linguistic features and computational techniques on a relatively large dataset to ensure the results can be generalized effectively. Unfortunately, the Urdu language got very limited attention from the research community due to the scarcity of language resources and the non-availability of high-quality datasets and models. This study addresses those challenges, firstly by collecting and annotating a dataset of 12020 Urdu tweets using OLID taxonomy as a benchmark. Secondly, by extracting character-level and word-level features based on bag-of-words, n-grams and TFIDF representation. Finally, an extensive series of experiments were conducted on the extracted features using seven machine learning classifiers to identify the most effective features and classifiers. The experimental findings indicate that word unigrams, character trigrams, and word TFIDF are the most prominent ones. Similarly, among the classifiers, logistic regression and support vector machine attained the highest accuracy of 86% and F1-Score of 75%. |
format | Article |
id | doaj-art-021350a1d1d1427b9750b56df45bfabe |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-021350a1d1d1427b9750b56df45bfabe2025-01-31T23:05:24ZengIEEEIEEE Access2169-35362025-01-0113197551976910.1109/ACCESS.2025.353466210854428An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and EvaluationSalah Ud Din0https://orcid.org/0009-0008-3803-9839Shah Khusro1https://orcid.org/0000-0002-7734-7243Farman Ali Khan2https://orcid.org/0000-0002-1748-3016Munir Ahmad3Oualid Ali4Taher M. Ghazal5https://orcid.org/0000-0003-0672-7924Department of Computer Science, University of Peshawar, Peshawar, PakistanDepartment of Computer Science, University of Peshawar, Peshawar, PakistanDepartment of Computer Science, COMSATS University Islamabad, Attock Campus, Attock, PakistanDepartment of Computer Sciences, National College of Business Administration and Economics, Lahore, PakistanCollege of Arts and Science, Applied Science University, Manama, Kingdom of BahrainDepartment of Networks and Cybersecurity, Hourani Center for Applied Scientific Research, Al-Ahliyya Amman University, Amman, JordanOffensive language is a type of unacceptable language that is impolite amongst individuals, specific community groups, and society as well. With the advent of various social media platforms, offensive language usage has been widely reported, thus developing a toxic online environment that has real-life endangers within society. Therefore, to foster a culture of respect and acceptance, a prompt response is needed to combat offensive content. On the other hand, the identification of offensive language has become a challenging task, specifically in low-resource languages such as Urdu. Urdu text poses challenges because of its unique features, complex script, and rich morphology. Applying methods directly that work in other languages is difficult. It also requires exploring new linguistic features and computational techniques on a relatively large dataset to ensure the results can be generalized effectively. Unfortunately, the Urdu language got very limited attention from the research community due to the scarcity of language resources and the non-availability of high-quality datasets and models. This study addresses those challenges, firstly by collecting and annotating a dataset of 12020 Urdu tweets using OLID taxonomy as a benchmark. Secondly, by extracting character-level and word-level features based on bag-of-words, n-grams and TFIDF representation. Finally, an extensive series of experiments were conducted on the extracted features using seven machine learning classifiers to identify the most effective features and classifiers. The experimental findings indicate that word unigrams, character trigrams, and word TFIDF are the most prominent ones. Similarly, among the classifiers, logistic regression and support vector machine attained the highest accuracy of 86% and F1-Score of 75%.https://ieeexplore.ieee.org/document/10854428/Offensive language identificationUrdu language datasetOLID taxonomymachine learning classifierscyberbullyinghate speech |
spellingShingle | Salah Ud Din Shah Khusro Farman Ali Khan Munir Ahmad Oualid Ali Taher M. Ghazal An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and Evaluation IEEE Access Offensive language identification Urdu language dataset OLID taxonomy machine learning classifiers cyberbullying hate speech |
title | An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and Evaluation |
title_full | An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and Evaluation |
title_fullStr | An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and Evaluation |
title_full_unstemmed | An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and Evaluation |
title_short | An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and Evaluation |
title_sort | automatic approach for the identification of offensive language in perso arabic urdu language dataset creation and evaluation |
topic | Offensive language identification Urdu language dataset OLID taxonomy machine learning classifiers cyberbullying hate speech |
url | https://ieeexplore.ieee.org/document/10854428/ |
work_keys_str_mv | AT salahuddin anautomaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation AT shahkhusro anautomaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation AT farmanalikhan anautomaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation AT munirahmad anautomaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation AT oualidali anautomaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation AT tahermghazal anautomaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation AT salahuddin automaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation AT shahkhusro automaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation AT farmanalikhan automaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation AT munirahmad automaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation AT oualidali automaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation AT tahermghazal automaticapproachfortheidentificationofoffensivelanguageinpersoarabicurdulanguagedatasetcreationandevaluation |