UEF-HOCUrdu: Unified Embeddings Ensemble Framework for Hate and Offensive Text Classification in Urdu
Hate speech and other forms of hostile communication on social media have several implications such as; fostering violence, promoting social divide, and negative psychological effects. Since such toxic language is becoming more and more common, it is imperative to have a proper way of identifying it...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10849516/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832088092126740480 |
---|---|
author | Kifayat Ullah Muhammad Aslam Muhammad Usman Ghani Khan Faten S. Alamri Amjad Rehman Khan |
author_facet | Kifayat Ullah Muhammad Aslam Muhammad Usman Ghani Khan Faten S. Alamri Amjad Rehman Khan |
author_sort | Kifayat Ullah |
collection | DOAJ |
description | Hate speech and other forms of hostile communication on social media have several implications such as; fostering violence, promoting social divide, and negative psychological effects. Since such toxic language is becoming more and more common, it is imperative to have a proper way of identifying it, especially in low resource language like Urdu. To meet this challenge, this research proposed a new ensemble based multi-classification model and generated new dataset of 36,000 Urdu tweets categorized as ‘Hate’, ‘Offensive’ and ‘Neither’. This study sought to create a model that not only achieves a high classification accuracy but also overcome key challenges inherent in natural language processing, namely, high dimensionality, sparsity, overfitting, OOV words and dialectal variations. For this purpose, an extensive comparison of different learning algorithms were conducted. As a result, the most efficient models, namely FastText, XLM-RoBERTa, ULMFiT, and XGBoost were incorporated in the proposed ensemble approach to achieve the best results in both classification and mitigation of NLP issues. To further enhance the confidence in proposed model, a stratified 5-fold cross-validation technique has been utilized. The ensemble model performed the best and achieved macro F1 score of 0.94, complemented by comprehensive labeled dataset focusing on hate and offensive speech in Urdu. By addressing key research gaps, this research provides a valuable foundation for future work and benchmarking in Urdu hate speech multi-classification tasks. |
format | Article |
id | doaj-art-d433739ab92f4f459f31146ba3450a61 |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-d433739ab92f4f459f31146ba3450a612025-02-06T00:00:15ZengIEEEIEEE Access2169-35362025-01-0113218532186910.1109/ACCESS.2025.353261110849516UEF-HOCUrdu: Unified Embeddings Ensemble Framework for Hate and Offensive Text Classification in UrduKifayat Ullah0https://orcid.org/0009-0007-0801-9442Muhammad Aslam1https://orcid.org/0000-0002-8977-9457Muhammad Usman Ghani Khan2https://orcid.org/0000-0001-6733-2569Faten S. Alamri3https://orcid.org/0000-0003-0312-8731Amjad Rehman Khan4https://orcid.org/0000-0002-3817-2655Department of Computer Science, University of Engineering and Technology Lahore, Lahore, PakistanDepartment of Computer Science, University of Engineering and Technology Lahore, Lahore, PakistanNational Center of Artificial Intelligence, Al-Khawarizmi Institute of Computer Science, UET Lahore, Lahore, PakistanDepartment of Mathematical Sciences, College of Science, Princess Nourah Bint Abdulrahman University, Riyadh, Saudi ArabiaArtificial Intelligence & Data Analytics Laboratory (AIDA), CCIS, Prince Sultan University, Riyadh, Saudi ArabiaHate speech and other forms of hostile communication on social media have several implications such as; fostering violence, promoting social divide, and negative psychological effects. Since such toxic language is becoming more and more common, it is imperative to have a proper way of identifying it, especially in low resource language like Urdu. To meet this challenge, this research proposed a new ensemble based multi-classification model and generated new dataset of 36,000 Urdu tweets categorized as ‘Hate’, ‘Offensive’ and ‘Neither’. This study sought to create a model that not only achieves a high classification accuracy but also overcome key challenges inherent in natural language processing, namely, high dimensionality, sparsity, overfitting, OOV words and dialectal variations. For this purpose, an extensive comparison of different learning algorithms were conducted. As a result, the most efficient models, namely FastText, XLM-RoBERTa, ULMFiT, and XGBoost were incorporated in the proposed ensemble approach to achieve the best results in both classification and mitigation of NLP issues. To further enhance the confidence in proposed model, a stratified 5-fold cross-validation technique has been utilized. The ensemble model performed the best and achieved macro F1 score of 0.94, complemented by comprehensive labeled dataset focusing on hate and offensive speech in Urdu. By addressing key research gaps, this research provides a valuable foundation for future work and benchmarking in Urdu hate speech multi-classification tasks.https://ieeexplore.ieee.org/document/10849516/Urdu hate speech detectionUrdu multi-class classificationmachine learningdeep learningtransfer learningensemble learning model |
spellingShingle | Kifayat Ullah Muhammad Aslam Muhammad Usman Ghani Khan Faten S. Alamri Amjad Rehman Khan UEF-HOCUrdu: Unified Embeddings Ensemble Framework for Hate and Offensive Text Classification in Urdu IEEE Access Urdu hate speech detection Urdu multi-class classification machine learning deep learning transfer learning ensemble learning model |
title | UEF-HOCUrdu: Unified Embeddings Ensemble Framework for Hate and Offensive Text Classification in Urdu |
title_full | UEF-HOCUrdu: Unified Embeddings Ensemble Framework for Hate and Offensive Text Classification in Urdu |
title_fullStr | UEF-HOCUrdu: Unified Embeddings Ensemble Framework for Hate and Offensive Text Classification in Urdu |
title_full_unstemmed | UEF-HOCUrdu: Unified Embeddings Ensemble Framework for Hate and Offensive Text Classification in Urdu |
title_short | UEF-HOCUrdu: Unified Embeddings Ensemble Framework for Hate and Offensive Text Classification in Urdu |
title_sort | uef hocurdu unified embeddings ensemble framework for hate and offensive text classification in urdu |
topic | Urdu hate speech detection Urdu multi-class classification machine learning deep learning transfer learning ensemble learning model |
url | https://ieeexplore.ieee.org/document/10849516/ |
work_keys_str_mv | AT kifayatullah uefhocurduunifiedembeddingsensembleframeworkforhateandoffensivetextclassificationinurdu AT muhammadaslam uefhocurduunifiedembeddingsensembleframeworkforhateandoffensivetextclassificationinurdu AT muhammadusmanghanikhan uefhocurduunifiedembeddingsensembleframeworkforhateandoffensivetextclassificationinurdu AT fatensalamri uefhocurduunifiedembeddingsensembleframeworkforhateandoffensivetextclassificationinurdu AT amjadrehmankhan uefhocurduunifiedembeddingsensembleframeworkforhateandoffensivetextclassificationinurdu |