UEF-HOCUrdu: Unified Embeddings Ensemble Framework for Hate and Offensive Text Classification in Urdu

Hate speech and other forms of hostile communication on social media have several implications such as; fostering violence, promoting social divide, and negative psychological effects. Since such toxic language is becoming more and more common, it is imperative to have a proper way of identifying it...

Full description

Saved in:
Bibliographic Details
Main Authors: Kifayat Ullah, Muhammad Aslam, Muhammad Usman Ghani Khan, Faten S. Alamri, Amjad Rehman Khan
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10849516/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832088092126740480
author Kifayat Ullah
Muhammad Aslam
Muhammad Usman Ghani Khan
Faten S. Alamri
Amjad Rehman Khan
author_facet Kifayat Ullah
Muhammad Aslam
Muhammad Usman Ghani Khan
Faten S. Alamri
Amjad Rehman Khan
author_sort Kifayat Ullah
collection DOAJ
description Hate speech and other forms of hostile communication on social media have several implications such as; fostering violence, promoting social divide, and negative psychological effects. Since such toxic language is becoming more and more common, it is imperative to have a proper way of identifying it, especially in low resource language like Urdu. To meet this challenge, this research proposed a new ensemble based multi-classification model and generated new dataset of 36,000 Urdu tweets categorized as ‘Hate’, ‘Offensive’ and ‘Neither’. This study sought to create a model that not only achieves a high classification accuracy but also overcome key challenges inherent in natural language processing, namely, high dimensionality, sparsity, overfitting, OOV words and dialectal variations. For this purpose, an extensive comparison of different learning algorithms were conducted. As a result, the most efficient models, namely FastText, XLM-RoBERTa, ULMFiT, and XGBoost were incorporated in the proposed ensemble approach to achieve the best results in both classification and mitigation of NLP issues. To further enhance the confidence in proposed model, a stratified 5-fold cross-validation technique has been utilized. The ensemble model performed the best and achieved macro F1 score of 0.94, complemented by comprehensive labeled dataset focusing on hate and offensive speech in Urdu. By addressing key research gaps, this research provides a valuable foundation for future work and benchmarking in Urdu hate speech multi-classification tasks.
format Article
id doaj-art-d433739ab92f4f459f31146ba3450a61
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-d433739ab92f4f459f31146ba3450a612025-02-06T00:00:15ZengIEEEIEEE Access2169-35362025-01-0113218532186910.1109/ACCESS.2025.353261110849516UEF-HOCUrdu: Unified Embeddings Ensemble Framework for Hate and Offensive Text Classification in UrduKifayat Ullah0https://orcid.org/0009-0007-0801-9442Muhammad Aslam1https://orcid.org/0000-0002-8977-9457Muhammad Usman Ghani Khan2https://orcid.org/0000-0001-6733-2569Faten S. Alamri3https://orcid.org/0000-0003-0312-8731Amjad Rehman Khan4https://orcid.org/0000-0002-3817-2655Department of Computer Science, University of Engineering and Technology Lahore, Lahore, PakistanDepartment of Computer Science, University of Engineering and Technology Lahore, Lahore, PakistanNational Center of Artificial Intelligence, Al-Khawarizmi Institute of Computer Science, UET Lahore, Lahore, PakistanDepartment of Mathematical Sciences, College of Science, Princess Nourah Bint Abdulrahman University, Riyadh, Saudi ArabiaArtificial Intelligence & Data Analytics Laboratory (AIDA), CCIS, Prince Sultan University, Riyadh, Saudi ArabiaHate speech and other forms of hostile communication on social media have several implications such as; fostering violence, promoting social divide, and negative psychological effects. Since such toxic language is becoming more and more common, it is imperative to have a proper way of identifying it, especially in low resource language like Urdu. To meet this challenge, this research proposed a new ensemble based multi-classification model and generated new dataset of 36,000 Urdu tweets categorized as ‘Hate’, ‘Offensive’ and ‘Neither’. This study sought to create a model that not only achieves a high classification accuracy but also overcome key challenges inherent in natural language processing, namely, high dimensionality, sparsity, overfitting, OOV words and dialectal variations. For this purpose, an extensive comparison of different learning algorithms were conducted. As a result, the most efficient models, namely FastText, XLM-RoBERTa, ULMFiT, and XGBoost were incorporated in the proposed ensemble approach to achieve the best results in both classification and mitigation of NLP issues. To further enhance the confidence in proposed model, a stratified 5-fold cross-validation technique has been utilized. The ensemble model performed the best and achieved macro F1 score of 0.94, complemented by comprehensive labeled dataset focusing on hate and offensive speech in Urdu. By addressing key research gaps, this research provides a valuable foundation for future work and benchmarking in Urdu hate speech multi-classification tasks.https://ieeexplore.ieee.org/document/10849516/Urdu hate speech detectionUrdu multi-class classificationmachine learningdeep learningtransfer learningensemble learning model
spellingShingle Kifayat Ullah
Muhammad Aslam
Muhammad Usman Ghani Khan
Faten S. Alamri
Amjad Rehman Khan
UEF-HOCUrdu: Unified Embeddings Ensemble Framework for Hate and Offensive Text Classification in Urdu
IEEE Access
Urdu hate speech detection
Urdu multi-class classification
machine learning
deep learning
transfer learning
ensemble learning model
title UEF-HOCUrdu: Unified Embeddings Ensemble Framework for Hate and Offensive Text Classification in Urdu
title_full UEF-HOCUrdu: Unified Embeddings Ensemble Framework for Hate and Offensive Text Classification in Urdu
title_fullStr UEF-HOCUrdu: Unified Embeddings Ensemble Framework for Hate and Offensive Text Classification in Urdu
title_full_unstemmed UEF-HOCUrdu: Unified Embeddings Ensemble Framework for Hate and Offensive Text Classification in Urdu
title_short UEF-HOCUrdu: Unified Embeddings Ensemble Framework for Hate and Offensive Text Classification in Urdu
title_sort uef hocurdu unified embeddings ensemble framework for hate and offensive text classification in urdu
topic Urdu hate speech detection
Urdu multi-class classification
machine learning
deep learning
transfer learning
ensemble learning model
url https://ieeexplore.ieee.org/document/10849516/
work_keys_str_mv AT kifayatullah uefhocurduunifiedembeddingsensembleframeworkforhateandoffensivetextclassificationinurdu
AT muhammadaslam uefhocurduunifiedembeddingsensembleframeworkforhateandoffensivetextclassificationinurdu
AT muhammadusmanghanikhan uefhocurduunifiedembeddingsensembleframeworkforhateandoffensivetextclassificationinurdu
AT fatensalamri uefhocurduunifiedembeddingsensembleframeworkforhateandoffensivetextclassificationinurdu
AT amjadrehmankhan uefhocurduunifiedembeddingsensembleframeworkforhateandoffensivetextclassificationinurdu