UEF-HOCUrdu: Unified Embeddings Ensemble Framework for Hate and Offensive Text Classification in Urdu

Hate speech and other forms of hostile communication on social media have several implications such as; fostering violence, promoting social divide, and negative psychological effects. Since such toxic language is becoming more and more common, it is imperative to have a proper way of identifying it...

Full description

Saved in:
Bibliographic Details
Main Authors: Kifayat Ullah, Muhammad Aslam, Muhammad Usman Ghani Khan, Faten S. Alamri, Amjad Rehman Khan
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10849516/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Hate speech and other forms of hostile communication on social media have several implications such as; fostering violence, promoting social divide, and negative psychological effects. Since such toxic language is becoming more and more common, it is imperative to have a proper way of identifying it, especially in low resource language like Urdu. To meet this challenge, this research proposed a new ensemble based multi-classification model and generated new dataset of 36,000 Urdu tweets categorized as ‘Hate’, ‘Offensive’ and ‘Neither’. This study sought to create a model that not only achieves a high classification accuracy but also overcome key challenges inherent in natural language processing, namely, high dimensionality, sparsity, overfitting, OOV words and dialectal variations. For this purpose, an extensive comparison of different learning algorithms were conducted. As a result, the most efficient models, namely FastText, XLM-RoBERTa, ULMFiT, and XGBoost were incorporated in the proposed ensemble approach to achieve the best results in both classification and mitigation of NLP issues. To further enhance the confidence in proposed model, a stratified 5-fold cross-validation technique has been utilized. The ensemble model performed the best and achieved macro F1 score of 0.94, complemented by comprehensive labeled dataset focusing on hate and offensive speech in Urdu. By addressing key research gaps, this research provides a valuable foundation for future work and benchmarking in Urdu hate speech multi-classification tasks.
ISSN:2169-3536