Urdu Toxic Comment Classification With PURUTT Corpus Development

This study addresses the critical gap in toxic comment classification in Urdu, a widely spoken language devoid of high-quality standard datasets. To address this gap, we employed an existing labeled Roman Urdu (RU) corpus, which was developed originally for Roman Urdu toxic comment classification, a...

Full description

Saved in:
Bibliographic Details
Main Authors: Hafiz Hassaan Saeed, Tahir Khalil, Faisal Kamiran
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10856102/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This study addresses the critical gap in toxic comment classification in Urdu, a widely spoken language devoid of high-quality standard datasets. To address this gap, we employed an existing labeled Roman Urdu (RU) corpus, which was developed originally for Roman Urdu toxic comment classification, and supplemented that corpus by adding its Urdu equivalent transliterations. The motivation behind such an extension is twofold: firstly, to provide a large comprehensive dataset for the classification of toxic comments in Urdu; secondly, to facilitate bidirectional transliteration between Urdu and RU, however, transliteration is currently outside the scope of this study and is envisioned as a future research direction. We introduce the extended corpus as PURUTT (Parallel Urdu and Roman Urdu Corpus for Toxic Comments and Transliteration), boasting 72,771 labeled comments as parallel comments in both Urdu and Roman Urdu scripts. Specific to Urdu toxic comment classification, our methodology begins by training those classification models that were trained on the original Roman Urdu corpus. We leverage pre-trained Word2Vec and FastText Urdu word embeddings to evaluate model performance through transfer learning. Furthermore, we fine-tune five multilingual large language models capitalizing on their inherent multilingual capabilities. To further enhance the classification performance, this study proposes an ensemble approach that aggregates the strengths of multiple base models. Our extensive empirical validation demonstrates the superiority of the ensemble model, achieving a state-of-the-art F1-score of 91.65% on PURUTT, setting a benchmark F1-score on PURUTT corpus for Urdu toxic comment classification.
ISSN:2169-3536