Urdu Toxic Comment Classification With PURUTT Corpus Development

This study addresses the critical gap in toxic comment classification in Urdu, a widely spoken language devoid of high-quality standard datasets. To address this gap, we employed an existing labeled Roman Urdu (RU) corpus, which was developed originally for Roman Urdu toxic comment classification, a...

Full description

Saved in:
Bibliographic Details
Main Authors: Hafiz Hassaan Saeed, Tahir Khalil, Faisal Kamiran
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10856102/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832088084626276352
author Hafiz Hassaan Saeed
Tahir Khalil
Faisal Kamiran
author_facet Hafiz Hassaan Saeed
Tahir Khalil
Faisal Kamiran
author_sort Hafiz Hassaan Saeed
collection DOAJ
description This study addresses the critical gap in toxic comment classification in Urdu, a widely spoken language devoid of high-quality standard datasets. To address this gap, we employed an existing labeled Roman Urdu (RU) corpus, which was developed originally for Roman Urdu toxic comment classification, and supplemented that corpus by adding its Urdu equivalent transliterations. The motivation behind such an extension is twofold: firstly, to provide a large comprehensive dataset for the classification of toxic comments in Urdu; secondly, to facilitate bidirectional transliteration between Urdu and RU, however, transliteration is currently outside the scope of this study and is envisioned as a future research direction. We introduce the extended corpus as PURUTT (Parallel Urdu and Roman Urdu Corpus for Toxic Comments and Transliteration), boasting 72,771 labeled comments as parallel comments in both Urdu and Roman Urdu scripts. Specific to Urdu toxic comment classification, our methodology begins by training those classification models that were trained on the original Roman Urdu corpus. We leverage pre-trained Word2Vec and FastText Urdu word embeddings to evaluate model performance through transfer learning. Furthermore, we fine-tune five multilingual large language models capitalizing on their inherent multilingual capabilities. To further enhance the classification performance, this study proposes an ensemble approach that aggregates the strengths of multiple base models. Our extensive empirical validation demonstrates the superiority of the ensemble model, achieving a state-of-the-art F1-score of 91.65% on PURUTT, setting a benchmark F1-score on PURUTT corpus for Urdu toxic comment classification.
format Article
id doaj-art-cbc79e13630546c2b116354eac55990a
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-cbc79e13630546c2b116354eac55990a2025-02-06T00:00:30ZengIEEEIEEE Access2169-35362025-01-0113216352165110.1109/ACCESS.2025.353586210856102Urdu Toxic Comment Classification With PURUTT Corpus DevelopmentHafiz Hassaan Saeed0https://orcid.org/0000-0001-5026-0765Tahir Khalil1https://orcid.org/0009-0006-3366-1712Faisal Kamiran2https://orcid.org/0000-0002-1168-9451Department of Computer Science, Information Technology University, Lahore, PakistanDepartment of Computer Science, Information Technology University, Lahore, PakistanDepartment of Computer Science, Information Technology University, Lahore, PakistanThis study addresses the critical gap in toxic comment classification in Urdu, a widely spoken language devoid of high-quality standard datasets. To address this gap, we employed an existing labeled Roman Urdu (RU) corpus, which was developed originally for Roman Urdu toxic comment classification, and supplemented that corpus by adding its Urdu equivalent transliterations. The motivation behind such an extension is twofold: firstly, to provide a large comprehensive dataset for the classification of toxic comments in Urdu; secondly, to facilitate bidirectional transliteration between Urdu and RU, however, transliteration is currently outside the scope of this study and is envisioned as a future research direction. We introduce the extended corpus as PURUTT (Parallel Urdu and Roman Urdu Corpus for Toxic Comments and Transliteration), boasting 72,771 labeled comments as parallel comments in both Urdu and Roman Urdu scripts. Specific to Urdu toxic comment classification, our methodology begins by training those classification models that were trained on the original Roman Urdu corpus. We leverage pre-trained Word2Vec and FastText Urdu word embeddings to evaluate model performance through transfer learning. Furthermore, we fine-tune five multilingual large language models capitalizing on their inherent multilingual capabilities. To further enhance the classification performance, this study proposes an ensemble approach that aggregates the strengths of multiple base models. Our extensive empirical validation demonstrates the superiority of the ensemble model, achieving a state-of-the-art F1-score of 91.65% on PURUTT, setting a benchmark F1-score on PURUTT corpus for Urdu toxic comment classification.https://ieeexplore.ieee.org/document/10856102/UrduUrdu parallel corpusUrdu toxic commentsUrdu toxic comment classificationtoxic comment classificationtransfer learning
spellingShingle Hafiz Hassaan Saeed
Tahir Khalil
Faisal Kamiran
Urdu Toxic Comment Classification With PURUTT Corpus Development
IEEE Access
Urdu
Urdu parallel corpus
Urdu toxic comments
Urdu toxic comment classification
toxic comment classification
transfer learning
title Urdu Toxic Comment Classification With PURUTT Corpus Development
title_full Urdu Toxic Comment Classification With PURUTT Corpus Development
title_fullStr Urdu Toxic Comment Classification With PURUTT Corpus Development
title_full_unstemmed Urdu Toxic Comment Classification With PURUTT Corpus Development
title_short Urdu Toxic Comment Classification With PURUTT Corpus Development
title_sort urdu toxic comment classification with purutt corpus development
topic Urdu
Urdu parallel corpus
Urdu toxic comments
Urdu toxic comment classification
toxic comment classification
transfer learning
url https://ieeexplore.ieee.org/document/10856102/
work_keys_str_mv AT hafizhassaansaeed urdutoxiccommentclassificationwithpuruttcorpusdevelopment
AT tahirkhalil urdutoxiccommentclassificationwithpuruttcorpusdevelopment
AT faisalkamiran urdutoxiccommentclassificationwithpuruttcorpusdevelopment