Cross-lingual hate speech detection using domain-specific word embeddings.

THIS ARTICLE USES WORDS OR LANGUAGE THAT IS CONSIDERED PROFANE, VULGAR, OR OFFENSIVE BY SOME READERS. Hate speech detection in online social networks is a multidimensional problem, dependent on language and cultural factors. Most supervised learning resources for this task, such as labeled datasets...

Full description

Saved in:
Bibliographic Details
Main Authors: Ayme Arango Monnar, Jorge Perez Rojas, Barbara Polete Labra
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2024-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0306521
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850173356263866368
author Ayme Arango Monnar
Jorge Perez Rojas
Barbara Polete Labra
author_facet Ayme Arango Monnar
Jorge Perez Rojas
Barbara Polete Labra
author_sort Ayme Arango Monnar
collection DOAJ
description THIS ARTICLE USES WORDS OR LANGUAGE THAT IS CONSIDERED PROFANE, VULGAR, OR OFFENSIVE BY SOME READERS. Hate speech detection in online social networks is a multidimensional problem, dependent on language and cultural factors. Most supervised learning resources for this task, such as labeled datasets and Natural Language Processing (NLP) tools, have been specifically tailored for English. However, a large portion of web users around the world speak different languages, creating an important need for efficient multilingual hate speech detection approaches. In particular, such approaches should be able to leverage the limited cross-lingual resources currently existing in their learning process. The cross-lingual transfer in this task has been difficult to achieve successfully. Therefore, we propose a simple yet effective method to approach this problem. To our knowledge, ours is the first attempt to create a multilingual embedding model specific to this problem. We validate the effectiveness of our approach by performing an extensive comparative evaluation against several well-known general-purpose language models that, unlike ours, have been trained on massive amounts of data. We focus on a zero-shot cross-lingual evaluation scenario in which we classify hate speech in one language without having access to any labeled data. Despite its simplicity, our embeddings outperform more complex models for most experimental settings we tested. In addition, we provide further evidence of the effectiveness of our approach through an ad hoc qualitative exploratory analysis, which captures how hate speech is displayed in different languages. This analysis allows us to find new cross-lingual relations between words in the hate-speech domain. Overall, our findings indicate common patterns in how hate speech is expressed across languages and that our proposed model can capture such relationships significantly.
format Article
id doaj-art-e67a07c8fb72425ab3bbeccb78841450
institution OA Journals
issn 1932-6203
language English
publishDate 2024-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-e67a07c8fb72425ab3bbeccb788414502025-08-20T02:19:51ZengPublic Library of Science (PLoS)PLoS ONE1932-62032024-01-01197e030652110.1371/journal.pone.0306521Cross-lingual hate speech detection using domain-specific word embeddings.Ayme Arango MonnarJorge Perez RojasBarbara Polete LabraTHIS ARTICLE USES WORDS OR LANGUAGE THAT IS CONSIDERED PROFANE, VULGAR, OR OFFENSIVE BY SOME READERS. Hate speech detection in online social networks is a multidimensional problem, dependent on language and cultural factors. Most supervised learning resources for this task, such as labeled datasets and Natural Language Processing (NLP) tools, have been specifically tailored for English. However, a large portion of web users around the world speak different languages, creating an important need for efficient multilingual hate speech detection approaches. In particular, such approaches should be able to leverage the limited cross-lingual resources currently existing in their learning process. The cross-lingual transfer in this task has been difficult to achieve successfully. Therefore, we propose a simple yet effective method to approach this problem. To our knowledge, ours is the first attempt to create a multilingual embedding model specific to this problem. We validate the effectiveness of our approach by performing an extensive comparative evaluation against several well-known general-purpose language models that, unlike ours, have been trained on massive amounts of data. We focus on a zero-shot cross-lingual evaluation scenario in which we classify hate speech in one language without having access to any labeled data. Despite its simplicity, our embeddings outperform more complex models for most experimental settings we tested. In addition, we provide further evidence of the effectiveness of our approach through an ad hoc qualitative exploratory analysis, which captures how hate speech is displayed in different languages. This analysis allows us to find new cross-lingual relations between words in the hate-speech domain. Overall, our findings indicate common patterns in how hate speech is expressed across languages and that our proposed model can capture such relationships significantly.https://doi.org/10.1371/journal.pone.0306521
spellingShingle Ayme Arango Monnar
Jorge Perez Rojas
Barbara Polete Labra
Cross-lingual hate speech detection using domain-specific word embeddings.
PLoS ONE
title Cross-lingual hate speech detection using domain-specific word embeddings.
title_full Cross-lingual hate speech detection using domain-specific word embeddings.
title_fullStr Cross-lingual hate speech detection using domain-specific word embeddings.
title_full_unstemmed Cross-lingual hate speech detection using domain-specific word embeddings.
title_short Cross-lingual hate speech detection using domain-specific word embeddings.
title_sort cross lingual hate speech detection using domain specific word embeddings
url https://doi.org/10.1371/journal.pone.0306521
work_keys_str_mv AT aymearangomonnar crosslingualhatespeechdetectionusingdomainspecificwordembeddings
AT jorgeperezrojas crosslingualhatespeechdetectionusingdomainspecificwordembeddings
AT barbarapoletelabra crosslingualhatespeechdetectionusingdomainspecificwordembeddings