Analytical Comparison of Stop Word Recognition Methods in Persian Texts
Stop words are primarily non-significant words used to connect other words in sentence construction. Since these words do not contain specific information about the text, they are typically removed during text processing. Therefore, identifying stop words is an essential operation in text processing...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Regional Information Center for Science and Technology (RICeST)
2025-01-01
|
Series: | International Journal of Information Science and Management |
Subjects: | |
Online Access: | https://ijism.isc.ac/article_719446_1322bb8af0b283fd4d22f5dd5810090a.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832595704094130176 |
---|---|
author | Mohammad Samie Erta Bahmani Niloofar Mozafari |
author_facet | Mohammad Samie Erta Bahmani Niloofar Mozafari |
author_sort | Mohammad Samie |
collection | DOAJ |
description | Stop words are primarily non-significant words used to connect other words in sentence construction. Since these words do not contain specific information about the text, they are typically removed during text processing. Therefore, identifying stop words is an essential operation in text processing. A challenge arises when usually insignificant words can become significant depending on the situation, while words that are typically important can sometimes be classified as stop words. This problem is particularly pronounced in Persian due to the complexities inherent in the language. Recognizing the importance of identifying stop words in Persian, we analyzed and reviewed various approaches, including a dictionary-based approach, POS tagging-based approach, Word2Vec-based approach and FastText-based approach to identify stop words using a corpus of 50.000 Persian sentences from Hamshahri dataset. Our findings indicate that the FastText-based approach outperformed the others with a detection accuracy of 96.98, suggesting that this method can lead to the development of an automatic, reliable, and efficient system. |
format | Article |
id | doaj-art-ab753a0640754ce2b1c5d161b1066eed |
institution | Kabale University |
issn | 2008-8302 2008-8310 |
language | English |
publishDate | 2025-01-01 |
publisher | Regional Information Center for Science and Technology (RICeST) |
record_format | Article |
series | International Journal of Information Science and Management |
spelling | doaj-art-ab753a0640754ce2b1c5d161b1066eed2025-01-18T06:18:23ZengRegional Information Center for Science and Technology (RICeST)International Journal of Information Science and Management2008-83022008-83102025-01-012319110710.22034/ijism.2025.2017335.1322719446Analytical Comparison of Stop Word Recognition Methods in Persian TextsMohammad Samie0Erta Bahmani1Niloofar Mozafari2Department of Computer Engineering and IT, Jahrom University, Jahrom, IranDepartment of Computer Engineering and IT, Jahrom University, Jahrom, IranIslamic World Science and Technology Monitoring and Citation Institute (ISC), Shiraz, IranStop words are primarily non-significant words used to connect other words in sentence construction. Since these words do not contain specific information about the text, they are typically removed during text processing. Therefore, identifying stop words is an essential operation in text processing. A challenge arises when usually insignificant words can become significant depending on the situation, while words that are typically important can sometimes be classified as stop words. This problem is particularly pronounced in Persian due to the complexities inherent in the language. Recognizing the importance of identifying stop words in Persian, we analyzed and reviewed various approaches, including a dictionary-based approach, POS tagging-based approach, Word2Vec-based approach and FastText-based approach to identify stop words using a corpus of 50.000 Persian sentences from Hamshahri dataset. Our findings indicate that the FastText-based approach outperformed the others with a detection accuracy of 96.98, suggesting that this method can lead to the development of an automatic, reliable, and efficient system.https://ijism.isc.ac/article_719446_1322bb8af0b283fd4d22f5dd5810090a.pdfstop wordscontent wordspersian language processingpos taggingword2vecfasttext |
spellingShingle | Mohammad Samie Erta Bahmani Niloofar Mozafari Analytical Comparison of Stop Word Recognition Methods in Persian Texts International Journal of Information Science and Management stop words content words persian language processing pos tagging word2vec fasttext |
title | Analytical Comparison of Stop Word Recognition Methods in Persian Texts |
title_full | Analytical Comparison of Stop Word Recognition Methods in Persian Texts |
title_fullStr | Analytical Comparison of Stop Word Recognition Methods in Persian Texts |
title_full_unstemmed | Analytical Comparison of Stop Word Recognition Methods in Persian Texts |
title_short | Analytical Comparison of Stop Word Recognition Methods in Persian Texts |
title_sort | analytical comparison of stop word recognition methods in persian texts |
topic | stop words content words persian language processing pos tagging word2vec fasttext |
url | https://ijism.isc.ac/article_719446_1322bb8af0b283fd4d22f5dd5810090a.pdf |
work_keys_str_mv | AT mohammadsamie analyticalcomparisonofstopwordrecognitionmethodsinpersiantexts AT ertabahmani analyticalcomparisonofstopwordrecognitionmethodsinpersiantexts AT niloofarmozafari analyticalcomparisonofstopwordrecognitionmethodsinpersiantexts |