Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques

Sentiment analysis is a crucial component of text mining and natural language processing (NLP), involving the evaluation and classification of text data based on its emotional tone, typically categorized as positive, negative, or neutral. While significant research has focused on structured language...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zineb Nassr, Faouzia Benabbou, Nawal Sael, Touria Hamim
Format:	Article
Language:	English
Published:	MDPI AG 2025-01-01
Series:	Information
Subjects:	text mining NLP sentiment analysis Moroccan dialect preprocessing stop words
Online Access:	https://www.mdpi.com/2078-2489/16/1/39
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832588332147671040
author	Zineb Nassr Faouzia Benabbou Nawal Sael Touria Hamim
author_facet	Zineb Nassr Faouzia Benabbou Nawal Sael Touria Hamim
author_sort	Zineb Nassr
collection	DOAJ
description	Sentiment analysis is a crucial component of text mining and natural language processing (NLP), involving the evaluation and classification of text data based on its emotional tone, typically categorized as positive, negative, or neutral. While significant research has focused on structured languages like English, unstructured languages, such as the Moroccan Dialect (MD), face substantial resource limitations and linguistic challenges, making effective sentiment analysis difficult. This study addresses this gap by exploring the integration of data-balancing techniques with machine learning (ML) methods, specifically investigating the impact of resampling techniques and feature extraction methods, including Term Frequency–Inverse Document Frequency (TF-IDF), Bag of Words (BOW), and N-grams. Through rigorous experimentation, we evaluate the effectiveness of these approaches in enhancing sentiment analysis accuracy for the Moroccan dialect. Our findings demonstrate that strategic resampling, combined with the TF-IDF method, significantly improves classification accuracy and robustness. We also explore the interaction between resampling strategies and feature extraction methods, revealing varying levels of effectiveness across different combinations. Notably, the Support Vector Machine (SVM) classifier, when paired with TF-IDF representation, achieves superior performance, with an accuracy of 90.24% and a precision of 90.34%. These results highlight the importance of tailored resampling techniques, appropriate feature extraction methods, and machine learning optimization in advancing sentiment analysis for under-resourced and dialect-heavy languages like the Moroccan dialect, providing a practical framework for future research and development in NLP for unstructured languages.
format	Article
id	doaj-art-2e240de97fc14d1c8a6da4dbce448dd0
institution	Kabale University
issn	2078-2489
language	English
publishDate	2025-01-01
publisher	MDPI AG
record_format	Article
series	Information
spelling	doaj-art-2e240de97fc14d1c8a6da4dbce448dd02025-01-24T13:35:14ZengMDPI AGInformation2078-24892025-01-011613910.3390/info16010039Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction TechniquesZineb Nassr0Faouzia Benabbou1Nawal Sael2Touria Hamim3Laboratory of Modelling and Information Technology, Faculty of Sciences Ben M’SIK, University Hassan II, Casablanca 20000, MoroccoLaboratory of Modelling and Information Technology, Faculty of Sciences Ben M’SIK, University Hassan II, Casablanca 20000, MoroccoLaboratory of Modelling and Information Technology, Faculty of Sciences Ben M’SIK, University Hassan II, Casablanca 20000, MoroccoLaboratory of Modelling and Information Technology, Faculty of Sciences Ben M’SIK, University Hassan II, Casablanca 20000, MoroccoSentiment analysis is a crucial component of text mining and natural language processing (NLP), involving the evaluation and classification of text data based on its emotional tone, typically categorized as positive, negative, or neutral. While significant research has focused on structured languages like English, unstructured languages, such as the Moroccan Dialect (MD), face substantial resource limitations and linguistic challenges, making effective sentiment analysis difficult. This study addresses this gap by exploring the integration of data-balancing techniques with machine learning (ML) methods, specifically investigating the impact of resampling techniques and feature extraction methods, including Term Frequency–Inverse Document Frequency (TF-IDF), Bag of Words (BOW), and N-grams. Through rigorous experimentation, we evaluate the effectiveness of these approaches in enhancing sentiment analysis accuracy for the Moroccan dialect. Our findings demonstrate that strategic resampling, combined with the TF-IDF method, significantly improves classification accuracy and robustness. We also explore the interaction between resampling strategies and feature extraction methods, revealing varying levels of effectiveness across different combinations. Notably, the Support Vector Machine (SVM) classifier, when paired with TF-IDF representation, achieves superior performance, with an accuracy of 90.24% and a precision of 90.34%. These results highlight the importance of tailored resampling techniques, appropriate feature extraction methods, and machine learning optimization in advancing sentiment analysis for under-resourced and dialect-heavy languages like the Moroccan dialect, providing a practical framework for future research and development in NLP for unstructured languages.https://www.mdpi.com/2078-2489/16/1/39text miningNLPsentiment analysisMoroccan dialectpreprocessingstop words
spellingShingle	Zineb Nassr Faouzia Benabbou Nawal Sael Touria Hamim Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques Information text mining NLP sentiment analysis Moroccan dialect preprocessing stop words
title	Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques
title_full	Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques
title_fullStr	Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques
title_full_unstemmed	Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques
title_short	Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques
title_sort	improving sentiment analysis performance on imbalanced moroccan dialect datasets using resample and feature extraction techniques
topic	text mining NLP sentiment analysis Moroccan dialect preprocessing stop words
url	https://www.mdpi.com/2078-2489/16/1/39
work_keys_str_mv	AT zinebnassr improvingsentimentanalysisperformanceonimbalancedmoroccandialectdatasetsusingresampleandfeatureextractiontechniques AT faouziabenabbou improvingsentimentanalysisperformanceonimbalancedmoroccandialectdatasetsusingresampleandfeatureextractiontechniques AT nawalsael improvingsentimentanalysisperformanceonimbalancedmoroccandialectdatasetsusingresampleandfeatureextractiontechniques AT touriahamim improvingsentimentanalysisperformanceonimbalancedmoroccandialectdatasetsusingresampleandfeatureextractiontechniques

Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques

Similar Items