Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques

Sentiment analysis is a crucial component of text mining and natural language processing (NLP), involving the evaluation and classification of text data based on its emotional tone, typically categorized as positive, negative, or neutral. While significant research has focused on structured language...

Full description

Saved in:
Bibliographic Details
Main Authors: Zineb Nassr, Faouzia Benabbou, Nawal Sael, Touria Hamim
Format: Article
Language:English
Published: MDPI AG 2025-01-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/16/1/39
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832588332147671040
author Zineb Nassr
Faouzia Benabbou
Nawal Sael
Touria Hamim
author_facet Zineb Nassr
Faouzia Benabbou
Nawal Sael
Touria Hamim
author_sort Zineb Nassr
collection DOAJ
description Sentiment analysis is a crucial component of text mining and natural language processing (NLP), involving the evaluation and classification of text data based on its emotional tone, typically categorized as positive, negative, or neutral. While significant research has focused on structured languages like English, unstructured languages, such as the Moroccan Dialect (MD), face substantial resource limitations and linguistic challenges, making effective sentiment analysis difficult. This study addresses this gap by exploring the integration of data-balancing techniques with machine learning (ML) methods, specifically investigating the impact of resampling techniques and feature extraction methods, including Term Frequency–Inverse Document Frequency (TF-IDF), Bag of Words (BOW), and N-grams. Through rigorous experimentation, we evaluate the effectiveness of these approaches in enhancing sentiment analysis accuracy for the Moroccan dialect. Our findings demonstrate that strategic resampling, combined with the TF-IDF method, significantly improves classification accuracy and robustness. We also explore the interaction between resampling strategies and feature extraction methods, revealing varying levels of effectiveness across different combinations. Notably, the Support Vector Machine (SVM) classifier, when paired with TF-IDF representation, achieves superior performance, with an accuracy of 90.24% and a precision of 90.34%. These results highlight the importance of tailored resampling techniques, appropriate feature extraction methods, and machine learning optimization in advancing sentiment analysis for under-resourced and dialect-heavy languages like the Moroccan dialect, providing a practical framework for future research and development in NLP for unstructured languages.
format Article
id doaj-art-2e240de97fc14d1c8a6da4dbce448dd0
institution Kabale University
issn 2078-2489
language English
publishDate 2025-01-01
publisher MDPI AG
record_format Article
series Information
spelling doaj-art-2e240de97fc14d1c8a6da4dbce448dd02025-01-24T13:35:14ZengMDPI AGInformation2078-24892025-01-011613910.3390/info16010039Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction TechniquesZineb Nassr0Faouzia Benabbou1Nawal Sael2Touria Hamim3Laboratory of Modelling and Information Technology, Faculty of Sciences Ben M’SIK, University Hassan II, Casablanca 20000, MoroccoLaboratory of Modelling and Information Technology, Faculty of Sciences Ben M’SIK, University Hassan II, Casablanca 20000, MoroccoLaboratory of Modelling and Information Technology, Faculty of Sciences Ben M’SIK, University Hassan II, Casablanca 20000, MoroccoLaboratory of Modelling and Information Technology, Faculty of Sciences Ben M’SIK, University Hassan II, Casablanca 20000, MoroccoSentiment analysis is a crucial component of text mining and natural language processing (NLP), involving the evaluation and classification of text data based on its emotional tone, typically categorized as positive, negative, or neutral. While significant research has focused on structured languages like English, unstructured languages, such as the Moroccan Dialect (MD), face substantial resource limitations and linguistic challenges, making effective sentiment analysis difficult. This study addresses this gap by exploring the integration of data-balancing techniques with machine learning (ML) methods, specifically investigating the impact of resampling techniques and feature extraction methods, including Term Frequency–Inverse Document Frequency (TF-IDF), Bag of Words (BOW), and N-grams. Through rigorous experimentation, we evaluate the effectiveness of these approaches in enhancing sentiment analysis accuracy for the Moroccan dialect. Our findings demonstrate that strategic resampling, combined with the TF-IDF method, significantly improves classification accuracy and robustness. We also explore the interaction between resampling strategies and feature extraction methods, revealing varying levels of effectiveness across different combinations. Notably, the Support Vector Machine (SVM) classifier, when paired with TF-IDF representation, achieves superior performance, with an accuracy of 90.24% and a precision of 90.34%. These results highlight the importance of tailored resampling techniques, appropriate feature extraction methods, and machine learning optimization in advancing sentiment analysis for under-resourced and dialect-heavy languages like the Moroccan dialect, providing a practical framework for future research and development in NLP for unstructured languages.https://www.mdpi.com/2078-2489/16/1/39text miningNLPsentiment analysisMoroccan dialectpreprocessingstop words
spellingShingle Zineb Nassr
Faouzia Benabbou
Nawal Sael
Touria Hamim
Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques
Information
text mining
NLP
sentiment analysis
Moroccan dialect
preprocessing
stop words
title Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques
title_full Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques
title_fullStr Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques
title_full_unstemmed Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques
title_short Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques
title_sort improving sentiment analysis performance on imbalanced moroccan dialect datasets using resample and feature extraction techniques
topic text mining
NLP
sentiment analysis
Moroccan dialect
preprocessing
stop words
url https://www.mdpi.com/2078-2489/16/1/39
work_keys_str_mv AT zinebnassr improvingsentimentanalysisperformanceonimbalancedmoroccandialectdatasetsusingresampleandfeatureextractiontechniques
AT faouziabenabbou improvingsentimentanalysisperformanceonimbalancedmoroccandialectdatasetsusingresampleandfeatureextractiontechniques
AT nawalsael improvingsentimentanalysisperformanceonimbalancedmoroccandialectdatasetsusingresampleandfeatureextractiontechniques
AT touriahamim improvingsentimentanalysisperformanceonimbalancedmoroccandialectdatasetsusingresampleandfeatureextractiontechniques