Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques
Sentiment analysis is a crucial component of text mining and natural language processing (NLP), involving the evaluation and classification of text data based on its emotional tone, typically categorized as positive, negative, or neutral. While significant research has focused on structured language...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2025-01-01
|
Series: | Information |
Subjects: | |
Online Access: | https://www.mdpi.com/2078-2489/16/1/39 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832588332147671040 |
---|---|
author | Zineb Nassr Faouzia Benabbou Nawal Sael Touria Hamim |
author_facet | Zineb Nassr Faouzia Benabbou Nawal Sael Touria Hamim |
author_sort | Zineb Nassr |
collection | DOAJ |
description | Sentiment analysis is a crucial component of text mining and natural language processing (NLP), involving the evaluation and classification of text data based on its emotional tone, typically categorized as positive, negative, or neutral. While significant research has focused on structured languages like English, unstructured languages, such as the Moroccan Dialect (MD), face substantial resource limitations and linguistic challenges, making effective sentiment analysis difficult. This study addresses this gap by exploring the integration of data-balancing techniques with machine learning (ML) methods, specifically investigating the impact of resampling techniques and feature extraction methods, including Term Frequency–Inverse Document Frequency (TF-IDF), Bag of Words (BOW), and N-grams. Through rigorous experimentation, we evaluate the effectiveness of these approaches in enhancing sentiment analysis accuracy for the Moroccan dialect. Our findings demonstrate that strategic resampling, combined with the TF-IDF method, significantly improves classification accuracy and robustness. We also explore the interaction between resampling strategies and feature extraction methods, revealing varying levels of effectiveness across different combinations. Notably, the Support Vector Machine (SVM) classifier, when paired with TF-IDF representation, achieves superior performance, with an accuracy of 90.24% and a precision of 90.34%. These results highlight the importance of tailored resampling techniques, appropriate feature extraction methods, and machine learning optimization in advancing sentiment analysis for under-resourced and dialect-heavy languages like the Moroccan dialect, providing a practical framework for future research and development in NLP for unstructured languages. |
format | Article |
id | doaj-art-2e240de97fc14d1c8a6da4dbce448dd0 |
institution | Kabale University |
issn | 2078-2489 |
language | English |
publishDate | 2025-01-01 |
publisher | MDPI AG |
record_format | Article |
series | Information |
spelling | doaj-art-2e240de97fc14d1c8a6da4dbce448dd02025-01-24T13:35:14ZengMDPI AGInformation2078-24892025-01-011613910.3390/info16010039Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction TechniquesZineb Nassr0Faouzia Benabbou1Nawal Sael2Touria Hamim3Laboratory of Modelling and Information Technology, Faculty of Sciences Ben M’SIK, University Hassan II, Casablanca 20000, MoroccoLaboratory of Modelling and Information Technology, Faculty of Sciences Ben M’SIK, University Hassan II, Casablanca 20000, MoroccoLaboratory of Modelling and Information Technology, Faculty of Sciences Ben M’SIK, University Hassan II, Casablanca 20000, MoroccoLaboratory of Modelling and Information Technology, Faculty of Sciences Ben M’SIK, University Hassan II, Casablanca 20000, MoroccoSentiment analysis is a crucial component of text mining and natural language processing (NLP), involving the evaluation and classification of text data based on its emotional tone, typically categorized as positive, negative, or neutral. While significant research has focused on structured languages like English, unstructured languages, such as the Moroccan Dialect (MD), face substantial resource limitations and linguistic challenges, making effective sentiment analysis difficult. This study addresses this gap by exploring the integration of data-balancing techniques with machine learning (ML) methods, specifically investigating the impact of resampling techniques and feature extraction methods, including Term Frequency–Inverse Document Frequency (TF-IDF), Bag of Words (BOW), and N-grams. Through rigorous experimentation, we evaluate the effectiveness of these approaches in enhancing sentiment analysis accuracy for the Moroccan dialect. Our findings demonstrate that strategic resampling, combined with the TF-IDF method, significantly improves classification accuracy and robustness. We also explore the interaction between resampling strategies and feature extraction methods, revealing varying levels of effectiveness across different combinations. Notably, the Support Vector Machine (SVM) classifier, when paired with TF-IDF representation, achieves superior performance, with an accuracy of 90.24% and a precision of 90.34%. These results highlight the importance of tailored resampling techniques, appropriate feature extraction methods, and machine learning optimization in advancing sentiment analysis for under-resourced and dialect-heavy languages like the Moroccan dialect, providing a practical framework for future research and development in NLP for unstructured languages.https://www.mdpi.com/2078-2489/16/1/39text miningNLPsentiment analysisMoroccan dialectpreprocessingstop words |
spellingShingle | Zineb Nassr Faouzia Benabbou Nawal Sael Touria Hamim Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques Information text mining NLP sentiment analysis Moroccan dialect preprocessing stop words |
title | Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques |
title_full | Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques |
title_fullStr | Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques |
title_full_unstemmed | Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques |
title_short | Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques |
title_sort | improving sentiment analysis performance on imbalanced moroccan dialect datasets using resample and feature extraction techniques |
topic | text mining NLP sentiment analysis Moroccan dialect preprocessing stop words |
url | https://www.mdpi.com/2078-2489/16/1/39 |
work_keys_str_mv | AT zinebnassr improvingsentimentanalysisperformanceonimbalancedmoroccandialectdatasetsusingresampleandfeatureextractiontechniques AT faouziabenabbou improvingsentimentanalysisperformanceonimbalancedmoroccandialectdatasetsusingresampleandfeatureextractiontechniques AT nawalsael improvingsentimentanalysisperformanceonimbalancedmoroccandialectdatasetsusingresampleandfeatureextractiontechniques AT touriahamim improvingsentimentanalysisperformanceonimbalancedmoroccandialectdatasetsusingresampleandfeatureextractiontechniques |