How well can LLMs grade essays in Arabic?
This research assesses the effectiveness of state-of-the-art large language models (LLMs), including ChatGPT, Llama, Aya, Jais, and ACEGPT, in the task of Arabic automated essay scoring (AES) using the AR-AES dataset. It explores various evaluation methodologies, including zero-shot, few-shot in con...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-12-01
|
| Series: | Computers and Education: Artificial Intelligence |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2666920X2500089X |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849403939764568064 |
|---|---|
| author | Rayed Ghazawi Edwin Simpson |
| author_facet | Rayed Ghazawi Edwin Simpson |
| author_sort | Rayed Ghazawi |
| collection | DOAJ |
| description | This research assesses the effectiveness of state-of-the-art large language models (LLMs), including ChatGPT, Llama, Aya, Jais, and ACEGPT, in the task of Arabic automated essay scoring (AES) using the AR-AES dataset. It explores various evaluation methodologies, including zero-shot, few-shot in context learning, and fine-tuning, and examines the influence of instruction-following capabilities through the inclusion of marking guidelines within the prompts. A mixed-language prompting strategy, integrating English prompts with Arabic content, was implemented to improve model comprehension and performance. Among the models tested, ACEGPT demonstrated the strongest performance across the dataset, achieving a Quadratic Weighted Kappa (QWK) of 0.67, but was outperformed by a smaller BERT-based model with a QWK of 0.88. The study identifies challenges faced by LLMs in processing Arabic, including tokenization complexities and higher computational demands. Performance variation across different courses underscores the need for adaptive models capable of handling diverse assessment formats and highlights the positive impact of effective prompt engineering on improving LLM outputs. To the best of our knowledge, this study is the first to empirically evaluate the performance of multiple generative Large Language Models (LLMs) on Arabic essays using authentic student data. |
| format | Article |
| id | doaj-art-e0864f70c9964b2cbb4e744922c170bc |
| institution | Kabale University |
| issn | 2666-920X |
| language | English |
| publishDate | 2025-12-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Computers and Education: Artificial Intelligence |
| spelling | doaj-art-e0864f70c9964b2cbb4e744922c170bc2025-08-20T03:37:08ZengElsevierComputers and Education: Artificial Intelligence2666-920X2025-12-01910044910.1016/j.caeai.2025.100449How well can LLMs grade essays in Arabic?Rayed Ghazawi0Edwin Simpson1Intelligent Systems Labs, University of Bristol, Merchant Venturers Building, Woodland Road, Bristol, BS8 1UB, UK; Data Science Department, Umm Al-Qura University, Makkah, Saudi Arabia; Corresponding author at: Intelligent Systems Labs, University of Bristol, Merchant Venturers Building, Woodland Road, Bristol, BS8 1UB, UK.Intelligent Systems Labs, University of Bristol, Merchant Venturers Building, Woodland Road, Bristol, BS8 1UB, UKThis research assesses the effectiveness of state-of-the-art large language models (LLMs), including ChatGPT, Llama, Aya, Jais, and ACEGPT, in the task of Arabic automated essay scoring (AES) using the AR-AES dataset. It explores various evaluation methodologies, including zero-shot, few-shot in context learning, and fine-tuning, and examines the influence of instruction-following capabilities through the inclusion of marking guidelines within the prompts. A mixed-language prompting strategy, integrating English prompts with Arabic content, was implemented to improve model comprehension and performance. Among the models tested, ACEGPT demonstrated the strongest performance across the dataset, achieving a Quadratic Weighted Kappa (QWK) of 0.67, but was outperformed by a smaller BERT-based model with a QWK of 0.88. The study identifies challenges faced by LLMs in processing Arabic, including tokenization complexities and higher computational demands. Performance variation across different courses underscores the need for adaptive models capable of handling diverse assessment formats and highlights the positive impact of effective prompt engineering on improving LLM outputs. To the best of our knowledge, this study is the first to empirically evaluate the performance of multiple generative Large Language Models (LLMs) on Arabic essays using authentic student data.http://www.sciencedirect.com/science/article/pii/S2666920X2500089XAutomatic essay scoring (AES)Natural language processing (NLP)Large language models (LLMs)Arabic language |
| spellingShingle | Rayed Ghazawi Edwin Simpson How well can LLMs grade essays in Arabic? Computers and Education: Artificial Intelligence Automatic essay scoring (AES) Natural language processing (NLP) Large language models (LLMs) Arabic language |
| title | How well can LLMs grade essays in Arabic? |
| title_full | How well can LLMs grade essays in Arabic? |
| title_fullStr | How well can LLMs grade essays in Arabic? |
| title_full_unstemmed | How well can LLMs grade essays in Arabic? |
| title_short | How well can LLMs grade essays in Arabic? |
| title_sort | how well can llms grade essays in arabic |
| topic | Automatic essay scoring (AES) Natural language processing (NLP) Large language models (LLMs) Arabic language |
| url | http://www.sciencedirect.com/science/article/pii/S2666920X2500089X |
| work_keys_str_mv | AT rayedghazawi howwellcanllmsgradeessaysinarabic AT edwinsimpson howwellcanllmsgradeessaysinarabic |