How well can LLMs grade essays in Arabic?

This research assesses the effectiveness of state-of-the-art large language models (LLMs), including ChatGPT, Llama, Aya, Jais, and ACEGPT, in the task of Arabic automated essay scoring (AES) using the AR-AES dataset. It explores various evaluation methodologies, including zero-shot, few-shot in con...

Full description

Saved in:
Bibliographic Details
Main Authors: Rayed Ghazawi, Edwin Simpson
Format: Article
Language:English
Published: Elsevier 2025-12-01
Series:Computers and Education: Artificial Intelligence
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666920X2500089X
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849403939764568064
author Rayed Ghazawi
Edwin Simpson
author_facet Rayed Ghazawi
Edwin Simpson
author_sort Rayed Ghazawi
collection DOAJ
description This research assesses the effectiveness of state-of-the-art large language models (LLMs), including ChatGPT, Llama, Aya, Jais, and ACEGPT, in the task of Arabic automated essay scoring (AES) using the AR-AES dataset. It explores various evaluation methodologies, including zero-shot, few-shot in context learning, and fine-tuning, and examines the influence of instruction-following capabilities through the inclusion of marking guidelines within the prompts. A mixed-language prompting strategy, integrating English prompts with Arabic content, was implemented to improve model comprehension and performance. Among the models tested, ACEGPT demonstrated the strongest performance across the dataset, achieving a Quadratic Weighted Kappa (QWK) of 0.67, but was outperformed by a smaller BERT-based model with a QWK of 0.88. The study identifies challenges faced by LLMs in processing Arabic, including tokenization complexities and higher computational demands. Performance variation across different courses underscores the need for adaptive models capable of handling diverse assessment formats and highlights the positive impact of effective prompt engineering on improving LLM outputs. To the best of our knowledge, this study is the first to empirically evaluate the performance of multiple generative Large Language Models (LLMs) on Arabic essays using authentic student data.
format Article
id doaj-art-e0864f70c9964b2cbb4e744922c170bc
institution Kabale University
issn 2666-920X
language English
publishDate 2025-12-01
publisher Elsevier
record_format Article
series Computers and Education: Artificial Intelligence
spelling doaj-art-e0864f70c9964b2cbb4e744922c170bc2025-08-20T03:37:08ZengElsevierComputers and Education: Artificial Intelligence2666-920X2025-12-01910044910.1016/j.caeai.2025.100449How well can LLMs grade essays in Arabic?Rayed Ghazawi0Edwin Simpson1Intelligent Systems Labs, University of Bristol, Merchant Venturers Building, Woodland Road, Bristol, BS8 1UB, UK; Data Science Department, Umm Al-Qura University, Makkah, Saudi Arabia; Corresponding author at: Intelligent Systems Labs, University of Bristol, Merchant Venturers Building, Woodland Road, Bristol, BS8 1UB, UK.Intelligent Systems Labs, University of Bristol, Merchant Venturers Building, Woodland Road, Bristol, BS8 1UB, UKThis research assesses the effectiveness of state-of-the-art large language models (LLMs), including ChatGPT, Llama, Aya, Jais, and ACEGPT, in the task of Arabic automated essay scoring (AES) using the AR-AES dataset. It explores various evaluation methodologies, including zero-shot, few-shot in context learning, and fine-tuning, and examines the influence of instruction-following capabilities through the inclusion of marking guidelines within the prompts. A mixed-language prompting strategy, integrating English prompts with Arabic content, was implemented to improve model comprehension and performance. Among the models tested, ACEGPT demonstrated the strongest performance across the dataset, achieving a Quadratic Weighted Kappa (QWK) of 0.67, but was outperformed by a smaller BERT-based model with a QWK of 0.88. The study identifies challenges faced by LLMs in processing Arabic, including tokenization complexities and higher computational demands. Performance variation across different courses underscores the need for adaptive models capable of handling diverse assessment formats and highlights the positive impact of effective prompt engineering on improving LLM outputs. To the best of our knowledge, this study is the first to empirically evaluate the performance of multiple generative Large Language Models (LLMs) on Arabic essays using authentic student data.http://www.sciencedirect.com/science/article/pii/S2666920X2500089XAutomatic essay scoring (AES)Natural language processing (NLP)Large language models (LLMs)Arabic language
spellingShingle Rayed Ghazawi
Edwin Simpson
How well can LLMs grade essays in Arabic?
Computers and Education: Artificial Intelligence
Automatic essay scoring (AES)
Natural language processing (NLP)
Large language models (LLMs)
Arabic language
title How well can LLMs grade essays in Arabic?
title_full How well can LLMs grade essays in Arabic?
title_fullStr How well can LLMs grade essays in Arabic?
title_full_unstemmed How well can LLMs grade essays in Arabic?
title_short How well can LLMs grade essays in Arabic?
title_sort how well can llms grade essays in arabic
topic Automatic essay scoring (AES)
Natural language processing (NLP)
Large language models (LLMs)
Arabic language
url http://www.sciencedirect.com/science/article/pii/S2666920X2500089X
work_keys_str_mv AT rayedghazawi howwellcanllmsgradeessaysinarabic
AT edwinsimpson howwellcanllmsgradeessaysinarabic