Evaluating LLMs for Automated Scoring in Formative Assessments

The increasing complexity and scale of modern education have revealed the shortcomings of traditional grading methods in providing consistent and scalable assessments. Advancements in artificial intelligence have positioned Large Language Models (LLMs) as robust solutions for automating grading task...

Full description

Saved in:
Bibliographic Details
Main Authors: Pedro C. Mendonça, Filipe Quintal, Fábio Mendonça
Format: Article
Language:English
Published: MDPI AG 2025-03-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/5/2787
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The increasing complexity and scale of modern education have revealed the shortcomings of traditional grading methods in providing consistent and scalable assessments. Advancements in artificial intelligence have positioned Large Language Models (LLMs) as robust solutions for automating grading tasks. This study systematically compared the grading performance of an open-source LLM (LLaMA 3.2) and a premium LLM (OpenAI GPT-4o) against human evaluators across diverse question types in the context of a computer programming subject. Using detailed rubrics, the study assessed the alignment between LLM-generated and human-assigned grades. Results revealed that while both LLMs align closely with human grading, equivalence testing demonstrated that the premium LLM achieves statistically and practically similar grading patterns, particularly for code-based questions, suggesting its potential as a reliable tool for educational assessments. These findings underscore the ability of LLMs to enhance grading consistency, reduce educator workload, and address scalability challenges in programming-focused assessments.
ISSN:2076-3417