Can large language models meet the challenge of generating school-level questions?

In the realm of education, crafting appropriate questions for examinations is a meticulous and time-consuming task that is crucial for assessing students' understanding of the subject matter. This paper explores the potential of leveraging large language models (LLMs) to automate question gener...

Full description

Saved in:
Bibliographic Details
Main Authors: Subhankar Maity, Aniket Deroy, Sudeshna Sarkar
Format: Article
Language:English
Published: Elsevier 2025-06-01
Series:Computers and Education: Artificial Intelligence
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666920X25000104
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In the realm of education, crafting appropriate questions for examinations is a meticulous and time-consuming task that is crucial for assessing students' understanding of the subject matter. This paper explores the potential of leveraging large language models (LLMs) to automate question generation in the educational domain. Specifically, we focus on generating educational questions from contexts extracted from school-level textbooks. Our study aims to prompt LLMs such as GPT-4 Turbo, GPT-3.5 Turbo, Llama-2-70B, Llama-3.1-405B, and Gemini Pro to generate a complete set of questions for each context, potentially streamlining the question generation process for educators. We performed a human evaluation of the generated questions, assessing their coverage, grammaticality, usefulness, answerability, and relevance. Additionally, we prompted LLMs to generate questions based on Bloom's revised taxonomy, categorizing and evaluating these questions according to their cognitive complexity and learning objectives. We applied both zero-shot and eight-shot prompting techniques. These efforts provide insight into the efficacy of LLMs in automated question generation and their potential in assessing students' cognitive abilities across various school-level subjects. The results show that employing an eight-shot technique improves the performance of human evaluation metrics for the generated complete set of questions and helps generate questions that are better aligned with Bloom's revised taxonomy.
ISSN:2666-920X