A pilot study of the performance of Chat GPT and other large language models on a written final year periodontology exam

Abstract Large Language Models (LLMs) such as Chat GPT are being increasingly utilized by students in education with reportedly adequate academic responses. Chat GPT is expected to learn and improve with time. Thus, the aim was to longitudinally compare the performance of the current versions of Cha...

Full description

Saved in:
Bibliographic Details
Main Authors: Shaun Ramlogan, Vidya Raman, Shayn Ramlogan
Format: Article
Language:English
Published: BMC 2025-05-01
Series:BMC Medical Education
Subjects:
Online Access:https://doi.org/10.1186/s12909-025-07195-7
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Large Language Models (LLMs) such as Chat GPT are being increasingly utilized by students in education with reportedly adequate academic responses. Chat GPT is expected to learn and improve with time. Thus, the aim was to longitudinally compare the performance of the current versions of Chat GPT-4/GPT4o with that of final-year DDS students on a written periodontology exam. Other current non-subscription LLMs were also compared to the students. Chat GPT-4, guided by the exam parameters, generated answers as ‘Run 1’ and 6 months later as as ‘Run 2’. Chat GPT-4o generated answers as ‘Run 3’ at 15 months later. All LLMs and student scripts were marked independently by two periodontology lecturers (Cohen’s Kappa value 0.71). ‘Run 1’ and ‘Run 3’ generated statistically significantly (p < 0.001) higher mean scores of 78% and 77% compared to the students (60%). The mean scores of Chat GPT-4 and GPT4o were also similar to that of the best student. ‘Run 2’ performed at the level of the students but underperformed with generalizations, more inaccuracies and incomplete answers compared to ‘Run 1’ and ‘Run 3’. This variability for ‘Run 2’ may be due to outdated data sources, hallucinations and inherent LLM limitations such as online traffic, availability of datasets and resources. Other non-subscription LLMs such as Claude, DeepSeek, Gemini and Le Chat also produced statistically significantly (p < 0.001) higher scores compared to the students. Claude was the best performing LLM with more comprehensive answers. LLMs such as Chat GPT may provide summaries and model answers in clinical undergraduate periodontology education. However, the result must be interpreted with caution regarding academic accuracy and credibility especially in a health care profession.
ISSN:1472-6920