Performance assessment of ChatGPT 4, ChatGPT 3.5, Gemini Advanced Pro 1.5 and Bard 2.0 to problem solving in pathology in French language
Digital teaching diversifies the ways of knowledge assessment, as natural language processing offers the possibility of answering questions posed by students and teachers. Objective This study evaluated ChatGPT's, Bard's and Gemini's performances on second year of medical studies’ (DF...
Saved in:
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
SAGE Publishing
2025-01-01
|
Series: | Digital Health |
Online Access: | https://doi.org/10.1177/20552076241310630 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Digital teaching diversifies the ways of knowledge assessment, as natural language processing offers the possibility of answering questions posed by students and teachers. Objective This study evaluated ChatGPT's, Bard's and Gemini's performances on second year of medical studies’ (DFGSM2) Pathology exams from the Health Sciences Center of Dijon (France) in 2018–2022. Methods From 2018 to 2022, exam scores, discriminating powers and discordance rates were retrieved. Seventy questions (25 first-order single response questions and 45 second-order multiple response questions) were submitted on May 2023 to ChatGPT 3.5 and Bard 2.0, and on September 2024 to Gemini 1.5 and ChatGPT-4. Chatbot's and student's average scores were compared, as well as discriminating powers of questions answered by chatbots. The percentage of student–chatbot identical answers was retrieved, and linear regression analysis correlated the scores of chatbots with student's discordance rates. Chatbot's reliability was assessed by submitting the questions in four successive rounds and comparing score variability using a Fleiss’ Kappa and a Cohen's Kappa. Results Newer chatbots outperformed both students and older chatbots as for the overall scores and multiple-response questions. All chatbots outperformed students on less discriminating questions. Oppositely, all chatbots were outperformed by students to questions with a high discriminating power. Chatbot's scores were correlated to student discordance rates. ChatGPT 4 and Gemini 1.5 provided variable answers, due to effects linked to prompt engineering. Conclusion Our study in line with the literature confirms chatbot's moderate performance for questions requiring complex reasoning, with ChatGPT outperforming Google chatbots. The use of NLP software based on distributional semantics remains a challenge for the generation of questions in French. Drawbacks to the use of NLP software in generating questions include the generation of hallucinations and erroneous medical knowledge which have to be taken into count when using NLP software in medical education. |
---|---|
ISSN: | 2055-2076 |