Performance assessment of ChatGPT 4, ChatGPT 3.5, Gemini Advanced Pro 1.5 and Bard 2.0 to problem solving in pathology in French language

Digital teaching diversifies the ways of knowledge assessment, as natural language processing offers the possibility of answering questions posed by students and teachers. Objective This study evaluated ChatGPT's, Bard's and Gemini's performances on second year of medical studies’ (DF...

Full description

Saved in:
Bibliographic Details
Main Authors: Georges Tarris, Laurent Martin
Format: Article
Language:English
Published: SAGE Publishing 2025-01-01
Series:Digital Health
Online Access:https://doi.org/10.1177/20552076241310630
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832576027603238912
author Georges Tarris
Laurent Martin
author_facet Georges Tarris
Laurent Martin
author_sort Georges Tarris
collection DOAJ
description Digital teaching diversifies the ways of knowledge assessment, as natural language processing offers the possibility of answering questions posed by students and teachers. Objective This study evaluated ChatGPT's, Bard's and Gemini's performances on second year of medical studies’ (DFGSM2) Pathology exams from the Health Sciences Center of Dijon (France) in 2018–2022. Methods From 2018 to 2022, exam scores, discriminating powers and discordance rates were retrieved. Seventy questions (25 first-order single response questions and 45 second-order multiple response questions) were submitted on May 2023 to ChatGPT 3.5 and Bard 2.0, and on September 2024 to Gemini 1.5 and ChatGPT-4. Chatbot's and student's average scores were compared, as well as discriminating powers of questions answered by chatbots. The percentage of student–chatbot identical answers was retrieved, and linear regression analysis correlated the scores of chatbots with student's discordance rates. Chatbot's reliability was assessed by submitting the questions in four successive rounds and comparing score variability using a Fleiss’ Kappa and a Cohen's Kappa. Results Newer chatbots outperformed both students and older chatbots as for the overall scores and multiple-response questions. All chatbots outperformed students on less discriminating questions. Oppositely, all chatbots were outperformed by students to questions with a high discriminating power. Chatbot's scores were correlated to student discordance rates. ChatGPT 4 and Gemini 1.5 provided variable answers, due to effects linked to prompt engineering. Conclusion Our study in line with the literature confirms chatbot's moderate performance for questions requiring complex reasoning, with ChatGPT outperforming Google chatbots. The use of NLP software based on distributional semantics remains a challenge for the generation of questions in French. Drawbacks to the use of NLP software in generating questions include the generation of hallucinations and erroneous medical knowledge which have to be taken into count when using NLP software in medical education.
format Article
id doaj-art-fafd6f186ac1445f9210ce40d377ff77
institution Kabale University
issn 2055-2076
language English
publishDate 2025-01-01
publisher SAGE Publishing
record_format Article
series Digital Health
spelling doaj-art-fafd6f186ac1445f9210ce40d377ff772025-01-31T14:07:43ZengSAGE PublishingDigital Health2055-20762025-01-011110.1177/20552076241310630Performance assessment of ChatGPT 4, ChatGPT 3.5, Gemini Advanced Pro 1.5 and Bard 2.0 to problem solving in pathology in French languageGeorges Tarris0Laurent Martin1 University of Burgundy Health Sciences Center, Dijon, France University of Burgundy Health Sciences Center, Dijon, FranceDigital teaching diversifies the ways of knowledge assessment, as natural language processing offers the possibility of answering questions posed by students and teachers. Objective This study evaluated ChatGPT's, Bard's and Gemini's performances on second year of medical studies’ (DFGSM2) Pathology exams from the Health Sciences Center of Dijon (France) in 2018–2022. Methods From 2018 to 2022, exam scores, discriminating powers and discordance rates were retrieved. Seventy questions (25 first-order single response questions and 45 second-order multiple response questions) were submitted on May 2023 to ChatGPT 3.5 and Bard 2.0, and on September 2024 to Gemini 1.5 and ChatGPT-4. Chatbot's and student's average scores were compared, as well as discriminating powers of questions answered by chatbots. The percentage of student–chatbot identical answers was retrieved, and linear regression analysis correlated the scores of chatbots with student's discordance rates. Chatbot's reliability was assessed by submitting the questions in four successive rounds and comparing score variability using a Fleiss’ Kappa and a Cohen's Kappa. Results Newer chatbots outperformed both students and older chatbots as for the overall scores and multiple-response questions. All chatbots outperformed students on less discriminating questions. Oppositely, all chatbots were outperformed by students to questions with a high discriminating power. Chatbot's scores were correlated to student discordance rates. ChatGPT 4 and Gemini 1.5 provided variable answers, due to effects linked to prompt engineering. Conclusion Our study in line with the literature confirms chatbot's moderate performance for questions requiring complex reasoning, with ChatGPT outperforming Google chatbots. The use of NLP software based on distributional semantics remains a challenge for the generation of questions in French. Drawbacks to the use of NLP software in generating questions include the generation of hallucinations and erroneous medical knowledge which have to be taken into count when using NLP software in medical education.https://doi.org/10.1177/20552076241310630
spellingShingle Georges Tarris
Laurent Martin
Performance assessment of ChatGPT 4, ChatGPT 3.5, Gemini Advanced Pro 1.5 and Bard 2.0 to problem solving in pathology in French language
Digital Health
title Performance assessment of ChatGPT 4, ChatGPT 3.5, Gemini Advanced Pro 1.5 and Bard 2.0 to problem solving in pathology in French language
title_full Performance assessment of ChatGPT 4, ChatGPT 3.5, Gemini Advanced Pro 1.5 and Bard 2.0 to problem solving in pathology in French language
title_fullStr Performance assessment of ChatGPT 4, ChatGPT 3.5, Gemini Advanced Pro 1.5 and Bard 2.0 to problem solving in pathology in French language
title_full_unstemmed Performance assessment of ChatGPT 4, ChatGPT 3.5, Gemini Advanced Pro 1.5 and Bard 2.0 to problem solving in pathology in French language
title_short Performance assessment of ChatGPT 4, ChatGPT 3.5, Gemini Advanced Pro 1.5 and Bard 2.0 to problem solving in pathology in French language
title_sort performance assessment of chatgpt 4 chatgpt 3 5 gemini advanced pro 1 5 and bard 2 0 to problem solving in pathology in french language
url https://doi.org/10.1177/20552076241310630
work_keys_str_mv AT georgestarris performanceassessmentofchatgpt4chatgpt35geminiadvancedpro15andbard20toproblemsolvinginpathologyinfrenchlanguage
AT laurentmartin performanceassessmentofchatgpt4chatgpt35geminiadvancedpro15andbard20toproblemsolvinginpathologyinfrenchlanguage