Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study

Abstract BackgroundArtificial intelligence (AI) has become widely applied across many fields, including medical education. Content validation and its answers are based on training datasets and the optimization of each model. The accuracy of large language model (LLMs) in basic...

Full description

Saved in:

Bibliographic Details
Main Authors:	Naritsaret Kaewboonlert, Jiraphon Poontananggul, Natthipong Pongsuwan, Gun Bhakdisongkhram
Format:	Article
Language:	English
Published:	JMIR Publications 2025-01-01
Series:	JMIR Medical Education
Online Access:	https://mededu.jmir.org/2025/1/e58898
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832593392678207488
author	Naritsaret Kaewboonlert Jiraphon Poontananggul Natthipong Pongsuwan Gun Bhakdisongkhram
author_facet	Naritsaret Kaewboonlert Jiraphon Poontananggul Natthipong Pongsuwan Gun Bhakdisongkhram
author_sort	Naritsaret Kaewboonlert
collection	DOAJ
description	Abstract BackgroundArtificial intelligence (AI) has become widely applied across many fields, including medical education. Content validation and its answers are based on training datasets and the optimization of each model. The accuracy of large language model (LLMs) in basic medical examinations and factors related to their accuracy have also been explored. ObjectiveWe evaluated factors associated with the accuracy of LLMs (GPT-3.5, GPT-4, Google Bard, and Microsoft Bing) in answering multiple-choice questions from basic medical science examinations. MethodsWe used questions that were closely aligned with the content and topic distribution of Thailand’s Step 1 National Medical Licensing Examination. Variables such as the difficulty index, discrimination index, and question characteristics were collected. These questions were then simultaneously input into ChatGPT (with GPT-3.5 and GPT-4), Microsoft Bing, and Google Bard, and their responses were recorded. The accuracy of these LLMs and the associated factors were analyzed using multivariable logistic regression. This analysis aimed to assess the effect of various factors on model accuracy, with results reported as odds ratios (ORs). ResultsThe study revealed that GPT-4 was the top-performing model, with an overall accuracy of 89.07% (95% CI 84.76%‐92.41%), significantly outperforming the others (P ConclusionsThe GPT-4 and Microsoft Bing models demonstrated equal and superior accuracy compared to GPT-3.5 and Google Bard in the domain of basic medical science. The accuracy of these models was significantly influenced by the item’s difficulty index, indicating that the LLMs are more accurate when answering easier questions. This suggests that the more accurate models, such as GPT-4 and Bing, can be valuable tools for understanding and learning basic medical science concepts.
format	Article
id	doaj-art-08a523418567447baae9f6e9bd6cfc30
institution	Kabale University
issn	2369-3762
language	English
publishDate	2025-01-01
publisher	JMIR Publications
record_format	Article
series	JMIR Medical Education
spelling	doaj-art-08a523418567447baae9f6e9bd6cfc302025-01-20T16:15:54ZengJMIR PublicationsJMIR Medical Education2369-37622025-01-0111e58898e5889810.2196/58898Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional StudyNaritsaret Kaewboonlerthttp://orcid.org/0009-0004-2035-5631Jiraphon Poontananggulhttp://orcid.org/0009-0000-9566-7737Natthipong Pongsuwanhttp://orcid.org/0009-0002-0555-7767Gun Bhakdisongkhramhttp://orcid.org/0000-0001-7434-9262 Abstract BackgroundArtificial intelligence (AI) has become widely applied across many fields, including medical education. Content validation and its answers are based on training datasets and the optimization of each model. The accuracy of large language model (LLMs) in basic medical examinations and factors related to their accuracy have also been explored. ObjectiveWe evaluated factors associated with the accuracy of LLMs (GPT-3.5, GPT-4, Google Bard, and Microsoft Bing) in answering multiple-choice questions from basic medical science examinations. MethodsWe used questions that were closely aligned with the content and topic distribution of Thailand’s Step 1 National Medical Licensing Examination. Variables such as the difficulty index, discrimination index, and question characteristics were collected. These questions were then simultaneously input into ChatGPT (with GPT-3.5 and GPT-4), Microsoft Bing, and Google Bard, and their responses were recorded. The accuracy of these LLMs and the associated factors were analyzed using multivariable logistic regression. This analysis aimed to assess the effect of various factors on model accuracy, with results reported as odds ratios (ORs). ResultsThe study revealed that GPT-4 was the top-performing model, with an overall accuracy of 89.07% (95% CI 84.76%‐92.41%), significantly outperforming the others (P ConclusionsThe GPT-4 and Microsoft Bing models demonstrated equal and superior accuracy compared to GPT-3.5 and Google Bard in the domain of basic medical science. The accuracy of these models was significantly influenced by the item’s difficulty index, indicating that the LLMs are more accurate when answering easier questions. This suggests that the more accurate models, such as GPT-4 and Bing, can be valuable tools for understanding and learning basic medical science concepts.https://mededu.jmir.org/2025/1/e58898
spellingShingle	Naritsaret Kaewboonlert Jiraphon Poontananggul Natthipong Pongsuwan Gun Bhakdisongkhram Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study JMIR Medical Education
title	Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study
title_full	Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study
title_fullStr	Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study
title_full_unstemmed	Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study
title_short	Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study
title_sort	factors associated with the accuracy of large language models in basic medical science examinations cross sectional study
url	https://mededu.jmir.org/2025/1/e58898
work_keys_str_mv	AT naritsaretkaewboonlert factorsassociatedwiththeaccuracyoflargelanguagemodelsinbasicmedicalscienceexaminationscrosssectionalstudy AT jiraphonpoontananggul factorsassociatedwiththeaccuracyoflargelanguagemodelsinbasicmedicalscienceexaminationscrosssectionalstudy AT natthipongpongsuwan factorsassociatedwiththeaccuracyoflargelanguagemodelsinbasicmedicalscienceexaminationscrosssectionalstudy AT gunbhakdisongkhram factorsassociatedwiththeaccuracyoflargelanguagemodelsinbasicmedicalscienceexaminationscrosssectionalstudy

Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study

Similar Items