Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis

Abstract BackgroundArtificial intelligence advancements have enabled large language models to significantly impact radiology education and diagnostic accuracy. ObjectiveThis study evaluates the performance of mainstream large language models, including GPT-4, Claud...

Full description

Saved in:

Bibliographic Details
Main Author:	Boxiong Wei
Format:	Article
Language:	English
Published:	JMIR Publications 2025-01-01
Series:	JMIR Medical Education
Online Access:	https://mededu.jmir.org/2025/1/e64284
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832585158280085504
author	Boxiong Wei
author_facet	Boxiong Wei
author_sort	Boxiong Wei
collection	DOAJ
description	Abstract BackgroundArtificial intelligence advancements have enabled large language models to significantly impact radiology education and diagnostic accuracy. ObjectiveThis study evaluates the performance of mainstream large language models, including GPT-4, Claude, Bard, Tongyi Qianwen, and Gemini Pro, in radiology board exams. MethodsA comparative analysis of 150 multiple-choice questions from radiology board exams without images was conducted. Models were assessed on their accuracy for text-based questions and were categorized by cognitive levels and medical specialties using χ2 ResultsGPT-4 achieved the highest accuracy (83.3%, 125/150), significantly outperforming all other models. Specifically, Claude achieved an accuracy of 62% (93/150; PPPPP ConclusionsGPT-4 and Tongyi Qianwen show promise in medical education and training. The study emphasizes the need for domain-specific training datasets to enhance large language models’ effectiveness in specialized fields like radiology.
format	Article
id	doaj-art-e09f6dc558884493898f9c22a1983de9
institution	Kabale University
issn	2369-3762
language	English
publishDate	2025-01-01
publisher	JMIR Publications
record_format	Article
series	JMIR Medical Education
spelling	doaj-art-e09f6dc558884493898f9c22a1983de92025-01-27T02:52:12ZengJMIR PublicationsJMIR Medical Education2369-37622025-01-0111e64284e6428410.2196/64284Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative AnalysisBoxiong Weihttp://orcid.org/0000-0001-6206-9319 Abstract BackgroundArtificial intelligence advancements have enabled large language models to significantly impact radiology education and diagnostic accuracy. ObjectiveThis study evaluates the performance of mainstream large language models, including GPT-4, Claude, Bard, Tongyi Qianwen, and Gemini Pro, in radiology board exams. MethodsA comparative analysis of 150 multiple-choice questions from radiology board exams without images was conducted. Models were assessed on their accuracy for text-based questions and were categorized by cognitive levels and medical specialties using χ2 ResultsGPT-4 achieved the highest accuracy (83.3%, 125/150), significantly outperforming all other models. Specifically, Claude achieved an accuracy of 62% (93/150; PPPPP ConclusionsGPT-4 and Tongyi Qianwen show promise in medical education and training. The study emphasizes the need for domain-specific training datasets to enhance large language models’ effectiveness in specialized fields like radiology.https://mededu.jmir.org/2025/1/e64284
spellingShingle	Boxiong Wei Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis JMIR Medical Education
title	Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis
title_full	Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis
title_fullStr	Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis
title_full_unstemmed	Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis
title_short	Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis
title_sort	performance evaluation and implications of large language models in radiology board exams prospective comparative analysis
url	https://mededu.jmir.org/2025/1/e64284
work_keys_str_mv	AT boxiongwei performanceevaluationandimplicationsoflargelanguagemodelsinradiologyboardexamsprospectivecomparativeanalysis

Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis

Similar Items