Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis
Abstract BackgroundArtificial intelligence advancements have enabled large language models to significantly impact radiology education and diagnostic accuracy. ObjectiveThis study evaluates the performance of mainstream large language models, including GPT-4, Claud...
Saved in:
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
JMIR Publications
2025-01-01
|
Series: | JMIR Medical Education |
Online Access: | https://mededu.jmir.org/2025/1/e64284 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832585158280085504 |
---|---|
author | Boxiong Wei |
author_facet | Boxiong Wei |
author_sort | Boxiong Wei |
collection | DOAJ |
description |
Abstract
BackgroundArtificial intelligence advancements have enabled large language models to significantly impact radiology education and diagnostic accuracy.
ObjectiveThis study evaluates the performance of mainstream large language models, including GPT-4, Claude, Bard, Tongyi Qianwen, and Gemini Pro, in radiology board exams.
MethodsA comparative analysis of 150 multiple-choice questions from radiology board exams without images was conducted. Models were assessed on their accuracy for text-based questions and were categorized by cognitive levels and medical specialties using χ2
ResultsGPT-4 achieved the highest accuracy (83.3%, 125/150), significantly outperforming all other models. Specifically, Claude achieved an accuracy of 62% (93/150; PPPPP
ConclusionsGPT-4 and Tongyi Qianwen show promise in medical education and training. The study emphasizes the need for domain-specific training datasets to enhance large language models’ effectiveness in specialized fields like radiology. |
format | Article |
id | doaj-art-e09f6dc558884493898f9c22a1983de9 |
institution | Kabale University |
issn | 2369-3762 |
language | English |
publishDate | 2025-01-01 |
publisher | JMIR Publications |
record_format | Article |
series | JMIR Medical Education |
spelling | doaj-art-e09f6dc558884493898f9c22a1983de92025-01-27T02:52:12ZengJMIR PublicationsJMIR Medical Education2369-37622025-01-0111e64284e6428410.2196/64284Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative AnalysisBoxiong Weihttp://orcid.org/0000-0001-6206-9319 Abstract BackgroundArtificial intelligence advancements have enabled large language models to significantly impact radiology education and diagnostic accuracy. ObjectiveThis study evaluates the performance of mainstream large language models, including GPT-4, Claude, Bard, Tongyi Qianwen, and Gemini Pro, in radiology board exams. MethodsA comparative analysis of 150 multiple-choice questions from radiology board exams without images was conducted. Models were assessed on their accuracy for text-based questions and were categorized by cognitive levels and medical specialties using χ2 ResultsGPT-4 achieved the highest accuracy (83.3%, 125/150), significantly outperforming all other models. Specifically, Claude achieved an accuracy of 62% (93/150; PPPPP ConclusionsGPT-4 and Tongyi Qianwen show promise in medical education and training. The study emphasizes the need for domain-specific training datasets to enhance large language models’ effectiveness in specialized fields like radiology.https://mededu.jmir.org/2025/1/e64284 |
spellingShingle | Boxiong Wei Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis JMIR Medical Education |
title | Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis |
title_full | Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis |
title_fullStr | Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis |
title_full_unstemmed | Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis |
title_short | Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis |
title_sort | performance evaluation and implications of large language models in radiology board exams prospective comparative analysis |
url | https://mededu.jmir.org/2025/1/e64284 |
work_keys_str_mv | AT boxiongwei performanceevaluationandimplicationsoflargelanguagemodelsinradiologyboardexamsprospectivecomparativeanalysis |