A multi-model longitudinal assessment of ChatGPT performance on medical residency examinations

IntroductionChatGPT, a generative artificial intelligence, has potential applications in numerous fields, including medical education. This potential can be assessed through its performance on medical exams. Medical residency exams, critical for entering medical specialties, serve as a valuable benc...

Full description

Saved in:

Bibliographic Details
Main Authors:	Maria Eduarda Varela Cavalcanti Souto, Alexandre Chaves Fernandes, Ana Beatriz Santana Silva, Louise Helena de Freitas Ribeiro, Thales Allyrio Araújo de Medeiros Fernandes
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2025-08-01
Series:	Frontiers in Artificial Intelligence
Subjects:	generative artificial intelligence medical residency examinations medical education artificial intelligence chain-of-thought reasoning large language model
Online Access:	https://www.frontiersin.org/articles/10.3389/frai.2025.1614874/full
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849229200721969152
author	Maria Eduarda Varela Cavalcanti Souto Alexandre Chaves Fernandes Ana Beatriz Santana Silva Louise Helena de Freitas Ribeiro Thales Allyrio Araújo de Medeiros Fernandes
author_facet	Maria Eduarda Varela Cavalcanti Souto Alexandre Chaves Fernandes Ana Beatriz Santana Silva Louise Helena de Freitas Ribeiro Thales Allyrio Araújo de Medeiros Fernandes
author_sort	Maria Eduarda Varela Cavalcanti Souto
collection	DOAJ
description	IntroductionChatGPT, a generative artificial intelligence, has potential applications in numerous fields, including medical education. This potential can be assessed through its performance on medical exams. Medical residency exams, critical for entering medical specialties, serve as a valuable benchmark.Materials and methodsThis study aimed to assess the accuracy of ChatGPT-4 and GPT-4o in responding to 1,041 medical residency questions from Brazil, examining overall accuracy and performance across different medical areas, based on evaluations conducted in 2023 and 2024. The questions were classified into higher and lower cognitive levels according to Bloom’s taxonomy. Additionally, questions answered incorrectly by both models were tested using the recent GPT models that use chain-of-thought reasoning (e.g., o1-preview, o3, o4-mini-high) with evaluations carried out in both 2024 and 2025.ResultsGPT-4 achieved 81.27% accuracy (95% CI: 78.89–83.64%), while GPT-4o reached 85.88% (95% CI: 83.76–88.00%), significantly outperforming GPT-4 (p < 0.05). Both models showed reduced accuracy on higher-order thinking questions. On questions that both models failed, GPT o1-preview achieved 53.26% accuracy (95% CI: 42.87–63.65%), GPT o3 47.83% (95% CI: 37.42–58.23%) and o4-mini-high 35.87% (95% CI: 25.88–45.86%), with all three models performing better on higher-order questions.ConclusionArtificial intelligence could be a beneficial tool in medical education, enhancing residency exam preparation, helping to understand complex topics, and improving teaching strategies. However, careful use of artificial intelligence is essential due to ethical concerns and potential limitations in both educational and clinical practice.
format	Article
id	doaj-art-e37dc193c6b147f49b60b46a282976ec
institution	Kabale University
issn	2624-8212
language	English
publishDate	2025-08-01
publisher	Frontiers Media S.A.
record_format	Article
series	Frontiers in Artificial Intelligence
spelling	doaj-art-e37dc193c6b147f49b60b46a282976ec2025-08-22T05:26:44ZengFrontiers Media S.A.Frontiers in Artificial Intelligence2624-82122025-08-01810.3389/frai.2025.16148741614874A multi-model longitudinal assessment of ChatGPT performance on medical residency examinationsMaria Eduarda Varela Cavalcanti Souto0Alexandre Chaves Fernandes1Ana Beatriz Santana Silva2Louise Helena de Freitas Ribeiro3Thales Allyrio Araújo de Medeiros Fernandes4Department of Biomedical Sciences, School of Health Sciences, State University of Rio Grande do Norte, Mossoró, BrazilInstitute of Mathematics and Computer Sciences, University of São Paulo, São Paulo, BrazilDepartment of Biomedical Sciences, School of Health Sciences, State University of Rio Grande do Norte, Mossoró, BrazilDepartment of Biomedical Sciences, School of Health Sciences, State University of Rio Grande do Norte, Mossoró, BrazilDepartment of Biomedical Sciences, School of Health Sciences, State University of Rio Grande do Norte, Mossoró, BrazilIntroductionChatGPT, a generative artificial intelligence, has potential applications in numerous fields, including medical education. This potential can be assessed through its performance on medical exams. Medical residency exams, critical for entering medical specialties, serve as a valuable benchmark.Materials and methodsThis study aimed to assess the accuracy of ChatGPT-4 and GPT-4o in responding to 1,041 medical residency questions from Brazil, examining overall accuracy and performance across different medical areas, based on evaluations conducted in 2023 and 2024. The questions were classified into higher and lower cognitive levels according to Bloom’s taxonomy. Additionally, questions answered incorrectly by both models were tested using the recent GPT models that use chain-of-thought reasoning (e.g., o1-preview, o3, o4-mini-high) with evaluations carried out in both 2024 and 2025.ResultsGPT-4 achieved 81.27% accuracy (95% CI: 78.89–83.64%), while GPT-4o reached 85.88% (95% CI: 83.76–88.00%), significantly outperforming GPT-4 (p < 0.05). Both models showed reduced accuracy on higher-order thinking questions. On questions that both models failed, GPT o1-preview achieved 53.26% accuracy (95% CI: 42.87–63.65%), GPT o3 47.83% (95% CI: 37.42–58.23%) and o4-mini-high 35.87% (95% CI: 25.88–45.86%), with all three models performing better on higher-order questions.ConclusionArtificial intelligence could be a beneficial tool in medical education, enhancing residency exam preparation, helping to understand complex topics, and improving teaching strategies. However, careful use of artificial intelligence is essential due to ethical concerns and potential limitations in both educational and clinical practice.https://www.frontiersin.org/articles/10.3389/frai.2025.1614874/fullgenerative artificial intelligencemedical residency examinationsmedical educationartificial intelligencechain-of-thought reasoninglarge language model
spellingShingle	Maria Eduarda Varela Cavalcanti Souto Alexandre Chaves Fernandes Ana Beatriz Santana Silva Louise Helena de Freitas Ribeiro Thales Allyrio Araújo de Medeiros Fernandes A multi-model longitudinal assessment of ChatGPT performance on medical residency examinations Frontiers in Artificial Intelligence generative artificial intelligence medical residency examinations medical education artificial intelligence chain-of-thought reasoning large language model
title	A multi-model longitudinal assessment of ChatGPT performance on medical residency examinations
title_full	A multi-model longitudinal assessment of ChatGPT performance on medical residency examinations
title_fullStr	A multi-model longitudinal assessment of ChatGPT performance on medical residency examinations
title_full_unstemmed	A multi-model longitudinal assessment of ChatGPT performance on medical residency examinations
title_short	A multi-model longitudinal assessment of ChatGPT performance on medical residency examinations
title_sort	multi model longitudinal assessment of chatgpt performance on medical residency examinations
topic	generative artificial intelligence medical residency examinations medical education artificial intelligence chain-of-thought reasoning large language model
url	https://www.frontiersin.org/articles/10.3389/frai.2025.1614874/full
work_keys_str_mv	AT mariaeduardavarelacavalcantisouto amultimodellongitudinalassessmentofchatgptperformanceonmedicalresidencyexaminations AT alexandrechavesfernandes amultimodellongitudinalassessmentofchatgptperformanceonmedicalresidencyexaminations AT anabeatrizsantanasilva amultimodellongitudinalassessmentofchatgptperformanceonmedicalresidencyexaminations AT louisehelenadefreitasribeiro amultimodellongitudinalassessmentofchatgptperformanceonmedicalresidencyexaminations AT thalesallyrioaraujodemedeirosfernandes amultimodellongitudinalassessmentofchatgptperformanceonmedicalresidencyexaminations AT mariaeduardavarelacavalcantisouto multimodellongitudinalassessmentofchatgptperformanceonmedicalresidencyexaminations AT alexandrechavesfernandes multimodellongitudinalassessmentofchatgptperformanceonmedicalresidencyexaminations AT anabeatrizsantanasilva multimodellongitudinalassessmentofchatgptperformanceonmedicalresidencyexaminations AT louisehelenadefreitasribeiro multimodellongitudinalassessmentofchatgptperformanceonmedicalresidencyexaminations AT thalesallyrioaraujodemedeirosfernandes multimodellongitudinalassessmentofchatgptperformanceonmedicalresidencyexaminations

A multi-model longitudinal assessment of ChatGPT performance on medical residency examinations

Similar Items