Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study

Abstract BackgroundThe capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored. ObjectiveThis study evaluates the confidence levels of 12 LLMs across 5 medical specialt...

Full description

Saved in:
Bibliographic Details
Main Authors: Mahmud Omar, Reem Agbareia, Benjamin S Glicksberg, Girish N Nadkarni, Eyal Klang
Format: Article
Language:English
Published: JMIR Publications 2025-05-01
Series:JMIR Medical Informatics
Online Access:https://medinform.jmir.org/2025/1/e66917
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract BackgroundThe capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored. ObjectiveThis study evaluates the confidence levels of 12 LLMs across 5 medical specialties to assess LLMs’ ability to accurately judge their own responses. MethodsWe used 1965 multiple-choice questions that assessed clinical knowledge in the following areas: internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and to also provide their confidence for the correct answers (score: range 0%‐100%). We calculated the correlation between each model’s mean confidence score for correct answers and the overall accuracy of each model across all questions. The confidence scores for correct and incorrect answers were also analyzed to determine the mean difference in confidence, using 2-sample, 2-tailed t ResultsThe correlation between the mean confidence scores for correct answers and model accuracy was inverse and statistically significant (rPP ConclusionsBetter-performing LLMs show more aligned overall confidence levels. However, even the most accurate models still show minimal variation in confidence between right and wrong answers. This may limit their safe use in clinical settings. Addressing overconfidence could involve refining calibration methods, performing domain-specific fine-tuning, and involving human oversight when decisions carry high risks. Further research is needed to improve these strategies before broader clinical adoption of LLMs.
ISSN:2291-9694