Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study
Abstract BackgroundThe capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored. ObjectiveThis study evaluates the confidence levels of 12 LLMs across 5 medical specialt...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
JMIR Publications
2025-05-01
|
| Series: | JMIR Medical Informatics |
| Online Access: | https://medinform.jmir.org/2025/1/e66917 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract
BackgroundThe capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored.
ObjectiveThis study evaluates the confidence levels of 12 LLMs across 5 medical specialties to assess LLMs’ ability to accurately judge their own responses.
MethodsWe used 1965 multiple-choice questions that assessed clinical knowledge in the following areas: internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and to also provide their confidence for the correct answers (score: range 0%‐100%). We calculated the correlation between each model’s mean confidence score for correct answers and the overall accuracy of each model across all questions. The confidence scores for correct and incorrect answers were also analyzed to determine the mean difference in confidence, using 2-sample, 2-tailed t
ResultsThe correlation between the mean confidence scores for correct answers and model accuracy was inverse and statistically significant (rPP
ConclusionsBetter-performing LLMs show more aligned overall confidence levels. However, even the most accurate models still show minimal variation in confidence between right and wrong answers. This may limit their safe use in clinical settings. Addressing overconfidence could involve refining calibration methods, performing domain-specific fine-tuning, and involving human oversight when decisions carry high risks. Further research is needed to improve these strategies before broader clinical adoption of LLMs. |
|---|---|
| ISSN: | 2291-9694 |