Large language models improve the identification of emergency department visits for symptomatic kidney stones

Abstract Recent advancements of large language models (LLMs) like generative pre-trained transformer 4 (GPT-4) have generated significant interest among the scientific community. Yet, the potential of these models to be utilized in clinical settings remains largely unexplored. In this study, we inve...

Full description

Saved in:
Bibliographic Details
Main Authors: Cosmin A. Bejan, Amy M. Reed, Matthew Mikula, Siwei Zhang, Yaomin Xu, Daniel Fabbri, Peter J. Embí, Ryan S. Hsi
Format: Article
Language:English
Published: Nature Portfolio 2025-01-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-86632-5
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832571775577227264
author Cosmin A. Bejan
Amy M. Reed
Matthew Mikula
Siwei Zhang
Yaomin Xu
Daniel Fabbri
Peter J. Embí
Ryan S. Hsi
author_facet Cosmin A. Bejan
Amy M. Reed
Matthew Mikula
Siwei Zhang
Yaomin Xu
Daniel Fabbri
Peter J. Embí
Ryan S. Hsi
author_sort Cosmin A. Bejan
collection DOAJ
description Abstract Recent advancements of large language models (LLMs) like generative pre-trained transformer 4 (GPT-4) have generated significant interest among the scientific community. Yet, the potential of these models to be utilized in clinical settings remains largely unexplored. In this study, we investigated the abilities of multiple LLMs and traditional machine learning models to analyze emergency department (ED) reports and determine if the corresponding visits were due to symptomatic kidney stones. Leveraging a dataset of manually annotated ED reports, we developed strategies to enhance LLMs including prompt optimization, zero- and few-shot prompting, fine-tuning, and prompt augmentation. Further, we implemented fairness assessment and bias mitigation methods to investigate the potential disparities by LLMs with respect to race and gender. A clinical expert manually assessed the explanations generated by GPT-4 for its predictions to determine if they were sound, factually correct, unrelated to the input prompt, or potentially harmful. The best results were achieved by GPT-4 (macro-F1 = 0.833, 95% confidence interval [CI] 0.826–0.841) and GPT-3.5 (macro-F1 = 0.796, 95% CI 0.796–0.796). Ablation studies revealed that the initial pre-trained GPT-3.5 model benefits from fine-tuning. Adding demographic information and prior disease history to the prompts allows LLMs to make better decisions. Bias assessment found that GPT-4 exhibited no racial or gender disparities, in contrast to GPT-3.5, which failed to effectively model racial diversity.
format Article
id doaj-art-21260fb45f9d4686b62bfd831343eb9f
institution Kabale University
issn 2045-2322
language English
publishDate 2025-01-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-21260fb45f9d4686b62bfd831343eb9f2025-02-02T12:18:49ZengNature PortfolioScientific Reports2045-23222025-01-0115111010.1038/s41598-025-86632-5Large language models improve the identification of emergency department visits for symptomatic kidney stonesCosmin A. Bejan0Amy M. Reed1Matthew Mikula2Siwei Zhang3Yaomin Xu4Daniel Fabbri5Peter J. Embí6Ryan S. Hsi7Department of Biomedical Informatics, Vanderbilt University School of Medicine, Vanderbilt University Medical CenterDepartment of Urology, Vanderbilt University Medical CenterDepartment of Urology, Vanderbilt University Medical CenterDepartment of Biostatistics, Vanderbilt University Medical CenterDepartment of Biomedical Informatics, Vanderbilt University School of Medicine, Vanderbilt University Medical CenterDepartment of Biomedical Informatics, Vanderbilt University School of Medicine, Vanderbilt University Medical CenterDepartment of Biomedical Informatics, Vanderbilt University School of Medicine, Vanderbilt University Medical CenterDepartment of Urology, Vanderbilt University Medical CenterAbstract Recent advancements of large language models (LLMs) like generative pre-trained transformer 4 (GPT-4) have generated significant interest among the scientific community. Yet, the potential of these models to be utilized in clinical settings remains largely unexplored. In this study, we investigated the abilities of multiple LLMs and traditional machine learning models to analyze emergency department (ED) reports and determine if the corresponding visits were due to symptomatic kidney stones. Leveraging a dataset of manually annotated ED reports, we developed strategies to enhance LLMs including prompt optimization, zero- and few-shot prompting, fine-tuning, and prompt augmentation. Further, we implemented fairness assessment and bias mitigation methods to investigate the potential disparities by LLMs with respect to race and gender. A clinical expert manually assessed the explanations generated by GPT-4 for its predictions to determine if they were sound, factually correct, unrelated to the input prompt, or potentially harmful. The best results were achieved by GPT-4 (macro-F1 = 0.833, 95% confidence interval [CI] 0.826–0.841) and GPT-3.5 (macro-F1 = 0.796, 95% CI 0.796–0.796). Ablation studies revealed that the initial pre-trained GPT-3.5 model benefits from fine-tuning. Adding demographic information and prior disease history to the prompts allows LLMs to make better decisions. Bias assessment found that GPT-4 exhibited no racial or gender disparities, in contrast to GPT-3.5, which failed to effectively model racial diversity.https://doi.org/10.1038/s41598-025-86632-5Large language modelsLLMsGPT-3.5GPT-4Llama-2Kidney stones
spellingShingle Cosmin A. Bejan
Amy M. Reed
Matthew Mikula
Siwei Zhang
Yaomin Xu
Daniel Fabbri
Peter J. Embí
Ryan S. Hsi
Large language models improve the identification of emergency department visits for symptomatic kidney stones
Scientific Reports
Large language models
LLMs
GPT-3.5
GPT-4
Llama-2
Kidney stones
title Large language models improve the identification of emergency department visits for symptomatic kidney stones
title_full Large language models improve the identification of emergency department visits for symptomatic kidney stones
title_fullStr Large language models improve the identification of emergency department visits for symptomatic kidney stones
title_full_unstemmed Large language models improve the identification of emergency department visits for symptomatic kidney stones
title_short Large language models improve the identification of emergency department visits for symptomatic kidney stones
title_sort large language models improve the identification of emergency department visits for symptomatic kidney stones
topic Large language models
LLMs
GPT-3.5
GPT-4
Llama-2
Kidney stones
url https://doi.org/10.1038/s41598-025-86632-5
work_keys_str_mv AT cosminabejan largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones
AT amymreed largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones
AT matthewmikula largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones
AT siweizhang largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones
AT yaominxu largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones
AT danielfabbri largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones
AT peterjembi largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones
AT ryanshsi largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones