Large language models improve the identification of emergency department visits for symptomatic kidney stones

Abstract Recent advancements of large language models (LLMs) like generative pre-trained transformer 4 (GPT-4) have generated significant interest among the scientific community. Yet, the potential of these models to be utilized in clinical settings remains largely unexplored. In this study, we inve...

Full description

Saved in:

Bibliographic Details
Main Authors:	Cosmin A. Bejan, Amy M. Reed, Matthew Mikula, Siwei Zhang, Yaomin Xu, Daniel Fabbri, Peter J. Embí, Ryan S. Hsi
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-01-01
Series:	Scientific Reports
Subjects:	Large language models LLMs GPT-3.5 GPT-4 Llama-2 Kidney stones
Online Access:	https://doi.org/10.1038/s41598-025-86632-5
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832571775577227264
author	Cosmin A. Bejan Amy M. Reed Matthew Mikula Siwei Zhang Yaomin Xu Daniel Fabbri Peter J. Embí Ryan S. Hsi
author_facet	Cosmin A. Bejan Amy M. Reed Matthew Mikula Siwei Zhang Yaomin Xu Daniel Fabbri Peter J. Embí Ryan S. Hsi
author_sort	Cosmin A. Bejan
collection	DOAJ
description	Abstract Recent advancements of large language models (LLMs) like generative pre-trained transformer 4 (GPT-4) have generated significant interest among the scientific community. Yet, the potential of these models to be utilized in clinical settings remains largely unexplored. In this study, we investigated the abilities of multiple LLMs and traditional machine learning models to analyze emergency department (ED) reports and determine if the corresponding visits were due to symptomatic kidney stones. Leveraging a dataset of manually annotated ED reports, we developed strategies to enhance LLMs including prompt optimization, zero- and few-shot prompting, fine-tuning, and prompt augmentation. Further, we implemented fairness assessment and bias mitigation methods to investigate the potential disparities by LLMs with respect to race and gender. A clinical expert manually assessed the explanations generated by GPT-4 for its predictions to determine if they were sound, factually correct, unrelated to the input prompt, or potentially harmful. The best results were achieved by GPT-4 (macro-F1 = 0.833, 95% confidence interval [CI] 0.826–0.841) and GPT-3.5 (macro-F1 = 0.796, 95% CI 0.796–0.796). Ablation studies revealed that the initial pre-trained GPT-3.5 model benefits from fine-tuning. Adding demographic information and prior disease history to the prompts allows LLMs to make better decisions. Bias assessment found that GPT-4 exhibited no racial or gender disparities, in contrast to GPT-3.5, which failed to effectively model racial diversity.
format	Article
id	doaj-art-21260fb45f9d4686b62bfd831343eb9f
institution	Kabale University
issn	2045-2322
language	English
publishDate	2025-01-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj-art-21260fb45f9d4686b62bfd831343eb9f2025-02-02T12:18:49ZengNature PortfolioScientific Reports2045-23222025-01-0115111010.1038/s41598-025-86632-5Large language models improve the identification of emergency department visits for symptomatic kidney stonesCosmin A. Bejan0Amy M. Reed1Matthew Mikula2Siwei Zhang3Yaomin Xu4Daniel Fabbri5Peter J. Embí6Ryan S. Hsi7Department of Biomedical Informatics, Vanderbilt University School of Medicine, Vanderbilt University Medical CenterDepartment of Urology, Vanderbilt University Medical CenterDepartment of Urology, Vanderbilt University Medical CenterDepartment of Biostatistics, Vanderbilt University Medical CenterDepartment of Biomedical Informatics, Vanderbilt University School of Medicine, Vanderbilt University Medical CenterDepartment of Biomedical Informatics, Vanderbilt University School of Medicine, Vanderbilt University Medical CenterDepartment of Biomedical Informatics, Vanderbilt University School of Medicine, Vanderbilt University Medical CenterDepartment of Urology, Vanderbilt University Medical CenterAbstract Recent advancements of large language models (LLMs) like generative pre-trained transformer 4 (GPT-4) have generated significant interest among the scientific community. Yet, the potential of these models to be utilized in clinical settings remains largely unexplored. In this study, we investigated the abilities of multiple LLMs and traditional machine learning models to analyze emergency department (ED) reports and determine if the corresponding visits were due to symptomatic kidney stones. Leveraging a dataset of manually annotated ED reports, we developed strategies to enhance LLMs including prompt optimization, zero- and few-shot prompting, fine-tuning, and prompt augmentation. Further, we implemented fairness assessment and bias mitigation methods to investigate the potential disparities by LLMs with respect to race and gender. A clinical expert manually assessed the explanations generated by GPT-4 for its predictions to determine if they were sound, factually correct, unrelated to the input prompt, or potentially harmful. The best results were achieved by GPT-4 (macro-F1 = 0.833, 95% confidence interval [CI] 0.826–0.841) and GPT-3.5 (macro-F1 = 0.796, 95% CI 0.796–0.796). Ablation studies revealed that the initial pre-trained GPT-3.5 model benefits from fine-tuning. Adding demographic information and prior disease history to the prompts allows LLMs to make better decisions. Bias assessment found that GPT-4 exhibited no racial or gender disparities, in contrast to GPT-3.5, which failed to effectively model racial diversity.https://doi.org/10.1038/s41598-025-86632-5Large language modelsLLMsGPT-3.5GPT-4Llama-2Kidney stones
spellingShingle	Cosmin A. Bejan Amy M. Reed Matthew Mikula Siwei Zhang Yaomin Xu Daniel Fabbri Peter J. Embí Ryan S. Hsi Large language models improve the identification of emergency department visits for symptomatic kidney stones Scientific Reports Large language models LLMs GPT-3.5 GPT-4 Llama-2 Kidney stones
title	Large language models improve the identification of emergency department visits for symptomatic kidney stones
title_full	Large language models improve the identification of emergency department visits for symptomatic kidney stones
title_fullStr	Large language models improve the identification of emergency department visits for symptomatic kidney stones
title_full_unstemmed	Large language models improve the identification of emergency department visits for symptomatic kidney stones
title_short	Large language models improve the identification of emergency department visits for symptomatic kidney stones
title_sort	large language models improve the identification of emergency department visits for symptomatic kidney stones
topic	Large language models LLMs GPT-3.5 GPT-4 Llama-2 Kidney stones
url	https://doi.org/10.1038/s41598-025-86632-5
work_keys_str_mv	AT cosminabejan largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones AT amymreed largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones AT matthewmikula largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones AT siweizhang largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones AT yaominxu largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones AT danielfabbri largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones AT peterjembi largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones AT ryanshsi largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones

Large language models improve the identification of emergency department visits for symptomatic kidney stones

Similar Items