Large language models improve the identification of emergency department visits for symptomatic kidney stones
Abstract Recent advancements of large language models (LLMs) like generative pre-trained transformer 4 (GPT-4) have generated significant interest among the scientific community. Yet, the potential of these models to be utilized in clinical settings remains largely unexplored. In this study, we inve...
Saved in:
Main Authors: | , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Nature Portfolio
2025-01-01
|
Series: | Scientific Reports |
Subjects: | |
Online Access: | https://doi.org/10.1038/s41598-025-86632-5 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832571775577227264 |
---|---|
author | Cosmin A. Bejan Amy M. Reed Matthew Mikula Siwei Zhang Yaomin Xu Daniel Fabbri Peter J. Embí Ryan S. Hsi |
author_facet | Cosmin A. Bejan Amy M. Reed Matthew Mikula Siwei Zhang Yaomin Xu Daniel Fabbri Peter J. Embí Ryan S. Hsi |
author_sort | Cosmin A. Bejan |
collection | DOAJ |
description | Abstract Recent advancements of large language models (LLMs) like generative pre-trained transformer 4 (GPT-4) have generated significant interest among the scientific community. Yet, the potential of these models to be utilized in clinical settings remains largely unexplored. In this study, we investigated the abilities of multiple LLMs and traditional machine learning models to analyze emergency department (ED) reports and determine if the corresponding visits were due to symptomatic kidney stones. Leveraging a dataset of manually annotated ED reports, we developed strategies to enhance LLMs including prompt optimization, zero- and few-shot prompting, fine-tuning, and prompt augmentation. Further, we implemented fairness assessment and bias mitigation methods to investigate the potential disparities by LLMs with respect to race and gender. A clinical expert manually assessed the explanations generated by GPT-4 for its predictions to determine if they were sound, factually correct, unrelated to the input prompt, or potentially harmful. The best results were achieved by GPT-4 (macro-F1 = 0.833, 95% confidence interval [CI] 0.826–0.841) and GPT-3.5 (macro-F1 = 0.796, 95% CI 0.796–0.796). Ablation studies revealed that the initial pre-trained GPT-3.5 model benefits from fine-tuning. Adding demographic information and prior disease history to the prompts allows LLMs to make better decisions. Bias assessment found that GPT-4 exhibited no racial or gender disparities, in contrast to GPT-3.5, which failed to effectively model racial diversity. |
format | Article |
id | doaj-art-21260fb45f9d4686b62bfd831343eb9f |
institution | Kabale University |
issn | 2045-2322 |
language | English |
publishDate | 2025-01-01 |
publisher | Nature Portfolio |
record_format | Article |
series | Scientific Reports |
spelling | doaj-art-21260fb45f9d4686b62bfd831343eb9f2025-02-02T12:18:49ZengNature PortfolioScientific Reports2045-23222025-01-0115111010.1038/s41598-025-86632-5Large language models improve the identification of emergency department visits for symptomatic kidney stonesCosmin A. Bejan0Amy M. Reed1Matthew Mikula2Siwei Zhang3Yaomin Xu4Daniel Fabbri5Peter J. Embí6Ryan S. Hsi7Department of Biomedical Informatics, Vanderbilt University School of Medicine, Vanderbilt University Medical CenterDepartment of Urology, Vanderbilt University Medical CenterDepartment of Urology, Vanderbilt University Medical CenterDepartment of Biostatistics, Vanderbilt University Medical CenterDepartment of Biomedical Informatics, Vanderbilt University School of Medicine, Vanderbilt University Medical CenterDepartment of Biomedical Informatics, Vanderbilt University School of Medicine, Vanderbilt University Medical CenterDepartment of Biomedical Informatics, Vanderbilt University School of Medicine, Vanderbilt University Medical CenterDepartment of Urology, Vanderbilt University Medical CenterAbstract Recent advancements of large language models (LLMs) like generative pre-trained transformer 4 (GPT-4) have generated significant interest among the scientific community. Yet, the potential of these models to be utilized in clinical settings remains largely unexplored. In this study, we investigated the abilities of multiple LLMs and traditional machine learning models to analyze emergency department (ED) reports and determine if the corresponding visits were due to symptomatic kidney stones. Leveraging a dataset of manually annotated ED reports, we developed strategies to enhance LLMs including prompt optimization, zero- and few-shot prompting, fine-tuning, and prompt augmentation. Further, we implemented fairness assessment and bias mitigation methods to investigate the potential disparities by LLMs with respect to race and gender. A clinical expert manually assessed the explanations generated by GPT-4 for its predictions to determine if they were sound, factually correct, unrelated to the input prompt, or potentially harmful. The best results were achieved by GPT-4 (macro-F1 = 0.833, 95% confidence interval [CI] 0.826–0.841) and GPT-3.5 (macro-F1 = 0.796, 95% CI 0.796–0.796). Ablation studies revealed that the initial pre-trained GPT-3.5 model benefits from fine-tuning. Adding demographic information and prior disease history to the prompts allows LLMs to make better decisions. Bias assessment found that GPT-4 exhibited no racial or gender disparities, in contrast to GPT-3.5, which failed to effectively model racial diversity.https://doi.org/10.1038/s41598-025-86632-5Large language modelsLLMsGPT-3.5GPT-4Llama-2Kidney stones |
spellingShingle | Cosmin A. Bejan Amy M. Reed Matthew Mikula Siwei Zhang Yaomin Xu Daniel Fabbri Peter J. Embí Ryan S. Hsi Large language models improve the identification of emergency department visits for symptomatic kidney stones Scientific Reports Large language models LLMs GPT-3.5 GPT-4 Llama-2 Kidney stones |
title | Large language models improve the identification of emergency department visits for symptomatic kidney stones |
title_full | Large language models improve the identification of emergency department visits for symptomatic kidney stones |
title_fullStr | Large language models improve the identification of emergency department visits for symptomatic kidney stones |
title_full_unstemmed | Large language models improve the identification of emergency department visits for symptomatic kidney stones |
title_short | Large language models improve the identification of emergency department visits for symptomatic kidney stones |
title_sort | large language models improve the identification of emergency department visits for symptomatic kidney stones |
topic | Large language models LLMs GPT-3.5 GPT-4 Llama-2 Kidney stones |
url | https://doi.org/10.1038/s41598-025-86632-5 |
work_keys_str_mv | AT cosminabejan largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones AT amymreed largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones AT matthewmikula largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones AT siweizhang largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones AT yaominxu largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones AT danielfabbri largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones AT peterjembi largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones AT ryanshsi largelanguagemodelsimprovetheidentificationofemergencydepartmentvisitsforsymptomatickidneystones |