Scalable information extraction from free text electronic health records using large language models

Abstract Background A vast amount of potentially useful information such as description of patient symptoms, family, and social history is recorded as free-text notes in electronic health records (EHRs) but is difficult to reliably extract at scale, limiting their utility in research. This study aim...

Full description

Saved in:

Bibliographic Details
Main Authors:	Bowen Gu, Vivian Shao, Ziqian Liao, Valentina Carducci, Santiago Romero Brufau, Jie Yang, Rishi J. Desai
Format:	Article
Language:	English
Published:	BMC 2025-01-01
Series:	BMC Medical Research Methodology
Subjects:	Social determinants of health (SDoH) Electronic health records (EHR) Natural Language Processing (NLP) Large language models (LLMs) Clinical information extraction
Online Access:	https://doi.org/10.1186/s12874-025-02470-z
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832571627348426752
author	Bowen Gu Vivian Shao Ziqian Liao Valentina Carducci Santiago Romero Brufau Jie Yang Rishi J. Desai
author_facet	Bowen Gu Vivian Shao Ziqian Liao Valentina Carducci Santiago Romero Brufau Jie Yang Rishi J. Desai
author_sort	Bowen Gu
collection	DOAJ
description	Abstract Background A vast amount of potentially useful information such as description of patient symptoms, family, and social history is recorded as free-text notes in electronic health records (EHRs) but is difficult to reliably extract at scale, limiting their utility in research. This study aims to assess whether an “out of the box” implementation of open-source large language models (LLMs) without any fine-tuning can accurately extract social determinants of health (SDoH) data from free-text clinical notes. Methods We conducted a cross-sectional study using EHR data from the Mass General Brigham (MGB) system, analyzing free-text notes for SDoH information. We selected a random sample of 200 patients and manually labeled nine SDoH aspects. Eight advanced open-source LLMs were evaluated against a baseline pattern-matching model. Two human reviewers provided the manual labels, achieving 93% inter-annotator agreement. LLM performance was assessed using accuracy metrics for overall, mentioned, and non-mentioned SDoH, and macro F1 scores. Results LLMs outperformed the baseline pattern-matching approach, particularly for explicitly mentioned SDoH, achieving up to 40% higher Accuracymentioned. openchat_3.5 was the best-performing model, surpassing the baseline in overall accuracy across all nine SDoH aspects. The refined pipeline with prompt engineering reduced hallucinations and improved accuracy. Conclusions Open-source LLMs are effective and scalable tools for extracting SDoH from unstructured EHRs, surpassing traditional pattern-matching methods. Further refinement and domain-specific training could enhance their utility in clinical research and predictive analytics, improving healthcare outcomes and addressing health disparities.
format	Article
id	doaj-art-fb5e9c775e814bd6a03de0e12639ad4d
institution	Kabale University
issn	1471-2288
language	English
publishDate	2025-01-01
publisher	BMC
record_format	Article
series	BMC Medical Research Methodology
spelling	doaj-art-fb5e9c775e814bd6a03de0e12639ad4d2025-02-02T12:30:19ZengBMCBMC Medical Research Methodology1471-22882025-01-012511910.1186/s12874-025-02470-zScalable information extraction from free text electronic health records using large language modelsBowen Gu0Vivian Shao1Ziqian Liao2Valentina Carducci3Santiago Romero Brufau4Jie Yang5Rishi J. Desai6Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical SchoolDivision of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical SchoolDepartment of Biostatistics, Harvard T.H. Chan School of Public Health, Harvard UniversityDepartment of Otorhinolaryngology - Head & Neck Surgery, Mayo ClinicDepartment of Otorhinolaryngology - Head & Neck Surgery, Mayo ClinicDivision of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical SchoolDivision of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical SchoolAbstract Background A vast amount of potentially useful information such as description of patient symptoms, family, and social history is recorded as free-text notes in electronic health records (EHRs) but is difficult to reliably extract at scale, limiting their utility in research. This study aims to assess whether an “out of the box” implementation of open-source large language models (LLMs) without any fine-tuning can accurately extract social determinants of health (SDoH) data from free-text clinical notes. Methods We conducted a cross-sectional study using EHR data from the Mass General Brigham (MGB) system, analyzing free-text notes for SDoH information. We selected a random sample of 200 patients and manually labeled nine SDoH aspects. Eight advanced open-source LLMs were evaluated against a baseline pattern-matching model. Two human reviewers provided the manual labels, achieving 93% inter-annotator agreement. LLM performance was assessed using accuracy metrics for overall, mentioned, and non-mentioned SDoH, and macro F1 scores. Results LLMs outperformed the baseline pattern-matching approach, particularly for explicitly mentioned SDoH, achieving up to 40% higher Accuracymentioned. openchat_3.5 was the best-performing model, surpassing the baseline in overall accuracy across all nine SDoH aspects. The refined pipeline with prompt engineering reduced hallucinations and improved accuracy. Conclusions Open-source LLMs are effective and scalable tools for extracting SDoH from unstructured EHRs, surpassing traditional pattern-matching methods. Further refinement and domain-specific training could enhance their utility in clinical research and predictive analytics, improving healthcare outcomes and addressing health disparities.https://doi.org/10.1186/s12874-025-02470-zSocial determinants of health (SDoH)Electronic health records (EHR)Natural Language Processing (NLP)Large language models (LLMs)Clinical information extraction
spellingShingle	Bowen Gu Vivian Shao Ziqian Liao Valentina Carducci Santiago Romero Brufau Jie Yang Rishi J. Desai Scalable information extraction from free text electronic health records using large language models BMC Medical Research Methodology Social determinants of health (SDoH) Electronic health records (EHR) Natural Language Processing (NLP) Large language models (LLMs) Clinical information extraction
title	Scalable information extraction from free text electronic health records using large language models
title_full	Scalable information extraction from free text electronic health records using large language models
title_fullStr	Scalable information extraction from free text electronic health records using large language models
title_full_unstemmed	Scalable information extraction from free text electronic health records using large language models
title_short	Scalable information extraction from free text electronic health records using large language models
title_sort	scalable information extraction from free text electronic health records using large language models
topic	Social determinants of health (SDoH) Electronic health records (EHR) Natural Language Processing (NLP) Large language models (LLMs) Clinical information extraction
url	https://doi.org/10.1186/s12874-025-02470-z
work_keys_str_mv	AT bowengu scalableinformationextractionfromfreetextelectronichealthrecordsusinglargelanguagemodels AT vivianshao scalableinformationextractionfromfreetextelectronichealthrecordsusinglargelanguagemodels AT ziqianliao scalableinformationextractionfromfreetextelectronichealthrecordsusinglargelanguagemodels AT valentinacarducci scalableinformationextractionfromfreetextelectronichealthrecordsusinglargelanguagemodels AT santiagoromerobrufau scalableinformationextractionfromfreetextelectronichealthrecordsusinglargelanguagemodels AT jieyang scalableinformationextractionfromfreetextelectronichealthrecordsusinglargelanguagemodels AT rishijdesai scalableinformationextractionfromfreetextelectronichealthrecordsusinglargelanguagemodels

Scalable information extraction from free text electronic health records using large language models

Similar Items