Clinical entity augmented retrieval for clinical information extraction

Abstract Large language models (LLMs) with retrieval-augmented generation (RAG) have improved information extraction over previous methods, yet their reliance on embeddings often leads to inefficient retrieval. We introduce CLinical Entity Augmented Retrieval (CLEAR), a RAG pipeline that retrieves i...

Full description

Saved in:
Bibliographic Details
Main Authors: Ivan Lopez, Akshay Swaminathan, Karthik Vedula, Sanjana Narayanan, Fateme Nateghi Haredasht, Stephen P. Ma, April S. Liang, Steven Tate, Manoj Maddali, Robert Joseph Gallo, Nigam H. Shah, Jonathan H. Chen
Format: Article
Language:English
Published: Nature Portfolio 2025-01-01
Series:npj Digital Medicine
Online Access:https://doi.org/10.1038/s41746-024-01377-1
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832594441311879168
author Ivan Lopez
Akshay Swaminathan
Karthik Vedula
Sanjana Narayanan
Fateme Nateghi Haredasht
Stephen P. Ma
April S. Liang
Steven Tate
Manoj Maddali
Robert Joseph Gallo
Nigam H. Shah
Jonathan H. Chen
author_facet Ivan Lopez
Akshay Swaminathan
Karthik Vedula
Sanjana Narayanan
Fateme Nateghi Haredasht
Stephen P. Ma
April S. Liang
Steven Tate
Manoj Maddali
Robert Joseph Gallo
Nigam H. Shah
Jonathan H. Chen
author_sort Ivan Lopez
collection DOAJ
description Abstract Large language models (LLMs) with retrieval-augmented generation (RAG) have improved information extraction over previous methods, yet their reliance on embeddings often leads to inefficient retrieval. We introduce CLinical Entity Augmented Retrieval (CLEAR), a RAG pipeline that retrieves information using entities. We compared CLEAR to embedding RAG and full-note approaches for extracting 18 variables using six LLMs across 20,000 clinical notes. Average F1 scores were 0.90, 0.86, and 0.79; inference times were 4.95, 17.41, and 20.08 s per note; average model queries were 1.68, 4.94, and 4.18 per note; and average input tokens were 1.1k, 3.8k, and 6.1k per note for CLEAR, embedding RAG, and full-note approaches, respectively. In conclusion, CLEAR utilizes clinical entities for information retrieval and achieves >70% reduction in token usage and inference time with improved performance compared to modern methods.
format Article
id doaj-art-cdd275f794e24fb496f700d16a907fbc
institution Kabale University
issn 2398-6352
language English
publishDate 2025-01-01
publisher Nature Portfolio
record_format Article
series npj Digital Medicine
spelling doaj-art-cdd275f794e24fb496f700d16a907fbc2025-01-19T12:39:42ZengNature Portfolionpj Digital Medicine2398-63522025-01-018111110.1038/s41746-024-01377-1Clinical entity augmented retrieval for clinical information extractionIvan Lopez0Akshay Swaminathan1Karthik Vedula2Sanjana Narayanan3Fateme Nateghi Haredasht4Stephen P. Ma5April S. Liang6Steven Tate7Manoj Maddali8Robert Joseph Gallo9Nigam H. Shah10Jonathan H. Chen11Stanford University School of MedicineStanford University School of MedicinePoolesville High SchoolStanford Center for Biomedical Informatics ResearchStanford Center for Biomedical Informatics ResearchDivision of Hospital Medicine, Stanford University School of MedicineDivision of Clinical Informatics, Stanford University School of MedicineDepartment of Psychiatry and Behavioral Sciences, Stanford University School of MedicineDepartment of Biomedical Data ScienceCenter for Innovation to Implementation, VA Palo Alto Healthcare SystemStanford Center for Biomedical Informatics ResearchDepartment of Biomedical Data ScienceAbstract Large language models (LLMs) with retrieval-augmented generation (RAG) have improved information extraction over previous methods, yet their reliance on embeddings often leads to inefficient retrieval. We introduce CLinical Entity Augmented Retrieval (CLEAR), a RAG pipeline that retrieves information using entities. We compared CLEAR to embedding RAG and full-note approaches for extracting 18 variables using six LLMs across 20,000 clinical notes. Average F1 scores were 0.90, 0.86, and 0.79; inference times were 4.95, 17.41, and 20.08 s per note; average model queries were 1.68, 4.94, and 4.18 per note; and average input tokens were 1.1k, 3.8k, and 6.1k per note for CLEAR, embedding RAG, and full-note approaches, respectively. In conclusion, CLEAR utilizes clinical entities for information retrieval and achieves >70% reduction in token usage and inference time with improved performance compared to modern methods.https://doi.org/10.1038/s41746-024-01377-1
spellingShingle Ivan Lopez
Akshay Swaminathan
Karthik Vedula
Sanjana Narayanan
Fateme Nateghi Haredasht
Stephen P. Ma
April S. Liang
Steven Tate
Manoj Maddali
Robert Joseph Gallo
Nigam H. Shah
Jonathan H. Chen
Clinical entity augmented retrieval for clinical information extraction
npj Digital Medicine
title Clinical entity augmented retrieval for clinical information extraction
title_full Clinical entity augmented retrieval for clinical information extraction
title_fullStr Clinical entity augmented retrieval for clinical information extraction
title_full_unstemmed Clinical entity augmented retrieval for clinical information extraction
title_short Clinical entity augmented retrieval for clinical information extraction
title_sort clinical entity augmented retrieval for clinical information extraction
url https://doi.org/10.1038/s41746-024-01377-1
work_keys_str_mv AT ivanlopez clinicalentityaugmentedretrievalforclinicalinformationextraction
AT akshayswaminathan clinicalentityaugmentedretrievalforclinicalinformationextraction
AT karthikvedula clinicalentityaugmentedretrievalforclinicalinformationextraction
AT sanjananarayanan clinicalentityaugmentedretrievalforclinicalinformationextraction
AT fatemenateghiharedasht clinicalentityaugmentedretrievalforclinicalinformationextraction
AT stephenpma clinicalentityaugmentedretrievalforclinicalinformationextraction
AT aprilsliang clinicalentityaugmentedretrievalforclinicalinformationextraction
AT steventate clinicalentityaugmentedretrievalforclinicalinformationextraction
AT manojmaddali clinicalentityaugmentedretrievalforclinicalinformationextraction
AT robertjosephgallo clinicalentityaugmentedretrievalforclinicalinformationextraction
AT nigamhshah clinicalentityaugmentedretrievalforclinicalinformationextraction
AT jonathanhchen clinicalentityaugmentedretrievalforclinicalinformationextraction