Towards evaluating and building versatile large language models for medicine

Abstract In this study, we present MedS-Bench, a comprehensive benchmark to evaluate large language models (LLMs) in clinical contexts, MedS-Bench, spanning 11 high-level clinical tasks. We evaluate nine leading LLMs, e.g., MEDITRON, Llama 3, Mistral, GPT-4, Claude-3.5, etc. and found that most mode...

Full description

Saved in:
Bibliographic Details
Main Authors: Chaoyi Wu, Pengcheng Qiu, Jinxin Liu, Hongfei Gu, Na Li, Ya Zhang, Yanfeng Wang, Weidi Xie
Format: Article
Language:English
Published: Nature Portfolio 2025-01-01
Series:npj Digital Medicine
Online Access:https://doi.org/10.1038/s41746-024-01390-4
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832571239531544576
author Chaoyi Wu
Pengcheng Qiu
Jinxin Liu
Hongfei Gu
Na Li
Ya Zhang
Yanfeng Wang
Weidi Xie
author_facet Chaoyi Wu
Pengcheng Qiu
Jinxin Liu
Hongfei Gu
Na Li
Ya Zhang
Yanfeng Wang
Weidi Xie
author_sort Chaoyi Wu
collection DOAJ
description Abstract In this study, we present MedS-Bench, a comprehensive benchmark to evaluate large language models (LLMs) in clinical contexts, MedS-Bench, spanning 11 high-level clinical tasks. We evaluate nine leading LLMs, e.g., MEDITRON, Llama 3, Mistral, GPT-4, Claude-3.5, etc. and found that most models struggle with these complex tasks. To address these limitations, we developed MedS-Ins, a large-scale instruction-tuning dataset for medicine. MedS-Ins comprises 58 medically oriented language corpora, totaling 5M instances with 19K instructions, across 122 tasks. To demonstrate the dataset’s utility, we conducted a proof-of-concept experiment by performing instruction tuning on a lightweight, open-source medical language model. The resulting model, MMedIns-Llama 3, significantly outperformed existing models on various clinical tasks. To promote further advancements, we have made MedS-Ins fully accessible and invite the research community to contribute to its expansion. Additionally, we have launched a dynamic leaderboard for MedS-Bench, to track the development progress of medical LLMs.
format Article
id doaj-art-675bdf34aea649419ec20b97480b6fd1
institution Kabale University
issn 2398-6352
language English
publishDate 2025-01-01
publisher Nature Portfolio
record_format Article
series npj Digital Medicine
spelling doaj-art-675bdf34aea649419ec20b97480b6fd12025-02-02T12:43:48ZengNature Portfolionpj Digital Medicine2398-63522025-01-018111310.1038/s41746-024-01390-4Towards evaluating and building versatile large language models for medicineChaoyi Wu0Pengcheng Qiu1Jinxin Liu2Hongfei Gu3Na Li4Ya Zhang5Yanfeng Wang6Weidi Xie7Shanghai Jiao Tong UniversityShanghai Jiao Tong UniversityChina Mobile Communications Group Co., Ltd.China Mobile Communications Group Shanghai Co., Ltd.China Mobile Communications Group Co., Ltd.Shanghai Jiao Tong UniversityShanghai Jiao Tong UniversityShanghai Jiao Tong UniversityAbstract In this study, we present MedS-Bench, a comprehensive benchmark to evaluate large language models (LLMs) in clinical contexts, MedS-Bench, spanning 11 high-level clinical tasks. We evaluate nine leading LLMs, e.g., MEDITRON, Llama 3, Mistral, GPT-4, Claude-3.5, etc. and found that most models struggle with these complex tasks. To address these limitations, we developed MedS-Ins, a large-scale instruction-tuning dataset for medicine. MedS-Ins comprises 58 medically oriented language corpora, totaling 5M instances with 19K instructions, across 122 tasks. To demonstrate the dataset’s utility, we conducted a proof-of-concept experiment by performing instruction tuning on a lightweight, open-source medical language model. The resulting model, MMedIns-Llama 3, significantly outperformed existing models on various clinical tasks. To promote further advancements, we have made MedS-Ins fully accessible and invite the research community to contribute to its expansion. Additionally, we have launched a dynamic leaderboard for MedS-Bench, to track the development progress of medical LLMs.https://doi.org/10.1038/s41746-024-01390-4
spellingShingle Chaoyi Wu
Pengcheng Qiu
Jinxin Liu
Hongfei Gu
Na Li
Ya Zhang
Yanfeng Wang
Weidi Xie
Towards evaluating and building versatile large language models for medicine
npj Digital Medicine
title Towards evaluating and building versatile large language models for medicine
title_full Towards evaluating and building versatile large language models for medicine
title_fullStr Towards evaluating and building versatile large language models for medicine
title_full_unstemmed Towards evaluating and building versatile large language models for medicine
title_short Towards evaluating and building versatile large language models for medicine
title_sort towards evaluating and building versatile large language models for medicine
url https://doi.org/10.1038/s41746-024-01390-4
work_keys_str_mv AT chaoyiwu towardsevaluatingandbuildingversatilelargelanguagemodelsformedicine
AT pengchengqiu towardsevaluatingandbuildingversatilelargelanguagemodelsformedicine
AT jinxinliu towardsevaluatingandbuildingversatilelargelanguagemodelsformedicine
AT hongfeigu towardsevaluatingandbuildingversatilelargelanguagemodelsformedicine
AT nali towardsevaluatingandbuildingversatilelargelanguagemodelsformedicine
AT yazhang towardsevaluatingandbuildingversatilelargelanguagemodelsformedicine
AT yanfengwang towardsevaluatingandbuildingversatilelargelanguagemodelsformedicine
AT weidixie towardsevaluatingandbuildingversatilelargelanguagemodelsformedicine