Towards evaluating and building versatile large language models for medicine
Abstract In this study, we present MedS-Bench, a comprehensive benchmark to evaluate large language models (LLMs) in clinical contexts, MedS-Bench, spanning 11 high-level clinical tasks. We evaluate nine leading LLMs, e.g., MEDITRON, Llama 3, Mistral, GPT-4, Claude-3.5, etc. and found that most mode...
Saved in:
Main Authors: | , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Nature Portfolio
2025-01-01
|
Series: | npj Digital Medicine |
Online Access: | https://doi.org/10.1038/s41746-024-01390-4 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832571239531544576 |
---|---|
author | Chaoyi Wu Pengcheng Qiu Jinxin Liu Hongfei Gu Na Li Ya Zhang Yanfeng Wang Weidi Xie |
author_facet | Chaoyi Wu Pengcheng Qiu Jinxin Liu Hongfei Gu Na Li Ya Zhang Yanfeng Wang Weidi Xie |
author_sort | Chaoyi Wu |
collection | DOAJ |
description | Abstract In this study, we present MedS-Bench, a comprehensive benchmark to evaluate large language models (LLMs) in clinical contexts, MedS-Bench, spanning 11 high-level clinical tasks. We evaluate nine leading LLMs, e.g., MEDITRON, Llama 3, Mistral, GPT-4, Claude-3.5, etc. and found that most models struggle with these complex tasks. To address these limitations, we developed MedS-Ins, a large-scale instruction-tuning dataset for medicine. MedS-Ins comprises 58 medically oriented language corpora, totaling 5M instances with 19K instructions, across 122 tasks. To demonstrate the dataset’s utility, we conducted a proof-of-concept experiment by performing instruction tuning on a lightweight, open-source medical language model. The resulting model, MMedIns-Llama 3, significantly outperformed existing models on various clinical tasks. To promote further advancements, we have made MedS-Ins fully accessible and invite the research community to contribute to its expansion. Additionally, we have launched a dynamic leaderboard for MedS-Bench, to track the development progress of medical LLMs. |
format | Article |
id | doaj-art-675bdf34aea649419ec20b97480b6fd1 |
institution | Kabale University |
issn | 2398-6352 |
language | English |
publishDate | 2025-01-01 |
publisher | Nature Portfolio |
record_format | Article |
series | npj Digital Medicine |
spelling | doaj-art-675bdf34aea649419ec20b97480b6fd12025-02-02T12:43:48ZengNature Portfolionpj Digital Medicine2398-63522025-01-018111310.1038/s41746-024-01390-4Towards evaluating and building versatile large language models for medicineChaoyi Wu0Pengcheng Qiu1Jinxin Liu2Hongfei Gu3Na Li4Ya Zhang5Yanfeng Wang6Weidi Xie7Shanghai Jiao Tong UniversityShanghai Jiao Tong UniversityChina Mobile Communications Group Co., Ltd.China Mobile Communications Group Shanghai Co., Ltd.China Mobile Communications Group Co., Ltd.Shanghai Jiao Tong UniversityShanghai Jiao Tong UniversityShanghai Jiao Tong UniversityAbstract In this study, we present MedS-Bench, a comprehensive benchmark to evaluate large language models (LLMs) in clinical contexts, MedS-Bench, spanning 11 high-level clinical tasks. We evaluate nine leading LLMs, e.g., MEDITRON, Llama 3, Mistral, GPT-4, Claude-3.5, etc. and found that most models struggle with these complex tasks. To address these limitations, we developed MedS-Ins, a large-scale instruction-tuning dataset for medicine. MedS-Ins comprises 58 medically oriented language corpora, totaling 5M instances with 19K instructions, across 122 tasks. To demonstrate the dataset’s utility, we conducted a proof-of-concept experiment by performing instruction tuning on a lightweight, open-source medical language model. The resulting model, MMedIns-Llama 3, significantly outperformed existing models on various clinical tasks. To promote further advancements, we have made MedS-Ins fully accessible and invite the research community to contribute to its expansion. Additionally, we have launched a dynamic leaderboard for MedS-Bench, to track the development progress of medical LLMs.https://doi.org/10.1038/s41746-024-01390-4 |
spellingShingle | Chaoyi Wu Pengcheng Qiu Jinxin Liu Hongfei Gu Na Li Ya Zhang Yanfeng Wang Weidi Xie Towards evaluating and building versatile large language models for medicine npj Digital Medicine |
title | Towards evaluating and building versatile large language models for medicine |
title_full | Towards evaluating and building versatile large language models for medicine |
title_fullStr | Towards evaluating and building versatile large language models for medicine |
title_full_unstemmed | Towards evaluating and building versatile large language models for medicine |
title_short | Towards evaluating and building versatile large language models for medicine |
title_sort | towards evaluating and building versatile large language models for medicine |
url | https://doi.org/10.1038/s41746-024-01390-4 |
work_keys_str_mv | AT chaoyiwu towardsevaluatingandbuildingversatilelargelanguagemodelsformedicine AT pengchengqiu towardsevaluatingandbuildingversatilelargelanguagemodelsformedicine AT jinxinliu towardsevaluatingandbuildingversatilelargelanguagemodelsformedicine AT hongfeigu towardsevaluatingandbuildingversatilelargelanguagemodelsformedicine AT nali towardsevaluatingandbuildingversatilelargelanguagemodelsformedicine AT yazhang towardsevaluatingandbuildingversatilelargelanguagemodelsformedicine AT yanfengwang towardsevaluatingandbuildingversatilelargelanguagemodelsformedicine AT weidixie towardsevaluatingandbuildingversatilelargelanguagemodelsformedicine |