Towards evaluating and building versatile large language models for medicine

Abstract In this study, we present MedS-Bench, a comprehensive benchmark to evaluate large language models (LLMs) in clinical contexts, MedS-Bench, spanning 11 high-level clinical tasks. We evaluate nine leading LLMs, e.g., MEDITRON, Llama 3, Mistral, GPT-4, Claude-3.5, etc. and found that most mode...

Full description

Saved in:

Bibliographic Details
Main Authors:	Chaoyi Wu, Pengcheng Qiu, Jinxin Liu, Hongfei Gu, Na Li, Ya Zhang, Yanfeng Wang, Weidi Xie
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-01-01
Series:	npj Digital Medicine
Online Access:	https://doi.org/10.1038/s41746-024-01390-4
Tags:	Add Tag No Tags, Be the first to tag this record!

Internet

https://doi.org/10.1038/s41746-024-01390-4

Towards evaluating and building versatile large language models for medicine

Internet

Similar Items