A dataset for evaluating clinical research claims in large language models
Abstract Large language models (LLMs) have the potential to enhance the verification of health claims. However, issues with hallucination and comprehension of logical statements require these models to be closely scrutinized in healthcare applications. We introduce CliniFact, a scientific claim data...
Saved in:
Main Authors: | , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Nature Portfolio
2025-01-01
|
Series: | Scientific Data |
Online Access: | https://doi.org/10.1038/s41597-025-04417-x |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832595004981248000 |
---|---|
author | Boya Zhang Alban Bornet Anthony Yazdani Philipp Khlebnikov Marija Milutinovic Hossein Rouhizadeh Poorya Amini Douglas Teodoro |
author_facet | Boya Zhang Alban Bornet Anthony Yazdani Philipp Khlebnikov Marija Milutinovic Hossein Rouhizadeh Poorya Amini Douglas Teodoro |
author_sort | Boya Zhang |
collection | DOAJ |
description | Abstract Large language models (LLMs) have the potential to enhance the verification of health claims. However, issues with hallucination and comprehension of logical statements require these models to be closely scrutinized in healthcare applications. We introduce CliniFact, a scientific claim dataset created from hypothesis testing results in clinical research, covering 992 unique interventions for 22 disease categories. The dataset used study arms and interventions, primary outcome measures, and results from clinical trials to derive and label clinical research claims. These claims were then linked to supporting information describing clinical trial results in scientific publications. CliniFact contains 1,970 instances from 992 unique clinical trials related to 1,540 unique publications. When evaluating LLMs against CliniFact, discriminative models, such as BioBERT with an accuracy of 80.2%, outperformed generative counterparts, such as Llama3-70B, which reached 53.6% accuracy (p-value < 0.001). Our results demonstrate the potential of CliniFact as a benchmark for evaluating LLM performance in clinical research claim verification. |
format | Article |
id | doaj-art-d57497af237645fa8757951221b00b0e |
institution | Kabale University |
issn | 2052-4463 |
language | English |
publishDate | 2025-01-01 |
publisher | Nature Portfolio |
record_format | Article |
series | Scientific Data |
spelling | doaj-art-d57497af237645fa8757951221b00b0e2025-01-19T12:10:07ZengNature PortfolioScientific Data2052-44632025-01-0112111110.1038/s41597-025-04417-xA dataset for evaluating clinical research claims in large language modelsBoya Zhang0Alban Bornet1Anthony Yazdani2Philipp Khlebnikov3Marija Milutinovic4Hossein Rouhizadeh5Poorya Amini6Douglas Teodoro7Department of Radiology and Medical Informatics, Faculty of Medicine, University of GenevaDepartment of Radiology and Medical Informatics, Faculty of Medicine, University of GenevaDepartment of Radiology and Medical Informatics, Faculty of Medicine, University of GenevaRisklick AGDepartment of Radiology and Medical Informatics, Faculty of Medicine, University of GenevaDepartment of Radiology and Medical Informatics, Faculty of Medicine, University of GenevaRisklick AGDepartment of Radiology and Medical Informatics, Faculty of Medicine, University of GenevaAbstract Large language models (LLMs) have the potential to enhance the verification of health claims. However, issues with hallucination and comprehension of logical statements require these models to be closely scrutinized in healthcare applications. We introduce CliniFact, a scientific claim dataset created from hypothesis testing results in clinical research, covering 992 unique interventions for 22 disease categories. The dataset used study arms and interventions, primary outcome measures, and results from clinical trials to derive and label clinical research claims. These claims were then linked to supporting information describing clinical trial results in scientific publications. CliniFact contains 1,970 instances from 992 unique clinical trials related to 1,540 unique publications. When evaluating LLMs against CliniFact, discriminative models, such as BioBERT with an accuracy of 80.2%, outperformed generative counterparts, such as Llama3-70B, which reached 53.6% accuracy (p-value < 0.001). Our results demonstrate the potential of CliniFact as a benchmark for evaluating LLM performance in clinical research claim verification.https://doi.org/10.1038/s41597-025-04417-x |
spellingShingle | Boya Zhang Alban Bornet Anthony Yazdani Philipp Khlebnikov Marija Milutinovic Hossein Rouhizadeh Poorya Amini Douglas Teodoro A dataset for evaluating clinical research claims in large language models Scientific Data |
title | A dataset for evaluating clinical research claims in large language models |
title_full | A dataset for evaluating clinical research claims in large language models |
title_fullStr | A dataset for evaluating clinical research claims in large language models |
title_full_unstemmed | A dataset for evaluating clinical research claims in large language models |
title_short | A dataset for evaluating clinical research claims in large language models |
title_sort | dataset for evaluating clinical research claims in large language models |
url | https://doi.org/10.1038/s41597-025-04417-x |
work_keys_str_mv | AT boyazhang adatasetforevaluatingclinicalresearchclaimsinlargelanguagemodels AT albanbornet adatasetforevaluatingclinicalresearchclaimsinlargelanguagemodels AT anthonyyazdani adatasetforevaluatingclinicalresearchclaimsinlargelanguagemodels AT philippkhlebnikov adatasetforevaluatingclinicalresearchclaimsinlargelanguagemodels AT marijamilutinovic adatasetforevaluatingclinicalresearchclaimsinlargelanguagemodels AT hosseinrouhizadeh adatasetforevaluatingclinicalresearchclaimsinlargelanguagemodels AT pooryaamini adatasetforevaluatingclinicalresearchclaimsinlargelanguagemodels AT douglasteodoro adatasetforevaluatingclinicalresearchclaimsinlargelanguagemodels AT boyazhang datasetforevaluatingclinicalresearchclaimsinlargelanguagemodels AT albanbornet datasetforevaluatingclinicalresearchclaimsinlargelanguagemodels AT anthonyyazdani datasetforevaluatingclinicalresearchclaimsinlargelanguagemodels AT philippkhlebnikov datasetforevaluatingclinicalresearchclaimsinlargelanguagemodels AT marijamilutinovic datasetforevaluatingclinicalresearchclaimsinlargelanguagemodels AT hosseinrouhizadeh datasetforevaluatingclinicalresearchclaimsinlargelanguagemodels AT pooryaamini datasetforevaluatingclinicalresearchclaimsinlargelanguagemodels AT douglasteodoro datasetforevaluatingclinicalresearchclaimsinlargelanguagemodels |