A dataset for evaluating clinical research claims in large language models

Abstract Large language models (LLMs) have the potential to enhance the verification of health claims. However, issues with hallucination and comprehension of logical statements require these models to be closely scrutinized in healthcare applications. We introduce CliniFact, a scientific claim data...

Full description

Saved in:
Bibliographic Details
Main Authors: Boya Zhang, Alban Bornet, Anthony Yazdani, Philipp Khlebnikov, Marija Milutinovic, Hossein Rouhizadeh, Poorya Amini, Douglas Teodoro
Format: Article
Language:English
Published: Nature Portfolio 2025-01-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-025-04417-x
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832595004981248000
author Boya Zhang
Alban Bornet
Anthony Yazdani
Philipp Khlebnikov
Marija Milutinovic
Hossein Rouhizadeh
Poorya Amini
Douglas Teodoro
author_facet Boya Zhang
Alban Bornet
Anthony Yazdani
Philipp Khlebnikov
Marija Milutinovic
Hossein Rouhizadeh
Poorya Amini
Douglas Teodoro
author_sort Boya Zhang
collection DOAJ
description Abstract Large language models (LLMs) have the potential to enhance the verification of health claims. However, issues with hallucination and comprehension of logical statements require these models to be closely scrutinized in healthcare applications. We introduce CliniFact, a scientific claim dataset created from hypothesis testing results in clinical research, covering 992 unique interventions for 22 disease categories. The dataset used study arms and interventions, primary outcome measures, and results from clinical trials to derive and label clinical research claims. These claims were then linked to supporting information describing clinical trial results in scientific publications. CliniFact contains 1,970 instances from 992 unique clinical trials related to 1,540 unique publications. When evaluating LLMs against CliniFact, discriminative models, such as BioBERT with an accuracy of 80.2%, outperformed generative counterparts, such as Llama3-70B, which reached 53.6% accuracy (p-value < 0.001). Our results demonstrate the potential of CliniFact as a benchmark for evaluating LLM performance in clinical research claim verification.
format Article
id doaj-art-d57497af237645fa8757951221b00b0e
institution Kabale University
issn 2052-4463
language English
publishDate 2025-01-01
publisher Nature Portfolio
record_format Article
series Scientific Data
spelling doaj-art-d57497af237645fa8757951221b00b0e2025-01-19T12:10:07ZengNature PortfolioScientific Data2052-44632025-01-0112111110.1038/s41597-025-04417-xA dataset for evaluating clinical research claims in large language modelsBoya Zhang0Alban Bornet1Anthony Yazdani2Philipp Khlebnikov3Marija Milutinovic4Hossein Rouhizadeh5Poorya Amini6Douglas Teodoro7Department of Radiology and Medical Informatics, Faculty of Medicine, University of GenevaDepartment of Radiology and Medical Informatics, Faculty of Medicine, University of GenevaDepartment of Radiology and Medical Informatics, Faculty of Medicine, University of GenevaRisklick AGDepartment of Radiology and Medical Informatics, Faculty of Medicine, University of GenevaDepartment of Radiology and Medical Informatics, Faculty of Medicine, University of GenevaRisklick AGDepartment of Radiology and Medical Informatics, Faculty of Medicine, University of GenevaAbstract Large language models (LLMs) have the potential to enhance the verification of health claims. However, issues with hallucination and comprehension of logical statements require these models to be closely scrutinized in healthcare applications. We introduce CliniFact, a scientific claim dataset created from hypothesis testing results in clinical research, covering 992 unique interventions for 22 disease categories. The dataset used study arms and interventions, primary outcome measures, and results from clinical trials to derive and label clinical research claims. These claims were then linked to supporting information describing clinical trial results in scientific publications. CliniFact contains 1,970 instances from 992 unique clinical trials related to 1,540 unique publications. When evaluating LLMs against CliniFact, discriminative models, such as BioBERT with an accuracy of 80.2%, outperformed generative counterparts, such as Llama3-70B, which reached 53.6% accuracy (p-value < 0.001). Our results demonstrate the potential of CliniFact as a benchmark for evaluating LLM performance in clinical research claim verification.https://doi.org/10.1038/s41597-025-04417-x
spellingShingle Boya Zhang
Alban Bornet
Anthony Yazdani
Philipp Khlebnikov
Marija Milutinovic
Hossein Rouhizadeh
Poorya Amini
Douglas Teodoro
A dataset for evaluating clinical research claims in large language models
Scientific Data
title A dataset for evaluating clinical research claims in large language models
title_full A dataset for evaluating clinical research claims in large language models
title_fullStr A dataset for evaluating clinical research claims in large language models
title_full_unstemmed A dataset for evaluating clinical research claims in large language models
title_short A dataset for evaluating clinical research claims in large language models
title_sort dataset for evaluating clinical research claims in large language models
url https://doi.org/10.1038/s41597-025-04417-x
work_keys_str_mv AT boyazhang adatasetforevaluatingclinicalresearchclaimsinlargelanguagemodels
AT albanbornet adatasetforevaluatingclinicalresearchclaimsinlargelanguagemodels
AT anthonyyazdani adatasetforevaluatingclinicalresearchclaimsinlargelanguagemodels
AT philippkhlebnikov adatasetforevaluatingclinicalresearchclaimsinlargelanguagemodels
AT marijamilutinovic adatasetforevaluatingclinicalresearchclaimsinlargelanguagemodels
AT hosseinrouhizadeh adatasetforevaluatingclinicalresearchclaimsinlargelanguagemodels
AT pooryaamini adatasetforevaluatingclinicalresearchclaimsinlargelanguagemodels
AT douglasteodoro adatasetforevaluatingclinicalresearchclaimsinlargelanguagemodels
AT boyazhang datasetforevaluatingclinicalresearchclaimsinlargelanguagemodels
AT albanbornet datasetforevaluatingclinicalresearchclaimsinlargelanguagemodels
AT anthonyyazdani datasetforevaluatingclinicalresearchclaimsinlargelanguagemodels
AT philippkhlebnikov datasetforevaluatingclinicalresearchclaimsinlargelanguagemodels
AT marijamilutinovic datasetforevaluatingclinicalresearchclaimsinlargelanguagemodels
AT hosseinrouhizadeh datasetforevaluatingclinicalresearchclaimsinlargelanguagemodels
AT pooryaamini datasetforevaluatingclinicalresearchclaimsinlargelanguagemodels
AT douglasteodoro datasetforevaluatingclinicalresearchclaimsinlargelanguagemodels