Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research

Large-language models (LLMs) show promise for automating evidence synthesis, yet head-to-head evaluations remain scarce. We benchmarked five state-of-the-art LLMs—openai/o1-mini, x-ai/grok-2-1212, meta-llama/Llama-3.3-70B-Instruct, google/Gemini-Flash-1.5-8B, and deepseek/DeepSeek-R1-70B-Distill—on...

Full description

Saved in:
Bibliographic Details
Main Authors: Richard J. Young, Alice M. Matthews, Brach Poston
Format: Article
Language:English
Published: MDPI AG 2025-05-01
Series:Algorithms
Subjects:
Online Access:https://www.mdpi.com/1999-4893/18/5/296
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849711007616729088
author Richard J. Young
Alice M. Matthews
Brach Poston
author_facet Richard J. Young
Alice M. Matthews
Brach Poston
author_sort Richard J. Young
collection DOAJ
description Large-language models (LLMs) show promise for automating evidence synthesis, yet head-to-head evaluations remain scarce. We benchmarked five state-of-the-art LLMs—openai/o1-mini, x-ai/grok-2-1212, meta-llama/Llama-3.3-70B-Instruct, google/Gemini-Flash-1.5-8B, and deepseek/DeepSeek-R1-70B-Distill—on extracting protocol details from transcranial direct-current stimulation (tDCS) trials enrolling older adults. A multi-LLM ensemble pipeline ingested ClinicalTrials.gov records, applied a structured JSON schema, and generated comparable outputs from unstructured text. The pipeline retrieved 83 aging-related tDCS trials—roughly double the yield of a conventional keyword search. Across models, agreement was almost perfect for the binary field brain stimulation used (Fleiss κ ≈ 0.92) and substantial for the categorical primary target (κ ≈ 0.71). Numeric parameters such as stimulation intensity and session duration showed excellent consistency when explicitly reported (ICC 0.95–0.96); secondary targets and free-text duration phrases remained challenging (κ ≈ 0.61; ICC ≈ 0.35). An ensemble consensus (majority vote or averaging) resolved most disagreements and delivered near-perfect reliability on core stimulation attributes (κ = 0.94). These results demonstrate that multi-LLM ensembles can markedly expand trial coverage and reach expert-level accuracy on well-defined fields while still requiring human oversight for nuanced or sparsely reported details. The benchmark and open-source workflow set a solid baseline for future advances in prompt engineering, model specialization, and ensemble strategies aimed at fully automated evidence synthesis in neurostimulation research involving aging populations. Overall, the five-model multi-LLM ensemble doubled the number of eligible aging-related tDCS trials retrieved versus keyword searching and achieved near-perfect agreement on core stimulation parameters (κ ≈ 0.94), demonstrating expert-level extraction accuracy.
format Article
id doaj-art-a8b4975cc39f4bdfa38cd1fcfc8d2df6
institution DOAJ
issn 1999-4893
language English
publishDate 2025-05-01
publisher MDPI AG
record_format Article
series Algorithms
spelling doaj-art-a8b4975cc39f4bdfa38cd1fcfc8d2df62025-08-20T03:14:45ZengMDPI AGAlgorithms1999-48932025-05-0118529610.3390/a18050296Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging ResearchRichard J. Young0Alice M. Matthews1Brach Poston2Interdisciplinary Ph.D. Program in Neuroscience, University of Nevada, Las Vegas, NV 89154, USAInterdisciplinary Ph.D. Program in Neuroscience, University of Nevada, Las Vegas, NV 89154, USAInterdisciplinary Ph.D. Program in Neuroscience, University of Nevada, Las Vegas, NV 89154, USALarge-language models (LLMs) show promise for automating evidence synthesis, yet head-to-head evaluations remain scarce. We benchmarked five state-of-the-art LLMs—openai/o1-mini, x-ai/grok-2-1212, meta-llama/Llama-3.3-70B-Instruct, google/Gemini-Flash-1.5-8B, and deepseek/DeepSeek-R1-70B-Distill—on extracting protocol details from transcranial direct-current stimulation (tDCS) trials enrolling older adults. A multi-LLM ensemble pipeline ingested ClinicalTrials.gov records, applied a structured JSON schema, and generated comparable outputs from unstructured text. The pipeline retrieved 83 aging-related tDCS trials—roughly double the yield of a conventional keyword search. Across models, agreement was almost perfect for the binary field brain stimulation used (Fleiss κ ≈ 0.92) and substantial for the categorical primary target (κ ≈ 0.71). Numeric parameters such as stimulation intensity and session duration showed excellent consistency when explicitly reported (ICC 0.95–0.96); secondary targets and free-text duration phrases remained challenging (κ ≈ 0.61; ICC ≈ 0.35). An ensemble consensus (majority vote or averaging) resolved most disagreements and delivered near-perfect reliability on core stimulation attributes (κ = 0.94). These results demonstrate that multi-LLM ensembles can markedly expand trial coverage and reach expert-level accuracy on well-defined fields while still requiring human oversight for nuanced or sparsely reported details. The benchmark and open-source workflow set a solid baseline for future advances in prompt engineering, model specialization, and ensemble strategies aimed at fully automated evidence synthesis in neurostimulation research involving aging populations. Overall, the five-model multi-LLM ensemble doubled the number of eligible aging-related tDCS trials retrieved versus keyword searching and achieved near-perfect agreement on core stimulation parameters (κ ≈ 0.94), demonstrating expert-level extraction accuracy.https://www.mdpi.com/1999-4893/18/5/296clinical trial data extractionlarge language models (LLMs)API integrationmulti-agent systemssystematic review methodologytranscranial direct current stimulation (tDCS)
spellingShingle Richard J. Young
Alice M. Matthews
Brach Poston
Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research
Algorithms
clinical trial data extraction
large language models (LLMs)
API integration
multi-agent systems
systematic review methodology
transcranial direct current stimulation (tDCS)
title Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research
title_full Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research
title_fullStr Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research
title_full_unstemmed Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research
title_short Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research
title_sort benchmarking multiple large language models for automated clinical trial data extraction in aging research
topic clinical trial data extraction
large language models (LLMs)
API integration
multi-agent systems
systematic review methodology
transcranial direct current stimulation (tDCS)
url https://www.mdpi.com/1999-4893/18/5/296
work_keys_str_mv AT richardjyoung benchmarkingmultiplelargelanguagemodelsforautomatedclinicaltrialdataextractioninagingresearch
AT alicemmatthews benchmarkingmultiplelargelanguagemodelsforautomatedclinicaltrialdataextractioninagingresearch
AT brachposton benchmarkingmultiplelargelanguagemodelsforautomatedclinicaltrialdataextractioninagingresearch