Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research
Large-language models (LLMs) show promise for automating evidence synthesis, yet head-to-head evaluations remain scarce. We benchmarked five state-of-the-art LLMs—openai/o1-mini, x-ai/grok-2-1212, meta-llama/Llama-3.3-70B-Instruct, google/Gemini-Flash-1.5-8B, and deepseek/DeepSeek-R1-70B-Distill—on...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-05-01
|
| Series: | Algorithms |
| Subjects: | |
| Online Access: | https://www.mdpi.com/1999-4893/18/5/296 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849711007616729088 |
|---|---|
| author | Richard J. Young Alice M. Matthews Brach Poston |
| author_facet | Richard J. Young Alice M. Matthews Brach Poston |
| author_sort | Richard J. Young |
| collection | DOAJ |
| description | Large-language models (LLMs) show promise for automating evidence synthesis, yet head-to-head evaluations remain scarce. We benchmarked five state-of-the-art LLMs—openai/o1-mini, x-ai/grok-2-1212, meta-llama/Llama-3.3-70B-Instruct, google/Gemini-Flash-1.5-8B, and deepseek/DeepSeek-R1-70B-Distill—on extracting protocol details from transcranial direct-current stimulation (tDCS) trials enrolling older adults. A multi-LLM ensemble pipeline ingested ClinicalTrials.gov records, applied a structured JSON schema, and generated comparable outputs from unstructured text. The pipeline retrieved 83 aging-related tDCS trials—roughly double the yield of a conventional keyword search. Across models, agreement was almost perfect for the binary field brain stimulation used (Fleiss κ ≈ 0.92) and substantial for the categorical primary target (κ ≈ 0.71). Numeric parameters such as stimulation intensity and session duration showed excellent consistency when explicitly reported (ICC 0.95–0.96); secondary targets and free-text duration phrases remained challenging (κ ≈ 0.61; ICC ≈ 0.35). An ensemble consensus (majority vote or averaging) resolved most disagreements and delivered near-perfect reliability on core stimulation attributes (κ = 0.94). These results demonstrate that multi-LLM ensembles can markedly expand trial coverage and reach expert-level accuracy on well-defined fields while still requiring human oversight for nuanced or sparsely reported details. The benchmark and open-source workflow set a solid baseline for future advances in prompt engineering, model specialization, and ensemble strategies aimed at fully automated evidence synthesis in neurostimulation research involving aging populations. Overall, the five-model multi-LLM ensemble doubled the number of eligible aging-related tDCS trials retrieved versus keyword searching and achieved near-perfect agreement on core stimulation parameters (κ ≈ 0.94), demonstrating expert-level extraction accuracy. |
| format | Article |
| id | doaj-art-a8b4975cc39f4bdfa38cd1fcfc8d2df6 |
| institution | DOAJ |
| issn | 1999-4893 |
| language | English |
| publishDate | 2025-05-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Algorithms |
| spelling | doaj-art-a8b4975cc39f4bdfa38cd1fcfc8d2df62025-08-20T03:14:45ZengMDPI AGAlgorithms1999-48932025-05-0118529610.3390/a18050296Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging ResearchRichard J. Young0Alice M. Matthews1Brach Poston2Interdisciplinary Ph.D. Program in Neuroscience, University of Nevada, Las Vegas, NV 89154, USAInterdisciplinary Ph.D. Program in Neuroscience, University of Nevada, Las Vegas, NV 89154, USAInterdisciplinary Ph.D. Program in Neuroscience, University of Nevada, Las Vegas, NV 89154, USALarge-language models (LLMs) show promise for automating evidence synthesis, yet head-to-head evaluations remain scarce. We benchmarked five state-of-the-art LLMs—openai/o1-mini, x-ai/grok-2-1212, meta-llama/Llama-3.3-70B-Instruct, google/Gemini-Flash-1.5-8B, and deepseek/DeepSeek-R1-70B-Distill—on extracting protocol details from transcranial direct-current stimulation (tDCS) trials enrolling older adults. A multi-LLM ensemble pipeline ingested ClinicalTrials.gov records, applied a structured JSON schema, and generated comparable outputs from unstructured text. The pipeline retrieved 83 aging-related tDCS trials—roughly double the yield of a conventional keyword search. Across models, agreement was almost perfect for the binary field brain stimulation used (Fleiss κ ≈ 0.92) and substantial for the categorical primary target (κ ≈ 0.71). Numeric parameters such as stimulation intensity and session duration showed excellent consistency when explicitly reported (ICC 0.95–0.96); secondary targets and free-text duration phrases remained challenging (κ ≈ 0.61; ICC ≈ 0.35). An ensemble consensus (majority vote or averaging) resolved most disagreements and delivered near-perfect reliability on core stimulation attributes (κ = 0.94). These results demonstrate that multi-LLM ensembles can markedly expand trial coverage and reach expert-level accuracy on well-defined fields while still requiring human oversight for nuanced or sparsely reported details. The benchmark and open-source workflow set a solid baseline for future advances in prompt engineering, model specialization, and ensemble strategies aimed at fully automated evidence synthesis in neurostimulation research involving aging populations. Overall, the five-model multi-LLM ensemble doubled the number of eligible aging-related tDCS trials retrieved versus keyword searching and achieved near-perfect agreement on core stimulation parameters (κ ≈ 0.94), demonstrating expert-level extraction accuracy.https://www.mdpi.com/1999-4893/18/5/296clinical trial data extractionlarge language models (LLMs)API integrationmulti-agent systemssystematic review methodologytranscranial direct current stimulation (tDCS) |
| spellingShingle | Richard J. Young Alice M. Matthews Brach Poston Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research Algorithms clinical trial data extraction large language models (LLMs) API integration multi-agent systems systematic review methodology transcranial direct current stimulation (tDCS) |
| title | Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research |
| title_full | Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research |
| title_fullStr | Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research |
| title_full_unstemmed | Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research |
| title_short | Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research |
| title_sort | benchmarking multiple large language models for automated clinical trial data extraction in aging research |
| topic | clinical trial data extraction large language models (LLMs) API integration multi-agent systems systematic review methodology transcranial direct current stimulation (tDCS) |
| url | https://www.mdpi.com/1999-4893/18/5/296 |
| work_keys_str_mv | AT richardjyoung benchmarkingmultiplelargelanguagemodelsforautomatedclinicaltrialdataextractioninagingresearch AT alicemmatthews benchmarkingmultiplelargelanguagemodelsforautomatedclinicaltrialdataextractioninagingresearch AT brachposton benchmarkingmultiplelargelanguagemodelsforautomatedclinicaltrialdataextractioninagingresearch |