Leveraging large language models for data analysis automation.

Data analysis is constrained by a shortage of skilled experts, particularly in biology, where detailed data analysis and subsequent interpretation is vital for understanding complex biological processes and developing new treatments and diagnostics. One possible solution to this shortage in experts...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jacqueline A Jansen, Artür Manukyan, Nour Al Khoury, Altuna Akalin
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2025-01-01
Series:	PLoS ONE
Online Access:	https://doi.org/10.1371/journal.pone.0317084
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849311663436595200
author	Jacqueline A Jansen Artür Manukyan Nour Al Khoury Altuna Akalin
author_facet	Jacqueline A Jansen Artür Manukyan Nour Al Khoury Altuna Akalin
author_sort	Jacqueline A Jansen
collection	DOAJ
description	Data analysis is constrained by a shortage of skilled experts, particularly in biology, where detailed data analysis and subsequent interpretation is vital for understanding complex biological processes and developing new treatments and diagnostics. One possible solution to this shortage in experts would be making use of Large Language Models (LLMs) for generating data analysis pipelines. However, although LLMs have shown great potential when used for code generation tasks, questions regarding the accuracy of LLMs when prompted with domain expert questions such as omics related data analysis questions, remain unanswered. To address this, we developed mergen, an R package that leverages LLMs for data analysis code generation and execution. We evaluated the performance of this data analysis system using various data analysis tasks for genomics. Our primary goal is to enable researchers to conduct data analysis by simply describing their objectives and the desired analyses for specific datasets through clear text. Our approach improves code generation via specialized prompt engineering and error feedback mechanisms. In addition, our system can execute the data analysis workflows prescribed by the LLM providing the results of the data analysis workflow for human review. Our evaluation of this system reveals that while LLMs effectively generate code for some data analysis tasks, challenges remain in executable code generation, especially for complex data analysis tasks. The best performance was seen with the self-correction mechanism, in which self-correct was able to increase the percentage of executable code when compared to the simple strategy by 22.5% for tasks of complexity 2. For tasks for complexity 3, 4 and 5, this increase was 52.5%, 27.5% and 15% respectively. Using a chi-squared test, it was shown that significant differences could be found using the different prompting strategies. Our study contributes to a better understanding of LLM capabilities and limitations, providing software infrastructure and practical insights for their effective integration into data analysis workflows.
format	Article
id	doaj-art-f80151b7ecbf4a62bca272ab843a6e8c
institution	Kabale University
issn	1932-6203
language	English
publishDate	2025-01-01
publisher	Public Library of Science (PLoS)
record_format	Article
series	PLoS ONE
spelling	doaj-art-f80151b7ecbf4a62bca272ab843a6e8c2025-08-20T03:53:21ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01202e031708410.1371/journal.pone.0317084Leveraging large language models for data analysis automation.Jacqueline A JansenArtür ManukyanNour Al KhouryAltuna AkalinData analysis is constrained by a shortage of skilled experts, particularly in biology, where detailed data analysis and subsequent interpretation is vital for understanding complex biological processes and developing new treatments and diagnostics. One possible solution to this shortage in experts would be making use of Large Language Models (LLMs) for generating data analysis pipelines. However, although LLMs have shown great potential when used for code generation tasks, questions regarding the accuracy of LLMs when prompted with domain expert questions such as omics related data analysis questions, remain unanswered. To address this, we developed mergen, an R package that leverages LLMs for data analysis code generation and execution. We evaluated the performance of this data analysis system using various data analysis tasks for genomics. Our primary goal is to enable researchers to conduct data analysis by simply describing their objectives and the desired analyses for specific datasets through clear text. Our approach improves code generation via specialized prompt engineering and error feedback mechanisms. In addition, our system can execute the data analysis workflows prescribed by the LLM providing the results of the data analysis workflow for human review. Our evaluation of this system reveals that while LLMs effectively generate code for some data analysis tasks, challenges remain in executable code generation, especially for complex data analysis tasks. The best performance was seen with the self-correction mechanism, in which self-correct was able to increase the percentage of executable code when compared to the simple strategy by 22.5% for tasks of complexity 2. For tasks for complexity 3, 4 and 5, this increase was 52.5%, 27.5% and 15% respectively. Using a chi-squared test, it was shown that significant differences could be found using the different prompting strategies. Our study contributes to a better understanding of LLM capabilities and limitations, providing software infrastructure and practical insights for their effective integration into data analysis workflows.https://doi.org/10.1371/journal.pone.0317084
spellingShingle	Jacqueline A Jansen Artür Manukyan Nour Al Khoury Altuna Akalin Leveraging large language models for data analysis automation. PLoS ONE
title	Leveraging large language models for data analysis automation.
title_full	Leveraging large language models for data analysis automation.
title_fullStr	Leveraging large language models for data analysis automation.
title_full_unstemmed	Leveraging large language models for data analysis automation.
title_short	Leveraging large language models for data analysis automation.
title_sort	leveraging large language models for data analysis automation
url	https://doi.org/10.1371/journal.pone.0317084
work_keys_str_mv	AT jacquelineajansen leveraginglargelanguagemodelsfordataanalysisautomation AT arturmanukyan leveraginglargelanguagemodelsfordataanalysisautomation AT nouralkhoury leveraginglargelanguagemodelsfordataanalysisautomation AT altunaakalin leveraginglargelanguagemodelsfordataanalysisautomation

Leveraging large language models for data analysis automation.

Similar Items