Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved]

Background Synthetic data’s utility in benchmark studies depends on its ability to closely mimic real-world conditions and reproduce results obtained from experimental data. Building on Nearing et al.’s study (1), who assessed 14 differential abundance tests using 38 experimental 16S rRNA datasets i...

Full description

Saved in:

Bibliographic Details
Main Authors:	Clemens Kreutz, Eva Kohnert
Format:	Article
Language:	English
Published:	F1000 Research Ltd 2025-01-01
Series:	F1000Research
Subjects:	16S microbiome differential abundance simulation synthetic data benchmarking eng
Online Access:	https://f1000research.com/articles/13-1180/v2
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832590359854579712
author	Clemens Kreutz Eva Kohnert
author_facet	Clemens Kreutz Eva Kohnert
author_sort	Clemens Kreutz
collection	DOAJ
description	Background Synthetic data’s utility in benchmark studies depends on its ability to closely mimic real-world conditions and reproduce results obtained from experimental data. Building on Nearing et al.’s study (1), who assessed 14 differential abundance tests using 38 experimental 16S rRNA datasets in a case-control design, we are generating synthetic datasets that mimic the experimental data to verify their findings. We will employ statistical tests to rigorously assess the similarity between synthetic and experimental data and to validate the conclusions on the performance of these tests drawn by Nearing et al. (1). This protocol adheres to the SPIRIT guidelines, demonstrating how established reporting frameworks can support robust, transparent, and unbiased study planning. Methods We replicate Nearing et al.’s (1) methodology, incorporating synthetic data simulated using two distinct tools, mirroring the 38 experimental datasets. Equivalence tests will be conducted on a non-redundant subset of 46 data characteristics comparing synthetic and experimental data, complemented by principal component analysis for overall similarity assessment. The 14 differential abundance tests will be applied to synthetic and experimental datasets, evaluating the consistency of significant feature identification and the number of significant features per tool. Correlation analysis and multiple regression will explore how differences between synthetic and experimental data characteristics may affect the results. Conclusions Synthetic data enables the validation of findings through controlled experiments. We assess how well synthetic data replicates experimental data, try to validate previous findings with the most recent versions of the DA methods and delineate the strengths and limitations of synthetic data in benchmark studies. Moreover, to our knowledge this is the first computational benchmark study to systematically incorporate synthetic data for validating differential abundance methods while strictly adhering to a pre-specified study protocol following SPIRIT guidelines, contributing to transparency, reproducibility, and unbiased research.
format	Article
id	doaj-art-d394e461c542419b9be1a11f549f7c5f
institution	Kabale University
issn	2046-1402
language	English
publishDate	2025-01-01
publisher	F1000 Research Ltd
record_format	Article
series	F1000Research
spelling	doaj-art-d394e461c542419b9be1a11f549f7c5f2025-01-24T01:00:01ZengF1000 Research LtdF1000Research2046-14022025-01-0113176118Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved]Clemens Kreutz0Eva Kohnert1https://orcid.org/0009-0007-9976-2441Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Baden-Württemberg, GermanyInstitute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Baden-Württemberg, GermanyBackground Synthetic data’s utility in benchmark studies depends on its ability to closely mimic real-world conditions and reproduce results obtained from experimental data. Building on Nearing et al.’s study (1), who assessed 14 differential abundance tests using 38 experimental 16S rRNA datasets in a case-control design, we are generating synthetic datasets that mimic the experimental data to verify their findings. We will employ statistical tests to rigorously assess the similarity between synthetic and experimental data and to validate the conclusions on the performance of these tests drawn by Nearing et al. (1). This protocol adheres to the SPIRIT guidelines, demonstrating how established reporting frameworks can support robust, transparent, and unbiased study planning. Methods We replicate Nearing et al.’s (1) methodology, incorporating synthetic data simulated using two distinct tools, mirroring the 38 experimental datasets. Equivalence tests will be conducted on a non-redundant subset of 46 data characteristics comparing synthetic and experimental data, complemented by principal component analysis for overall similarity assessment. The 14 differential abundance tests will be applied to synthetic and experimental datasets, evaluating the consistency of significant feature identification and the number of significant features per tool. Correlation analysis and multiple regression will explore how differences between synthetic and experimental data characteristics may affect the results. Conclusions Synthetic data enables the validation of findings through controlled experiments. We assess how well synthetic data replicates experimental data, try to validate previous findings with the most recent versions of the DA methods and delineate the strengths and limitations of synthetic data in benchmark studies. Moreover, to our knowledge this is the first computational benchmark study to systematically incorporate synthetic data for validating differential abundance methods while strictly adhering to a pre-specified study protocol following SPIRIT guidelines, contributing to transparency, reproducibility, and unbiased research.https://f1000research.com/articles/13-1180/v216S microbiome differential abundance simulation synthetic data benchmarkingeng
spellingShingle	Clemens Kreutz Eva Kohnert Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved] F1000Research 16S microbiome differential abundance simulation synthetic data benchmarking eng
title	Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved]
title_full	Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved]
title_fullStr	Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved]
title_full_unstemmed	Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved]
title_short	Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved]
title_sort	computational study protocol leveraging synthetic data to validate a benchmark study for differential abundance tests for 16s microbiome sequencing data version 2 peer review 2 approved
topic	16S microbiome differential abundance simulation synthetic data benchmarking eng
url	https://f1000research.com/articles/13-1180/v2
work_keys_str_mv	AT clemenskreutz computationalstudyprotocolleveragingsyntheticdatatovalidateabenchmarkstudyfordifferentialabundancetestsfor16smicrobiomesequencingdataversion2peerreview2approved AT evakohnert computationalstudyprotocolleveragingsyntheticdatatovalidateabenchmarkstudyfordifferentialabundancetestsfor16smicrobiomesequencingdataversion2peerreview2approved

Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved]

Similar Items