Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved]

Background Synthetic data’s utility in benchmark studies depends on its ability to closely mimic real-world conditions and reproduce results obtained from experimental data. Building on Nearing et al.’s study (1), who assessed 14 differential abundance tests using 38 experimental 16S rRNA datasets i...

Full description

Saved in:
Bibliographic Details
Main Authors: Clemens Kreutz, Eva Kohnert
Format: Article
Language:English
Published: F1000 Research Ltd 2025-01-01
Series:F1000Research
Subjects:
Online Access:https://f1000research.com/articles/13-1180/v2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832590359854579712
author Clemens Kreutz
Eva Kohnert
author_facet Clemens Kreutz
Eva Kohnert
author_sort Clemens Kreutz
collection DOAJ
description Background Synthetic data’s utility in benchmark studies depends on its ability to closely mimic real-world conditions and reproduce results obtained from experimental data. Building on Nearing et al.’s study (1), who assessed 14 differential abundance tests using 38 experimental 16S rRNA datasets in a case-control design, we are generating synthetic datasets that mimic the experimental data to verify their findings. We will employ statistical tests to rigorously assess the similarity between synthetic and experimental data and to validate the conclusions on the performance of these tests drawn by Nearing et al. (1). This protocol adheres to the SPIRIT guidelines, demonstrating how established reporting frameworks can support robust, transparent, and unbiased study planning. Methods We replicate Nearing et al.’s (1) methodology, incorporating synthetic data simulated using two distinct tools, mirroring the 38 experimental datasets. Equivalence tests will be conducted on a non-redundant subset of 46 data characteristics comparing synthetic and experimental data, complemented by principal component analysis for overall similarity assessment. The 14 differential abundance tests will be applied to synthetic and experimental datasets, evaluating the consistency of significant feature identification and the number of significant features per tool. Correlation analysis and multiple regression will explore how differences between synthetic and experimental data characteristics may affect the results. Conclusions Synthetic data enables the validation of findings through controlled experiments. We assess how well synthetic data replicates experimental data, try to validate previous findings with the most recent versions of the DA methods and delineate the strengths and limitations of synthetic data in benchmark studies. Moreover, to our knowledge this is the first computational benchmark study to systematically incorporate synthetic data for validating differential abundance methods while strictly adhering to a pre-specified study protocol following SPIRIT guidelines, contributing to transparency, reproducibility, and unbiased research.
format Article
id doaj-art-d394e461c542419b9be1a11f549f7c5f
institution Kabale University
issn 2046-1402
language English
publishDate 2025-01-01
publisher F1000 Research Ltd
record_format Article
series F1000Research
spelling doaj-art-d394e461c542419b9be1a11f549f7c5f2025-01-24T01:00:01ZengF1000 Research LtdF1000Research2046-14022025-01-0113176118Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved]Clemens Kreutz0Eva Kohnert1https://orcid.org/0009-0007-9976-2441Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Baden-Württemberg, GermanyInstitute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Baden-Württemberg, GermanyBackground Synthetic data’s utility in benchmark studies depends on its ability to closely mimic real-world conditions and reproduce results obtained from experimental data. Building on Nearing et al.’s study (1), who assessed 14 differential abundance tests using 38 experimental 16S rRNA datasets in a case-control design, we are generating synthetic datasets that mimic the experimental data to verify their findings. We will employ statistical tests to rigorously assess the similarity between synthetic and experimental data and to validate the conclusions on the performance of these tests drawn by Nearing et al. (1). This protocol adheres to the SPIRIT guidelines, demonstrating how established reporting frameworks can support robust, transparent, and unbiased study planning. Methods We replicate Nearing et al.’s (1) methodology, incorporating synthetic data simulated using two distinct tools, mirroring the 38 experimental datasets. Equivalence tests will be conducted on a non-redundant subset of 46 data characteristics comparing synthetic and experimental data, complemented by principal component analysis for overall similarity assessment. The 14 differential abundance tests will be applied to synthetic and experimental datasets, evaluating the consistency of significant feature identification and the number of significant features per tool. Correlation analysis and multiple regression will explore how differences between synthetic and experimental data characteristics may affect the results. Conclusions Synthetic data enables the validation of findings through controlled experiments. We assess how well synthetic data replicates experimental data, try to validate previous findings with the most recent versions of the DA methods and delineate the strengths and limitations of synthetic data in benchmark studies. Moreover, to our knowledge this is the first computational benchmark study to systematically incorporate synthetic data for validating differential abundance methods while strictly adhering to a pre-specified study protocol following SPIRIT guidelines, contributing to transparency, reproducibility, and unbiased research.https://f1000research.com/articles/13-1180/v216S microbiome differential abundance simulation synthetic data benchmarkingeng
spellingShingle Clemens Kreutz
Eva Kohnert
Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved]
F1000Research
16S
microbiome
differential abundance
simulation
synthetic data
benchmarking
eng
title Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved]
title_full Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved]
title_fullStr Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved]
title_full_unstemmed Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved]
title_short Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved]
title_sort computational study protocol leveraging synthetic data to validate a benchmark study for differential abundance tests for 16s microbiome sequencing data version 2 peer review 2 approved
topic 16S
microbiome
differential abundance
simulation
synthetic data
benchmarking
eng
url https://f1000research.com/articles/13-1180/v2
work_keys_str_mv AT clemenskreutz computationalstudyprotocolleveragingsyntheticdatatovalidateabenchmarkstudyfordifferentialabundancetestsfor16smicrobiomesequencingdataversion2peerreview2approved
AT evakohnert computationalstudyprotocolleveragingsyntheticdatatovalidateabenchmarkstudyfordifferentialabundancetestsfor16smicrobiomesequencingdataversion2peerreview2approved