Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved]
Background Synthetic data’s utility in benchmark studies depends on its ability to closely mimic real-world conditions and reproduce results obtained from experimental data. Building on Nearing et al.’s study (1), who assessed 14 differential abundance tests using 38 experimental 16S rRNA datasets i...
Saved in:
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
F1000 Research Ltd
2025-01-01
|
Series: | F1000Research |
Subjects: | |
Online Access: | https://f1000research.com/articles/13-1180/v2 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832590359854579712 |
---|---|
author | Clemens Kreutz Eva Kohnert |
author_facet | Clemens Kreutz Eva Kohnert |
author_sort | Clemens Kreutz |
collection | DOAJ |
description | Background Synthetic data’s utility in benchmark studies depends on its ability to closely mimic real-world conditions and reproduce results obtained from experimental data. Building on Nearing et al.’s study (1), who assessed 14 differential abundance tests using 38 experimental 16S rRNA datasets in a case-control design, we are generating synthetic datasets that mimic the experimental data to verify their findings. We will employ statistical tests to rigorously assess the similarity between synthetic and experimental data and to validate the conclusions on the performance of these tests drawn by Nearing et al. (1). This protocol adheres to the SPIRIT guidelines, demonstrating how established reporting frameworks can support robust, transparent, and unbiased study planning. Methods We replicate Nearing et al.’s (1) methodology, incorporating synthetic data simulated using two distinct tools, mirroring the 38 experimental datasets. Equivalence tests will be conducted on a non-redundant subset of 46 data characteristics comparing synthetic and experimental data, complemented by principal component analysis for overall similarity assessment. The 14 differential abundance tests will be applied to synthetic and experimental datasets, evaluating the consistency of significant feature identification and the number of significant features per tool. Correlation analysis and multiple regression will explore how differences between synthetic and experimental data characteristics may affect the results. Conclusions Synthetic data enables the validation of findings through controlled experiments. We assess how well synthetic data replicates experimental data, try to validate previous findings with the most recent versions of the DA methods and delineate the strengths and limitations of synthetic data in benchmark studies. Moreover, to our knowledge this is the first computational benchmark study to systematically incorporate synthetic data for validating differential abundance methods while strictly adhering to a pre-specified study protocol following SPIRIT guidelines, contributing to transparency, reproducibility, and unbiased research. |
format | Article |
id | doaj-art-d394e461c542419b9be1a11f549f7c5f |
institution | Kabale University |
issn | 2046-1402 |
language | English |
publishDate | 2025-01-01 |
publisher | F1000 Research Ltd |
record_format | Article |
series | F1000Research |
spelling | doaj-art-d394e461c542419b9be1a11f549f7c5f2025-01-24T01:00:01ZengF1000 Research LtdF1000Research2046-14022025-01-0113176118Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved]Clemens Kreutz0Eva Kohnert1https://orcid.org/0009-0007-9976-2441Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Baden-Württemberg, GermanyInstitute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Baden-Württemberg, GermanyBackground Synthetic data’s utility in benchmark studies depends on its ability to closely mimic real-world conditions and reproduce results obtained from experimental data. Building on Nearing et al.’s study (1), who assessed 14 differential abundance tests using 38 experimental 16S rRNA datasets in a case-control design, we are generating synthetic datasets that mimic the experimental data to verify their findings. We will employ statistical tests to rigorously assess the similarity between synthetic and experimental data and to validate the conclusions on the performance of these tests drawn by Nearing et al. (1). This protocol adheres to the SPIRIT guidelines, demonstrating how established reporting frameworks can support robust, transparent, and unbiased study planning. Methods We replicate Nearing et al.’s (1) methodology, incorporating synthetic data simulated using two distinct tools, mirroring the 38 experimental datasets. Equivalence tests will be conducted on a non-redundant subset of 46 data characteristics comparing synthetic and experimental data, complemented by principal component analysis for overall similarity assessment. The 14 differential abundance tests will be applied to synthetic and experimental datasets, evaluating the consistency of significant feature identification and the number of significant features per tool. Correlation analysis and multiple regression will explore how differences between synthetic and experimental data characteristics may affect the results. Conclusions Synthetic data enables the validation of findings through controlled experiments. We assess how well synthetic data replicates experimental data, try to validate previous findings with the most recent versions of the DA methods and delineate the strengths and limitations of synthetic data in benchmark studies. Moreover, to our knowledge this is the first computational benchmark study to systematically incorporate synthetic data for validating differential abundance methods while strictly adhering to a pre-specified study protocol following SPIRIT guidelines, contributing to transparency, reproducibility, and unbiased research.https://f1000research.com/articles/13-1180/v216S microbiome differential abundance simulation synthetic data benchmarkingeng |
spellingShingle | Clemens Kreutz Eva Kohnert Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved] F1000Research 16S microbiome differential abundance simulation synthetic data benchmarking eng |
title | Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved] |
title_full | Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved] |
title_fullStr | Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved] |
title_full_unstemmed | Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved] |
title_short | Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 2; peer review: 2 approved] |
title_sort | computational study protocol leveraging synthetic data to validate a benchmark study for differential abundance tests for 16s microbiome sequencing data version 2 peer review 2 approved |
topic | 16S microbiome differential abundance simulation synthetic data benchmarking eng |
url | https://f1000research.com/articles/13-1180/v2 |
work_keys_str_mv | AT clemenskreutz computationalstudyprotocolleveragingsyntheticdatatovalidateabenchmarkstudyfordifferentialabundancetestsfor16smicrobiomesequencingdataversion2peerreview2approved AT evakohnert computationalstudyprotocolleveragingsyntheticdatatovalidateabenchmarkstudyfordifferentialabundancetestsfor16smicrobiomesequencingdataversion2peerreview2approved |