ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition.
<h4>Motivation</h4>Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels...
Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Public Library of Science (PLoS)
2015-01-01
|
| Series: | PLoS ONE |
| Online Access: | https://doi.org/10.1371/journal.pone.0140644 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850124975703326720 |
|---|---|
| author | David Koslicki Saikat Chatterjee Damon Shahrivar Alan W Walker Suzanna C Francis Louise J Fraser Mikko Vehkaperä Yueheng Lan Jukka Corander |
| author_facet | David Koslicki Saikat Chatterjee Damon Shahrivar Alan W Walker Suzanna C Francis Louise J Fraser Mikko Vehkaperä Yueheng Lan Jukka Corander |
| author_sort | David Koslicki |
| collection | DOAJ |
| description | <h4>Motivation</h4>Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging.<h4>Results</h4>There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity.<h4>Availability</h4>An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware. |
| format | Article |
| id | doaj-art-d0e8caa4c0b744dfbd0a80f1b1ed8ade |
| institution | OA Journals |
| issn | 1932-6203 |
| language | English |
| publishDate | 2015-01-01 |
| publisher | Public Library of Science (PLoS) |
| record_format | Article |
| series | PLoS ONE |
| spelling | doaj-art-d0e8caa4c0b744dfbd0a80f1b1ed8ade2025-08-20T02:34:12ZengPublic Library of Science (PLoS)PLoS ONE1932-62032015-01-011010e014064410.1371/journal.pone.0140644ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition.David KoslickiSaikat ChatterjeeDamon ShahrivarAlan W WalkerSuzanna C FrancisLouise J FraserMikko VehkaperäYueheng LanJukka Corander<h4>Motivation</h4>Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging.<h4>Results</h4>There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity.<h4>Availability</h4>An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.https://doi.org/10.1371/journal.pone.0140644 |
| spellingShingle | David Koslicki Saikat Chatterjee Damon Shahrivar Alan W Walker Suzanna C Francis Louise J Fraser Mikko Vehkaperä Yueheng Lan Jukka Corander ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition. PLoS ONE |
| title | ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition. |
| title_full | ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition. |
| title_fullStr | ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition. |
| title_full_unstemmed | ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition. |
| title_short | ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition. |
| title_sort | ark aggregation of reads by k means for estimation of bacterial community composition |
| url | https://doi.org/10.1371/journal.pone.0140644 |
| work_keys_str_mv | AT davidkoslicki arkaggregationofreadsbykmeansforestimationofbacterialcommunitycomposition AT saikatchatterjee arkaggregationofreadsbykmeansforestimationofbacterialcommunitycomposition AT damonshahrivar arkaggregationofreadsbykmeansforestimationofbacterialcommunitycomposition AT alanwwalker arkaggregationofreadsbykmeansforestimationofbacterialcommunitycomposition AT suzannacfrancis arkaggregationofreadsbykmeansforestimationofbacterialcommunitycomposition AT louisejfraser arkaggregationofreadsbykmeansforestimationofbacterialcommunitycomposition AT mikkovehkapera arkaggregationofreadsbykmeansforestimationofbacterialcommunitycomposition AT yuehenglan arkaggregationofreadsbykmeansforestimationofbacterialcommunitycomposition AT jukkacorander arkaggregationofreadsbykmeansforestimationofbacterialcommunitycomposition |