Taming large-scale genomic analyses via sparsified genomics

Abstract Searching for similar genomic sequences is an essential and fundamental step in biomedical research. State-of-the-art computational methods performing such comparisons fail to cope with the exponential growth of genomic sequencing data. We introduce the concept of sparsified genomics where...

Full description

Saved in:
Bibliographic Details
Main Authors: Mohammed Alser, Julien Eudine, Onur Mutlu
Format: Article
Language:English
Published: Nature Portfolio 2025-01-01
Series:Nature Communications
Online Access:https://doi.org/10.1038/s41467-024-55762-1
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832585605323685888
author Mohammed Alser
Julien Eudine
Onur Mutlu
author_facet Mohammed Alser
Julien Eudine
Onur Mutlu
author_sort Mohammed Alser
collection DOAJ
description Abstract Searching for similar genomic sequences is an essential and fundamental step in biomedical research. State-of-the-art computational methods performing such comparisons fail to cope with the exponential growth of genomic sequencing data. We introduce the concept of sparsified genomics where we systematically exclude a large number of bases from genomic sequences and enable faster and memory-efficient processing of the sparsified, shorter genomic sequences, while providing comparable accuracy to processing non-sparsified sequences. Sparsified genomics provides benefits to many genomic analyses and has broad applicability. Sparsifying genomic sequences accelerates the state-of-the-art read mapper (minimap2) by 2.57-5.38x, 1.13-2.78x, and 3.52-6.28x using real Illumina, HiFi, and ONT reads, respectively, while providing comparable memory footprint, 2x smaller index size, and more correctly detected variations compared to minimap2. Sparsifying genomic sequences makes containment search through very large genomes and large databases 72.7-75.88x (1.62-1.9x when indexing is preprocessed) faster and 723.3x more storage-efficient than searching through non-sparsified genomic sequences (with CMash and KMC3). Sparsifying genomic sequences enables robust microbiome discovery by providing 54.15-61.88x (1.58-1.71x when indexing is preprocessed) faster and 720x more storage-efficient taxonomic profiling of metagenomic samples over the state-of-the-art tool (Metalign).
format Article
id doaj-art-f08a23d57361479db4dc7aced15dfa75
institution Kabale University
issn 2041-1723
language English
publishDate 2025-01-01
publisher Nature Portfolio
record_format Article
series Nature Communications
spelling doaj-art-f08a23d57361479db4dc7aced15dfa752025-01-26T12:40:50ZengNature PortfolioNature Communications2041-17232025-01-0116112110.1038/s41467-024-55762-1Taming large-scale genomic analyses via sparsified genomicsMohammed Alser0Julien Eudine1Onur Mutlu2Department of Information Technology and Electrical Engineering, ETH ZürichDepartment of Information Technology and Electrical Engineering, ETH ZürichDepartment of Information Technology and Electrical Engineering, ETH ZürichAbstract Searching for similar genomic sequences is an essential and fundamental step in biomedical research. State-of-the-art computational methods performing such comparisons fail to cope with the exponential growth of genomic sequencing data. We introduce the concept of sparsified genomics where we systematically exclude a large number of bases from genomic sequences and enable faster and memory-efficient processing of the sparsified, shorter genomic sequences, while providing comparable accuracy to processing non-sparsified sequences. Sparsified genomics provides benefits to many genomic analyses and has broad applicability. Sparsifying genomic sequences accelerates the state-of-the-art read mapper (minimap2) by 2.57-5.38x, 1.13-2.78x, and 3.52-6.28x using real Illumina, HiFi, and ONT reads, respectively, while providing comparable memory footprint, 2x smaller index size, and more correctly detected variations compared to minimap2. Sparsifying genomic sequences makes containment search through very large genomes and large databases 72.7-75.88x (1.62-1.9x when indexing is preprocessed) faster and 723.3x more storage-efficient than searching through non-sparsified genomic sequences (with CMash and KMC3). Sparsifying genomic sequences enables robust microbiome discovery by providing 54.15-61.88x (1.58-1.71x when indexing is preprocessed) faster and 720x more storage-efficient taxonomic profiling of metagenomic samples over the state-of-the-art tool (Metalign).https://doi.org/10.1038/s41467-024-55762-1
spellingShingle Mohammed Alser
Julien Eudine
Onur Mutlu
Taming large-scale genomic analyses via sparsified genomics
Nature Communications
title Taming large-scale genomic analyses via sparsified genomics
title_full Taming large-scale genomic analyses via sparsified genomics
title_fullStr Taming large-scale genomic analyses via sparsified genomics
title_full_unstemmed Taming large-scale genomic analyses via sparsified genomics
title_short Taming large-scale genomic analyses via sparsified genomics
title_sort taming large scale genomic analyses via sparsified genomics
url https://doi.org/10.1038/s41467-024-55762-1
work_keys_str_mv AT mohammedalser taminglargescalegenomicanalysesviasparsifiedgenomics
AT julieneudine taminglargescalegenomicanalysesviasparsifiedgenomics
AT onurmutlu taminglargescalegenomicanalysesviasparsifiedgenomics