A novel lossless encoding algorithm for data compression–genomics data as an exemplar

Data compression is a challenging and increasingly important problem. As the amount of data generated daily continues to increase, efficient transmission and storage have never been more critical. In this study, a novel encoding algorithm is proposed, motivated by the compression of DNA data and ass...

Full description

Saved in:
Bibliographic Details
Main Authors: Anas Al-okaily, Abdelghani Tbakhi
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-01-01
Series:Frontiers in Bioinformatics
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fbinf.2024.1489704/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832590806910763008
author Anas Al-okaily
Abdelghani Tbakhi
author_facet Anas Al-okaily
Abdelghani Tbakhi
author_sort Anas Al-okaily
collection DOAJ
description Data compression is a challenging and increasingly important problem. As the amount of data generated daily continues to increase, efficient transmission and storage have never been more critical. In this study, a novel encoding algorithm is proposed, motivated by the compression of DNA data and associated characteristics. The proposed algorithm follows a divide-and-conquer approach by scanning the whole genome, classifying subsequences based on similarities in their content, and binning similar subsequences together. The data is then compressed into each bin independently. This approach is different than the currently known approaches: entropy, dictionary, predictive, or transform-based methods. Proof-of-concept performance was evaluated using a benchmark dataset with seventeen genomes ranging in size from kilobytes to gigabytes. The results showed a considerable improvement in the compression of each genome, preserving several megabytes compared to state-of-the-art tools. Moreover, the algorithm can be applied to the compression of other data types include mainly text, numbers, images, audio, and video which are being generated daily and unprecedentedly in massive volumes.
format Article
id doaj-art-3c58369ff8fc4259bdd5d3a4439859ee
institution Kabale University
issn 2673-7647
language English
publishDate 2025-01-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Bioinformatics
spelling doaj-art-3c58369ff8fc4259bdd5d3a4439859ee2025-01-23T06:56:26ZengFrontiers Media S.A.Frontiers in Bioinformatics2673-76472025-01-01410.3389/fbinf.2024.14897041489704A novel lossless encoding algorithm for data compression–genomics data as an exemplarAnas Al-okaily0Abdelghani Tbakhi1Department of Cell Therapy and Applied Genomics, King Hussein Cancer Center, Amman, JordanDepartment of Pathology and Molecular Medicine, McMaster University, Hamilton, ON, CanadaData compression is a challenging and increasingly important problem. As the amount of data generated daily continues to increase, efficient transmission and storage have never been more critical. In this study, a novel encoding algorithm is proposed, motivated by the compression of DNA data and associated characteristics. The proposed algorithm follows a divide-and-conquer approach by scanning the whole genome, classifying subsequences based on similarities in their content, and binning similar subsequences together. The data is then compressed into each bin independently. This approach is different than the currently known approaches: entropy, dictionary, predictive, or transform-based methods. Proof-of-concept performance was evaluated using a benchmark dataset with seventeen genomes ranging in size from kilobytes to gigabytes. The results showed a considerable improvement in the compression of each genome, preserving several megabytes compared to state-of-the-art tools. Moreover, the algorithm can be applied to the compression of other data types include mainly text, numbers, images, audio, and video which are being generated daily and unprecedentedly in massive volumes.https://www.frontiersin.org/articles/10.3389/fbinf.2024.1489704/fullcompressionHuffman encodingLZgenomicsBWT
spellingShingle Anas Al-okaily
Abdelghani Tbakhi
A novel lossless encoding algorithm for data compression–genomics data as an exemplar
Frontiers in Bioinformatics
compression
Huffman encoding
LZ
genomics
BWT
title A novel lossless encoding algorithm for data compression–genomics data as an exemplar
title_full A novel lossless encoding algorithm for data compression–genomics data as an exemplar
title_fullStr A novel lossless encoding algorithm for data compression–genomics data as an exemplar
title_full_unstemmed A novel lossless encoding algorithm for data compression–genomics data as an exemplar
title_short A novel lossless encoding algorithm for data compression–genomics data as an exemplar
title_sort novel lossless encoding algorithm for data compression genomics data as an exemplar
topic compression
Huffman encoding
LZ
genomics
BWT
url https://www.frontiersin.org/articles/10.3389/fbinf.2024.1489704/full
work_keys_str_mv AT anasalokaily anovellosslessencodingalgorithmfordatacompressiongenomicsdataasanexemplar
AT abdelghanitbakhi anovellosslessencodingalgorithmfordatacompressiongenomicsdataasanexemplar
AT anasalokaily novellosslessencodingalgorithmfordatacompressiongenomicsdataasanexemplar
AT abdelghanitbakhi novellosslessencodingalgorithmfordatacompressiongenomicsdataasanexemplar