Training data diversity enhances the basecalling of novel RNA modification-induced nanopore sequencing readouts

Abstract Accurately basecalling sequence backbones in the presence of nucleotide modifications remains a substantial challenge in nanopore sequencing bioinformatics. It has been extensively demonstrated that state-of-the-art basecallers are less compatible with modification-induced sequencing signal...

Full description

Saved in:
Bibliographic Details
Main Authors: Ziyuan Wang, Ziyang Liu, Yinshan Fang, Hao Helen Zhang, Xiaoxiao Sun, Ning Hao, Jianwen Que, Hongxu Ding
Format: Article
Language:English
Published: Nature Portfolio 2025-01-01
Series:Nature Communications
Online Access:https://doi.org/10.1038/s41467-025-55974-z
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832594541193986048
author Ziyuan Wang
Ziyang Liu
Yinshan Fang
Hao Helen Zhang
Xiaoxiao Sun
Ning Hao
Jianwen Que
Hongxu Ding
author_facet Ziyuan Wang
Ziyang Liu
Yinshan Fang
Hao Helen Zhang
Xiaoxiao Sun
Ning Hao
Jianwen Que
Hongxu Ding
author_sort Ziyuan Wang
collection DOAJ
description Abstract Accurately basecalling sequence backbones in the presence of nucleotide modifications remains a substantial challenge in nanopore sequencing bioinformatics. It has been extensively demonstrated that state-of-the-art basecallers are less compatible with modification-induced sequencing signals. A precise basecalling, on the other hand, serves as the prerequisite for virtually all the downstream analyses. Here, we report that basecallers exposed to diverse training modifications gain the generalizability to analyze novel modifications. With synthesized oligos as the model system, we precisely basecall various out-of-sample RNA modifications. From the representation learning perspective, we attribute this generalizability to basecaller representation space expanded by diverse training modifications. Taken together, we conclude increasing the training data diversity as a paradigm for building modification-tolerant nanopore sequencing basecallers.
format Article
id doaj-art-36d8cf368ae24c69958ecbc8695a7803
institution Kabale University
issn 2041-1723
language English
publishDate 2025-01-01
publisher Nature Portfolio
record_format Article
series Nature Communications
spelling doaj-art-36d8cf368ae24c69958ecbc8695a78032025-01-19T12:32:02ZengNature PortfolioNature Communications2041-17232025-01-011611910.1038/s41467-025-55974-zTraining data diversity enhances the basecalling of novel RNA modification-induced nanopore sequencing readoutsZiyuan Wang0Ziyang Liu1Yinshan Fang2Hao Helen Zhang3Xiaoxiao Sun4Ning Hao5Jianwen Que6Hongxu Ding7Department of Pharmacy Practice and Science, University of ArizonaDepartment of Pharmacy Practice and Science, University of ArizonaColumbia Center for Human Development, Department of Medicine, Columbia University Medical CenterStatistics and Data Science GIDP, University of ArizonaStatistics and Data Science GIDP, University of ArizonaStatistics and Data Science GIDP, University of ArizonaColumbia Center for Human Development, Department of Medicine, Columbia University Medical CenterDepartment of Pharmacy Practice and Science, University of ArizonaAbstract Accurately basecalling sequence backbones in the presence of nucleotide modifications remains a substantial challenge in nanopore sequencing bioinformatics. It has been extensively demonstrated that state-of-the-art basecallers are less compatible with modification-induced sequencing signals. A precise basecalling, on the other hand, serves as the prerequisite for virtually all the downstream analyses. Here, we report that basecallers exposed to diverse training modifications gain the generalizability to analyze novel modifications. With synthesized oligos as the model system, we precisely basecall various out-of-sample RNA modifications. From the representation learning perspective, we attribute this generalizability to basecaller representation space expanded by diverse training modifications. Taken together, we conclude increasing the training data diversity as a paradigm for building modification-tolerant nanopore sequencing basecallers.https://doi.org/10.1038/s41467-025-55974-z
spellingShingle Ziyuan Wang
Ziyang Liu
Yinshan Fang
Hao Helen Zhang
Xiaoxiao Sun
Ning Hao
Jianwen Que
Hongxu Ding
Training data diversity enhances the basecalling of novel RNA modification-induced nanopore sequencing readouts
Nature Communications
title Training data diversity enhances the basecalling of novel RNA modification-induced nanopore sequencing readouts
title_full Training data diversity enhances the basecalling of novel RNA modification-induced nanopore sequencing readouts
title_fullStr Training data diversity enhances the basecalling of novel RNA modification-induced nanopore sequencing readouts
title_full_unstemmed Training data diversity enhances the basecalling of novel RNA modification-induced nanopore sequencing readouts
title_short Training data diversity enhances the basecalling of novel RNA modification-induced nanopore sequencing readouts
title_sort training data diversity enhances the basecalling of novel rna modification induced nanopore sequencing readouts
url https://doi.org/10.1038/s41467-025-55974-z
work_keys_str_mv AT ziyuanwang trainingdatadiversityenhancesthebasecallingofnovelrnamodificationinducednanoporesequencingreadouts
AT ziyangliu trainingdatadiversityenhancesthebasecallingofnovelrnamodificationinducednanoporesequencingreadouts
AT yinshanfang trainingdatadiversityenhancesthebasecallingofnovelrnamodificationinducednanoporesequencingreadouts
AT haohelenzhang trainingdatadiversityenhancesthebasecallingofnovelrnamodificationinducednanoporesequencingreadouts
AT xiaoxiaosun trainingdatadiversityenhancesthebasecallingofnovelrnamodificationinducednanoporesequencingreadouts
AT ninghao trainingdatadiversityenhancesthebasecallingofnovelrnamodificationinducednanoporesequencingreadouts
AT jianwenque trainingdatadiversityenhancesthebasecallingofnovelrnamodificationinducednanoporesequencingreadouts
AT hongxuding trainingdatadiversityenhancesthebasecallingofnovelrnamodificationinducednanoporesequencingreadouts