Analysis of unmapped regions associated with long deletions in Korean whole genome sequences based on short read data

While studies aimed at detecting and analyzing indels or single nucleotide polymorphisms within human genomic sequences have been actively conducted, studies on detecting long insertions/deletions are not easy to orchestrate. For the last 10 years, the availability of long read data of human genomes...

Full description

Saved in:
Bibliographic Details
Main Authors: Yuna Lee, Kiejung Park, Insong Koh
Format: Article
Language:English
Published: BioMed Central 2019-12-01
Series:Genomics & Informatics
Subjects:
Online Access:http://genominfo.org/upload/pdf/gi-2019-17-4-e40.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832570120684175360
author Yuna Lee
Kiejung Park
Insong Koh
author_facet Yuna Lee
Kiejung Park
Insong Koh
author_sort Yuna Lee
collection DOAJ
description While studies aimed at detecting and analyzing indels or single nucleotide polymorphisms within human genomic sequences have been actively conducted, studies on detecting long insertions/deletions are not easy to orchestrate. For the last 10 years, the availability of long read data of human genomes from PacBio or Nanopore platforms has increased, which makes it easier to detect long insertions/deletions. However, because long read data have a critical disadvantage due to their relatively high cost, many next generation sequencing data are produced mainly by short read sequencing machines. Here, we constructed programs to detect so-called unmapped regions (UMRs, where no reads are mapped on the reference genome), scanned 40 Korean genomes to select UMR long deletion candidates, and compared the candidates with the long deletion break points within the genomes available from the 1000 Genomes Project (1KGP). An average of about 36,000 UMRs were found in the 40 Korean genomes tested, 284 UMRs were common across the 40 genomes, and a total of 37,943 UMRs were found. Compared with the 74,045 break points provided by the 1KGP, 30,698 UMRs overlapped. As the number of compared samples increased from 1 to 40, the number of UMRs that overlapped with the break points also increased. This eventually reached a peak of 80.9% of the total UMRs found in this study. As the total number of overlapped UMRs could probably grow to encompass 74,045 break points with the inclusion of more Korean genomes, this approach could be practically useful for studies on long deletions utilizing short read data.
format Article
id doaj-art-211f8e10162e46c3ade3e743905a77cf
institution Kabale University
issn 2234-0742
language English
publishDate 2019-12-01
publisher BioMed Central
record_format Article
series Genomics & Informatics
spelling doaj-art-211f8e10162e46c3ade3e743905a77cf2025-02-02T17:01:55ZengBioMed CentralGenomics & Informatics2234-07422019-12-0117410.5808/GI.2019.17.4.e40586Analysis of unmapped regions associated with long deletions in Korean whole genome sequences based on short read dataYuna Lee0Kiejung Park1Insong Koh2 Department of Biomedical Informatics, Hanyang University, Seoul 04763, Korea Cheonan Industry-Academic Collaboration Foundation, Sangmyung University, Cheonan 31066, Korea Department of Biomedical Informatics, Hanyang University, Seoul 04763, KoreaWhile studies aimed at detecting and analyzing indels or single nucleotide polymorphisms within human genomic sequences have been actively conducted, studies on detecting long insertions/deletions are not easy to orchestrate. For the last 10 years, the availability of long read data of human genomes from PacBio or Nanopore platforms has increased, which makes it easier to detect long insertions/deletions. However, because long read data have a critical disadvantage due to their relatively high cost, many next generation sequencing data are produced mainly by short read sequencing machines. Here, we constructed programs to detect so-called unmapped regions (UMRs, where no reads are mapped on the reference genome), scanned 40 Korean genomes to select UMR long deletion candidates, and compared the candidates with the long deletion break points within the genomes available from the 1000 Genomes Project (1KGP). An average of about 36,000 UMRs were found in the 40 Korean genomes tested, 284 UMRs were common across the 40 genomes, and a total of 37,943 UMRs were found. Compared with the 74,045 break points provided by the 1KGP, 30,698 UMRs overlapped. As the number of compared samples increased from 1 to 40, the number of UMRs that overlapped with the break points also increased. This eventually reached a peak of 80.9% of the total UMRs found in this study. As the total number of overlapped UMRs could probably grow to encompass 74,045 break points with the inclusion of more Korean genomes, this approach could be practically useful for studies on long deletions utilizing short read data.http://genominfo.org/upload/pdf/gi-2019-17-4-e40.pdfdeletionkoreanstructural variationunmapped regionwhole genome sequencing
spellingShingle Yuna Lee
Kiejung Park
Insong Koh
Analysis of unmapped regions associated with long deletions in Korean whole genome sequences based on short read data
Genomics & Informatics
deletion
korean
structural variation
unmapped region
whole genome sequencing
title Analysis of unmapped regions associated with long deletions in Korean whole genome sequences based on short read data
title_full Analysis of unmapped regions associated with long deletions in Korean whole genome sequences based on short read data
title_fullStr Analysis of unmapped regions associated with long deletions in Korean whole genome sequences based on short read data
title_full_unstemmed Analysis of unmapped regions associated with long deletions in Korean whole genome sequences based on short read data
title_short Analysis of unmapped regions associated with long deletions in Korean whole genome sequences based on short read data
title_sort analysis of unmapped regions associated with long deletions in korean whole genome sequences based on short read data
topic deletion
korean
structural variation
unmapped region
whole genome sequencing
url http://genominfo.org/upload/pdf/gi-2019-17-4-e40.pdf
work_keys_str_mv AT yunalee analysisofunmappedregionsassociatedwithlongdeletionsinkoreanwholegenomesequencesbasedonshortreaddata
AT kiejungpark analysisofunmappedregionsassociatedwithlongdeletionsinkoreanwholegenomesequencesbasedonshortreaddata
AT insongkoh analysisofunmappedregionsassociatedwithlongdeletionsinkoreanwholegenomesequencesbasedonshortreaddata