A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning
Advances in high-throughput genome sequencing have enabled large-scale genome sequencing in clinical practice and research studies. By analyzing genomic variants of humans, scientists can gain better understanding of the risk factors of complex diseases such as cancer and COVID-19. To model and anal...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Frontiers Media S.A.
2025-01-01
|
Series: | Frontiers in Big Data |
Subjects: | |
Online Access: | https://www.frontiersin.org/articles/10.3389/fdata.2024.1466391/full |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832592404763377664 |
---|---|
author | Shivika Prasanna Ajay Kumar Deepthi Rao Eduardo J. Simoes Praveen Rao |
author_facet | Shivika Prasanna Ajay Kumar Deepthi Rao Eduardo J. Simoes Praveen Rao |
author_sort | Shivika Prasanna |
collection | DOAJ |
description | Advances in high-throughput genome sequencing have enabled large-scale genome sequencing in clinical practice and research studies. By analyzing genomic variants of humans, scientists can gain better understanding of the risk factors of complex diseases such as cancer and COVID-19. To model and analyze the rich genomic data, knowledge graphs (KGs) and graph machine learning (GML) can be regarded as enabling technologies. In this article, we present a scalable tool called VariantKG for analyzing genomic variants of humans modeled using KGs and GML. Specifically, we used publicly available genome sequencing data from patients with COVID-19. VariantKG extracts variant-level genetic information output by a variant calling pipeline, annotates the variant data with additional metadata, and converts the annotated variant information into a KG represented using the Resource Description Framework (RDF). The resulting KG is further enhanced with patient metadata and stored in a scalable graph database that enables efficient RDF indexing and query processing. VariantKG employs the Deep Graph Library (DGL) to perform GML tasks such as node classification. A user can extract a subset of the KG and perform inference tasks using DGL. The user can monitor the training and testing performance and hardware utilization. We tested VariantKG for KG construction by using 1,508 genome sequences, leading to 4 billion RDF statements. We evaluated GML tasks using VariantKG by selecting a subset of 500 sequences from the KG and performing node classification using well-known GML techniques such as GraphSAGE, Graph Convolutional Network (GCN) and Graph Transformer. VariantKG has intuitive user interfaces and features enabling a low barrier to entry for KG construction, model inference, and model interpretation on genomic variants of humans. |
format | Article |
id | doaj-art-36a188f6b8ed452e8fe24a724da359ea |
institution | Kabale University |
issn | 2624-909X |
language | English |
publishDate | 2025-01-01 |
publisher | Frontiers Media S.A. |
record_format | Article |
series | Frontiers in Big Data |
spelling | doaj-art-36a188f6b8ed452e8fe24a724da359ea2025-01-21T08:36:43ZengFrontiers Media S.A.Frontiers in Big Data2624-909X2025-01-01710.3389/fdata.2024.14663911466391A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learningShivika Prasanna0Ajay Kumar1Deepthi Rao2Eduardo J. Simoes3Praveen Rao4Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United StatesDepartment of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United StatesDepartment of Pathology and Anatomical Sciences, University of Missouri, Columbia, MO, United StatesDepartment of Biomedical Informatics, Biostatistics and Medical Epidemiology, University of Missouri, Columbia, MO, United StatesDepartment of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United StatesAdvances in high-throughput genome sequencing have enabled large-scale genome sequencing in clinical practice and research studies. By analyzing genomic variants of humans, scientists can gain better understanding of the risk factors of complex diseases such as cancer and COVID-19. To model and analyze the rich genomic data, knowledge graphs (KGs) and graph machine learning (GML) can be regarded as enabling technologies. In this article, we present a scalable tool called VariantKG for analyzing genomic variants of humans modeled using KGs and GML. Specifically, we used publicly available genome sequencing data from patients with COVID-19. VariantKG extracts variant-level genetic information output by a variant calling pipeline, annotates the variant data with additional metadata, and converts the annotated variant information into a KG represented using the Resource Description Framework (RDF). The resulting KG is further enhanced with patient metadata and stored in a scalable graph database that enables efficient RDF indexing and query processing. VariantKG employs the Deep Graph Library (DGL) to perform GML tasks such as node classification. A user can extract a subset of the KG and perform inference tasks using DGL. The user can monitor the training and testing performance and hardware utilization. We tested VariantKG for KG construction by using 1,508 genome sequences, leading to 4 billion RDF statements. We evaluated GML tasks using VariantKG by selecting a subset of 500 sequences from the KG and performing node classification using well-known GML techniques such as GraphSAGE, Graph Convolutional Network (GCN) and Graph Transformer. VariantKG has intuitive user interfaces and features enabling a low barrier to entry for KG construction, model inference, and model interpretation on genomic variants of humans.https://www.frontiersin.org/articles/10.3389/fdata.2024.1466391/fullknowledge graphshuman genomic variantsgraph machine learningscalabilityinference |
spellingShingle | Shivika Prasanna Ajay Kumar Deepthi Rao Eduardo J. Simoes Praveen Rao A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning Frontiers in Big Data knowledge graphs human genomic variants graph machine learning scalability inference |
title | A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning |
title_full | A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning |
title_fullStr | A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning |
title_full_unstemmed | A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning |
title_short | A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning |
title_sort | scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning |
topic | knowledge graphs human genomic variants graph machine learning scalability inference |
url | https://www.frontiersin.org/articles/10.3389/fdata.2024.1466391/full |
work_keys_str_mv | AT shivikaprasanna ascalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning AT ajaykumar ascalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning AT deepthirao ascalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning AT eduardojsimoes ascalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning AT praveenrao ascalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning AT shivikaprasanna scalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning AT ajaykumar scalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning AT deepthirao scalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning AT eduardojsimoes scalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning AT praveenrao scalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning |