A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning

Advances in high-throughput genome sequencing have enabled large-scale genome sequencing in clinical practice and research studies. By analyzing genomic variants of humans, scientists can gain better understanding of the risk factors of complex diseases such as cancer and COVID-19. To model and anal...

Full description

Saved in:
Bibliographic Details
Main Authors: Shivika Prasanna, Ajay Kumar, Deepthi Rao, Eduardo J. Simoes, Praveen Rao
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-01-01
Series:Frontiers in Big Data
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fdata.2024.1466391/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832592404763377664
author Shivika Prasanna
Ajay Kumar
Deepthi Rao
Eduardo J. Simoes
Praveen Rao
author_facet Shivika Prasanna
Ajay Kumar
Deepthi Rao
Eduardo J. Simoes
Praveen Rao
author_sort Shivika Prasanna
collection DOAJ
description Advances in high-throughput genome sequencing have enabled large-scale genome sequencing in clinical practice and research studies. By analyzing genomic variants of humans, scientists can gain better understanding of the risk factors of complex diseases such as cancer and COVID-19. To model and analyze the rich genomic data, knowledge graphs (KGs) and graph machine learning (GML) can be regarded as enabling technologies. In this article, we present a scalable tool called VariantKG for analyzing genomic variants of humans modeled using KGs and GML. Specifically, we used publicly available genome sequencing data from patients with COVID-19. VariantKG extracts variant-level genetic information output by a variant calling pipeline, annotates the variant data with additional metadata, and converts the annotated variant information into a KG represented using the Resource Description Framework (RDF). The resulting KG is further enhanced with patient metadata and stored in a scalable graph database that enables efficient RDF indexing and query processing. VariantKG employs the Deep Graph Library (DGL) to perform GML tasks such as node classification. A user can extract a subset of the KG and perform inference tasks using DGL. The user can monitor the training and testing performance and hardware utilization. We tested VariantKG for KG construction by using 1,508 genome sequences, leading to 4 billion RDF statements. We evaluated GML tasks using VariantKG by selecting a subset of 500 sequences from the KG and performing node classification using well-known GML techniques such as GraphSAGE, Graph Convolutional Network (GCN) and Graph Transformer. VariantKG has intuitive user interfaces and features enabling a low barrier to entry for KG construction, model inference, and model interpretation on genomic variants of humans.
format Article
id doaj-art-36a188f6b8ed452e8fe24a724da359ea
institution Kabale University
issn 2624-909X
language English
publishDate 2025-01-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Big Data
spelling doaj-art-36a188f6b8ed452e8fe24a724da359ea2025-01-21T08:36:43ZengFrontiers Media S.A.Frontiers in Big Data2624-909X2025-01-01710.3389/fdata.2024.14663911466391A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learningShivika Prasanna0Ajay Kumar1Deepthi Rao2Eduardo J. Simoes3Praveen Rao4Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United StatesDepartment of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United StatesDepartment of Pathology and Anatomical Sciences, University of Missouri, Columbia, MO, United StatesDepartment of Biomedical Informatics, Biostatistics and Medical Epidemiology, University of Missouri, Columbia, MO, United StatesDepartment of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United StatesAdvances in high-throughput genome sequencing have enabled large-scale genome sequencing in clinical practice and research studies. By analyzing genomic variants of humans, scientists can gain better understanding of the risk factors of complex diseases such as cancer and COVID-19. To model and analyze the rich genomic data, knowledge graphs (KGs) and graph machine learning (GML) can be regarded as enabling technologies. In this article, we present a scalable tool called VariantKG for analyzing genomic variants of humans modeled using KGs and GML. Specifically, we used publicly available genome sequencing data from patients with COVID-19. VariantKG extracts variant-level genetic information output by a variant calling pipeline, annotates the variant data with additional metadata, and converts the annotated variant information into a KG represented using the Resource Description Framework (RDF). The resulting KG is further enhanced with patient metadata and stored in a scalable graph database that enables efficient RDF indexing and query processing. VariantKG employs the Deep Graph Library (DGL) to perform GML tasks such as node classification. A user can extract a subset of the KG and perform inference tasks using DGL. The user can monitor the training and testing performance and hardware utilization. We tested VariantKG for KG construction by using 1,508 genome sequences, leading to 4 billion RDF statements. We evaluated GML tasks using VariantKG by selecting a subset of 500 sequences from the KG and performing node classification using well-known GML techniques such as GraphSAGE, Graph Convolutional Network (GCN) and Graph Transformer. VariantKG has intuitive user interfaces and features enabling a low barrier to entry for KG construction, model inference, and model interpretation on genomic variants of humans.https://www.frontiersin.org/articles/10.3389/fdata.2024.1466391/fullknowledge graphshuman genomic variantsgraph machine learningscalabilityinference
spellingShingle Shivika Prasanna
Ajay Kumar
Deepthi Rao
Eduardo J. Simoes
Praveen Rao
A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning
Frontiers in Big Data
knowledge graphs
human genomic variants
graph machine learning
scalability
inference
title A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning
title_full A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning
title_fullStr A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning
title_full_unstemmed A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning
title_short A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning
title_sort scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning
topic knowledge graphs
human genomic variants
graph machine learning
scalability
inference
url https://www.frontiersin.org/articles/10.3389/fdata.2024.1466391/full
work_keys_str_mv AT shivikaprasanna ascalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning
AT ajaykumar ascalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning
AT deepthirao ascalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning
AT eduardojsimoes ascalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning
AT praveenrao ascalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning
AT shivikaprasanna scalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning
AT ajaykumar scalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning
AT deepthirao scalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning
AT eduardojsimoes scalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning
AT praveenrao scalabletoolforanalyzinggenomicvariantsofhumansusingknowledgegraphsandgraphmachinelearning