Continuous and Discrete Similarity Coefficient for Identifying Essential Proteins Using Gene Expression Data

Essential proteins play a vital role in biological processes, and the combination of gene expression profiles with Protein-Protein Interaction (PPI) networks can improve the identification of essential proteins. However, gene expression data are prone to significant fluctuations due to noise interfe...

Full description

Saved in:
Bibliographic Details
Main Authors: Jiancheng Zhong, Zuohang Qu, Ying Zhong, Chao Tang, Yi Pan
Format: Article
Language:English
Published: Tsinghua University Press 2023-06-01
Series:Big Data Mining and Analytics
Subjects:
Online Access:https://www.sciopen.com/article/10.26599/BDMA.2022.9020019
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832557433530089472
author Jiancheng Zhong
Zuohang Qu
Ying Zhong
Chao Tang
Yi Pan
author_facet Jiancheng Zhong
Zuohang Qu
Ying Zhong
Chao Tang
Yi Pan
author_sort Jiancheng Zhong
collection DOAJ
description Essential proteins play a vital role in biological processes, and the combination of gene expression profiles with Protein-Protein Interaction (PPI) networks can improve the identification of essential proteins. However, gene expression data are prone to significant fluctuations due to noise interference in topological networks. In this work, we discretized gene expression data and used the discrete similarities of the gene expression spectrum to eliminate noise fluctuation. We then proposed the Pearson Jaccard coefficient (PJC) that consisted of continuous and discrete similarities in the gene expression data. Using the graph theory as the basis, we fused the newly proposed similarity coefficient with the existing network topology prediction algorithm at each protein node to recognize essential proteins. This strategy exhibited a high recognition rate and good specificity. We validated the new similarity coefficient PJC on PPI datasets of Krogan, Gavin, and DIP of yeast species and evaluated the results by receiver operating characteristic analysis, jackknife analysis, top analysis, and accuracy analysis. Compared with that of node-based network topology centrality and fusion biological information centrality methods, the new similarity coefficient PJC showed a significantly improved prediction performance for essential proteins in DC, IC, Eigenvector centrality, subgraph centrality, betweenness centrality, closeness centrality, NC, PeC, and WDC. We also compared the PJC coefficient with other methods using the NF-PIN algorithm, which predicts proteins by constructing active PPI networks through dynamic gene expression. The experimental results proved that our newly proposed similarity coefficient PJC has superior advantages in predicting essential proteins.
format Article
id doaj-art-9bd3d3d4c7e848e29ab8264c7b758e76
institution Kabale University
issn 2096-0654
language English
publishDate 2023-06-01
publisher Tsinghua University Press
record_format Article
series Big Data Mining and Analytics
spelling doaj-art-9bd3d3d4c7e848e29ab8264c7b758e762025-02-03T04:58:51ZengTsinghua University PressBig Data Mining and Analytics2096-06542023-06-016218520010.26599/BDMA.2022.9020019Continuous and Discrete Similarity Coefficient for Identifying Essential Proteins Using Gene Expression DataJiancheng Zhong0Zuohang Qu1Ying Zhong2Chao Tang3Yi Pan4College of Information Science and Engineering, Hunan Normal University, Changsha 410081, ChinaCollege of Information Science and Engineering, Hunan Normal University, Changsha 410081, ChinaCollege of Information Science and Engineering, Hunan Normal University, Changsha 410081, ChinaCollege of Information Science and Engineering, Hunan Normal University, Changsha 410081, ChinaFaculty of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences Shenzhen, Guangzhou 518055, ChinaEssential proteins play a vital role in biological processes, and the combination of gene expression profiles with Protein-Protein Interaction (PPI) networks can improve the identification of essential proteins. However, gene expression data are prone to significant fluctuations due to noise interference in topological networks. In this work, we discretized gene expression data and used the discrete similarities of the gene expression spectrum to eliminate noise fluctuation. We then proposed the Pearson Jaccard coefficient (PJC) that consisted of continuous and discrete similarities in the gene expression data. Using the graph theory as the basis, we fused the newly proposed similarity coefficient with the existing network topology prediction algorithm at each protein node to recognize essential proteins. This strategy exhibited a high recognition rate and good specificity. We validated the new similarity coefficient PJC on PPI datasets of Krogan, Gavin, and DIP of yeast species and evaluated the results by receiver operating characteristic analysis, jackknife analysis, top analysis, and accuracy analysis. Compared with that of node-based network topology centrality and fusion biological information centrality methods, the new similarity coefficient PJC showed a significantly improved prediction performance for essential proteins in DC, IC, Eigenvector centrality, subgraph centrality, betweenness centrality, closeness centrality, NC, PeC, and WDC. We also compared the PJC coefficient with other methods using the NF-PIN algorithm, which predicts proteins by constructing active PPI networks through dynamic gene expression. The experimental results proved that our newly proposed similarity coefficient PJC has superior advantages in predicting essential proteins.https://www.sciopen.com/article/10.26599/BDMA.2022.9020019protein-protein interaction (ppi) networkcontinuous and discrete similarity coefficientessential proteins
spellingShingle Jiancheng Zhong
Zuohang Qu
Ying Zhong
Chao Tang
Yi Pan
Continuous and Discrete Similarity Coefficient for Identifying Essential Proteins Using Gene Expression Data
Big Data Mining and Analytics
protein-protein interaction (ppi) network
continuous and discrete similarity coefficient
essential proteins
title Continuous and Discrete Similarity Coefficient for Identifying Essential Proteins Using Gene Expression Data
title_full Continuous and Discrete Similarity Coefficient for Identifying Essential Proteins Using Gene Expression Data
title_fullStr Continuous and Discrete Similarity Coefficient for Identifying Essential Proteins Using Gene Expression Data
title_full_unstemmed Continuous and Discrete Similarity Coefficient for Identifying Essential Proteins Using Gene Expression Data
title_short Continuous and Discrete Similarity Coefficient for Identifying Essential Proteins Using Gene Expression Data
title_sort continuous and discrete similarity coefficient for identifying essential proteins using gene expression data
topic protein-protein interaction (ppi) network
continuous and discrete similarity coefficient
essential proteins
url https://www.sciopen.com/article/10.26599/BDMA.2022.9020019
work_keys_str_mv AT jianchengzhong continuousanddiscretesimilaritycoefficientforidentifyingessentialproteinsusinggeneexpressiondata
AT zuohangqu continuousanddiscretesimilaritycoefficientforidentifyingessentialproteinsusinggeneexpressiondata
AT yingzhong continuousanddiscretesimilaritycoefficientforidentifyingessentialproteinsusinggeneexpressiondata
AT chaotang continuousanddiscretesimilaritycoefficientforidentifyingessentialproteinsusinggeneexpressiondata
AT yipan continuousanddiscretesimilaritycoefficientforidentifyingessentialproteinsusinggeneexpressiondata