Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms

Virtual screening is the most critical process in drug discovery, and it relies on machine learning to facilitate the screening process. It enables the discovery of molecules that bind to a specific protein to form a drug. Despite its benefits, virtual screening generates enormous data and suffers f...

Full description

Saved in:

Bibliographic Details
Main Authors:	Sahar K. Hussin, Salah M. Abdelmageid, Adel Alkhalil, Yasser M. Omar, Mahmoud I. Marie, Rabie A. Ramadan
Format:	Article
Language:	English
Published:	Wiley 2021-01-01
Series:	Complexity
Online Access:	http://dx.doi.org/10.1155/2021/6675279
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832560093818781696
author	Sahar K. Hussin Salah M. Abdelmageid Adel Alkhalil Yasser M. Omar Mahmoud I. Marie Rabie A. Ramadan
author_facet	Sahar K. Hussin Salah M. Abdelmageid Adel Alkhalil Yasser M. Omar Mahmoud I. Marie Rabie A. Ramadan
author_sort	Sahar K. Hussin
collection	DOAJ
description	Virtual screening is the most critical process in drug discovery, and it relies on machine learning to facilitate the screening process. It enables the discovery of molecules that bind to a specific protein to form a drug. Despite its benefits, virtual screening generates enormous data and suffers from drawbacks such as high dimensions and imbalance. This paper tackles data imbalance and aims to improve virtual screening accuracy, especially for a minority dataset. For a dataset identified without considering the data’s imbalanced nature, most classification methods tend to have high predictive accuracy for the majority category. However, the accuracy was significantly poor for the minority category. The paper proposes a K-mean algorithm coupled with Synthetic Minority Oversampling Technique (SMOTE) to overcome the problem of imbalanced datasets. The proposed algorithm is named as KSMOTE. Using KSMOTE, minority data can be identified at high accuracy and can be detected at high precision. A large set of experiments were implemented on Apache Spark using numeric PaDEL and fingerprint descriptors. The proposed solution was compared to both no-sampling method and SMOTE on the same datasets. Experimental results showed that the proposed solution outperformed other methods.
format	Article
id	doaj-art-28849350eb54460288cac53db9348760
institution	Kabale University
issn	1076-2787 1099-0526
language	English
publishDate	2021-01-01
publisher	Wiley
record_format	Article
series	Complexity
spelling	doaj-art-28849350eb54460288cac53db93487602025-02-03T01:28:23ZengWileyComplexity1076-27871099-05262021-01-01202110.1155/2021/66752796675279Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning AlgorithmsSahar K. Hussin0Salah M. Abdelmageid1Adel Alkhalil2Yasser M. Omar3Mahmoud I. Marie4Rabie A. Ramadan5Communication and Computers Engineering Department Alshrouck Academy, Cairo, EgyptComputer Engineering Department, Collage of Comp. Science and Engineering, Taibah University, Medina, Saudi ArabiaCollege of Computer Science and Engineering, University of Hai’l, Hai’l, Saudi ArabiaArab Academy for Science Technology and Maritime Transport, Cairo, EgyptComputer and System Engineering Department, Al-Azhar University, Cairo, EgyptCollege of Computer Science and Engineering, University of Hai’l, Hai’l, Saudi ArabiaVirtual screening is the most critical process in drug discovery, and it relies on machine learning to facilitate the screening process. It enables the discovery of molecules that bind to a specific protein to form a drug. Despite its benefits, virtual screening generates enormous data and suffers from drawbacks such as high dimensions and imbalance. This paper tackles data imbalance and aims to improve virtual screening accuracy, especially for a minority dataset. For a dataset identified without considering the data’s imbalanced nature, most classification methods tend to have high predictive accuracy for the majority category. However, the accuracy was significantly poor for the minority category. The paper proposes a K-mean algorithm coupled with Synthetic Minority Oversampling Technique (SMOTE) to overcome the problem of imbalanced datasets. The proposed algorithm is named as KSMOTE. Using KSMOTE, minority data can be identified at high accuracy and can be detected at high precision. A large set of experiments were implemented on Apache Spark using numeric PaDEL and fingerprint descriptors. The proposed solution was compared to both no-sampling method and SMOTE on the same datasets. Experimental results showed that the proposed solution outperformed other methods.http://dx.doi.org/10.1155/2021/6675279
spellingShingle	Sahar K. Hussin Salah M. Abdelmageid Adel Alkhalil Yasser M. Omar Mahmoud I. Marie Rabie A. Ramadan Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms Complexity
title	Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms
title_full	Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms
title_fullStr	Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms
title_full_unstemmed	Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms
title_short	Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms
title_sort	handling imbalance classification virtual screening big data using machine learning algorithms
url	http://dx.doi.org/10.1155/2021/6675279
work_keys_str_mv	AT saharkhussin handlingimbalanceclassificationvirtualscreeningbigdatausingmachinelearningalgorithms AT salahmabdelmageid handlingimbalanceclassificationvirtualscreeningbigdatausingmachinelearningalgorithms AT adelalkhalil handlingimbalanceclassificationvirtualscreeningbigdatausingmachinelearningalgorithms AT yassermomar handlingimbalanceclassificationvirtualscreeningbigdatausingmachinelearningalgorithms AT mahmoudimarie handlingimbalanceclassificationvirtualscreeningbigdatausingmachinelearningalgorithms AT rabiearamadan handlingimbalanceclassificationvirtualscreeningbigdatausingmachinelearningalgorithms

Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms

Similar Items