Detection of COVID-19 Using Protein Sequence Data via Machine Learning Classification Approach

The emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in late 2019 resulted in the COVID-19 pandemic, necessitating rapid and accurate detection of pathogens through protein sequence data. This study is aimed at developing an efficient classification model for coronavirus pro...

Full description

Saved in:
Bibliographic Details
Main Authors: Siti Aminah, Gianinna Ardaneswari, Mufarrido Husnah, Ghani Deori, Handi Bagus Prasetyo
Format: Article
Language:English
Published: Wiley 2023-01-01
Series:Journal of Applied Mathematics
Online Access:http://dx.doi.org/10.1155/2023/9991095
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832547356191490048
author Siti Aminah
Gianinna Ardaneswari
Mufarrido Husnah
Ghani Deori
Handi Bagus Prasetyo
author_facet Siti Aminah
Gianinna Ardaneswari
Mufarrido Husnah
Ghani Deori
Handi Bagus Prasetyo
author_sort Siti Aminah
collection DOAJ
description The emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in late 2019 resulted in the COVID-19 pandemic, necessitating rapid and accurate detection of pathogens through protein sequence data. This study is aimed at developing an efficient classification model for coronavirus protein sequences using machine learning algorithms and feature selection techniques to aid in the early detection and prediction of novel viruses. We utilized a dataset comprising 2000 protein sequences, including 1000 SARS-CoV-2 sequences and 1000 non-SARS-CoV-2 sequences. Feature extraction provided 27 essential features representing the primary structural data, achieved through the Discere package. To optimize performance, we employed machine learning classification algorithms such as K-nearest neighbor (KNN), XGBoost, and Naïve Bayes, along with feature selection techniques like genetic algorithm (GA), LASSO, and support vector machine recursive feature elimination (SVM-RFE). The SVM-RFE+KNN model exhibited exceptional performance, achieving a classification accuracy of 99.30%, specificity of 99.52%, and sensitivity of 99.55%. These results demonstrate the model’s efficacy in accurately classifying coronavirus protein sequences. Our research successfully developed a robust classification model capable of early detection and prediction of protein sequences in SARS-CoV-2 and other coronaviruses. This advancement holds great promise in facilitating the development of targeted treatments and preventive strategies for combating future viral outbreaks.
format Article
id doaj-art-4f8435e234e948d7ba6efc6fb0d1b9c4
institution Kabale University
issn 1687-0042
language English
publishDate 2023-01-01
publisher Wiley
record_format Article
series Journal of Applied Mathematics
spelling doaj-art-4f8435e234e948d7ba6efc6fb0d1b9c42025-02-03T06:45:14ZengWileyJournal of Applied Mathematics1687-00422023-01-01202310.1155/2023/9991095Detection of COVID-19 Using Protein Sequence Data via Machine Learning Classification ApproachSiti Aminah0Gianinna Ardaneswari1Mufarrido Husnah2Ghani Deori3Handi Bagus Prasetyo4Department of MathematicsDepartment of MathematicsDepartment of MathematicsDepartment of MathematicsDepartment of MathematicsThe emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in late 2019 resulted in the COVID-19 pandemic, necessitating rapid and accurate detection of pathogens through protein sequence data. This study is aimed at developing an efficient classification model for coronavirus protein sequences using machine learning algorithms and feature selection techniques to aid in the early detection and prediction of novel viruses. We utilized a dataset comprising 2000 protein sequences, including 1000 SARS-CoV-2 sequences and 1000 non-SARS-CoV-2 sequences. Feature extraction provided 27 essential features representing the primary structural data, achieved through the Discere package. To optimize performance, we employed machine learning classification algorithms such as K-nearest neighbor (KNN), XGBoost, and Naïve Bayes, along with feature selection techniques like genetic algorithm (GA), LASSO, and support vector machine recursive feature elimination (SVM-RFE). The SVM-RFE+KNN model exhibited exceptional performance, achieving a classification accuracy of 99.30%, specificity of 99.52%, and sensitivity of 99.55%. These results demonstrate the model’s efficacy in accurately classifying coronavirus protein sequences. Our research successfully developed a robust classification model capable of early detection and prediction of protein sequences in SARS-CoV-2 and other coronaviruses. This advancement holds great promise in facilitating the development of targeted treatments and preventive strategies for combating future viral outbreaks.http://dx.doi.org/10.1155/2023/9991095
spellingShingle Siti Aminah
Gianinna Ardaneswari
Mufarrido Husnah
Ghani Deori
Handi Bagus Prasetyo
Detection of COVID-19 Using Protein Sequence Data via Machine Learning Classification Approach
Journal of Applied Mathematics
title Detection of COVID-19 Using Protein Sequence Data via Machine Learning Classification Approach
title_full Detection of COVID-19 Using Protein Sequence Data via Machine Learning Classification Approach
title_fullStr Detection of COVID-19 Using Protein Sequence Data via Machine Learning Classification Approach
title_full_unstemmed Detection of COVID-19 Using Protein Sequence Data via Machine Learning Classification Approach
title_short Detection of COVID-19 Using Protein Sequence Data via Machine Learning Classification Approach
title_sort detection of covid 19 using protein sequence data via machine learning classification approach
url http://dx.doi.org/10.1155/2023/9991095
work_keys_str_mv AT sitiaminah detectionofcovid19usingproteinsequencedataviamachinelearningclassificationapproach
AT gianinnaardaneswari detectionofcovid19usingproteinsequencedataviamachinelearningclassificationapproach
AT mufarridohusnah detectionofcovid19usingproteinsequencedataviamachinelearningclassificationapproach
AT ghanideori detectionofcovid19usingproteinsequencedataviamachinelearningclassificationapproach
AT handibagusprasetyo detectionofcovid19usingproteinsequencedataviamachinelearningclassificationapproach