Machine Learning Classifiers and Data Synthesis Techniques to Tackle with Highly Imbalanced COVID-19 Data

The COVID-19 pandemic has highlighted the urgent need for rapid and accurate diagnostic methods. In this study, we evaluate three machine learning models—Random Forest (RF), Logistic Regression (LR) and Decision Tree (DT)—for detecting COVID-19 trained on preprocessed imbalanced datasets with 5086 n...

Full description

Saved in:
Bibliographic Details
Main Authors: Avaz Naghipour, Mohammad Reza Abbaszadeh Bavil Soflaei, mostafa ghader-zefrehei
Format: Article
Language:English
Published: Ferdowsi University of Mashhad 2024-12-01
Series:Computer and Knowledge Engineering
Subjects:
Online Access:https://cke.um.ac.ir/article_45898_b3c8e1d9ecf92ea8a3734a1aab782226.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The COVID-19 pandemic has highlighted the urgent need for rapid and accurate diagnostic methods. In this study, we evaluate three machine learning models—Random Forest (RF), Logistic Regression (LR) and Decision Tree (DT)—for detecting COVID-19 trained on preprocessed imbalanced datasets with 5086 negative and 558 positive cases. To this end, we demonstrate the capability of two advanced data synthesis algorithms, Conditional Tabular Generative Adversarial Network (CTGAN) and Tabular Variational Autoencoder (TVAE), in addressing the class imbalance inherent in the dataset. The classifiers trained on the original as well as the balanced datasets were evaluated for comparison. Our findings reveal that RF obtains the highest accuracy of 98.83% on the CTGAN-balanced dataset. In conclusion, our results verify the potential of coupling data synthesis with traditional machine learning for the diagnosis of COVID-19. We hope that this research will make a significant contribution to the current AI (Artificial Intelligence) efforts in combating the pandemic.
ISSN:2538-5453
2717-4123