Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm

ABSTRACT Machine learning is important in the treatment of heart disease because it is capable of analyzing large amounts of patient data, such as medical records, imaging tests, and genetic information, in order to identify patterns and predict the risk of developing heart disease. However, most ML...

Full description

Saved in:
Bibliographic Details
Main Authors: Jannatul Mauya, Saad Sahriar, Sanjida Akther, Ruhul Amin, Sabba Ruhi, Md. Shamim Reza
Format: Article
Language:English
Published: Wiley 2025-01-01
Series:Engineering Reports
Subjects:
Online Access:https://doi.org/10.1002/eng2.13034
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832576645795414016
author Jannatul Mauya
Saad Sahriar
Sanjida Akther
Ruhul Amin
Sabba Ruhi
Md. Shamim Reza
author_facet Jannatul Mauya
Saad Sahriar
Sanjida Akther
Ruhul Amin
Sabba Ruhi
Md. Shamim Reza
author_sort Jannatul Mauya
collection DOAJ
description ABSTRACT Machine learning is important in the treatment of heart disease because it is capable of analyzing large amounts of patient data, such as medical records, imaging tests, and genetic information, in order to identify patterns and predict the risk of developing heart disease. However, most ML algorithms require more accurate data in order to build an accurate prediction model and do not tolerate missing values. Handling missing risk factors is critical during dataset preprocessing and becomes more difficult when the risk factors are completely missing. Removing this completely missing feature may result in the loss of critical information, but there are no readily available imputation methods, which presents a significant challenge. To overcome this difficulty, in this study, we take an attempt to impute using statistical multiple linear regression and Huber regression (HR) methods using four blended datasets (Statlog, Cleveland, Hungarian, and Switzerland) sourced from the UCI ML repository. The entire dataset comprises 14 attributes, including one target variable; however, in the Switzerland dataset, one feature value (“serum cholesterol”) is entirely missing. Missing “serum cholesterol” is recognized as a predisposing factor including “chest pain,” “supreme heartbeat rate,” “type of defect,” “exercise induced ST stress related to rest,” and “exercise generated angina” in the proposed imputation methods. We also proposed applying the majority voting ensemble technique in an individual's and integrated dataset using ML algorithms as part of the risk factor identification strategy. The results show that our proposed stacked algorithm for the combined dataset with the ensemble features significantly improved accuracy by 93.47%, and an AUC score of 94.50% demonstrated more accurate and early prediction than the previous research and also provided the model's diversity, resilience, generalization, and adaptability to varied datasets.
format Article
id doaj-art-c64faea8a8a34990a368cfc627550ec2
institution Kabale University
issn 2577-8196
language English
publishDate 2025-01-01
publisher Wiley
record_format Article
series Engineering Reports
spelling doaj-art-c64faea8a8a34990a368cfc627550ec22025-01-31T00:22:48ZengWileyEngineering Reports2577-81962025-01-0171n/an/a10.1002/eng2.13034Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking AlgorithmJannatul Mauya0Saad Sahriar1Sanjida Akther2Ruhul Amin3Sabba Ruhi4Md. Shamim Reza5Deep Statistical Learning and Research Lab, Department of Statistics Pabna University of Science & Technology Pabna BangladeshDeep Statistical Learning and Research Lab, Department of Statistics Pabna University of Science & Technology Pabna BangladeshDeep Statistical Learning and Research Lab, Department of Statistics Pabna University of Science & Technology Pabna BangladeshDeep Statistical Learning and Research Lab, Department of Statistics Pabna University of Science & Technology Pabna BangladeshDepartment of Statistics Pabna University of Science & Technology Pabna BangladeshDeep Statistical Learning and Research Lab, Department of Statistics Pabna University of Science & Technology Pabna BangladeshABSTRACT Machine learning is important in the treatment of heart disease because it is capable of analyzing large amounts of patient data, such as medical records, imaging tests, and genetic information, in order to identify patterns and predict the risk of developing heart disease. However, most ML algorithms require more accurate data in order to build an accurate prediction model and do not tolerate missing values. Handling missing risk factors is critical during dataset preprocessing and becomes more difficult when the risk factors are completely missing. Removing this completely missing feature may result in the loss of critical information, but there are no readily available imputation methods, which presents a significant challenge. To overcome this difficulty, in this study, we take an attempt to impute using statistical multiple linear regression and Huber regression (HR) methods using four blended datasets (Statlog, Cleveland, Hungarian, and Switzerland) sourced from the UCI ML repository. The entire dataset comprises 14 attributes, including one target variable; however, in the Switzerland dataset, one feature value (“serum cholesterol”) is entirely missing. Missing “serum cholesterol” is recognized as a predisposing factor including “chest pain,” “supreme heartbeat rate,” “type of defect,” “exercise induced ST stress related to rest,” and “exercise generated angina” in the proposed imputation methods. We also proposed applying the majority voting ensemble technique in an individual's and integrated dataset using ML algorithms as part of the risk factor identification strategy. The results show that our proposed stacked algorithm for the combined dataset with the ensemble features significantly improved accuracy by 93.47%, and an AUC score of 94.50% demonstrated more accurate and early prediction than the previous research and also provided the model's diversity, resilience, generalization, and adaptability to varied datasets.https://doi.org/10.1002/eng2.13034feature ensembleheart diseaseML algorithmsrisk factorsstack classifier
spellingShingle Jannatul Mauya
Saad Sahriar
Sanjida Akther
Ruhul Amin
Sabba Ruhi
Md. Shamim Reza
Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm
Engineering Reports
feature ensemble
heart disease
ML algorithms
risk factors
stack classifier
title Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm
title_full Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm
title_fullStr Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm
title_full_unstemmed Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm
title_short Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm
title_sort missing risk factor prediction in cardiovascular disease using a blended dataset and optimizing classification with a stacking algorithm
topic feature ensemble
heart disease
ML algorithms
risk factors
stack classifier
url https://doi.org/10.1002/eng2.13034
work_keys_str_mv AT jannatulmauya missingriskfactorpredictionincardiovasculardiseaseusingablendeddatasetandoptimizingclassificationwithastackingalgorithm
AT saadsahriar missingriskfactorpredictionincardiovasculardiseaseusingablendeddatasetandoptimizingclassificationwithastackingalgorithm
AT sanjidaakther missingriskfactorpredictionincardiovasculardiseaseusingablendeddatasetandoptimizingclassificationwithastackingalgorithm
AT ruhulamin missingriskfactorpredictionincardiovasculardiseaseusingablendeddatasetandoptimizingclassificationwithastackingalgorithm
AT sabbaruhi missingriskfactorpredictionincardiovasculardiseaseusingablendeddatasetandoptimizingclassificationwithastackingalgorithm
AT mdshamimreza missingriskfactorpredictionincardiovasculardiseaseusingablendeddatasetandoptimizingclassificationwithastackingalgorithm