Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm

ABSTRACT Machine learning is important in the treatment of heart disease because it is capable of analyzing large amounts of patient data, such as medical records, imaging tests, and genetic information, in order to identify patterns and predict the risk of developing heart disease. However, most ML...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jannatul Mauya, Saad Sahriar, Sanjida Akther, Ruhul Amin, Sabba Ruhi, Md. Shamim Reza
Format:	Article
Language:	English
Published:	Wiley 2025-01-01
Series:	Engineering Reports
Subjects:	feature ensemble heart disease ML algorithms risk factors stack classifier
Online Access:	https://doi.org/10.1002/eng2.13034
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832576645795414016
author	Jannatul Mauya Saad Sahriar Sanjida Akther Ruhul Amin Sabba Ruhi Md. Shamim Reza
author_facet	Jannatul Mauya Saad Sahriar Sanjida Akther Ruhul Amin Sabba Ruhi Md. Shamim Reza
author_sort	Jannatul Mauya
collection	DOAJ
description	ABSTRACT Machine learning is important in the treatment of heart disease because it is capable of analyzing large amounts of patient data, such as medical records, imaging tests, and genetic information, in order to identify patterns and predict the risk of developing heart disease. However, most ML algorithms require more accurate data in order to build an accurate prediction model and do not tolerate missing values. Handling missing risk factors is critical during dataset preprocessing and becomes more difficult when the risk factors are completely missing. Removing this completely missing feature may result in the loss of critical information, but there are no readily available imputation methods, which presents a significant challenge. To overcome this difficulty, in this study, we take an attempt to impute using statistical multiple linear regression and Huber regression (HR) methods using four blended datasets (Statlog, Cleveland, Hungarian, and Switzerland) sourced from the UCI ML repository. The entire dataset comprises 14 attributes, including one target variable; however, in the Switzerland dataset, one feature value (“serum cholesterol”) is entirely missing. Missing “serum cholesterol” is recognized as a predisposing factor including “chest pain,” “supreme heartbeat rate,” “type of defect,” “exercise induced ST stress related to rest,” and “exercise generated angina” in the proposed imputation methods. We also proposed applying the majority voting ensemble technique in an individual's and integrated dataset using ML algorithms as part of the risk factor identification strategy. The results show that our proposed stacked algorithm for the combined dataset with the ensemble features significantly improved accuracy by 93.47%, and an AUC score of 94.50% demonstrated more accurate and early prediction than the previous research and also provided the model's diversity, resilience, generalization, and adaptability to varied datasets.
format	Article
id	doaj-art-c64faea8a8a34990a368cfc627550ec2
institution	Kabale University
issn	2577-8196
language	English
publishDate	2025-01-01
publisher	Wiley
record_format	Article
series	Engineering Reports
spelling	doaj-art-c64faea8a8a34990a368cfc627550ec22025-01-31T00:22:48ZengWileyEngineering Reports2577-81962025-01-0171n/an/a10.1002/eng2.13034Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking AlgorithmJannatul Mauya0Saad Sahriar1Sanjida Akther2Ruhul Amin3Sabba Ruhi4Md. Shamim Reza5Deep Statistical Learning and Research Lab, Department of Statistics Pabna University of Science & Technology Pabna BangladeshDeep Statistical Learning and Research Lab, Department of Statistics Pabna University of Science & Technology Pabna BangladeshDeep Statistical Learning and Research Lab, Department of Statistics Pabna University of Science & Technology Pabna BangladeshDeep Statistical Learning and Research Lab, Department of Statistics Pabna University of Science & Technology Pabna BangladeshDepartment of Statistics Pabna University of Science & Technology Pabna BangladeshDeep Statistical Learning and Research Lab, Department of Statistics Pabna University of Science & Technology Pabna BangladeshABSTRACT Machine learning is important in the treatment of heart disease because it is capable of analyzing large amounts of patient data, such as medical records, imaging tests, and genetic information, in order to identify patterns and predict the risk of developing heart disease. However, most ML algorithms require more accurate data in order to build an accurate prediction model and do not tolerate missing values. Handling missing risk factors is critical during dataset preprocessing and becomes more difficult when the risk factors are completely missing. Removing this completely missing feature may result in the loss of critical information, but there are no readily available imputation methods, which presents a significant challenge. To overcome this difficulty, in this study, we take an attempt to impute using statistical multiple linear regression and Huber regression (HR) methods using four blended datasets (Statlog, Cleveland, Hungarian, and Switzerland) sourced from the UCI ML repository. The entire dataset comprises 14 attributes, including one target variable; however, in the Switzerland dataset, one feature value (“serum cholesterol”) is entirely missing. Missing “serum cholesterol” is recognized as a predisposing factor including “chest pain,” “supreme heartbeat rate,” “type of defect,” “exercise induced ST stress related to rest,” and “exercise generated angina” in the proposed imputation methods. We also proposed applying the majority voting ensemble technique in an individual's and integrated dataset using ML algorithms as part of the risk factor identification strategy. The results show that our proposed stacked algorithm for the combined dataset with the ensemble features significantly improved accuracy by 93.47%, and an AUC score of 94.50% demonstrated more accurate and early prediction than the previous research and also provided the model's diversity, resilience, generalization, and adaptability to varied datasets.https://doi.org/10.1002/eng2.13034feature ensembleheart diseaseML algorithmsrisk factorsstack classifier
spellingShingle	Jannatul Mauya Saad Sahriar Sanjida Akther Ruhul Amin Sabba Ruhi Md. Shamim Reza Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm Engineering Reports feature ensemble heart disease ML algorithms risk factors stack classifier
title	Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm
title_full	Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm
title_fullStr	Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm
title_full_unstemmed	Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm
title_short	Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm
title_sort	missing risk factor prediction in cardiovascular disease using a blended dataset and optimizing classification with a stacking algorithm
topic	feature ensemble heart disease ML algorithms risk factors stack classifier
url	https://doi.org/10.1002/eng2.13034
work_keys_str_mv	AT jannatulmauya missingriskfactorpredictionincardiovasculardiseaseusingablendeddatasetandoptimizingclassificationwithastackingalgorithm AT saadsahriar missingriskfactorpredictionincardiovasculardiseaseusingablendeddatasetandoptimizingclassificationwithastackingalgorithm AT sanjidaakther missingriskfactorpredictionincardiovasculardiseaseusingablendeddatasetandoptimizingclassificationwithastackingalgorithm AT ruhulamin missingriskfactorpredictionincardiovasculardiseaseusingablendeddatasetandoptimizingclassificationwithastackingalgorithm AT sabbaruhi missingriskfactorpredictionincardiovasculardiseaseusingablendeddatasetandoptimizingclassificationwithastackingalgorithm AT mdshamimreza missingriskfactorpredictionincardiovasculardiseaseusingablendeddatasetandoptimizingclassificationwithastackingalgorithm

Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm

Similar Items