Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm
ABSTRACT Machine learning is important in the treatment of heart disease because it is capable of analyzing large amounts of patient data, such as medical records, imaging tests, and genetic information, in order to identify patterns and predict the risk of developing heart disease. However, most ML...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Wiley
2025-01-01
|
Series: | Engineering Reports |
Subjects: | |
Online Access: | https://doi.org/10.1002/eng2.13034 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832576645795414016 |
---|---|
author | Jannatul Mauya Saad Sahriar Sanjida Akther Ruhul Amin Sabba Ruhi Md. Shamim Reza |
author_facet | Jannatul Mauya Saad Sahriar Sanjida Akther Ruhul Amin Sabba Ruhi Md. Shamim Reza |
author_sort | Jannatul Mauya |
collection | DOAJ |
description | ABSTRACT Machine learning is important in the treatment of heart disease because it is capable of analyzing large amounts of patient data, such as medical records, imaging tests, and genetic information, in order to identify patterns and predict the risk of developing heart disease. However, most ML algorithms require more accurate data in order to build an accurate prediction model and do not tolerate missing values. Handling missing risk factors is critical during dataset preprocessing and becomes more difficult when the risk factors are completely missing. Removing this completely missing feature may result in the loss of critical information, but there are no readily available imputation methods, which presents a significant challenge. To overcome this difficulty, in this study, we take an attempt to impute using statistical multiple linear regression and Huber regression (HR) methods using four blended datasets (Statlog, Cleveland, Hungarian, and Switzerland) sourced from the UCI ML repository. The entire dataset comprises 14 attributes, including one target variable; however, in the Switzerland dataset, one feature value (“serum cholesterol”) is entirely missing. Missing “serum cholesterol” is recognized as a predisposing factor including “chest pain,” “supreme heartbeat rate,” “type of defect,” “exercise induced ST stress related to rest,” and “exercise generated angina” in the proposed imputation methods. We also proposed applying the majority voting ensemble technique in an individual's and integrated dataset using ML algorithms as part of the risk factor identification strategy. The results show that our proposed stacked algorithm for the combined dataset with the ensemble features significantly improved accuracy by 93.47%, and an AUC score of 94.50% demonstrated more accurate and early prediction than the previous research and also provided the model's diversity, resilience, generalization, and adaptability to varied datasets. |
format | Article |
id | doaj-art-c64faea8a8a34990a368cfc627550ec2 |
institution | Kabale University |
issn | 2577-8196 |
language | English |
publishDate | 2025-01-01 |
publisher | Wiley |
record_format | Article |
series | Engineering Reports |
spelling | doaj-art-c64faea8a8a34990a368cfc627550ec22025-01-31T00:22:48ZengWileyEngineering Reports2577-81962025-01-0171n/an/a10.1002/eng2.13034Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking AlgorithmJannatul Mauya0Saad Sahriar1Sanjida Akther2Ruhul Amin3Sabba Ruhi4Md. Shamim Reza5Deep Statistical Learning and Research Lab, Department of Statistics Pabna University of Science & Technology Pabna BangladeshDeep Statistical Learning and Research Lab, Department of Statistics Pabna University of Science & Technology Pabna BangladeshDeep Statistical Learning and Research Lab, Department of Statistics Pabna University of Science & Technology Pabna BangladeshDeep Statistical Learning and Research Lab, Department of Statistics Pabna University of Science & Technology Pabna BangladeshDepartment of Statistics Pabna University of Science & Technology Pabna BangladeshDeep Statistical Learning and Research Lab, Department of Statistics Pabna University of Science & Technology Pabna BangladeshABSTRACT Machine learning is important in the treatment of heart disease because it is capable of analyzing large amounts of patient data, such as medical records, imaging tests, and genetic information, in order to identify patterns and predict the risk of developing heart disease. However, most ML algorithms require more accurate data in order to build an accurate prediction model and do not tolerate missing values. Handling missing risk factors is critical during dataset preprocessing and becomes more difficult when the risk factors are completely missing. Removing this completely missing feature may result in the loss of critical information, but there are no readily available imputation methods, which presents a significant challenge. To overcome this difficulty, in this study, we take an attempt to impute using statistical multiple linear regression and Huber regression (HR) methods using four blended datasets (Statlog, Cleveland, Hungarian, and Switzerland) sourced from the UCI ML repository. The entire dataset comprises 14 attributes, including one target variable; however, in the Switzerland dataset, one feature value (“serum cholesterol”) is entirely missing. Missing “serum cholesterol” is recognized as a predisposing factor including “chest pain,” “supreme heartbeat rate,” “type of defect,” “exercise induced ST stress related to rest,” and “exercise generated angina” in the proposed imputation methods. We also proposed applying the majority voting ensemble technique in an individual's and integrated dataset using ML algorithms as part of the risk factor identification strategy. The results show that our proposed stacked algorithm for the combined dataset with the ensemble features significantly improved accuracy by 93.47%, and an AUC score of 94.50% demonstrated more accurate and early prediction than the previous research and also provided the model's diversity, resilience, generalization, and adaptability to varied datasets.https://doi.org/10.1002/eng2.13034feature ensembleheart diseaseML algorithmsrisk factorsstack classifier |
spellingShingle | Jannatul Mauya Saad Sahriar Sanjida Akther Ruhul Amin Sabba Ruhi Md. Shamim Reza Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm Engineering Reports feature ensemble heart disease ML algorithms risk factors stack classifier |
title | Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm |
title_full | Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm |
title_fullStr | Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm |
title_full_unstemmed | Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm |
title_short | Missing Risk Factor Prediction in Cardiovascular Disease Using a Blended Dataset and Optimizing Classification With a Stacking Algorithm |
title_sort | missing risk factor prediction in cardiovascular disease using a blended dataset and optimizing classification with a stacking algorithm |
topic | feature ensemble heart disease ML algorithms risk factors stack classifier |
url | https://doi.org/10.1002/eng2.13034 |
work_keys_str_mv | AT jannatulmauya missingriskfactorpredictionincardiovasculardiseaseusingablendeddatasetandoptimizingclassificationwithastackingalgorithm AT saadsahriar missingriskfactorpredictionincardiovasculardiseaseusingablendeddatasetandoptimizingclassificationwithastackingalgorithm AT sanjidaakther missingriskfactorpredictionincardiovasculardiseaseusingablendeddatasetandoptimizingclassificationwithastackingalgorithm AT ruhulamin missingriskfactorpredictionincardiovasculardiseaseusingablendeddatasetandoptimizingclassificationwithastackingalgorithm AT sabbaruhi missingriskfactorpredictionincardiovasculardiseaseusingablendeddatasetandoptimizingclassificationwithastackingalgorithm AT mdshamimreza missingriskfactorpredictionincardiovasculardiseaseusingablendeddatasetandoptimizingclassificationwithastackingalgorithm |