Developing a Predictive Model for Stroke Disease Detection Using a Scalable Machine Learning Approach

Stroke disease has been the leading cause of death globally for the last several decades. Thus, the death rate can be decreased by early recognition of disease and ongoing surveillance. However, the largest obstacle to perform advanced analytics using the conventional approach is the growth of massi...

Full description

Saved in:
Bibliographic Details
Main Authors: Assefa Senbato Genale, Tsion Ayalew Dessalegn
Format: Article
Language:English
Published: Wiley 2025-01-01
Series:Applied Computational Intelligence and Soft Computing
Online Access:http://dx.doi.org/10.1155/acis/7394597
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Stroke disease has been the leading cause of death globally for the last several decades. Thus, the death rate can be decreased by early recognition of disease and ongoing surveillance. However, the largest obstacle to perform advanced analytics using the conventional approach is the growth of massive amount of data from various sources, including patient histories, wearable sensor devices, and medical data. The current technology that could have a large impact on the healthcare sector is the integration of machine learning with big data analytics (scalable machine learning), particularly in the early diagnosis of this disease. To address this issue, a scalable stroke disease prediction model for a multinode distributed environment, which was developed by combining big data analytics concepts with machine learning to handle extensive healthcare datasets, an aspect not seen in the prior literature on stroke disease detection, is presented in this work. We have implemented four scalable algorithms: logistic regression, random forest, gradient-boosting tree, and decision tree, using a dataset that was collected from a Medical Quality Improvement Consortium database. As a result, two worker nodes and one master node were used to analyze the dataset. The model’s performance was assessed using performance metrics including the area under the curve (AUC) and confusion matrix. With an accuracy of 94.3% and an AUC score of 99%, the random forest was determined to be better based on the experimental results. It was also shown that the main risk factor for stroke disease is diabetes, which is followed by hypertension. This study demonstrated the effectiveness of using Spark’s scalable machine learning techniques to forecast stroke disease and identify risk factors earlier. The findings of this study can be utilized by physicians as clinical decision aids to aid in the more accurate identification of stroke disease.
ISSN:1687-9732