Polynomial-SHAP as a SMOTE alternative in conglomerate neural networks for realistic data augmentation in cardiovascular and breast cancer diagnosis

Abstract Cardiovascular disease (CVD) and breast cancer (BC) are among the leading causes of mortality worldwide, necessitating accurate and interpretable machine learning (ML) models for early diagnosis. Existing approaches often rely on data augmentation techniques such as SMOTE (Synthetic Minorit...

Full description

Saved in:
Bibliographic Details
Main Authors: Chukwuebuka Joseph Ejiyi, Dongsheng Cai, Francis Ofoma Eze, Makuachukwu Bennedith Ejiyi, Jennifer Ene Idoko, Sarpong Kwadwo Asere, Thomas Ugochukwu Ejiyi
Format: Article
Language:English
Published: SpringerOpen 2025-04-01
Series:Journal of Big Data
Subjects:
Online Access:https://doi.org/10.1186/s40537-025-01152-3
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Cardiovascular disease (CVD) and breast cancer (BC) are among the leading causes of mortality worldwide, necessitating accurate and interpretable machine learning (ML) models for early diagnosis. Existing approaches often rely on data augmentation techniques such as SMOTE (Synthetic Minority Over-sampling Technique) to address class imbalance, but these methods can introduce noise, distort feature distributions, and reduce model interpretability. To overcome these challenges, we propose two augmentation-free neural network models, Double Conglomerate (D-CongNet) and Triple Conglomerate (T-CongNet), which integrate Polynomial feature transformations and SHAP (Shapley Additive Explanations) for feature analysis, ensuring both high predictive performance and robust interpretability. We evaluate our models on two publicly available datasets: the UCI Heart Disease dataset for CVD prediction and the Wisconsin Diagnostic Breast Cancer (WDBC) dataset for BC classification. D-CongNet and T-CongNet achieve state-of-the-art performance without augmentation, with 86.96% accuracy, 88.79% sensitivity, and 84.42% specificity for CVD, and 97.37% accuracy, 97.67% sensitivity, and 97.18% specificity for BC. Our models also provide clinically meaningful explanations, identifying MaxHR-ST_Slope as a critical predictor for CVD and concave points_mean-area_worst for BC, aligning with established medical knowledge. By eliminating the need for augmentation, D-CongNet and T-CongNet offer a transparent and reliable alternative to traditional oversampling methods, ensuring robust decision-making in medical applications. Our results demonstrate that augmentation-free ML models can achieve both high accuracy and interpretability, making them valuable tools for healthcare professionals seeking explainable AI-driven diagnostics.
ISSN:2196-1115