Efficient feature selection based on Gower distance for breast cancer diagnosis

This study presents an efficient feature selection method based on the Gower distance to enhance the accuracy and efficiency of standard classifiers on high-dimensional medical datasets. High-dimensional data poses significant challenges for traditional classifiers due to feature redundancy or being...

Full description

Saved in:
Bibliographic Details
Main Authors: Salwa Shakir Baawi, Mustafa Noaman Kadhim, Dhiah Al-Shammary
Format: Article
Language:English
Published: KeAi Communications Co., Ltd. 2025-06-01
Series:Journal of Electronic Science and Technology
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S1674862X25000163
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This study presents an efficient feature selection method based on the Gower distance to enhance the accuracy and efficiency of standard classifiers on high-dimensional medical datasets. High-dimensional data poses significant challenges for traditional classifiers due to feature redundancy or being irrelevant. The proposed method addresses these challenges by partitioning the dataset into blocks, calculating the Gower distance within each block, and selecting features based on their average similarity. Technically, the Gower distance normalizes the absolute difference between numerical features, ensuring that each feature contributes equally to the distance calculation. This normalization prevents features with larger scales from overshadowing those with smaller scales. This process facilitates the identification of features that exhibit high harmony and are the most relevant for classification. The proposed feature selection strategy significantly reduces dimensionality, retains the most relevant features, and improves model performance. Experimental results show that the accuracy for the classifiers including k-nearest neighbors (KNN), naive Bayes (NB), decision tree (DT), random forest (RF), support vector machine (SVM), and logistic regression (LR) was increased by 4.38%–7.02%. Besides, the reduction in the feature set size contributes to a considerable decrease in computational complexity and thus faster diagnosis speed. The execution time was averagely reduced by 77.82% for all samples and 76.45% for one sample. These results demonstrate that the proposed feature selection method shows enhanced performance on both prediction accuracy and diagnostic speed, making it a promising tool for real-time clinical decision-making and improving patient care outcomes.
ISSN:2666-223X