A problem-agnostic approach to feature selection and analysis using SHAP
Abstract Feature selection is an effective data reduction technique. SHapley Additive exPlanations (SHAP) can be used to provide a feature importance ranking for models built with labeled or unlabeled data. Thus, one may use the SHAP feature importance ranking in a feature selection technique by sel...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
SpringerOpen
2025-01-01
|
Series: | Journal of Big Data |
Subjects: | |
Online Access: | https://doi.org/10.1186/s40537-024-01041-1 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832585635326590976 |
---|---|
author | John T. Hancock Taghi M. Khoshgoftaar Qianxin Liang |
author_facet | John T. Hancock Taghi M. Khoshgoftaar Qianxin Liang |
author_sort | John T. Hancock |
collection | DOAJ |
description | Abstract Feature selection is an effective data reduction technique. SHapley Additive exPlanations (SHAP) can be used to provide a feature importance ranking for models built with labeled or unlabeled data. Thus, one may use the SHAP feature importance ranking in a feature selection technique by selecting the k highest ranking features. Furthermore, this SHAP-based feature selection technique is applicable regardless of the availability of labels for data. We use the Kaggle Credit Card Fraud detection dataset to simulate three label availability scenarios. When no labeled data is available, unsupervised learners should be used. We explore feature selection for data reduction with Isolation Forest and SHAP for this case. When data of one class is available, a one-class classifier, such as Gaussian Mixture Model (GMM) can be used in combination with SHAP for determining feature importance, and for feature selection. Finally, if labeled data from both classes is available a binary-class classifier can be used in conjunction with SHAP for data reduction. Our contribution is to provide a comparative analysis of features selected in the three label availability scenarios. Our primary conclusion is that feature sets may be reduced with SHAP without compromising performance. To the best of our knowledge, this is the first study to explore a feature analysis technique, applicable in the three label availability scenarios. |
format | Article |
id | doaj-art-5a1864bf65764f8890a33f64865a5a95 |
institution | Kabale University |
issn | 2196-1115 |
language | English |
publishDate | 2025-01-01 |
publisher | SpringerOpen |
record_format | Article |
series | Journal of Big Data |
spelling | doaj-art-5a1864bf65764f8890a33f64865a5a952025-01-26T12:37:45ZengSpringerOpenJournal of Big Data2196-11152025-01-0112112210.1186/s40537-024-01041-1A problem-agnostic approach to feature selection and analysis using SHAPJohn T. Hancock0Taghi M. Khoshgoftaar1Qianxin Liang2College of Engineering and Computer Science, Florida Atlantic UniversityCollege of Engineering and Computer Science, Florida Atlantic UniversityCollege of Engineering and Computer Science, Florida Atlantic UniversityAbstract Feature selection is an effective data reduction technique. SHapley Additive exPlanations (SHAP) can be used to provide a feature importance ranking for models built with labeled or unlabeled data. Thus, one may use the SHAP feature importance ranking in a feature selection technique by selecting the k highest ranking features. Furthermore, this SHAP-based feature selection technique is applicable regardless of the availability of labels for data. We use the Kaggle Credit Card Fraud detection dataset to simulate three label availability scenarios. When no labeled data is available, unsupervised learners should be used. We explore feature selection for data reduction with Isolation Forest and SHAP for this case. When data of one class is available, a one-class classifier, such as Gaussian Mixture Model (GMM) can be used in combination with SHAP for determining feature importance, and for feature selection. Finally, if labeled data from both classes is available a binary-class classifier can be used in conjunction with SHAP for data reduction. Our contribution is to provide a comparative analysis of features selected in the three label availability scenarios. Our primary conclusion is that feature sets may be reduced with SHAP without compromising performance. To the best of our knowledge, this is the first study to explore a feature analysis technique, applicable in the three label availability scenarios.https://doi.org/10.1186/s40537-024-01041-1Class imbalanceFeature selectionSHAPCredit Card Fraud DetectionOne-class classificationBinary-class classification |
spellingShingle | John T. Hancock Taghi M. Khoshgoftaar Qianxin Liang A problem-agnostic approach to feature selection and analysis using SHAP Journal of Big Data Class imbalance Feature selection SHAP Credit Card Fraud Detection One-class classification Binary-class classification |
title | A problem-agnostic approach to feature selection and analysis using SHAP |
title_full | A problem-agnostic approach to feature selection and analysis using SHAP |
title_fullStr | A problem-agnostic approach to feature selection and analysis using SHAP |
title_full_unstemmed | A problem-agnostic approach to feature selection and analysis using SHAP |
title_short | A problem-agnostic approach to feature selection and analysis using SHAP |
title_sort | problem agnostic approach to feature selection and analysis using shap |
topic | Class imbalance Feature selection SHAP Credit Card Fraud Detection One-class classification Binary-class classification |
url | https://doi.org/10.1186/s40537-024-01041-1 |
work_keys_str_mv | AT johnthancock aproblemagnosticapproachtofeatureselectionandanalysisusingshap AT taghimkhoshgoftaar aproblemagnosticapproachtofeatureselectionandanalysisusingshap AT qianxinliang aproblemagnosticapproachtofeatureselectionandanalysisusingshap AT johnthancock problemagnosticapproachtofeatureselectionandanalysisusingshap AT taghimkhoshgoftaar problemagnosticapproachtofeatureselectionandanalysisusingshap AT qianxinliang problemagnosticapproachtofeatureselectionandanalysisusingshap |