A problem-agnostic approach to feature selection and analysis using SHAP

Abstract Feature selection is an effective data reduction technique. SHapley Additive exPlanations (SHAP) can be used to provide a feature importance ranking for models built with labeled or unlabeled data. Thus, one may use the SHAP feature importance ranking in a feature selection technique by sel...

Full description

Saved in:
Bibliographic Details
Main Authors: John T. Hancock, Taghi M. Khoshgoftaar, Qianxin Liang
Format: Article
Language:English
Published: SpringerOpen 2025-01-01
Series:Journal of Big Data
Subjects:
Online Access:https://doi.org/10.1186/s40537-024-01041-1
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832585635326590976
author John T. Hancock
Taghi M. Khoshgoftaar
Qianxin Liang
author_facet John T. Hancock
Taghi M. Khoshgoftaar
Qianxin Liang
author_sort John T. Hancock
collection DOAJ
description Abstract Feature selection is an effective data reduction technique. SHapley Additive exPlanations (SHAP) can be used to provide a feature importance ranking for models built with labeled or unlabeled data. Thus, one may use the SHAP feature importance ranking in a feature selection technique by selecting the k highest ranking features. Furthermore, this SHAP-based feature selection technique is applicable regardless of the availability of labels for data. We use the Kaggle Credit Card Fraud detection dataset to simulate three label availability scenarios. When no labeled data is available, unsupervised learners should be used. We explore feature selection for data reduction with Isolation Forest and SHAP for this case. When data of one class is available, a one-class classifier, such as Gaussian Mixture Model (GMM) can be used in combination with SHAP for determining feature importance, and for feature selection. Finally, if labeled data from both classes is available a binary-class classifier can be used in conjunction with SHAP for data reduction. Our contribution is to provide a comparative analysis of features selected in the three label availability scenarios. Our primary conclusion is that feature sets may be reduced with SHAP without compromising performance. To the best of our knowledge, this is the first study to explore a feature analysis technique, applicable in the three label availability scenarios.
format Article
id doaj-art-5a1864bf65764f8890a33f64865a5a95
institution Kabale University
issn 2196-1115
language English
publishDate 2025-01-01
publisher SpringerOpen
record_format Article
series Journal of Big Data
spelling doaj-art-5a1864bf65764f8890a33f64865a5a952025-01-26T12:37:45ZengSpringerOpenJournal of Big Data2196-11152025-01-0112112210.1186/s40537-024-01041-1A problem-agnostic approach to feature selection and analysis using SHAPJohn T. Hancock0Taghi M. Khoshgoftaar1Qianxin Liang2College of Engineering and Computer Science, Florida Atlantic UniversityCollege of Engineering and Computer Science, Florida Atlantic UniversityCollege of Engineering and Computer Science, Florida Atlantic UniversityAbstract Feature selection is an effective data reduction technique. SHapley Additive exPlanations (SHAP) can be used to provide a feature importance ranking for models built with labeled or unlabeled data. Thus, one may use the SHAP feature importance ranking in a feature selection technique by selecting the k highest ranking features. Furthermore, this SHAP-based feature selection technique is applicable regardless of the availability of labels for data. We use the Kaggle Credit Card Fraud detection dataset to simulate three label availability scenarios. When no labeled data is available, unsupervised learners should be used. We explore feature selection for data reduction with Isolation Forest and SHAP for this case. When data of one class is available, a one-class classifier, such as Gaussian Mixture Model (GMM) can be used in combination with SHAP for determining feature importance, and for feature selection. Finally, if labeled data from both classes is available a binary-class classifier can be used in conjunction with SHAP for data reduction. Our contribution is to provide a comparative analysis of features selected in the three label availability scenarios. Our primary conclusion is that feature sets may be reduced with SHAP without compromising performance. To the best of our knowledge, this is the first study to explore a feature analysis technique, applicable in the three label availability scenarios.https://doi.org/10.1186/s40537-024-01041-1Class imbalanceFeature selectionSHAPCredit Card Fraud DetectionOne-class classificationBinary-class classification
spellingShingle John T. Hancock
Taghi M. Khoshgoftaar
Qianxin Liang
A problem-agnostic approach to feature selection and analysis using SHAP
Journal of Big Data
Class imbalance
Feature selection
SHAP
Credit Card Fraud Detection
One-class classification
Binary-class classification
title A problem-agnostic approach to feature selection and analysis using SHAP
title_full A problem-agnostic approach to feature selection and analysis using SHAP
title_fullStr A problem-agnostic approach to feature selection and analysis using SHAP
title_full_unstemmed A problem-agnostic approach to feature selection and analysis using SHAP
title_short A problem-agnostic approach to feature selection and analysis using SHAP
title_sort problem agnostic approach to feature selection and analysis using shap
topic Class imbalance
Feature selection
SHAP
Credit Card Fraud Detection
One-class classification
Binary-class classification
url https://doi.org/10.1186/s40537-024-01041-1
work_keys_str_mv AT johnthancock aproblemagnosticapproachtofeatureselectionandanalysisusingshap
AT taghimkhoshgoftaar aproblemagnosticapproachtofeatureselectionandanalysisusingshap
AT qianxinliang aproblemagnosticapproachtofeatureselectionandanalysisusingshap
AT johnthancock problemagnosticapproachtofeatureselectionandanalysisusingshap
AT taghimkhoshgoftaar problemagnosticapproachtofeatureselectionandanalysisusingshap
AT qianxinliang problemagnosticapproachtofeatureselectionandanalysisusingshap