A problem-agnostic approach to feature selection and analysis using SHAP

Abstract Feature selection is an effective data reduction technique. SHapley Additive exPlanations (SHAP) can be used to provide a feature importance ranking for models built with labeled or unlabeled data. Thus, one may use the SHAP feature importance ranking in a feature selection technique by sel...

Full description

Saved in:

Bibliographic Details
Main Authors:	John T. Hancock, Taghi M. Khoshgoftaar, Qianxin Liang
Format:	Article
Language:	English
Published:	SpringerOpen 2025-01-01
Series:	Journal of Big Data
Subjects:	Class imbalance Feature selection SHAP Credit Card Fraud Detection One-class classification Binary-class classification
Online Access:	https://doi.org/10.1186/s40537-024-01041-1
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832585635326590976
author	John T. Hancock Taghi M. Khoshgoftaar Qianxin Liang
author_facet	John T. Hancock Taghi M. Khoshgoftaar Qianxin Liang
author_sort	John T. Hancock
collection	DOAJ
description	Abstract Feature selection is an effective data reduction technique. SHapley Additive exPlanations (SHAP) can be used to provide a feature importance ranking for models built with labeled or unlabeled data. Thus, one may use the SHAP feature importance ranking in a feature selection technique by selecting the k highest ranking features. Furthermore, this SHAP-based feature selection technique is applicable regardless of the availability of labels for data. We use the Kaggle Credit Card Fraud detection dataset to simulate three label availability scenarios. When no labeled data is available, unsupervised learners should be used. We explore feature selection for data reduction with Isolation Forest and SHAP for this case. When data of one class is available, a one-class classifier, such as Gaussian Mixture Model (GMM) can be used in combination with SHAP for determining feature importance, and for feature selection. Finally, if labeled data from both classes is available a binary-class classifier can be used in conjunction with SHAP for data reduction. Our contribution is to provide a comparative analysis of features selected in the three label availability scenarios. Our primary conclusion is that feature sets may be reduced with SHAP without compromising performance. To the best of our knowledge, this is the first study to explore a feature analysis technique, applicable in the three label availability scenarios.
format	Article
id	doaj-art-5a1864bf65764f8890a33f64865a5a95
institution	Kabale University
issn	2196-1115
language	English
publishDate	2025-01-01
publisher	SpringerOpen
record_format	Article
series	Journal of Big Data
spelling	doaj-art-5a1864bf65764f8890a33f64865a5a952025-01-26T12:37:45ZengSpringerOpenJournal of Big Data2196-11152025-01-0112112210.1186/s40537-024-01041-1A problem-agnostic approach to feature selection and analysis using SHAPJohn T. Hancock0Taghi M. Khoshgoftaar1Qianxin Liang2College of Engineering and Computer Science, Florida Atlantic UniversityCollege of Engineering and Computer Science, Florida Atlantic UniversityCollege of Engineering and Computer Science, Florida Atlantic UniversityAbstract Feature selection is an effective data reduction technique. SHapley Additive exPlanations (SHAP) can be used to provide a feature importance ranking for models built with labeled or unlabeled data. Thus, one may use the SHAP feature importance ranking in a feature selection technique by selecting the k highest ranking features. Furthermore, this SHAP-based feature selection technique is applicable regardless of the availability of labels for data. We use the Kaggle Credit Card Fraud detection dataset to simulate three label availability scenarios. When no labeled data is available, unsupervised learners should be used. We explore feature selection for data reduction with Isolation Forest and SHAP for this case. When data of one class is available, a one-class classifier, such as Gaussian Mixture Model (GMM) can be used in combination with SHAP for determining feature importance, and for feature selection. Finally, if labeled data from both classes is available a binary-class classifier can be used in conjunction with SHAP for data reduction. Our contribution is to provide a comparative analysis of features selected in the three label availability scenarios. Our primary conclusion is that feature sets may be reduced with SHAP without compromising performance. To the best of our knowledge, this is the first study to explore a feature analysis technique, applicable in the three label availability scenarios.https://doi.org/10.1186/s40537-024-01041-1Class imbalanceFeature selectionSHAPCredit Card Fraud DetectionOne-class classificationBinary-class classification
spellingShingle	John T. Hancock Taghi M. Khoshgoftaar Qianxin Liang A problem-agnostic approach to feature selection and analysis using SHAP Journal of Big Data Class imbalance Feature selection SHAP Credit Card Fraud Detection One-class classification Binary-class classification
title	A problem-agnostic approach to feature selection and analysis using SHAP
title_full	A problem-agnostic approach to feature selection and analysis using SHAP
title_fullStr	A problem-agnostic approach to feature selection and analysis using SHAP
title_full_unstemmed	A problem-agnostic approach to feature selection and analysis using SHAP
title_short	A problem-agnostic approach to feature selection and analysis using SHAP
title_sort	problem agnostic approach to feature selection and analysis using shap
topic	Class imbalance Feature selection SHAP Credit Card Fraud Detection One-class classification Binary-class classification
url	https://doi.org/10.1186/s40537-024-01041-1
work_keys_str_mv	AT johnthancock aproblemagnosticapproachtofeatureselectionandanalysisusingshap AT taghimkhoshgoftaar aproblemagnosticapproachtofeatureselectionandanalysisusingshap AT qianxinliang aproblemagnosticapproachtofeatureselectionandanalysisusingshap AT johnthancock problemagnosticapproachtofeatureselectionandanalysisusingshap AT taghimkhoshgoftaar problemagnosticapproachtofeatureselectionandanalysisusingshap AT qianxinliang problemagnosticapproachtofeatureselectionandanalysisusingshap

A problem-agnostic approach to feature selection and analysis using SHAP

Similar Items