Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models

This study investigates the use of machine learning (ML) models combined with explainable artificial intelligence (XAI) techniques to identify the most influential genes in the classification of five recurrent cancer types in women: breast cancer (BRCA), lung adenocarcinoma (LUAD), thyroid cancer (T...

Full description

Saved in:

Bibliographic Details
Main Authors:	Matheus Dalmolin, Karolayne S. Azevedo, Luísa C. de Souza, Caroline B. de Farias, Martina Lichtenfels, Marcelo A. C. Fernandes
Format:	Article
Language:	English
Published:	MDPI AG 2024-12-01
Series:	AI
Subjects:	explainable AI machine learning feature selection RNA-seq cancer SHAP
Online Access:	https://www.mdpi.com/2673-2688/6/1/2
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832589437380329472
author	Matheus Dalmolin Karolayne S. Azevedo Luísa C. de Souza Caroline B. de Farias Martina Lichtenfels Marcelo A. C. Fernandes
author_facet	Matheus Dalmolin Karolayne S. Azevedo Luísa C. de Souza Caroline B. de Farias Martina Lichtenfels Marcelo A. C. Fernandes
author_sort	Matheus Dalmolin
collection	DOAJ
description	This study investigates the use of machine learning (ML) models combined with explainable artificial intelligence (XAI) techniques to identify the most influential genes in the classification of five recurrent cancer types in women: breast cancer (BRCA), lung adenocarcinoma (LUAD), thyroid cancer (THCA), ovarian cancer (OV), and colon adenocarcinoma (COAD). Gene expression data from RNA-seq, extracted from The Cancer Genome Atlas (TCGA), were used to train ML models, including decision trees (DTs), random forest (RF), and XGBoost (XGB), which achieved accuracies of 98.69%, 99.82%, and 99.37%, respectively. However, the challenges in this analysis included the high dimensionality of the dataset and the lack of transparency in the ML models. To mitigate these challenges, the SHAP (Shapley Additive Explanations) method was applied to generate a list of features, aiming to understand which characteristics influenced the models’ decision-making processes and, consequently, the prediction results for the five tumor types. The SHAP analysis identified 119, 80, and 10 genes for the RF, XGB, and DT models, respectively, totaling 209 genes, resulting in 172 unique genes. The new list, representing 0.8% of the original input features, is coherent and fully explainable, increasing confidence in the applied models. Additionally, the results suggest that the SHAP method can be effectively used as a feature selector in gene expression data. This approach not only enhances model transparency but also maintains high classification performance, highlighting its potential in identifying biologically relevant features that may serve as biomarkers for cancer diagnostics and treatment planning.
format	Article
id	doaj-art-e198b00f85d748838c7451a946da554a
institution	Kabale University
issn	2673-2688
language	English
publishDate	2024-12-01
publisher	MDPI AG
record_format	Article
series	AI
spelling	doaj-art-e198b00f85d748838c7451a946da554a2025-01-24T13:17:21ZengMDPI AGAI2673-26882024-12-0161210.3390/ai6010002Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning ModelsMatheus Dalmolin0Karolayne S. Azevedo1Luísa C. de Souza2Caroline B. de Farias3Martina Lichtenfels4Marcelo A. C. Fernandes5InovAI Lab, Federal University of Rio Grande do Norte, Natal 59078-970, RN, BrazilInovAI Lab, Federal University of Rio Grande do Norte, Natal 59078-970, RN, BrazilInovAI Lab, Federal University of Rio Grande do Norte, Natal 59078-970, RN, BrazilNational Science and Technology Institute for Children’s Cancer Biology and Pediatric Oncology-INCT BioOncoPed, Porto Alegre 90620-110, RS, BrazilZiel Biosciences, Porto Alegre 90650-001, RS, BrazilInovAI Lab, Federal University of Rio Grande do Norte, Natal 59078-970, RN, BrazilThis study investigates the use of machine learning (ML) models combined with explainable artificial intelligence (XAI) techniques to identify the most influential genes in the classification of five recurrent cancer types in women: breast cancer (BRCA), lung adenocarcinoma (LUAD), thyroid cancer (THCA), ovarian cancer (OV), and colon adenocarcinoma (COAD). Gene expression data from RNA-seq, extracted from The Cancer Genome Atlas (TCGA), were used to train ML models, including decision trees (DTs), random forest (RF), and XGBoost (XGB), which achieved accuracies of 98.69%, 99.82%, and 99.37%, respectively. However, the challenges in this analysis included the high dimensionality of the dataset and the lack of transparency in the ML models. To mitigate these challenges, the SHAP (Shapley Additive Explanations) method was applied to generate a list of features, aiming to understand which characteristics influenced the models’ decision-making processes and, consequently, the prediction results for the five tumor types. The SHAP analysis identified 119, 80, and 10 genes for the RF, XGB, and DT models, respectively, totaling 209 genes, resulting in 172 unique genes. The new list, representing 0.8% of the original input features, is coherent and fully explainable, increasing confidence in the applied models. Additionally, the results suggest that the SHAP method can be effectively used as a feature selector in gene expression data. This approach not only enhances model transparency but also maintains high classification performance, highlighting its potential in identifying biologically relevant features that may serve as biomarkers for cancer diagnostics and treatment planning.https://www.mdpi.com/2673-2688/6/1/2explainable AImachine learningfeature selectionRNA-seqcancerSHAP
spellingShingle	Matheus Dalmolin Karolayne S. Azevedo Luísa C. de Souza Caroline B. de Farias Martina Lichtenfels Marcelo A. C. Fernandes Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models AI explainable AI machine learning feature selection RNA-seq cancer SHAP
title	Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models
title_full	Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models
title_fullStr	Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models
title_full_unstemmed	Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models
title_short	Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models
title_sort	feature selection in cancer classification utilizing explainable artificial intelligence to uncover influential genes in machine learning models
topic	explainable AI machine learning feature selection RNA-seq cancer SHAP
url	https://www.mdpi.com/2673-2688/6/1/2
work_keys_str_mv	AT matheusdalmolin featureselectionincancerclassificationutilizingexplainableartificialintelligencetouncoverinfluentialgenesinmachinelearningmodels AT karolaynesazevedo featureselectionincancerclassificationutilizingexplainableartificialintelligencetouncoverinfluentialgenesinmachinelearningmodels AT luisacdesouza featureselectionincancerclassificationutilizingexplainableartificialintelligencetouncoverinfluentialgenesinmachinelearningmodels AT carolinebdefarias featureselectionincancerclassificationutilizingexplainableartificialintelligencetouncoverinfluentialgenesinmachinelearningmodels AT martinalichtenfels featureselectionincancerclassificationutilizingexplainableartificialintelligencetouncoverinfluentialgenesinmachinelearningmodels AT marceloacfernandes featureselectionincancerclassificationutilizingexplainableartificialintelligencetouncoverinfluentialgenesinmachinelearningmodels

Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models

Similar Items