Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models

This study investigates the use of machine learning (ML) models combined with explainable artificial intelligence (XAI) techniques to identify the most influential genes in the classification of five recurrent cancer types in women: breast cancer (BRCA), lung adenocarcinoma (LUAD), thyroid cancer (T...

Full description

Saved in:
Bibliographic Details
Main Authors: Matheus Dalmolin, Karolayne S. Azevedo, Luísa C. de Souza, Caroline B. de Farias, Martina Lichtenfels, Marcelo A. C. Fernandes
Format: Article
Language:English
Published: MDPI AG 2024-12-01
Series:AI
Subjects:
Online Access:https://www.mdpi.com/2673-2688/6/1/2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832589437380329472
author Matheus Dalmolin
Karolayne S. Azevedo
Luísa C. de Souza
Caroline B. de Farias
Martina Lichtenfels
Marcelo A. C. Fernandes
author_facet Matheus Dalmolin
Karolayne S. Azevedo
Luísa C. de Souza
Caroline B. de Farias
Martina Lichtenfels
Marcelo A. C. Fernandes
author_sort Matheus Dalmolin
collection DOAJ
description This study investigates the use of machine learning (ML) models combined with explainable artificial intelligence (XAI) techniques to identify the most influential genes in the classification of five recurrent cancer types in women: breast cancer (BRCA), lung adenocarcinoma (LUAD), thyroid cancer (THCA), ovarian cancer (OV), and colon adenocarcinoma (COAD). Gene expression data from RNA-seq, extracted from The Cancer Genome Atlas (TCGA), were used to train ML models, including decision trees (DTs), random forest (RF), and XGBoost (XGB), which achieved accuracies of 98.69%, 99.82%, and 99.37%, respectively. However, the challenges in this analysis included the high dimensionality of the dataset and the lack of transparency in the ML models. To mitigate these challenges, the SHAP (Shapley Additive Explanations) method was applied to generate a list of features, aiming to understand which characteristics influenced the models’ decision-making processes and, consequently, the prediction results for the five tumor types. The SHAP analysis identified 119, 80, and 10 genes for the RF, XGB, and DT models, respectively, totaling 209 genes, resulting in 172 unique genes. The new list, representing 0.8% of the original input features, is coherent and fully explainable, increasing confidence in the applied models. Additionally, the results suggest that the SHAP method can be effectively used as a feature selector in gene expression data. This approach not only enhances model transparency but also maintains high classification performance, highlighting its potential in identifying biologically relevant features that may serve as biomarkers for cancer diagnostics and treatment planning.
format Article
id doaj-art-e198b00f85d748838c7451a946da554a
institution Kabale University
issn 2673-2688
language English
publishDate 2024-12-01
publisher MDPI AG
record_format Article
series AI
spelling doaj-art-e198b00f85d748838c7451a946da554a2025-01-24T13:17:21ZengMDPI AGAI2673-26882024-12-0161210.3390/ai6010002Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning ModelsMatheus Dalmolin0Karolayne S. Azevedo1Luísa C. de Souza2Caroline B. de Farias3Martina Lichtenfels4Marcelo A. C. Fernandes5InovAI Lab, Federal University of Rio Grande do Norte, Natal 59078-970, RN, BrazilInovAI Lab, Federal University of Rio Grande do Norte, Natal 59078-970, RN, BrazilInovAI Lab, Federal University of Rio Grande do Norte, Natal 59078-970, RN, BrazilNational Science and Technology Institute for Children’s Cancer Biology and Pediatric Oncology-INCT BioOncoPed, Porto Alegre 90620-110, RS, BrazilZiel Biosciences, Porto Alegre 90650-001, RS, BrazilInovAI Lab, Federal University of Rio Grande do Norte, Natal 59078-970, RN, BrazilThis study investigates the use of machine learning (ML) models combined with explainable artificial intelligence (XAI) techniques to identify the most influential genes in the classification of five recurrent cancer types in women: breast cancer (BRCA), lung adenocarcinoma (LUAD), thyroid cancer (THCA), ovarian cancer (OV), and colon adenocarcinoma (COAD). Gene expression data from RNA-seq, extracted from The Cancer Genome Atlas (TCGA), were used to train ML models, including decision trees (DTs), random forest (RF), and XGBoost (XGB), which achieved accuracies of 98.69%, 99.82%, and 99.37%, respectively. However, the challenges in this analysis included the high dimensionality of the dataset and the lack of transparency in the ML models. To mitigate these challenges, the SHAP (Shapley Additive Explanations) method was applied to generate a list of features, aiming to understand which characteristics influenced the models’ decision-making processes and, consequently, the prediction results for the five tumor types. The SHAP analysis identified 119, 80, and 10 genes for the RF, XGB, and DT models, respectively, totaling 209 genes, resulting in 172 unique genes. The new list, representing 0.8% of the original input features, is coherent and fully explainable, increasing confidence in the applied models. Additionally, the results suggest that the SHAP method can be effectively used as a feature selector in gene expression data. This approach not only enhances model transparency but also maintains high classification performance, highlighting its potential in identifying biologically relevant features that may serve as biomarkers for cancer diagnostics and treatment planning.https://www.mdpi.com/2673-2688/6/1/2explainable AImachine learningfeature selectionRNA-seqcancerSHAP
spellingShingle Matheus Dalmolin
Karolayne S. Azevedo
Luísa C. de Souza
Caroline B. de Farias
Martina Lichtenfels
Marcelo A. C. Fernandes
Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models
AI
explainable AI
machine learning
feature selection
RNA-seq
cancer
SHAP
title Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models
title_full Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models
title_fullStr Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models
title_full_unstemmed Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models
title_short Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models
title_sort feature selection in cancer classification utilizing explainable artificial intelligence to uncover influential genes in machine learning models
topic explainable AI
machine learning
feature selection
RNA-seq
cancer
SHAP
url https://www.mdpi.com/2673-2688/6/1/2
work_keys_str_mv AT matheusdalmolin featureselectionincancerclassificationutilizingexplainableartificialintelligencetouncoverinfluentialgenesinmachinelearningmodels
AT karolaynesazevedo featureselectionincancerclassificationutilizingexplainableartificialintelligencetouncoverinfluentialgenesinmachinelearningmodels
AT luisacdesouza featureselectionincancerclassificationutilizingexplainableartificialintelligencetouncoverinfluentialgenesinmachinelearningmodels
AT carolinebdefarias featureselectionincancerclassificationutilizingexplainableartificialintelligencetouncoverinfluentialgenesinmachinelearningmodels
AT martinalichtenfels featureselectionincancerclassificationutilizingexplainableartificialintelligencetouncoverinfluentialgenesinmachinelearningmodels
AT marceloacfernandes featureselectionincancerclassificationutilizingexplainableartificialintelligencetouncoverinfluentialgenesinmachinelearningmodels