Lung Cancer Classification Using the Extreme Gradient Boosting (XGBoost) Algorithm and Mutual Information for Feature Selection

Lung cancer is one of the deadliest types of cancer worldwide and is often detected too late due to the absence of early symptoms. This study aims to evaluate the impact of feature selection using Mutual Information on the performance of lung cancer classification with the XGBoost algorithm. Mutual...

Full description

Saved in:
Bibliographic Details
Main Authors: Regitha Zizilia, Yulison Herry Chrisnanto, Gunawan Abdillah
Format: Article
Language:Indonesian
Published: Islamic University of Indragiri 2025-09-01
Series:Sistemasi: Jurnal Sistem Informasi
Subjects:
Online Access:https://sistemasi.ftik.unisi.ac.id/index.php/stmsi/article/view/5345
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849706033915625472
author Regitha Zizilia
Yulison Herry Chrisnanto
Gunawan Abdillah
author_facet Regitha Zizilia
Yulison Herry Chrisnanto
Gunawan Abdillah
author_sort Regitha Zizilia
collection DOAJ
description Lung cancer is one of the deadliest types of cancer worldwide and is often detected too late due to the absence of early symptoms. This study aims to evaluate the impact of feature selection using Mutual Information on the performance of lung cancer classification with the XGBoost algorithm. Mutual Information is employed to select relevant features, including those with linear and non-linear relationships with the target variable, while XGBoost is chosen for its ability to handle large datasets and reduce overfitting. The study was conducted on a dataset containing 30,000 data entries, with data split scenarios of 90:10, 80:20, 70:30, and 60:40. The results show that the testing accuracy before applying Mutual Information ranged from 93.42% to 93.83%, while K-Fold Cross-Validation accuracy ranged from 94.59% to 94.76%. After feature selection, testing accuracy remained stable for the 70:30 and 60:40 split scenarios, at 93.60% and 93.42% respectively. However, K-Fold Cross-Validation accuracy decreased to 89.26% and 90.88%. In contrast, for the 90:10 and 80:20 split scenarios, a decline in accuracy was observed — testing accuracy dropped to 88.63% and 88.85%, and K-Fold Cross-Validation accuracy fell to 88.87% and 90.24%. Feature selection using Mutual Information improves computational efficiency by reducing the number of features, and it can be effectively applied to simplify feature sets without significantly compromising model performance in certain data scenarios, depending on the characteristics of the dataset.
format Article
id doaj-art-87d2c2cb68fd4c3fbf6bbb809cd104df
institution DOAJ
issn 2302-8149
2540-9719
language Indonesian
publishDate 2025-09-01
publisher Islamic University of Indragiri
record_format Article
series Sistemasi: Jurnal Sistem Informasi
spelling doaj-art-87d2c2cb68fd4c3fbf6bbb809cd104df2025-08-20T03:16:18ZindIslamic University of IndragiriSistemasi: Jurnal Sistem Informasi2302-81492540-97192025-09-011452198221410.32520/stmsi.v14i5.53451178Lung Cancer Classification Using the Extreme Gradient Boosting (XGBoost) Algorithm and Mutual Information for Feature SelectionRegitha Zizilia0Yulison Herry Chrisnanto1Gunawan Abdillah2Universitas Jenderal Achmad YaniUniversitas Jenderal Achmad YaniUniversitas Jenderal Achmad YaniLung cancer is one of the deadliest types of cancer worldwide and is often detected too late due to the absence of early symptoms. This study aims to evaluate the impact of feature selection using Mutual Information on the performance of lung cancer classification with the XGBoost algorithm. Mutual Information is employed to select relevant features, including those with linear and non-linear relationships with the target variable, while XGBoost is chosen for its ability to handle large datasets and reduce overfitting. The study was conducted on a dataset containing 30,000 data entries, with data split scenarios of 90:10, 80:20, 70:30, and 60:40. The results show that the testing accuracy before applying Mutual Information ranged from 93.42% to 93.83%, while K-Fold Cross-Validation accuracy ranged from 94.59% to 94.76%. After feature selection, testing accuracy remained stable for the 70:30 and 60:40 split scenarios, at 93.60% and 93.42% respectively. However, K-Fold Cross-Validation accuracy decreased to 89.26% and 90.88%. In contrast, for the 90:10 and 80:20 split scenarios, a decline in accuracy was observed — testing accuracy dropped to 88.63% and 88.85%, and K-Fold Cross-Validation accuracy fell to 88.87% and 90.24%. Feature selection using Mutual Information improves computational efficiency by reducing the number of features, and it can be effectively applied to simplify feature sets without significantly compromising model performance in certain data scenarios, depending on the characteristics of the dataset.https://sistemasi.ftik.unisi.ac.id/index.php/stmsi/article/view/5345classificationlung cancermutual informationxgboostk-fold cross validation
spellingShingle Regitha Zizilia
Yulison Herry Chrisnanto
Gunawan Abdillah
Lung Cancer Classification Using the Extreme Gradient Boosting (XGBoost) Algorithm and Mutual Information for Feature Selection
Sistemasi: Jurnal Sistem Informasi
classification
lung cancer
mutual information
xgboost
k-fold cross validation
title Lung Cancer Classification Using the Extreme Gradient Boosting (XGBoost) Algorithm and Mutual Information for Feature Selection
title_full Lung Cancer Classification Using the Extreme Gradient Boosting (XGBoost) Algorithm and Mutual Information for Feature Selection
title_fullStr Lung Cancer Classification Using the Extreme Gradient Boosting (XGBoost) Algorithm and Mutual Information for Feature Selection
title_full_unstemmed Lung Cancer Classification Using the Extreme Gradient Boosting (XGBoost) Algorithm and Mutual Information for Feature Selection
title_short Lung Cancer Classification Using the Extreme Gradient Boosting (XGBoost) Algorithm and Mutual Information for Feature Selection
title_sort lung cancer classification using the extreme gradient boosting xgboost algorithm and mutual information for feature selection
topic classification
lung cancer
mutual information
xgboost
k-fold cross validation
url https://sistemasi.ftik.unisi.ac.id/index.php/stmsi/article/view/5345
work_keys_str_mv AT regithazizilia lungcancerclassificationusingtheextremegradientboostingxgboostalgorithmandmutualinformationforfeatureselection
AT yulisonherrychrisnanto lungcancerclassificationusingtheextremegradientboostingxgboostalgorithmandmutualinformationforfeatureselection
AT gunawanabdillah lungcancerclassificationusingtheextremegradientboostingxgboostalgorithmandmutualinformationforfeatureselection