Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data

Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-di...

Full description

Saved in:

Bibliographic Details
Main Authors:	Thanh-Tung Nguyen, Joshua Zhexue Huang, Thuy Thi Nguyen
Format:	Article
Language:	English
Published:	Wiley 2015-01-01
Series:	The Scientific World Journal
Online Access:	http://dx.doi.org/10.1155/2015/471371
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832554423997431808
author	Thanh-Tung Nguyen Joshua Zhexue Huang Thuy Thi Nguyen
author_facet	Thanh-Tung Nguyen Joshua Zhexue Huang Thuy Thi Nguyen
author_sort	Thanh-Tung Nguyen
collection	DOAJ
description	Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using p-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.
format	Article
id	doaj-art-6c1f68fbe87e41eaa8f6a242bfba2b6c
institution	Kabale University
issn	2356-6140 1537-744X
language	English
publishDate	2015-01-01
publisher	Wiley
record_format	Article
series	The Scientific World Journal
spelling	doaj-art-6c1f68fbe87e41eaa8f6a242bfba2b6c2025-02-03T05:51:30ZengWileyThe Scientific World Journal2356-61401537-744X2015-01-01201510.1155/2015/471371471371Unbiased Feature Selection in Learning Random Forests for High-Dimensional DataThanh-Tung Nguyen0Joshua Zhexue Huang1Thuy Thi Nguyen2Shenzhen Key Laboratory of High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, ChinaShenzhen Key Laboratory of High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, ChinaFaculty of Information Technology, Vietnam National University of Agriculture, Hanoi 10000, VietnamRandom forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using p-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.http://dx.doi.org/10.1155/2015/471371
spellingShingle	Thanh-Tung Nguyen Joshua Zhexue Huang Thuy Thi Nguyen Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data The Scientific World Journal
title	Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data
title_full	Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data
title_fullStr	Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data
title_full_unstemmed	Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data
title_short	Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data
title_sort	unbiased feature selection in learning random forests for high dimensional data
url	http://dx.doi.org/10.1155/2015/471371
work_keys_str_mv	AT thanhtungnguyen unbiasedfeatureselectioninlearningrandomforestsforhighdimensionaldata AT joshuazhexuehuang unbiasedfeatureselectioninlearningrandomforestsforhighdimensionaldata AT thuythinguyen unbiasedfeatureselectioninlearningrandomforestsforhighdimensionaldata

Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data

Similar Items