HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition

The classical classifiers are ineffective in dealing with the problem of imbalanced big dataset classification. Resampling the datasets and balancing samples distribution before training the classifier is one of the most popular approaches to resolve this problem. An effective and simple hybrid samp...

Full description

Saved in:

Bibliographic Details
Main Authors:	Liping Chen, Jiabao Jiang, Yong Zhang
Format:	Article
Language:	English
Published:	Wiley 2021-01-01
Series:	Complexity
Online Access:	http://dx.doi.org/10.1155/2021/6877284
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832548923368013824
author	Liping Chen Jiabao Jiang Yong Zhang
author_facet	Liping Chen Jiabao Jiang Yong Zhang
author_sort	Liping Chen
collection	DOAJ
description	The classical classifiers are ineffective in dealing with the problem of imbalanced big dataset classification. Resampling the datasets and balancing samples distribution before training the classifier is one of the most popular approaches to resolve this problem. An effective and simple hybrid sampling method based on data partition (HSDP) is proposed in this paper. First, all the data samples are partitioned into different data regions. Then, the data samples in the noise minority samples region are removed and the samples in the boundary minority samples region are selected as oversampling seeds to generate the synthetic samples. Finally, a weighted oversampling process is conducted considering the generation of synthetic samples in the same cluster of the oversampling seed. The weight of each selected minority class sample is computed by the ratio between the proportion of majority class in the neighbors of this selected sample and the sum of all these proportions. Generation of synthetic samples in the same cluster of the oversampling seed guarantees new synthetic samples located inside the minority class area. Experiments conducted on eight datasets show that the proposed method, HSDP, is better than or comparable with the typical sampling methods for F-measure and G-mean.
format	Article
id	doaj-art-d9dc20e425214b1581d4e0be8541a910
institution	Kabale University
issn	1076-2787 1099-0526
language	English
publishDate	2021-01-01
publisher	Wiley
record_format	Article
series	Complexity
spelling	doaj-art-d9dc20e425214b1581d4e0be8541a9102025-02-03T06:12:50ZengWileyComplexity1076-27871099-05262021-01-01202110.1155/2021/68772846877284HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data PartitionLiping Chen0Jiabao Jiang1Yong Zhang2School of Information Engineering, Chaohu University, Chaohu, ChinaSchool of Information Engineering, Chaohu University, Chaohu, ChinaSchool of Information Engineering, Chaohu University, Chaohu, ChinaThe classical classifiers are ineffective in dealing with the problem of imbalanced big dataset classification. Resampling the datasets and balancing samples distribution before training the classifier is one of the most popular approaches to resolve this problem. An effective and simple hybrid sampling method based on data partition (HSDP) is proposed in this paper. First, all the data samples are partitioned into different data regions. Then, the data samples in the noise minority samples region are removed and the samples in the boundary minority samples region are selected as oversampling seeds to generate the synthetic samples. Finally, a weighted oversampling process is conducted considering the generation of synthetic samples in the same cluster of the oversampling seed. The weight of each selected minority class sample is computed by the ratio between the proportion of majority class in the neighbors of this selected sample and the sum of all these proportions. Generation of synthetic samples in the same cluster of the oversampling seed guarantees new synthetic samples located inside the minority class area. Experiments conducted on eight datasets show that the proposed method, HSDP, is better than or comparable with the typical sampling methods for F-measure and G-mean.http://dx.doi.org/10.1155/2021/6877284
spellingShingle	Liping Chen Jiabao Jiang Yong Zhang HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition Complexity
title	HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition
title_full	HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition
title_fullStr	HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition
title_full_unstemmed	HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition
title_short	HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition
title_sort	hsdp a hybrid sampling method for imbalanced big data based on data partition
url	http://dx.doi.org/10.1155/2021/6877284
work_keys_str_mv	AT lipingchen hsdpahybridsamplingmethodforimbalancedbigdatabasedondatapartition AT jiabaojiang hsdpahybridsamplingmethodforimbalancedbigdatabasedondatapartition AT yongzhang hsdpahybridsamplingmethodforimbalancedbigdatabasedondatapartition

HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition

Similar Items