HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition

The classical classifiers are ineffective in dealing with the problem of imbalanced big dataset classification. Resampling the datasets and balancing samples distribution before training the classifier is one of the most popular approaches to resolve this problem. An effective and simple hybrid samp...

Full description

Saved in:
Bibliographic Details
Main Authors: Liping Chen, Jiabao Jiang, Yong Zhang
Format: Article
Language:English
Published: Wiley 2021-01-01
Series:Complexity
Online Access:http://dx.doi.org/10.1155/2021/6877284
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The classical classifiers are ineffective in dealing with the problem of imbalanced big dataset classification. Resampling the datasets and balancing samples distribution before training the classifier is one of the most popular approaches to resolve this problem. An effective and simple hybrid sampling method based on data partition (HSDP) is proposed in this paper. First, all the data samples are partitioned into different data regions. Then, the data samples in the noise minority samples region are removed and the samples in the boundary minority samples region are selected as oversampling seeds to generate the synthetic samples. Finally, a weighted oversampling process is conducted considering the generation of synthetic samples in the same cluster of the oversampling seed. The weight of each selected minority class sample is computed by the ratio between the proportion of majority class in the neighbors of this selected sample and the sum of all these proportions. Generation of synthetic samples in the same cluster of the oversampling seed guarantees new synthetic samples located inside the minority class area. Experiments conducted on eight datasets show that the proposed method, HSDP, is better than or comparable with the typical sampling methods for F-measure and G-mean.
ISSN:1076-2787
1099-0526