HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition

The classical classifiers are ineffective in dealing with the problem of imbalanced big dataset classification. Resampling the datasets and balancing samples distribution before training the classifier is one of the most popular approaches to resolve this problem. An effective and simple hybrid samp...

Full description

Saved in:
Bibliographic Details
Main Authors: Liping Chen, Jiabao Jiang, Yong Zhang
Format: Article
Language:English
Published: Wiley 2021-01-01
Series:Complexity
Online Access:http://dx.doi.org/10.1155/2021/6877284
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832548923368013824
author Liping Chen
Jiabao Jiang
Yong Zhang
author_facet Liping Chen
Jiabao Jiang
Yong Zhang
author_sort Liping Chen
collection DOAJ
description The classical classifiers are ineffective in dealing with the problem of imbalanced big dataset classification. Resampling the datasets and balancing samples distribution before training the classifier is one of the most popular approaches to resolve this problem. An effective and simple hybrid sampling method based on data partition (HSDP) is proposed in this paper. First, all the data samples are partitioned into different data regions. Then, the data samples in the noise minority samples region are removed and the samples in the boundary minority samples region are selected as oversampling seeds to generate the synthetic samples. Finally, a weighted oversampling process is conducted considering the generation of synthetic samples in the same cluster of the oversampling seed. The weight of each selected minority class sample is computed by the ratio between the proportion of majority class in the neighbors of this selected sample and the sum of all these proportions. Generation of synthetic samples in the same cluster of the oversampling seed guarantees new synthetic samples located inside the minority class area. Experiments conducted on eight datasets show that the proposed method, HSDP, is better than or comparable with the typical sampling methods for F-measure and G-mean.
format Article
id doaj-art-d9dc20e425214b1581d4e0be8541a910
institution Kabale University
issn 1076-2787
1099-0526
language English
publishDate 2021-01-01
publisher Wiley
record_format Article
series Complexity
spelling doaj-art-d9dc20e425214b1581d4e0be8541a9102025-02-03T06:12:50ZengWileyComplexity1076-27871099-05262021-01-01202110.1155/2021/68772846877284HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data PartitionLiping Chen0Jiabao Jiang1Yong Zhang2School of Information Engineering, Chaohu University, Chaohu, ChinaSchool of Information Engineering, Chaohu University, Chaohu, ChinaSchool of Information Engineering, Chaohu University, Chaohu, ChinaThe classical classifiers are ineffective in dealing with the problem of imbalanced big dataset classification. Resampling the datasets and balancing samples distribution before training the classifier is one of the most popular approaches to resolve this problem. An effective and simple hybrid sampling method based on data partition (HSDP) is proposed in this paper. First, all the data samples are partitioned into different data regions. Then, the data samples in the noise minority samples region are removed and the samples in the boundary minority samples region are selected as oversampling seeds to generate the synthetic samples. Finally, a weighted oversampling process is conducted considering the generation of synthetic samples in the same cluster of the oversampling seed. The weight of each selected minority class sample is computed by the ratio between the proportion of majority class in the neighbors of this selected sample and the sum of all these proportions. Generation of synthetic samples in the same cluster of the oversampling seed guarantees new synthetic samples located inside the minority class area. Experiments conducted on eight datasets show that the proposed method, HSDP, is better than or comparable with the typical sampling methods for F-measure and G-mean.http://dx.doi.org/10.1155/2021/6877284
spellingShingle Liping Chen
Jiabao Jiang
Yong Zhang
HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition
Complexity
title HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition
title_full HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition
title_fullStr HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition
title_full_unstemmed HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition
title_short HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition
title_sort hsdp a hybrid sampling method for imbalanced big data based on data partition
url http://dx.doi.org/10.1155/2021/6877284
work_keys_str_mv AT lipingchen hsdpahybridsamplingmethodforimbalancedbigdatabasedondatapartition
AT jiabaojiang hsdpahybridsamplingmethodforimbalancedbigdatabasedondatapartition
AT yongzhang hsdpahybridsamplingmethodforimbalancedbigdatabasedondatapartition