HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition
The classical classifiers are ineffective in dealing with the problem of imbalanced big dataset classification. Resampling the datasets and balancing samples distribution before training the classifier is one of the most popular approaches to resolve this problem. An effective and simple hybrid samp...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Wiley
2021-01-01
|
Series: | Complexity |
Online Access: | http://dx.doi.org/10.1155/2021/6877284 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832548923368013824 |
---|---|
author | Liping Chen Jiabao Jiang Yong Zhang |
author_facet | Liping Chen Jiabao Jiang Yong Zhang |
author_sort | Liping Chen |
collection | DOAJ |
description | The classical classifiers are ineffective in dealing with the problem of imbalanced big dataset classification. Resampling the datasets and balancing samples distribution before training the classifier is one of the most popular approaches to resolve this problem. An effective and simple hybrid sampling method based on data partition (HSDP) is proposed in this paper. First, all the data samples are partitioned into different data regions. Then, the data samples in the noise minority samples region are removed and the samples in the boundary minority samples region are selected as oversampling seeds to generate the synthetic samples. Finally, a weighted oversampling process is conducted considering the generation of synthetic samples in the same cluster of the oversampling seed. The weight of each selected minority class sample is computed by the ratio between the proportion of majority class in the neighbors of this selected sample and the sum of all these proportions. Generation of synthetic samples in the same cluster of the oversampling seed guarantees new synthetic samples located inside the minority class area. Experiments conducted on eight datasets show that the proposed method, HSDP, is better than or comparable with the typical sampling methods for F-measure and G-mean. |
format | Article |
id | doaj-art-d9dc20e425214b1581d4e0be8541a910 |
institution | Kabale University |
issn | 1076-2787 1099-0526 |
language | English |
publishDate | 2021-01-01 |
publisher | Wiley |
record_format | Article |
series | Complexity |
spelling | doaj-art-d9dc20e425214b1581d4e0be8541a9102025-02-03T06:12:50ZengWileyComplexity1076-27871099-05262021-01-01202110.1155/2021/68772846877284HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data PartitionLiping Chen0Jiabao Jiang1Yong Zhang2School of Information Engineering, Chaohu University, Chaohu, ChinaSchool of Information Engineering, Chaohu University, Chaohu, ChinaSchool of Information Engineering, Chaohu University, Chaohu, ChinaThe classical classifiers are ineffective in dealing with the problem of imbalanced big dataset classification. Resampling the datasets and balancing samples distribution before training the classifier is one of the most popular approaches to resolve this problem. An effective and simple hybrid sampling method based on data partition (HSDP) is proposed in this paper. First, all the data samples are partitioned into different data regions. Then, the data samples in the noise minority samples region are removed and the samples in the boundary minority samples region are selected as oversampling seeds to generate the synthetic samples. Finally, a weighted oversampling process is conducted considering the generation of synthetic samples in the same cluster of the oversampling seed. The weight of each selected minority class sample is computed by the ratio between the proportion of majority class in the neighbors of this selected sample and the sum of all these proportions. Generation of synthetic samples in the same cluster of the oversampling seed guarantees new synthetic samples located inside the minority class area. Experiments conducted on eight datasets show that the proposed method, HSDP, is better than or comparable with the typical sampling methods for F-measure and G-mean.http://dx.doi.org/10.1155/2021/6877284 |
spellingShingle | Liping Chen Jiabao Jiang Yong Zhang HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition Complexity |
title | HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition |
title_full | HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition |
title_fullStr | HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition |
title_full_unstemmed | HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition |
title_short | HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition |
title_sort | hsdp a hybrid sampling method for imbalanced big data based on data partition |
url | http://dx.doi.org/10.1155/2021/6877284 |
work_keys_str_mv | AT lipingchen hsdpahybridsamplingmethodforimbalancedbigdatabasedondatapartition AT jiabaojiang hsdpahybridsamplingmethodforimbalancedbigdatabasedondatapartition AT yongzhang hsdpahybridsamplingmethodforimbalancedbigdatabasedondatapartition |