Machine Learning Model for Imbalanced Cholera Dataset in Tanzania

Cholera epidemic remains a public threat throughout history, affecting vulnerable population living with unreliable water and substandard sanitary conditions. Various studies have observed that the occurrence of cholera has strong linkage with environmental factors such as climate change and geograp...

Full description

Saved in:
Bibliographic Details
Main Authors: Judith Leo, Edith Luhanga, Kisangiri Michael
Format: Article
Language:English
Published: Wiley 2019-01-01
Series:The Scientific World Journal
Online Access:http://dx.doi.org/10.1155/2019/9397578
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832551397399199744
author Judith Leo
Edith Luhanga
Kisangiri Michael
author_facet Judith Leo
Edith Luhanga
Kisangiri Michael
author_sort Judith Leo
collection DOAJ
description Cholera epidemic remains a public threat throughout history, affecting vulnerable population living with unreliable water and substandard sanitary conditions. Various studies have observed that the occurrence of cholera has strong linkage with environmental factors such as climate change and geographical location. Climate change has been strongly linked to the seasonal occurrence and widespread of cholera through the creation of weather patterns that favor the disease’s transmission, infection, and the growth of Vibrio cholerae, which cause the disease. Over the past decades, there have been great achievements in developing epidemic models for the proper prediction of cholera. However, the integration of weather variables and use of machine learning techniques have not been explicitly deployed in modeling cholera epidemics in Tanzania due to the challenges that come with its datasets such as imbalanced data and missing information. This paper explores the use of machine learning techniques to model cholera epidemics with linkage to seasonal weather changes while overcoming the data imbalance problem. Adaptive Synthetic Sampling Approach (ADASYN) and Principal Component Analysis (PCA) were used to the restore sampling balance and dimensional of the dataset. In addition, sensitivity, specificity, and balanced-accuracy metrics were used to evaluate the performance of the seven models. Based on the results of the Wilcoxon sign-rank test and features of the models, XGBoost classifier was selected to be the best model for the study. Overall results improved our understanding of the significant roles of machine learning strategies in health-care data. However, the study could not be treated as a time series problem due to the data collection bias. The study recommends a review of health-care systems in order to facilitate quality data collection and deployment of machine learning techniques.
format Article
id doaj-art-6cad701ca7cf4ea689e4808e7f7a5d1d
institution Kabale University
issn 2356-6140
1537-744X
language English
publishDate 2019-01-01
publisher Wiley
record_format Article
series The Scientific World Journal
spelling doaj-art-6cad701ca7cf4ea689e4808e7f7a5d1d2025-02-03T06:01:37ZengWileyThe Scientific World Journal2356-61401537-744X2019-01-01201910.1155/2019/93975789397578Machine Learning Model for Imbalanced Cholera Dataset in TanzaniaJudith Leo0Edith Luhanga1Kisangiri Michael2Nelson Mandela African Institution of Science and Technology (NM-AIST), School of Computation and Communication Science and Engineering (CoCSE), P.O. BOX 447, Arusha, TanzaniaNelson Mandela African Institution of Science and Technology (NM-AIST), School of Computation and Communication Science and Engineering (CoCSE), P.O. BOX 447, Arusha, TanzaniaNelson Mandela African Institution of Science and Technology (NM-AIST), School of Computation and Communication Science and Engineering (CoCSE), P.O. BOX 447, Arusha, TanzaniaCholera epidemic remains a public threat throughout history, affecting vulnerable population living with unreliable water and substandard sanitary conditions. Various studies have observed that the occurrence of cholera has strong linkage with environmental factors such as climate change and geographical location. Climate change has been strongly linked to the seasonal occurrence and widespread of cholera through the creation of weather patterns that favor the disease’s transmission, infection, and the growth of Vibrio cholerae, which cause the disease. Over the past decades, there have been great achievements in developing epidemic models for the proper prediction of cholera. However, the integration of weather variables and use of machine learning techniques have not been explicitly deployed in modeling cholera epidemics in Tanzania due to the challenges that come with its datasets such as imbalanced data and missing information. This paper explores the use of machine learning techniques to model cholera epidemics with linkage to seasonal weather changes while overcoming the data imbalance problem. Adaptive Synthetic Sampling Approach (ADASYN) and Principal Component Analysis (PCA) were used to the restore sampling balance and dimensional of the dataset. In addition, sensitivity, specificity, and balanced-accuracy metrics were used to evaluate the performance of the seven models. Based on the results of the Wilcoxon sign-rank test and features of the models, XGBoost classifier was selected to be the best model for the study. Overall results improved our understanding of the significant roles of machine learning strategies in health-care data. However, the study could not be treated as a time series problem due to the data collection bias. The study recommends a review of health-care systems in order to facilitate quality data collection and deployment of machine learning techniques.http://dx.doi.org/10.1155/2019/9397578
spellingShingle Judith Leo
Edith Luhanga
Kisangiri Michael
Machine Learning Model for Imbalanced Cholera Dataset in Tanzania
The Scientific World Journal
title Machine Learning Model for Imbalanced Cholera Dataset in Tanzania
title_full Machine Learning Model for Imbalanced Cholera Dataset in Tanzania
title_fullStr Machine Learning Model for Imbalanced Cholera Dataset in Tanzania
title_full_unstemmed Machine Learning Model for Imbalanced Cholera Dataset in Tanzania
title_short Machine Learning Model for Imbalanced Cholera Dataset in Tanzania
title_sort machine learning model for imbalanced cholera dataset in tanzania
url http://dx.doi.org/10.1155/2019/9397578
work_keys_str_mv AT judithleo machinelearningmodelforimbalancedcholeradatasetintanzania
AT edithluhanga machinelearningmodelforimbalancedcholeradatasetintanzania
AT kisangirimichael machinelearningmodelforimbalancedcholeradatasetintanzania