Identification of Potential Type II Diabetes in a Large-Scale Chinese Population Using a Systematic Machine Learning Framework

Background. An estimated 425 million people globally have diabetes, accounting for 12% of the world’s health expenditures, and the number continues to grow, placing a huge burden on the healthcare system, especially in those remote, underserved areas. Methods. A total of 584,168 adult subjects who h...

Full description

Saved in:
Bibliographic Details
Main Authors: Mingyue Xue, Yinxia Su, Chen Li, Shuxia Wang, Hua Yao
Format: Article
Language:English
Published: Wiley 2020-01-01
Series:Journal of Diabetes Research
Online Access:http://dx.doi.org/10.1155/2020/6873891
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832554596464066560
author Mingyue Xue
Yinxia Su
Chen Li
Shuxia Wang
Hua Yao
author_facet Mingyue Xue
Yinxia Su
Chen Li
Shuxia Wang
Hua Yao
author_sort Mingyue Xue
collection DOAJ
description Background. An estimated 425 million people globally have diabetes, accounting for 12% of the world’s health expenditures, and the number continues to grow, placing a huge burden on the healthcare system, especially in those remote, underserved areas. Methods. A total of 584,168 adult subjects who have participated in the national physical examination were enrolled in this study. The risk factors for type II diabetes mellitus (T2DM) were identified by p values and odds ratio, using logistic regression (LR) based on variables of physical measurement and a questionnaire. Combined with the risk factors selected by LR, we used a decision tree, a random forest, AdaBoost with a decision tree (AdaBoost), and an extreme gradient boosting decision tree (XGBoost) to identify individuals with T2DM, compared the performance of the four machine learning classifiers, and used the best-performing classifier to output the degree of variables’ importance scores of T2DM. Results. The results indicated that XGBoost had the best performance (accuracy=0.906, precision=0.910, recall=0.902, F‐1=0.906, and AUC=0.968). The degree of variables’ importance scores in XGBoost showed that BMI was the most significant feature, followed by age, waist circumference, systolic pressure, ethnicity, smoking amount, fatty liver, hypertension, physical activity, drinking status, dietary ratio (meat to vegetables), drink amount, smoking status, and diet habit (oil loving). Conclusions. We proposed a classifier based on LR-XGBoost which used fourteen variables of patients which are easily obtained and noninvasive as predictor variables to identify potential incidents of T2DM. The classifier can accurately screen the risk of diabetes in the early phrase, and the degree of variables’ importance scores gives a clue to prevent diabetes occurrence.
format Article
id doaj-art-d10db7b640534c5db0d3b4b2160323d9
institution Kabale University
issn 2314-6745
2314-6753
language English
publishDate 2020-01-01
publisher Wiley
record_format Article
series Journal of Diabetes Research
spelling doaj-art-d10db7b640534c5db0d3b4b2160323d92025-02-03T05:51:11ZengWileyJournal of Diabetes Research2314-67452314-67532020-01-01202010.1155/2020/68738916873891Identification of Potential Type II Diabetes in a Large-Scale Chinese Population Using a Systematic Machine Learning FrameworkMingyue Xue0Yinxia Su1Chen Li2Shuxia Wang3Hua Yao4Hospital of Traditional Chinese Medicine Affiliated to the Fourth Clinical Medical College of Xinjiang Medical University, Urumqi, ChinaCollege of Public Health, Xinjiang Medical University, Urumqi, ChinaThe First Affiliated Hospital of Xinjiang Medical University, Urumqi, ChinaCenter of Health Management, The First Affiliated Hospital, Xinjiang Medical University, Urumqi, ChinaCenter of Health Management, The First Affiliated Hospital, Xinjiang Medical University, Urumqi, ChinaBackground. An estimated 425 million people globally have diabetes, accounting for 12% of the world’s health expenditures, and the number continues to grow, placing a huge burden on the healthcare system, especially in those remote, underserved areas. Methods. A total of 584,168 adult subjects who have participated in the national physical examination were enrolled in this study. The risk factors for type II diabetes mellitus (T2DM) were identified by p values and odds ratio, using logistic regression (LR) based on variables of physical measurement and a questionnaire. Combined with the risk factors selected by LR, we used a decision tree, a random forest, AdaBoost with a decision tree (AdaBoost), and an extreme gradient boosting decision tree (XGBoost) to identify individuals with T2DM, compared the performance of the four machine learning classifiers, and used the best-performing classifier to output the degree of variables’ importance scores of T2DM. Results. The results indicated that XGBoost had the best performance (accuracy=0.906, precision=0.910, recall=0.902, F‐1=0.906, and AUC=0.968). The degree of variables’ importance scores in XGBoost showed that BMI was the most significant feature, followed by age, waist circumference, systolic pressure, ethnicity, smoking amount, fatty liver, hypertension, physical activity, drinking status, dietary ratio (meat to vegetables), drink amount, smoking status, and diet habit (oil loving). Conclusions. We proposed a classifier based on LR-XGBoost which used fourteen variables of patients which are easily obtained and noninvasive as predictor variables to identify potential incidents of T2DM. The classifier can accurately screen the risk of diabetes in the early phrase, and the degree of variables’ importance scores gives a clue to prevent diabetes occurrence.http://dx.doi.org/10.1155/2020/6873891
spellingShingle Mingyue Xue
Yinxia Su
Chen Li
Shuxia Wang
Hua Yao
Identification of Potential Type II Diabetes in a Large-Scale Chinese Population Using a Systematic Machine Learning Framework
Journal of Diabetes Research
title Identification of Potential Type II Diabetes in a Large-Scale Chinese Population Using a Systematic Machine Learning Framework
title_full Identification of Potential Type II Diabetes in a Large-Scale Chinese Population Using a Systematic Machine Learning Framework
title_fullStr Identification of Potential Type II Diabetes in a Large-Scale Chinese Population Using a Systematic Machine Learning Framework
title_full_unstemmed Identification of Potential Type II Diabetes in a Large-Scale Chinese Population Using a Systematic Machine Learning Framework
title_short Identification of Potential Type II Diabetes in a Large-Scale Chinese Population Using a Systematic Machine Learning Framework
title_sort identification of potential type ii diabetes in a large scale chinese population using a systematic machine learning framework
url http://dx.doi.org/10.1155/2020/6873891
work_keys_str_mv AT mingyuexue identificationofpotentialtypeiidiabetesinalargescalechinesepopulationusingasystematicmachinelearningframework
AT yinxiasu identificationofpotentialtypeiidiabetesinalargescalechinesepopulationusingasystematicmachinelearningframework
AT chenli identificationofpotentialtypeiidiabetesinalargescalechinesepopulationusingasystematicmachinelearningframework
AT shuxiawang identificationofpotentialtypeiidiabetesinalargescalechinesepopulationusingasystematicmachinelearningframework
AT huayao identificationofpotentialtypeiidiabetesinalargescalechinesepopulationusingasystematicmachinelearningframework