Machine Learning-Based Alzheimer’s Disease Stage Diagnosis Utilizing Blood Gene Expression and Clinical Data: A Comparative Investigation

<b>Background/Objectives:</b> This study presents a comparative analysis of the multistage diagnosis of Alzheimer’s disease (AD), including mild cognitive impairment (MCI), utilizing two distinct types of biomarkers: blood gene expression and clinical biomarker samples. Both of these sam...

Full description

Saved in:
Bibliographic Details
Main Authors: Manash Sarma, Subarna Chatterjee
Format: Article
Language:English
Published: MDPI AG 2025-01-01
Series:Diagnostics
Subjects:
Online Access:https://www.mdpi.com/2075-4418/15/2/211
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832588675134783488
author Manash Sarma
Subarna Chatterjee
author_facet Manash Sarma
Subarna Chatterjee
author_sort Manash Sarma
collection DOAJ
description <b>Background/Objectives:</b> This study presents a comparative analysis of the multistage diagnosis of Alzheimer’s disease (AD), including mild cognitive impairment (MCI), utilizing two distinct types of biomarkers: blood gene expression and clinical biomarker samples. Both of these samples, obtained from participants in the Alzheimer’s Disease Neuroimaging Initiative (ADNI), were independently analyzed utilizing machine learning (ML)-based multiclassifiers. This study applied novel machine learning-based data augmentation techniques to gene expression profile data that are high-dimensional, low-sample-size (HDLSS) and inherently highly imbalanced. The investigation obtained the highest multiclassification performance to date in the multistage diagnosis of Alzheimer’s disease utilizing the blood gene expression profiles of Alzheimer’s Disease Neuroimaging Initiative (ADNI) participants. Based on the performance results obtained, and other factors such as early prediction capabilities, this study compares the efficacies of the two types of biomarkers for multistage diagnosis. This study presents the sole investigation in which multiclassification-based AD stage diagnosis was conducted utilizing blood gene expression data. We obtained the best multiclassification result in both modalities of the ADNI data in terms of F1-score and were able to identify new genetic biomarkers. <b>Methods:</b> The combination of the XGBoost and SFBS (Sequential Floating Backward Selection) methods was used to select the features. We were able to select the 95 most effective gene probe sets out of 49,386. For the clinical study data, eight of the most effective biomarkers were selected using SFBS. A deep learning (DL) classifier was used to identify the stages—cognitive normal (CN), mild cognitive impairment (MCI), and Alzheimer’s disease (AD)/dementia. DL, support vector machine (SVM), gradient boosting (GB), and random forest (RF) classifiers were used for the AD stage detection from gene expression profile data. Because of the high data imbalance in genomic data, borderline oversampling/data augmentation was applied in the model training and original samples for validation. <b>Results:</b> Utilizing clinical data, the highest ROC AUC scores attained were 0.989, 0.927, and 0.907 for the identification of the CN, MCI, and dementia stages, respectively. The highest F1 scores achieved were 0.971, 0.939, and 0.886. Employing gene expression data, we obtained ROC AUC scores of 0.763, 0.761, and 0.706 for the CN, MCI, and dementia stages, respectively, and F1 scores of 0.71, 0.77, and 0.53 for CN, MCI, and dementia, respectively. <b>Conclusions:</b> This represents the best outcome to date for AD stage diagnosis from ADNI blood gene expression profile data utilizing multiclassification techniques. The results indicated that our multiclassification model effectively manages the imbalanced data of a high-dimension, low-sample-size (HDLSS) nature to identify samples of the minority class. MAPK14, PLG, FZD2, FXYD6, and TEP1 are among the novel genes identified as being associated with AD risk.
format Article
id doaj-art-1bc45486d27d44ce9e3c0a018a2732ea
institution Kabale University
issn 2075-4418
language English
publishDate 2025-01-01
publisher MDPI AG
record_format Article
series Diagnostics
spelling doaj-art-1bc45486d27d44ce9e3c0a018a2732ea2025-01-24T13:29:07ZengMDPI AGDiagnostics2075-44182025-01-0115221110.3390/diagnostics15020211Machine Learning-Based Alzheimer’s Disease Stage Diagnosis Utilizing Blood Gene Expression and Clinical Data: A Comparative InvestigationManash Sarma0Subarna Chatterjee1Department of Computer Science and Engineering, Faculty of Engineering and Technology, Technology Campus (Peenya Campus), Ramaiah University of Applied Sciences, Bengaluru 560058, IndiaDepartment of Computer Science and Engineering, Faculty of Engineering and Technology, Technology Campus (Peenya Campus), Ramaiah University of Applied Sciences, Bengaluru 560058, India<b>Background/Objectives:</b> This study presents a comparative analysis of the multistage diagnosis of Alzheimer’s disease (AD), including mild cognitive impairment (MCI), utilizing two distinct types of biomarkers: blood gene expression and clinical biomarker samples. Both of these samples, obtained from participants in the Alzheimer’s Disease Neuroimaging Initiative (ADNI), were independently analyzed utilizing machine learning (ML)-based multiclassifiers. This study applied novel machine learning-based data augmentation techniques to gene expression profile data that are high-dimensional, low-sample-size (HDLSS) and inherently highly imbalanced. The investigation obtained the highest multiclassification performance to date in the multistage diagnosis of Alzheimer’s disease utilizing the blood gene expression profiles of Alzheimer’s Disease Neuroimaging Initiative (ADNI) participants. Based on the performance results obtained, and other factors such as early prediction capabilities, this study compares the efficacies of the two types of biomarkers for multistage diagnosis. This study presents the sole investigation in which multiclassification-based AD stage diagnosis was conducted utilizing blood gene expression data. We obtained the best multiclassification result in both modalities of the ADNI data in terms of F1-score and were able to identify new genetic biomarkers. <b>Methods:</b> The combination of the XGBoost and SFBS (Sequential Floating Backward Selection) methods was used to select the features. We were able to select the 95 most effective gene probe sets out of 49,386. For the clinical study data, eight of the most effective biomarkers were selected using SFBS. A deep learning (DL) classifier was used to identify the stages—cognitive normal (CN), mild cognitive impairment (MCI), and Alzheimer’s disease (AD)/dementia. DL, support vector machine (SVM), gradient boosting (GB), and random forest (RF) classifiers were used for the AD stage detection from gene expression profile data. Because of the high data imbalance in genomic data, borderline oversampling/data augmentation was applied in the model training and original samples for validation. <b>Results:</b> Utilizing clinical data, the highest ROC AUC scores attained were 0.989, 0.927, and 0.907 for the identification of the CN, MCI, and dementia stages, respectively. The highest F1 scores achieved were 0.971, 0.939, and 0.886. Employing gene expression data, we obtained ROC AUC scores of 0.763, 0.761, and 0.706 for the CN, MCI, and dementia stages, respectively, and F1 scores of 0.71, 0.77, and 0.53 for CN, MCI, and dementia, respectively. <b>Conclusions:</b> This represents the best outcome to date for AD stage diagnosis from ADNI blood gene expression profile data utilizing multiclassification techniques. The results indicated that our multiclassification model effectively manages the imbalanced data of a high-dimension, low-sample-size (HDLSS) nature to identify samples of the minority class. MAPK14, PLG, FZD2, FXYD6, and TEP1 are among the novel genes identified as being associated with AD risk.https://www.mdpi.com/2075-4418/15/2/211Alzheimer’s disease stage diagnosisblood gene expressiondata imbalancemulticlassificationF1-scoreAD risk gene
spellingShingle Manash Sarma
Subarna Chatterjee
Machine Learning-Based Alzheimer’s Disease Stage Diagnosis Utilizing Blood Gene Expression and Clinical Data: A Comparative Investigation
Diagnostics
Alzheimer’s disease stage diagnosis
blood gene expression
data imbalance
multiclassification
F1-score
AD risk gene
title Machine Learning-Based Alzheimer’s Disease Stage Diagnosis Utilizing Blood Gene Expression and Clinical Data: A Comparative Investigation
title_full Machine Learning-Based Alzheimer’s Disease Stage Diagnosis Utilizing Blood Gene Expression and Clinical Data: A Comparative Investigation
title_fullStr Machine Learning-Based Alzheimer’s Disease Stage Diagnosis Utilizing Blood Gene Expression and Clinical Data: A Comparative Investigation
title_full_unstemmed Machine Learning-Based Alzheimer’s Disease Stage Diagnosis Utilizing Blood Gene Expression and Clinical Data: A Comparative Investigation
title_short Machine Learning-Based Alzheimer’s Disease Stage Diagnosis Utilizing Blood Gene Expression and Clinical Data: A Comparative Investigation
title_sort machine learning based alzheimer s disease stage diagnosis utilizing blood gene expression and clinical data a comparative investigation
topic Alzheimer’s disease stage diagnosis
blood gene expression
data imbalance
multiclassification
F1-score
AD risk gene
url https://www.mdpi.com/2075-4418/15/2/211
work_keys_str_mv AT manashsarma machinelearningbasedalzheimersdiseasestagediagnosisutilizingbloodgeneexpressionandclinicaldataacomparativeinvestigation
AT subarnachatterjee machinelearningbasedalzheimersdiseasestagediagnosisutilizingbloodgeneexpressionandclinicaldataacomparativeinvestigation