Identification of patient demographic, clinical, and SARS-CoV-2 genomic factors associated with severe COVID-19 using supervised machine learning: a retrospective multicenter study

Abstract Background Drivers of COVID-19 severity are multifactorial and include multidimensional and potentially interacting factors encompassing viral determinants and host-related factors (i.e., demographics, pre-existing conditions and/or genetics), thus complicating the prediction of clinical ou...

Full description

Saved in:
Bibliographic Details
Main Authors: Kuganya Nirmalarajah, Patryk Aftanas, Shiva Barati, Emily Chien, Gloria Crowl, Amna Faheem, Lubna Farooqi, Alainna J. Jamal, Saman Khan, Jonathon D. Kotwa, Angel X. Li, Mohammad Mozafarihashjin, Jalees A. Nasir, Altynay Shigayeva, Winfield Yim, Lily Yip, Xi Zoe Zhong, Kevin Katz, Robert Kozak, Andrew G. McArthur, Nick Daneman, Finlay Maguire, Allison J. McGeer, Venkata R. Duvvuri, Samira Mubareka
Format: Article
Language:English
Published: BMC 2025-01-01
Series:BMC Infectious Diseases
Subjects:
Online Access:https://doi.org/10.1186/s12879-025-10450-3
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832572035228762112
author Kuganya Nirmalarajah
Patryk Aftanas
Shiva Barati
Emily Chien
Gloria Crowl
Amna Faheem
Lubna Farooqi
Alainna J. Jamal
Saman Khan
Jonathon D. Kotwa
Angel X. Li
Mohammad Mozafarihashjin
Jalees A. Nasir
Altynay Shigayeva
Winfield Yim
Lily Yip
Xi Zoe Zhong
Kevin Katz
Robert Kozak
Andrew G. McArthur
Nick Daneman
Finlay Maguire
Allison J. McGeer
Venkata R. Duvvuri
Samira Mubareka
author_facet Kuganya Nirmalarajah
Patryk Aftanas
Shiva Barati
Emily Chien
Gloria Crowl
Amna Faheem
Lubna Farooqi
Alainna J. Jamal
Saman Khan
Jonathon D. Kotwa
Angel X. Li
Mohammad Mozafarihashjin
Jalees A. Nasir
Altynay Shigayeva
Winfield Yim
Lily Yip
Xi Zoe Zhong
Kevin Katz
Robert Kozak
Andrew G. McArthur
Nick Daneman
Finlay Maguire
Allison J. McGeer
Venkata R. Duvvuri
Samira Mubareka
author_sort Kuganya Nirmalarajah
collection DOAJ
description Abstract Background Drivers of COVID-19 severity are multifactorial and include multidimensional and potentially interacting factors encompassing viral determinants and host-related factors (i.e., demographics, pre-existing conditions and/or genetics), thus complicating the prediction of clinical outcomes for different severe acute respiratory syndrome coronavirus (SARS-CoV-2) variants. Although millions of SARS-CoV-2 genomes have been publicly shared in global databases, linkages with detailed clinical data are scarce. Therefore, we aimed to establish a COVID-19 patient dataset with linked clinical and viral genomic data to then examine associations between SARS-CoV-2 genomic signatures and clinical disease phenotypes. Methods A cohort of adult patients with laboratory confirmed SARS-CoV-2 from 11 participating healthcare institutions in the Greater Toronto Area (GTA) were recruited from March 2020 to April 2022. Supervised machine learning (ML) models were developed to predict hospitalization using SARS-CoV-2 lineage-specific genomic signatures, patient demographics, symptoms, and pre-existing comorbidities. The relative importance of these features was then evaluated. Results Complete clinical data and viral whole genome level information were obtained from 617 patients, 50.4% of whom were hospitalized. Notably, inpatients were older with a mean age of 66.67 years (SD ± 17.64 years), whereas outpatients had a mean age of 44.89 years (SD ± 16.00 years). SHapley Additive exPlanations (SHAP) analyses revealed that underlying vascular disease, underlying pulmonary disease, and fever were the most significant clinical features associated with hospitalization. In models built on the amino acid sequences of functional regions including spike, nucleocapsid, ORF3a, and ORF8 proteins, variants preceding the emergence of variants of concern (VOCs) or pre-VOC variants, were associated with hospitalization. Conclusions Viral genomic features have limited utility in predicting hospitalization across SARS-CoV-2 diversity. Combining clinical and viral genomic datasets provides perspective on patient specific and virus-related factors that impact COVID-19 disease severity. Overall, clinical features had greater discriminatory power than viral genomic features in predicting hospitalization.
format Article
id doaj-art-dcb3ff62504944578ffb3cb59964b105
institution Kabale University
issn 1471-2334
language English
publishDate 2025-01-01
publisher BMC
record_format Article
series BMC Infectious Diseases
spelling doaj-art-dcb3ff62504944578ffb3cb59964b1052025-02-02T12:10:27ZengBMCBMC Infectious Diseases1471-23342025-01-0125111510.1186/s12879-025-10450-3Identification of patient demographic, clinical, and SARS-CoV-2 genomic factors associated with severe COVID-19 using supervised machine learning: a retrospective multicenter studyKuganya Nirmalarajah0Patryk Aftanas1Shiva Barati2Emily Chien3Gloria Crowl4Amna Faheem5Lubna Farooqi6Alainna J. Jamal7Saman Khan8Jonathon D. Kotwa9Angel X. Li10Mohammad Mozafarihashjin11Jalees A. Nasir12Altynay Shigayeva13Winfield Yim14Lily Yip15Xi Zoe Zhong16Kevin Katz17Robert Kozak18Andrew G. McArthur19Nick Daneman20Finlay Maguire21Allison J. McGeer22Venkata R. Duvvuri23Samira Mubareka24Sunnybrook Research InstituteShared Hospital LaboratorySinai Health SystemSunnybrook Research InstituteSinai Health SystemSinai Health SystemSinai Health SystemSinai Health SystemSinai Health SystemSunnybrook Research InstituteSinai Health SystemSinai Health SystemMichael G. DeGroote Institute for Infectious Disease Research, McMaster UniversitySinai Health SystemSunnybrook Research InstituteSunnybrook Research InstituteSinai Health SystemShared Hospital LaboratorySunnybrook Research InstituteMichael G. DeGroote Institute for Infectious Disease Research, McMaster UniversitySunnybrook Research InstituteSunnybrook Research InstituteSinai Health SystemPublic Health OntarioSunnybrook Research InstituteAbstract Background Drivers of COVID-19 severity are multifactorial and include multidimensional and potentially interacting factors encompassing viral determinants and host-related factors (i.e., demographics, pre-existing conditions and/or genetics), thus complicating the prediction of clinical outcomes for different severe acute respiratory syndrome coronavirus (SARS-CoV-2) variants. Although millions of SARS-CoV-2 genomes have been publicly shared in global databases, linkages with detailed clinical data are scarce. Therefore, we aimed to establish a COVID-19 patient dataset with linked clinical and viral genomic data to then examine associations between SARS-CoV-2 genomic signatures and clinical disease phenotypes. Methods A cohort of adult patients with laboratory confirmed SARS-CoV-2 from 11 participating healthcare institutions in the Greater Toronto Area (GTA) were recruited from March 2020 to April 2022. Supervised machine learning (ML) models were developed to predict hospitalization using SARS-CoV-2 lineage-specific genomic signatures, patient demographics, symptoms, and pre-existing comorbidities. The relative importance of these features was then evaluated. Results Complete clinical data and viral whole genome level information were obtained from 617 patients, 50.4% of whom were hospitalized. Notably, inpatients were older with a mean age of 66.67 years (SD ± 17.64 years), whereas outpatients had a mean age of 44.89 years (SD ± 16.00 years). SHapley Additive exPlanations (SHAP) analyses revealed that underlying vascular disease, underlying pulmonary disease, and fever were the most significant clinical features associated with hospitalization. In models built on the amino acid sequences of functional regions including spike, nucleocapsid, ORF3a, and ORF8 proteins, variants preceding the emergence of variants of concern (VOCs) or pre-VOC variants, were associated with hospitalization. Conclusions Viral genomic features have limited utility in predicting hospitalization across SARS-CoV-2 diversity. Combining clinical and viral genomic datasets provides perspective on patient specific and virus-related factors that impact COVID-19 disease severity. Overall, clinical features had greater discriminatory power than viral genomic features in predicting hospitalization.https://doi.org/10.1186/s12879-025-10450-3COVID-19SARS-CoV-2Machine learningViral genomicsDisease severityData integration
spellingShingle Kuganya Nirmalarajah
Patryk Aftanas
Shiva Barati
Emily Chien
Gloria Crowl
Amna Faheem
Lubna Farooqi
Alainna J. Jamal
Saman Khan
Jonathon D. Kotwa
Angel X. Li
Mohammad Mozafarihashjin
Jalees A. Nasir
Altynay Shigayeva
Winfield Yim
Lily Yip
Xi Zoe Zhong
Kevin Katz
Robert Kozak
Andrew G. McArthur
Nick Daneman
Finlay Maguire
Allison J. McGeer
Venkata R. Duvvuri
Samira Mubareka
Identification of patient demographic, clinical, and SARS-CoV-2 genomic factors associated with severe COVID-19 using supervised machine learning: a retrospective multicenter study
BMC Infectious Diseases
COVID-19
SARS-CoV-2
Machine learning
Viral genomics
Disease severity
Data integration
title Identification of patient demographic, clinical, and SARS-CoV-2 genomic factors associated with severe COVID-19 using supervised machine learning: a retrospective multicenter study
title_full Identification of patient demographic, clinical, and SARS-CoV-2 genomic factors associated with severe COVID-19 using supervised machine learning: a retrospective multicenter study
title_fullStr Identification of patient demographic, clinical, and SARS-CoV-2 genomic factors associated with severe COVID-19 using supervised machine learning: a retrospective multicenter study
title_full_unstemmed Identification of patient demographic, clinical, and SARS-CoV-2 genomic factors associated with severe COVID-19 using supervised machine learning: a retrospective multicenter study
title_short Identification of patient demographic, clinical, and SARS-CoV-2 genomic factors associated with severe COVID-19 using supervised machine learning: a retrospective multicenter study
title_sort identification of patient demographic clinical and sars cov 2 genomic factors associated with severe covid 19 using supervised machine learning a retrospective multicenter study
topic COVID-19
SARS-CoV-2
Machine learning
Viral genomics
Disease severity
Data integration
url https://doi.org/10.1186/s12879-025-10450-3
work_keys_str_mv AT kuganyanirmalarajah identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT patrykaftanas identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT shivabarati identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT emilychien identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT gloriacrowl identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT amnafaheem identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT lubnafarooqi identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT alainnajjamal identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT samankhan identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT jonathondkotwa identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT angelxli identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT mohammadmozafarihashjin identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT jaleesanasir identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT altynayshigayeva identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT winfieldyim identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT lilyyip identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT xizoezhong identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT kevinkatz identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT robertkozak identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT andrewgmcarthur identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT nickdaneman identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT finlaymaguire identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT allisonjmcgeer identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT venkatarduvvuri identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy
AT samiramubareka identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy