Identification of patient demographic, clinical, and SARS-CoV-2 genomic factors associated with severe COVID-19 using supervised machine learning: a retrospective multicenter study
Abstract Background Drivers of COVID-19 severity are multifactorial and include multidimensional and potentially interacting factors encompassing viral determinants and host-related factors (i.e., demographics, pre-existing conditions and/or genetics), thus complicating the prediction of clinical ou...
Saved in:
Main Authors: | , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2025-01-01
|
Series: | BMC Infectious Diseases |
Subjects: | |
Online Access: | https://doi.org/10.1186/s12879-025-10450-3 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832572035228762112 |
---|---|
author | Kuganya Nirmalarajah Patryk Aftanas Shiva Barati Emily Chien Gloria Crowl Amna Faheem Lubna Farooqi Alainna J. Jamal Saman Khan Jonathon D. Kotwa Angel X. Li Mohammad Mozafarihashjin Jalees A. Nasir Altynay Shigayeva Winfield Yim Lily Yip Xi Zoe Zhong Kevin Katz Robert Kozak Andrew G. McArthur Nick Daneman Finlay Maguire Allison J. McGeer Venkata R. Duvvuri Samira Mubareka |
author_facet | Kuganya Nirmalarajah Patryk Aftanas Shiva Barati Emily Chien Gloria Crowl Amna Faheem Lubna Farooqi Alainna J. Jamal Saman Khan Jonathon D. Kotwa Angel X. Li Mohammad Mozafarihashjin Jalees A. Nasir Altynay Shigayeva Winfield Yim Lily Yip Xi Zoe Zhong Kevin Katz Robert Kozak Andrew G. McArthur Nick Daneman Finlay Maguire Allison J. McGeer Venkata R. Duvvuri Samira Mubareka |
author_sort | Kuganya Nirmalarajah |
collection | DOAJ |
description | Abstract Background Drivers of COVID-19 severity are multifactorial and include multidimensional and potentially interacting factors encompassing viral determinants and host-related factors (i.e., demographics, pre-existing conditions and/or genetics), thus complicating the prediction of clinical outcomes for different severe acute respiratory syndrome coronavirus (SARS-CoV-2) variants. Although millions of SARS-CoV-2 genomes have been publicly shared in global databases, linkages with detailed clinical data are scarce. Therefore, we aimed to establish a COVID-19 patient dataset with linked clinical and viral genomic data to then examine associations between SARS-CoV-2 genomic signatures and clinical disease phenotypes. Methods A cohort of adult patients with laboratory confirmed SARS-CoV-2 from 11 participating healthcare institutions in the Greater Toronto Area (GTA) were recruited from March 2020 to April 2022. Supervised machine learning (ML) models were developed to predict hospitalization using SARS-CoV-2 lineage-specific genomic signatures, patient demographics, symptoms, and pre-existing comorbidities. The relative importance of these features was then evaluated. Results Complete clinical data and viral whole genome level information were obtained from 617 patients, 50.4% of whom were hospitalized. Notably, inpatients were older with a mean age of 66.67 years (SD ± 17.64 years), whereas outpatients had a mean age of 44.89 years (SD ± 16.00 years). SHapley Additive exPlanations (SHAP) analyses revealed that underlying vascular disease, underlying pulmonary disease, and fever were the most significant clinical features associated with hospitalization. In models built on the amino acid sequences of functional regions including spike, nucleocapsid, ORF3a, and ORF8 proteins, variants preceding the emergence of variants of concern (VOCs) or pre-VOC variants, were associated with hospitalization. Conclusions Viral genomic features have limited utility in predicting hospitalization across SARS-CoV-2 diversity. Combining clinical and viral genomic datasets provides perspective on patient specific and virus-related factors that impact COVID-19 disease severity. Overall, clinical features had greater discriminatory power than viral genomic features in predicting hospitalization. |
format | Article |
id | doaj-art-dcb3ff62504944578ffb3cb59964b105 |
institution | Kabale University |
issn | 1471-2334 |
language | English |
publishDate | 2025-01-01 |
publisher | BMC |
record_format | Article |
series | BMC Infectious Diseases |
spelling | doaj-art-dcb3ff62504944578ffb3cb59964b1052025-02-02T12:10:27ZengBMCBMC Infectious Diseases1471-23342025-01-0125111510.1186/s12879-025-10450-3Identification of patient demographic, clinical, and SARS-CoV-2 genomic factors associated with severe COVID-19 using supervised machine learning: a retrospective multicenter studyKuganya Nirmalarajah0Patryk Aftanas1Shiva Barati2Emily Chien3Gloria Crowl4Amna Faheem5Lubna Farooqi6Alainna J. Jamal7Saman Khan8Jonathon D. Kotwa9Angel X. Li10Mohammad Mozafarihashjin11Jalees A. Nasir12Altynay Shigayeva13Winfield Yim14Lily Yip15Xi Zoe Zhong16Kevin Katz17Robert Kozak18Andrew G. McArthur19Nick Daneman20Finlay Maguire21Allison J. McGeer22Venkata R. Duvvuri23Samira Mubareka24Sunnybrook Research InstituteShared Hospital LaboratorySinai Health SystemSunnybrook Research InstituteSinai Health SystemSinai Health SystemSinai Health SystemSinai Health SystemSinai Health SystemSunnybrook Research InstituteSinai Health SystemSinai Health SystemMichael G. DeGroote Institute for Infectious Disease Research, McMaster UniversitySinai Health SystemSunnybrook Research InstituteSunnybrook Research InstituteSinai Health SystemShared Hospital LaboratorySunnybrook Research InstituteMichael G. DeGroote Institute for Infectious Disease Research, McMaster UniversitySunnybrook Research InstituteSunnybrook Research InstituteSinai Health SystemPublic Health OntarioSunnybrook Research InstituteAbstract Background Drivers of COVID-19 severity are multifactorial and include multidimensional and potentially interacting factors encompassing viral determinants and host-related factors (i.e., demographics, pre-existing conditions and/or genetics), thus complicating the prediction of clinical outcomes for different severe acute respiratory syndrome coronavirus (SARS-CoV-2) variants. Although millions of SARS-CoV-2 genomes have been publicly shared in global databases, linkages with detailed clinical data are scarce. Therefore, we aimed to establish a COVID-19 patient dataset with linked clinical and viral genomic data to then examine associations between SARS-CoV-2 genomic signatures and clinical disease phenotypes. Methods A cohort of adult patients with laboratory confirmed SARS-CoV-2 from 11 participating healthcare institutions in the Greater Toronto Area (GTA) were recruited from March 2020 to April 2022. Supervised machine learning (ML) models were developed to predict hospitalization using SARS-CoV-2 lineage-specific genomic signatures, patient demographics, symptoms, and pre-existing comorbidities. The relative importance of these features was then evaluated. Results Complete clinical data and viral whole genome level information were obtained from 617 patients, 50.4% of whom were hospitalized. Notably, inpatients were older with a mean age of 66.67 years (SD ± 17.64 years), whereas outpatients had a mean age of 44.89 years (SD ± 16.00 years). SHapley Additive exPlanations (SHAP) analyses revealed that underlying vascular disease, underlying pulmonary disease, and fever were the most significant clinical features associated with hospitalization. In models built on the amino acid sequences of functional regions including spike, nucleocapsid, ORF3a, and ORF8 proteins, variants preceding the emergence of variants of concern (VOCs) or pre-VOC variants, were associated with hospitalization. Conclusions Viral genomic features have limited utility in predicting hospitalization across SARS-CoV-2 diversity. Combining clinical and viral genomic datasets provides perspective on patient specific and virus-related factors that impact COVID-19 disease severity. Overall, clinical features had greater discriminatory power than viral genomic features in predicting hospitalization.https://doi.org/10.1186/s12879-025-10450-3COVID-19SARS-CoV-2Machine learningViral genomicsDisease severityData integration |
spellingShingle | Kuganya Nirmalarajah Patryk Aftanas Shiva Barati Emily Chien Gloria Crowl Amna Faheem Lubna Farooqi Alainna J. Jamal Saman Khan Jonathon D. Kotwa Angel X. Li Mohammad Mozafarihashjin Jalees A. Nasir Altynay Shigayeva Winfield Yim Lily Yip Xi Zoe Zhong Kevin Katz Robert Kozak Andrew G. McArthur Nick Daneman Finlay Maguire Allison J. McGeer Venkata R. Duvvuri Samira Mubareka Identification of patient demographic, clinical, and SARS-CoV-2 genomic factors associated with severe COVID-19 using supervised machine learning: a retrospective multicenter study BMC Infectious Diseases COVID-19 SARS-CoV-2 Machine learning Viral genomics Disease severity Data integration |
title | Identification of patient demographic, clinical, and SARS-CoV-2 genomic factors associated with severe COVID-19 using supervised machine learning: a retrospective multicenter study |
title_full | Identification of patient demographic, clinical, and SARS-CoV-2 genomic factors associated with severe COVID-19 using supervised machine learning: a retrospective multicenter study |
title_fullStr | Identification of patient demographic, clinical, and SARS-CoV-2 genomic factors associated with severe COVID-19 using supervised machine learning: a retrospective multicenter study |
title_full_unstemmed | Identification of patient demographic, clinical, and SARS-CoV-2 genomic factors associated with severe COVID-19 using supervised machine learning: a retrospective multicenter study |
title_short | Identification of patient demographic, clinical, and SARS-CoV-2 genomic factors associated with severe COVID-19 using supervised machine learning: a retrospective multicenter study |
title_sort | identification of patient demographic clinical and sars cov 2 genomic factors associated with severe covid 19 using supervised machine learning a retrospective multicenter study |
topic | COVID-19 SARS-CoV-2 Machine learning Viral genomics Disease severity Data integration |
url | https://doi.org/10.1186/s12879-025-10450-3 |
work_keys_str_mv | AT kuganyanirmalarajah identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT patrykaftanas identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT shivabarati identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT emilychien identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT gloriacrowl identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT amnafaheem identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT lubnafarooqi identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT alainnajjamal identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT samankhan identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT jonathondkotwa identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT angelxli identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT mohammadmozafarihashjin identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT jaleesanasir identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT altynayshigayeva identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT winfieldyim identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT lilyyip identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT xizoezhong identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT kevinkatz identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT robertkozak identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT andrewgmcarthur identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT nickdaneman identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT finlaymaguire identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT allisonjmcgeer identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT venkatarduvvuri identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy AT samiramubareka identificationofpatientdemographicclinicalandsarscov2genomicfactorsassociatedwithseverecovid19usingsupervisedmachinelearningaretrospectivemulticenterstudy |