Decomposition of the pangenome matrix reveals a structure in gene distribution in the Escherichia coli species

ABSTRACT Thousands of complete genome sequences for strains of a species that are now available enable the advancement of pangenome analytics to a new level of sophistication. We collected 2,377 publicly available complete genomes of Escherichia coli for detailed pangenome analysis. The core genome...

Full description

Saved in:
Bibliographic Details
Main Authors: Siddharth M. Chauhan, Omid Ardalani, Jason C. Hyun, Jonathan M. Monk, Patrick V. Phaneuf, Bernhard O. Palsson
Format: Article
Language:English
Published: American Society for Microbiology 2025-01-01
Series:mSphere
Subjects:
Online Access:https://journals.asm.org/doi/10.1128/msphere.00532-24
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832583393776238592
author Siddharth M. Chauhan
Omid Ardalani
Jason C. Hyun
Jonathan M. Monk
Patrick V. Phaneuf
Bernhard O. Palsson
author_facet Siddharth M. Chauhan
Omid Ardalani
Jason C. Hyun
Jonathan M. Monk
Patrick V. Phaneuf
Bernhard O. Palsson
author_sort Siddharth M. Chauhan
collection DOAJ
description ABSTRACT Thousands of complete genome sequences for strains of a species that are now available enable the advancement of pangenome analytics to a new level of sophistication. We collected 2,377 publicly available complete genomes of Escherichia coli for detailed pangenome analysis. The core genome and accessory genomes consisted of 2,398 and 5,182 genes, respectively. We developed a machine learning approach to define the accessory genes characterizing the major phylogroups of E. coli plus Shigella: A, B1, B2, C, D, E, F, G, and Shigella. The analysis resulted in a detailed structure of the genetic basis of the phylogroups’ differential traits. This pangenome structure was largely consistent with a housekeeping-gene-based MLST distribution, sequence-based Mash distance, and the Clermont quadruplex classification. The rare genome (consisting of genes found in <6.8% of all strains) consisted of 163,619 genes, about 79% of which represented variations of 315 underlying transposon elements. This analysis generated a mathematical definition of the genetic basis for a species.IMPORTANCEThe comprehensive analysis of the pangenome of Escherichia coli presented in this study marks a significant advancement in understanding bacterial genetic diversity. By employing machine learning techniques to analyze 2,377 complete E. coli genomes, the study provides a detailed mapping of core, accessory, and rare genes. This approach reveals the genetic basis for differential traits across phylogroups, offering insights into pathogenicity, antibiotic resistance, and evolutionary adaptations. The findings enhance the potential for genome-based diagnostics and pave the way for future studies aimed at achieving a global genetic definition of bacterial phylogeny.
format Article
id doaj-art-21605b607a104dcd9e69b8bb498a7676
institution Kabale University
issn 2379-5042
language English
publishDate 2025-01-01
publisher American Society for Microbiology
record_format Article
series mSphere
spelling doaj-art-21605b607a104dcd9e69b8bb498a76762025-01-28T14:00:56ZengAmerican Society for MicrobiologymSphere2379-50422025-01-0110110.1128/msphere.00532-24Decomposition of the pangenome matrix reveals a structure in gene distribution in the Escherichia coli speciesSiddharth M. Chauhan0Omid Ardalani1Jason C. Hyun2Jonathan M. Monk3Patrick V. Phaneuf4Bernhard O. Palsson5Department of Bioengineering, University of California, San Diego, La Jolla, California, USANovo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kemitorvet, Kongens, Lyngby, DenmarkDepartment of Bioengineering, University of California, San Diego, La Jolla, California, USADepartment of Bioengineering, University of California, San Diego, La Jolla, California, USANovo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kemitorvet, Kongens, Lyngby, DenmarkDepartment of Bioengineering, University of California, San Diego, La Jolla, California, USAABSTRACT Thousands of complete genome sequences for strains of a species that are now available enable the advancement of pangenome analytics to a new level of sophistication. We collected 2,377 publicly available complete genomes of Escherichia coli for detailed pangenome analysis. The core genome and accessory genomes consisted of 2,398 and 5,182 genes, respectively. We developed a machine learning approach to define the accessory genes characterizing the major phylogroups of E. coli plus Shigella: A, B1, B2, C, D, E, F, G, and Shigella. The analysis resulted in a detailed structure of the genetic basis of the phylogroups’ differential traits. This pangenome structure was largely consistent with a housekeeping-gene-based MLST distribution, sequence-based Mash distance, and the Clermont quadruplex classification. The rare genome (consisting of genes found in <6.8% of all strains) consisted of 163,619 genes, about 79% of which represented variations of 315 underlying transposon elements. This analysis generated a mathematical definition of the genetic basis for a species.IMPORTANCEThe comprehensive analysis of the pangenome of Escherichia coli presented in this study marks a significant advancement in understanding bacterial genetic diversity. By employing machine learning techniques to analyze 2,377 complete E. coli genomes, the study provides a detailed mapping of core, accessory, and rare genes. This approach reveals the genetic basis for differential traits across phylogroups, offering insights into pathogenicity, antibiotic resistance, and evolutionary adaptations. The findings enhance the potential for genome-based diagnostics and pave the way for future studies aimed at achieving a global genetic definition of bacterial phylogeny.https://journals.asm.org/doi/10.1128/msphere.00532-24ShigellaEscherichia coligenomicstypingcomputational biologygenome analysis
spellingShingle Siddharth M. Chauhan
Omid Ardalani
Jason C. Hyun
Jonathan M. Monk
Patrick V. Phaneuf
Bernhard O. Palsson
Decomposition of the pangenome matrix reveals a structure in gene distribution in the Escherichia coli species
mSphere
Shigella
Escherichia coli
genomics
typing
computational biology
genome analysis
title Decomposition of the pangenome matrix reveals a structure in gene distribution in the Escherichia coli species
title_full Decomposition of the pangenome matrix reveals a structure in gene distribution in the Escherichia coli species
title_fullStr Decomposition of the pangenome matrix reveals a structure in gene distribution in the Escherichia coli species
title_full_unstemmed Decomposition of the pangenome matrix reveals a structure in gene distribution in the Escherichia coli species
title_short Decomposition of the pangenome matrix reveals a structure in gene distribution in the Escherichia coli species
title_sort decomposition of the pangenome matrix reveals a structure in gene distribution in the escherichia coli species
topic Shigella
Escherichia coli
genomics
typing
computational biology
genome analysis
url https://journals.asm.org/doi/10.1128/msphere.00532-24
work_keys_str_mv AT siddharthmchauhan decompositionofthepangenomematrixrevealsastructureingenedistributionintheescherichiacolispecies
AT omidardalani decompositionofthepangenomematrixrevealsastructureingenedistributionintheescherichiacolispecies
AT jasonchyun decompositionofthepangenomematrixrevealsastructureingenedistributionintheescherichiacolispecies
AT jonathanmmonk decompositionofthepangenomematrixrevealsastructureingenedistributionintheescherichiacolispecies
AT patrickvphaneuf decompositionofthepangenomematrixrevealsastructureingenedistributionintheescherichiacolispecies
AT bernhardopalsson decompositionofthepangenomematrixrevealsastructureingenedistributionintheescherichiacolispecies