Decomposition of the pangenome matrix reveals a structure in gene distribution in the Escherichia coli species
ABSTRACT Thousands of complete genome sequences for strains of a species that are now available enable the advancement of pangenome analytics to a new level of sophistication. We collected 2,377 publicly available complete genomes of Escherichia coli for detailed pangenome analysis. The core genome...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
American Society for Microbiology
2025-01-01
|
Series: | mSphere |
Subjects: | |
Online Access: | https://journals.asm.org/doi/10.1128/msphere.00532-24 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832583393776238592 |
---|---|
author | Siddharth M. Chauhan Omid Ardalani Jason C. Hyun Jonathan M. Monk Patrick V. Phaneuf Bernhard O. Palsson |
author_facet | Siddharth M. Chauhan Omid Ardalani Jason C. Hyun Jonathan M. Monk Patrick V. Phaneuf Bernhard O. Palsson |
author_sort | Siddharth M. Chauhan |
collection | DOAJ |
description | ABSTRACT Thousands of complete genome sequences for strains of a species that are now available enable the advancement of pangenome analytics to a new level of sophistication. We collected 2,377 publicly available complete genomes of Escherichia coli for detailed pangenome analysis. The core genome and accessory genomes consisted of 2,398 and 5,182 genes, respectively. We developed a machine learning approach to define the accessory genes characterizing the major phylogroups of E. coli plus Shigella: A, B1, B2, C, D, E, F, G, and Shigella. The analysis resulted in a detailed structure of the genetic basis of the phylogroups’ differential traits. This pangenome structure was largely consistent with a housekeeping-gene-based MLST distribution, sequence-based Mash distance, and the Clermont quadruplex classification. The rare genome (consisting of genes found in <6.8% of all strains) consisted of 163,619 genes, about 79% of which represented variations of 315 underlying transposon elements. This analysis generated a mathematical definition of the genetic basis for a species.IMPORTANCEThe comprehensive analysis of the pangenome of Escherichia coli presented in this study marks a significant advancement in understanding bacterial genetic diversity. By employing machine learning techniques to analyze 2,377 complete E. coli genomes, the study provides a detailed mapping of core, accessory, and rare genes. This approach reveals the genetic basis for differential traits across phylogroups, offering insights into pathogenicity, antibiotic resistance, and evolutionary adaptations. The findings enhance the potential for genome-based diagnostics and pave the way for future studies aimed at achieving a global genetic definition of bacterial phylogeny. |
format | Article |
id | doaj-art-21605b607a104dcd9e69b8bb498a7676 |
institution | Kabale University |
issn | 2379-5042 |
language | English |
publishDate | 2025-01-01 |
publisher | American Society for Microbiology |
record_format | Article |
series | mSphere |
spelling | doaj-art-21605b607a104dcd9e69b8bb498a76762025-01-28T14:00:56ZengAmerican Society for MicrobiologymSphere2379-50422025-01-0110110.1128/msphere.00532-24Decomposition of the pangenome matrix reveals a structure in gene distribution in the Escherichia coli speciesSiddharth M. Chauhan0Omid Ardalani1Jason C. Hyun2Jonathan M. Monk3Patrick V. Phaneuf4Bernhard O. Palsson5Department of Bioengineering, University of California, San Diego, La Jolla, California, USANovo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kemitorvet, Kongens, Lyngby, DenmarkDepartment of Bioengineering, University of California, San Diego, La Jolla, California, USADepartment of Bioengineering, University of California, San Diego, La Jolla, California, USANovo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kemitorvet, Kongens, Lyngby, DenmarkDepartment of Bioengineering, University of California, San Diego, La Jolla, California, USAABSTRACT Thousands of complete genome sequences for strains of a species that are now available enable the advancement of pangenome analytics to a new level of sophistication. We collected 2,377 publicly available complete genomes of Escherichia coli for detailed pangenome analysis. The core genome and accessory genomes consisted of 2,398 and 5,182 genes, respectively. We developed a machine learning approach to define the accessory genes characterizing the major phylogroups of E. coli plus Shigella: A, B1, B2, C, D, E, F, G, and Shigella. The analysis resulted in a detailed structure of the genetic basis of the phylogroups’ differential traits. This pangenome structure was largely consistent with a housekeeping-gene-based MLST distribution, sequence-based Mash distance, and the Clermont quadruplex classification. The rare genome (consisting of genes found in <6.8% of all strains) consisted of 163,619 genes, about 79% of which represented variations of 315 underlying transposon elements. This analysis generated a mathematical definition of the genetic basis for a species.IMPORTANCEThe comprehensive analysis of the pangenome of Escherichia coli presented in this study marks a significant advancement in understanding bacterial genetic diversity. By employing machine learning techniques to analyze 2,377 complete E. coli genomes, the study provides a detailed mapping of core, accessory, and rare genes. This approach reveals the genetic basis for differential traits across phylogroups, offering insights into pathogenicity, antibiotic resistance, and evolutionary adaptations. The findings enhance the potential for genome-based diagnostics and pave the way for future studies aimed at achieving a global genetic definition of bacterial phylogeny.https://journals.asm.org/doi/10.1128/msphere.00532-24ShigellaEscherichia coligenomicstypingcomputational biologygenome analysis |
spellingShingle | Siddharth M. Chauhan Omid Ardalani Jason C. Hyun Jonathan M. Monk Patrick V. Phaneuf Bernhard O. Palsson Decomposition of the pangenome matrix reveals a structure in gene distribution in the Escherichia coli species mSphere Shigella Escherichia coli genomics typing computational biology genome analysis |
title | Decomposition of the pangenome matrix reveals a structure in gene distribution in the Escherichia coli species |
title_full | Decomposition of the pangenome matrix reveals a structure in gene distribution in the Escherichia coli species |
title_fullStr | Decomposition of the pangenome matrix reveals a structure in gene distribution in the Escherichia coli species |
title_full_unstemmed | Decomposition of the pangenome matrix reveals a structure in gene distribution in the Escherichia coli species |
title_short | Decomposition of the pangenome matrix reveals a structure in gene distribution in the Escherichia coli species |
title_sort | decomposition of the pangenome matrix reveals a structure in gene distribution in the escherichia coli species |
topic | Shigella Escherichia coli genomics typing computational biology genome analysis |
url | https://journals.asm.org/doi/10.1128/msphere.00532-24 |
work_keys_str_mv | AT siddharthmchauhan decompositionofthepangenomematrixrevealsastructureingenedistributionintheescherichiacolispecies AT omidardalani decompositionofthepangenomematrixrevealsastructureingenedistributionintheescherichiacolispecies AT jasonchyun decompositionofthepangenomematrixrevealsastructureingenedistributionintheescherichiacolispecies AT jonathanmmonk decompositionofthepangenomematrixrevealsastructureingenedistributionintheescherichiacolispecies AT patrickvphaneuf decompositionofthepangenomematrixrevealsastructureingenedistributionintheescherichiacolispecies AT bernhardopalsson decompositionofthepangenomematrixrevealsastructureingenedistributionintheescherichiacolispecies |