A probabilistic approach to visualize the effect of missing data on PCA in ancient human genomics

Abstract Background Principal Component Analysis (PCA) is widely used in population genetics to visualize genetic relationships and population structures. In ancient genomics, genotype information may in parts remain unresolved due to the low abundance and degraded quality of ancient DNA. While meth...

Full description

Saved in:
Bibliographic Details
Main Authors: Susanne Zabel, Samira Breitling, Cosimo Posth, Kay Nieselt
Format: Article
Language:English
Published: BMC 2025-05-01
Series:BMC Genomics
Subjects:
Online Access:https://doi.org/10.1186/s12864-025-11728-1
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849688366542487552
author Susanne Zabel
Samira Breitling
Cosimo Posth
Kay Nieselt
author_facet Susanne Zabel
Samira Breitling
Cosimo Posth
Kay Nieselt
author_sort Susanne Zabel
collection DOAJ
description Abstract Background Principal Component Analysis (PCA) is widely used in population genetics to visualize genetic relationships and population structures. In ancient genomics, genotype information may in parts remain unresolved due to the low abundance and degraded quality of ancient DNA. While methods like SmartPCA allow the projection of ancient samples despite missing data, they do not quantify projection uncertainty. The reliability of PCA projections for often very sparse ancient genotype samples is not well understood. Ignoring this uncertainty may lead to overconfident conclusions about the observed genetic relationships and population structure. Results This study systematically investigates the impact of missing loci on PCA projections using both simulated and real ancient human genotype data. Through extensive simulations with high-coverage ancient samples, we demonstrate that increasing levels of missing data can lead to less accurate SmartPCA projections, highlighting the importance of considering uncertainty when interpreting PCA results from ancient samples. To address this, we developed a probabilistic framework to quantify the uncertainty in PCA projections due to missing data. By applying our methodology to modern and ancient West Eurasian genotype samples from the Allen Ancient DNA Resource database, we could show a high concordance between our predicted projection and empirically derived distributions. Applying this framework to real-world data, we demonstrate its utility in predicting and visualizing embedding uncertainties for ancient samples of varying SNP coverages. Conclusion Our results emphasize the importance of accounting for projection uncertainty in ancient population studies. We therefore make our probabilistic model available through TrustPCA, a user-friendly web tool that provides researchers with uncertainty estimates alongside PCA projections, facilitating data exploration in ancient human genomic studies and enhancing transparency in data quality reporting.
format Article
id doaj-art-8a5200f858c84d3ea71917c6b151060f
institution DOAJ
issn 1471-2164
language English
publishDate 2025-05-01
publisher BMC
record_format Article
series BMC Genomics
spelling doaj-art-8a5200f858c84d3ea71917c6b151060f2025-08-20T03:22:01ZengBMCBMC Genomics1471-21642025-05-0126111410.1186/s12864-025-11728-1A probabilistic approach to visualize the effect of missing data on PCA in ancient human genomicsSusanne Zabel0Samira Breitling1Cosimo Posth2Kay Nieselt3Institute for Bioinformatics and Medical Informatics, University of TübingenInstitute for Bioinformatics and Medical Informatics, University of TübingenArchaeo- and Palaeogenetics, Institute for Archaeological Sciences, Department of Geosciences, University of TübingenInstitute for Bioinformatics and Medical Informatics, University of TübingenAbstract Background Principal Component Analysis (PCA) is widely used in population genetics to visualize genetic relationships and population structures. In ancient genomics, genotype information may in parts remain unresolved due to the low abundance and degraded quality of ancient DNA. While methods like SmartPCA allow the projection of ancient samples despite missing data, they do not quantify projection uncertainty. The reliability of PCA projections for often very sparse ancient genotype samples is not well understood. Ignoring this uncertainty may lead to overconfident conclusions about the observed genetic relationships and population structure. Results This study systematically investigates the impact of missing loci on PCA projections using both simulated and real ancient human genotype data. Through extensive simulations with high-coverage ancient samples, we demonstrate that increasing levels of missing data can lead to less accurate SmartPCA projections, highlighting the importance of considering uncertainty when interpreting PCA results from ancient samples. To address this, we developed a probabilistic framework to quantify the uncertainty in PCA projections due to missing data. By applying our methodology to modern and ancient West Eurasian genotype samples from the Allen Ancient DNA Resource database, we could show a high concordance between our predicted projection and empirically derived distributions. Applying this framework to real-world data, we demonstrate its utility in predicting and visualizing embedding uncertainties for ancient samples of varying SNP coverages. Conclusion Our results emphasize the importance of accounting for projection uncertainty in ancient population studies. We therefore make our probabilistic model available through TrustPCA, a user-friendly web tool that provides researchers with uncertainty estimates alongside PCA projections, facilitating data exploration in ancient human genomic studies and enhancing transparency in data quality reporting.https://doi.org/10.1186/s12864-025-11728-1Ancient genomicsMissing dataPopulation geneticsPrincipal component analysisSmartPCAUncertainty
spellingShingle Susanne Zabel
Samira Breitling
Cosimo Posth
Kay Nieselt
A probabilistic approach to visualize the effect of missing data on PCA in ancient human genomics
BMC Genomics
Ancient genomics
Missing data
Population genetics
Principal component analysis
SmartPCA
Uncertainty
title A probabilistic approach to visualize the effect of missing data on PCA in ancient human genomics
title_full A probabilistic approach to visualize the effect of missing data on PCA in ancient human genomics
title_fullStr A probabilistic approach to visualize the effect of missing data on PCA in ancient human genomics
title_full_unstemmed A probabilistic approach to visualize the effect of missing data on PCA in ancient human genomics
title_short A probabilistic approach to visualize the effect of missing data on PCA in ancient human genomics
title_sort probabilistic approach to visualize the effect of missing data on pca in ancient human genomics
topic Ancient genomics
Missing data
Population genetics
Principal component analysis
SmartPCA
Uncertainty
url https://doi.org/10.1186/s12864-025-11728-1
work_keys_str_mv AT susannezabel aprobabilisticapproachtovisualizetheeffectofmissingdataonpcainancienthumangenomics
AT samirabreitling aprobabilisticapproachtovisualizetheeffectofmissingdataonpcainancienthumangenomics
AT cosimoposth aprobabilisticapproachtovisualizetheeffectofmissingdataonpcainancienthumangenomics
AT kaynieselt aprobabilisticapproachtovisualizetheeffectofmissingdataonpcainancienthumangenomics
AT susannezabel probabilisticapproachtovisualizetheeffectofmissingdataonpcainancienthumangenomics
AT samirabreitling probabilisticapproachtovisualizetheeffectofmissingdataonpcainancienthumangenomics
AT cosimoposth probabilisticapproachtovisualizetheeffectofmissingdataonpcainancienthumangenomics
AT kaynieselt probabilisticapproachtovisualizetheeffectofmissingdataonpcainancienthumangenomics