A National Synthetic Populations Dataset for the United States
Abstract Geospatially explicit and statistically accurate person and household data allow researchers to study community-and neighborhood-level effects and design and test hypotheses that would otherwise not be possible without the generation of synthetic data. In this article, we demonstrate the wo...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Nature Portfolio
2025-01-01
|
Series: | Scientific Data |
Online Access: | https://doi.org/10.1038/s41597-025-04380-7 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832586097594466304 |
---|---|
author | James Rineer Nicholas Kruskamp Caroline Kery Kasey Jones Rainer Hilscher Georgiy Bobashev |
author_facet | James Rineer Nicholas Kruskamp Caroline Kery Kasey Jones Rainer Hilscher Georgiy Bobashev |
author_sort | James Rineer |
collection | DOAJ |
description | Abstract Geospatially explicit and statistically accurate person and household data allow researchers to study community-and neighborhood-level effects and design and test hypotheses that would otherwise not be possible without the generation of synthetic data. In this article, we demonstrate the workflow for generating spatially explicit household- and individual-level synthetic populations for the United States representing the year 2019. We use publicly available U.S. Census American Community Survey (ACS) 5-year estimates from the 2015–2019 ACS. We use Iterative Proportional Fitting (IPF) to create our synthetic population and use the resulting joint counts to sample representative households and people directly from microdata. Our dataset contains records for 120,754,708 households and 303,128,287 individuals across the United States. We spatially allocate households using the Environmental Protection Agency (EPA) Integrated Climate and Land Use Scenarios (ICLUS) project household distribution estimates to create a spatially explicit dataset. Our validation shows strong correlation with original census variables, with many categories reporting a greater than 0.99 Pearson’s r correlation coefficient. |
format | Article |
id | doaj-art-ef24b02cff3e407fa21843d42e1e70e3 |
institution | Kabale University |
issn | 2052-4463 |
language | English |
publishDate | 2025-01-01 |
publisher | Nature Portfolio |
record_format | Article |
series | Scientific Data |
spelling | doaj-art-ef24b02cff3e407fa21843d42e1e70e32025-01-26T12:14:41ZengNature PortfolioScientific Data2052-44632025-01-0112111410.1038/s41597-025-04380-7A National Synthetic Populations Dataset for the United StatesJames Rineer0Nicholas Kruskamp1Caroline Kery2Kasey Jones3Rainer Hilscher4Georgiy Bobashev5RTI InternationalRTI InternationalRTI InternationalRTI InternationalRTI InternationalRTI InternationalAbstract Geospatially explicit and statistically accurate person and household data allow researchers to study community-and neighborhood-level effects and design and test hypotheses that would otherwise not be possible without the generation of synthetic data. In this article, we demonstrate the workflow for generating spatially explicit household- and individual-level synthetic populations for the United States representing the year 2019. We use publicly available U.S. Census American Community Survey (ACS) 5-year estimates from the 2015–2019 ACS. We use Iterative Proportional Fitting (IPF) to create our synthetic population and use the resulting joint counts to sample representative households and people directly from microdata. Our dataset contains records for 120,754,708 households and 303,128,287 individuals across the United States. We spatially allocate households using the Environmental Protection Agency (EPA) Integrated Climate and Land Use Scenarios (ICLUS) project household distribution estimates to create a spatially explicit dataset. Our validation shows strong correlation with original census variables, with many categories reporting a greater than 0.99 Pearson’s r correlation coefficient.https://doi.org/10.1038/s41597-025-04380-7 |
spellingShingle | James Rineer Nicholas Kruskamp Caroline Kery Kasey Jones Rainer Hilscher Georgiy Bobashev A National Synthetic Populations Dataset for the United States Scientific Data |
title | A National Synthetic Populations Dataset for the United States |
title_full | A National Synthetic Populations Dataset for the United States |
title_fullStr | A National Synthetic Populations Dataset for the United States |
title_full_unstemmed | A National Synthetic Populations Dataset for the United States |
title_short | A National Synthetic Populations Dataset for the United States |
title_sort | national synthetic populations dataset for the united states |
url | https://doi.org/10.1038/s41597-025-04380-7 |
work_keys_str_mv | AT jamesrineer anationalsyntheticpopulationsdatasetfortheunitedstates AT nicholaskruskamp anationalsyntheticpopulationsdatasetfortheunitedstates AT carolinekery anationalsyntheticpopulationsdatasetfortheunitedstates AT kaseyjones anationalsyntheticpopulationsdatasetfortheunitedstates AT rainerhilscher anationalsyntheticpopulationsdatasetfortheunitedstates AT georgiybobashev anationalsyntheticpopulationsdatasetfortheunitedstates AT jamesrineer nationalsyntheticpopulationsdatasetfortheunitedstates AT nicholaskruskamp nationalsyntheticpopulationsdatasetfortheunitedstates AT carolinekery nationalsyntheticpopulationsdatasetfortheunitedstates AT kaseyjones nationalsyntheticpopulationsdatasetfortheunitedstates AT rainerhilscher nationalsyntheticpopulationsdatasetfortheunitedstates AT georgiybobashev nationalsyntheticpopulationsdatasetfortheunitedstates |