A National Synthetic Populations Dataset for the United States

Abstract Geospatially explicit and statistically accurate person and household data allow researchers to study community-and neighborhood-level effects and design and test hypotheses that would otherwise not be possible without the generation of synthetic data. In this article, we demonstrate the wo...

Full description

Saved in:
Bibliographic Details
Main Authors: James Rineer, Nicholas Kruskamp, Caroline Kery, Kasey Jones, Rainer Hilscher, Georgiy Bobashev
Format: Article
Language:English
Published: Nature Portfolio 2025-01-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-025-04380-7
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832586097594466304
author James Rineer
Nicholas Kruskamp
Caroline Kery
Kasey Jones
Rainer Hilscher
Georgiy Bobashev
author_facet James Rineer
Nicholas Kruskamp
Caroline Kery
Kasey Jones
Rainer Hilscher
Georgiy Bobashev
author_sort James Rineer
collection DOAJ
description Abstract Geospatially explicit and statistically accurate person and household data allow researchers to study community-and neighborhood-level effects and design and test hypotheses that would otherwise not be possible without the generation of synthetic data. In this article, we demonstrate the workflow for generating spatially explicit household- and individual-level synthetic populations for the United States representing the year 2019. We use publicly available U.S. Census American Community Survey (ACS) 5-year estimates from the 2015–2019 ACS. We use Iterative Proportional Fitting (IPF) to create our synthetic population and use the resulting joint counts to sample representative households and people directly from microdata. Our dataset contains records for 120,754,708 households and 303,128,287 individuals across the United States. We spatially allocate households using the Environmental Protection Agency (EPA) Integrated Climate and Land Use Scenarios (ICLUS) project household distribution estimates to create a spatially explicit dataset. Our validation shows strong correlation with original census variables, with many categories reporting a greater than 0.99 Pearson’s r correlation coefficient.
format Article
id doaj-art-ef24b02cff3e407fa21843d42e1e70e3
institution Kabale University
issn 2052-4463
language English
publishDate 2025-01-01
publisher Nature Portfolio
record_format Article
series Scientific Data
spelling doaj-art-ef24b02cff3e407fa21843d42e1e70e32025-01-26T12:14:41ZengNature PortfolioScientific Data2052-44632025-01-0112111410.1038/s41597-025-04380-7A National Synthetic Populations Dataset for the United StatesJames Rineer0Nicholas Kruskamp1Caroline Kery2Kasey Jones3Rainer Hilscher4Georgiy Bobashev5RTI InternationalRTI InternationalRTI InternationalRTI InternationalRTI InternationalRTI InternationalAbstract Geospatially explicit and statistically accurate person and household data allow researchers to study community-and neighborhood-level effects and design and test hypotheses that would otherwise not be possible without the generation of synthetic data. In this article, we demonstrate the workflow for generating spatially explicit household- and individual-level synthetic populations for the United States representing the year 2019. We use publicly available U.S. Census American Community Survey (ACS) 5-year estimates from the 2015–2019 ACS. We use Iterative Proportional Fitting (IPF) to create our synthetic population and use the resulting joint counts to sample representative households and people directly from microdata. Our dataset contains records for 120,754,708 households and 303,128,287 individuals across the United States. We spatially allocate households using the Environmental Protection Agency (EPA) Integrated Climate and Land Use Scenarios (ICLUS) project household distribution estimates to create a spatially explicit dataset. Our validation shows strong correlation with original census variables, with many categories reporting a greater than 0.99 Pearson’s r correlation coefficient.https://doi.org/10.1038/s41597-025-04380-7
spellingShingle James Rineer
Nicholas Kruskamp
Caroline Kery
Kasey Jones
Rainer Hilscher
Georgiy Bobashev
A National Synthetic Populations Dataset for the United States
Scientific Data
title A National Synthetic Populations Dataset for the United States
title_full A National Synthetic Populations Dataset for the United States
title_fullStr A National Synthetic Populations Dataset for the United States
title_full_unstemmed A National Synthetic Populations Dataset for the United States
title_short A National Synthetic Populations Dataset for the United States
title_sort national synthetic populations dataset for the united states
url https://doi.org/10.1038/s41597-025-04380-7
work_keys_str_mv AT jamesrineer anationalsyntheticpopulationsdatasetfortheunitedstates
AT nicholaskruskamp anationalsyntheticpopulationsdatasetfortheunitedstates
AT carolinekery anationalsyntheticpopulationsdatasetfortheunitedstates
AT kaseyjones anationalsyntheticpopulationsdatasetfortheunitedstates
AT rainerhilscher anationalsyntheticpopulationsdatasetfortheunitedstates
AT georgiybobashev anationalsyntheticpopulationsdatasetfortheunitedstates
AT jamesrineer nationalsyntheticpopulationsdatasetfortheunitedstates
AT nicholaskruskamp nationalsyntheticpopulationsdatasetfortheunitedstates
AT carolinekery nationalsyntheticpopulationsdatasetfortheunitedstates
AT kaseyjones nationalsyntheticpopulationsdatasetfortheunitedstates
AT rainerhilscher nationalsyntheticpopulationsdatasetfortheunitedstates
AT georgiybobashev nationalsyntheticpopulationsdatasetfortheunitedstates