A National Synthetic Populations Dataset for the United States

Abstract Geospatially explicit and statistically accurate person and household data allow researchers to study community-and neighborhood-level effects and design and test hypotheses that would otherwise not be possible without the generation of synthetic data. In this article, we demonstrate the wo...

Full description

Saved in:
Bibliographic Details
Main Authors: James Rineer, Nicholas Kruskamp, Caroline Kery, Kasey Jones, Rainer Hilscher, Georgiy Bobashev
Format: Article
Language:English
Published: Nature Portfolio 2025-01-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-025-04380-7
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Geospatially explicit and statistically accurate person and household data allow researchers to study community-and neighborhood-level effects and design and test hypotheses that would otherwise not be possible without the generation of synthetic data. In this article, we demonstrate the workflow for generating spatially explicit household- and individual-level synthetic populations for the United States representing the year 2019. We use publicly available U.S. Census American Community Survey (ACS) 5-year estimates from the 2015–2019 ACS. We use Iterative Proportional Fitting (IPF) to create our synthetic population and use the resulting joint counts to sample representative households and people directly from microdata. Our dataset contains records for 120,754,708 households and 303,128,287 individuals across the United States. We spatially allocate households using the Environmental Protection Agency (EPA) Integrated Climate and Land Use Scenarios (ICLUS) project household distribution estimates to create a spatially explicit dataset. Our validation shows strong correlation with original census variables, with many categories reporting a greater than 0.99 Pearson’s r correlation coefficient.
ISSN:2052-4463