Missing Categorical Data in Sociological Surveys: An Experimental Evaluation of Imputation Techniques
Missing categorical data presents a persistent challenge to data quality in quantitative sociological research, where simpler approaches can lead to biased estimates and incorrect conclusions. This article provides an empirically grounded evaluation of multiple imputation (MI) strategies for categor...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Taras Shevchenko National University of Kyiv
2025-06-01
|
| Series: | Соціологічні студії |
| Subjects: | |
| Online Access: | https://sociostudios.vnu.edu.ua/index.php/socio/article/view/417 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Missing categorical data presents a persistent challenge to data quality in quantitative sociological research, where simpler approaches can lead to biased estimates and incorrect conclusions. This article provides an empirically grounded evaluation of multiple imputation (MI) strategies for categorical survey data, specifically focusing on the complex, multi-category nominal variable "party voted for" using European Social Survey data from Sweden and Norway. We developed a simulation framework, introducing missingness under Missing Completely at Random, Missing at Random, derived from patterns of item nonresponse on auxiliary variables, and Missing Not at Random: linked to the undisclosed party choice itself. We systematically compared the performance of six imputation methods (Multinomial Logistic Regression, Random Forest, CART, KNN, Hot Deck, and Mode) across four distinct predictor set sizes, evaluating them using Accuracy, Cohen’s Kappa, and Macro F1-score with m=20 imputations. Results indicate that while imputing party choice is challenging, model-based MI techniques significantly outperform naive approaches. Multinomial Logistic Regression consistently emerged as the most robust and highest-performing method, often benefiting from larger predictor sets within the MI framework. K-Nearest Neighbors showed promise with smaller predictor sets, offering a computationally efficient alternative. The work emphasizes the importance of principled imputation and provides practical recommendations for sociologists regarding method selection, predictor set construction, and consideration of computational costs when addressing missing categorical data. |
|---|---|
| ISSN: | 2306-3971 2521-1056 |