How much missing data is too much to impute for longitudinal health indicators? A preliminary guideline for the choice of the extent of missing proportion to impute with multiple imputation by chained equations

Abstract Background The multiple imputation by chained equations (MICE) is a widely used approach for handling missing data. However, its robustness, especially for high missing proportions in health indicators, is under-researched. The study aimed to provide a preliminary guideline for the choice o...

Full description

Saved in:
Bibliographic Details
Main Authors: K. P. Junaid, Tanvi Kiran, Madhu Gupta, Kamal Kishore, Sujata Siwatch
Format: Article
Language:English
Published: BMC 2025-02-01
Series:Population Health Metrics
Subjects:
Online Access:https://doi.org/10.1186/s12963-025-00364-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832571488860897280
author K. P. Junaid
Tanvi Kiran
Madhu Gupta
Kamal Kishore
Sujata Siwatch
author_facet K. P. Junaid
Tanvi Kiran
Madhu Gupta
Kamal Kishore
Sujata Siwatch
author_sort K. P. Junaid
collection DOAJ
description Abstract Background The multiple imputation by chained equations (MICE) is a widely used approach for handling missing data. However, its robustness, especially for high missing proportions in health indicators, is under-researched. The study aimed to provide a preliminary guideline for the choice of the extent of missing proportion to impute longitudinal health-related data using the MICE method. Methods The study obtained complete data on five mortality-related health indicators of 100 countries (2015–2019) from the Global Health Observatory. Nine incomplete datasets with missing rates from 10 to 90% were generated and imputed using MICE. The robustness of MICE was assessed through three approaches: comparison of means using the Repeated Measures- Analysis of variance, estimation of evaluation metrics (Root mean square error, mean absolute deviation, Bias, and proportionate variance), and visual inspection of box plots of imputed and non-imputed data. Results The Repeated Measures- Analysis of variance revealed significant differences between complete and imputed data, primarily in imputed data with over 50% missing proportions. Evaluation metrics exhibited ‘high performance’ for the dataset with a 50% missing proportion for various health indicators However, with missing proportions exceeding 70%, the majority of indicators demonstrated a ‘low’ performance level in terms of most evaluation metrics. The visual inspection of the box plot revealed severe variance shrinkage in imputed datasets with missing proportions beyond 70%, corroborating the findings from the evaluation metrics. Conclusion It demonstrates high robustness up to 50% missing values, with marginal deviations from complete datasets. Caution is warranted for missing proportions between 50 and 70%, as moderate alterations are observed. Proportions beyond 70% lead to significant variance shrinkage and compromised data reliability, emphasizing the importance of acknowledging imputation limitations for practical decision-making.
format Article
id doaj-art-ff5448c783f3496798262ae48657d3b4
institution Kabale University
issn 1478-7954
language English
publishDate 2025-02-01
publisher BMC
record_format Article
series Population Health Metrics
spelling doaj-art-ff5448c783f3496798262ae48657d3b42025-02-02T12:36:07ZengBMCPopulation Health Metrics1478-79542025-02-0123111610.1186/s12963-025-00364-2How much missing data is too much to impute for longitudinal health indicators? A preliminary guideline for the choice of the extent of missing proportion to impute with multiple imputation by chained equationsK. P. Junaid0Tanvi Kiran1Madhu Gupta2Kamal Kishore3Sujata Siwatch4Department of Community Medicine and School of Public Health, Postgraduate Institute of Medical Education and Research (PGIMER)Department of Community Medicine and School of Public Health, Postgraduate Institute of Medical Education and Research (PGIMER)Department of Community Medicine and School of Public Health, Postgraduate Institute of Medical Education and Research (PGIMER)Department of Biostatistics, Postgraduate Institute of Medical Education and ResearchDepartment of Obstetrics and Gynaecology, Postgraduate Institute of Medical Education and ResearchAbstract Background The multiple imputation by chained equations (MICE) is a widely used approach for handling missing data. However, its robustness, especially for high missing proportions in health indicators, is under-researched. The study aimed to provide a preliminary guideline for the choice of the extent of missing proportion to impute longitudinal health-related data using the MICE method. Methods The study obtained complete data on five mortality-related health indicators of 100 countries (2015–2019) from the Global Health Observatory. Nine incomplete datasets with missing rates from 10 to 90% were generated and imputed using MICE. The robustness of MICE was assessed through three approaches: comparison of means using the Repeated Measures- Analysis of variance, estimation of evaluation metrics (Root mean square error, mean absolute deviation, Bias, and proportionate variance), and visual inspection of box plots of imputed and non-imputed data. Results The Repeated Measures- Analysis of variance revealed significant differences between complete and imputed data, primarily in imputed data with over 50% missing proportions. Evaluation metrics exhibited ‘high performance’ for the dataset with a 50% missing proportion for various health indicators However, with missing proportions exceeding 70%, the majority of indicators demonstrated a ‘low’ performance level in terms of most evaluation metrics. The visual inspection of the box plot revealed severe variance shrinkage in imputed datasets with missing proportions beyond 70%, corroborating the findings from the evaluation metrics. Conclusion It demonstrates high robustness up to 50% missing values, with marginal deviations from complete datasets. Caution is warranted for missing proportions between 50 and 70%, as moderate alterations are observed. Proportions beyond 70% lead to significant variance shrinkage and compromised data reliability, emphasizing the importance of acknowledging imputation limitations for practical decision-making.https://doi.org/10.1186/s12963-025-00364-2Multiple imputationMICEMissing proportionRM-ANOVAMortalityLongitudinal data
spellingShingle K. P. Junaid
Tanvi Kiran
Madhu Gupta
Kamal Kishore
Sujata Siwatch
How much missing data is too much to impute for longitudinal health indicators? A preliminary guideline for the choice of the extent of missing proportion to impute with multiple imputation by chained equations
Population Health Metrics
Multiple imputation
MICE
Missing proportion
RM-ANOVA
Mortality
Longitudinal data
title How much missing data is too much to impute for longitudinal health indicators? A preliminary guideline for the choice of the extent of missing proportion to impute with multiple imputation by chained equations
title_full How much missing data is too much to impute for longitudinal health indicators? A preliminary guideline for the choice of the extent of missing proportion to impute with multiple imputation by chained equations
title_fullStr How much missing data is too much to impute for longitudinal health indicators? A preliminary guideline for the choice of the extent of missing proportion to impute with multiple imputation by chained equations
title_full_unstemmed How much missing data is too much to impute for longitudinal health indicators? A preliminary guideline for the choice of the extent of missing proportion to impute with multiple imputation by chained equations
title_short How much missing data is too much to impute for longitudinal health indicators? A preliminary guideline for the choice of the extent of missing proportion to impute with multiple imputation by chained equations
title_sort how much missing data is too much to impute for longitudinal health indicators a preliminary guideline for the choice of the extent of missing proportion to impute with multiple imputation by chained equations
topic Multiple imputation
MICE
Missing proportion
RM-ANOVA
Mortality
Longitudinal data
url https://doi.org/10.1186/s12963-025-00364-2
work_keys_str_mv AT kpjunaid howmuchmissingdataistoomuchtoimputeforlongitudinalhealthindicatorsapreliminaryguidelineforthechoiceoftheextentofmissingproportiontoimputewithmultipleimputationbychainedequations
AT tanvikiran howmuchmissingdataistoomuchtoimputeforlongitudinalhealthindicatorsapreliminaryguidelineforthechoiceoftheextentofmissingproportiontoimputewithmultipleimputationbychainedequations
AT madhugupta howmuchmissingdataistoomuchtoimputeforlongitudinalhealthindicatorsapreliminaryguidelineforthechoiceoftheextentofmissingproportiontoimputewithmultipleimputationbychainedequations
AT kamalkishore howmuchmissingdataistoomuchtoimputeforlongitudinalhealthindicatorsapreliminaryguidelineforthechoiceoftheextentofmissingproportiontoimputewithmultipleimputationbychainedequations
AT sujatasiwatch howmuchmissingdataistoomuchtoimputeforlongitudinalhealthindicatorsapreliminaryguidelineforthechoiceoftheextentofmissingproportiontoimputewithmultipleimputationbychainedequations