Preserving information while respecting privacy through an information theoretic framework for synthetic health data generation

Abstract Generating synthetic data from medical records is a complex task intensified by patient privacy concerns. In recent years, multiple approaches have been reported for the generation of synthetic data, however, limited attention was given to jointly evaluate the quality and the privacy of the...

Full description

Saved in:
Bibliographic Details
Main Authors: Nadir Sella, Florent Guinot, Nikita Lagrange, Laurent-Philippe Albou, Jonathan Desponds, Hervé Isambert
Format: Article
Language:English
Published: Nature Portfolio 2025-01-01
Series:npj Digital Medicine
Online Access:https://doi.org/10.1038/s41746-025-01431-6
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Generating synthetic data from medical records is a complex task intensified by patient privacy concerns. In recent years, multiple approaches have been reported for the generation of synthetic data, however, limited attention was given to jointly evaluate the quality and the privacy of the generated data. The quality and privacy of synthetic data stem from multivariate associations across variables, which cannot be assessed by comparing univariate distributions with the original data. Here, we introduce a novel algorithm (MIIC-SDG) for generating synthetic data from electronic records based on a multivariate information framework and Bayesian network theory. We also propose a new metric to quantitatively assess the trade-off between the Quality and Privacy Scores (QPS) of synthetic data generation methods. The performance of MIIC-SDG is demonstrated on different clinical datasets and favorably compares with state-of-the-art synthetic data generation methods, based on the QPS trade-off between several quality and privacy metrics.
ISSN:2398-6352