Comprehensive Review of Privacy, Utility, and Fairness Offered by Synthetic Data

Automation is the core transformation strategy that every industry wants to get on its roadmap today. Artificial Intelligence (AI) and Machine Learning (ML) are the key components of automation. It is increasingly used in both data analysis and building predictive models from the data. Growing priva...

Full description

Saved in:
Bibliographic Details
Main Authors: A. Kiran, P. Rubini, S. Saravana Kumar
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10847835/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Automation is the core transformation strategy that every industry wants to get on its roadmap today. Artificial Intelligence (AI) and Machine Learning (ML) are the key components of automation. It is increasingly used in both data analysis and building predictive models from the data. Growing privacy concerns, data confidentiality, and disclosure risks have posed a challenge to the accessibility of right and meaningful data. Several privacy-preserving and disclosure-limiting techniques have come up through research. One such disclosure limiting technique is Synthetic Data. Early research efforts have shown that synthetic data is an effective substitute for real data which can be effectively used to train AI and ML models. However, this needs a comprehensive evaluation before the data user can be confident enough that it is indeed a good substitute for real data. In this paper, we look at three main parameters of synthetic data which should provide a holistic assessment of the quality of synthetic data. First and foremost, how well synthetic data can preserve privacy and control disclosure, second is how good is its utility, and third, are they able to give fair results without any bias when used in machine learning. We review the existing literature to understand various disclosure control limiting methods, synthetic data generators, and then the validation methodologies and evaluation techniques. We understand how data privacy, utility and the fairness of synthetic data intervene with each other and identify the areas for future work.
ISSN:2169-3536