Comprehensive Review of Privacy, Utility, and Fairness Offered by Synthetic Data

Automation is the core transformation strategy that every industry wants to get on its roadmap today. Artificial Intelligence (AI) and Machine Learning (ML) are the key components of automation. It is increasingly used in both data analysis and building predictive models from the data. Growing priva...

Full description

Saved in:
Bibliographic Details
Main Authors: A. Kiran, P. Rubini, S. Saravana Kumar
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10847835/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832583213927628800
author A. Kiran
P. Rubini
S. Saravana Kumar
author_facet A. Kiran
P. Rubini
S. Saravana Kumar
author_sort A. Kiran
collection DOAJ
description Automation is the core transformation strategy that every industry wants to get on its roadmap today. Artificial Intelligence (AI) and Machine Learning (ML) are the key components of automation. It is increasingly used in both data analysis and building predictive models from the data. Growing privacy concerns, data confidentiality, and disclosure risks have posed a challenge to the accessibility of right and meaningful data. Several privacy-preserving and disclosure-limiting techniques have come up through research. One such disclosure limiting technique is Synthetic Data. Early research efforts have shown that synthetic data is an effective substitute for real data which can be effectively used to train AI and ML models. However, this needs a comprehensive evaluation before the data user can be confident enough that it is indeed a good substitute for real data. In this paper, we look at three main parameters of synthetic data which should provide a holistic assessment of the quality of synthetic data. First and foremost, how well synthetic data can preserve privacy and control disclosure, second is how good is its utility, and third, are they able to give fair results without any bias when used in machine learning. We review the existing literature to understand various disclosure control limiting methods, synthetic data generators, and then the validation methodologies and evaluation techniques. We understand how data privacy, utility and the fairness of synthetic data intervene with each other and identify the areas for future work.
format Article
id doaj-art-2c5cadddbe3a43d79c2f45e5d5f5956c
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-2c5cadddbe3a43d79c2f45e5d5f5956c2025-01-29T00:00:51ZengIEEEIEEE Access2169-35362025-01-0113157951581110.1109/ACCESS.2025.353212810847835Comprehensive Review of Privacy, Utility, and Fairness Offered by Synthetic DataA. Kiran0https://orcid.org/0000-0002-4574-6688P. Rubini1S. Saravana Kumar2https://orcid.org/0000-0001-5679-2367Department of CSE, SOET, CMR University, Bengaluru, Karnataka, IndiaDepartment of CSE, SOET, CMR University, Bengaluru, Karnataka, IndiaDepartment of IT/PG School, SOET, CMR University, Bengaluru, Karnataka, IndiaAutomation is the core transformation strategy that every industry wants to get on its roadmap today. Artificial Intelligence (AI) and Machine Learning (ML) are the key components of automation. It is increasingly used in both data analysis and building predictive models from the data. Growing privacy concerns, data confidentiality, and disclosure risks have posed a challenge to the accessibility of right and meaningful data. Several privacy-preserving and disclosure-limiting techniques have come up through research. One such disclosure limiting technique is Synthetic Data. Early research efforts have shown that synthetic data is an effective substitute for real data which can be effectively used to train AI and ML models. However, this needs a comprehensive evaluation before the data user can be confident enough that it is indeed a good substitute for real data. In this paper, we look at three main parameters of synthetic data which should provide a holistic assessment of the quality of synthetic data. First and foremost, how well synthetic data can preserve privacy and control disclosure, second is how good is its utility, and third, are they able to give fair results without any bias when used in machine learning. We review the existing literature to understand various disclosure control limiting methods, synthetic data generators, and then the validation methodologies and evaluation techniques. We understand how data privacy, utility and the fairness of synthetic data intervene with each other and identify the areas for future work.https://ieeexplore.ieee.org/document/10847835/Artificial intelligencemachine learningsynthetic datastatistical disclosure controldifferential privacyprivacy enhancing technology
spellingShingle A. Kiran
P. Rubini
S. Saravana Kumar
Comprehensive Review of Privacy, Utility, and Fairness Offered by Synthetic Data
IEEE Access
Artificial intelligence
machine learning
synthetic data
statistical disclosure control
differential privacy
privacy enhancing technology
title Comprehensive Review of Privacy, Utility, and Fairness Offered by Synthetic Data
title_full Comprehensive Review of Privacy, Utility, and Fairness Offered by Synthetic Data
title_fullStr Comprehensive Review of Privacy, Utility, and Fairness Offered by Synthetic Data
title_full_unstemmed Comprehensive Review of Privacy, Utility, and Fairness Offered by Synthetic Data
title_short Comprehensive Review of Privacy, Utility, and Fairness Offered by Synthetic Data
title_sort comprehensive review of privacy utility and fairness offered by synthetic data
topic Artificial intelligence
machine learning
synthetic data
statistical disclosure control
differential privacy
privacy enhancing technology
url https://ieeexplore.ieee.org/document/10847835/
work_keys_str_mv AT akiran comprehensivereviewofprivacyutilityandfairnessofferedbysyntheticdata
AT prubini comprehensivereviewofprivacyutilityandfairnessofferedbysyntheticdata
AT ssaravanakumar comprehensivereviewofprivacyutilityandfairnessofferedbysyntheticdata