A tabular data generation framework guided by downstream tasks optimization

Abstract Recently, generative models have been gradually emerging into the extended dataset field, showcasing their advantages. However, when it comes to generating tabular data, these models often fail to satisfy the constraints of numerical columns, which cannot generate high-quality datasets that...

Full description

Saved in:
Bibliographic Details
Main Authors: Fengwei Jia, Hongli Zhu, Fengyuan Jia, Xinyue Ren, Siqi Chen, Hongming Tan, Wai Kin Victor Chan
Format: Article
Language:English
Published: Nature Portfolio 2024-07-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-024-65777-9
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832585744013590528
author Fengwei Jia
Hongli Zhu
Fengyuan Jia
Xinyue Ren
Siqi Chen
Hongming Tan
Wai Kin Victor Chan
author_facet Fengwei Jia
Hongli Zhu
Fengyuan Jia
Xinyue Ren
Siqi Chen
Hongming Tan
Wai Kin Victor Chan
author_sort Fengwei Jia
collection DOAJ
description Abstract Recently, generative models have been gradually emerging into the extended dataset field, showcasing their advantages. However, when it comes to generating tabular data, these models often fail to satisfy the constraints of numerical columns, which cannot generate high-quality datasets that accurately represent real-world data and are suitable for the intended downstream applications. Responding to the challenge, we propose a tabular data generation framework guided by downstream task optimization (TDGGD). It incorporates three indicators into each time step of diffusion generation, using gradient optimization to align the generated fake data. Unlike the traditional strategy of separating the downstream task model from the upstream data synthesis model, TDGGD ensures that the generated data has highly focused columns feasibility in upstream real tabular data. For downstream task, TDGGD strikes the utility of tabular data over solely pursuing statistical fidelity. Through extensive experiments conducted on real-world tables with explicit column constraints and tables without explicit column constraints, we have demonstrated that TDGGD ensures increasing data volume while enhancing prediction accuracy. To the best of our knowledge, this is the first instance of deploying downstream information into a diffusion model framework.
format Article
id doaj-art-de8307375287441bb552987a10530cb0
institution Kabale University
issn 2045-2322
language English
publishDate 2024-07-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-de8307375287441bb552987a10530cb02025-01-26T12:35:16ZengNature PortfolioScientific Reports2045-23222024-07-0114111410.1038/s41598-024-65777-9A tabular data generation framework guided by downstream tasks optimizationFengwei Jia0Hongli Zhu1Fengyuan Jia2Xinyue Ren3Siqi Chen4Hongming Tan5Wai Kin Victor Chan6Tsinghua Shenzhen International Graduate School, Tsinghua UniversityTsinghua Shenzhen International Graduate School, Tsinghua UniversitySchool of Mechanical Engineering, Anhui University of TechnologyTsinghua Shenzhen International Graduate School, Tsinghua UniversityTsinghua Shenzhen International Graduate School, Tsinghua UniversityTsinghua Shenzhen International Graduate School, Tsinghua UniversityTsinghua Shenzhen International Graduate School, Tsinghua UniversityAbstract Recently, generative models have been gradually emerging into the extended dataset field, showcasing their advantages. However, when it comes to generating tabular data, these models often fail to satisfy the constraints of numerical columns, which cannot generate high-quality datasets that accurately represent real-world data and are suitable for the intended downstream applications. Responding to the challenge, we propose a tabular data generation framework guided by downstream task optimization (TDGGD). It incorporates three indicators into each time step of diffusion generation, using gradient optimization to align the generated fake data. Unlike the traditional strategy of separating the downstream task model from the upstream data synthesis model, TDGGD ensures that the generated data has highly focused columns feasibility in upstream real tabular data. For downstream task, TDGGD strikes the utility of tabular data over solely pursuing statistical fidelity. Through extensive experiments conducted on real-world tables with explicit column constraints and tables without explicit column constraints, we have demonstrated that TDGGD ensures increasing data volume while enhancing prediction accuracy. To the best of our knowledge, this is the first instance of deploying downstream information into a diffusion model framework.https://doi.org/10.1038/s41598-024-65777-9
spellingShingle Fengwei Jia
Hongli Zhu
Fengyuan Jia
Xinyue Ren
Siqi Chen
Hongming Tan
Wai Kin Victor Chan
A tabular data generation framework guided by downstream tasks optimization
Scientific Reports
title A tabular data generation framework guided by downstream tasks optimization
title_full A tabular data generation framework guided by downstream tasks optimization
title_fullStr A tabular data generation framework guided by downstream tasks optimization
title_full_unstemmed A tabular data generation framework guided by downstream tasks optimization
title_short A tabular data generation framework guided by downstream tasks optimization
title_sort tabular data generation framework guided by downstream tasks optimization
url https://doi.org/10.1038/s41598-024-65777-9
work_keys_str_mv AT fengweijia atabulardatagenerationframeworkguidedbydownstreamtasksoptimization
AT honglizhu atabulardatagenerationframeworkguidedbydownstreamtasksoptimization
AT fengyuanjia atabulardatagenerationframeworkguidedbydownstreamtasksoptimization
AT xinyueren atabulardatagenerationframeworkguidedbydownstreamtasksoptimization
AT siqichen atabulardatagenerationframeworkguidedbydownstreamtasksoptimization
AT hongmingtan atabulardatagenerationframeworkguidedbydownstreamtasksoptimization
AT waikinvictorchan atabulardatagenerationframeworkguidedbydownstreamtasksoptimization
AT fengweijia tabulardatagenerationframeworkguidedbydownstreamtasksoptimization
AT honglizhu tabulardatagenerationframeworkguidedbydownstreamtasksoptimization
AT fengyuanjia tabulardatagenerationframeworkguidedbydownstreamtasksoptimization
AT xinyueren tabulardatagenerationframeworkguidedbydownstreamtasksoptimization
AT siqichen tabulardatagenerationframeworkguidedbydownstreamtasksoptimization
AT hongmingtan tabulardatagenerationframeworkguidedbydownstreamtasksoptimization
AT waikinvictorchan tabulardatagenerationframeworkguidedbydownstreamtasksoptimization