Semi-automatic construction of heterogeneous data schema based on structure and context-aware recommendation

Abstract Customizing the structure and format of scientific data facilitates the publication of diverse and heterogeneous data. Many data publishing platforms empower users to create self-designed schemas, leading to schema proliferation and more intricate creation processes. To address these challe...

Full description

Saved in:

Bibliographic Details
Main Authors:	Nan Yin, Junheng Liang, Xi Guo, Xue Jiang, Jie He, Xiaotong Zhang
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-02-01
Series:	Scientific Data
Online Access:	https://doi.org/10.1038/s41597-024-04196-x
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832572030684233728
author	Nan Yin Junheng Liang Xi Guo Xue Jiang Jie He Xiaotong Zhang
author_facet	Nan Yin Junheng Liang Xi Guo Xue Jiang Jie He Xiaotong Zhang
author_sort	Nan Yin
collection	DOAJ
description	Abstract Customizing the structure and format of scientific data facilitates the publication of diverse and heterogeneous data. Many data publishing platforms empower users to create self-designed schemas, leading to schema proliferation and more intricate creation processes. To address these challenges, we present a semi-automatic method and system for constructing heterogeneous material data schemas based on structure and context-aware recommendation. We propose a schema fragment tree structure to represent data schemas with hierarchical relationships, transforming the recommendation into subtree matching. Fragment index and semantic search techniques are introduced to identify candidate fragments, and a tree editing distance algorithm calculates similarity scores. Evaluated on the Data Schema Construction System, the algorithm outperforms baselines—TF-IDF and BM25 for schemas matching—in precision, recall, and F1-score. The baseline for reduced workload refers to the effort required to create schemas without recommendation. Our recommendation improves schema creation efficiency by 50.5% and reduces schema proliferation by 16.5%.
format	Article
id	doaj-art-c6f9c16615b040c9a69eafd0e832bb07
institution	Kabale University
issn	2052-4463
language	English
publishDate	2025-02-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Data
spelling	doaj-art-c6f9c16615b040c9a69eafd0e832bb072025-02-02T12:08:20ZengNature PortfolioScientific Data2052-44632025-02-0112111410.1038/s41597-024-04196-xSemi-automatic construction of heterogeneous data schema based on structure and context-aware recommendationNan Yin0Junheng Liang1Xi Guo2Xue Jiang3Jie He4Xiaotong Zhang5School of Computer and Communication Engineering, University of Science and Technology BeijingSchool of Computer and Communication Engineering, University of Science and Technology BeijingSchool of Computer and Communication Engineering, University of Science and Technology BeijingBeijing Advanced Innovation Center for Materials Genome Engineering, Institute for Advanced Materials and Technology, University of Science and Technology BeijingSchool of Computer and Communication Engineering, University of Science and Technology BeijingSchool of Computer and Communication Engineering, University of Science and Technology BeijingAbstract Customizing the structure and format of scientific data facilitates the publication of diverse and heterogeneous data. Many data publishing platforms empower users to create self-designed schemas, leading to schema proliferation and more intricate creation processes. To address these challenges, we present a semi-automatic method and system for constructing heterogeneous material data schemas based on structure and context-aware recommendation. We propose a schema fragment tree structure to represent data schemas with hierarchical relationships, transforming the recommendation into subtree matching. Fragment index and semantic search techniques are introduced to identify candidate fragments, and a tree editing distance algorithm calculates similarity scores. Evaluated on the Data Schema Construction System, the algorithm outperforms baselines—TF-IDF and BM25 for schemas matching—in precision, recall, and F1-score. The baseline for reduced workload refers to the effort required to create schemas without recommendation. Our recommendation improves schema creation efficiency by 50.5% and reduces schema proliferation by 16.5%.https://doi.org/10.1038/s41597-024-04196-x
spellingShingle	Nan Yin Junheng Liang Xi Guo Xue Jiang Jie He Xiaotong Zhang Semi-automatic construction of heterogeneous data schema based on structure and context-aware recommendation Scientific Data
title	Semi-automatic construction of heterogeneous data schema based on structure and context-aware recommendation
title_full	Semi-automatic construction of heterogeneous data schema based on structure and context-aware recommendation
title_fullStr	Semi-automatic construction of heterogeneous data schema based on structure and context-aware recommendation
title_full_unstemmed	Semi-automatic construction of heterogeneous data schema based on structure and context-aware recommendation
title_short	Semi-automatic construction of heterogeneous data schema based on structure and context-aware recommendation
title_sort	semi automatic construction of heterogeneous data schema based on structure and context aware recommendation
url	https://doi.org/10.1038/s41597-024-04196-x
work_keys_str_mv	AT nanyin semiautomaticconstructionofheterogeneousdataschemabasedonstructureandcontextawarerecommendation AT junhengliang semiautomaticconstructionofheterogeneousdataschemabasedonstructureandcontextawarerecommendation AT xiguo semiautomaticconstructionofheterogeneousdataschemabasedonstructureandcontextawarerecommendation AT xuejiang semiautomaticconstructionofheterogeneousdataschemabasedonstructureandcontextawarerecommendation AT jiehe semiautomaticconstructionofheterogeneousdataschemabasedonstructureandcontextawarerecommendation AT xiaotongzhang semiautomaticconstructionofheterogeneousdataschemabasedonstructureandcontextawarerecommendation

Semi-automatic construction of heterogeneous data schema based on structure and context-aware recommendation

Similar Items