Semi-automatic construction of heterogeneous data schema based on structure and context-aware recommendation

Abstract Customizing the structure and format of scientific data facilitates the publication of diverse and heterogeneous data. Many data publishing platforms empower users to create self-designed schemas, leading to schema proliferation and more intricate creation processes. To address these challe...

Full description

Saved in:
Bibliographic Details
Main Authors: Nan Yin, Junheng Liang, Xi Guo, Xue Jiang, Jie He, Xiaotong Zhang
Format: Article
Language:English
Published: Nature Portfolio 2025-02-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-024-04196-x
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832572030684233728
author Nan Yin
Junheng Liang
Xi Guo
Xue Jiang
Jie He
Xiaotong Zhang
author_facet Nan Yin
Junheng Liang
Xi Guo
Xue Jiang
Jie He
Xiaotong Zhang
author_sort Nan Yin
collection DOAJ
description Abstract Customizing the structure and format of scientific data facilitates the publication of diverse and heterogeneous data. Many data publishing platforms empower users to create self-designed schemas, leading to schema proliferation and more intricate creation processes. To address these challenges, we present a semi-automatic method and system for constructing heterogeneous material data schemas based on structure and context-aware recommendation. We propose a schema fragment tree structure to represent data schemas with hierarchical relationships, transforming the recommendation into subtree matching. Fragment index and semantic search techniques are introduced to identify candidate fragments, and a tree editing distance algorithm calculates similarity scores. Evaluated on the Data Schema Construction System, the algorithm outperforms baselines—TF-IDF and BM25 for schemas matching—in precision, recall, and F1-score. The baseline for reduced workload refers to the effort required to create schemas without recommendation. Our recommendation improves schema creation efficiency by 50.5% and reduces schema proliferation by 16.5%.
format Article
id doaj-art-c6f9c16615b040c9a69eafd0e832bb07
institution Kabale University
issn 2052-4463
language English
publishDate 2025-02-01
publisher Nature Portfolio
record_format Article
series Scientific Data
spelling doaj-art-c6f9c16615b040c9a69eafd0e832bb072025-02-02T12:08:20ZengNature PortfolioScientific Data2052-44632025-02-0112111410.1038/s41597-024-04196-xSemi-automatic construction of heterogeneous data schema based on structure and context-aware recommendationNan Yin0Junheng Liang1Xi Guo2Xue Jiang3Jie He4Xiaotong Zhang5School of Computer and Communication Engineering, University of Science and Technology BeijingSchool of Computer and Communication Engineering, University of Science and Technology BeijingSchool of Computer and Communication Engineering, University of Science and Technology BeijingBeijing Advanced Innovation Center for Materials Genome Engineering, Institute for Advanced Materials and Technology, University of Science and Technology BeijingSchool of Computer and Communication Engineering, University of Science and Technology BeijingSchool of Computer and Communication Engineering, University of Science and Technology BeijingAbstract Customizing the structure and format of scientific data facilitates the publication of diverse and heterogeneous data. Many data publishing platforms empower users to create self-designed schemas, leading to schema proliferation and more intricate creation processes. To address these challenges, we present a semi-automatic method and system for constructing heterogeneous material data schemas based on structure and context-aware recommendation. We propose a schema fragment tree structure to represent data schemas with hierarchical relationships, transforming the recommendation into subtree matching. Fragment index and semantic search techniques are introduced to identify candidate fragments, and a tree editing distance algorithm calculates similarity scores. Evaluated on the Data Schema Construction System, the algorithm outperforms baselines—TF-IDF and BM25 for schemas matching—in precision, recall, and F1-score. The baseline for reduced workload refers to the effort required to create schemas without recommendation. Our recommendation improves schema creation efficiency by 50.5% and reduces schema proliferation by 16.5%.https://doi.org/10.1038/s41597-024-04196-x
spellingShingle Nan Yin
Junheng Liang
Xi Guo
Xue Jiang
Jie He
Xiaotong Zhang
Semi-automatic construction of heterogeneous data schema based on structure and context-aware recommendation
Scientific Data
title Semi-automatic construction of heterogeneous data schema based on structure and context-aware recommendation
title_full Semi-automatic construction of heterogeneous data schema based on structure and context-aware recommendation
title_fullStr Semi-automatic construction of heterogeneous data schema based on structure and context-aware recommendation
title_full_unstemmed Semi-automatic construction of heterogeneous data schema based on structure and context-aware recommendation
title_short Semi-automatic construction of heterogeneous data schema based on structure and context-aware recommendation
title_sort semi automatic construction of heterogeneous data schema based on structure and context aware recommendation
url https://doi.org/10.1038/s41597-024-04196-x
work_keys_str_mv AT nanyin semiautomaticconstructionofheterogeneousdataschemabasedonstructureandcontextawarerecommendation
AT junhengliang semiautomaticconstructionofheterogeneousdataschemabasedonstructureandcontextawarerecommendation
AT xiguo semiautomaticconstructionofheterogeneousdataschemabasedonstructureandcontextawarerecommendation
AT xuejiang semiautomaticconstructionofheterogeneousdataschemabasedonstructureandcontextawarerecommendation
AT jiehe semiautomaticconstructionofheterogeneousdataschemabasedonstructureandcontextawarerecommendation
AT xiaotongzhang semiautomaticconstructionofheterogeneousdataschemabasedonstructureandcontextawarerecommendation