On the Readiness of Scientific Data Papers for a Fair and Transparent Use in Machine Learning

Abstract To ensure the fairness and trustworthiness of machine learning (ML) systems, recent legislative initiatives and relevant research in the ML community have pointed out the need to document the data used to train ML models. Besides, data-sharing practices in many scientific domains have evolv...

Full description

Saved in:

Bibliographic Details
Main Authors:	Joan Giner-Miguelez, Abel Gómez, Jordi Cabot
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-01-01
Series:	Scientific Data
Online Access:	https://doi.org/10.1038/s41597-025-04402-4
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832595006089592832
author	Joan Giner-Miguelez Abel Gómez Jordi Cabot
author_facet	Joan Giner-Miguelez Abel Gómez Jordi Cabot
author_sort	Joan Giner-Miguelez
collection	DOAJ
description	Abstract To ensure the fairness and trustworthiness of machine learning (ML) systems, recent legislative initiatives and relevant research in the ML community have pointed out the need to document the data used to train ML models. Besides, data-sharing practices in many scientific domains have evolved in recent years for reproducibility purposes. In this sense, academic institutions’ adoption of these practices has encouraged researchers to publish their data and technical documentation in peer-reviewed publications such as data papers. In this study, we analyze how this broader scientific data documentation meets the needs of the ML community and regulatory bodies for its use in ML technologies. We examine a sample of 4041 data papers of different domains, assessing their coverage and trends in the requested dimensions and comparing them to those from an ML-focused venue (NeurIPS D&B), which publishes papers describing datasets. As a result, we propose a set of recommendation guidelines for data creators and scientific data publishers to increase their data’s preparedness for its transparent and fairer use in ML technologies.
format	Article
id	doaj-art-d938c26327a34579bf5f2cdb571a9b2e
institution	Kabale University
issn	2052-4463
language	English
publishDate	2025-01-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Data
spelling	doaj-art-d938c26327a34579bf5f2cdb571a9b2e2025-01-19T12:09:36ZengNature PortfolioScientific Data2052-44632025-01-0112111610.1038/s41597-025-04402-4On the Readiness of Scientific Data Papers for a Fair and Transparent Use in Machine LearningJoan Giner-Miguelez0Abel Gómez1Jordi Cabot2Internet Interdisciplinary Institute (IN3), Universitat Oberta de Catalunya (UOC)Internet Interdisciplinary Institute (IN3), Universitat Oberta de Catalunya (UOC)Luxembourg Institute of Science and TechnologyAbstract To ensure the fairness and trustworthiness of machine learning (ML) systems, recent legislative initiatives and relevant research in the ML community have pointed out the need to document the data used to train ML models. Besides, data-sharing practices in many scientific domains have evolved in recent years for reproducibility purposes. In this sense, academic institutions’ adoption of these practices has encouraged researchers to publish their data and technical documentation in peer-reviewed publications such as data papers. In this study, we analyze how this broader scientific data documentation meets the needs of the ML community and regulatory bodies for its use in ML technologies. We examine a sample of 4041 data papers of different domains, assessing their coverage and trends in the requested dimensions and comparing them to those from an ML-focused venue (NeurIPS D&B), which publishes papers describing datasets. As a result, we propose a set of recommendation guidelines for data creators and scientific data publishers to increase their data’s preparedness for its transparent and fairer use in ML technologies.https://doi.org/10.1038/s41597-025-04402-4
spellingShingle	Joan Giner-Miguelez Abel Gómez Jordi Cabot On the Readiness of Scientific Data Papers for a Fair and Transparent Use in Machine Learning Scientific Data
title	On the Readiness of Scientific Data Papers for a Fair and Transparent Use in Machine Learning
title_full	On the Readiness of Scientific Data Papers for a Fair and Transparent Use in Machine Learning
title_fullStr	On the Readiness of Scientific Data Papers for a Fair and Transparent Use in Machine Learning
title_full_unstemmed	On the Readiness of Scientific Data Papers for a Fair and Transparent Use in Machine Learning
title_short	On the Readiness of Scientific Data Papers for a Fair and Transparent Use in Machine Learning
title_sort	on the readiness of scientific data papers for a fair and transparent use in machine learning
url	https://doi.org/10.1038/s41597-025-04402-4
work_keys_str_mv	AT joanginermiguelez onthereadinessofscientificdatapapersforafairandtransparentuseinmachinelearning AT abelgomez onthereadinessofscientificdatapapersforafairandtransparentuseinmachinelearning AT jordicabot onthereadinessofscientificdatapapersforafairandtransparentuseinmachinelearning

On the Readiness of Scientific Data Papers for a Fair and Transparent Use in Machine Learning

Similar Items