BidCorpus: A multifaceted learning dataset for public procurementHugging Face
Digital transformation has significantly impacted public procurement, improving operational efficiency, transparency, and competition. This transformation has allowed the automation of data analysis and oversight in public administration. Public procurement involves various stages and generates a mu...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2025-02-01
|
Series: | Data in Brief |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2352340924011648 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832576518468927488 |
---|---|
author | Weslley Lima Victor Silva Jasson Silva Ricardo Lira Anselmo Paiva |
author_facet | Weslley Lima Victor Silva Jasson Silva Ricardo Lira Anselmo Paiva |
author_sort | Weslley Lima |
collection | DOAJ |
description | Digital transformation has significantly impacted public procurement, improving operational efficiency, transparency, and competition. This transformation has allowed the automation of data analysis and oversight in public administration. Public procurement involves various stages and generates a multitude of documents. However, experts manually analyze these unstructured textual documents, which are time-consuming and inefficient. To address this issue, we introduce BidCorpus, a novel and comprehensive dataset consisting of thousands of documents related to public procurement, specifically bidding notices from Brazilian public websites. The dataset was labeled using weak supervision techniques, manual labeling, and BERT-based language models. Models trained with these annotated data showed promising results, with metrics greater than 80 % in various experiments. The models could also tolerate intentional changes made to bidding notices to evade fraud detection. All the resources from this work are publicly available, including the documents, pre-processing scripts, and training and evaluation of the models. We expect the dataset and its labels to be of great value to researchers working on public procurement problems. |
format | Article |
id | doaj-art-ee2df7dd3dc1400482e902dc0332b823 |
institution | Kabale University |
issn | 2352-3409 |
language | English |
publishDate | 2025-02-01 |
publisher | Elsevier |
record_format | Article |
series | Data in Brief |
spelling | doaj-art-ee2df7dd3dc1400482e902dc0332b8232025-01-31T05:11:29ZengElsevierData in Brief2352-34092025-02-0158111202BidCorpus: A multifaceted learning dataset for public procurementHugging FaceWeslley Lima0Victor Silva1Jasson Silva2Ricardo Lira3Anselmo Paiva4Federal University of Piauí. Campus Universitário Ministro Petrônio Portella. Teresina, Piauí, Brazil; Corresponding author.Federal University of Piauí. Campus Universitário Ministro Petrônio Portella. Teresina, Piauí, BrazilFederal University of Piauí. Campus Universitário Ministro Petrônio Portella. Teresina, Piauí, BrazilFederal University of Piauí. Campus Universitário Ministro Petrônio Portella. Teresina, Piauí, BrazilFederal University of Maranhão. Av. dos Portugueses, 1966 São Luís, Maranhão, BrazilDigital transformation has significantly impacted public procurement, improving operational efficiency, transparency, and competition. This transformation has allowed the automation of data analysis and oversight in public administration. Public procurement involves various stages and generates a multitude of documents. However, experts manually analyze these unstructured textual documents, which are time-consuming and inefficient. To address this issue, we introduce BidCorpus, a novel and comprehensive dataset consisting of thousands of documents related to public procurement, specifically bidding notices from Brazilian public websites. The dataset was labeled using weak supervision techniques, manual labeling, and BERT-based language models. Models trained with these annotated data showed promising results, with metrics greater than 80 % in various experiments. The models could also tolerate intentional changes made to bidding notices to evade fraud detection. All the resources from this work are publicly available, including the documents, pre-processing scripts, and training and evaluation of the models. We expect the dataset and its labels to be of great value to researchers working on public procurement problems.http://www.sciencedirect.com/science/article/pii/S2352340924011648NLPBERTWeak supervisionBidding notice |
spellingShingle | Weslley Lima Victor Silva Jasson Silva Ricardo Lira Anselmo Paiva BidCorpus: A multifaceted learning dataset for public procurementHugging Face Data in Brief NLP BERT Weak supervision Bidding notice |
title | BidCorpus: A multifaceted learning dataset for public procurementHugging Face |
title_full | BidCorpus: A multifaceted learning dataset for public procurementHugging Face |
title_fullStr | BidCorpus: A multifaceted learning dataset for public procurementHugging Face |
title_full_unstemmed | BidCorpus: A multifaceted learning dataset for public procurementHugging Face |
title_short | BidCorpus: A multifaceted learning dataset for public procurementHugging Face |
title_sort | bidcorpus a multifaceted learning dataset for public procurementhugging face |
topic | NLP BERT Weak supervision Bidding notice |
url | http://www.sciencedirect.com/science/article/pii/S2352340924011648 |
work_keys_str_mv | AT weslleylima bidcorpusamultifacetedlearningdatasetforpublicprocurementhuggingface AT victorsilva bidcorpusamultifacetedlearningdatasetforpublicprocurementhuggingface AT jassonsilva bidcorpusamultifacetedlearningdatasetforpublicprocurementhuggingface AT ricardolira bidcorpusamultifacetedlearningdatasetforpublicprocurementhuggingface AT anselmopaiva bidcorpusamultifacetedlearningdatasetforpublicprocurementhuggingface |