BidCorpus: A multifaceted learning dataset for public procurementHugging Face

Digital transformation has significantly impacted public procurement, improving operational efficiency, transparency, and competition. This transformation has allowed the automation of data analysis and oversight in public administration. Public procurement involves various stages and generates a mu...

Full description

Saved in:
Bibliographic Details
Main Authors: Weslley Lima, Victor Silva, Jasson Silva, Ricardo Lira, Anselmo Paiva
Format: Article
Language:English
Published: Elsevier 2025-02-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340924011648
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832576518468927488
author Weslley Lima
Victor Silva
Jasson Silva
Ricardo Lira
Anselmo Paiva
author_facet Weslley Lima
Victor Silva
Jasson Silva
Ricardo Lira
Anselmo Paiva
author_sort Weslley Lima
collection DOAJ
description Digital transformation has significantly impacted public procurement, improving operational efficiency, transparency, and competition. This transformation has allowed the automation of data analysis and oversight in public administration. Public procurement involves various stages and generates a multitude of documents. However, experts manually analyze these unstructured textual documents, which are time-consuming and inefficient. To address this issue, we introduce BidCorpus, a novel and comprehensive dataset consisting of thousands of documents related to public procurement, specifically bidding notices from Brazilian public websites. The dataset was labeled using weak supervision techniques, manual labeling, and BERT-based language models. Models trained with these annotated data showed promising results, with metrics greater than 80 % in various experiments. The models could also tolerate intentional changes made to bidding notices to evade fraud detection. All the resources from this work are publicly available, including the documents, pre-processing scripts, and training and evaluation of the models. We expect the dataset and its labels to be of great value to researchers working on public procurement problems.
format Article
id doaj-art-ee2df7dd3dc1400482e902dc0332b823
institution Kabale University
issn 2352-3409
language English
publishDate 2025-02-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj-art-ee2df7dd3dc1400482e902dc0332b8232025-01-31T05:11:29ZengElsevierData in Brief2352-34092025-02-0158111202BidCorpus: A multifaceted learning dataset for public procurementHugging FaceWeslley Lima0Victor Silva1Jasson Silva2Ricardo Lira3Anselmo Paiva4Federal University of Piauí. Campus Universitário Ministro Petrônio Portella. Teresina, Piauí, Brazil; Corresponding author.Federal University of Piauí. Campus Universitário Ministro Petrônio Portella. Teresina, Piauí, BrazilFederal University of Piauí. Campus Universitário Ministro Petrônio Portella. Teresina, Piauí, BrazilFederal University of Piauí. Campus Universitário Ministro Petrônio Portella. Teresina, Piauí, BrazilFederal University of Maranhão. Av. dos Portugueses, 1966 São Luís, Maranhão, BrazilDigital transformation has significantly impacted public procurement, improving operational efficiency, transparency, and competition. This transformation has allowed the automation of data analysis and oversight in public administration. Public procurement involves various stages and generates a multitude of documents. However, experts manually analyze these unstructured textual documents, which are time-consuming and inefficient. To address this issue, we introduce BidCorpus, a novel and comprehensive dataset consisting of thousands of documents related to public procurement, specifically bidding notices from Brazilian public websites. The dataset was labeled using weak supervision techniques, manual labeling, and BERT-based language models. Models trained with these annotated data showed promising results, with metrics greater than 80 % in various experiments. The models could also tolerate intentional changes made to bidding notices to evade fraud detection. All the resources from this work are publicly available, including the documents, pre-processing scripts, and training and evaluation of the models. We expect the dataset and its labels to be of great value to researchers working on public procurement problems.http://www.sciencedirect.com/science/article/pii/S2352340924011648NLPBERTWeak supervisionBidding notice
spellingShingle Weslley Lima
Victor Silva
Jasson Silva
Ricardo Lira
Anselmo Paiva
BidCorpus: A multifaceted learning dataset for public procurementHugging Face
Data in Brief
NLP
BERT
Weak supervision
Bidding notice
title BidCorpus: A multifaceted learning dataset for public procurementHugging Face
title_full BidCorpus: A multifaceted learning dataset for public procurementHugging Face
title_fullStr BidCorpus: A multifaceted learning dataset for public procurementHugging Face
title_full_unstemmed BidCorpus: A multifaceted learning dataset for public procurementHugging Face
title_short BidCorpus: A multifaceted learning dataset for public procurementHugging Face
title_sort bidcorpus a multifaceted learning dataset for public procurementhugging face
topic NLP
BERT
Weak supervision
Bidding notice
url http://www.sciencedirect.com/science/article/pii/S2352340924011648
work_keys_str_mv AT weslleylima bidcorpusamultifacetedlearningdatasetforpublicprocurementhuggingface
AT victorsilva bidcorpusamultifacetedlearningdatasetforpublicprocurementhuggingface
AT jassonsilva bidcorpusamultifacetedlearningdatasetforpublicprocurementhuggingface
AT ricardolira bidcorpusamultifacetedlearningdatasetforpublicprocurementhuggingface
AT anselmopaiva bidcorpusamultifacetedlearningdatasetforpublicprocurementhuggingface