BidCorpus: A multifaceted learning dataset for public procurementHugging Face

Digital transformation has significantly impacted public procurement, improving operational efficiency, transparency, and competition. This transformation has allowed the automation of data analysis and oversight in public administration. Public procurement involves various stages and generates a mu...

Full description

Saved in:
Bibliographic Details
Main Authors: Weslley Lima, Victor Silva, Jasson Silva, Ricardo Lira, Anselmo Paiva
Format: Article
Language:English
Published: Elsevier 2025-02-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340924011648
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Digital transformation has significantly impacted public procurement, improving operational efficiency, transparency, and competition. This transformation has allowed the automation of data analysis and oversight in public administration. Public procurement involves various stages and generates a multitude of documents. However, experts manually analyze these unstructured textual documents, which are time-consuming and inefficient. To address this issue, we introduce BidCorpus, a novel and comprehensive dataset consisting of thousands of documents related to public procurement, specifically bidding notices from Brazilian public websites. The dataset was labeled using weak supervision techniques, manual labeling, and BERT-based language models. Models trained with these annotated data showed promising results, with metrics greater than 80 % in various experiments. The models could also tolerate intentional changes made to bidding notices to evade fraud detection. All the resources from this work are publicly available, including the documents, pre-processing scripts, and training and evaluation of the models. We expect the dataset and its labels to be of great value to researchers working on public procurement problems.
ISSN:2352-3409