Building a large, realistic and labeled HTTP URI dataset for anomaly-based intrusion detection systems: Biblio-US17

Abstract This paper introduces Biblio-US17, a labeled dataset collected over 6 months from the log files of a popular public website at the University of Seville. It contains 47 million records, each including the method, uniform resource identifier (URI) and associated response code and size of eve...

Full description

Saved in:
Bibliographic Details
Main Authors: Jesús Díaz-Verdejo, Rafael Estepa, Antonio Estepa, Javier Muñoz-Calle, Germán Madinabeitia
Format: Article
Language:English
Published: SpringerOpen 2025-06-01
Series:Cybersecurity
Subjects:
Online Access:https://doi.org/10.1186/s42400-024-00336-3
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract This paper introduces Biblio-US17, a labeled dataset collected over 6 months from the log files of a popular public website at the University of Seville. It contains 47 million records, each including the method, uniform resource identifier (URI) and associated response code and size of every request received by the web server. Records have been classified as either normal or attack using a comprehensive semi-automated process, which involved signature-based detection, assisted inspection of URIs vocabulary, and substantial expert manual supervision. Unlike comparable datasets, this one offers a genuine real-world perspective on the normal operation of an active website, along with an unbiased proportion of actual attacks (i.e., non-synthetic). This makes it ideal for evaluating and comparing anomaly-based approaches in a realistic environment. Its extensive size and duration also make it valuable for addressing challenges like data shift and insufficient training. This paper describes the collection and labeling processes, dataset structure, and most relevant properties. We also include an example of an application for assessing the performance of a simple anomaly detector. Biblio-US17, now available to the scientific community, can also be used to model the URIs used by current web servers.
ISSN:2523-3246