Mathematical Model and Algorithm for Accurate Main Content Extraction From News Websites

Irrelevant elements like ads, menus, and footers in web pages hinder data extraction and reduce the performance of Retrieval-Augmented Generation (RAG) systems in Large Language Models (LLMs). This paper tackles the challenge of accurately identifying and extracting the main content from web pages t...

Full description

Saved in:
Bibliographic Details
Main Authors: Hamza Salem, Hadi Salloum, Manuel Mazzara
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10819347/
Tags: Add Tag
No Tags, Be the first to tag this record!

Similar Items