Mathematical Model and Algorithm for Accurate Main Content Extraction From News Websites
Irrelevant elements like ads, menus, and footers in web pages hinder data extraction and reduce the performance of Retrieval-Augmented Generation (RAG) systems in Large Language Models (LLMs). This paper tackles the challenge of accurately identifying and extracting the main content from web pages t...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10819347/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Irrelevant elements like ads, menus, and footers in web pages hinder data extraction and reduce the performance of Retrieval-Augmented Generation (RAG) systems in Large Language Models (LLMs). This paper tackles the challenge of accurately identifying and extracting the main content from web pages to enhance the efficiency of these systems. We present a novel mathematical model and algorithm that leverages the Document Object Model (DOM) structure, effectively isolating relevant content with high accuracy. Our approach is language-neutral and performs well across diverse languages, including those with complex tokenization, such as Arabic. To validate the model, we created a dataset from 500 websites, allowing for comprehensive evaluation and benchmarking. The algorithm’s practical application demonstrates a reduction in token usage for LLM tasks, contributing to cost-effectiveness. This work introduces a robust, open-source tool for the academic and commercial communities, fostering further innovation in web content extraction and information retrieval. |
---|---|
ISSN: | 2169-3536 |