Mathematical Model and Algorithm for Accurate Main Content Extraction From News Websites

Irrelevant elements like ads, menus, and footers in web pages hinder data extraction and reduce the performance of Retrieval-Augmented Generation (RAG) systems in Large Language Models (LLMs). This paper tackles the challenge of accurately identifying and extracting the main content from web pages t...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hamza Salem, Hadi Salloum, Manuel Mazzara
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Information extraction document object model (DOM) retrieval-augmented generation (RAG) large language models (LLM) main content detection
Online Access:	https://ieeexplore.ieee.org/document/10819347/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832583967609454592
author	Hamza Salem Hadi Salloum Manuel Mazzara
author_facet	Hamza Salem Hadi Salloum Manuel Mazzara
author_sort	Hamza Salem
collection	DOAJ
description	Irrelevant elements like ads, menus, and footers in web pages hinder data extraction and reduce the performance of Retrieval-Augmented Generation (RAG) systems in Large Language Models (LLMs). This paper tackles the challenge of accurately identifying and extracting the main content from web pages to enhance the efficiency of these systems. We present a novel mathematical model and algorithm that leverages the Document Object Model (DOM) structure, effectively isolating relevant content with high accuracy. Our approach is language-neutral and performs well across diverse languages, including those with complex tokenization, such as Arabic. To validate the model, we created a dataset from 500 websites, allowing for comprehensive evaluation and benchmarking. The algorithm’s practical application demonstrates a reduction in token usage for LLM tasks, contributing to cost-effectiveness. This work introduces a robust, open-source tool for the academic and commercial communities, fostering further innovation in web content extraction and information retrieval.
format	Article
id	doaj-art-2fcae42a508848bcb2fdaf27f70de105
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-2fcae42a508848bcb2fdaf27f70de1052025-01-28T00:01:25ZengIEEEIEEE Access2169-35362025-01-0113156941571110.1109/ACCESS.2024.352465610819347Mathematical Model and Algorithm for Accurate Main Content Extraction From News WebsitesHamza Salem0https://orcid.org/0000-0002-9143-5231Hadi Salloum1https://orcid.org/0009-0005-6068-0532Manuel Mazzara2https://orcid.org/0000-0002-3860-4948Department of Computer Science and Engineering, Innopolis University, Innopolis, RussiaDepartment of Computer Science and Engineering, Innopolis University, Innopolis, RussiaDepartment of Computer Science and Engineering, Innopolis University, Innopolis, RussiaIrrelevant elements like ads, menus, and footers in web pages hinder data extraction and reduce the performance of Retrieval-Augmented Generation (RAG) systems in Large Language Models (LLMs). This paper tackles the challenge of accurately identifying and extracting the main content from web pages to enhance the efficiency of these systems. We present a novel mathematical model and algorithm that leverages the Document Object Model (DOM) structure, effectively isolating relevant content with high accuracy. Our approach is language-neutral and performs well across diverse languages, including those with complex tokenization, such as Arabic. To validate the model, we created a dataset from 500 websites, allowing for comprehensive evaluation and benchmarking. The algorithm’s practical application demonstrates a reduction in token usage for LLM tasks, contributing to cost-effectiveness. This work introduces a robust, open-source tool for the academic and commercial communities, fostering further innovation in web content extraction and information retrieval.https://ieeexplore.ieee.org/document/10819347/Information extractiondocument object model (DOM)retrieval-augmented generation (RAG)large language models (LLM)main content detection
spellingShingle	Hamza Salem Hadi Salloum Manuel Mazzara Mathematical Model and Algorithm for Accurate Main Content Extraction From News Websites IEEE Access Information extraction document object model (DOM) retrieval-augmented generation (RAG) large language models (LLM) main content detection
title	Mathematical Model and Algorithm for Accurate Main Content Extraction From News Websites
title_full	Mathematical Model and Algorithm for Accurate Main Content Extraction From News Websites
title_fullStr	Mathematical Model and Algorithm for Accurate Main Content Extraction From News Websites
title_full_unstemmed	Mathematical Model and Algorithm for Accurate Main Content Extraction From News Websites
title_short	Mathematical Model and Algorithm for Accurate Main Content Extraction From News Websites
title_sort	mathematical model and algorithm for accurate main content extraction from news websites
topic	Information extraction document object model (DOM) retrieval-augmented generation (RAG) large language models (LLM) main content detection
url	https://ieeexplore.ieee.org/document/10819347/
work_keys_str_mv	AT hamzasalem mathematicalmodelandalgorithmforaccuratemaincontentextractionfromnewswebsites AT hadisalloum mathematicalmodelandalgorithmforaccuratemaincontentextractionfromnewswebsites AT manuelmazzara mathematicalmodelandalgorithmforaccuratemaincontentextractionfromnewswebsites

Mathematical Model and Algorithm for Accurate Main Content Extraction From News Websites

Similar Items