Document Information Extraction: An Analysis of Invoice Anatomy

In this paper, we present a new approach of document information extraction by studying the document anatomy where we investigated the possible variants and forms it could have for each document component. To overcome the lack of publicly available document datasets, we used a generated invoice data...

Full description

Saved in:

Bibliographic Details
Main Authors:	Mouad Hamri, Maxime Devanne, Jonathan Weber, Michel Hassenforder
Format:	Article
Language:	English
Published:	Wiley 2024-01-01
Series:	Applied Computational Intelligence and Soft Computing
Online Access:	http://dx.doi.org/10.1155/2024/7599415
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832546155068653568
author	Mouad Hamri Maxime Devanne Jonathan Weber Michel Hassenforder
author_facet	Mouad Hamri Maxime Devanne Jonathan Weber Michel Hassenforder
author_sort	Mouad Hamri
collection	DOAJ
description	In this paper, we present a new approach of document information extraction by studying the document anatomy where we investigated the possible variants and forms it could have for each document component. To overcome the lack of publicly available document datasets, we used a generated invoice database where we conceived 9 different templates and generated 100 samples for each one where the documents were annotated automatically during the generation process. We analysed the following invoice components: dates block (invoice date and invoice due date), address block, amounts block (tax-free amount, tax amount, and total amount), and lines block (lines table) by investigating the impact of training our model on various block variants. We conducted several experiments where we compared the results obtained when we tested on templates that included variants not encountered during the training phase versus when we introduced them to the training dataset. This allowed us to analyse the improvement in results after adding these previously unseen variants. The obtained results have shown that the model generalises better when trained on a large variety of cases and achieves remarkable performance. We conducted experiments on various models to highlight the model-agnostic character of our proposed approach. This methodology allows to have great performance, even with models that have significantly fewer parameters, especially in comparison to recently published models with millions of parameters.
format	Article
id	doaj-art-ba76344e0f4c491c9a97728fb6e833ba
institution	Kabale University
issn	1687-9732
language	English
publishDate	2024-01-01
publisher	Wiley
record_format	Article
series	Applied Computational Intelligence and Soft Computing
spelling	doaj-art-ba76344e0f4c491c9a97728fb6e833ba2025-02-03T07:23:45ZengWileyApplied Computational Intelligence and Soft Computing1687-97322024-01-01202410.1155/2024/7599415Document Information Extraction: An Analysis of Invoice AnatomyMouad Hamri0Maxime Devanne1Jonathan Weber2Michel Hassenforder3IRIMASIRIMASIRIMASIRIMASIn this paper, we present a new approach of document information extraction by studying the document anatomy where we investigated the possible variants and forms it could have for each document component. To overcome the lack of publicly available document datasets, we used a generated invoice database where we conceived 9 different templates and generated 100 samples for each one where the documents were annotated automatically during the generation process. We analysed the following invoice components: dates block (invoice date and invoice due date), address block, amounts block (tax-free amount, tax amount, and total amount), and lines block (lines table) by investigating the impact of training our model on various block variants. We conducted several experiments where we compared the results obtained when we tested on templates that included variants not encountered during the training phase versus when we introduced them to the training dataset. This allowed us to analyse the improvement in results after adding these previously unseen variants. The obtained results have shown that the model generalises better when trained on a large variety of cases and achieves remarkable performance. We conducted experiments on various models to highlight the model-agnostic character of our proposed approach. This methodology allows to have great performance, even with models that have significantly fewer parameters, especially in comparison to recently published models with millions of parameters.http://dx.doi.org/10.1155/2024/7599415
spellingShingle	Mouad Hamri Maxime Devanne Jonathan Weber Michel Hassenforder Document Information Extraction: An Analysis of Invoice Anatomy Applied Computational Intelligence and Soft Computing
title	Document Information Extraction: An Analysis of Invoice Anatomy
title_full	Document Information Extraction: An Analysis of Invoice Anatomy
title_fullStr	Document Information Extraction: An Analysis of Invoice Anatomy
title_full_unstemmed	Document Information Extraction: An Analysis of Invoice Anatomy
title_short	Document Information Extraction: An Analysis of Invoice Anatomy
title_sort	document information extraction an analysis of invoice anatomy
url	http://dx.doi.org/10.1155/2024/7599415
work_keys_str_mv	AT mouadhamri documentinformationextractionananalysisofinvoiceanatomy AT maximedevanne documentinformationextractionananalysisofinvoiceanatomy AT jonathanweber documentinformationextractionananalysisofinvoiceanatomy AT michelhassenforder documentinformationextractionananalysisofinvoiceanatomy

Document Information Extraction: An Analysis of Invoice Anatomy

Similar Items