Document Information Extraction: An Analysis of Invoice Anatomy

In this paper, we present a new approach of document information extraction by studying the document anatomy where we investigated the possible variants and forms it could have for each document component. To overcome the lack of publicly available document datasets, we used a generated invoice data...

Full description

Saved in:
Bibliographic Details
Main Authors: Mouad Hamri, Maxime Devanne, Jonathan Weber, Michel Hassenforder
Format: Article
Language:English
Published: Wiley 2024-01-01
Series:Applied Computational Intelligence and Soft Computing
Online Access:http://dx.doi.org/10.1155/2024/7599415
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832546155068653568
author Mouad Hamri
Maxime Devanne
Jonathan Weber
Michel Hassenforder
author_facet Mouad Hamri
Maxime Devanne
Jonathan Weber
Michel Hassenforder
author_sort Mouad Hamri
collection DOAJ
description In this paper, we present a new approach of document information extraction by studying the document anatomy where we investigated the possible variants and forms it could have for each document component. To overcome the lack of publicly available document datasets, we used a generated invoice database where we conceived 9 different templates and generated 100 samples for each one where the documents were annotated automatically during the generation process. We analysed the following invoice components: dates block (invoice date and invoice due date), address block, amounts block (tax-free amount, tax amount, and total amount), and lines block (lines table) by investigating the impact of training our model on various block variants. We conducted several experiments where we compared the results obtained when we tested on templates that included variants not encountered during the training phase versus when we introduced them to the training dataset. This allowed us to analyse the improvement in results after adding these previously unseen variants. The obtained results have shown that the model generalises better when trained on a large variety of cases and achieves remarkable performance. We conducted experiments on various models to highlight the model-agnostic character of our proposed approach. This methodology allows to have great performance, even with models that have significantly fewer parameters, especially in comparison to recently published models with millions of parameters.
format Article
id doaj-art-ba76344e0f4c491c9a97728fb6e833ba
institution Kabale University
issn 1687-9732
language English
publishDate 2024-01-01
publisher Wiley
record_format Article
series Applied Computational Intelligence and Soft Computing
spelling doaj-art-ba76344e0f4c491c9a97728fb6e833ba2025-02-03T07:23:45ZengWileyApplied Computational Intelligence and Soft Computing1687-97322024-01-01202410.1155/2024/7599415Document Information Extraction: An Analysis of Invoice AnatomyMouad Hamri0Maxime Devanne1Jonathan Weber2Michel Hassenforder3IRIMASIRIMASIRIMASIRIMASIn this paper, we present a new approach of document information extraction by studying the document anatomy where we investigated the possible variants and forms it could have for each document component. To overcome the lack of publicly available document datasets, we used a generated invoice database where we conceived 9 different templates and generated 100 samples for each one where the documents were annotated automatically during the generation process. We analysed the following invoice components: dates block (invoice date and invoice due date), address block, amounts block (tax-free amount, tax amount, and total amount), and lines block (lines table) by investigating the impact of training our model on various block variants. We conducted several experiments where we compared the results obtained when we tested on templates that included variants not encountered during the training phase versus when we introduced them to the training dataset. This allowed us to analyse the improvement in results after adding these previously unseen variants. The obtained results have shown that the model generalises better when trained on a large variety of cases and achieves remarkable performance. We conducted experiments on various models to highlight the model-agnostic character of our proposed approach. This methodology allows to have great performance, even with models that have significantly fewer parameters, especially in comparison to recently published models with millions of parameters.http://dx.doi.org/10.1155/2024/7599415
spellingShingle Mouad Hamri
Maxime Devanne
Jonathan Weber
Michel Hassenforder
Document Information Extraction: An Analysis of Invoice Anatomy
Applied Computational Intelligence and Soft Computing
title Document Information Extraction: An Analysis of Invoice Anatomy
title_full Document Information Extraction: An Analysis of Invoice Anatomy
title_fullStr Document Information Extraction: An Analysis of Invoice Anatomy
title_full_unstemmed Document Information Extraction: An Analysis of Invoice Anatomy
title_short Document Information Extraction: An Analysis of Invoice Anatomy
title_sort document information extraction an analysis of invoice anatomy
url http://dx.doi.org/10.1155/2024/7599415
work_keys_str_mv AT mouadhamri documentinformationextractionananalysisofinvoiceanatomy
AT maximedevanne documentinformationextractionananalysisofinvoiceanatomy
AT jonathanweber documentinformationextractionananalysisofinvoiceanatomy
AT michelhassenforder documentinformationextractionananalysisofinvoiceanatomy