Improving Text Recognition Accuracy for Serbian Legal Documents Using BERT

Producing a new high-quality text corpus is a big challenge due to the required complexity and labor expenses. High-quality datasets, considered a prerequisite for many supervised machine learning algorithms, are often only available in very limited quantities. This in turn limits the capabilities o...

Full description

Saved in:
Bibliographic Details
Main Authors: Miloš Bogdanović, Milena Frtunić Gligorijević, Jelena Kocić, Leonid Stoimenov
Format: Article
Language:English
Published: MDPI AG 2025-01-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/2/615
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832589278702469120
author Miloš Bogdanović
Milena Frtunić Gligorijević
Jelena Kocić
Leonid Stoimenov
author_facet Miloš Bogdanović
Milena Frtunić Gligorijević
Jelena Kocić
Leonid Stoimenov
author_sort Miloš Bogdanović
collection DOAJ
description Producing a new high-quality text corpus is a big challenge due to the required complexity and labor expenses. High-quality datasets, considered a prerequisite for many supervised machine learning algorithms, are often only available in very limited quantities. This in turn limits the capabilities of many advanced technologies when used in a specific field of research and development. This is also the case for the Serbian language, which is considered low-resourced in digitized language resources. In this paper, we address this issue for the Serbian language through a novel approach for generating high-quality text corpora by improving text recognition accuracy for scanned documents belonging to Serbian legal heritage. Our approach integrates three different components to provide high-quality results: a BERT-based large language model built specifically for Serbian legal texts, a high-quality open-source optical character recognition (OCR) model, and a word-level similarity measure for Serbian Cyrillic developed for this research and used for generating necessary correction suggestions. This approach was evaluated manually using scanned legal documents sampled from three different epochs between the years 1970 and 2002 with more than 14,500 test cases. We demonstrate that our approach can correct up to 88% of terms inaccurately extracted by the OCR model in the case of Serbian legal texts.
format Article
id doaj-art-9d708723e3544d5bb5b415e3634b2ed1
institution Kabale University
issn 2076-3417
language English
publishDate 2025-01-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-9d708723e3544d5bb5b415e3634b2ed12025-01-24T13:20:04ZengMDPI AGApplied Sciences2076-34172025-01-0115261510.3390/app15020615Improving Text Recognition Accuracy for Serbian Legal Documents Using BERTMiloš Bogdanović0Milena Frtunić Gligorijević1Jelena Kocić2Leonid Stoimenov3Faculty of Electronic Engineering, University of Nis, 18000 Nis, SerbiaFaculty of Electronic Engineering, University of Nis, 18000 Nis, SerbiaFaculty of Electronic Engineering, University of Nis, 18000 Nis, SerbiaFaculty of Electronic Engineering, University of Nis, 18000 Nis, SerbiaProducing a new high-quality text corpus is a big challenge due to the required complexity and labor expenses. High-quality datasets, considered a prerequisite for many supervised machine learning algorithms, are often only available in very limited quantities. This in turn limits the capabilities of many advanced technologies when used in a specific field of research and development. This is also the case for the Serbian language, which is considered low-resourced in digitized language resources. In this paper, we address this issue for the Serbian language through a novel approach for generating high-quality text corpora by improving text recognition accuracy for scanned documents belonging to Serbian legal heritage. Our approach integrates three different components to provide high-quality results: a BERT-based large language model built specifically for Serbian legal texts, a high-quality open-source optical character recognition (OCR) model, and a word-level similarity measure for Serbian Cyrillic developed for this research and used for generating necessary correction suggestions. This approach was evaluated manually using scanned legal documents sampled from three different epochs between the years 1970 and 2002 with more than 14,500 test cases. We demonstrate that our approach can correct up to 88% of terms inaccurately extracted by the OCR model in the case of Serbian legal texts.https://www.mdpi.com/2076-3417/15/2/615BERTtext recognitionoptical character recognitionword similarity
spellingShingle Miloš Bogdanović
Milena Frtunić Gligorijević
Jelena Kocić
Leonid Stoimenov
Improving Text Recognition Accuracy for Serbian Legal Documents Using BERT
Applied Sciences
BERT
text recognition
optical character recognition
word similarity
title Improving Text Recognition Accuracy for Serbian Legal Documents Using BERT
title_full Improving Text Recognition Accuracy for Serbian Legal Documents Using BERT
title_fullStr Improving Text Recognition Accuracy for Serbian Legal Documents Using BERT
title_full_unstemmed Improving Text Recognition Accuracy for Serbian Legal Documents Using BERT
title_short Improving Text Recognition Accuracy for Serbian Legal Documents Using BERT
title_sort improving text recognition accuracy for serbian legal documents using bert
topic BERT
text recognition
optical character recognition
word similarity
url https://www.mdpi.com/2076-3417/15/2/615
work_keys_str_mv AT milosbogdanovic improvingtextrecognitionaccuracyforserbianlegaldocumentsusingbert
AT milenafrtunicgligorijevic improvingtextrecognitionaccuracyforserbianlegaldocumentsusingbert
AT jelenakocic improvingtextrecognitionaccuracyforserbianlegaldocumentsusingbert
AT leonidstoimenov improvingtextrecognitionaccuracyforserbianlegaldocumentsusingbert