Improving Text Recognition Accuracy for Serbian Legal Documents Using BERT
Producing a new high-quality text corpus is a big challenge due to the required complexity and labor expenses. High-quality datasets, considered a prerequisite for many supervised machine learning algorithms, are often only available in very limited quantities. This in turn limits the capabilities o...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2025-01-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/15/2/615 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832589278702469120 |
---|---|
author | Miloš Bogdanović Milena Frtunić Gligorijević Jelena Kocić Leonid Stoimenov |
author_facet | Miloš Bogdanović Milena Frtunić Gligorijević Jelena Kocić Leonid Stoimenov |
author_sort | Miloš Bogdanović |
collection | DOAJ |
description | Producing a new high-quality text corpus is a big challenge due to the required complexity and labor expenses. High-quality datasets, considered a prerequisite for many supervised machine learning algorithms, are often only available in very limited quantities. This in turn limits the capabilities of many advanced technologies when used in a specific field of research and development. This is also the case for the Serbian language, which is considered low-resourced in digitized language resources. In this paper, we address this issue for the Serbian language through a novel approach for generating high-quality text corpora by improving text recognition accuracy for scanned documents belonging to Serbian legal heritage. Our approach integrates three different components to provide high-quality results: a BERT-based large language model built specifically for Serbian legal texts, a high-quality open-source optical character recognition (OCR) model, and a word-level similarity measure for Serbian Cyrillic developed for this research and used for generating necessary correction suggestions. This approach was evaluated manually using scanned legal documents sampled from three different epochs between the years 1970 and 2002 with more than 14,500 test cases. We demonstrate that our approach can correct up to 88% of terms inaccurately extracted by the OCR model in the case of Serbian legal texts. |
format | Article |
id | doaj-art-9d708723e3544d5bb5b415e3634b2ed1 |
institution | Kabale University |
issn | 2076-3417 |
language | English |
publishDate | 2025-01-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj-art-9d708723e3544d5bb5b415e3634b2ed12025-01-24T13:20:04ZengMDPI AGApplied Sciences2076-34172025-01-0115261510.3390/app15020615Improving Text Recognition Accuracy for Serbian Legal Documents Using BERTMiloš Bogdanović0Milena Frtunić Gligorijević1Jelena Kocić2Leonid Stoimenov3Faculty of Electronic Engineering, University of Nis, 18000 Nis, SerbiaFaculty of Electronic Engineering, University of Nis, 18000 Nis, SerbiaFaculty of Electronic Engineering, University of Nis, 18000 Nis, SerbiaFaculty of Electronic Engineering, University of Nis, 18000 Nis, SerbiaProducing a new high-quality text corpus is a big challenge due to the required complexity and labor expenses. High-quality datasets, considered a prerequisite for many supervised machine learning algorithms, are often only available in very limited quantities. This in turn limits the capabilities of many advanced technologies when used in a specific field of research and development. This is also the case for the Serbian language, which is considered low-resourced in digitized language resources. In this paper, we address this issue for the Serbian language through a novel approach for generating high-quality text corpora by improving text recognition accuracy for scanned documents belonging to Serbian legal heritage. Our approach integrates three different components to provide high-quality results: a BERT-based large language model built specifically for Serbian legal texts, a high-quality open-source optical character recognition (OCR) model, and a word-level similarity measure for Serbian Cyrillic developed for this research and used for generating necessary correction suggestions. This approach was evaluated manually using scanned legal documents sampled from three different epochs between the years 1970 and 2002 with more than 14,500 test cases. We demonstrate that our approach can correct up to 88% of terms inaccurately extracted by the OCR model in the case of Serbian legal texts.https://www.mdpi.com/2076-3417/15/2/615BERTtext recognitionoptical character recognitionword similarity |
spellingShingle | Miloš Bogdanović Milena Frtunić Gligorijević Jelena Kocić Leonid Stoimenov Improving Text Recognition Accuracy for Serbian Legal Documents Using BERT Applied Sciences BERT text recognition optical character recognition word similarity |
title | Improving Text Recognition Accuracy for Serbian Legal Documents Using BERT |
title_full | Improving Text Recognition Accuracy for Serbian Legal Documents Using BERT |
title_fullStr | Improving Text Recognition Accuracy for Serbian Legal Documents Using BERT |
title_full_unstemmed | Improving Text Recognition Accuracy for Serbian Legal Documents Using BERT |
title_short | Improving Text Recognition Accuracy for Serbian Legal Documents Using BERT |
title_sort | improving text recognition accuracy for serbian legal documents using bert |
topic | BERT text recognition optical character recognition word similarity |
url | https://www.mdpi.com/2076-3417/15/2/615 |
work_keys_str_mv | AT milosbogdanovic improvingtextrecognitionaccuracyforserbianlegaldocumentsusingbert AT milenafrtunicgligorijevic improvingtextrecognitionaccuracyforserbianlegaldocumentsusingbert AT jelenakocic improvingtextrecognitionaccuracyforserbianlegaldocumentsusingbert AT leonidstoimenov improvingtextrecognitionaccuracyforserbianlegaldocumentsusingbert |