A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data

In this study, the authors presented a dataset for named entity recognition in the Uzbek language. The dataset consists of 2000 sentences and 25,865 words, and the sources were legal documents and hand-crafted sentences annotated using the BIOES scheme. The study is complemented by the fact that the...

Full description

Saved in:
Bibliographic Details
Main Authors: Davlatyor Mengliev, Vladimir Barakhnin, Mukhriddin Eshkulov, Bahodir Ibragimov, Shohrux Madirimov
Format: Article
Language:English
Published: Elsevier 2025-02-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340924012113
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832576491047616512
author Davlatyor Mengliev
Vladimir Barakhnin
Mukhriddin Eshkulov
Bahodir Ibragimov
Shohrux Madirimov
author_facet Davlatyor Mengliev
Vladimir Barakhnin
Mukhriddin Eshkulov
Bahodir Ibragimov
Shohrux Madirimov
author_sort Davlatyor Mengliev
collection DOAJ
description In this study, the authors presented a dataset for named entity recognition in the Uzbek language. The dataset consists of 2000 sentences and 25,865 words, and the sources were legal documents and hand-crafted sentences annotated using the BIOES scheme. The study is complemented by the fact that the authors demonstrated the applications of the created dataset by training a language model using the CNN + LSTM architecture, which achieves high accuracy in NER tasks, with an F1 score of 90.8 %, precision of 93.9 %, and recall of 88.0 % on the test set. The proposed dataset and trained model contribute to the development of natural language processing in the Uzbek language. In addition, the authors also conducted an analysis of existing works, as well as a comparative analysis, which will help to identify the distinctive features and novelty of the proposed work. Moreover, in conclusion, the authors propose possible scenarios for the development of the work, in the form of further scaling of the dataset, as well as the use of other neural network architectures.
format Article
id doaj-art-8e3978a649fb46e9995f248253daaf1a
institution Kabale University
issn 2352-3409
language English
publishDate 2025-02-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj-art-8e3978a649fb46e9995f248253daaf1a2025-01-31T05:11:40ZengElsevierData in Brief2352-34092025-02-0158111249A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley DataDavlatyor Mengliev0Vladimir Barakhnin1Mukhriddin Eshkulov2Bahodir Ibragimov3Shohrux Madirimov4Urgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, 110, al-Khwarizmi str., 220100 Urgench city, Uzbekistan; Novosibirsk State University, 2, Pirogova str., Novosibirsk city 630090, Russia; Corresponding author at: Urgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, 110, al-Khwarizmi str., 220100 Urgench city, Uzbekistan.Novosibirsk State University, 2, Pirogova str., Novosibirsk city 630090, Russia; Federal Research Center for Information and Computational Technologies, 6, Academician M.A. Lavrentiev avenue, Novosibirsk 630090, RussiaJizzakh polytechnic institute, 4, Islom Karimov str., Jizzakh city 130100, UzbekistanUrgench State University, 14, Kh.Alimdjan str., Urgench city 220100, UzbekistanTashkent institute of textile and light industry, 5, Shoxdjaxon str., Tashkent city 100100, UzbekistanIn this study, the authors presented a dataset for named entity recognition in the Uzbek language. The dataset consists of 2000 sentences and 25,865 words, and the sources were legal documents and hand-crafted sentences annotated using the BIOES scheme. The study is complemented by the fact that the authors demonstrated the applications of the created dataset by training a language model using the CNN + LSTM architecture, which achieves high accuracy in NER tasks, with an F1 score of 90.8 %, precision of 93.9 %, and recall of 88.0 % on the test set. The proposed dataset and trained model contribute to the development of natural language processing in the Uzbek language. In addition, the authors also conducted an analysis of existing works, as well as a comparative analysis, which will help to identify the distinctive features and novelty of the proposed work. Moreover, in conclusion, the authors propose possible scenarios for the development of the work, in the form of further scaling of the dataset, as well as the use of other neural network architectures.http://www.sciencedirect.com/science/article/pii/S2352340924012113Named entityLow-resource languagesUzbek languageLanguage corpusLinguistic research
spellingShingle Davlatyor Mengliev
Vladimir Barakhnin
Mukhriddin Eshkulov
Bahodir Ibragimov
Shohrux Madirimov
A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data
Data in Brief
Named entity
Low-resource languages
Uzbek language
Language corpus
Linguistic research
title A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data
title_full A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data
title_fullStr A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data
title_full_unstemmed A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data
title_short A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data
title_sort comprehensive dataset and neural network approach for named entity recognition in the uzbek languagemendeley data
topic Named entity
Low-resource languages
Uzbek language
Language corpus
Linguistic research
url http://www.sciencedirect.com/science/article/pii/S2352340924012113
work_keys_str_mv AT davlatyormengliev acomprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata
AT vladimirbarakhnin acomprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata
AT mukhriddineshkulov acomprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata
AT bahodiribragimov acomprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata
AT shohruxmadirimov acomprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata
AT davlatyormengliev comprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata
AT vladimirbarakhnin comprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata
AT mukhriddineshkulov comprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata
AT bahodiribragimov comprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata
AT shohruxmadirimov comprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata