A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data

In this study, the authors presented a dataset for named entity recognition in the Uzbek language. The dataset consists of 2000 sentences and 25,865 words, and the sources were legal documents and hand-crafted sentences annotated using the BIOES scheme. The study is complemented by the fact that the...

Full description

Saved in:

Bibliographic Details
Main Authors:	Davlatyor Mengliev, Vladimir Barakhnin, Mukhriddin Eshkulov, Bahodir Ibragimov, Shohrux Madirimov
Format:	Article
Language:	English
Published:	Elsevier 2025-02-01
Series:	Data in Brief
Subjects:	Named entity Low-resource languages Uzbek language Language corpus Linguistic research
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352340924012113
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832576491047616512
author	Davlatyor Mengliev Vladimir Barakhnin Mukhriddin Eshkulov Bahodir Ibragimov Shohrux Madirimov
author_facet	Davlatyor Mengliev Vladimir Barakhnin Mukhriddin Eshkulov Bahodir Ibragimov Shohrux Madirimov
author_sort	Davlatyor Mengliev
collection	DOAJ
description	In this study, the authors presented a dataset for named entity recognition in the Uzbek language. The dataset consists of 2000 sentences and 25,865 words, and the sources were legal documents and hand-crafted sentences annotated using the BIOES scheme. The study is complemented by the fact that the authors demonstrated the applications of the created dataset by training a language model using the CNN + LSTM architecture, which achieves high accuracy in NER tasks, with an F1 score of 90.8 %, precision of 93.9 %, and recall of 88.0 % on the test set. The proposed dataset and trained model contribute to the development of natural language processing in the Uzbek language. In addition, the authors also conducted an analysis of existing works, as well as a comparative analysis, which will help to identify the distinctive features and novelty of the proposed work. Moreover, in conclusion, the authors propose possible scenarios for the development of the work, in the form of further scaling of the dataset, as well as the use of other neural network architectures.
format	Article
id	doaj-art-8e3978a649fb46e9995f248253daaf1a
institution	Kabale University
issn	2352-3409
language	English
publishDate	2025-02-01
publisher	Elsevier
record_format	Article
series	Data in Brief
spelling	doaj-art-8e3978a649fb46e9995f248253daaf1a2025-01-31T05:11:40ZengElsevierData in Brief2352-34092025-02-0158111249A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley DataDavlatyor Mengliev0Vladimir Barakhnin1Mukhriddin Eshkulov2Bahodir Ibragimov3Shohrux Madirimov4Urgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, 110, al-Khwarizmi str., 220100 Urgench city, Uzbekistan; Novosibirsk State University, 2, Pirogova str., Novosibirsk city 630090, Russia; Corresponding author at: Urgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, 110, al-Khwarizmi str., 220100 Urgench city, Uzbekistan.Novosibirsk State University, 2, Pirogova str., Novosibirsk city 630090, Russia; Federal Research Center for Information and Computational Technologies, 6, Academician M.A. Lavrentiev avenue, Novosibirsk 630090, RussiaJizzakh polytechnic institute, 4, Islom Karimov str., Jizzakh city 130100, UzbekistanUrgench State University, 14, Kh.Alimdjan str., Urgench city 220100, UzbekistanTashkent institute of textile and light industry, 5, Shoxdjaxon str., Tashkent city 100100, UzbekistanIn this study, the authors presented a dataset for named entity recognition in the Uzbek language. The dataset consists of 2000 sentences and 25,865 words, and the sources were legal documents and hand-crafted sentences annotated using the BIOES scheme. The study is complemented by the fact that the authors demonstrated the applications of the created dataset by training a language model using the CNN + LSTM architecture, which achieves high accuracy in NER tasks, with an F1 score of 90.8 %, precision of 93.9 %, and recall of 88.0 % on the test set. The proposed dataset and trained model contribute to the development of natural language processing in the Uzbek language. In addition, the authors also conducted an analysis of existing works, as well as a comparative analysis, which will help to identify the distinctive features and novelty of the proposed work. Moreover, in conclusion, the authors propose possible scenarios for the development of the work, in the form of further scaling of the dataset, as well as the use of other neural network architectures.http://www.sciencedirect.com/science/article/pii/S2352340924012113Named entityLow-resource languagesUzbek languageLanguage corpusLinguistic research
spellingShingle	Davlatyor Mengliev Vladimir Barakhnin Mukhriddin Eshkulov Bahodir Ibragimov Shohrux Madirimov A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data Data in Brief Named entity Low-resource languages Uzbek language Language corpus Linguistic research
title	A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data
title_full	A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data
title_fullStr	A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data
title_full_unstemmed	A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data
title_short	A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data
title_sort	comprehensive dataset and neural network approach for named entity recognition in the uzbek languagemendeley data
topic	Named entity Low-resource languages Uzbek language Language corpus Linguistic research
url	http://www.sciencedirect.com/science/article/pii/S2352340924012113
work_keys_str_mv	AT davlatyormengliev acomprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata AT vladimirbarakhnin acomprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata AT mukhriddineshkulov acomprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata AT bahodiribragimov acomprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata AT shohruxmadirimov acomprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata AT davlatyormengliev comprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata AT vladimirbarakhnin comprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata AT mukhriddineshkulov comprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata AT bahodiribragimov comprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata AT shohruxmadirimov comprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata

A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data

Similar Items