A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data
In this study, the authors presented a dataset for named entity recognition in the Uzbek language. The dataset consists of 2000 sentences and 25,865 words, and the sources were legal documents and hand-crafted sentences annotated using the BIOES scheme. The study is complemented by the fact that the...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2025-02-01
|
Series: | Data in Brief |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2352340924012113 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832576491047616512 |
---|---|
author | Davlatyor Mengliev Vladimir Barakhnin Mukhriddin Eshkulov Bahodir Ibragimov Shohrux Madirimov |
author_facet | Davlatyor Mengliev Vladimir Barakhnin Mukhriddin Eshkulov Bahodir Ibragimov Shohrux Madirimov |
author_sort | Davlatyor Mengliev |
collection | DOAJ |
description | In this study, the authors presented a dataset for named entity recognition in the Uzbek language. The dataset consists of 2000 sentences and 25,865 words, and the sources were legal documents and hand-crafted sentences annotated using the BIOES scheme. The study is complemented by the fact that the authors demonstrated the applications of the created dataset by training a language model using the CNN + LSTM architecture, which achieves high accuracy in NER tasks, with an F1 score of 90.8 %, precision of 93.9 %, and recall of 88.0 % on the test set. The proposed dataset and trained model contribute to the development of natural language processing in the Uzbek language. In addition, the authors also conducted an analysis of existing works, as well as a comparative analysis, which will help to identify the distinctive features and novelty of the proposed work. Moreover, in conclusion, the authors propose possible scenarios for the development of the work, in the form of further scaling of the dataset, as well as the use of other neural network architectures. |
format | Article |
id | doaj-art-8e3978a649fb46e9995f248253daaf1a |
institution | Kabale University |
issn | 2352-3409 |
language | English |
publishDate | 2025-02-01 |
publisher | Elsevier |
record_format | Article |
series | Data in Brief |
spelling | doaj-art-8e3978a649fb46e9995f248253daaf1a2025-01-31T05:11:40ZengElsevierData in Brief2352-34092025-02-0158111249A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley DataDavlatyor Mengliev0Vladimir Barakhnin1Mukhriddin Eshkulov2Bahodir Ibragimov3Shohrux Madirimov4Urgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, 110, al-Khwarizmi str., 220100 Urgench city, Uzbekistan; Novosibirsk State University, 2, Pirogova str., Novosibirsk city 630090, Russia; Corresponding author at: Urgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, 110, al-Khwarizmi str., 220100 Urgench city, Uzbekistan.Novosibirsk State University, 2, Pirogova str., Novosibirsk city 630090, Russia; Federal Research Center for Information and Computational Technologies, 6, Academician M.A. Lavrentiev avenue, Novosibirsk 630090, RussiaJizzakh polytechnic institute, 4, Islom Karimov str., Jizzakh city 130100, UzbekistanUrgench State University, 14, Kh.Alimdjan str., Urgench city 220100, UzbekistanTashkent institute of textile and light industry, 5, Shoxdjaxon str., Tashkent city 100100, UzbekistanIn this study, the authors presented a dataset for named entity recognition in the Uzbek language. The dataset consists of 2000 sentences and 25,865 words, and the sources were legal documents and hand-crafted sentences annotated using the BIOES scheme. The study is complemented by the fact that the authors demonstrated the applications of the created dataset by training a language model using the CNN + LSTM architecture, which achieves high accuracy in NER tasks, with an F1 score of 90.8 %, precision of 93.9 %, and recall of 88.0 % on the test set. The proposed dataset and trained model contribute to the development of natural language processing in the Uzbek language. In addition, the authors also conducted an analysis of existing works, as well as a comparative analysis, which will help to identify the distinctive features and novelty of the proposed work. Moreover, in conclusion, the authors propose possible scenarios for the development of the work, in the form of further scaling of the dataset, as well as the use of other neural network architectures.http://www.sciencedirect.com/science/article/pii/S2352340924012113Named entityLow-resource languagesUzbek languageLanguage corpusLinguistic research |
spellingShingle | Davlatyor Mengliev Vladimir Barakhnin Mukhriddin Eshkulov Bahodir Ibragimov Shohrux Madirimov A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data Data in Brief Named entity Low-resource languages Uzbek language Language corpus Linguistic research |
title | A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data |
title_full | A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data |
title_fullStr | A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data |
title_full_unstemmed | A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data |
title_short | A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data |
title_sort | comprehensive dataset and neural network approach for named entity recognition in the uzbek languagemendeley data |
topic | Named entity Low-resource languages Uzbek language Language corpus Linguistic research |
url | http://www.sciencedirect.com/science/article/pii/S2352340924012113 |
work_keys_str_mv | AT davlatyormengliev acomprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata AT vladimirbarakhnin acomprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata AT mukhriddineshkulov acomprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata AT bahodiribragimov acomprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata AT shohruxmadirimov acomprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata AT davlatyormengliev comprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata AT vladimirbarakhnin comprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata AT mukhriddineshkulov comprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata AT bahodiribragimov comprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata AT shohruxmadirimov comprehensivedatasetandneuralnetworkapproachfornamedentityrecognitionintheuzbeklanguagemendeleydata |