C-BERT: A Mongolian reverse dictionary based on fused lexical semantic clustering and BERT

A reverse dictionary is an electronic dictionary that accepts user-provided natural language descriptions and returns semantically matching lexicons. Despite substantial research achievements in Mongolian lexicography, discussions on Mongolian reverse dictionaries have not yet emerged. To address th...

Full description

Saved in:
Bibliographic Details
Main Authors: Amuguleng Wang, Yilagui Qi, Dahu Baiyila
Format: Article
Language:English
Published: Elsevier 2025-01-01
Series:Alexandria Engineering Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S1110016824011967
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832595957655535616
author Amuguleng Wang
Yilagui Qi
Dahu Baiyila
author_facet Amuguleng Wang
Yilagui Qi
Dahu Baiyila
author_sort Amuguleng Wang
collection DOAJ
description A reverse dictionary is an electronic dictionary that accepts user-provided natural language descriptions and returns semantically matching lexicons. Despite substantial research achievements in Mongolian lexicography, discussions on Mongolian reverse dictionaries have not yet emerged. To address this, we propose an innovative model, C-BERT, combining advanced lexical semantic clustering and BERT classification technology. Initially, the K-means algorithm was used to cluster preprocessed entries from well-known Mongolian dictionaries into 5000 clusters, forming a comprehensive training set. We then optimized this training set’s data distribution through random negative sampling and fine-tuned the CINO-large model, leading to the creation of the C-BERT model. When users submit descriptions, C-BERT matches them with the central words of 5000 clusters, selecting the top 125 clusters. It then matches target words within these clusters to recommend the top 100 semantically relevant candidates. Compared to the seven baseline models, C-BERT demonstrates superior performance, particularly when evaluated on datasets with human-generated descriptions, where its synonym accuracy@10/100 reaches 16.5% and 71%, respectively. Benefiting from clustering, C-BERT improves inference speed more than tenfold, significantly enhancing its practical utility. Accordingly, we have developed a user-friendly online application platform based on C-BERT for a broad range of users, available at http://mrdp.net/.
format Article
id doaj-art-4aa965c7613541e0829aa73e09c97900
institution Kabale University
issn 1110-0168
language English
publishDate 2025-01-01
publisher Elsevier
record_format Article
series Alexandria Engineering Journal
spelling doaj-art-4aa965c7613541e0829aa73e09c979002025-01-18T05:03:36ZengElsevierAlexandria Engineering Journal1110-01682025-01-01111385395C-BERT: A Mongolian reverse dictionary based on fused lexical semantic clustering and BERTAmuguleng Wang0Yilagui Qi1Dahu Baiyila2School of Mongolian Studies, Inner Mongolia University, Hohhot, 010000, ChinaSchool of Mongolian Studies, Inner Mongolia University, Hohhot, 010000, ChinaCorresponding author.; School of Mongolian Studies, Inner Mongolia University, Hohhot, 010000, ChinaA reverse dictionary is an electronic dictionary that accepts user-provided natural language descriptions and returns semantically matching lexicons. Despite substantial research achievements in Mongolian lexicography, discussions on Mongolian reverse dictionaries have not yet emerged. To address this, we propose an innovative model, C-BERT, combining advanced lexical semantic clustering and BERT classification technology. Initially, the K-means algorithm was used to cluster preprocessed entries from well-known Mongolian dictionaries into 5000 clusters, forming a comprehensive training set. We then optimized this training set’s data distribution through random negative sampling and fine-tuned the CINO-large model, leading to the creation of the C-BERT model. When users submit descriptions, C-BERT matches them with the central words of 5000 clusters, selecting the top 125 clusters. It then matches target words within these clusters to recommend the top 100 semantically relevant candidates. Compared to the seven baseline models, C-BERT demonstrates superior performance, particularly when evaluated on datasets with human-generated descriptions, where its synonym accuracy@10/100 reaches 16.5% and 71%, respectively. Benefiting from clustering, C-BERT improves inference speed more than tenfold, significantly enhancing its practical utility. Accordingly, we have developed a user-friendly online application platform based on C-BERT for a broad range of users, available at http://mrdp.net/.http://www.sciencedirect.com/science/article/pii/S1110016824011967Reverse dictionaryMongolian languageNatural language processingConceptual searchLexical semantic clusteringBERT classifier
spellingShingle Amuguleng Wang
Yilagui Qi
Dahu Baiyila
C-BERT: A Mongolian reverse dictionary based on fused lexical semantic clustering and BERT
Alexandria Engineering Journal
Reverse dictionary
Mongolian language
Natural language processing
Conceptual search
Lexical semantic clustering
BERT classifier
title C-BERT: A Mongolian reverse dictionary based on fused lexical semantic clustering and BERT
title_full C-BERT: A Mongolian reverse dictionary based on fused lexical semantic clustering and BERT
title_fullStr C-BERT: A Mongolian reverse dictionary based on fused lexical semantic clustering and BERT
title_full_unstemmed C-BERT: A Mongolian reverse dictionary based on fused lexical semantic clustering and BERT
title_short C-BERT: A Mongolian reverse dictionary based on fused lexical semantic clustering and BERT
title_sort c bert a mongolian reverse dictionary based on fused lexical semantic clustering and bert
topic Reverse dictionary
Mongolian language
Natural language processing
Conceptual search
Lexical semantic clustering
BERT classifier
url http://www.sciencedirect.com/science/article/pii/S1110016824011967
work_keys_str_mv AT amugulengwang cbertamongolianreversedictionarybasedonfusedlexicalsemanticclusteringandbert
AT yilaguiqi cbertamongolianreversedictionarybasedonfusedlexicalsemanticclusteringandbert
AT dahubaiyila cbertamongolianreversedictionarybasedonfusedlexicalsemanticclusteringandbert