C-BERT: A Mongolian reverse dictionary based on fused lexical semantic clustering and BERT
A reverse dictionary is an electronic dictionary that accepts user-provided natural language descriptions and returns semantically matching lexicons. Despite substantial research achievements in Mongolian lexicography, discussions on Mongolian reverse dictionaries have not yet emerged. To address th...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2025-01-01
|
Series: | Alexandria Engineering Journal |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S1110016824011967 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | A reverse dictionary is an electronic dictionary that accepts user-provided natural language descriptions and returns semantically matching lexicons. Despite substantial research achievements in Mongolian lexicography, discussions on Mongolian reverse dictionaries have not yet emerged. To address this, we propose an innovative model, C-BERT, combining advanced lexical semantic clustering and BERT classification technology. Initially, the K-means algorithm was used to cluster preprocessed entries from well-known Mongolian dictionaries into 5000 clusters, forming a comprehensive training set. We then optimized this training set’s data distribution through random negative sampling and fine-tuned the CINO-large model, leading to the creation of the C-BERT model. When users submit descriptions, C-BERT matches them with the central words of 5000 clusters, selecting the top 125 clusters. It then matches target words within these clusters to recommend the top 100 semantically relevant candidates. Compared to the seven baseline models, C-BERT demonstrates superior performance, particularly when evaluated on datasets with human-generated descriptions, where its synonym accuracy@10/100 reaches 16.5% and 71%, respectively. Benefiting from clustering, C-BERT improves inference speed more than tenfold, significantly enhancing its practical utility. Accordingly, we have developed a user-friendly online application platform based on C-BERT for a broad range of users, available at http://mrdp.net/. |
---|---|
ISSN: | 1110-0168 |