Unleashing the power of pinyin: promoting Chinese named entity recognition with multiple embedding and attention

Abstract Named Entity Recognition (NER) aims to identify entities with specific meanings and their boundaries in natural language texts. Due to the differences between Chinese and English language families, Chinese NER faces challenges such as ambiguous word boundary delineation and semantic diversi...

Full description

Saved in:
Bibliographic Details
Main Authors: Jigui Zhao, Yurong Qian, Shuxiang Hou, Jiayin Chen, Kui Wang, Min Liu, Aizimaiti Xiaokaiti
Format: Article
Language:English
Published: Springer 2025-01-01
Series:Complex & Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s40747-024-01753-0
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Named Entity Recognition (NER) aims to identify entities with specific meanings and their boundaries in natural language texts. Due to the differences between Chinese and English language families, Chinese NER faces challenges such as ambiguous word boundary delineation and semantic diversity. Previous studies on Chinese NER have focused on character and lexical information, neglecting the unique feature of Chinese—pinyin information. In this paper, we propose CPL-NER, which combines multiple feature information of Chinese characters as embedding to enhance the semantic representation by introducing pinyin and dictionary information. For Chinese named entity recognition, pinyin information of Chinese characters helps to resolve the polyphonic phenomenon, while dictionary information aids in addressing word segmentation ambiguities. Additionally, we innovatively designed the Pinyin-Lexicon Cross-Attention Mechanism (PLCA), which calculates attention scores between various embeddings. This mechanism deeply integrates character, pinyin, and lexicon embeddings, generating character sequences enriched with semantic information. Finally, BiLSTM-CRF is employed for sequence modeling. Through this design, we can more comprehensively capture semantic features in Chinese text, improving the model’s ability to handle polyphonic characters and word segmentation ambiguities, thereby enhancing the recognition performance of Chinese named entities. We conducted experiments on four standard Chinese NER benchmark datasets, and the results show that our method outperforms most baselines, demonstrating the effectiveness of our proposed model.
ISSN:2199-4536
2198-6053