An Adaptive Methodology for Constructing Domain-Specific Sentiment Lexicons Based on Chinese Social Media Data
Currently, many methods for automatically constructing domain-specific sentiment lexicons rely on knowledge bases and domain-specific corpora. However, these methods often face accuracy challenges due to data sparsity, and inferring the polarity of new domain-specific sentiment words from a limited...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11008636/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Currently, many methods for automatically constructing domain-specific sentiment lexicons rely on knowledge bases and domain-specific corpora. However, these methods often face accuracy challenges due to data sparsity, and inferring the polarity of new domain-specific sentiment words from a limited set of labeled seed words lacks precision. Chinese social media texts typically exhibit a high degree of randomness, noise, and informal sentiment words, which further increases the difficulty of constructing domain-specific sentiment lexicons. To address these challenges, we propose an adaptive framework for constructing domain-specific sentiment lexicons using Chinese social media data and apply it to develop a sentiment lexicon for public opinion during public health emergencies (PHEPO-SentiLex). We first fine-tune Bidirectional Encoder Representations from Transformers (BERT) via a multi-task framework on domain-specific corpus and a small number of Weibo-annotated sentiment datasets, enabling the model to encode both domain semantics and sentiment-related contextual patterns into word embeddings through gradient sharing. The embeddings are subsequently used to calculate the Sentiment Attraction Degree (SAD) during seed word filtering, cosine similarity during domain-specific sentiment word selection, and for constructing the domain-specific corpus-sentiment word graph (SentiGraph). Next, we propose SentiGraph-GCN, a method for sentiment word polarity determination that integrates semantic, sentiment, co-occurrence frequency, and global structural information embedded in the corpus. Experimental results demonstrate that SentiGraph-GCN significantly outperforms existing methods in determining sentiment word polarity. Furthermore, PHEPO-SentiLex exhibits superior accuracy and stability in relevant scenarios compared to general-purpose sentiment lexicons. |
|---|---|
| ISSN: | 2169-3536 |