An Adaptive Methodology for Constructing Domain-Specific Sentiment Lexicons Based on Chinese Social Media Data

Currently, many methods for automatically constructing domain-specific sentiment lexicons rely on knowledge bases and domain-specific corpora. However, these methods often face accuracy challenges due to data sparsity, and inferring the polarity of new domain-specific sentiment words from a limited...

Full description

Saved in:
Bibliographic Details
Main Authors: Xue Xu, Haidong Liu, Lei Liu
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11008636/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Currently, many methods for automatically constructing domain-specific sentiment lexicons rely on knowledge bases and domain-specific corpora. However, these methods often face accuracy challenges due to data sparsity, and inferring the polarity of new domain-specific sentiment words from a limited set of labeled seed words lacks precision. Chinese social media texts typically exhibit a high degree of randomness, noise, and informal sentiment words, which further increases the difficulty of constructing domain-specific sentiment lexicons. To address these challenges, we propose an adaptive framework for constructing domain-specific sentiment lexicons using Chinese social media data and apply it to develop a sentiment lexicon for public opinion during public health emergencies (PHEPO-SentiLex). We first fine-tune Bidirectional Encoder Representations from Transformers (BERT) via a multi-task framework on domain-specific corpus and a small number of Weibo-annotated sentiment datasets, enabling the model to encode both domain semantics and sentiment-related contextual patterns into word embeddings through gradient sharing. The embeddings are subsequently used to calculate the Sentiment Attraction Degree (SAD) during seed word filtering, cosine similarity during domain-specific sentiment word selection, and for constructing the domain-specific corpus-sentiment word graph (SentiGraph). Next, we propose SentiGraph-GCN, a method for sentiment word polarity determination that integrates semantic, sentiment, co-occurrence frequency, and global structural information embedded in the corpus. Experimental results demonstrate that SentiGraph-GCN significantly outperforms existing methods in determining sentiment word polarity. Furthermore, PHEPO-SentiLex exhibits superior accuracy and stability in relevant scenarios compared to general-purpose sentiment lexicons.
ISSN:2169-3536