Enhancing Emotion Recognition in Speech Based on Self-Supervised Learning: Cross-Attention Fusion of Acoustic and Semantic Features
Speech Emotion Recognition has gained considerable attention in speech processing and machine learning due to its potential applications in human-computer interaction, mental health monitoring, and customer service. However, state-of-the-art models for speech emotion recognition use many parameters,...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10938083/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Speech Emotion Recognition has gained considerable attention in speech processing and machine learning due to its potential applications in human-computer interaction, mental health monitoring, and customer service. However, state-of-the-art models for speech emotion recognition use many parameters, which leads to computational complexity. In this paper, we introduce a novel deep-learning model to enhance the accuracy of emotional content detection in speech signals while maintaining a lightweight architecture compared to state-of-the-art models. The proposed model incorporates a feature encoder that significantly improves the emotional representation of acoustic features and a cross-attention mechanism to fuse acoustic features, such as Spectrograms, with semantic features extracted from the pre-trained self-supervised learning framework, enriching the emotional representation of speech. An extensive experimental study demonstrates that the proposed model achieves a weighted accuracy of 74.6% on the IEMOCAP dataset, competitive with the state-of-the-art baselines. In addition, our proposed model achieves a latency of 24 milliseconds on moderate devices while containing up to three times fewer parameters. |
|---|---|
| ISSN: | 2169-3536 |