Enhancing Emotion Recognition in Speech Based on Self-Supervised Learning: Cross-Attention Fusion of Acoustic and Semantic Features

Speech Emotion Recognition has gained considerable attention in speech processing and machine learning due to its potential applications in human-computer interaction, mental health monitoring, and customer service. However, state-of-the-art models for speech emotion recognition use many parameters,...

Full description

Saved in:
Bibliographic Details
Main Authors: Bashar M. Deeb, Andrey V. Savchenko, Ilya Makarov
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10938083/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Speech Emotion Recognition has gained considerable attention in speech processing and machine learning due to its potential applications in human-computer interaction, mental health monitoring, and customer service. However, state-of-the-art models for speech emotion recognition use many parameters, which leads to computational complexity. In this paper, we introduce a novel deep-learning model to enhance the accuracy of emotional content detection in speech signals while maintaining a lightweight architecture compared to state-of-the-art models. The proposed model incorporates a feature encoder that significantly improves the emotional representation of acoustic features and a cross-attention mechanism to fuse acoustic features, such as Spectrograms, with semantic features extracted from the pre-trained self-supervised learning framework, enriching the emotional representation of speech. An extensive experimental study demonstrates that the proposed model achieves a weighted accuracy of 74.6% on the IEMOCAP dataset, competitive with the state-of-the-art baselines. In addition, our proposed model achieves a latency of 24 milliseconds on moderate devices while containing up to three times fewer parameters.
ISSN:2169-3536