Speech Emotion Recognition Model Based on Joint Modeling of Discrete and Dimensional Emotion Representation

This paper introduces a novel joint model architecture for Speech Emotion Recognition (SER) that integrates both discrete and dimensional emotional representations, allowing for the simultaneous training of classification and regression tasks to improve the comprehensiveness and interpretability of...

Full description

Saved in:

Bibliographic Details
Main Authors:	John Lorenzo Bautista, Hyun Soon Shin
Format:	Article
Language:	English
Published:	MDPI AG 2025-01-01
Series:	Applied Sciences
Subjects:	adaptive weight balancing scheme affective computing dimensional emotion representation discrete emotion representation joint model architecture Speech Emotion Recognition (SER)
Online Access:	https://www.mdpi.com/2076-3417/15/2/623
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832589239596875776
author	John Lorenzo Bautista Hyun Soon Shin
author_facet	John Lorenzo Bautista Hyun Soon Shin
author_sort	John Lorenzo Bautista
collection	DOAJ
description	This paper introduces a novel joint model architecture for Speech Emotion Recognition (SER) that integrates both discrete and dimensional emotional representations, allowing for the simultaneous training of classification and regression tasks to improve the comprehensiveness and interpretability of emotion recognition. By employing a joint loss function that combines categorical and regression losses, the model ensures balanced optimization across tasks, with experiments exploring various weighting schemes using a tunable parameter to adjust task importance. Two adaptive weight balancing schemes, Dynamic Weighting and Joint Weighting, further enhance performance by dynamically adjusting task weights based on optimization progress and ensuring balanced emotion representation during backpropagation. The architecture employs parallel feature extraction through independent encoders, designed to capture unique features from multiple modalities, including Mel-frequency Cepstral Coefficients (MFCC), Short-term Features (STF), Mel-spectrograms, and raw audio signals. Additionally, pre-trained models such as Wav2Vec 2.0 and HuBERT are integrated to leverage their robust latent features. The inclusion of self-attention and co-attention mechanisms allows the model to capture relationships between input modalities and interdependencies among features, further improving its interpretability and integration capabilities. Experiments conducted on the IEMOCAP dataset using a leave-one-subject-out approach demonstrate the model’s effectiveness, with results showing a 1–2% accuracy improvement over classification-only models. The optimal configuration, incorporating the joint architecture, dynamic weighting, and parallel processing of multimodal features, achieves a weighted accuracy of 72.66%, an unweighted accuracy of 73.22%, and a mean Concordance Correlation Coefficient (CCC) of 0.3717. These results validate the effectiveness of the proposed joint model architecture and adaptive balancing weight schemes in improving SER performance.
format	Article
id	doaj-art-1b8092cbc7894ace96d46848266b4ea5
institution	Kabale University
issn	2076-3417
language	English
publishDate	2025-01-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj-art-1b8092cbc7894ace96d46848266b4ea52025-01-24T13:20:09ZengMDPI AGApplied Sciences2076-34172025-01-0115262310.3390/app15020623Speech Emotion Recognition Model Based on Joint Modeling of Discrete and Dimensional Emotion RepresentationJohn Lorenzo Bautista0Hyun Soon Shin1Emotion Recognition IoT Research Section, Hyper-Connected Communication Research Laboratory, Electronic and Telecommunications Research Institute (ETRI), Daejeon 34129, Republic of KoreaEmotion Recognition IoT Research Section, Hyper-Connected Communication Research Laboratory, Electronic and Telecommunications Research Institute (ETRI), Daejeon 34129, Republic of KoreaThis paper introduces a novel joint model architecture for Speech Emotion Recognition (SER) that integrates both discrete and dimensional emotional representations, allowing for the simultaneous training of classification and regression tasks to improve the comprehensiveness and interpretability of emotion recognition. By employing a joint loss function that combines categorical and regression losses, the model ensures balanced optimization across tasks, with experiments exploring various weighting schemes using a tunable parameter to adjust task importance. Two adaptive weight balancing schemes, Dynamic Weighting and Joint Weighting, further enhance performance by dynamically adjusting task weights based on optimization progress and ensuring balanced emotion representation during backpropagation. The architecture employs parallel feature extraction through independent encoders, designed to capture unique features from multiple modalities, including Mel-frequency Cepstral Coefficients (MFCC), Short-term Features (STF), Mel-spectrograms, and raw audio signals. Additionally, pre-trained models such as Wav2Vec 2.0 and HuBERT are integrated to leverage their robust latent features. The inclusion of self-attention and co-attention mechanisms allows the model to capture relationships between input modalities and interdependencies among features, further improving its interpretability and integration capabilities. Experiments conducted on the IEMOCAP dataset using a leave-one-subject-out approach demonstrate the model’s effectiveness, with results showing a 1–2% accuracy improvement over classification-only models. The optimal configuration, incorporating the joint architecture, dynamic weighting, and parallel processing of multimodal features, achieves a weighted accuracy of 72.66%, an unweighted accuracy of 73.22%, and a mean Concordance Correlation Coefficient (CCC) of 0.3717. These results validate the effectiveness of the proposed joint model architecture and adaptive balancing weight schemes in improving SER performance.https://www.mdpi.com/2076-3417/15/2/623adaptive weight balancing schemeaffective computingdimensional emotion representationdiscrete emotion representationjoint model architectureSpeech Emotion Recognition (SER)
spellingShingle	John Lorenzo Bautista Hyun Soon Shin Speech Emotion Recognition Model Based on Joint Modeling of Discrete and Dimensional Emotion Representation Applied Sciences adaptive weight balancing scheme affective computing dimensional emotion representation discrete emotion representation joint model architecture Speech Emotion Recognition (SER)
title	Speech Emotion Recognition Model Based on Joint Modeling of Discrete and Dimensional Emotion Representation
title_full	Speech Emotion Recognition Model Based on Joint Modeling of Discrete and Dimensional Emotion Representation
title_fullStr	Speech Emotion Recognition Model Based on Joint Modeling of Discrete and Dimensional Emotion Representation
title_full_unstemmed	Speech Emotion Recognition Model Based on Joint Modeling of Discrete and Dimensional Emotion Representation
title_short	Speech Emotion Recognition Model Based on Joint Modeling of Discrete and Dimensional Emotion Representation
title_sort	speech emotion recognition model based on joint modeling of discrete and dimensional emotion representation
topic	adaptive weight balancing scheme affective computing dimensional emotion representation discrete emotion representation joint model architecture Speech Emotion Recognition (SER)
url	https://www.mdpi.com/2076-3417/15/2/623
work_keys_str_mv	AT johnlorenzobautista speechemotionrecognitionmodelbasedonjointmodelingofdiscreteanddimensionalemotionrepresentation AT hyunsoonshin speechemotionrecognitionmodelbasedonjointmodelingofdiscreteanddimensionalemotionrepresentation

Speech Emotion Recognition Model Based on Joint Modeling of Discrete and Dimensional Emotion Representation

Similar Items