EmoSDS: Unified Emotionally Adaptive Spoken Dialogue System Using Self-Supervised Speech Representations

In recent years, advancements in artificial intelligence, speech, and natural language processing technology have enhanced spoken dialogue systems (SDSs), enabling natural, voice-based human–computer interaction. However, discrete, token-based LLMs in emotionally adaptive SDSs focus on lexical conte...

Full description

Saved in:
Bibliographic Details
Main Authors: Jaehwan Lee, Youngjun Sim, Jinyou Kim, Young-Joo Suh
Format: Article
Language:English
Published: MDPI AG 2025-03-01
Series:Future Internet
Subjects:
Online Access:https://www.mdpi.com/1999-5903/17/4/143
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In recent years, advancements in artificial intelligence, speech, and natural language processing technology have enhanced spoken dialogue systems (SDSs), enabling natural, voice-based human–computer interaction. However, discrete, token-based LLMs in emotionally adaptive SDSs focus on lexical content while overlooking essential paralinguistic cues for emotion expression. Existing methods use external emotion predictors to compensate for this but introduce computational overhead and fail to fully integrate paralinguistic features with linguistic context. Moreover, the lack of high-quality emotional speech datasets limits models’ ability to learn expressive emotional cues. To address these challenges, we propose EmoSDS, a unified SDS framework that integrates speech and emotion recognition by leveraging self-supervised learning (SSL) features. Our three-stage training pipeline enables the LLM to learn both discrete linguistic content and continuous paralinguistic features, improving emotional expressiveness and response naturalness. Additionally, we construct EmoSC, a dataset combining GPT-generated dialogues with emotional voice conversion data, ensuring greater emotional diversity and a balanced sample distribution across emotion categories. The experimental results show that EmoSDS outperforms existing models in emotional alignment and response generation, achieving a minimum 2.9% increase in text generation metrics, enhancing the LLM’s ability to interpret emotional and textual cues for more expressive and contextually appropriate responses.
ISSN:1999-5903