End-to-End Multi-Speaker FastSpeech2 With Hierarchical Decoder

Multi-speaker text-to-speech (TTS) systems play a crucial role in different applications, such as personalized voice assistants, audiobooks, and multilingual speech synthesis. These systems aim to generate high-quality, natural-sounding speech while preserving the distinct characteristics of differe...

Full description

Saved in:
Bibliographic Details
Main Authors: Majid Adibian, Hossein Zeinali
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11080147/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Multi-speaker text-to-speech (TTS) systems play a crucial role in different applications, such as personalized voice assistants, audiobooks, and multilingual speech synthesis. These systems aim to generate high-quality, natural-sounding speech while preserving the distinct characteristics of different speakers. In this paper, we strive to enhance the naturalness and speaker similarity of the FastSpeech2 model in multi-speaker text-to-speech synthesis across closed and open-set speaker scenarios while preserving its high inference speed and lightweight architecture. Specifically, we introduce a hierarchical decoder structure and a speaker similarity loss function to enhance speaker fidelity in synthesized speech. Additionally, we investigate various methods for integrating speaker embeddings within the model and propose an end-to-end training strategy to mitigate error propagation, an inherent limitation of cascaded models. Experimental results demonstrate that our modified FastSpeech2 model significantly outperforms the baseline in closed and open-set scenarios. The proposed model achieves an absolute improvement of 0.89 in Mean Opinion Score (MOS) and 0.44 in Speaker Similarity MOS (SMOS) while maintaining the high inference speed of FastSpeech2.
ISSN:2169-3536