Document-Level Neural TTS Using Curriculum Learning and Attention Masking

Speech synthesis has been developed to the level of natural human-level speech synthesized through an attention-based end-to-end text-to-speech synthesis (TTS) model. However, it is difficult to generate attention when synthesizing a text longer than the trained length or document-level text. In thi...

Full description

Saved in:

Bibliographic Details
Main Authors:	Sung-Woong Hwang, Joon-Hyuk Chang
Format:	Article
Language:	English
Published:	IEEE 2021-01-01
Series:	IEEE Access
Subjects:	Speech synthesis document-level neural TTS curriculum learning attention masking Tacotron2 MelGAN
Online Access:	https://ieeexplore.ieee.org/document/9312676/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832582411817320448
author	Sung-Woong Hwang Joon-Hyuk Chang
author_facet	Sung-Woong Hwang Joon-Hyuk Chang
author_sort	Sung-Woong Hwang
collection	DOAJ
description	Speech synthesis has been developed to the level of natural human-level speech synthesized through an attention-based end-to-end text-to-speech synthesis (TTS) model. However, it is difficult to generate attention when synthesizing a text longer than the trained length or document-level text. In this paper, we propose a neural speech synthesis model that can synthesize more than 5 min of speech at once using training data comprising a short speech of less than 10 s. This model can be used for tasks that need to synthesize document-level speech at a time, such as a singing voice synthesis (SVS) system or a book reading system. First, through curriculum learning, our model automatically increases the length of the speech trained for each epoch, while reducing the batch size so that long sentences can be trained with a limited graphics processing unit (GPU) capacity. During synthesis, the document-level text is synthesized using only the necessary contexts of the current time step and masking the rest through an attention-masking mechanism. The Tacotron2-based speech synthesis model and duration predictor were used in the experiment, and the results showed that proposed method can synthesize document-level speech with overwhelmingly lower character error rate, and attention error rates, and higher quality than those obtained using the existing model.
format	Article
id	doaj-art-e5e9d71aa603485cab3e50756fab497d
institution	Kabale University
issn	2169-3536
language	English
publishDate	2021-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-e5e9d71aa603485cab3e50756fab497d2025-01-30T00:00:58ZengIEEEIEEE Access2169-35362021-01-0198954896010.1109/ACCESS.2020.30490739312676Document-Level Neural TTS Using Curriculum Learning and Attention MaskingSung-Woong Hwang0https://orcid.org/0000-0001-6194-9752Joon-Hyuk Chang1https://orcid.org/0000-0003-2610-2323Department of Electronic Engineering, Hanyang University, Seoul, South KoreaDepartment of Electronic Engineering, Hanyang University, Seoul, South KoreaSpeech synthesis has been developed to the level of natural human-level speech synthesized through an attention-based end-to-end text-to-speech synthesis (TTS) model. However, it is difficult to generate attention when synthesizing a text longer than the trained length or document-level text. In this paper, we propose a neural speech synthesis model that can synthesize more than 5 min of speech at once using training data comprising a short speech of less than 10 s. This model can be used for tasks that need to synthesize document-level speech at a time, such as a singing voice synthesis (SVS) system or a book reading system. First, through curriculum learning, our model automatically increases the length of the speech trained for each epoch, while reducing the batch size so that long sentences can be trained with a limited graphics processing unit (GPU) capacity. During synthesis, the document-level text is synthesized using only the necessary contexts of the current time step and masking the rest through an attention-masking mechanism. The Tacotron2-based speech synthesis model and duration predictor were used in the experiment, and the results showed that proposed method can synthesize document-level speech with overwhelmingly lower character error rate, and attention error rates, and higher quality than those obtained using the existing model.https://ieeexplore.ieee.org/document/9312676/Speech synthesisdocument-level neural TTScurriculum learningattention maskingTacotron2MelGAN
spellingShingle	Sung-Woong Hwang Joon-Hyuk Chang Document-Level Neural TTS Using Curriculum Learning and Attention Masking IEEE Access Speech synthesis document-level neural TTS curriculum learning attention masking Tacotron2 MelGAN
title	Document-Level Neural TTS Using Curriculum Learning and Attention Masking
title_full	Document-Level Neural TTS Using Curriculum Learning and Attention Masking
title_fullStr	Document-Level Neural TTS Using Curriculum Learning and Attention Masking
title_full_unstemmed	Document-Level Neural TTS Using Curriculum Learning and Attention Masking
title_short	Document-Level Neural TTS Using Curriculum Learning and Attention Masking
title_sort	document level neural tts using curriculum learning and attention masking
topic	Speech synthesis document-level neural TTS curriculum learning attention masking Tacotron2 MelGAN
url	https://ieeexplore.ieee.org/document/9312676/
work_keys_str_mv	AT sungwoonghwang documentlevelneuralttsusingcurriculumlearningandattentionmasking AT joonhyukchang documentlevelneuralttsusingcurriculumlearningandattentionmasking

Document-Level Neural TTS Using Curriculum Learning and Attention Masking

Similar Items