Tibetan–Chinese speech-to-speech translation based on discrete units

Abstract Speech-to-speech translation (S2ST) has evolved from cascade systems which integrate Automatic Speech Recognition (ASR), Machine Translation (MT), and Text-to-Speech (TTS), to end-to-end models. This evolution has been driven by advancements in model performance and the expansion of cross-l...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zairan Gong, Xiaona Xu, Yue Zhao
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-01-01
Series:	Scientific Reports
Online Access:	https://doi.org/10.1038/s41598-025-85782-w
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832585767332872192
author	Zairan Gong Xiaona Xu Yue Zhao
author_facet	Zairan Gong Xiaona Xu Yue Zhao
author_sort	Zairan Gong
collection	DOAJ
description	Abstract Speech-to-speech translation (S2ST) has evolved from cascade systems which integrate Automatic Speech Recognition (ASR), Machine Translation (MT), and Text-to-Speech (TTS), to end-to-end models. This evolution has been driven by advancements in model performance and the expansion of cross-lingual speech datasets. Despite the paucity of research on Tibetan speech translation, this paper endeavors to tackle the challenge of Tibetan-to-Chinese direct speech-to-speech translation within the multi-task learning framework, employing self-supervised learning (SSL) and sequence-to-sequence model training. Leveraging HuBERT model to extract discrete units of target speech, we develop a speech-to-unit translation (S2UT) model using an encoder-decoder architecture which subsequently generates speech output through a unit-based vocoder. By employing SSL and utilizing discrete representations as training targets, our approach effectively captures linguistic differences, facilitating direct translation between the two languages. We evaluate the performance of HuBERT model under various configurations to select the optimal setup based on Phone-unit Normalized Mutual Information (PNMI) values. After fine-tuning the chosen HuBERT model on specific corpora, we introduce auxiliary tasks to enhance translation performance. This underscores the pivotal role of multi-task learning in improving overall model efficacy. Experimental results validate the feasibility of Tibetan-to-Chinese S2ST, demonstrating promising translation quality and semantic content preservation, despite limited data availability.
format	Article
id	doaj-art-3d434fbe90ad44238301cae09a33081b
institution	Kabale University
issn	2045-2322
language	English
publishDate	2025-01-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj-art-3d434fbe90ad44238301cae09a33081b2025-01-26T12:30:27ZengNature PortfolioScientific Reports2045-23222025-01-0115111010.1038/s41598-025-85782-wTibetan–Chinese speech-to-speech translation based on discrete unitsZairan Gong0Xiaona Xu1Yue Zhao2Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of ChinaKey Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of ChinaKey Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of ChinaAbstract Speech-to-speech translation (S2ST) has evolved from cascade systems which integrate Automatic Speech Recognition (ASR), Machine Translation (MT), and Text-to-Speech (TTS), to end-to-end models. This evolution has been driven by advancements in model performance and the expansion of cross-lingual speech datasets. Despite the paucity of research on Tibetan speech translation, this paper endeavors to tackle the challenge of Tibetan-to-Chinese direct speech-to-speech translation within the multi-task learning framework, employing self-supervised learning (SSL) and sequence-to-sequence model training. Leveraging HuBERT model to extract discrete units of target speech, we develop a speech-to-unit translation (S2UT) model using an encoder-decoder architecture which subsequently generates speech output through a unit-based vocoder. By employing SSL and utilizing discrete representations as training targets, our approach effectively captures linguistic differences, facilitating direct translation between the two languages. We evaluate the performance of HuBERT model under various configurations to select the optimal setup based on Phone-unit Normalized Mutual Information (PNMI) values. After fine-tuning the chosen HuBERT model on specific corpora, we introduce auxiliary tasks to enhance translation performance. This underscores the pivotal role of multi-task learning in improving overall model efficacy. Experimental results validate the feasibility of Tibetan-to-Chinese S2ST, demonstrating promising translation quality and semantic content preservation, despite limited data availability.https://doi.org/10.1038/s41598-025-85782-w
spellingShingle	Zairan Gong Xiaona Xu Yue Zhao Tibetan–Chinese speech-to-speech translation based on discrete units Scientific Reports
title	Tibetan–Chinese speech-to-speech translation based on discrete units
title_full	Tibetan–Chinese speech-to-speech translation based on discrete units
title_fullStr	Tibetan–Chinese speech-to-speech translation based on discrete units
title_full_unstemmed	Tibetan–Chinese speech-to-speech translation based on discrete units
title_short	Tibetan–Chinese speech-to-speech translation based on discrete units
title_sort	tibetan chinese speech to speech translation based on discrete units
url	https://doi.org/10.1038/s41598-025-85782-w
work_keys_str_mv	AT zairangong tibetanchinesespeechtospeechtranslationbasedondiscreteunits AT xiaonaxu tibetanchinesespeechtospeechtranslationbasedondiscreteunits AT yuezhao tibetanchinesespeechtospeechtranslationbasedondiscreteunits

Tibetan–Chinese speech-to-speech translation based on discrete units

Similar Items