Tibetan–Chinese speech-to-speech translation based on discrete units

Abstract Speech-to-speech translation (S2ST) has evolved from cascade systems which integrate Automatic Speech Recognition (ASR), Machine Translation (MT), and Text-to-Speech (TTS), to end-to-end models. This evolution has been driven by advancements in model performance and the expansion of cross-l...

Full description

Saved in:
Bibliographic Details
Main Authors: Zairan Gong, Xiaona Xu, Yue Zhao
Format: Article
Language:English
Published: Nature Portfolio 2025-01-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-025-85782-w
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832585767332872192
author Zairan Gong
Xiaona Xu
Yue Zhao
author_facet Zairan Gong
Xiaona Xu
Yue Zhao
author_sort Zairan Gong
collection DOAJ
description Abstract Speech-to-speech translation (S2ST) has evolved from cascade systems which integrate Automatic Speech Recognition (ASR), Machine Translation (MT), and Text-to-Speech (TTS), to end-to-end models. This evolution has been driven by advancements in model performance and the expansion of cross-lingual speech datasets. Despite the paucity of research on Tibetan speech translation, this paper endeavors to tackle the challenge of Tibetan-to-Chinese direct speech-to-speech translation within the multi-task learning framework, employing self-supervised learning (SSL) and sequence-to-sequence model training. Leveraging HuBERT model to extract discrete units of target speech, we develop a speech-to-unit translation (S2UT) model using an encoder-decoder architecture which subsequently generates speech output through a unit-based vocoder. By employing SSL and utilizing discrete representations as training targets, our approach effectively captures linguistic differences, facilitating direct translation between the two languages. We evaluate the performance of HuBERT model under various configurations to select the optimal setup based on Phone-unit Normalized Mutual Information (PNMI) values. After fine-tuning the chosen HuBERT model on specific corpora, we introduce auxiliary tasks to enhance translation performance. This underscores the pivotal role of multi-task learning in improving overall model efficacy. Experimental results validate the feasibility of Tibetan-to-Chinese S2ST, demonstrating promising translation quality and semantic content preservation, despite limited data availability.
format Article
id doaj-art-3d434fbe90ad44238301cae09a33081b
institution Kabale University
issn 2045-2322
language English
publishDate 2025-01-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-3d434fbe90ad44238301cae09a33081b2025-01-26T12:30:27ZengNature PortfolioScientific Reports2045-23222025-01-0115111010.1038/s41598-025-85782-wTibetan–Chinese speech-to-speech translation based on discrete unitsZairan Gong0Xiaona Xu1Yue Zhao2Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of ChinaKey Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of ChinaKey Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of ChinaAbstract Speech-to-speech translation (S2ST) has evolved from cascade systems which integrate Automatic Speech Recognition (ASR), Machine Translation (MT), and Text-to-Speech (TTS), to end-to-end models. This evolution has been driven by advancements in model performance and the expansion of cross-lingual speech datasets. Despite the paucity of research on Tibetan speech translation, this paper endeavors to tackle the challenge of Tibetan-to-Chinese direct speech-to-speech translation within the multi-task learning framework, employing self-supervised learning (SSL) and sequence-to-sequence model training. Leveraging HuBERT model to extract discrete units of target speech, we develop a speech-to-unit translation (S2UT) model using an encoder-decoder architecture which subsequently generates speech output through a unit-based vocoder. By employing SSL and utilizing discrete representations as training targets, our approach effectively captures linguistic differences, facilitating direct translation between the two languages. We evaluate the performance of HuBERT model under various configurations to select the optimal setup based on Phone-unit Normalized Mutual Information (PNMI) values. After fine-tuning the chosen HuBERT model on specific corpora, we introduce auxiliary tasks to enhance translation performance. This underscores the pivotal role of multi-task learning in improving overall model efficacy. Experimental results validate the feasibility of Tibetan-to-Chinese S2ST, demonstrating promising translation quality and semantic content preservation, despite limited data availability.https://doi.org/10.1038/s41598-025-85782-w
spellingShingle Zairan Gong
Xiaona Xu
Yue Zhao
Tibetan–Chinese speech-to-speech translation based on discrete units
Scientific Reports
title Tibetan–Chinese speech-to-speech translation based on discrete units
title_full Tibetan–Chinese speech-to-speech translation based on discrete units
title_fullStr Tibetan–Chinese speech-to-speech translation based on discrete units
title_full_unstemmed Tibetan–Chinese speech-to-speech translation based on discrete units
title_short Tibetan–Chinese speech-to-speech translation based on discrete units
title_sort tibetan chinese speech to speech translation based on discrete units
url https://doi.org/10.1038/s41598-025-85782-w
work_keys_str_mv AT zairangong tibetanchinesespeechtospeechtranslationbasedondiscreteunits
AT xiaonaxu tibetanchinesespeechtospeechtranslationbasedondiscreteunits
AT yuezhao tibetanchinesespeechtospeechtranslationbasedondiscreteunits