End-to-end neural automatic speech recognition system for low resource languages

The rising popularity of end-to-end (E2E) automatic speech recognition (ASR) systems can be attributed to their ability to learn complex speech patterns directly from raw data, eliminating the need for intricate feature extraction pipelines and handcrafted language models. E2E-ASR systems have consi...

Full description

Saved in:

Bibliographic Details
Main Authors:	Sami Dhahbi, Nasir Saleem, Sami Bourouis, Mouhebeddine Berrima, Elena Verdú
Format:	Article
Language:	English
Published:	Elsevier 2025-03-01
Series:	Egyptian Informatics Journal
Subjects:	Speech recognition Deep learning Low-resource language E2E learning Data augmentation Synthetic speech
Online Access:	http://www.sciencedirect.com/science/article/pii/S1110866525000088
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832583086331658240
author	Sami Dhahbi Nasir Saleem Sami Bourouis Mouhebeddine Berrima Elena Verdú
author_facet	Sami Dhahbi Nasir Saleem Sami Bourouis Mouhebeddine Berrima Elena Verdú
author_sort	Sami Dhahbi
collection	DOAJ
description	The rising popularity of end-to-end (E2E) automatic speech recognition (ASR) systems can be attributed to their ability to learn complex speech patterns directly from raw data, eliminating the need for intricate feature extraction pipelines and handcrafted language models. E2E-ASR systems have consistently outperformed traditional ASRs. However, training E2E-ASR systems for low-resource languages remains challenging due to the dependence on data from well-resourced languages. ASR is vital for promoting under-resourced languages, especially in developing human-to-human and human-to-machine communication systems. Using synthetic speech and data augmentation techniques can enhance E2E-ASR performance for low-resource languages, reducing word error rates (WERs) and character error rates (CERs). This study leverages a non-autoregressive neural text-to-speech (TTS) engine to generate high-quality speech, converting a series of phonemes into speech waveforms (mel-spectrograms). An on-the-fly data augmentation method is applied to these mel-spectrograms, treating them as images from which features are extracted to train a convolutional neural network (CNN) and a bidirectional long short-term memory (BLSTM)-based ASR. The E2E architecture of this system achieves optimal WER and CER performance. The proposed deep learning-based E2E-ASR, trained with synthetic speech and data augmentation, shows significant performance improvements, with a 20.75% reduction in WERs and a 10.34% reduction in CERs.
format	Article
id	doaj-art-508fec56bd624ffaa418dd7bd420a88b
institution	Kabale University
issn	1110-8665
language	English
publishDate	2025-03-01
publisher	Elsevier
record_format	Article
series	Egyptian Informatics Journal
spelling	doaj-art-508fec56bd624ffaa418dd7bd420a88b2025-01-29T05:00:20ZengElsevierEgyptian Informatics Journal1110-86652025-03-0129100615End-to-end neural automatic speech recognition system for low resource languagesSami Dhahbi0Nasir Saleem1Sami Bourouis2Mouhebeddine Berrima3Elena Verdú4Applied College of Mahail Aseer, King Khalid University, Muhayil Aseer, 62529, Saudi ArabiaDepartment of Electrical Engineering, Faculty of Engineering and Technology, Gomal University, Dera Ismail Khan, PakistanDepartment of Information Technology, College of Computers and Information Technology, Taif University, Taif, 21944, Saudi ArabiaUnit of Scientific Research, Applied College, Qassim University, Buraydah, Saudi Arabia; Corresponding author.Universidad Internacional de La Rioja, Logroño, SpainThe rising popularity of end-to-end (E2E) automatic speech recognition (ASR) systems can be attributed to their ability to learn complex speech patterns directly from raw data, eliminating the need for intricate feature extraction pipelines and handcrafted language models. E2E-ASR systems have consistently outperformed traditional ASRs. However, training E2E-ASR systems for low-resource languages remains challenging due to the dependence on data from well-resourced languages. ASR is vital for promoting under-resourced languages, especially in developing human-to-human and human-to-machine communication systems. Using synthetic speech and data augmentation techniques can enhance E2E-ASR performance for low-resource languages, reducing word error rates (WERs) and character error rates (CERs). This study leverages a non-autoregressive neural text-to-speech (TTS) engine to generate high-quality speech, converting a series of phonemes into speech waveforms (mel-spectrograms). An on-the-fly data augmentation method is applied to these mel-spectrograms, treating them as images from which features are extracted to train a convolutional neural network (CNN) and a bidirectional long short-term memory (BLSTM)-based ASR. The E2E architecture of this system achieves optimal WER and CER performance. The proposed deep learning-based E2E-ASR, trained with synthetic speech and data augmentation, shows significant performance improvements, with a 20.75% reduction in WERs and a 10.34% reduction in CERs.http://www.sciencedirect.com/science/article/pii/S1110866525000088Speech recognitionDeep learningLow-resource languageE2E learningData augmentationSynthetic speech
spellingShingle	Sami Dhahbi Nasir Saleem Sami Bourouis Mouhebeddine Berrima Elena Verdú End-to-end neural automatic speech recognition system for low resource languages Egyptian Informatics Journal Speech recognition Deep learning Low-resource language E2E learning Data augmentation Synthetic speech
title	End-to-end neural automatic speech recognition system for low resource languages
title_full	End-to-end neural automatic speech recognition system for low resource languages
title_fullStr	End-to-end neural automatic speech recognition system for low resource languages
title_full_unstemmed	End-to-end neural automatic speech recognition system for low resource languages
title_short	End-to-end neural automatic speech recognition system for low resource languages
title_sort	end to end neural automatic speech recognition system for low resource languages
topic	Speech recognition Deep learning Low-resource language E2E learning Data augmentation Synthetic speech
url	http://www.sciencedirect.com/science/article/pii/S1110866525000088
work_keys_str_mv	AT samidhahbi endtoendneuralautomaticspeechrecognitionsystemforlowresourcelanguages AT nasirsaleem endtoendneuralautomaticspeechrecognitionsystemforlowresourcelanguages AT samibourouis endtoendneuralautomaticspeechrecognitionsystemforlowresourcelanguages AT mouhebeddineberrima endtoendneuralautomaticspeechrecognitionsystemforlowresourcelanguages AT elenaverdu endtoendneuralautomaticspeechrecognitionsystemforlowresourcelanguages

End-to-end neural automatic speech recognition system for low resource languages

Similar Items