Arabic Speech Recognition Based on Encoder-Decoder Architecture of Transformer

Recognizing and transcribing human speech has become an increasingly important task. Recently, researchers have been more interested in automatic speech recognition (ASR) using End to End models. Previous choices for the Arabic ASR architecture have been time-delay neural networks, recurrent neural...

Full description

Saved in:
Bibliographic Details
Main Authors: Mohanad Sameer, Ahmed Talib, Alla Hussein, Husniza Husni
Format: Article
Language:English
Published: middle technical university 2023-03-01
Series:Journal of Techniques
Subjects:
Online Access:https://journal.mtu.edu.iq/index.php/MTU/article/view/749
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832595110174392320
author Mohanad Sameer
Ahmed Talib
Alla Hussein
Husniza Husni
author_facet Mohanad Sameer
Ahmed Talib
Alla Hussein
Husniza Husni
author_sort Mohanad Sameer
collection DOAJ
description Recognizing and transcribing human speech has become an increasingly important task. Recently, researchers have been more interested in automatic speech recognition (ASR) using End to End models. Previous choices for the Arabic ASR architecture have been time-delay neural networks, recurrent neural networks (RNN), and long short-term memory (LSTM). Preview end-to-end approaches have suffered from slow training and inference speed because of the limitations of training parallelization, and they require a large amount of data to achieve acceptable results in recognizing Arabic speech. This research presents an Arabic speech recognition based on a transformer encoder-decoder architecture with self-attention to transcribe Arabic audio speech segments into text, which can be trained faster with more efficiency. The proposed model exceeds the performance of previous end-to-end approaches when utilizing the Common Voice dataset from Mozilla. In this research, we introduced a speech-transformer model that was trained over 110 epochs using only 112 hours of speech. Although Arabic is considered one of the languages that are difficult to interpret by speech recognition systems, we achieved the best word error rate (WER) of 3.2 compared to other systems whose training requires a very large amount of data. The proposed system was evaluated on the common voice 8.0 dataset without using the language model.
format Article
id doaj-art-45ef8eb4b7b24186a9cb668e48c3e326
institution Kabale University
issn 1818-653X
2708-8383
language English
publishDate 2023-03-01
publisher middle technical university
record_format Article
series Journal of Techniques
spelling doaj-art-45ef8eb4b7b24186a9cb668e48c3e3262025-01-19T11:02:00Zengmiddle technical universityJournal of Techniques1818-653X2708-83832023-03-015110.51173/jt.v5i1.749Arabic Speech Recognition Based on Encoder-Decoder Architecture of TransformerMohanad Sameer0Ahmed Talib1Alla Hussein2Husniza Husni3Technical College of management - Baghdad, Middle Technical University, Baghdad, Iraq.Technical College of management - Baghdad, Middle Technical University, Baghdad, Iraq.Technical Institute / Kut, Middle Technical University, Baghdad, IraqUniversiti Utara Malaysia, 06010 Sintok, Kedah, Malaysia Recognizing and transcribing human speech has become an increasingly important task. Recently, researchers have been more interested in automatic speech recognition (ASR) using End to End models. Previous choices for the Arabic ASR architecture have been time-delay neural networks, recurrent neural networks (RNN), and long short-term memory (LSTM). Preview end-to-end approaches have suffered from slow training and inference speed because of the limitations of training parallelization, and they require a large amount of data to achieve acceptable results in recognizing Arabic speech. This research presents an Arabic speech recognition based on a transformer encoder-decoder architecture with self-attention to transcribe Arabic audio speech segments into text, which can be trained faster with more efficiency. The proposed model exceeds the performance of previous end-to-end approaches when utilizing the Common Voice dataset from Mozilla. In this research, we introduced a speech-transformer model that was trained over 110 epochs using only 112 hours of speech. Although Arabic is considered one of the languages that are difficult to interpret by speech recognition systems, we achieved the best word error rate (WER) of 3.2 compared to other systems whose training requires a very large amount of data. The proposed system was evaluated on the common voice 8.0 dataset without using the language model. https://journal.mtu.edu.iq/index.php/MTU/article/view/749Sequence to Sequence ASRArabic ASRTransformer-Speech RecognitionArabic Speech to Text
spellingShingle Mohanad Sameer
Ahmed Talib
Alla Hussein
Husniza Husni
Arabic Speech Recognition Based on Encoder-Decoder Architecture of Transformer
Journal of Techniques
Sequence to Sequence ASR
Arabic ASR
Transformer-Speech Recognition
Arabic Speech to Text
title Arabic Speech Recognition Based on Encoder-Decoder Architecture of Transformer
title_full Arabic Speech Recognition Based on Encoder-Decoder Architecture of Transformer
title_fullStr Arabic Speech Recognition Based on Encoder-Decoder Architecture of Transformer
title_full_unstemmed Arabic Speech Recognition Based on Encoder-Decoder Architecture of Transformer
title_short Arabic Speech Recognition Based on Encoder-Decoder Architecture of Transformer
title_sort arabic speech recognition based on encoder decoder architecture of transformer
topic Sequence to Sequence ASR
Arabic ASR
Transformer-Speech Recognition
Arabic Speech to Text
url https://journal.mtu.edu.iq/index.php/MTU/article/view/749
work_keys_str_mv AT mohanadsameer arabicspeechrecognitionbasedonencoderdecoderarchitectureoftransformer
AT ahmedtalib arabicspeechrecognitionbasedonencoderdecoderarchitectureoftransformer
AT allahussein arabicspeechrecognitionbasedonencoderdecoderarchitectureoftransformer
AT husnizahusni arabicspeechrecognitionbasedonencoderdecoderarchitectureoftransformer