Arabic Speech Recognition Based on Encoder-Decoder Architecture of Transformer

Recognizing and transcribing human speech has become an increasingly important task. Recently, researchers have been more interested in automatic speech recognition (ASR) using End to End models. Previous choices for the Arabic ASR architecture have been time-delay neural networks, recurrent neural...

Full description

Saved in:
Bibliographic Details
Main Authors: Mohanad Sameer, Ahmed Talib, Alla Hussein, Husniza Husni
Format: Article
Language:English
Published: middle technical university 2023-03-01
Series:Journal of Techniques
Subjects:
Online Access:https://journal.mtu.edu.iq/index.php/MTU/article/view/749
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Recognizing and transcribing human speech has become an increasingly important task. Recently, researchers have been more interested in automatic speech recognition (ASR) using End to End models. Previous choices for the Arabic ASR architecture have been time-delay neural networks, recurrent neural networks (RNN), and long short-term memory (LSTM). Preview end-to-end approaches have suffered from slow training and inference speed because of the limitations of training parallelization, and they require a large amount of data to achieve acceptable results in recognizing Arabic speech. This research presents an Arabic speech recognition based on a transformer encoder-decoder architecture with self-attention to transcribe Arabic audio speech segments into text, which can be trained faster with more efficiency. The proposed model exceeds the performance of previous end-to-end approaches when utilizing the Common Voice dataset from Mozilla. In this research, we introduced a speech-transformer model that was trained over 110 epochs using only 112 hours of speech. Although Arabic is considered one of the languages that are difficult to interpret by speech recognition systems, we achieved the best word error rate (WER) of 3.2 compared to other systems whose training requires a very large amount of data. The proposed system was evaluated on the common voice 8.0 dataset without using the language model.
ISSN:1818-653X
2708-8383