Arabic speech recognition using end‐to‐end deep learning

Abstract Arabic automatic speech recognition (ASR) methods with diacritics have the ability to be integrated with other systems better than Arabic ASR methods without diacritics. In this work, the application of state‐of‐the‐art end‐to‐end deep learning approaches is investigated to build a robust d...

Full description

Saved in:
Bibliographic Details
Main Authors: Hamzah A. Alsayadi, Abdelaziz A. Abdelhamid, Islam Hegazy, Zaki T. Fayed
Format: Article
Language:English
Published: Wiley 2021-10-01
Series:IET Signal Processing
Online Access:https://doi.org/10.1049/sil2.12057
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850164911230943232
author Hamzah A. Alsayadi
Abdelaziz A. Abdelhamid
Islam Hegazy
Zaki T. Fayed
author_facet Hamzah A. Alsayadi
Abdelaziz A. Abdelhamid
Islam Hegazy
Zaki T. Fayed
author_sort Hamzah A. Alsayadi
collection DOAJ
description Abstract Arabic automatic speech recognition (ASR) methods with diacritics have the ability to be integrated with other systems better than Arabic ASR methods without diacritics. In this work, the application of state‐of‐the‐art end‐to‐end deep learning approaches is investigated to build a robust diacritised Arabic ASR. These approaches are based on the Mel‐Frequency Cepstral Coefficients and the log Mel‐Scale Filter Bank energies as acoustic features. To the best of our knowledge, end‐to‐end deep learning approach has not been used in the task of diacritised Arabic automatic speech recognition. To fill this gap, this work presents a new CTC‐based ASR, CNN‐LSTM, and an attention‐based end‐to‐end approach for improving diacritisedArabic ASR. In addition, a word‐based language model is employed to achieve better results. The end‐to‐end approaches applied in this work are based on state‐of‐the‐art frameworks, namely ESPnet and Espresso. Training and testing of these frameworks are performed based on the Standard Arabic Single Speaker Corpus (SASSC), which contains 7 h of modern standard Arabic speech. Experimental results show that the CNN‐LSTM with an attention framework outperforms conventional ASR and the Joint CTC‐attention ASR framework in the task of Arabic speech recognition. The CNN‐LSTM with an attention framework could achieve a word error rate better than conventional ASR and the Joint CTC‐attention ASR by 5.24% and 2.62%, respectively.
format Article
id doaj-art-f18d8ed1b3f948f4831e4dba3acab250
institution OA Journals
issn 1751-9675
1751-9683
language English
publishDate 2021-10-01
publisher Wiley
record_format Article
series IET Signal Processing
spelling doaj-art-f18d8ed1b3f948f4831e4dba3acab2502025-08-20T02:21:52ZengWileyIET Signal Processing1751-96751751-96832021-10-0115852153410.1049/sil2.12057Arabic speech recognition using end‐to‐end deep learningHamzah A. Alsayadi0Abdelaziz A. Abdelhamid1Islam Hegazy2Zaki T. Fayed3Computer Science Department Faculty of Sciences Ibb University Ibb YemenComputer Science Department Faculty of Computer and Information Sciences Ain Shams University Cairo EgyptComputer Science Department Faculty of Computer and Information Sciences Ain Shams University Cairo EgyptComputer Science Department Faculty of Computer and Information Sciences Ain Shams University Cairo EgyptAbstract Arabic automatic speech recognition (ASR) methods with diacritics have the ability to be integrated with other systems better than Arabic ASR methods without diacritics. In this work, the application of state‐of‐the‐art end‐to‐end deep learning approaches is investigated to build a robust diacritised Arabic ASR. These approaches are based on the Mel‐Frequency Cepstral Coefficients and the log Mel‐Scale Filter Bank energies as acoustic features. To the best of our knowledge, end‐to‐end deep learning approach has not been used in the task of diacritised Arabic automatic speech recognition. To fill this gap, this work presents a new CTC‐based ASR, CNN‐LSTM, and an attention‐based end‐to‐end approach for improving diacritisedArabic ASR. In addition, a word‐based language model is employed to achieve better results. The end‐to‐end approaches applied in this work are based on state‐of‐the‐art frameworks, namely ESPnet and Espresso. Training and testing of these frameworks are performed based on the Standard Arabic Single Speaker Corpus (SASSC), which contains 7 h of modern standard Arabic speech. Experimental results show that the CNN‐LSTM with an attention framework outperforms conventional ASR and the Joint CTC‐attention ASR framework in the task of Arabic speech recognition. The CNN‐LSTM with an attention framework could achieve a word error rate better than conventional ASR and the Joint CTC‐attention ASR by 5.24% and 2.62%, respectively.https://doi.org/10.1049/sil2.12057
spellingShingle Hamzah A. Alsayadi
Abdelaziz A. Abdelhamid
Islam Hegazy
Zaki T. Fayed
Arabic speech recognition using end‐to‐end deep learning
IET Signal Processing
title Arabic speech recognition using end‐to‐end deep learning
title_full Arabic speech recognition using end‐to‐end deep learning
title_fullStr Arabic speech recognition using end‐to‐end deep learning
title_full_unstemmed Arabic speech recognition using end‐to‐end deep learning
title_short Arabic speech recognition using end‐to‐end deep learning
title_sort arabic speech recognition using end to end deep learning
url https://doi.org/10.1049/sil2.12057
work_keys_str_mv AT hamzahaalsayadi arabicspeechrecognitionusingendtoenddeeplearning
AT abdelazizaabdelhamid arabicspeechrecognitionusingendtoenddeeplearning
AT islamhegazy arabicspeechrecognitionusingendtoenddeeplearning
AT zakitfayed arabicspeechrecognitionusingendtoenddeeplearning