Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading

Lipreading involves recognizing spoken words by analyzing the movements of the lips and surrounding area using visual data. It is an emerging research topic with many potential applications, such as human–machine interaction and enhancing audio-based speech recognition. Recent deep learning approach...

Full description

Saved in:

Bibliographic Details
Main Authors:	Samar Daou, Achraf Ben-Hamadou, Ahmed Rekik, Abdelaziz Kallel
Format:	Article
Language:	English
Published:	MDPI AG 2025-01-01
Series:	Technologies
Subjects:	lipreading deep learning LRW-AR graph neural networks Transformer Arabic language
Online Access:	https://www.mdpi.com/2227-7080/13/1/26
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832587420652011520
author	Samar Daou Achraf Ben-Hamadou Ahmed Rekik Abdelaziz Kallel
author_facet	Samar Daou Achraf Ben-Hamadou Ahmed Rekik Abdelaziz Kallel
author_sort	Samar Daou
collection	DOAJ
description	Lipreading involves recognizing spoken words by analyzing the movements of the lips and surrounding area using visual data. It is an emerging research topic with many potential applications, such as human–machine interaction and enhancing audio-based speech recognition. Recent deep learning approaches integrate visual features from the mouth region and lip contours. However, simple methods such as concatenation may not effectively optimize the feature vector. In this article, we propose extracting optimal visual features using 3D convolution blocks followed by a ResNet-18, while employing a graph neural network to extract geometric features from tracked lip landmarks. To fuse these complementary features, we introduce a cross-attention mechanism that combines visual and geometric information to obtain an optimal representation of lip movements for lipreading tasks. To validate our approach for Arabic, we introduce the first large-scale Lipreading in the Wild for Arabic (LRW-AR) dataset, consisting of 20,000 videos across 100 word classes, spoken by 36 speakers. Experimental results on both the LRW-AR and LRW datasets demonstrate the effectiveness of our approach, achieving accuracies of 85.85% and 89.41%, respectively.
format	Article
id	doaj-art-433e1e16b4e34bd69d7a08713abb3d0d
institution	Kabale University
issn	2227-7080
language	English
publishDate	2025-01-01
publisher	MDPI AG
record_format	Article
series	Technologies
spelling	doaj-art-433e1e16b4e34bd69d7a08713abb3d0d2025-01-24T13:50:47ZengMDPI AGTechnologies2227-70802025-01-011312610.3390/technologies13010026Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic LipreadingSamar Daou0Achraf Ben-Hamadou1Ahmed Rekik2Abdelaziz Kallel3SMARTS Laboratory, Technopark of Sfax, Sakiet Ezzit, Sfax 3021, TunisiaSMARTS Laboratory, Technopark of Sfax, Sakiet Ezzit, Sfax 3021, TunisiaSMARTS Laboratory, Technopark of Sfax, Sakiet Ezzit, Sfax 3021, TunisiaSMARTS Laboratory, Technopark of Sfax, Sakiet Ezzit, Sfax 3021, TunisiaLipreading involves recognizing spoken words by analyzing the movements of the lips and surrounding area using visual data. It is an emerging research topic with many potential applications, such as human–machine interaction and enhancing audio-based speech recognition. Recent deep learning approaches integrate visual features from the mouth region and lip contours. However, simple methods such as concatenation may not effectively optimize the feature vector. In this article, we propose extracting optimal visual features using 3D convolution blocks followed by a ResNet-18, while employing a graph neural network to extract geometric features from tracked lip landmarks. To fuse these complementary features, we introduce a cross-attention mechanism that combines visual and geometric information to obtain an optimal representation of lip movements for lipreading tasks. To validate our approach for Arabic, we introduce the first large-scale Lipreading in the Wild for Arabic (LRW-AR) dataset, consisting of 20,000 videos across 100 word classes, spoken by 36 speakers. Experimental results on both the LRW-AR and LRW datasets demonstrate the effectiveness of our approach, achieving accuracies of 85.85% and 89.41%, respectively.https://www.mdpi.com/2227-7080/13/1/26lipreadingdeep learningLRW-ARgraph neural networksTransformerArabic language
spellingShingle	Samar Daou Achraf Ben-Hamadou Ahmed Rekik Abdelaziz Kallel Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading Technologies lipreading deep learning LRW-AR graph neural networks Transformer Arabic language
title	Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading
title_full	Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading
title_fullStr	Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading
title_full_unstemmed	Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading
title_short	Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading
title_sort	cross attention fusion of visual and geometric features for large vocabulary arabic lipreading
topic	lipreading deep learning LRW-AR graph neural networks Transformer Arabic language
url	https://www.mdpi.com/2227-7080/13/1/26
work_keys_str_mv	AT samardaou crossattentionfusionofvisualandgeometricfeaturesforlargevocabularyarabiclipreading AT achrafbenhamadou crossattentionfusionofvisualandgeometricfeaturesforlargevocabularyarabiclipreading AT ahmedrekik crossattentionfusionofvisualandgeometricfeaturesforlargevocabularyarabiclipreading AT abdelazizkallel crossattentionfusionofvisualandgeometricfeaturesforlargevocabularyarabiclipreading

Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading

Similar Items