Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading

Lipreading involves recognizing spoken words by analyzing the movements of the lips and surrounding area using visual data. It is an emerging research topic with many potential applications, such as human–machine interaction and enhancing audio-based speech recognition. Recent deep learning approach...

Full description

Saved in:
Bibliographic Details
Main Authors: Samar Daou, Achraf Ben-Hamadou, Ahmed Rekik, Abdelaziz Kallel
Format: Article
Language:English
Published: MDPI AG 2025-01-01
Series:Technologies
Subjects:
Online Access:https://www.mdpi.com/2227-7080/13/1/26
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832587420652011520
author Samar Daou
Achraf Ben-Hamadou
Ahmed Rekik
Abdelaziz Kallel
author_facet Samar Daou
Achraf Ben-Hamadou
Ahmed Rekik
Abdelaziz Kallel
author_sort Samar Daou
collection DOAJ
description Lipreading involves recognizing spoken words by analyzing the movements of the lips and surrounding area using visual data. It is an emerging research topic with many potential applications, such as human–machine interaction and enhancing audio-based speech recognition. Recent deep learning approaches integrate visual features from the mouth region and lip contours. However, simple methods such as concatenation may not effectively optimize the feature vector. In this article, we propose extracting optimal visual features using 3D convolution blocks followed by a ResNet-18, while employing a graph neural network to extract geometric features from tracked lip landmarks. To fuse these complementary features, we introduce a cross-attention mechanism that combines visual and geometric information to obtain an optimal representation of lip movements for lipreading tasks. To validate our approach for Arabic, we introduce the first large-scale Lipreading in the Wild for Arabic (LRW-AR) dataset, consisting of 20,000 videos across 100 word classes, spoken by 36 speakers. Experimental results on both the LRW-AR and LRW datasets demonstrate the effectiveness of our approach, achieving accuracies of 85.85% and 89.41%, respectively.
format Article
id doaj-art-433e1e16b4e34bd69d7a08713abb3d0d
institution Kabale University
issn 2227-7080
language English
publishDate 2025-01-01
publisher MDPI AG
record_format Article
series Technologies
spelling doaj-art-433e1e16b4e34bd69d7a08713abb3d0d2025-01-24T13:50:47ZengMDPI AGTechnologies2227-70802025-01-011312610.3390/technologies13010026Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic LipreadingSamar Daou0Achraf Ben-Hamadou1Ahmed Rekik2Abdelaziz Kallel3SMARTS Laboratory, Technopark of Sfax, Sakiet Ezzit, Sfax 3021, TunisiaSMARTS Laboratory, Technopark of Sfax, Sakiet Ezzit, Sfax 3021, TunisiaSMARTS Laboratory, Technopark of Sfax, Sakiet Ezzit, Sfax 3021, TunisiaSMARTS Laboratory, Technopark of Sfax, Sakiet Ezzit, Sfax 3021, TunisiaLipreading involves recognizing spoken words by analyzing the movements of the lips and surrounding area using visual data. It is an emerging research topic with many potential applications, such as human–machine interaction and enhancing audio-based speech recognition. Recent deep learning approaches integrate visual features from the mouth region and lip contours. However, simple methods such as concatenation may not effectively optimize the feature vector. In this article, we propose extracting optimal visual features using 3D convolution blocks followed by a ResNet-18, while employing a graph neural network to extract geometric features from tracked lip landmarks. To fuse these complementary features, we introduce a cross-attention mechanism that combines visual and geometric information to obtain an optimal representation of lip movements for lipreading tasks. To validate our approach for Arabic, we introduce the first large-scale Lipreading in the Wild for Arabic (LRW-AR) dataset, consisting of 20,000 videos across 100 word classes, spoken by 36 speakers. Experimental results on both the LRW-AR and LRW datasets demonstrate the effectiveness of our approach, achieving accuracies of 85.85% and 89.41%, respectively.https://www.mdpi.com/2227-7080/13/1/26lipreadingdeep learningLRW-ARgraph neural networksTransformerArabic language
spellingShingle Samar Daou
Achraf Ben-Hamadou
Ahmed Rekik
Abdelaziz Kallel
Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading
Technologies
lipreading
deep learning
LRW-AR
graph neural networks
Transformer
Arabic language
title Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading
title_full Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading
title_fullStr Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading
title_full_unstemmed Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading
title_short Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading
title_sort cross attention fusion of visual and geometric features for large vocabulary arabic lipreading
topic lipreading
deep learning
LRW-AR
graph neural networks
Transformer
Arabic language
url https://www.mdpi.com/2227-7080/13/1/26
work_keys_str_mv AT samardaou crossattentionfusionofvisualandgeometricfeaturesforlargevocabularyarabiclipreading
AT achrafbenhamadou crossattentionfusionofvisualandgeometricfeaturesforlargevocabularyarabiclipreading
AT ahmedrekik crossattentionfusionofvisualandgeometricfeaturesforlargevocabularyarabiclipreading
AT abdelazizkallel crossattentionfusionofvisualandgeometricfeaturesforlargevocabularyarabiclipreading