Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading
Lipreading involves recognizing spoken words by analyzing the movements of the lips and surrounding area using visual data. It is an emerging research topic with many potential applications, such as human–machine interaction and enhancing audio-based speech recognition. Recent deep learning approach...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2025-01-01
|
Series: | Technologies |
Subjects: | |
Online Access: | https://www.mdpi.com/2227-7080/13/1/26 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832587420652011520 |
---|---|
author | Samar Daou Achraf Ben-Hamadou Ahmed Rekik Abdelaziz Kallel |
author_facet | Samar Daou Achraf Ben-Hamadou Ahmed Rekik Abdelaziz Kallel |
author_sort | Samar Daou |
collection | DOAJ |
description | Lipreading involves recognizing spoken words by analyzing the movements of the lips and surrounding area using visual data. It is an emerging research topic with many potential applications, such as human–machine interaction and enhancing audio-based speech recognition. Recent deep learning approaches integrate visual features from the mouth region and lip contours. However, simple methods such as concatenation may not effectively optimize the feature vector. In this article, we propose extracting optimal visual features using 3D convolution blocks followed by a ResNet-18, while employing a graph neural network to extract geometric features from tracked lip landmarks. To fuse these complementary features, we introduce a cross-attention mechanism that combines visual and geometric information to obtain an optimal representation of lip movements for lipreading tasks. To validate our approach for Arabic, we introduce the first large-scale Lipreading in the Wild for Arabic (LRW-AR) dataset, consisting of 20,000 videos across 100 word classes, spoken by 36 speakers. Experimental results on both the LRW-AR and LRW datasets demonstrate the effectiveness of our approach, achieving accuracies of 85.85% and 89.41%, respectively. |
format | Article |
id | doaj-art-433e1e16b4e34bd69d7a08713abb3d0d |
institution | Kabale University |
issn | 2227-7080 |
language | English |
publishDate | 2025-01-01 |
publisher | MDPI AG |
record_format | Article |
series | Technologies |
spelling | doaj-art-433e1e16b4e34bd69d7a08713abb3d0d2025-01-24T13:50:47ZengMDPI AGTechnologies2227-70802025-01-011312610.3390/technologies13010026Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic LipreadingSamar Daou0Achraf Ben-Hamadou1Ahmed Rekik2Abdelaziz Kallel3SMARTS Laboratory, Technopark of Sfax, Sakiet Ezzit, Sfax 3021, TunisiaSMARTS Laboratory, Technopark of Sfax, Sakiet Ezzit, Sfax 3021, TunisiaSMARTS Laboratory, Technopark of Sfax, Sakiet Ezzit, Sfax 3021, TunisiaSMARTS Laboratory, Technopark of Sfax, Sakiet Ezzit, Sfax 3021, TunisiaLipreading involves recognizing spoken words by analyzing the movements of the lips and surrounding area using visual data. It is an emerging research topic with many potential applications, such as human–machine interaction and enhancing audio-based speech recognition. Recent deep learning approaches integrate visual features from the mouth region and lip contours. However, simple methods such as concatenation may not effectively optimize the feature vector. In this article, we propose extracting optimal visual features using 3D convolution blocks followed by a ResNet-18, while employing a graph neural network to extract geometric features from tracked lip landmarks. To fuse these complementary features, we introduce a cross-attention mechanism that combines visual and geometric information to obtain an optimal representation of lip movements for lipreading tasks. To validate our approach for Arabic, we introduce the first large-scale Lipreading in the Wild for Arabic (LRW-AR) dataset, consisting of 20,000 videos across 100 word classes, spoken by 36 speakers. Experimental results on both the LRW-AR and LRW datasets demonstrate the effectiveness of our approach, achieving accuracies of 85.85% and 89.41%, respectively.https://www.mdpi.com/2227-7080/13/1/26lipreadingdeep learningLRW-ARgraph neural networksTransformerArabic language |
spellingShingle | Samar Daou Achraf Ben-Hamadou Ahmed Rekik Abdelaziz Kallel Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading Technologies lipreading deep learning LRW-AR graph neural networks Transformer Arabic language |
title | Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading |
title_full | Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading |
title_fullStr | Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading |
title_full_unstemmed | Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading |
title_short | Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading |
title_sort | cross attention fusion of visual and geometric features for large vocabulary arabic lipreading |
topic | lipreading deep learning LRW-AR graph neural networks Transformer Arabic language |
url | https://www.mdpi.com/2227-7080/13/1/26 |
work_keys_str_mv | AT samardaou crossattentionfusionofvisualandgeometricfeaturesforlargevocabularyarabiclipreading AT achrafbenhamadou crossattentionfusionofvisualandgeometricfeaturesforlargevocabularyarabiclipreading AT ahmedrekik crossattentionfusionofvisualandgeometricfeaturesforlargevocabularyarabiclipreading AT abdelazizkallel crossattentionfusionofvisualandgeometricfeaturesforlargevocabularyarabiclipreading |