GSR-Fusion: A Deep Multimodal Fusion Architecture for Robust Sign Language Recognition Using RGB, Skeleton, and Graph-Based Modalities

Sign Language Recognition (SLR) plays a critical role in bridging communication gaps between the deaf and hearing communities. This research introduces GSR-Fusion, a deep multimodal fusion architecture that combines RGB-based, skeleton-based, and graph-based modalities to enhance gesture recognition...

Full description

Saved in:
Bibliographic Details
Main Authors: Wuttichai Vijitkunsawat, Teeradaj Racharak
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11045351/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849718337537310720
author Wuttichai Vijitkunsawat
Teeradaj Racharak
author_facet Wuttichai Vijitkunsawat
Teeradaj Racharak
author_sort Wuttichai Vijitkunsawat
collection DOAJ
description Sign Language Recognition (SLR) plays a critical role in bridging communication gaps between the deaf and hearing communities. This research introduces GSR-Fusion, a deep multimodal fusion architecture that combines RGB-based, skeleton-based, and graph-based modalities to enhance gesture recognition. Unlike traditional unimodal models, GSR-Fusion utilizes gesture initiation and termination detection, along with a cross-modality fusion approach using a merge (network) technique, enabling it to capture both spatial-temporal and relational features from multiple data sources. The model incorporates ViViT for RGB feature extraction, Transformers for sequential pose modeling, and A3T-GCN for joint graph representation, which together form a comprehensive understanding of gestures. The study investigates five key experimental setups, covering single-hand static and dynamic gestures (one, two, and three strokes) as well as two-hand static and dynamic gestures from the Thai Finger Spelling dataset. Additionally, we compare our architecture with existing models on global datasets, including WLASL and MS-ASL, to evaluate its performance. The results show that GSR-Fusion outperforms state-of-the-art models on multiple datasets. On WLASL, it achieves 83.45% accuracy for 100 classes and 75.23% for 300 classes, surpassing models like SignBERT and Fusion3. Similarly, on MS-ASL, it attains 84.31% for 100 classes and 80.57% for 200 classes, outperforming both RGB-based and skeleton-based models. These results highlight the effectiveness of GSR-Fusion in recognizing complex gestures, demonstrating its ability to generalize across different sign languages and datasets. The research emphasizes the importance of multimodal fusion in advancing sign language recognition for real-world applications.
format Article
id doaj-art-a930d76e5c2a48349adbc00a179e5f68
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-a930d76e5c2a48349adbc00a179e5f682025-08-20T03:12:24ZengIEEEIEEE Access2169-35362025-01-011310823510825410.1109/ACCESS.2025.358168311045351GSR-Fusion: A Deep Multimodal Fusion Architecture for Robust Sign Language Recognition Using RGB, Skeleton, and Graph-Based ModalitiesWuttichai Vijitkunsawat0https://orcid.org/0000-0003-2157-7661Teeradaj Racharak1https://orcid.org/0000-0002-8823-2361Department of Electronics and Telecommunication Engineering, Rajamangala University of Technology Krungthep, Bangkok, ThailandAdvanced Institute of So-Go-Chi (Convergence Knowledge) Informatics, Tohoku University, Miyagi, Sendai, JapanSign Language Recognition (SLR) plays a critical role in bridging communication gaps between the deaf and hearing communities. This research introduces GSR-Fusion, a deep multimodal fusion architecture that combines RGB-based, skeleton-based, and graph-based modalities to enhance gesture recognition. Unlike traditional unimodal models, GSR-Fusion utilizes gesture initiation and termination detection, along with a cross-modality fusion approach using a merge (network) technique, enabling it to capture both spatial-temporal and relational features from multiple data sources. The model incorporates ViViT for RGB feature extraction, Transformers for sequential pose modeling, and A3T-GCN for joint graph representation, which together form a comprehensive understanding of gestures. The study investigates five key experimental setups, covering single-hand static and dynamic gestures (one, two, and three strokes) as well as two-hand static and dynamic gestures from the Thai Finger Spelling dataset. Additionally, we compare our architecture with existing models on global datasets, including WLASL and MS-ASL, to evaluate its performance. The results show that GSR-Fusion outperforms state-of-the-art models on multiple datasets. On WLASL, it achieves 83.45% accuracy for 100 classes and 75.23% for 300 classes, surpassing models like SignBERT and Fusion3. Similarly, on MS-ASL, it attains 84.31% for 100 classes and 80.57% for 200 classes, outperforming both RGB-based and skeleton-based models. These results highlight the effectiveness of GSR-Fusion in recognizing complex gestures, demonstrating its ability to generalize across different sign languages and datasets. The research emphasizes the importance of multimodal fusion in advancing sign language recognition for real-world applications.https://ieeexplore.ieee.org/document/11045351/Thai finger spellingsign language recognitionmulti-modalitycross-modalityfusion architecture
spellingShingle Wuttichai Vijitkunsawat
Teeradaj Racharak
GSR-Fusion: A Deep Multimodal Fusion Architecture for Robust Sign Language Recognition Using RGB, Skeleton, and Graph-Based Modalities
IEEE Access
Thai finger spelling
sign language recognition
multi-modality
cross-modality
fusion architecture
title GSR-Fusion: A Deep Multimodal Fusion Architecture for Robust Sign Language Recognition Using RGB, Skeleton, and Graph-Based Modalities
title_full GSR-Fusion: A Deep Multimodal Fusion Architecture for Robust Sign Language Recognition Using RGB, Skeleton, and Graph-Based Modalities
title_fullStr GSR-Fusion: A Deep Multimodal Fusion Architecture for Robust Sign Language Recognition Using RGB, Skeleton, and Graph-Based Modalities
title_full_unstemmed GSR-Fusion: A Deep Multimodal Fusion Architecture for Robust Sign Language Recognition Using RGB, Skeleton, and Graph-Based Modalities
title_short GSR-Fusion: A Deep Multimodal Fusion Architecture for Robust Sign Language Recognition Using RGB, Skeleton, and Graph-Based Modalities
title_sort gsr fusion a deep multimodal fusion architecture for robust sign language recognition using rgb skeleton and graph based modalities
topic Thai finger spelling
sign language recognition
multi-modality
cross-modality
fusion architecture
url https://ieeexplore.ieee.org/document/11045351/
work_keys_str_mv AT wuttichaivijitkunsawat gsrfusionadeepmultimodalfusionarchitectureforrobustsignlanguagerecognitionusingrgbskeletonandgraphbasedmodalities
AT teeradajracharak gsrfusionadeepmultimodalfusionarchitectureforrobustsignlanguagerecognitionusingrgbskeletonandgraphbasedmodalities