GSR-Fusion: A Deep Multimodal Fusion Architecture for Robust Sign Language Recognition Using RGB, Skeleton, and Graph-Based Modalities

Sign Language Recognition (SLR) plays a critical role in bridging communication gaps between the deaf and hearing communities. This research introduces GSR-Fusion, a deep multimodal fusion architecture that combines RGB-based, skeleton-based, and graph-based modalities to enhance gesture recognition...

Full description

Saved in:
Bibliographic Details
Main Authors: Wuttichai Vijitkunsawat, Teeradaj Racharak
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11045351/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Sign Language Recognition (SLR) plays a critical role in bridging communication gaps between the deaf and hearing communities. This research introduces GSR-Fusion, a deep multimodal fusion architecture that combines RGB-based, skeleton-based, and graph-based modalities to enhance gesture recognition. Unlike traditional unimodal models, GSR-Fusion utilizes gesture initiation and termination detection, along with a cross-modality fusion approach using a merge (network) technique, enabling it to capture both spatial-temporal and relational features from multiple data sources. The model incorporates ViViT for RGB feature extraction, Transformers for sequential pose modeling, and A3T-GCN for joint graph representation, which together form a comprehensive understanding of gestures. The study investigates five key experimental setups, covering single-hand static and dynamic gestures (one, two, and three strokes) as well as two-hand static and dynamic gestures from the Thai Finger Spelling dataset. Additionally, we compare our architecture with existing models on global datasets, including WLASL and MS-ASL, to evaluate its performance. The results show that GSR-Fusion outperforms state-of-the-art models on multiple datasets. On WLASL, it achieves 83.45% accuracy for 100 classes and 75.23% for 300 classes, surpassing models like SignBERT and Fusion3. Similarly, on MS-ASL, it attains 84.31% for 100 classes and 80.57% for 200 classes, outperforming both RGB-based and skeleton-based models. These results highlight the effectiveness of GSR-Fusion in recognizing complex gestures, demonstrating its ability to generalize across different sign languages and datasets. The research emphasizes the importance of multimodal fusion in advancing sign language recognition for real-world applications.
ISSN:2169-3536