GSR-Fusion: A Deep Multimodal Fusion Architecture for Robust Sign Language Recognition Using RGB, Skeleton, and Graph-Based Modalities
Sign Language Recognition (SLR) plays a critical role in bridging communication gaps between the deaf and hearing communities. This research introduces GSR-Fusion, a deep multimodal fusion architecture that combines RGB-based, skeleton-based, and graph-based modalities to enhance gesture recognition...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11045351/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849718337537310720 |
|---|---|
| author | Wuttichai Vijitkunsawat Teeradaj Racharak |
| author_facet | Wuttichai Vijitkunsawat Teeradaj Racharak |
| author_sort | Wuttichai Vijitkunsawat |
| collection | DOAJ |
| description | Sign Language Recognition (SLR) plays a critical role in bridging communication gaps between the deaf and hearing communities. This research introduces GSR-Fusion, a deep multimodal fusion architecture that combines RGB-based, skeleton-based, and graph-based modalities to enhance gesture recognition. Unlike traditional unimodal models, GSR-Fusion utilizes gesture initiation and termination detection, along with a cross-modality fusion approach using a merge (network) technique, enabling it to capture both spatial-temporal and relational features from multiple data sources. The model incorporates ViViT for RGB feature extraction, Transformers for sequential pose modeling, and A3T-GCN for joint graph representation, which together form a comprehensive understanding of gestures. The study investigates five key experimental setups, covering single-hand static and dynamic gestures (one, two, and three strokes) as well as two-hand static and dynamic gestures from the Thai Finger Spelling dataset. Additionally, we compare our architecture with existing models on global datasets, including WLASL and MS-ASL, to evaluate its performance. The results show that GSR-Fusion outperforms state-of-the-art models on multiple datasets. On WLASL, it achieves 83.45% accuracy for 100 classes and 75.23% for 300 classes, surpassing models like SignBERT and Fusion3. Similarly, on MS-ASL, it attains 84.31% for 100 classes and 80.57% for 200 classes, outperforming both RGB-based and skeleton-based models. These results highlight the effectiveness of GSR-Fusion in recognizing complex gestures, demonstrating its ability to generalize across different sign languages and datasets. The research emphasizes the importance of multimodal fusion in advancing sign language recognition for real-world applications. |
| format | Article |
| id | doaj-art-a930d76e5c2a48349adbc00a179e5f68 |
| institution | DOAJ |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-a930d76e5c2a48349adbc00a179e5f682025-08-20T03:12:24ZengIEEEIEEE Access2169-35362025-01-011310823510825410.1109/ACCESS.2025.358168311045351GSR-Fusion: A Deep Multimodal Fusion Architecture for Robust Sign Language Recognition Using RGB, Skeleton, and Graph-Based ModalitiesWuttichai Vijitkunsawat0https://orcid.org/0000-0003-2157-7661Teeradaj Racharak1https://orcid.org/0000-0002-8823-2361Department of Electronics and Telecommunication Engineering, Rajamangala University of Technology Krungthep, Bangkok, ThailandAdvanced Institute of So-Go-Chi (Convergence Knowledge) Informatics, Tohoku University, Miyagi, Sendai, JapanSign Language Recognition (SLR) plays a critical role in bridging communication gaps between the deaf and hearing communities. This research introduces GSR-Fusion, a deep multimodal fusion architecture that combines RGB-based, skeleton-based, and graph-based modalities to enhance gesture recognition. Unlike traditional unimodal models, GSR-Fusion utilizes gesture initiation and termination detection, along with a cross-modality fusion approach using a merge (network) technique, enabling it to capture both spatial-temporal and relational features from multiple data sources. The model incorporates ViViT for RGB feature extraction, Transformers for sequential pose modeling, and A3T-GCN for joint graph representation, which together form a comprehensive understanding of gestures. The study investigates five key experimental setups, covering single-hand static and dynamic gestures (one, two, and three strokes) as well as two-hand static and dynamic gestures from the Thai Finger Spelling dataset. Additionally, we compare our architecture with existing models on global datasets, including WLASL and MS-ASL, to evaluate its performance. The results show that GSR-Fusion outperforms state-of-the-art models on multiple datasets. On WLASL, it achieves 83.45% accuracy for 100 classes and 75.23% for 300 classes, surpassing models like SignBERT and Fusion3. Similarly, on MS-ASL, it attains 84.31% for 100 classes and 80.57% for 200 classes, outperforming both RGB-based and skeleton-based models. These results highlight the effectiveness of GSR-Fusion in recognizing complex gestures, demonstrating its ability to generalize across different sign languages and datasets. The research emphasizes the importance of multimodal fusion in advancing sign language recognition for real-world applications.https://ieeexplore.ieee.org/document/11045351/Thai finger spellingsign language recognitionmulti-modalitycross-modalityfusion architecture |
| spellingShingle | Wuttichai Vijitkunsawat Teeradaj Racharak GSR-Fusion: A Deep Multimodal Fusion Architecture for Robust Sign Language Recognition Using RGB, Skeleton, and Graph-Based Modalities IEEE Access Thai finger spelling sign language recognition multi-modality cross-modality fusion architecture |
| title | GSR-Fusion: A Deep Multimodal Fusion Architecture for Robust Sign Language Recognition Using RGB, Skeleton, and Graph-Based Modalities |
| title_full | GSR-Fusion: A Deep Multimodal Fusion Architecture for Robust Sign Language Recognition Using RGB, Skeleton, and Graph-Based Modalities |
| title_fullStr | GSR-Fusion: A Deep Multimodal Fusion Architecture for Robust Sign Language Recognition Using RGB, Skeleton, and Graph-Based Modalities |
| title_full_unstemmed | GSR-Fusion: A Deep Multimodal Fusion Architecture for Robust Sign Language Recognition Using RGB, Skeleton, and Graph-Based Modalities |
| title_short | GSR-Fusion: A Deep Multimodal Fusion Architecture for Robust Sign Language Recognition Using RGB, Skeleton, and Graph-Based Modalities |
| title_sort | gsr fusion a deep multimodal fusion architecture for robust sign language recognition using rgb skeleton and graph based modalities |
| topic | Thai finger spelling sign language recognition multi-modality cross-modality fusion architecture |
| url | https://ieeexplore.ieee.org/document/11045351/ |
| work_keys_str_mv | AT wuttichaivijitkunsawat gsrfusionadeepmultimodalfusionarchitectureforrobustsignlanguagerecognitionusingrgbskeletonandgraphbasedmodalities AT teeradajracharak gsrfusionadeepmultimodalfusionarchitectureforrobustsignlanguagerecognitionusingrgbskeletonandgraphbasedmodalities |