A Unified Framework for Recognizing Dynamic Hand Actions and Estimating Hand Pose from First-Person RGB Videos
Recognizing hand actions and poses from first-person RGB videos is crucial for applications like human–computer interaction. However, the recognition accuracy is often affected by factors such as occlusion and blurring. In this study, we propose a unified framework for action recognition and hand po...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-06-01
|
| Series: | Algorithms |
| Subjects: | |
| Online Access: | https://www.mdpi.com/1999-4893/18/7/393 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Recognizing hand actions and poses from first-person RGB videos is crucial for applications like human–computer interaction. However, the recognition accuracy is often affected by factors such as occlusion and blurring. In this study, we propose a unified framework for action recognition and hand pose estimation in first-person RGB videos. The framework consists of two main modules: the Hand Pose Estimation Module and the Action Recognition Module. In the Hand Pose Estimation Module, each video frame is fed into a multi-layer transformer encoder after passing through a feature extractor. The hand pose results and object categories for each frame are obtained through multi-layer perceptron prediction using a dual residual network structure. The above prediction results are concatenated with the feature information corresponding to each frame for subsequent action recognition tasks. In the Action Recognition Module, the feature vectors from each frame are aggregated by a multi-layer transformer encoder to capture the temporal information of the hand between video frames and obtain the motion trajectory. The final output is the category of hand movements in consecutive video frames. We conducted experiments on two publicly available datasets, FPHA and H2O, and the results show that our method achieves significant improvements on both datasets, with action recognition accuracies of 94.82% and 87.92%, respectively. |
|---|---|
| ISSN: | 1999-4893 |