Toward accurate hand mesh estimation via masked image modeling
IntroductionWith an enormous number of hand images generated over time, leveraging unlabeled images for pose estimation is an emerging yet challenging topic. While some semi-supervised and self-supervised methods have emerged, they are constrained by their reliance on high-quality keypoint detection...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Frontiers Media S.A.
2025-01-01
|
Series: | Frontiers in Physics |
Subjects: | |
Online Access: | https://www.frontiersin.org/articles/10.3389/fphy.2024.1515842/full |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832582973635952640 |
---|---|
author | Yanli Li Congyi Wang Huan Wang |
author_facet | Yanli Li Congyi Wang Huan Wang |
author_sort | Yanli Li |
collection | DOAJ |
description | IntroductionWith an enormous number of hand images generated over time, leveraging unlabeled images for pose estimation is an emerging yet challenging topic. While some semi-supervised and self-supervised methods have emerged, they are constrained by their reliance on high-quality keypoint detection models or complicated network architectures.MethodsWe propose a novel selfsupervised pretraining strategy for 3D hand mesh regression. Our approach integrates a multi-granularity strategy with pseudo-keypoint alignment in a teacher–student framework, employing self-distillation and masked image modeling for comprehensive representation learning. We pair this with a robust pose estimation baseline, combining a standard vision transformer backbone with a pyramidal mesh alignment feedback head.ResultsExtensive experiments demonstrate HandMIM’s competitive performance across diverse datasets, notably achieving an 8.00 mm Procrustes alignment vertex-point-error on the challenging HO3Dv2 test set, which features severe hand occlusions, surpassing many specially optimized architectures. |
format | Article |
id | doaj-art-bc71b096a18d4e849e34ceb7057a654c |
institution | Kabale University |
issn | 2296-424X |
language | English |
publishDate | 2025-01-01 |
publisher | Frontiers Media S.A. |
record_format | Article |
series | Frontiers in Physics |
spelling | doaj-art-bc71b096a18d4e849e34ceb7057a654c2025-01-29T05:21:26ZengFrontiers Media S.A.Frontiers in Physics2296-424X2025-01-011210.3389/fphy.2024.15158421515842Toward accurate hand mesh estimation via masked image modelingYanli Li0Congyi Wang1Huan Wang2Fuzhou Medical College of Nanchang University, Fuzhou, ChinaFinancial Technology Research Institute of the Industrial Bank, Fuzhou, ChinaIndustrial Technology Research Center, Guangdong Institute of Scientific and Technical Information, Guangzhou, ChinaIntroductionWith an enormous number of hand images generated over time, leveraging unlabeled images for pose estimation is an emerging yet challenging topic. While some semi-supervised and self-supervised methods have emerged, they are constrained by their reliance on high-quality keypoint detection models or complicated network architectures.MethodsWe propose a novel selfsupervised pretraining strategy for 3D hand mesh regression. Our approach integrates a multi-granularity strategy with pseudo-keypoint alignment in a teacher–student framework, employing self-distillation and masked image modeling for comprehensive representation learning. We pair this with a robust pose estimation baseline, combining a standard vision transformer backbone with a pyramidal mesh alignment feedback head.ResultsExtensive experiments demonstrate HandMIM’s competitive performance across diverse datasets, notably achieving an 8.00 mm Procrustes alignment vertex-point-error on the challenging HO3Dv2 test set, which features severe hand occlusions, surpassing many specially optimized architectures.https://www.frontiersin.org/articles/10.3389/fphy.2024.1515842/full3D hand mesh estimationmulti-granularity representationself-supervised learningmasked image modelingvision transformer |
spellingShingle | Yanli Li Congyi Wang Huan Wang Toward accurate hand mesh estimation via masked image modeling Frontiers in Physics 3D hand mesh estimation multi-granularity representation self-supervised learning masked image modeling vision transformer |
title | Toward accurate hand mesh estimation via masked image modeling |
title_full | Toward accurate hand mesh estimation via masked image modeling |
title_fullStr | Toward accurate hand mesh estimation via masked image modeling |
title_full_unstemmed | Toward accurate hand mesh estimation via masked image modeling |
title_short | Toward accurate hand mesh estimation via masked image modeling |
title_sort | toward accurate hand mesh estimation via masked image modeling |
topic | 3D hand mesh estimation multi-granularity representation self-supervised learning masked image modeling vision transformer |
url | https://www.frontiersin.org/articles/10.3389/fphy.2024.1515842/full |
work_keys_str_mv | AT yanlili towardaccuratehandmeshestimationviamaskedimagemodeling AT congyiwang towardaccuratehandmeshestimationviamaskedimagemodeling AT huanwang towardaccuratehandmeshestimationviamaskedimagemodeling |