Toward accurate hand mesh estimation via masked image modeling

IntroductionWith an enormous number of hand images generated over time, leveraging unlabeled images for pose estimation is an emerging yet challenging topic. While some semi-supervised and self-supervised methods have emerged, they are constrained by their reliance on high-quality keypoint detection...

Full description

Saved in:
Bibliographic Details
Main Authors: Yanli Li, Congyi Wang, Huan Wang
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-01-01
Series:Frontiers in Physics
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fphy.2024.1515842/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832582973635952640
author Yanli Li
Congyi Wang
Huan Wang
author_facet Yanli Li
Congyi Wang
Huan Wang
author_sort Yanli Li
collection DOAJ
description IntroductionWith an enormous number of hand images generated over time, leveraging unlabeled images for pose estimation is an emerging yet challenging topic. While some semi-supervised and self-supervised methods have emerged, they are constrained by their reliance on high-quality keypoint detection models or complicated network architectures.MethodsWe propose a novel selfsupervised pretraining strategy for 3D hand mesh regression. Our approach integrates a multi-granularity strategy with pseudo-keypoint alignment in a teacher–student framework, employing self-distillation and masked image modeling for comprehensive representation learning. We pair this with a robust pose estimation baseline, combining a standard vision transformer backbone with a pyramidal mesh alignment feedback head.ResultsExtensive experiments demonstrate HandMIM’s competitive performance across diverse datasets, notably achieving an 8.00 mm Procrustes alignment vertex-point-error on the challenging HO3Dv2 test set, which features severe hand occlusions, surpassing many specially optimized architectures.
format Article
id doaj-art-bc71b096a18d4e849e34ceb7057a654c
institution Kabale University
issn 2296-424X
language English
publishDate 2025-01-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Physics
spelling doaj-art-bc71b096a18d4e849e34ceb7057a654c2025-01-29T05:21:26ZengFrontiers Media S.A.Frontiers in Physics2296-424X2025-01-011210.3389/fphy.2024.15158421515842Toward accurate hand mesh estimation via masked image modelingYanli Li0Congyi Wang1Huan Wang2Fuzhou Medical College of Nanchang University, Fuzhou, ChinaFinancial Technology Research Institute of the Industrial Bank, Fuzhou, ChinaIndustrial Technology Research Center, Guangdong Institute of Scientific and Technical Information, Guangzhou, ChinaIntroductionWith an enormous number of hand images generated over time, leveraging unlabeled images for pose estimation is an emerging yet challenging topic. While some semi-supervised and self-supervised methods have emerged, they are constrained by their reliance on high-quality keypoint detection models or complicated network architectures.MethodsWe propose a novel selfsupervised pretraining strategy for 3D hand mesh regression. Our approach integrates a multi-granularity strategy with pseudo-keypoint alignment in a teacher–student framework, employing self-distillation and masked image modeling for comprehensive representation learning. We pair this with a robust pose estimation baseline, combining a standard vision transformer backbone with a pyramidal mesh alignment feedback head.ResultsExtensive experiments demonstrate HandMIM’s competitive performance across diverse datasets, notably achieving an 8.00 mm Procrustes alignment vertex-point-error on the challenging HO3Dv2 test set, which features severe hand occlusions, surpassing many specially optimized architectures.https://www.frontiersin.org/articles/10.3389/fphy.2024.1515842/full3D hand mesh estimationmulti-granularity representationself-supervised learningmasked image modelingvision transformer
spellingShingle Yanli Li
Congyi Wang
Huan Wang
Toward accurate hand mesh estimation via masked image modeling
Frontiers in Physics
3D hand mesh estimation
multi-granularity representation
self-supervised learning
masked image modeling
vision transformer
title Toward accurate hand mesh estimation via masked image modeling
title_full Toward accurate hand mesh estimation via masked image modeling
title_fullStr Toward accurate hand mesh estimation via masked image modeling
title_full_unstemmed Toward accurate hand mesh estimation via masked image modeling
title_short Toward accurate hand mesh estimation via masked image modeling
title_sort toward accurate hand mesh estimation via masked image modeling
topic 3D hand mesh estimation
multi-granularity representation
self-supervised learning
masked image modeling
vision transformer
url https://www.frontiersin.org/articles/10.3389/fphy.2024.1515842/full
work_keys_str_mv AT yanlili towardaccuratehandmeshestimationviamaskedimagemodeling
AT congyiwang towardaccuratehandmeshestimationviamaskedimagemodeling
AT huanwang towardaccuratehandmeshestimationviamaskedimagemodeling