Toward accurate hand mesh estimation via masked image modeling

IntroductionWith an enormous number of hand images generated over time, leveraging unlabeled images for pose estimation is an emerging yet challenging topic. While some semi-supervised and self-supervised methods have emerged, they are constrained by their reliance on high-quality keypoint detection...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yanli Li, Congyi Wang, Huan Wang
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2025-01-01
Series:	Frontiers in Physics
Subjects:	3D hand mesh estimation multi-granularity representation self-supervised learning masked image modeling vision transformer
Online Access:	https://www.frontiersin.org/articles/10.3389/fphy.2024.1515842/full
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832582973635952640
author	Yanli Li Congyi Wang Huan Wang
author_facet	Yanli Li Congyi Wang Huan Wang
author_sort	Yanli Li
collection	DOAJ
description	IntroductionWith an enormous number of hand images generated over time, leveraging unlabeled images for pose estimation is an emerging yet challenging topic. While some semi-supervised and self-supervised methods have emerged, they are constrained by their reliance on high-quality keypoint detection models or complicated network architectures.MethodsWe propose a novel selfsupervised pretraining strategy for 3D hand mesh regression. Our approach integrates a multi-granularity strategy with pseudo-keypoint alignment in a teacher–student framework, employing self-distillation and masked image modeling for comprehensive representation learning. We pair this with a robust pose estimation baseline, combining a standard vision transformer backbone with a pyramidal mesh alignment feedback head.ResultsExtensive experiments demonstrate HandMIM’s competitive performance across diverse datasets, notably achieving an 8.00 mm Procrustes alignment vertex-point-error on the challenging HO3Dv2 test set, which features severe hand occlusions, surpassing many specially optimized architectures.
format	Article
id	doaj-art-bc71b096a18d4e849e34ceb7057a654c
institution	Kabale University
issn	2296-424X
language	English
publishDate	2025-01-01
publisher	Frontiers Media S.A.
record_format	Article
series	Frontiers in Physics
spelling	doaj-art-bc71b096a18d4e849e34ceb7057a654c2025-01-29T05:21:26ZengFrontiers Media S.A.Frontiers in Physics2296-424X2025-01-011210.3389/fphy.2024.15158421515842Toward accurate hand mesh estimation via masked image modelingYanli Li0Congyi Wang1Huan Wang2Fuzhou Medical College of Nanchang University, Fuzhou, ChinaFinancial Technology Research Institute of the Industrial Bank, Fuzhou, ChinaIndustrial Technology Research Center, Guangdong Institute of Scientific and Technical Information, Guangzhou, ChinaIntroductionWith an enormous number of hand images generated over time, leveraging unlabeled images for pose estimation is an emerging yet challenging topic. While some semi-supervised and self-supervised methods have emerged, they are constrained by their reliance on high-quality keypoint detection models or complicated network architectures.MethodsWe propose a novel selfsupervised pretraining strategy for 3D hand mesh regression. Our approach integrates a multi-granularity strategy with pseudo-keypoint alignment in a teacher–student framework, employing self-distillation and masked image modeling for comprehensive representation learning. We pair this with a robust pose estimation baseline, combining a standard vision transformer backbone with a pyramidal mesh alignment feedback head.ResultsExtensive experiments demonstrate HandMIM’s competitive performance across diverse datasets, notably achieving an 8.00 mm Procrustes alignment vertex-point-error on the challenging HO3Dv2 test set, which features severe hand occlusions, surpassing many specially optimized architectures.https://www.frontiersin.org/articles/10.3389/fphy.2024.1515842/full3D hand mesh estimationmulti-granularity representationself-supervised learningmasked image modelingvision transformer
spellingShingle	Yanli Li Congyi Wang Huan Wang Toward accurate hand mesh estimation via masked image modeling Frontiers in Physics 3D hand mesh estimation multi-granularity representation self-supervised learning masked image modeling vision transformer
title	Toward accurate hand mesh estimation via masked image modeling
title_full	Toward accurate hand mesh estimation via masked image modeling
title_fullStr	Toward accurate hand mesh estimation via masked image modeling
title_full_unstemmed	Toward accurate hand mesh estimation via masked image modeling
title_short	Toward accurate hand mesh estimation via masked image modeling
title_sort	toward accurate hand mesh estimation via masked image modeling
topic	3D hand mesh estimation multi-granularity representation self-supervised learning masked image modeling vision transformer
url	https://www.frontiersin.org/articles/10.3389/fphy.2024.1515842/full
work_keys_str_mv	AT yanlili towardaccuratehandmeshestimationviamaskedimagemodeling AT congyiwang towardaccuratehandmeshestimationviamaskedimagemodeling AT huanwang towardaccuratehandmeshestimationviamaskedimagemodeling

Toward accurate hand mesh estimation via masked image modeling

Similar Items