Toward accurate hand mesh estimation via masked image modeling

IntroductionWith an enormous number of hand images generated over time, leveraging unlabeled images for pose estimation is an emerging yet challenging topic. While some semi-supervised and self-supervised methods have emerged, they are constrained by their reliance on high-quality keypoint detection...

Full description

Saved in:
Bibliographic Details
Main Authors: Yanli Li, Congyi Wang, Huan Wang
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-01-01
Series:Frontiers in Physics
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fphy.2024.1515842/full
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:IntroductionWith an enormous number of hand images generated over time, leveraging unlabeled images for pose estimation is an emerging yet challenging topic. While some semi-supervised and self-supervised methods have emerged, they are constrained by their reliance on high-quality keypoint detection models or complicated network architectures.MethodsWe propose a novel selfsupervised pretraining strategy for 3D hand mesh regression. Our approach integrates a multi-granularity strategy with pseudo-keypoint alignment in a teacher–student framework, employing self-distillation and masked image modeling for comprehensive representation learning. We pair this with a robust pose estimation baseline, combining a standard vision transformer backbone with a pyramidal mesh alignment feedback head.ResultsExtensive experiments demonstrate HandMIM’s competitive performance across diverse datasets, notably achieving an 8.00 mm Procrustes alignment vertex-point-error on the challenging HO3Dv2 test set, which features severe hand occlusions, surpassing many specially optimized architectures.
ISSN:2296-424X