Text-guided RGB-P grasp generation

In the field of robotics, object grasping is a complex and challenging task. Although state-of-the-art computer vision-based models have made significant progress in predicting grasps, the lack of semantic information from textual data makes them susceptible to ambiguities in object recognition. For...

Full description

Saved in:

Bibliographic Details
Main Authors:	Van Duc Vu, Van Thiep Nguyen, Nam Hai Pham, Dinh-Cuong Hoang, Phan Xuan Tan
Format:	Article
Language:	English
Published:	PeerJ Inc. 2025-08-01
Series:	PeerJ Computer Science
Subjects:	Grasp generation Large language models Computer vision Multi-modal Robotics
Online Access:	https://peerj.com/articles/cs-3060.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	In the field of robotics, object grasping is a complex and challenging task. Although state-of-the-art computer vision-based models have made significant progress in predicting grasps, the lack of semantic information from textual data makes them susceptible to ambiguities in object recognition. For example, when asked to grasp a specific object on a table with many objects, robots relying only on visual data can easily get confused and grasp the wrong object. To address this limitation, we propose a multimodal approach that seamlessly integrates 3D data (shape) and red-green-blue (RGB) images (color, texture) into a unified representation called red-green-blue and point cloud (RGB-P), while also incorporating semantic information from textual descriptions processed by a large language model (LLM) to enhance object disambiguation. This combination of data allows our model to accurately infer and capture target objects based on natural language descriptions, overcoming the limitations of vision-only approaches. Our approach achieves superior performance, with an average precision (AP) of 53.2% on the GraspNet-1Billion dataset, significantly outperforming state-of-the-art methods. Additionally, we introduce an automated dataset creation pipeline that addresses the challenges of data collection and annotation. This pipeline leverages cutting-edge models: LLMs for text generation, Stable Diffusion for image synthesis, Depth Anything for depth estimation, using standard intrinsic parameters from the Kinect depth sensor to ensure geometric consistency, and GraspNet for grasp estimation. This automated process generates high-quality datasets with paired RGB-P, images, textual descriptions and potential grasp poses, significantly reducing the manual effort and enabling large-scale data collection.
ISSN:	2376-5992

Text-guided RGB-P grasp generation

Similar Items