Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning Method

To enhance the understanding of the core regions in Thangka images and improve the richness of generated content during decoding, we propose a Thangka image captioning method based on Region-Guided Feature Enhancement and Attribute Prediction (RGFEAP). The image feature enhancement encoder, guided b...

Full description

Saved in:

Bibliographic Details
Main Authors:	Fujun Zhang, Wendong Kang, Wenjin Hu
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Image captioning Thangka images region-guided attribute prediction
Online Access:	https://ieeexplore.ieee.org/document/10833628/
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	To enhance the understanding of the core regions in Thangka images and improve the richness of generated content during decoding, we propose a Thangka image captioning method based on Region-Guided Feature Enhancement and Attribute Prediction (RGFEAP). The image feature enhancement encoder, guided by regions, introduces a region-guided module with a distance-weighted strategy to enhance the feature representation of the central sacred elements in Thangka images. Additionally, we designed a Thangka feature enhancement encoder to further refine the regional feature vectors, which are then fused with global features extracted by CLIP through multi-scale convolutional fusion, injecting richer object-related information into the Thangka image features.Furthermore, to enhance the detailed representation capability for generating long-sequence captions of Thangka images, we designed an attribute predictor. This predictor leverages feature maps from four different convolutional blocks within the region-guided module to incorporate more detailed information into the model. Experimental results on the Thangka dataset demonstrate that RGFEAP achieves significant improvements compared to the baseline model ClipCap, with BLEU-1, BLEU-4, CIDEr, and METEOR scores increasing by 14.0%, 17.7%, 185.7%, and 11.5%, respectively. On the COCO dataset in the natural domain, RGFEAP achieves performance comparable to other state-of-the-art models, showcasing its strong adaptability.
ISSN:	2169-3536

Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning Method

Similar Items