Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning Method

To enhance the understanding of the core regions in Thangka images and improve the richness of generated content during decoding, we propose a Thangka image captioning method based on Region-Guided Feature Enhancement and Attribute Prediction (RGFEAP). The image feature enhancement encoder, guided b...

Full description

Saved in:

Bibliographic Details
Main Authors:	Fujun Zhang, Wendong Kang, Wenjin Hu
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Image captioning Thangka images region-guided attribute prediction
Online Access:	https://ieeexplore.ieee.org/document/10833628/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832583972275617792
author	Fujun Zhang Wendong Kang Wenjin Hu
author_facet	Fujun Zhang Wendong Kang Wenjin Hu
author_sort	Fujun Zhang
collection	DOAJ
description	To enhance the understanding of the core regions in Thangka images and improve the richness of generated content during decoding, we propose a Thangka image captioning method based on Region-Guided Feature Enhancement and Attribute Prediction (RGFEAP). The image feature enhancement encoder, guided by regions, introduces a region-guided module with a distance-weighted strategy to enhance the feature representation of the central sacred elements in Thangka images. Additionally, we designed a Thangka feature enhancement encoder to further refine the regional feature vectors, which are then fused with global features extracted by CLIP through multi-scale convolutional fusion, injecting richer object-related information into the Thangka image features.Furthermore, to enhance the detailed representation capability for generating long-sequence captions of Thangka images, we designed an attribute predictor. This predictor leverages feature maps from four different convolutional blocks within the region-guided module to incorporate more detailed information into the model. Experimental results on the Thangka dataset demonstrate that RGFEAP achieves significant improvements compared to the baseline model ClipCap, with BLEU-1, BLEU-4, CIDEr, and METEOR scores increasing by 14.0%, 17.7%, 185.7%, and 11.5%, respectively. On the COCO dataset in the natural domain, RGFEAP achieves performance comparable to other state-of-the-art models, showcasing its strong adaptability.
format	Article
id	doaj-art-49a76f2ced8f4210a965f9b34367e04e
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-49a76f2ced8f4210a965f9b34367e04e2025-01-28T00:01:07ZengIEEEIEEE Access2169-35362025-01-0113134401345310.1109/ACCESS.2025.352695410833628Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning MethodFujun Zhang0https://orcid.org/0009-0004-9593-9930Wendong Kang1Wenjin Hu2https://orcid.org/0000-0002-3120-5231Key Laboratory of Linguistic and Cultural Computing of Ministry of Education, Chinese National Information Technology Research Institute, Northwest Minzu University, Lanzhou, Gansu, ChinaKey Laboratory of Linguistic and Cultural Computing of Ministry of Education, Chinese National Information Technology Research Institute, Northwest Minzu University, Lanzhou, Gansu, ChinaKey Laboratory of Linguistic and Cultural Computing of Ministry of Education, Chinese National Information Technology Research Institute, Northwest Minzu University, Lanzhou, Gansu, ChinaTo enhance the understanding of the core regions in Thangka images and improve the richness of generated content during decoding, we propose a Thangka image captioning method based on Region-Guided Feature Enhancement and Attribute Prediction (RGFEAP). The image feature enhancement encoder, guided by regions, introduces a region-guided module with a distance-weighted strategy to enhance the feature representation of the central sacred elements in Thangka images. Additionally, we designed a Thangka feature enhancement encoder to further refine the regional feature vectors, which are then fused with global features extracted by CLIP through multi-scale convolutional fusion, injecting richer object-related information into the Thangka image features.Furthermore, to enhance the detailed representation capability for generating long-sequence captions of Thangka images, we designed an attribute predictor. This predictor leverages feature maps from four different convolutional blocks within the region-guided module to incorporate more detailed information into the model. Experimental results on the Thangka dataset demonstrate that RGFEAP achieves significant improvements compared to the baseline model ClipCap, with BLEU-1, BLEU-4, CIDEr, and METEOR scores increasing by 14.0%, 17.7%, 185.7%, and 11.5%, respectively. On the COCO dataset in the natural domain, RGFEAP achieves performance comparable to other state-of-the-art models, showcasing its strong adaptability.https://ieeexplore.ieee.org/document/10833628/Image captioningThangka imagesregion-guidedattribute prediction
spellingShingle	Fujun Zhang Wendong Kang Wenjin Hu Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning Method IEEE Access Image captioning Thangka images region-guided attribute prediction
title	Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning Method
title_full	Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning Method
title_fullStr	Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning Method
title_full_unstemmed	Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning Method
title_short	Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning Method
title_sort	combining region guided attention and attribute prediction for thangka image captioning method
topic	Image captioning Thangka images region-guided attribute prediction
url	https://ieeexplore.ieee.org/document/10833628/
work_keys_str_mv	AT fujunzhang combiningregionguidedattentionandattributepredictionforthangkaimagecaptioningmethod AT wendongkang combiningregionguidedattentionandattributepredictionforthangkaimagecaptioningmethod AT wenjinhu combiningregionguidedattentionandattributepredictionforthangkaimagecaptioningmethod

Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning Method

Similar Items