Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning Method

To enhance the understanding of the core regions in Thangka images and improve the richness of generated content during decoding, we propose a Thangka image captioning method based on Region-Guided Feature Enhancement and Attribute Prediction (RGFEAP). The image feature enhancement encoder, guided b...

Full description

Saved in:
Bibliographic Details
Main Authors: Fujun Zhang, Wendong Kang, Wenjin Hu
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10833628/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832583972275617792
author Fujun Zhang
Wendong Kang
Wenjin Hu
author_facet Fujun Zhang
Wendong Kang
Wenjin Hu
author_sort Fujun Zhang
collection DOAJ
description To enhance the understanding of the core regions in Thangka images and improve the richness of generated content during decoding, we propose a Thangka image captioning method based on Region-Guided Feature Enhancement and Attribute Prediction (RGFEAP). The image feature enhancement encoder, guided by regions, introduces a region-guided module with a distance-weighted strategy to enhance the feature representation of the central sacred elements in Thangka images. Additionally, we designed a Thangka feature enhancement encoder to further refine the regional feature vectors, which are then fused with global features extracted by CLIP through multi-scale convolutional fusion, injecting richer object-related information into the Thangka image features.Furthermore, to enhance the detailed representation capability for generating long-sequence captions of Thangka images, we designed an attribute predictor. This predictor leverages feature maps from four different convolutional blocks within the region-guided module to incorporate more detailed information into the model. Experimental results on the Thangka dataset demonstrate that RGFEAP achieves significant improvements compared to the baseline model ClipCap, with BLEU-1, BLEU-4, CIDEr, and METEOR scores increasing by 14.0%, 17.7%, 185.7%, and 11.5%, respectively. On the COCO dataset in the natural domain, RGFEAP achieves performance comparable to other state-of-the-art models, showcasing its strong adaptability.
format Article
id doaj-art-49a76f2ced8f4210a965f9b34367e04e
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-49a76f2ced8f4210a965f9b34367e04e2025-01-28T00:01:07ZengIEEEIEEE Access2169-35362025-01-0113134401345310.1109/ACCESS.2025.352695410833628Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning MethodFujun Zhang0https://orcid.org/0009-0004-9593-9930Wendong Kang1Wenjin Hu2https://orcid.org/0000-0002-3120-5231Key Laboratory of Linguistic and Cultural Computing of Ministry of Education, Chinese National Information Technology Research Institute, Northwest Minzu University, Lanzhou, Gansu, ChinaKey Laboratory of Linguistic and Cultural Computing of Ministry of Education, Chinese National Information Technology Research Institute, Northwest Minzu University, Lanzhou, Gansu, ChinaKey Laboratory of Linguistic and Cultural Computing of Ministry of Education, Chinese National Information Technology Research Institute, Northwest Minzu University, Lanzhou, Gansu, ChinaTo enhance the understanding of the core regions in Thangka images and improve the richness of generated content during decoding, we propose a Thangka image captioning method based on Region-Guided Feature Enhancement and Attribute Prediction (RGFEAP). The image feature enhancement encoder, guided by regions, introduces a region-guided module with a distance-weighted strategy to enhance the feature representation of the central sacred elements in Thangka images. Additionally, we designed a Thangka feature enhancement encoder to further refine the regional feature vectors, which are then fused with global features extracted by CLIP through multi-scale convolutional fusion, injecting richer object-related information into the Thangka image features.Furthermore, to enhance the detailed representation capability for generating long-sequence captions of Thangka images, we designed an attribute predictor. This predictor leverages feature maps from four different convolutional blocks within the region-guided module to incorporate more detailed information into the model. Experimental results on the Thangka dataset demonstrate that RGFEAP achieves significant improvements compared to the baseline model ClipCap, with BLEU-1, BLEU-4, CIDEr, and METEOR scores increasing by 14.0%, 17.7%, 185.7%, and 11.5%, respectively. On the COCO dataset in the natural domain, RGFEAP achieves performance comparable to other state-of-the-art models, showcasing its strong adaptability.https://ieeexplore.ieee.org/document/10833628/Image captioningThangka imagesregion-guidedattribute prediction
spellingShingle Fujun Zhang
Wendong Kang
Wenjin Hu
Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning Method
IEEE Access
Image captioning
Thangka images
region-guided
attribute prediction
title Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning Method
title_full Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning Method
title_fullStr Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning Method
title_full_unstemmed Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning Method
title_short Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning Method
title_sort combining region guided attention and attribute prediction for thangka image captioning method
topic Image captioning
Thangka images
region-guided
attribute prediction
url https://ieeexplore.ieee.org/document/10833628/
work_keys_str_mv AT fujunzhang combiningregionguidedattentionandattributepredictionforthangkaimagecaptioningmethod
AT wendongkang combiningregionguidedattentionandattributepredictionforthangkaimagecaptioningmethod
AT wenjinhu combiningregionguidedattentionandattributepredictionforthangkaimagecaptioningmethod