Answer Distillation Network With Bi-Text-Image Attention for Medical Visual Question Answering
Medical Visual Question Answering (Med-VQA) is a multimodal task that aims to obtain the correct answers based on medical images and questions. Med-VQA, as a classification task, is typically more challenging for algorithms to predict answers to open-ended questions than to closed-ended questions du...
Saved in:
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10848065/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832583252261470208 |
---|---|
author | Hongfang Gong Li Li |
author_facet | Hongfang Gong Li Li |
author_sort | Hongfang Gong |
collection | DOAJ |
description | Medical Visual Question Answering (Med-VQA) is a multimodal task that aims to obtain the correct answers based on medical images and questions. Med-VQA, as a classification task, is typically more challenging for algorithms to predict answers to open-ended questions than to closed-ended questions due to the larger number of answer categories for the former. Consequently, the accuracy of predictions for open-ended questions is generally lower than that for closed-ended questions. In this study, we design answer distillation network with bi-text-image attention (BTIA-AD Net) to solve the above problem. We present an answer distillation network to refine the answers and convert an open-ended question into a multiple-choice question with a selection of candidate answers. To fully utilize the candidate answer information from answer distillation network, we propose a bi-text-image attention fusion module composed of self-attention and guided attention to automatically fuse image features, question representations, and candidate answer information and achieve intra-modal and inter-modal semantic interaction. Extensive experiments validate the effectiveness of BTIA-AD Net. Results prove that our model can efficiently compress the answer space of open-ended tasks, improve the answer accuracy, and provide new state-of-the-art performance on the VQA-RAD dataset. |
format | Article |
id | doaj-art-b8a6e3e513b843499ed696cac6cbf4b4 |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-b8a6e3e513b843499ed696cac6cbf4b42025-01-29T00:01:07ZengIEEEIEEE Access2169-35362025-01-0113164551646510.1109/ACCESS.2025.353230810848065Answer Distillation Network With Bi-Text-Image Attention for Medical Visual Question AnsweringHongfang Gong0https://orcid.org/0000-0003-2618-9174Li Li1https://orcid.org/0009-0004-7101-049XSchool of Mathematics and Statistics, Changsha University of Science and Technology, Changsha, ChinaSchool of Mathematics and Statistics, Changsha University of Science and Technology, Changsha, ChinaMedical Visual Question Answering (Med-VQA) is a multimodal task that aims to obtain the correct answers based on medical images and questions. Med-VQA, as a classification task, is typically more challenging for algorithms to predict answers to open-ended questions than to closed-ended questions due to the larger number of answer categories for the former. Consequently, the accuracy of predictions for open-ended questions is generally lower than that for closed-ended questions. In this study, we design answer distillation network with bi-text-image attention (BTIA-AD Net) to solve the above problem. We present an answer distillation network to refine the answers and convert an open-ended question into a multiple-choice question with a selection of candidate answers. To fully utilize the candidate answer information from answer distillation network, we propose a bi-text-image attention fusion module composed of self-attention and guided attention to automatically fuse image features, question representations, and candidate answer information and achieve intra-modal and inter-modal semantic interaction. Extensive experiments validate the effectiveness of BTIA-AD Net. Results prove that our model can efficiently compress the answer space of open-ended tasks, improve the answer accuracy, and provide new state-of-the-art performance on the VQA-RAD dataset.https://ieeexplore.ieee.org/document/10848065/Medical visual question answeringmultimodal fusionVQA-RADmulti-head attention |
spellingShingle | Hongfang Gong Li Li Answer Distillation Network With Bi-Text-Image Attention for Medical Visual Question Answering IEEE Access Medical visual question answering multimodal fusion VQA-RAD multi-head attention |
title | Answer Distillation Network With Bi-Text-Image Attention for Medical Visual Question Answering |
title_full | Answer Distillation Network With Bi-Text-Image Attention for Medical Visual Question Answering |
title_fullStr | Answer Distillation Network With Bi-Text-Image Attention for Medical Visual Question Answering |
title_full_unstemmed | Answer Distillation Network With Bi-Text-Image Attention for Medical Visual Question Answering |
title_short | Answer Distillation Network With Bi-Text-Image Attention for Medical Visual Question Answering |
title_sort | answer distillation network with bi text image attention for medical visual question answering |
topic | Medical visual question answering multimodal fusion VQA-RAD multi-head attention |
url | https://ieeexplore.ieee.org/document/10848065/ |
work_keys_str_mv | AT hongfanggong answerdistillationnetworkwithbitextimageattentionformedicalvisualquestionanswering AT lili answerdistillationnetworkwithbitextimageattentionformedicalvisualquestionanswering |