Efficient knowledge distillation and alignment for improved KB-VQA

Abstract Knowledge-based visual question answering (KB-VQA) often requires utilizing external knowledge to answer natural language questions about image content. Recent research has emphasized the importance of knowledge in answering questions by implicitly leveraging Large Language Models (LLMs). H...

Full description

Saved in:
Bibliographic Details
Main Authors: Xiaofei Qin, Ruiqi Pei, Changxiang He, Fan Li, Xuedian Zhang
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-025-07539-9
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Knowledge-based visual question answering (KB-VQA) often requires utilizing external knowledge to answer natural language questions about image content. Recent research has emphasized the importance of knowledge in answering questions by implicitly leveraging Large Language Models (LLMs). However, these methods suffer from the following issues: (1) They primarily focus on aligning image-text descriptions while neglecting alignment between image features and knowledge. Solely relying on knowledge retrieval from databases or LLMs may introduce irrelevant information. Knowledge relevant to visual content contributes to improving the accuracy of model answers. (2) These methods often require long inference times and significant computational resources, with some even heavily relying on access to the GPT-3 API. Therefore, we propose a efficient approach EKDA (Efficient Knowledge Distillation and Alignment), unlike other methods utilizing LLMs, does not require extensive computational resources or involve complex processes. Leveraging knowledge distillation techniques with the LLaMA model as the teacher model enables knowledge extraction. Additionally, we employ Graph Neural Network (GNN) to effectively align visual information with knowledge, thereby effectively capturing image-related knowledge and enhancing the model’s understanding of semantics. Furthermore, our approach achieves state-of-the-art accuracy on the OK-VQA dataset, surpassing baseline methods by 6.63%.
ISSN:2045-2322