Efficient knowledge distillation and alignment for improved KB-VQA
Abstract Knowledge-based visual question answering (KB-VQA) often requires utilizing external knowledge to answer natural language questions about image content. Recent research has emphasized the importance of knowledge in answering questions by implicitly leveraging Large Language Models (LLMs). H...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-07-01
|
| Series: | Scientific Reports |
| Online Access: | https://doi.org/10.1038/s41598-025-07539-9 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Knowledge-based visual question answering (KB-VQA) often requires utilizing external knowledge to answer natural language questions about image content. Recent research has emphasized the importance of knowledge in answering questions by implicitly leveraging Large Language Models (LLMs). However, these methods suffer from the following issues: (1) They primarily focus on aligning image-text descriptions while neglecting alignment between image features and knowledge. Solely relying on knowledge retrieval from databases or LLMs may introduce irrelevant information. Knowledge relevant to visual content contributes to improving the accuracy of model answers. (2) These methods often require long inference times and significant computational resources, with some even heavily relying on access to the GPT-3 API. Therefore, we propose a efficient approach EKDA (Efficient Knowledge Distillation and Alignment), unlike other methods utilizing LLMs, does not require extensive computational resources or involve complex processes. Leveraging knowledge distillation techniques with the LLaMA model as the teacher model enables knowledge extraction. Additionally, we employ Graph Neural Network (GNN) to effectively align visual information with knowledge, thereby effectively capturing image-related knowledge and enhancing the model’s understanding of semantics. Furthermore, our approach achieves state-of-the-art accuracy on the OK-VQA dataset, surpassing baseline methods by 6.63%. |
|---|---|
| ISSN: | 2045-2322 |