A multimodal transformer-based visual question answering method integrating local and global information.
Addressing the limitations in current visual question answering (VQA) models face limitations in multimodal feature fusion capabilities and often lack adequate consideration of local information, this study proposes a multimodal Transformer VQA network based on local and global information integrati...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Public Library of Science (PLoS)
2025-01-01
|
| Series: | PLoS ONE |
| Online Access: | https://doi.org/10.1371/journal.pone.0324757 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|