Text this: A multimodal transformer-based visual question answering method integrating local and global information.