Text this: QUBVIS: query based multi-modal summarization system using CLIP based transformer and vision language models