Text this: Weight-based multi-stream model for Multi-Modal Video Question Answering