Text this: CrossModalSync: joint temporal-spatial fusion for semantic scene segmentation in large-scale scenes