Text this: Hierarchical multi‐modal video summarization with dynamic sampling