Text this: Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization