Text this: Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset