Text this: Wi-Fi-Enabled Vision via Spatially-Variant Pose Estimation Based on Convolutional Transformer Network