Text this: Multi-level representation learning via ConvNeXt-based network for unaligned cross-view matching