Text this: Parameter-efficient weakly supervised referring video object segmentation via chain-of-thought reasoning