Text this: Semantics-aware human motion generation from audio instructions