Text this: STFormer: Spatio‐temporal former for hand–object interaction recognition from egocentric RGB video