Text this: Identifying representative sequences of protein families using submodular optimization