Text this: One size does not fit all in evaluating model selection scores for image classification