Text this: Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders