Are medical school preclinical tests biased for sex and race? A differential item functioning analysis

Abstract Background A common practice in assessment development, fundamental for fairness and consequently the validity of test score interpretations and uses, is to ascertain whether test items function equally across test-taker groups. Accordingly, we conducted differential item functioning (DIF)...

Full description

Saved in:
Bibliographic Details
Main Authors: Esther Dasari Dale, Mohammed A. A. Abulela, Hao Jia, Claudio Violato
Format: Article
Language:English
Published: BMC 2025-01-01
Series:BMC Medical Education
Subjects:
Online Access:https://doi.org/10.1186/s12909-024-06540-6
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Background A common practice in assessment development, fundamental for fairness and consequently the validity of test score interpretations and uses, is to ascertain whether test items function equally across test-taker groups. Accordingly, we conducted differential item functioning (DIF) analysis, a psychometric procedure for detecting potential item bias, for three preclinical medical school foundational courses based on students’ sex and race. Methods The sample included 520, 519, and 344 medical students for anatomy, histology, and physiology, respectively, collected from 2018 to 2020. To conduct DIF analysis, we used the Wald test based on the two-parameter logistic model as utilized in the IRTPRO software. Results The three assessments had as many as one-fifth of the items that functioned statistically differentially across one or more of the variables sex and race: 10 out of 49 items (20%), six out of 40 items (15%), 5 out of 45 items (11%) showed statistically significant DIF for Anatomy, Histology, and Physiology courses, respectively. Measurement specialists and subject matter experts independently reviewed the items to identify construct-irrelevant factors as potential sources for DIF as demonstrated in Appendix A. Most identified items were generally poorly written or had unclear images. Conclusions The validity of score-based inferences, particularly for group comparisons, requires test items to function equally across test-taker groups. In the present study, we found DIF of some items for sex and race in three content areas. The present approach should be utilized in other medical schools to address the generalizability of the present findings. Item level DIF should also be routinely conducted as part of psychometric analyses for basic sciences courses and other assessments. Clinical trial number Not applicable.
ISSN:1472-6920