STAGER checklist: Standardized testing and assessment guidelines for evaluating generative artificial intelligence reliability

Abstract Generative artificial intelligence (AI) holds immense potential for medical applications, but the lack of a comprehensive evaluation framework and methodological deficiencies in existing studies hinder its effective implementation. Standardized assessment guidelines are crucial for ensuring...

Full description

Saved in:
Bibliographic Details
Main Authors: Jinghong Chen, Lingxuan Zhu, Weiming Mou, Anqi Lin, Dongqiang Zeng, Chang Qi, Zaoqu Liu, Aimin Jiang, Bufu Tang, Wenjie Shi, Ulf D. Kahlert, Jianguo Zhou, Shipeng Guo, Xiaofan Lu, Xu Sun, Trunghieu Ngo, Zhongji Pu, Baolei Jia, Che Ok Jeon, Yongbin He, Haiyang Wu, Shuqin Gu, Wisit Cheungpasitporn, Haojie Huang, Weipu Mao, Shixiang Wang, Xin Chen, Loïc Cabannes, Gerald Sng Gui Ren, Iain S. Whitaker, Stephen Ali, Quan Cheng, Kai Miao, Shuofeng Yuan, Peng Luo
Format: Article
Language:English
Published: Wiley 2024-09-01
Series:iMetaOmics
Subjects:
Online Access:https://doi.org/10.1002/imo2.7
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Generative artificial intelligence (AI) holds immense potential for medical applications, but the lack of a comprehensive evaluation framework and methodological deficiencies in existing studies hinder its effective implementation. Standardized assessment guidelines are crucial for ensuring reliable and consistent evaluation of generative AI in healthcare. Our objective is to develop robust, standardized guidelines tailored for evaluating generative AI performance in medical contexts. Through a rigorous literature review utilizing the Web of Sciences, Cochrane Library, PubMed, and Google Scholar, we focused on research testing generative AI capabilities in medicine. Our multidisciplinary team of experts conducted discussion sessions to develop a comprehensive 32‐item checklist. This checklist encompasses critical evaluation aspects of generative AI in medical applications, addressing key dimensions such as question collection, querying methodologies, and assessment techniques. The checklist and its broader assessment framework provide a holistic evaluation of AI systems, delineating a clear pathway from question gathering to result assessment. It guides researchers through potential challenges and pitfalls, enhancing research quality and reporting and aiding the evolution of generative AI in medicine and life sciences. Our framework furnishes a standardized, systematic approach for testing generative AI's applicability in medicine. For a concise checklist, please refer to Table S or visit GenAIMed.org.
ISSN:2996-9506
2996-9514