STAGER checklist: Standardized testing and assessment guidelines for evaluating generative artificial intelligence reliability

Abstract Generative artificial intelligence (AI) holds immense potential for medical applications, but the lack of a comprehensive evaluation framework and methodological deficiencies in existing studies hinder its effective implementation. Standardized assessment guidelines are crucial for ensuring...

Full description

Saved in:
Bibliographic Details
Main Authors: Jinghong Chen, Lingxuan Zhu, Weiming Mou, Anqi Lin, Dongqiang Zeng, Chang Qi, Zaoqu Liu, Aimin Jiang, Bufu Tang, Wenjie Shi, Ulf D. Kahlert, Jianguo Zhou, Shipeng Guo, Xiaofan Lu, Xu Sun, Trunghieu Ngo, Zhongji Pu, Baolei Jia, Che Ok Jeon, Yongbin He, Haiyang Wu, Shuqin Gu, Wisit Cheungpasitporn, Haojie Huang, Weipu Mao, Shixiang Wang, Xin Chen, Loïc Cabannes, Gerald Sng Gui Ren, Iain S. Whitaker, Stephen Ali, Quan Cheng, Kai Miao, Shuofeng Yuan, Peng Luo
Format: Article
Language:English
Published: Wiley 2024-09-01
Series:iMetaOmics
Subjects:
Online Access:https://doi.org/10.1002/imo2.7
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832575941152342016
author Jinghong Chen
Lingxuan Zhu
Weiming Mou
Anqi Lin
Dongqiang Zeng
Chang Qi
Zaoqu Liu
Aimin Jiang
Bufu Tang
Wenjie Shi
Ulf D. Kahlert
Jianguo Zhou
Shipeng Guo
Xiaofan Lu
Xu Sun
Trunghieu Ngo
Zhongji Pu
Baolei Jia
Che Ok Jeon
Yongbin He
Haiyang Wu
Shuqin Gu
Wisit Cheungpasitporn
Haojie Huang
Weipu Mao
Shixiang Wang
Xin Chen
Loïc Cabannes
Gerald Sng Gui Ren
Iain S. Whitaker
Stephen Ali
Quan Cheng
Kai Miao
Shuofeng Yuan
Peng Luo
author_facet Jinghong Chen
Lingxuan Zhu
Weiming Mou
Anqi Lin
Dongqiang Zeng
Chang Qi
Zaoqu Liu
Aimin Jiang
Bufu Tang
Wenjie Shi
Ulf D. Kahlert
Jianguo Zhou
Shipeng Guo
Xiaofan Lu
Xu Sun
Trunghieu Ngo
Zhongji Pu
Baolei Jia
Che Ok Jeon
Yongbin He
Haiyang Wu
Shuqin Gu
Wisit Cheungpasitporn
Haojie Huang
Weipu Mao
Shixiang Wang
Xin Chen
Loïc Cabannes
Gerald Sng Gui Ren
Iain S. Whitaker
Stephen Ali
Quan Cheng
Kai Miao
Shuofeng Yuan
Peng Luo
author_sort Jinghong Chen
collection DOAJ
description Abstract Generative artificial intelligence (AI) holds immense potential for medical applications, but the lack of a comprehensive evaluation framework and methodological deficiencies in existing studies hinder its effective implementation. Standardized assessment guidelines are crucial for ensuring reliable and consistent evaluation of generative AI in healthcare. Our objective is to develop robust, standardized guidelines tailored for evaluating generative AI performance in medical contexts. Through a rigorous literature review utilizing the Web of Sciences, Cochrane Library, PubMed, and Google Scholar, we focused on research testing generative AI capabilities in medicine. Our multidisciplinary team of experts conducted discussion sessions to develop a comprehensive 32‐item checklist. This checklist encompasses critical evaluation aspects of generative AI in medical applications, addressing key dimensions such as question collection, querying methodologies, and assessment techniques. The checklist and its broader assessment framework provide a holistic evaluation of AI systems, delineating a clear pathway from question gathering to result assessment. It guides researchers through potential challenges and pitfalls, enhancing research quality and reporting and aiding the evolution of generative AI in medicine and life sciences. Our framework furnishes a standardized, systematic approach for testing generative AI's applicability in medicine. For a concise checklist, please refer to Table S or visit GenAIMed.org.
format Article
id doaj-art-d662ca60c5b34c5db1a774cdbef12048
institution Kabale University
issn 2996-9506
2996-9514
language English
publishDate 2024-09-01
publisher Wiley
record_format Article
series iMetaOmics
spelling doaj-art-d662ca60c5b34c5db1a774cdbef120482025-01-31T16:15:20ZengWileyiMetaOmics2996-95062996-95142024-09-0111n/an/a10.1002/imo2.7STAGER checklist: Standardized testing and assessment guidelines for evaluating generative artificial intelligence reliabilityJinghong Chen0Lingxuan Zhu1Weiming Mou2Anqi Lin3Dongqiang Zeng4Chang Qi5Zaoqu Liu6Aimin Jiang7Bufu Tang8Wenjie Shi9Ulf D. Kahlert10Jianguo Zhou11Shipeng Guo12Xiaofan Lu13Xu Sun14Trunghieu Ngo15Zhongji Pu16Baolei Jia17Che Ok Jeon18Yongbin He19Haiyang Wu20Shuqin Gu21Wisit Cheungpasitporn22Haojie Huang23Weipu Mao24Shixiang Wang25Xin Chen26Loïc Cabannes27Gerald Sng Gui Ren28Iain S. Whitaker29Stephen Ali30Quan Cheng31Kai Miao32Shuofeng Yuan33Peng Luo34Department of Oncology, Zhujiang Hospital Southern Medical University Guangzhou ChinaDepartment of Oncology, Zhujiang Hospital Southern Medical University Guangzhou ChinaDepartment of Oncology, Zhujiang Hospital Southern Medical University Guangzhou ChinaDepartment of Oncology, Zhujiang Hospital Southern Medical University Guangzhou ChinaDepartment of Oncology, Nanfang Hospital Southern Medical University Guangzhou ChinaInstitute of Logic and Computation, TU Wien Wien AustriaInstitute of Basic Medical Sciences Chinese Academy of Medical Sciences and Peking Union Medical College Beijing ChinaDepartment of Urology, Changhai Hospital Naval Medical University (Second Military Medical University) Shanghai ChinaDepartment of Radiation Oncology, Zhongshan Hospital Fudan University Shanghai ChinaMolecular and Experimental Surgery, University Clinic for General‐, Visceral‐, Vascular‐ and Trans‐Plantation Surgery, Medical Faculty University Hospital Magdeburg Otto‐von Guericke University Magdeburg GermanyMolecular and Experimental Surgery, University Clinic for General‐, Visceral‐, Vascular‐ and Trans‐Plantation Surgery, Medical Faculty University Hospital Magdeburg Otto‐von Guericke University Magdeburg GermanyDepartment of Oncology The Second Affiliated Hospital of Zunyi Medical University Zunyi ChinaGZDLab Chongqing ChinaDepartment of Cancer and Functional Genomics, Institute of Genetics and Molecular and Cellular Biology CNRS/INSERM/UNISTRA Illkirch FranceLinguistique Informatique, UFR‐Linguistique Université Paris Cité Paris FranceLinguistique Informatique, UFR‐Linguistique Université Paris Cité Paris FranceXianghu Laboratory Hangzhou ChinaXianghu Laboratory Hangzhou ChinaDepartment of Life Science Chung‐Ang University Seoul KoreaSchool of Sport Medicine and Rehabilitation Beijing Sport University Beijing ChinaDepartment of Graduate School Tianjin Medical University Tianjin ChinaDuke Human Vaccine Institute Duke University Medical Center Durham North Carolina USADepartment of Medicine Mayo Clinic Rochester New York USADepartment of Biochemistry and Molecular Biology Mayo Clinic College of Medicine and Science Rochester New York USADepartment of Urology Zhongda Hospital Southeast University Nanjing ChinaBioinformatics Platform, Department of Experimental Research, State Key Laboratory of Oncology in South China, Guangdong Key Laboratory of Nasopharyngeal Carcinoma Diagnosis and Therapy, Guangdong Provincial Clinical Research Center for Cancer Sun Yat‐sen University Cancer Center Guangzhou ChinaDepartment of Pulmonary and Critical Care Medicine, Zhujiang Hospital Southern Medical University Guangzhou ChinaLinguistique Informatique, UFR‐Linguistique Université Paris Cité Paris FranceDepartment of Endocrinology Singapore General Hospital Singapore SingaporeReconstructive Surgery and Regenerative Medicine Research Centre, Institute of Life Sciences Swansea University Medical School Swansea UKReconstructive Surgery and Regenerative Medicine Research Centre, Institute of Life Sciences Swansea University Medical School Swansea UKDepartment of Neurosurgery, Xiangya Hospital Central South University Changsha ChinaCancer Centre and Institute of Translational Medicine, Faculty of Health Sciences University of Macau Macau ChinaDepartment of Infectious Disease and Microbiology The University of Hong Kong‐Shenzhen Hospital Shenzhen ChinaDepartment of Oncology, Zhujiang Hospital Southern Medical University Guangzhou ChinaAbstract Generative artificial intelligence (AI) holds immense potential for medical applications, but the lack of a comprehensive evaluation framework and methodological deficiencies in existing studies hinder its effective implementation. Standardized assessment guidelines are crucial for ensuring reliable and consistent evaluation of generative AI in healthcare. Our objective is to develop robust, standardized guidelines tailored for evaluating generative AI performance in medical contexts. Through a rigorous literature review utilizing the Web of Sciences, Cochrane Library, PubMed, and Google Scholar, we focused on research testing generative AI capabilities in medicine. Our multidisciplinary team of experts conducted discussion sessions to develop a comprehensive 32‐item checklist. This checklist encompasses critical evaluation aspects of generative AI in medical applications, addressing key dimensions such as question collection, querying methodologies, and assessment techniques. The checklist and its broader assessment framework provide a holistic evaluation of AI systems, delineating a clear pathway from question gathering to result assessment. It guides researchers through potential challenges and pitfalls, enhancing research quality and reporting and aiding the evolution of generative AI in medicine and life sciences. Our framework furnishes a standardized, systematic approach for testing generative AI's applicability in medicine. For a concise checklist, please refer to Table S or visit GenAIMed.org.https://doi.org/10.1002/imo2.7generative AImedical and life science contextsreliabilitystandardized assessment guidelines
spellingShingle Jinghong Chen
Lingxuan Zhu
Weiming Mou
Anqi Lin
Dongqiang Zeng
Chang Qi
Zaoqu Liu
Aimin Jiang
Bufu Tang
Wenjie Shi
Ulf D. Kahlert
Jianguo Zhou
Shipeng Guo
Xiaofan Lu
Xu Sun
Trunghieu Ngo
Zhongji Pu
Baolei Jia
Che Ok Jeon
Yongbin He
Haiyang Wu
Shuqin Gu
Wisit Cheungpasitporn
Haojie Huang
Weipu Mao
Shixiang Wang
Xin Chen
Loïc Cabannes
Gerald Sng Gui Ren
Iain S. Whitaker
Stephen Ali
Quan Cheng
Kai Miao
Shuofeng Yuan
Peng Luo
STAGER checklist: Standardized testing and assessment guidelines for evaluating generative artificial intelligence reliability
iMetaOmics
generative AI
medical and life science contexts
reliability
standardized assessment guidelines
title STAGER checklist: Standardized testing and assessment guidelines for evaluating generative artificial intelligence reliability
title_full STAGER checklist: Standardized testing and assessment guidelines for evaluating generative artificial intelligence reliability
title_fullStr STAGER checklist: Standardized testing and assessment guidelines for evaluating generative artificial intelligence reliability
title_full_unstemmed STAGER checklist: Standardized testing and assessment guidelines for evaluating generative artificial intelligence reliability
title_short STAGER checklist: Standardized testing and assessment guidelines for evaluating generative artificial intelligence reliability
title_sort stager checklist standardized testing and assessment guidelines for evaluating generative artificial intelligence reliability
topic generative AI
medical and life science contexts
reliability
standardized assessment guidelines
url https://doi.org/10.1002/imo2.7
work_keys_str_mv AT jinghongchen stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT lingxuanzhu stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT weimingmou stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT anqilin stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT dongqiangzeng stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT changqi stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT zaoquliu stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT aiminjiang stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT bufutang stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT wenjieshi stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT ulfdkahlert stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT jianguozhou stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT shipengguo stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT xiaofanlu stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT xusun stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT trunghieungo stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT zhongjipu stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT baoleijia stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT cheokjeon stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT yongbinhe stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT haiyangwu stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT shuqingu stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT wisitcheungpasitporn stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT haojiehuang stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT weipumao stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT shixiangwang stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT xinchen stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT loiccabannes stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT geraldsngguiren stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT iainswhitaker stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT stephenali stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT quancheng stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT kaimiao stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT shuofengyuan stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability
AT pengluo stagercheckliststandardizedtestingandassessmentguidelinesforevaluatinggenerativeartificialintelligencereliability