Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study

BackgroundCervical cancer remains the fourth leading cause of death among women globally, with a particularly severe burden in low-resource settings. A comprehensive approach—from screening to diagnosis and treatment—is essential for effective prevention and management. Large...

Full description

Saved in:
Bibliographic Details
Main Authors: Warisijiang Kuerbanjiang, Shengzhe Peng, Yiershatijiang Jiamaliding, Yuexiong Yi
Format: Article
Language:English
Published: JMIR Publications 2025-02-01
Series:Journal of Medical Internet Research
Online Access:https://www.jmir.org/2025/1/e63626
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832095312231006208
author Warisijiang Kuerbanjiang
Shengzhe Peng
Yiershatijiang Jiamaliding
Yuexiong Yi
author_facet Warisijiang Kuerbanjiang
Shengzhe Peng
Yiershatijiang Jiamaliding
Yuexiong Yi
author_sort Warisijiang Kuerbanjiang
collection DOAJ
description BackgroundCervical cancer remains the fourth leading cause of death among women globally, with a particularly severe burden in low-resource settings. A comprehensive approach—from screening to diagnosis and treatment—is essential for effective prevention and management. Large language models (LLMs) have emerged as potential tools to support health care, though their specific role in cervical cancer management remains underexplored. ObjectiveThis study aims to systematically evaluate the performance and interpretability of LLMs in cervical cancer management. MethodsModels were selected from the AlpacaEval leaderboard version 2.0 and based on the capabilities of our computer. The questions inputted into the models cover aspects of general knowledge, screening, diagnosis, and treatment, according to guidelines. The prompt was developed using the Context, Objective, Style, Tone, Audience, and Response (CO-STAR) framework. Responses were evaluated for accuracy, guideline compliance, clarity, and practicality, graded as A, B, C, and D with corresponding scores of 3, 2, 1, and 0. The effective rate was calculated as the ratio of A and B responses to the total number of designed questions. Local Interpretable Model-Agnostic Explanations (LIME) was used to explain and enhance physicians’ trust in model outputs within the medical context. ResultsNine models were included in this study, and a set of 100 standardized questions covering general information, screening, diagnosis, and treatment was designed based on international and national guidelines. Seven models (ChatGPT-4.0 Turbo, Claude 2, Gemini Pro, Mistral-7B-v0.2, Starling-LM-7B alpha, HuatuoGPT, and BioMedLM 2.7B) provided stable responses. Among all the models included, ChatGPT-4.0 Turbo ranked first with a mean score of 2.67 (95% CI 2.54-2.80; effective rate 94.00%) with a prompt and 2.52 (95% CI 2.37-2.67; effective rate 87.00%) without a prompt, outperforming the other 8 models (P<.001). Regardless of prompts, QiZhenGPT consistently ranked among the lowest-performing models, with P<.01 in comparisons against all models except BioMedLM. Interpretability analysis showed that prompts improved alignment with human annotations for proprietary models (median intersection over union 0.43), while medical-specialized models exhibited limited improvement. ConclusionsProprietary LLMs, particularly ChatGPT-4.0 Turbo and Claude 2, show promise in clinical decision-making involving logical analysis. The use of prompts can enhance the accuracy of some models in cervical cancer management to varying degrees. Medical-specialized models, such as HuatuoGPT and BioMedLM, did not perform as well as expected in this study. By contrast, proprietary models, particularly those augmented with prompts, demonstrated notable accuracy and interpretability in medical tasks, such as cervical cancer management. However, this study underscores the need for further research to explore the practical application of LLMs in medical practice.
format Article
id doaj-art-31af61a9387c4eb38eb36d028e5cd4f3
institution Kabale University
issn 1438-8871
language English
publishDate 2025-02-01
publisher JMIR Publications
record_format Article
series Journal of Medical Internet Research
spelling doaj-art-31af61a9387c4eb38eb36d028e5cd4f32025-02-05T21:31:10ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-02-0127e6362610.2196/63626Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative StudyWarisijiang Kuerbanjianghttps://orcid.org/0009-0008-2540-0613Shengzhe Penghttps://orcid.org/0009-0002-7575-3608Yiershatijiang Jiamalidinghttps://orcid.org/0009-0004-9290-2091Yuexiong Yihttps://orcid.org/0000-0002-5623-117X BackgroundCervical cancer remains the fourth leading cause of death among women globally, with a particularly severe burden in low-resource settings. A comprehensive approach—from screening to diagnosis and treatment—is essential for effective prevention and management. Large language models (LLMs) have emerged as potential tools to support health care, though their specific role in cervical cancer management remains underexplored. ObjectiveThis study aims to systematically evaluate the performance and interpretability of LLMs in cervical cancer management. MethodsModels were selected from the AlpacaEval leaderboard version 2.0 and based on the capabilities of our computer. The questions inputted into the models cover aspects of general knowledge, screening, diagnosis, and treatment, according to guidelines. The prompt was developed using the Context, Objective, Style, Tone, Audience, and Response (CO-STAR) framework. Responses were evaluated for accuracy, guideline compliance, clarity, and practicality, graded as A, B, C, and D with corresponding scores of 3, 2, 1, and 0. The effective rate was calculated as the ratio of A and B responses to the total number of designed questions. Local Interpretable Model-Agnostic Explanations (LIME) was used to explain and enhance physicians’ trust in model outputs within the medical context. ResultsNine models were included in this study, and a set of 100 standardized questions covering general information, screening, diagnosis, and treatment was designed based on international and national guidelines. Seven models (ChatGPT-4.0 Turbo, Claude 2, Gemini Pro, Mistral-7B-v0.2, Starling-LM-7B alpha, HuatuoGPT, and BioMedLM 2.7B) provided stable responses. Among all the models included, ChatGPT-4.0 Turbo ranked first with a mean score of 2.67 (95% CI 2.54-2.80; effective rate 94.00%) with a prompt and 2.52 (95% CI 2.37-2.67; effective rate 87.00%) without a prompt, outperforming the other 8 models (P<.001). Regardless of prompts, QiZhenGPT consistently ranked among the lowest-performing models, with P<.01 in comparisons against all models except BioMedLM. Interpretability analysis showed that prompts improved alignment with human annotations for proprietary models (median intersection over union 0.43), while medical-specialized models exhibited limited improvement. ConclusionsProprietary LLMs, particularly ChatGPT-4.0 Turbo and Claude 2, show promise in clinical decision-making involving logical analysis. The use of prompts can enhance the accuracy of some models in cervical cancer management to varying degrees. Medical-specialized models, such as HuatuoGPT and BioMedLM, did not perform as well as expected in this study. By contrast, proprietary models, particularly those augmented with prompts, demonstrated notable accuracy and interpretability in medical tasks, such as cervical cancer management. However, this study underscores the need for further research to explore the practical application of LLMs in medical practice.https://www.jmir.org/2025/1/e63626
spellingShingle Warisijiang Kuerbanjiang
Shengzhe Peng
Yiershatijiang Jiamaliding
Yuexiong Yi
Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study
Journal of Medical Internet Research
title Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study
title_full Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study
title_fullStr Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study
title_full_unstemmed Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study
title_short Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study
title_sort performance evaluation of large language models in cervical cancer management based on a standardized questionnaire comparative study
url https://www.jmir.org/2025/1/e63626
work_keys_str_mv AT warisijiangkuerbanjiang performanceevaluationoflargelanguagemodelsincervicalcancermanagementbasedonastandardizedquestionnairecomparativestudy
AT shengzhepeng performanceevaluationoflargelanguagemodelsincervicalcancermanagementbasedonastandardizedquestionnairecomparativestudy
AT yiershatijiangjiamaliding performanceevaluationoflargelanguagemodelsincervicalcancermanagementbasedonastandardizedquestionnairecomparativestudy
AT yuexiongyi performanceevaluationoflargelanguagemodelsincervicalcancermanagementbasedonastandardizedquestionnairecomparativestudy