A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection

Abstract Background Large language models (LLMs), such as ChatGPT-4o, Grok 3.0, Gemini Advanced 2.0 Pro, and DeepSeek, have been tested in many medical domains in recent years, ranging from clinical decision support systems to patient information processes and even some intraoperative scenarios. How...

Full description

Saved in:

Bibliographic Details
Main Authors:	Fulya Temizsoy Korkmaz, Fatma Ok, Burak Karip, Papatya Keleş
Format:	Article
Language:	English
Published:	BMC 2025-07-01
Series:	BMC Medical Education
Subjects:	Large language models (LLMs) Cadaveric dissection Brachial plexus Artificial intelligence(ai) in anatomy Dissection guidance Readability analysis
Online Access:	https://doi.org/10.1186/s12909-025-07493-0
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849768621884047360
author	Fulya Temizsoy Korkmaz Fatma Ok Burak Karip Papatya Keleş
author_facet	Fulya Temizsoy Korkmaz Fatma Ok Burak Karip Papatya Keleş
author_sort	Fulya Temizsoy Korkmaz
collection	DOAJ
description	Abstract Background Large language models (LLMs), such as ChatGPT-4o, Grok 3.0, Gemini Advanced 2.0 Pro, and DeepSeek, have been tested in many medical domains in recent years, ranging from clinical decision support systems to patient information processes and even some intraoperative scenarios. However, despite this widespread use, how LLMs perform as step-by-step guides in environments requiring sensory-motor interaction—such as direct cadaver dissection—has not yet been systematically evaluated. This gap is particularly pronounced in anatomically complex areas with low error tolerance, such as brachial plexus dissection. This study aimed to comparatively analyze the performance of four different large language models (LLMs) in terms of scientific quality, educational value, and readability of their responses to structured questions in a cadaver dissection environment. Methods A structured question set of 28 items on brachial plexus dissection was created. Four experienced anatomists blindly evaluated the responses from the models using the modified DISCERN scale (mDISCERN) and the Global Quality Score (GQS). Readability was assessed using Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), Simple Measure of Gobbledygook (SMOG), Gunning Fog Index (GFI), and Coleman-Liau Index (CLI). Content validity was tested via the Content Validity Index (CVI), and inter-rater reliability was calculated using Intraclass Correlation Coefficient (ICC) and Cohen’s Kappa. Results ChatGPT-4o and Grok 3.0 received the highest scores for scientific accuracy and guidance structure (p < 0.01). DeepSeek showed high readability but limited content depth, while Gemini performed moderately across all parameters. Readability metrics were significantly correlated with quality scores. Conclusion This is one of the first studies to systematically examine how LLM-based systems perform in a training context with sensory challenges such as cadaver dissection. While LLMs cannot replace the ethical and educational value provided by real human donors, they may offer scalable, individualized support in settings with limited mentorship or cadaver availability. Our study not only aims to support anatomy education in resource-limited environments, but also serves as a foundational reference for future AI-assisted cadaveric studies and intraoperative decision-support models in surgical anatomy.
format	Article
id	doaj-art-e019fed13c224817822cc4539360fe71
institution	DOAJ
issn	1472-6920
language	English
publishDate	2025-07-01
publisher	BMC
record_format	Article
series	BMC Medical Education
spelling	doaj-art-e019fed13c224817822cc4539360fe712025-08-20T03:03:44ZengBMCBMC Medical Education1472-69202025-07-0125111210.1186/s12909-025-07493-0A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissectionFulya Temizsoy Korkmaz0Fatma Ok1Burak Karip2Papatya Keleş3Department of Anatomy, Hamidiye Faculty of Medicine, University of Health SciencesDepartment of Anatomy, Hamidiye Faculty of Medicine, University of Health SciencesDepartment of Anatomy, Hamidiye Faculty of Medicine, University of Health SciencesDepartment of Anatomy, Hamidiye Faculty of Medicine, University of Health SciencesAbstract Background Large language models (LLMs), such as ChatGPT-4o, Grok 3.0, Gemini Advanced 2.0 Pro, and DeepSeek, have been tested in many medical domains in recent years, ranging from clinical decision support systems to patient information processes and even some intraoperative scenarios. However, despite this widespread use, how LLMs perform as step-by-step guides in environments requiring sensory-motor interaction—such as direct cadaver dissection—has not yet been systematically evaluated. This gap is particularly pronounced in anatomically complex areas with low error tolerance, such as brachial plexus dissection. This study aimed to comparatively analyze the performance of four different large language models (LLMs) in terms of scientific quality, educational value, and readability of their responses to structured questions in a cadaver dissection environment. Methods A structured question set of 28 items on brachial plexus dissection was created. Four experienced anatomists blindly evaluated the responses from the models using the modified DISCERN scale (mDISCERN) and the Global Quality Score (GQS). Readability was assessed using Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), Simple Measure of Gobbledygook (SMOG), Gunning Fog Index (GFI), and Coleman-Liau Index (CLI). Content validity was tested via the Content Validity Index (CVI), and inter-rater reliability was calculated using Intraclass Correlation Coefficient (ICC) and Cohen’s Kappa. Results ChatGPT-4o and Grok 3.0 received the highest scores for scientific accuracy and guidance structure (p < 0.01). DeepSeek showed high readability but limited content depth, while Gemini performed moderately across all parameters. Readability metrics were significantly correlated with quality scores. Conclusion This is one of the first studies to systematically examine how LLM-based systems perform in a training context with sensory challenges such as cadaver dissection. While LLMs cannot replace the ethical and educational value provided by real human donors, they may offer scalable, individualized support in settings with limited mentorship or cadaver availability. Our study not only aims to support anatomy education in resource-limited environments, but also serves as a foundational reference for future AI-assisted cadaveric studies and intraoperative decision-support models in surgical anatomy.https://doi.org/10.1186/s12909-025-07493-0Large language models (LLMs)Cadaveric dissectionBrachial plexusArtificial intelligence(ai) in anatomyDissection guidanceReadability analysis
spellingShingle	Fulya Temizsoy Korkmaz Fatma Ok Burak Karip Papatya Keleş A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection BMC Medical Education Large language models (LLMs) Cadaveric dissection Brachial plexus Artificial intelligence(ai) in anatomy Dissection guidance Readability analysis
title	A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection
title_full	A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection
title_fullStr	A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection
title_full_unstemmed	A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection
title_short	A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection
title_sort	structured evaluation of llm generated step by step instructions in cadaveric brachial plexus dissection
topic	Large language models (LLMs) Cadaveric dissection Brachial plexus Artificial intelligence(ai) in anatomy Dissection guidance Readability analysis
url	https://doi.org/10.1186/s12909-025-07493-0
work_keys_str_mv	AT fulyatemizsoykorkmaz astructuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection AT fatmaok astructuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection AT burakkarip astructuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection AT papatyakeles astructuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection AT fulyatemizsoykorkmaz structuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection AT fatmaok structuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection AT burakkarip structuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection AT papatyakeles structuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection

A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection

Similar Items