A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection

Abstract Background Large language models (LLMs), such as ChatGPT-4o, Grok 3.0, Gemini Advanced 2.0 Pro, and DeepSeek, have been tested in many medical domains in recent years, ranging from clinical decision support systems to patient information processes and even some intraoperative scenarios. How...

Full description

Saved in:
Bibliographic Details
Main Authors: Fulya Temizsoy Korkmaz, Fatma Ok, Burak Karip, Papatya Keleş
Format: Article
Language:English
Published: BMC 2025-07-01
Series:BMC Medical Education
Subjects:
Online Access:https://doi.org/10.1186/s12909-025-07493-0
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849768621884047360
author Fulya Temizsoy Korkmaz
Fatma Ok
Burak Karip
Papatya Keleş
author_facet Fulya Temizsoy Korkmaz
Fatma Ok
Burak Karip
Papatya Keleş
author_sort Fulya Temizsoy Korkmaz
collection DOAJ
description Abstract Background Large language models (LLMs), such as ChatGPT-4o, Grok 3.0, Gemini Advanced 2.0 Pro, and DeepSeek, have been tested in many medical domains in recent years, ranging from clinical decision support systems to patient information processes and even some intraoperative scenarios. However, despite this widespread use, how LLMs perform as step-by-step guides in environments requiring sensory-motor interaction—such as direct cadaver dissection—has not yet been systematically evaluated. This gap is particularly pronounced in anatomically complex areas with low error tolerance, such as brachial plexus dissection. This study aimed to comparatively analyze the performance of four different large language models (LLMs) in terms of scientific quality, educational value, and readability of their responses to structured questions in a cadaver dissection environment. Methods A structured question set of 28 items on brachial plexus dissection was created. Four experienced anatomists blindly evaluated the responses from the models using the modified DISCERN scale (mDISCERN) and the Global Quality Score (GQS). Readability was assessed using Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), Simple Measure of Gobbledygook (SMOG), Gunning Fog Index (GFI), and Coleman-Liau Index (CLI). Content validity was tested via the Content Validity Index (CVI), and inter-rater reliability was calculated using Intraclass Correlation Coefficient (ICC) and Cohen’s Kappa. Results ChatGPT-4o and Grok 3.0 received the highest scores for scientific accuracy and guidance structure (p < 0.01). DeepSeek showed high readability but limited content depth, while Gemini performed moderately across all parameters. Readability metrics were significantly correlated with quality scores. Conclusion This is one of the first studies to systematically examine how LLM-based systems perform in a training context with sensory challenges such as cadaver dissection. While LLMs cannot replace the ethical and educational value provided by real human donors, they may offer scalable, individualized support in settings with limited mentorship or cadaver availability. Our study not only aims to support anatomy education in resource-limited environments, but also serves as a foundational reference for future AI-assisted cadaveric studies and intraoperative decision-support models in surgical anatomy.
format Article
id doaj-art-e019fed13c224817822cc4539360fe71
institution DOAJ
issn 1472-6920
language English
publishDate 2025-07-01
publisher BMC
record_format Article
series BMC Medical Education
spelling doaj-art-e019fed13c224817822cc4539360fe712025-08-20T03:03:44ZengBMCBMC Medical Education1472-69202025-07-0125111210.1186/s12909-025-07493-0A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissectionFulya Temizsoy Korkmaz0Fatma Ok1Burak Karip2Papatya Keleş3Department of Anatomy, Hamidiye Faculty of Medicine, University of Health SciencesDepartment of Anatomy, Hamidiye Faculty of Medicine, University of Health SciencesDepartment of Anatomy, Hamidiye Faculty of Medicine, University of Health SciencesDepartment of Anatomy, Hamidiye Faculty of Medicine, University of Health SciencesAbstract Background Large language models (LLMs), such as ChatGPT-4o, Grok 3.0, Gemini Advanced 2.0 Pro, and DeepSeek, have been tested in many medical domains in recent years, ranging from clinical decision support systems to patient information processes and even some intraoperative scenarios. However, despite this widespread use, how LLMs perform as step-by-step guides in environments requiring sensory-motor interaction—such as direct cadaver dissection—has not yet been systematically evaluated. This gap is particularly pronounced in anatomically complex areas with low error tolerance, such as brachial plexus dissection. This study aimed to comparatively analyze the performance of four different large language models (LLMs) in terms of scientific quality, educational value, and readability of their responses to structured questions in a cadaver dissection environment. Methods A structured question set of 28 items on brachial plexus dissection was created. Four experienced anatomists blindly evaluated the responses from the models using the modified DISCERN scale (mDISCERN) and the Global Quality Score (GQS). Readability was assessed using Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), Simple Measure of Gobbledygook (SMOG), Gunning Fog Index (GFI), and Coleman-Liau Index (CLI). Content validity was tested via the Content Validity Index (CVI), and inter-rater reliability was calculated using Intraclass Correlation Coefficient (ICC) and Cohen’s Kappa. Results ChatGPT-4o and Grok 3.0 received the highest scores for scientific accuracy and guidance structure (p < 0.01). DeepSeek showed high readability but limited content depth, while Gemini performed moderately across all parameters. Readability metrics were significantly correlated with quality scores. Conclusion This is one of the first studies to systematically examine how LLM-based systems perform in a training context with sensory challenges such as cadaver dissection. While LLMs cannot replace the ethical and educational value provided by real human donors, they may offer scalable, individualized support in settings with limited mentorship or cadaver availability. Our study not only aims to support anatomy education in resource-limited environments, but also serves as a foundational reference for future AI-assisted cadaveric studies and intraoperative decision-support models in surgical anatomy.https://doi.org/10.1186/s12909-025-07493-0Large language models (LLMs)Cadaveric dissectionBrachial plexusArtificial intelligence(ai) in anatomyDissection guidanceReadability analysis
spellingShingle Fulya Temizsoy Korkmaz
Fatma Ok
Burak Karip
Papatya Keleş
A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection
BMC Medical Education
Large language models (LLMs)
Cadaveric dissection
Brachial plexus
Artificial intelligence(ai) in anatomy
Dissection guidance
Readability analysis
title A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection
title_full A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection
title_fullStr A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection
title_full_unstemmed A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection
title_short A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection
title_sort structured evaluation of llm generated step by step instructions in cadaveric brachial plexus dissection
topic Large language models (LLMs)
Cadaveric dissection
Brachial plexus
Artificial intelligence(ai) in anatomy
Dissection guidance
Readability analysis
url https://doi.org/10.1186/s12909-025-07493-0
work_keys_str_mv AT fulyatemizsoykorkmaz astructuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection
AT fatmaok astructuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection
AT burakkarip astructuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection
AT papatyakeles astructuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection
AT fulyatemizsoykorkmaz structuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection
AT fatmaok structuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection
AT burakkarip structuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection
AT papatyakeles structuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection