A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection
Abstract Background Large language models (LLMs), such as ChatGPT-4o, Grok 3.0, Gemini Advanced 2.0 Pro, and DeepSeek, have been tested in many medical domains in recent years, ranging from clinical decision support systems to patient information processes and even some intraoperative scenarios. How...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMC
2025-07-01
|
| Series: | BMC Medical Education |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s12909-025-07493-0 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849768621884047360 |
|---|---|
| author | Fulya Temizsoy Korkmaz Fatma Ok Burak Karip Papatya Keleş |
| author_facet | Fulya Temizsoy Korkmaz Fatma Ok Burak Karip Papatya Keleş |
| author_sort | Fulya Temizsoy Korkmaz |
| collection | DOAJ |
| description | Abstract Background Large language models (LLMs), such as ChatGPT-4o, Grok 3.0, Gemini Advanced 2.0 Pro, and DeepSeek, have been tested in many medical domains in recent years, ranging from clinical decision support systems to patient information processes and even some intraoperative scenarios. However, despite this widespread use, how LLMs perform as step-by-step guides in environments requiring sensory-motor interaction—such as direct cadaver dissection—has not yet been systematically evaluated. This gap is particularly pronounced in anatomically complex areas with low error tolerance, such as brachial plexus dissection. This study aimed to comparatively analyze the performance of four different large language models (LLMs) in terms of scientific quality, educational value, and readability of their responses to structured questions in a cadaver dissection environment. Methods A structured question set of 28 items on brachial plexus dissection was created. Four experienced anatomists blindly evaluated the responses from the models using the modified DISCERN scale (mDISCERN) and the Global Quality Score (GQS). Readability was assessed using Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), Simple Measure of Gobbledygook (SMOG), Gunning Fog Index (GFI), and Coleman-Liau Index (CLI). Content validity was tested via the Content Validity Index (CVI), and inter-rater reliability was calculated using Intraclass Correlation Coefficient (ICC) and Cohen’s Kappa. Results ChatGPT-4o and Grok 3.0 received the highest scores for scientific accuracy and guidance structure (p < 0.01). DeepSeek showed high readability but limited content depth, while Gemini performed moderately across all parameters. Readability metrics were significantly correlated with quality scores. Conclusion This is one of the first studies to systematically examine how LLM-based systems perform in a training context with sensory challenges such as cadaver dissection. While LLMs cannot replace the ethical and educational value provided by real human donors, they may offer scalable, individualized support in settings with limited mentorship or cadaver availability. Our study not only aims to support anatomy education in resource-limited environments, but also serves as a foundational reference for future AI-assisted cadaveric studies and intraoperative decision-support models in surgical anatomy. |
| format | Article |
| id | doaj-art-e019fed13c224817822cc4539360fe71 |
| institution | DOAJ |
| issn | 1472-6920 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | BMC |
| record_format | Article |
| series | BMC Medical Education |
| spelling | doaj-art-e019fed13c224817822cc4539360fe712025-08-20T03:03:44ZengBMCBMC Medical Education1472-69202025-07-0125111210.1186/s12909-025-07493-0A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissectionFulya Temizsoy Korkmaz0Fatma Ok1Burak Karip2Papatya Keleş3Department of Anatomy, Hamidiye Faculty of Medicine, University of Health SciencesDepartment of Anatomy, Hamidiye Faculty of Medicine, University of Health SciencesDepartment of Anatomy, Hamidiye Faculty of Medicine, University of Health SciencesDepartment of Anatomy, Hamidiye Faculty of Medicine, University of Health SciencesAbstract Background Large language models (LLMs), such as ChatGPT-4o, Grok 3.0, Gemini Advanced 2.0 Pro, and DeepSeek, have been tested in many medical domains in recent years, ranging from clinical decision support systems to patient information processes and even some intraoperative scenarios. However, despite this widespread use, how LLMs perform as step-by-step guides in environments requiring sensory-motor interaction—such as direct cadaver dissection—has not yet been systematically evaluated. This gap is particularly pronounced in anatomically complex areas with low error tolerance, such as brachial plexus dissection. This study aimed to comparatively analyze the performance of four different large language models (LLMs) in terms of scientific quality, educational value, and readability of their responses to structured questions in a cadaver dissection environment. Methods A structured question set of 28 items on brachial plexus dissection was created. Four experienced anatomists blindly evaluated the responses from the models using the modified DISCERN scale (mDISCERN) and the Global Quality Score (GQS). Readability was assessed using Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), Simple Measure of Gobbledygook (SMOG), Gunning Fog Index (GFI), and Coleman-Liau Index (CLI). Content validity was tested via the Content Validity Index (CVI), and inter-rater reliability was calculated using Intraclass Correlation Coefficient (ICC) and Cohen’s Kappa. Results ChatGPT-4o and Grok 3.0 received the highest scores for scientific accuracy and guidance structure (p < 0.01). DeepSeek showed high readability but limited content depth, while Gemini performed moderately across all parameters. Readability metrics were significantly correlated with quality scores. Conclusion This is one of the first studies to systematically examine how LLM-based systems perform in a training context with sensory challenges such as cadaver dissection. While LLMs cannot replace the ethical and educational value provided by real human donors, they may offer scalable, individualized support in settings with limited mentorship or cadaver availability. Our study not only aims to support anatomy education in resource-limited environments, but also serves as a foundational reference for future AI-assisted cadaveric studies and intraoperative decision-support models in surgical anatomy.https://doi.org/10.1186/s12909-025-07493-0Large language models (LLMs)Cadaveric dissectionBrachial plexusArtificial intelligence(ai) in anatomyDissection guidanceReadability analysis |
| spellingShingle | Fulya Temizsoy Korkmaz Fatma Ok Burak Karip Papatya Keleş A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection BMC Medical Education Large language models (LLMs) Cadaveric dissection Brachial plexus Artificial intelligence(ai) in anatomy Dissection guidance Readability analysis |
| title | A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection |
| title_full | A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection |
| title_fullStr | A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection |
| title_full_unstemmed | A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection |
| title_short | A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection |
| title_sort | structured evaluation of llm generated step by step instructions in cadaveric brachial plexus dissection |
| topic | Large language models (LLMs) Cadaveric dissection Brachial plexus Artificial intelligence(ai) in anatomy Dissection guidance Readability analysis |
| url | https://doi.org/10.1186/s12909-025-07493-0 |
| work_keys_str_mv | AT fulyatemizsoykorkmaz astructuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection AT fatmaok astructuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection AT burakkarip astructuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection AT papatyakeles astructuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection AT fulyatemizsoykorkmaz structuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection AT fatmaok structuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection AT burakkarip structuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection AT papatyakeles structuredevaluationofllmgeneratedstepbystepinstructionsincadavericbrachialplexusdissection |