Can ChatGPT-4 Think Like an Orthopaedic Surgeon? Testing Clinical Judgement and Diagnostic Ability in Bony and Soft Tissue Pathologies of the Foot and Ankle

Submission Type: Achilles Tendon Rupture Research Type: Level 4 – Case series Introduction/Purpose: ChatGPT-4, a chatbot with an ability to carry human-like conversation, has attracted attention after demonstrating aptitude to pass professional licensure examinations and perform at the level of a po...

Full description

Saved in:
Bibliographic Details
Main Authors: Hayden Hartman BS, Maritza Essis MD, MS, Wei Shao Tung BS, Sean Peden MD, Irvin Oh MD, Arianna Gianakos DO
Format: Article
Language:English
Published: SAGE Publishing 2025-03-01
Series:Foot & Ankle Orthopaedics
Online Access:https://doi.org/10.1177/2473011425S00048
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Submission Type: Achilles Tendon Rupture Research Type: Level 4 – Case series Introduction/Purpose: ChatGPT-4, a chatbot with an ability to carry human-like conversation, has attracted attention after demonstrating aptitude to pass professional licensure examinations and perform at the level of a postgraduate level three orthopaedic surgery resident on the Orthopaedic In-Service Training Examination question bank sets. The purpose of this study was to explore the diagnostic and decision-making capacities of ChatGPT-4 in clinical management specifically assessing for accuracy in identification and treatment of soft tissue foot and ankle pathologies. Methods: This study presented 16 soft tissue related foot and ankle cases to ChatGPT-4, with each case assessed by 3 fellowship-trained foot and ankle orthopaedic surgeons. The evaluation system included 5 criteria within a Likert scale, scoring from 5 (lowest) to 25 (highest possible). The criteria included stating the correct diagnosis (1), stating the most appropriate procedure (2), identification of alternative treatments (3), providing comprehensive information beyond treatment (4), and not mentioning nonexistent therapies (5). ChatGPT-4 was referred to as “Dr. GPT”, using role prompting to encourage step-by-step processing and establish a peer dynamic so that the role of an orthopaedic surgeon was emulated by the chatbot. Results: The average score across all criteria for all 16 cases was 4.47, with an average sum score of 22.4. The plantar fasciitis case received the highest score, with an average sum score of 24.7. The lowest score was observed in the peroneal tendon tear case, with an average sum score of 16.3. Subgroup analyses of each of the 5-criterion using Friedman Rank Sum tests showed no statistically significant differences in surgeon grading. Criterion 5, lack of mention of nonexistent treatment options, and criterion 1, ChatGPT’s ability to correctly diagnose, received the highest subgroup scores of 4.88 and 4.77, respectively. The lowest criteria score was observed in criteria 4 (4.05), evaluating ChatGPT-4 providing comprehensive information beyond treatment options. Conclusion: This study demonstrates that ChatGPT-4 effectively diagnosed and provided reliable treatment options for most soft tissue foot and ankle cases presented, noting consistency amongst surgeon evaluators. Individual criterion assessment revealed that ChatGPT-4 was most effective in diagnosing and suggesting appropriate treatment, but limitations were seen in the chatbot’s ability to provide comprehensive information and alternative treatment options. Additionally, the chatbot successfully did not suggest fabricated treatment options, a common concern in prior literature. This resource could be useful for clinicians seeking reliable patient education materials without fear of inconsistencies, though comprehensive information beyond treatment may be limited.
ISSN:2473-0114