ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines

Background: Large language models, including Chat Generative Pre-trained Transformer (ChatGPT) and Google Gemini have accelerated public accessibility to information, but their accuracy to medical questions remains unknown. In pediatric orthopaedics, no study has utilized board-certified expert opin...

Full description

Saved in:

Bibliographic Details
Main Authors:	Patrick P. Nian, BA, Amith Umesh, BA, Ruth H. Jones, BS, Akshitha Adhiyaman, BS, Christopher J. Williams, BS, Christine M. Goodbody, MD, Jessica H. Heyer, MD, Shevaun M. Doyle, MD
Format:	Article
Language:	English
Published:	Elsevier 2025-02-01
Series:	Journal of the Pediatric Orthopaedic Society of North America
Subjects:	Developmental dysplasia of the hip ChatGPT Google Gemini Clinical practice guideline American Academy of Orthopaedic Surgeons
Online Access:	http://www.sciencedirect.com/science/article/pii/S2768276524009611
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850043952570302464
author	Patrick P. Nian, BA Amith Umesh, BA Ruth H. Jones, BS Akshitha Adhiyaman, BS Christopher J. Williams, BS Christine M. Goodbody, MD Jessica H. Heyer, MD Shevaun M. Doyle, MD
author_facet	Patrick P. Nian, BA Amith Umesh, BA Ruth H. Jones, BS Akshitha Adhiyaman, BS Christopher J. Williams, BS Christine M. Goodbody, MD Jessica H. Heyer, MD Shevaun M. Doyle, MD
author_sort	Patrick P. Nian, BA
collection	DOAJ
description	Background: Large language models, including Chat Generative Pre-trained Transformer (ChatGPT) and Google Gemini have accelerated public accessibility to information, but their accuracy to medical questions remains unknown. In pediatric orthopaedics, no study has utilized board-certified expert opinion to evaluate the accuracy of artificial intelligence (AI) chatbots compared to evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on developmental dysplasia of the hip (DDH) regarding accuracy, supplementary and incomplete response patterns, and readability. Methods: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 9 evidence-based recommendations from the 2022 AAOS CPG on DDH. The answers to these questions were obtained on July 1st, 2024. Responses were anonymized and independently evaluated by two pediatric orthopaedic attending surgeons. Supplementary responses were additionally evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen's Kappa inter-rater reliability (κ) was calculated. Chi-square analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P < 0.05. Results: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 5/9, 6/9, 6/9, supplementary in 8/9, 7/9, 9/9, and incomplete in 7/9, 6/9, 7/9 recommendations, respectively. Of 24 supplementary responses, 5 (20.8%), 16 (66.7%), and 3 (12.5%) required no, some, and many modifications, respectively. There were no significant differences in accuracy (P = 0.853), supplementary responses (P = 0.325), necessary modifications (P = 0.661), and incomplete responses (P = 0.825). κ was highest for accuracy at 0.17. Google Gemini was significantly more readable in Flesch–Kincaid reading level, Flesch Reading Ease, and Gunning fog index (all, P < 0.05). Conclusions: In the setting of DDH, AI chatbots demonstrated limited accuracy, high supplementary and incomplete response patterns, and complex readability. Pediatric orthopaedic surgeons can counsel patients and their families to set appropriate expectations on the utility of these novel tools. Key Concepts: (1) Responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were inadequately accurate, frequently provided supplementary information that required modifications and frequently lacked essential details from the AAOS CPGs on DDH. (2) Accurate, supplementary, and incomplete response patterns were not significantly different among the three chatbots. (3) Google Gemini provided responses that had the highest readability among the three chatbots. (4) Pediatric orthopaedic surgeons can play a role in counseling patients and their families on the limited utility of AI chatbots for patient education purposes. Level of Evidence: IV
format	Article
id	doaj-art-878c8082f61c4f6692fd9f48d3f24481
institution	DOAJ
issn	2768-2765
language	English
publishDate	2025-02-01
publisher	Elsevier
record_format	Article
series	Journal of the Pediatric Orthopaedic Society of North America
spelling	doaj-art-878c8082f61c4f6692fd9f48d3f244812025-08-20T02:55:06ZengElsevierJournal of the Pediatric Orthopaedic Society of North America2768-27652025-02-011010013510.1016/j.jposna.2024.100135ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice GuidelinesPatrick P. Nian, BA0Amith Umesh, BA1Ruth H. Jones, BS2Akshitha Adhiyaman, BS3Christopher J. Williams, BS4Christine M. Goodbody, MD5Jessica H. Heyer, MD6Shevaun M. Doyle, MD7Hospital for Special Surgery, New York City, NY, USAHospital for Special Surgery, New York City, NY, USAHospital for Special Surgery, New York City, NY, USAHospital for Special Surgery, New York City, NY, USAHospital for Special Surgery, New York City, NY, USAChildren's Hospital of Philadelphia, Philadelphia, PA, USAHospital for Special Surgery, New York City, NY, USAHospital for Special Surgery, New York City, NY, USA; Corresponding author: 535 East 70th Street; New York, NY 10021, USA.Background: Large language models, including Chat Generative Pre-trained Transformer (ChatGPT) and Google Gemini have accelerated public accessibility to information, but their accuracy to medical questions remains unknown. In pediatric orthopaedics, no study has utilized board-certified expert opinion to evaluate the accuracy of artificial intelligence (AI) chatbots compared to evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on developmental dysplasia of the hip (DDH) regarding accuracy, supplementary and incomplete response patterns, and readability. Methods: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 9 evidence-based recommendations from the 2022 AAOS CPG on DDH. The answers to these questions were obtained on July 1st, 2024. Responses were anonymized and independently evaluated by two pediatric orthopaedic attending surgeons. Supplementary responses were additionally evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen's Kappa inter-rater reliability (κ) was calculated. Chi-square analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P < 0.05. Results: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 5/9, 6/9, 6/9, supplementary in 8/9, 7/9, 9/9, and incomplete in 7/9, 6/9, 7/9 recommendations, respectively. Of 24 supplementary responses, 5 (20.8%), 16 (66.7%), and 3 (12.5%) required no, some, and many modifications, respectively. There were no significant differences in accuracy (P = 0.853), supplementary responses (P = 0.325), necessary modifications (P = 0.661), and incomplete responses (P = 0.825). κ was highest for accuracy at 0.17. Google Gemini was significantly more readable in Flesch–Kincaid reading level, Flesch Reading Ease, and Gunning fog index (all, P < 0.05). Conclusions: In the setting of DDH, AI chatbots demonstrated limited accuracy, high supplementary and incomplete response patterns, and complex readability. Pediatric orthopaedic surgeons can counsel patients and their families to set appropriate expectations on the utility of these novel tools. Key Concepts: (1) Responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were inadequately accurate, frequently provided supplementary information that required modifications and frequently lacked essential details from the AAOS CPGs on DDH. (2) Accurate, supplementary, and incomplete response patterns were not significantly different among the three chatbots. (3) Google Gemini provided responses that had the highest readability among the three chatbots. (4) Pediatric orthopaedic surgeons can play a role in counseling patients and their families on the limited utility of AI chatbots for patient education purposes. Level of Evidence: IVhttp://www.sciencedirect.com/science/article/pii/S2768276524009611Developmental dysplasia of the hipChatGPTGoogle GeminiClinical practice guidelineAmerican Academy of Orthopaedic Surgeons
spellingShingle	Patrick P. Nian, BA Amith Umesh, BA Ruth H. Jones, BS Akshitha Adhiyaman, BS Christopher J. Williams, BS Christine M. Goodbody, MD Jessica H. Heyer, MD Shevaun M. Doyle, MD ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines Journal of the Pediatric Orthopaedic Society of North America Developmental dysplasia of the hip ChatGPT Google Gemini Clinical practice guideline American Academy of Orthopaedic Surgeons
title	ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines
title_full	ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines
title_fullStr	ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines
title_full_unstemmed	ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines
title_short	ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines
title_sort	chatgpt and google gemini are clinically inadequate in providing recommendations on management of developmental dysplasia of the hip compared to american academy of orthopaedic surgeons clinical practice guidelines
topic	Developmental dysplasia of the hip ChatGPT Google Gemini Clinical practice guideline American Academy of Orthopaedic Surgeons
url	http://www.sciencedirect.com/science/article/pii/S2768276524009611
work_keys_str_mv	AT patrickpnianba chatgptandgooglegeminiareclinicallyinadequateinprovidingrecommendationsonmanagementofdevelopmentaldysplasiaofthehipcomparedtoamericanacademyoforthopaedicsurgeonsclinicalpracticeguidelines AT amithumeshba chatgptandgooglegeminiareclinicallyinadequateinprovidingrecommendationsonmanagementofdevelopmentaldysplasiaofthehipcomparedtoamericanacademyoforthopaedicsurgeonsclinicalpracticeguidelines AT ruthhjonesbs chatgptandgooglegeminiareclinicallyinadequateinprovidingrecommendationsonmanagementofdevelopmentaldysplasiaofthehipcomparedtoamericanacademyoforthopaedicsurgeonsclinicalpracticeguidelines AT akshithaadhiyamanbs chatgptandgooglegeminiareclinicallyinadequateinprovidingrecommendationsonmanagementofdevelopmentaldysplasiaofthehipcomparedtoamericanacademyoforthopaedicsurgeonsclinicalpracticeguidelines AT christopherjwilliamsbs chatgptandgooglegeminiareclinicallyinadequateinprovidingrecommendationsonmanagementofdevelopmentaldysplasiaofthehipcomparedtoamericanacademyoforthopaedicsurgeonsclinicalpracticeguidelines AT christinemgoodbodymd chatgptandgooglegeminiareclinicallyinadequateinprovidingrecommendationsonmanagementofdevelopmentaldysplasiaofthehipcomparedtoamericanacademyoforthopaedicsurgeonsclinicalpracticeguidelines AT jessicahheyermd chatgptandgooglegeminiareclinicallyinadequateinprovidingrecommendationsonmanagementofdevelopmentaldysplasiaofthehipcomparedtoamericanacademyoforthopaedicsurgeonsclinicalpracticeguidelines AT shevaunmdoylemd chatgptandgooglegeminiareclinicallyinadequateinprovidingrecommendationsonmanagementofdevelopmentaldysplasiaofthehipcomparedtoamericanacademyoforthopaedicsurgeonsclinicalpracticeguidelines

ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines

Similar Items