ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines

Background: Large language models, including Chat Generative Pre-trained Transformer (ChatGPT) and Google Gemini have accelerated public accessibility to information, but their accuracy to medical questions remains unknown. In pediatric orthopaedics, no study has utilized board-certified expert opin...

Full description

Saved in:
Bibliographic Details
Main Authors: Patrick P. Nian, BA, Amith Umesh, BA, Ruth H. Jones, BS, Akshitha Adhiyaman, BS, Christopher J. Williams, BS, Christine M. Goodbody, MD, Jessica H. Heyer, MD, Shevaun M. Doyle, MD
Format: Article
Language:English
Published: Elsevier 2025-02-01
Series:Journal of the Pediatric Orthopaedic Society of North America
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2768276524009611
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850043952570302464
author Patrick P. Nian, BA
Amith Umesh, BA
Ruth H. Jones, BS
Akshitha Adhiyaman, BS
Christopher J. Williams, BS
Christine M. Goodbody, MD
Jessica H. Heyer, MD
Shevaun M. Doyle, MD
author_facet Patrick P. Nian, BA
Amith Umesh, BA
Ruth H. Jones, BS
Akshitha Adhiyaman, BS
Christopher J. Williams, BS
Christine M. Goodbody, MD
Jessica H. Heyer, MD
Shevaun M. Doyle, MD
author_sort Patrick P. Nian, BA
collection DOAJ
description Background: Large language models, including Chat Generative Pre-trained Transformer (ChatGPT) and Google Gemini have accelerated public accessibility to information, but their accuracy to medical questions remains unknown. In pediatric orthopaedics, no study has utilized board-certified expert opinion to evaluate the accuracy of artificial intelligence (AI) chatbots compared to evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on developmental dysplasia of the hip (DDH) regarding accuracy, supplementary and incomplete response patterns, and readability. Methods: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 9 evidence-based recommendations from the 2022 AAOS CPG on DDH. The answers to these questions were obtained on July 1st, 2024. Responses were anonymized and independently evaluated by two pediatric orthopaedic attending surgeons. Supplementary responses were additionally evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen's Kappa inter-rater reliability (κ) was calculated. Chi-square analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P ​< ​0.05. Results: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 5/9, 6/9, 6/9, supplementary in 8/9, 7/9, 9/9, and incomplete in 7/9, 6/9, 7/9 recommendations, respectively. Of 24 supplementary responses, 5 (20.8%), 16 (66.7%), and 3 (12.5%) required no, some, and many modifications, respectively. There were no significant differences in accuracy (P ​= ​0.853), supplementary responses (P ​= ​0.325), necessary modifications (P ​= ​0.661), and incomplete responses (P ​= ​0.825). κ was highest for accuracy at 0.17. Google Gemini was significantly more readable in Flesch–Kincaid reading level, Flesch Reading Ease, and Gunning fog index (all, P ​< ​0.05). Conclusions: In the setting of DDH, AI chatbots demonstrated limited accuracy, high supplementary and incomplete response patterns, and complex readability. Pediatric orthopaedic surgeons can counsel patients and their families to set appropriate expectations on the utility of these novel tools. Key Concepts: (1) Responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were inadequately accurate, frequently provided supplementary information that required modifications ​and frequently lacked essential details from the AAOS CPGs on DDH. (2) Accurate, supplementary, and incomplete response patterns were not significantly different among the three chatbots. (3) Google Gemini provided responses that had the highest readability among the three chatbots. (4) Pediatric orthopaedic surgeons can play a role in counseling patients and their families on the limited utility of AI chatbots for patient education purposes. Level of Evidence: IV
format Article
id doaj-art-878c8082f61c4f6692fd9f48d3f24481
institution DOAJ
issn 2768-2765
language English
publishDate 2025-02-01
publisher Elsevier
record_format Article
series Journal of the Pediatric Orthopaedic Society of North America
spelling doaj-art-878c8082f61c4f6692fd9f48d3f244812025-08-20T02:55:06ZengElsevierJournal of the Pediatric Orthopaedic Society of North America2768-27652025-02-011010013510.1016/j.jposna.2024.100135ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice GuidelinesPatrick P. Nian, BA0Amith Umesh, BA1Ruth H. Jones, BS2Akshitha Adhiyaman, BS3Christopher J. Williams, BS4Christine M. Goodbody, MD5Jessica H. Heyer, MD6Shevaun M. Doyle, MD7Hospital for Special Surgery, New York City, NY, USAHospital for Special Surgery, New York City, NY, USAHospital for Special Surgery, New York City, NY, USAHospital for Special Surgery, New York City, NY, USAHospital for Special Surgery, New York City, NY, USAChildren's Hospital of Philadelphia, Philadelphia, PA, USAHospital for Special Surgery, New York City, NY, USAHospital for Special Surgery, New York City, NY, USA; Corresponding author: 535 East 70th Street; New York, NY 10021, USA.Background: Large language models, including Chat Generative Pre-trained Transformer (ChatGPT) and Google Gemini have accelerated public accessibility to information, but their accuracy to medical questions remains unknown. In pediatric orthopaedics, no study has utilized board-certified expert opinion to evaluate the accuracy of artificial intelligence (AI) chatbots compared to evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on developmental dysplasia of the hip (DDH) regarding accuracy, supplementary and incomplete response patterns, and readability. Methods: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 9 evidence-based recommendations from the 2022 AAOS CPG on DDH. The answers to these questions were obtained on July 1st, 2024. Responses were anonymized and independently evaluated by two pediatric orthopaedic attending surgeons. Supplementary responses were additionally evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen's Kappa inter-rater reliability (κ) was calculated. Chi-square analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P ​< ​0.05. Results: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 5/9, 6/9, 6/9, supplementary in 8/9, 7/9, 9/9, and incomplete in 7/9, 6/9, 7/9 recommendations, respectively. Of 24 supplementary responses, 5 (20.8%), 16 (66.7%), and 3 (12.5%) required no, some, and many modifications, respectively. There were no significant differences in accuracy (P ​= ​0.853), supplementary responses (P ​= ​0.325), necessary modifications (P ​= ​0.661), and incomplete responses (P ​= ​0.825). κ was highest for accuracy at 0.17. Google Gemini was significantly more readable in Flesch–Kincaid reading level, Flesch Reading Ease, and Gunning fog index (all, P ​< ​0.05). Conclusions: In the setting of DDH, AI chatbots demonstrated limited accuracy, high supplementary and incomplete response patterns, and complex readability. Pediatric orthopaedic surgeons can counsel patients and their families to set appropriate expectations on the utility of these novel tools. Key Concepts: (1) Responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were inadequately accurate, frequently provided supplementary information that required modifications ​and frequently lacked essential details from the AAOS CPGs on DDH. (2) Accurate, supplementary, and incomplete response patterns were not significantly different among the three chatbots. (3) Google Gemini provided responses that had the highest readability among the three chatbots. (4) Pediatric orthopaedic surgeons can play a role in counseling patients and their families on the limited utility of AI chatbots for patient education purposes. Level of Evidence: IVhttp://www.sciencedirect.com/science/article/pii/S2768276524009611Developmental dysplasia of the hipChatGPTGoogle GeminiClinical practice guidelineAmerican Academy of Orthopaedic Surgeons
spellingShingle Patrick P. Nian, BA
Amith Umesh, BA
Ruth H. Jones, BS
Akshitha Adhiyaman, BS
Christopher J. Williams, BS
Christine M. Goodbody, MD
Jessica H. Heyer, MD
Shevaun M. Doyle, MD
ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines
Journal of the Pediatric Orthopaedic Society of North America
Developmental dysplasia of the hip
ChatGPT
Google Gemini
Clinical practice guideline
American Academy of Orthopaedic Surgeons
title ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines
title_full ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines
title_fullStr ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines
title_full_unstemmed ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines
title_short ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines
title_sort chatgpt and google gemini are clinically inadequate in providing recommendations on management of developmental dysplasia of the hip compared to american academy of orthopaedic surgeons clinical practice guidelines
topic Developmental dysplasia of the hip
ChatGPT
Google Gemini
Clinical practice guideline
American Academy of Orthopaedic Surgeons
url http://www.sciencedirect.com/science/article/pii/S2768276524009611
work_keys_str_mv AT patrickpnianba chatgptandgooglegeminiareclinicallyinadequateinprovidingrecommendationsonmanagementofdevelopmentaldysplasiaofthehipcomparedtoamericanacademyoforthopaedicsurgeonsclinicalpracticeguidelines
AT amithumeshba chatgptandgooglegeminiareclinicallyinadequateinprovidingrecommendationsonmanagementofdevelopmentaldysplasiaofthehipcomparedtoamericanacademyoforthopaedicsurgeonsclinicalpracticeguidelines
AT ruthhjonesbs chatgptandgooglegeminiareclinicallyinadequateinprovidingrecommendationsonmanagementofdevelopmentaldysplasiaofthehipcomparedtoamericanacademyoforthopaedicsurgeonsclinicalpracticeguidelines
AT akshithaadhiyamanbs chatgptandgooglegeminiareclinicallyinadequateinprovidingrecommendationsonmanagementofdevelopmentaldysplasiaofthehipcomparedtoamericanacademyoforthopaedicsurgeonsclinicalpracticeguidelines
AT christopherjwilliamsbs chatgptandgooglegeminiareclinicallyinadequateinprovidingrecommendationsonmanagementofdevelopmentaldysplasiaofthehipcomparedtoamericanacademyoforthopaedicsurgeonsclinicalpracticeguidelines
AT christinemgoodbodymd chatgptandgooglegeminiareclinicallyinadequateinprovidingrecommendationsonmanagementofdevelopmentaldysplasiaofthehipcomparedtoamericanacademyoforthopaedicsurgeonsclinicalpracticeguidelines
AT jessicahheyermd chatgptandgooglegeminiareclinicallyinadequateinprovidingrecommendationsonmanagementofdevelopmentaldysplasiaofthehipcomparedtoamericanacademyoforthopaedicsurgeonsclinicalpracticeguidelines
AT shevaunmdoylemd chatgptandgooglegeminiareclinicallyinadequateinprovidingrecommendationsonmanagementofdevelopmentaldysplasiaofthehipcomparedtoamericanacademyoforthopaedicsurgeonsclinicalpracticeguidelines