Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health

Aim: Given the increasing interest in using large language models (LLMs) for self-diagnosis, this study aimed to evaluate the comprehensiveness of two prominent LLMs, ChatGPT-3.5 and ChatGPT-4, in addressing common queries related to gingival and endodontic health across different language contexts...

Full description

Saved in:
Bibliographic Details
Main Authors: Qian Zhang, Zhengyu Wu, Jinlin Song, Shuicai Luo, Zhaowu Chai
Format: Article
Language:English
Published: Elsevier 2025-02-01
Series:International Dental Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S0020653924001953
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832592788083965952
author Qian Zhang
Zhengyu Wu
Jinlin Song
Shuicai Luo
Zhaowu Chai
author_facet Qian Zhang
Zhengyu Wu
Jinlin Song
Shuicai Luo
Zhaowu Chai
author_sort Qian Zhang
collection DOAJ
description Aim: Given the increasing interest in using large language models (LLMs) for self-diagnosis, this study aimed to evaluate the comprehensiveness of two prominent LLMs, ChatGPT-3.5 and ChatGPT-4, in addressing common queries related to gingival and endodontic health across different language contexts and query types. Methods: We assembled a set of 33 common real-life questions related to gingival and endodontic healthcare, including 17 common-sense questions and 16 expert questions. Each question was presented to the LLMs in both English and Chinese. Three specialists were invited to evaluate the comprehensiveness of the responses on a five-point Likert scale, where a higher score indicated greater quality responses. Results: LLMs performed significantly better in English, with an average score of 4.53, compared to 3.95 in Chinese (Mann–Whitney U test, P < .05). Responses to common sense questions received higher scores than those to expert questions, with averages of 4.46 and 4.02 (Mann–Whitney U test, P < .05). Among the LLMs, ChatGPT-4 consistently outperformed ChatGPT-3.5, achieving average scores of 4.45 and 4.03 (Mann–Whitney U test, P < .05). Conclusions: ChatGPT-4 provides more comprehensive responses than ChatGPT-3.5 for queries related to gingival and endodontic health. Both LLMs perform better in English and on common sense questions. However, the performance discrepancies across different language contexts and the presence of inaccurate responses suggest that further evaluation and understanding of their limitations are crucial to avoid potential misunderstandings. Clinical Relevance: This study revealed the performance differences of ChatGPT-3.5 and ChatGPT-4 in handling gingival and endodontic health issues across different language contexts, providing insights into the comprehensiveness and limitations of LLMs in addressing common oral healthcare queries.
format Article
id doaj-art-6200fb5448f54d7fa241ea0760f55a00
institution Kabale University
issn 0020-6539
language English
publishDate 2025-02-01
publisher Elsevier
record_format Article
series International Dental Journal
spelling doaj-art-6200fb5448f54d7fa241ea0760f55a002025-01-21T04:12:43ZengElsevierInternational Dental Journal0020-65392025-02-01751151157Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic HealthQian Zhang0Zhengyu Wu1Jinlin Song2Shuicai Luo3Zhaowu Chai4College of Stomatology, Chongqing Medical University, Chongqing, China; Chongqing Key Laboratory for Oral Diseases and Biomedical Sciences, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, ChinaCollege of Stomatology, Chongqing Medical University, Chongqing, China; Chongqing Key Laboratory for Oral Diseases and Biomedical Sciences, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, ChinaCollege of Stomatology, Chongqing Medical University, Chongqing, China; Chongqing Key Laboratory for Oral Diseases and Biomedical Sciences, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, ChinaQuanzhou Institute of Equipment Manufacturing, Haixi Institute, Chinese Academy of Sciences, Quanzhou, ChinaCollege of Stomatology, Chongqing Medical University, Chongqing, China; Chongqing Key Laboratory for Oral Diseases and Biomedical Sciences, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, China; Corresponding author. Stomatological Hospital of Chongqing Medical University Chongqing, 426 Songshibei Road, Chongqing 401147, China.Aim: Given the increasing interest in using large language models (LLMs) for self-diagnosis, this study aimed to evaluate the comprehensiveness of two prominent LLMs, ChatGPT-3.5 and ChatGPT-4, in addressing common queries related to gingival and endodontic health across different language contexts and query types. Methods: We assembled a set of 33 common real-life questions related to gingival and endodontic healthcare, including 17 common-sense questions and 16 expert questions. Each question was presented to the LLMs in both English and Chinese. Three specialists were invited to evaluate the comprehensiveness of the responses on a five-point Likert scale, where a higher score indicated greater quality responses. Results: LLMs performed significantly better in English, with an average score of 4.53, compared to 3.95 in Chinese (Mann–Whitney U test, P < .05). Responses to common sense questions received higher scores than those to expert questions, with averages of 4.46 and 4.02 (Mann–Whitney U test, P < .05). Among the LLMs, ChatGPT-4 consistently outperformed ChatGPT-3.5, achieving average scores of 4.45 and 4.03 (Mann–Whitney U test, P < .05). Conclusions: ChatGPT-4 provides more comprehensive responses than ChatGPT-3.5 for queries related to gingival and endodontic health. Both LLMs perform better in English and on common sense questions. However, the performance discrepancies across different language contexts and the presence of inaccurate responses suggest that further evaluation and understanding of their limitations are crucial to avoid potential misunderstandings. Clinical Relevance: This study revealed the performance differences of ChatGPT-3.5 and ChatGPT-4 in handling gingival and endodontic health issues across different language contexts, providing insights into the comprehensiveness and limitations of LLMs in addressing common oral healthcare queries.http://www.sciencedirect.com/science/article/pii/S0020653924001953Artificial intelligenceLarge language modelsOral healthcareGingival and endodontic health
spellingShingle Qian Zhang
Zhengyu Wu
Jinlin Song
Shuicai Luo
Zhaowu Chai
Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health
International Dental Journal
Artificial intelligence
Large language models
Oral healthcare
Gingival and endodontic health
title Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health
title_full Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health
title_fullStr Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health
title_full_unstemmed Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health
title_short Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health
title_sort comprehensiveness of large language models in patient queries on gingival and endodontic health
topic Artificial intelligence
Large language models
Oral healthcare
Gingival and endodontic health
url http://www.sciencedirect.com/science/article/pii/S0020653924001953
work_keys_str_mv AT qianzhang comprehensivenessoflargelanguagemodelsinpatientqueriesongingivalandendodontichealth
AT zhengyuwu comprehensivenessoflargelanguagemodelsinpatientqueriesongingivalandendodontichealth
AT jinlinsong comprehensivenessoflargelanguagemodelsinpatientqueriesongingivalandendodontichealth
AT shuicailuo comprehensivenessoflargelanguagemodelsinpatientqueriesongingivalandendodontichealth
AT zhaowuchai comprehensivenessoflargelanguagemodelsinpatientqueriesongingivalandendodontichealth