Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health
Aim: Given the increasing interest in using large language models (LLMs) for self-diagnosis, this study aimed to evaluate the comprehensiveness of two prominent LLMs, ChatGPT-3.5 and ChatGPT-4, in addressing common queries related to gingival and endodontic health across different language contexts...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2025-02-01
|
Series: | International Dental Journal |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S0020653924001953 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832592788083965952 |
---|---|
author | Qian Zhang Zhengyu Wu Jinlin Song Shuicai Luo Zhaowu Chai |
author_facet | Qian Zhang Zhengyu Wu Jinlin Song Shuicai Luo Zhaowu Chai |
author_sort | Qian Zhang |
collection | DOAJ |
description | Aim: Given the increasing interest in using large language models (LLMs) for self-diagnosis, this study aimed to evaluate the comprehensiveness of two prominent LLMs, ChatGPT-3.5 and ChatGPT-4, in addressing common queries related to gingival and endodontic health across different language contexts and query types. Methods: We assembled a set of 33 common real-life questions related to gingival and endodontic healthcare, including 17 common-sense questions and 16 expert questions. Each question was presented to the LLMs in both English and Chinese. Three specialists were invited to evaluate the comprehensiveness of the responses on a five-point Likert scale, where a higher score indicated greater quality responses. Results: LLMs performed significantly better in English, with an average score of 4.53, compared to 3.95 in Chinese (Mann–Whitney U test, P < .05). Responses to common sense questions received higher scores than those to expert questions, with averages of 4.46 and 4.02 (Mann–Whitney U test, P < .05). Among the LLMs, ChatGPT-4 consistently outperformed ChatGPT-3.5, achieving average scores of 4.45 and 4.03 (Mann–Whitney U test, P < .05). Conclusions: ChatGPT-4 provides more comprehensive responses than ChatGPT-3.5 for queries related to gingival and endodontic health. Both LLMs perform better in English and on common sense questions. However, the performance discrepancies across different language contexts and the presence of inaccurate responses suggest that further evaluation and understanding of their limitations are crucial to avoid potential misunderstandings. Clinical Relevance: This study revealed the performance differences of ChatGPT-3.5 and ChatGPT-4 in handling gingival and endodontic health issues across different language contexts, providing insights into the comprehensiveness and limitations of LLMs in addressing common oral healthcare queries. |
format | Article |
id | doaj-art-6200fb5448f54d7fa241ea0760f55a00 |
institution | Kabale University |
issn | 0020-6539 |
language | English |
publishDate | 2025-02-01 |
publisher | Elsevier |
record_format | Article |
series | International Dental Journal |
spelling | doaj-art-6200fb5448f54d7fa241ea0760f55a002025-01-21T04:12:43ZengElsevierInternational Dental Journal0020-65392025-02-01751151157Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic HealthQian Zhang0Zhengyu Wu1Jinlin Song2Shuicai Luo3Zhaowu Chai4College of Stomatology, Chongqing Medical University, Chongqing, China; Chongqing Key Laboratory for Oral Diseases and Biomedical Sciences, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, ChinaCollege of Stomatology, Chongqing Medical University, Chongqing, China; Chongqing Key Laboratory for Oral Diseases and Biomedical Sciences, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, ChinaCollege of Stomatology, Chongqing Medical University, Chongqing, China; Chongqing Key Laboratory for Oral Diseases and Biomedical Sciences, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, ChinaQuanzhou Institute of Equipment Manufacturing, Haixi Institute, Chinese Academy of Sciences, Quanzhou, ChinaCollege of Stomatology, Chongqing Medical University, Chongqing, China; Chongqing Key Laboratory for Oral Diseases and Biomedical Sciences, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, China; Corresponding author. Stomatological Hospital of Chongqing Medical University Chongqing, 426 Songshibei Road, Chongqing 401147, China.Aim: Given the increasing interest in using large language models (LLMs) for self-diagnosis, this study aimed to evaluate the comprehensiveness of two prominent LLMs, ChatGPT-3.5 and ChatGPT-4, in addressing common queries related to gingival and endodontic health across different language contexts and query types. Methods: We assembled a set of 33 common real-life questions related to gingival and endodontic healthcare, including 17 common-sense questions and 16 expert questions. Each question was presented to the LLMs in both English and Chinese. Three specialists were invited to evaluate the comprehensiveness of the responses on a five-point Likert scale, where a higher score indicated greater quality responses. Results: LLMs performed significantly better in English, with an average score of 4.53, compared to 3.95 in Chinese (Mann–Whitney U test, P < .05). Responses to common sense questions received higher scores than those to expert questions, with averages of 4.46 and 4.02 (Mann–Whitney U test, P < .05). Among the LLMs, ChatGPT-4 consistently outperformed ChatGPT-3.5, achieving average scores of 4.45 and 4.03 (Mann–Whitney U test, P < .05). Conclusions: ChatGPT-4 provides more comprehensive responses than ChatGPT-3.5 for queries related to gingival and endodontic health. Both LLMs perform better in English and on common sense questions. However, the performance discrepancies across different language contexts and the presence of inaccurate responses suggest that further evaluation and understanding of their limitations are crucial to avoid potential misunderstandings. Clinical Relevance: This study revealed the performance differences of ChatGPT-3.5 and ChatGPT-4 in handling gingival and endodontic health issues across different language contexts, providing insights into the comprehensiveness and limitations of LLMs in addressing common oral healthcare queries.http://www.sciencedirect.com/science/article/pii/S0020653924001953Artificial intelligenceLarge language modelsOral healthcareGingival and endodontic health |
spellingShingle | Qian Zhang Zhengyu Wu Jinlin Song Shuicai Luo Zhaowu Chai Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health International Dental Journal Artificial intelligence Large language models Oral healthcare Gingival and endodontic health |
title | Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health |
title_full | Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health |
title_fullStr | Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health |
title_full_unstemmed | Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health |
title_short | Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health |
title_sort | comprehensiveness of large language models in patient queries on gingival and endodontic health |
topic | Artificial intelligence Large language models Oral healthcare Gingival and endodontic health |
url | http://www.sciencedirect.com/science/article/pii/S0020653924001953 |
work_keys_str_mv | AT qianzhang comprehensivenessoflargelanguagemodelsinpatientqueriesongingivalandendodontichealth AT zhengyuwu comprehensivenessoflargelanguagemodelsinpatientqueriesongingivalandendodontichealth AT jinlinsong comprehensivenessoflargelanguagemodelsinpatientqueriesongingivalandendodontichealth AT shuicailuo comprehensivenessoflargelanguagemodelsinpatientqueriesongingivalandendodontichealth AT zhaowuchai comprehensivenessoflargelanguagemodelsinpatientqueriesongingivalandendodontichealth |