Performance of Large Language Models on the Korean Dental Licensing Examination: A Comparative Study
Purpose: This study investigated the potential application of large language models (LLMs) in dental education and practice, with a focus on ChatGPT and Claude3-Opus. Using the Korean Dental Licensing Examination (KDLE) as a benchmark, we aimed to assess the capabilities of these models in the denta...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2025-02-01
|
Series: | International Dental Journal |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S0020653924014928 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832592765414801408 |
---|---|
author | Woojun Kim Bong Chul Kim Han-Gyeol Yeom |
author_facet | Woojun Kim Bong Chul Kim Han-Gyeol Yeom |
author_sort | Woojun Kim |
collection | DOAJ |
description | Purpose: This study investigated the potential application of large language models (LLMs) in dental education and practice, with a focus on ChatGPT and Claude3-Opus. Using the Korean Dental Licensing Examination (KDLE) as a benchmark, we aimed to assess the capabilities of these models in the dental field. Methods: This study evaluated three LLMs: GPT-3.5, GPT-4 (version: March 2024), and Claude3-Opus (version: March 2024). We used the KDLE questionnaire from 2019 to 2023 as inputs to the LLMs and then used the outputs from the LLMs as the corresponding answers. The total scores for individual subjects were obtained and compared. We also compared the performance of LLMs with those of individuals who underwent the exams. Results: Claude3-Opus performed best among the considered LLMs, except in 2019 when ChatGPT-4 performed best. Claude3-Opus and ChatGPT-4 surpassed the cut-off scores in all the years considered; this indicated that Claude3-Opus and ChatGPT-4 passed the KDLE, whereas ChatGPT-3.5 did not. However, all LLMs considered performed worse than humans, represented here by dental students in Korea. On average, the best-performing LLM annually achieved 85.4% of human performance. Conclusion: Using the KDLE as a benchmark, our study demonstrates that although LLMs have not yet reached human-level performance in overall scores, both Claude3-Opus and ChatGPT-4 exceed the cut-off scores and perform exceptionally well in specific subjects. Clinical Relevance: Our findings will aid in evaluating the feasibility of integrating LLMs into dentistry to improve the quality and availability of dental services by offering patient information that meets the basic competency standards of a dentist. |
format | Article |
id | doaj-art-0b43b8645e1448cbab51982358809d43 |
institution | Kabale University |
issn | 0020-6539 |
language | English |
publishDate | 2025-02-01 |
publisher | Elsevier |
record_format | Article |
series | International Dental Journal |
spelling | doaj-art-0b43b8645e1448cbab51982358809d432025-01-21T04:12:44ZengElsevierInternational Dental Journal0020-65392025-02-01751176184Performance of Large Language Models on the Korean Dental Licensing Examination: A Comparative StudyWoojun Kim0Bong Chul Kim1Han-Gyeol Yeom2The Robotics Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USADepartment of Oral and Maxillofacial Surgery, Daejeon Dental Hospital, Wonkwang University College of Dentistry, Daejeon, KoreaDepartment of Oral and Maxillofacial Radiology, Daejeon Dental Hospital, Wonkwang University College of Dentistry, Daejeon, Korea; Corresponding author: Department of Oral and Maxillofacial Radiology, Daejeon Dental Hospital, Wonkwang University College of Dentistry, 77, Dunsan-ro, Seo-gu, Daejeon 35233, South Korea.Purpose: This study investigated the potential application of large language models (LLMs) in dental education and practice, with a focus on ChatGPT and Claude3-Opus. Using the Korean Dental Licensing Examination (KDLE) as a benchmark, we aimed to assess the capabilities of these models in the dental field. Methods: This study evaluated three LLMs: GPT-3.5, GPT-4 (version: March 2024), and Claude3-Opus (version: March 2024). We used the KDLE questionnaire from 2019 to 2023 as inputs to the LLMs and then used the outputs from the LLMs as the corresponding answers. The total scores for individual subjects were obtained and compared. We also compared the performance of LLMs with those of individuals who underwent the exams. Results: Claude3-Opus performed best among the considered LLMs, except in 2019 when ChatGPT-4 performed best. Claude3-Opus and ChatGPT-4 surpassed the cut-off scores in all the years considered; this indicated that Claude3-Opus and ChatGPT-4 passed the KDLE, whereas ChatGPT-3.5 did not. However, all LLMs considered performed worse than humans, represented here by dental students in Korea. On average, the best-performing LLM annually achieved 85.4% of human performance. Conclusion: Using the KDLE as a benchmark, our study demonstrates that although LLMs have not yet reached human-level performance in overall scores, both Claude3-Opus and ChatGPT-4 exceed the cut-off scores and perform exceptionally well in specific subjects. Clinical Relevance: Our findings will aid in evaluating the feasibility of integrating LLMs into dentistry to improve the quality and availability of dental services by offering patient information that meets the basic competency standards of a dentist.http://www.sciencedirect.com/science/article/pii/S0020653924014928Large language modelDeep learningArtificial intelligenceDentistDental health service |
spellingShingle | Woojun Kim Bong Chul Kim Han-Gyeol Yeom Performance of Large Language Models on the Korean Dental Licensing Examination: A Comparative Study International Dental Journal Large language model Deep learning Artificial intelligence Dentist Dental health service |
title | Performance of Large Language Models on the Korean Dental Licensing Examination: A Comparative Study |
title_full | Performance of Large Language Models on the Korean Dental Licensing Examination: A Comparative Study |
title_fullStr | Performance of Large Language Models on the Korean Dental Licensing Examination: A Comparative Study |
title_full_unstemmed | Performance of Large Language Models on the Korean Dental Licensing Examination: A Comparative Study |
title_short | Performance of Large Language Models on the Korean Dental Licensing Examination: A Comparative Study |
title_sort | performance of large language models on the korean dental licensing examination a comparative study |
topic | Large language model Deep learning Artificial intelligence Dentist Dental health service |
url | http://www.sciencedirect.com/science/article/pii/S0020653924014928 |
work_keys_str_mv | AT woojunkim performanceoflargelanguagemodelsonthekoreandentallicensingexaminationacomparativestudy AT bongchulkim performanceoflargelanguagemodelsonthekoreandentallicensingexaminationacomparativestudy AT hangyeolyeom performanceoflargelanguagemodelsonthekoreandentallicensingexaminationacomparativestudy |