Large Language Models in Biochemistry Education: Comparative Evaluation of Performance
Abstract BackgroundRecent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have started a new era of innovation across various fields, with medicine at the forefront of this technological revolution. Many studies indicated that at the...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
JMIR Publications
2025-04-01
|
| Series: | JMIR Medical Education |
| Online Access: | https://mededu.jmir.org/2025/1/e67244 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850182969651625984 |
|---|---|
| author | Olena Bolgova Inna Shypilova Volodymyr Mavrych |
| author_facet | Olena Bolgova Inna Shypilova Volodymyr Mavrych |
| author_sort | Olena Bolgova |
| collection | DOAJ |
| description |
Abstract
BackgroundRecent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have started a new era of innovation across various fields, with medicine at the forefront of this technological revolution. Many studies indicated that at the current level of development, LLMs can pass different board exams. However, the ability to answer specific subject-related questions requires validation.
ObjectiveThe objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots—Claude (Anthropic), GPT-4 (OpenAI), Gemini (Google), and Copilot (Microsoft)—against the academic results of medical students in the medical biochemistry course.
MethodsWe used 200 USMLE (United States Medical Licensing Examination)–style multiple-choice questions (MCQs) selected from the course exam database. They encompassed various complexity levels and were distributed across 23 distinctive topics. The questions with tables and images were not included in the study. The results of 5 successive attempts by Claude 3.5 Sonnet, GPT-4‐1106, Gemini 1.5 Flash, and Copilot to answer this questionnaire set were evaluated based on accuracy in August 2024. Statistica 13.5.0.17 (TIBCO Software Inc) was used to analyze the data’s basic statistics. Considering the binary nature of the data, the chi-square test was used to compare results among the different chatbots, with a statistical significance level of P.
ResultsOn average, the selected chatbots correctly answered 81.1% (SD 12.8%) of the questions, surpassing the students’ performance by 8.3% (P=PP
ConclusionsOur study suggests that different AI models may have unique strengths in specific medical fields, which could be leveraged for targeted support in biochemistry courses. This performance highlights the potential of AI in medical education and assessment. |
| format | Article |
| id | doaj-art-ee3d6e43f2ef41aab5b21a8e28edd6e4 |
| institution | OA Journals |
| issn | 2369-3762 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | JMIR Publications |
| record_format | Article |
| series | JMIR Medical Education |
| spelling | doaj-art-ee3d6e43f2ef41aab5b21a8e28edd6e42025-08-20T02:17:28ZengJMIR PublicationsJMIR Medical Education2369-37622025-04-0111e67244e6724410.2196/67244Large Language Models in Biochemistry Education: Comparative Evaluation of PerformanceOlena Bolgovahttp://orcid.org/0009-0002-9496-9754Inna Shypilovahttp://orcid.org/0009-0000-0707-6997Volodymyr Mavrychhttp://orcid.org/0009-0009-1159-4573 Abstract BackgroundRecent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have started a new era of innovation across various fields, with medicine at the forefront of this technological revolution. Many studies indicated that at the current level of development, LLMs can pass different board exams. However, the ability to answer specific subject-related questions requires validation. ObjectiveThe objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots—Claude (Anthropic), GPT-4 (OpenAI), Gemini (Google), and Copilot (Microsoft)—against the academic results of medical students in the medical biochemistry course. MethodsWe used 200 USMLE (United States Medical Licensing Examination)–style multiple-choice questions (MCQs) selected from the course exam database. They encompassed various complexity levels and were distributed across 23 distinctive topics. The questions with tables and images were not included in the study. The results of 5 successive attempts by Claude 3.5 Sonnet, GPT-4‐1106, Gemini 1.5 Flash, and Copilot to answer this questionnaire set were evaluated based on accuracy in August 2024. Statistica 13.5.0.17 (TIBCO Software Inc) was used to analyze the data’s basic statistics. Considering the binary nature of the data, the chi-square test was used to compare results among the different chatbots, with a statistical significance level of P. ResultsOn average, the selected chatbots correctly answered 81.1% (SD 12.8%) of the questions, surpassing the students’ performance by 8.3% (P=PP ConclusionsOur study suggests that different AI models may have unique strengths in specific medical fields, which could be leveraged for targeted support in biochemistry courses. This performance highlights the potential of AI in medical education and assessment.https://mededu.jmir.org/2025/1/e67244 |
| spellingShingle | Olena Bolgova Inna Shypilova Volodymyr Mavrych Large Language Models in Biochemistry Education: Comparative Evaluation of Performance JMIR Medical Education |
| title | Large Language Models in Biochemistry Education: Comparative Evaluation of Performance |
| title_full | Large Language Models in Biochemistry Education: Comparative Evaluation of Performance |
| title_fullStr | Large Language Models in Biochemistry Education: Comparative Evaluation of Performance |
| title_full_unstemmed | Large Language Models in Biochemistry Education: Comparative Evaluation of Performance |
| title_short | Large Language Models in Biochemistry Education: Comparative Evaluation of Performance |
| title_sort | large language models in biochemistry education comparative evaluation of performance |
| url | https://mededu.jmir.org/2025/1/e67244 |
| work_keys_str_mv | AT olenabolgova largelanguagemodelsinbiochemistryeducationcomparativeevaluationofperformance AT innashypilova largelanguagemodelsinbiochemistryeducationcomparativeevaluationofperformance AT volodymyrmavrych largelanguagemodelsinbiochemistryeducationcomparativeevaluationofperformance |