Large Language Models in Biochemistry Education: Comparative Evaluation of Performance

Abstract BackgroundRecent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have started a new era of innovation across various fields, with medicine at the forefront of this technological revolution. Many studies indicated that at the...

Full description

Saved in:
Bibliographic Details
Main Authors: Olena Bolgova, Inna Shypilova, Volodymyr Mavrych
Format: Article
Language:English
Published: JMIR Publications 2025-04-01
Series:JMIR Medical Education
Online Access:https://mededu.jmir.org/2025/1/e67244
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850182969651625984
author Olena Bolgova
Inna Shypilova
Volodymyr Mavrych
author_facet Olena Bolgova
Inna Shypilova
Volodymyr Mavrych
author_sort Olena Bolgova
collection DOAJ
description Abstract BackgroundRecent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have started a new era of innovation across various fields, with medicine at the forefront of this technological revolution. Many studies indicated that at the current level of development, LLMs can pass different board exams. However, the ability to answer specific subject-related questions requires validation. ObjectiveThe objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots—Claude (Anthropic), GPT-4 (OpenAI), Gemini (Google), and Copilot (Microsoft)—against the academic results of medical students in the medical biochemistry course. MethodsWe used 200 USMLE (United States Medical Licensing Examination)–style multiple-choice questions (MCQs) selected from the course exam database. They encompassed various complexity levels and were distributed across 23 distinctive topics. The questions with tables and images were not included in the study. The results of 5 successive attempts by Claude 3.5 Sonnet, GPT-4‐1106, Gemini 1.5 Flash, and Copilot to answer this questionnaire set were evaluated based on accuracy in August 2024. Statistica 13.5.0.17 (TIBCO Software Inc) was used to analyze the data’s basic statistics. Considering the binary nature of the data, the chi-square test was used to compare results among the different chatbots, with a statistical significance level of P. ResultsOn average, the selected chatbots correctly answered 81.1% (SD 12.8%) of the questions, surpassing the students’ performance by 8.3% (P=PP ConclusionsOur study suggests that different AI models may have unique strengths in specific medical fields, which could be leveraged for targeted support in biochemistry courses. This performance highlights the potential of AI in medical education and assessment.
format Article
id doaj-art-ee3d6e43f2ef41aab5b21a8e28edd6e4
institution OA Journals
issn 2369-3762
language English
publishDate 2025-04-01
publisher JMIR Publications
record_format Article
series JMIR Medical Education
spelling doaj-art-ee3d6e43f2ef41aab5b21a8e28edd6e42025-08-20T02:17:28ZengJMIR PublicationsJMIR Medical Education2369-37622025-04-0111e67244e6724410.2196/67244Large Language Models in Biochemistry Education: Comparative Evaluation of PerformanceOlena Bolgovahttp://orcid.org/0009-0002-9496-9754Inna Shypilovahttp://orcid.org/0009-0000-0707-6997Volodymyr Mavrychhttp://orcid.org/0009-0009-1159-4573 Abstract BackgroundRecent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have started a new era of innovation across various fields, with medicine at the forefront of this technological revolution. Many studies indicated that at the current level of development, LLMs can pass different board exams. However, the ability to answer specific subject-related questions requires validation. ObjectiveThe objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots—Claude (Anthropic), GPT-4 (OpenAI), Gemini (Google), and Copilot (Microsoft)—against the academic results of medical students in the medical biochemistry course. MethodsWe used 200 USMLE (United States Medical Licensing Examination)–style multiple-choice questions (MCQs) selected from the course exam database. They encompassed various complexity levels and were distributed across 23 distinctive topics. The questions with tables and images were not included in the study. The results of 5 successive attempts by Claude 3.5 Sonnet, GPT-4‐1106, Gemini 1.5 Flash, and Copilot to answer this questionnaire set were evaluated based on accuracy in August 2024. Statistica 13.5.0.17 (TIBCO Software Inc) was used to analyze the data’s basic statistics. Considering the binary nature of the data, the chi-square test was used to compare results among the different chatbots, with a statistical significance level of P. ResultsOn average, the selected chatbots correctly answered 81.1% (SD 12.8%) of the questions, surpassing the students’ performance by 8.3% (P=PP ConclusionsOur study suggests that different AI models may have unique strengths in specific medical fields, which could be leveraged for targeted support in biochemistry courses. This performance highlights the potential of AI in medical education and assessment.https://mededu.jmir.org/2025/1/e67244
spellingShingle Olena Bolgova
Inna Shypilova
Volodymyr Mavrych
Large Language Models in Biochemistry Education: Comparative Evaluation of Performance
JMIR Medical Education
title Large Language Models in Biochemistry Education: Comparative Evaluation of Performance
title_full Large Language Models in Biochemistry Education: Comparative Evaluation of Performance
title_fullStr Large Language Models in Biochemistry Education: Comparative Evaluation of Performance
title_full_unstemmed Large Language Models in Biochemistry Education: Comparative Evaluation of Performance
title_short Large Language Models in Biochemistry Education: Comparative Evaluation of Performance
title_sort large language models in biochemistry education comparative evaluation of performance
url https://mededu.jmir.org/2025/1/e67244
work_keys_str_mv AT olenabolgova largelanguagemodelsinbiochemistryeducationcomparativeevaluationofperformance
AT innashypilova largelanguagemodelsinbiochemistryeducationcomparativeevaluationofperformance
AT volodymyrmavrych largelanguagemodelsinbiochemistryeducationcomparativeevaluationofperformance