Evaluating AI-generated responses from different chatbots to soil science-related questions

The emergence of chatbots powered by large language models (LLMs), capable of providing human-like responses to various inquiries, has revolutionized fields like education and research, making artificial intelligence (AI) a major global topic. This study aimed to evaluate the performance of the most...

Full description

Saved in:
Bibliographic Details
Main Author: Javad Khanifar
Format: Article
Language:English
Published: Elsevier 2025-06-01
Series:Soil Advances
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2950289625000028
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832540310214803456
author Javad Khanifar
author_facet Javad Khanifar
author_sort Javad Khanifar
collection DOAJ
description The emergence of chatbots powered by large language models (LLMs), capable of providing human-like responses to various inquiries, has revolutionized fields like education and research, making artificial intelligence (AI) a major global topic. This study aimed to evaluate the performance of the most recent LLMs—Claude 3.5 Sonnet, GPT-4o, GPT-4o mini, Gemini 1.5 Pro, and Gemini 1.5 Flash—in yielding correct answers to questions related to soil science, a fundamental discipline in agricultural, natural resources, and environmental sciences. For this purpose, 105 specialized multiple-choice questions covering all domains of soil science were selected from the Iranian national PhD entrance exam in soil science. The GPT-4o-based chatbot, also known as ChatGPT, was employed to translate questions from Persian into English to assess the impact of input language on its performance. The LLMs were compared using Cohen's Kappa coefficient and the Chi-Square test. The study results indicated that the overall performance of chatbots powered by Claude 3.5 Sonnet and GPT-4o was comparable, as both models correctly answered 64.80 % of the questions. Nevertheless, these chatbots had significantly higher answering accuracy than Gemini 1.5 Pro and cost-efficient LLMs, namely Gemini 1.5 Flash and GPT-4o mini (p < 0.05). This finding suggests that soil science questions could be categorized as complex tasks for chatbots. The GPT-4o model's performance in answering questions was not significantly dependent on the language used (p > 0.05), revealing that input language is not a limiting factor when applying ChatGPT to soil science. Overall, AI chatbots can, at best, achieve slightly above-moderate performance in answering soil science questions. The study highlights the importance of soil scientists' knowledge and experience in integrating AI chatbots into soil science research and education.
format Article
id doaj-art-37a0571ab328406ca7ea8c8968b11605
institution Kabale University
issn 2950-2896
language English
publishDate 2025-06-01
publisher Elsevier
record_format Article
series Soil Advances
spelling doaj-art-37a0571ab328406ca7ea8c8968b116052025-02-05T04:32:57ZengElsevierSoil Advances2950-28962025-06-013100034Evaluating AI-generated responses from different chatbots to soil science-related questionsJavad Khanifar0Independent Researcher, Shush, Khuzestan, IranThe emergence of chatbots powered by large language models (LLMs), capable of providing human-like responses to various inquiries, has revolutionized fields like education and research, making artificial intelligence (AI) a major global topic. This study aimed to evaluate the performance of the most recent LLMs—Claude 3.5 Sonnet, GPT-4o, GPT-4o mini, Gemini 1.5 Pro, and Gemini 1.5 Flash—in yielding correct answers to questions related to soil science, a fundamental discipline in agricultural, natural resources, and environmental sciences. For this purpose, 105 specialized multiple-choice questions covering all domains of soil science were selected from the Iranian national PhD entrance exam in soil science. The GPT-4o-based chatbot, also known as ChatGPT, was employed to translate questions from Persian into English to assess the impact of input language on its performance. The LLMs were compared using Cohen's Kappa coefficient and the Chi-Square test. The study results indicated that the overall performance of chatbots powered by Claude 3.5 Sonnet and GPT-4o was comparable, as both models correctly answered 64.80 % of the questions. Nevertheless, these chatbots had significantly higher answering accuracy than Gemini 1.5 Pro and cost-efficient LLMs, namely Gemini 1.5 Flash and GPT-4o mini (p < 0.05). This finding suggests that soil science questions could be categorized as complex tasks for chatbots. The GPT-4o model's performance in answering questions was not significantly dependent on the language used (p > 0.05), revealing that input language is not a limiting factor when applying ChatGPT to soil science. Overall, AI chatbots can, at best, achieve slightly above-moderate performance in answering soil science questions. The study highlights the importance of soil scientists' knowledge and experience in integrating AI chatbots into soil science research and education.http://www.sciencedirect.com/science/article/pii/S2950289625000028Artificial intelligence (AI)ChatbotChatGPTClaudeGeminiLarge language models (LLMs)
spellingShingle Javad Khanifar
Evaluating AI-generated responses from different chatbots to soil science-related questions
Soil Advances
Artificial intelligence (AI)
Chatbot
ChatGPT
Claude
Gemini
Large language models (LLMs)
title Evaluating AI-generated responses from different chatbots to soil science-related questions
title_full Evaluating AI-generated responses from different chatbots to soil science-related questions
title_fullStr Evaluating AI-generated responses from different chatbots to soil science-related questions
title_full_unstemmed Evaluating AI-generated responses from different chatbots to soil science-related questions
title_short Evaluating AI-generated responses from different chatbots to soil science-related questions
title_sort evaluating ai generated responses from different chatbots to soil science related questions
topic Artificial intelligence (AI)
Chatbot
ChatGPT
Claude
Gemini
Large language models (LLMs)
url http://www.sciencedirect.com/science/article/pii/S2950289625000028
work_keys_str_mv AT javadkhanifar evaluatingaigeneratedresponsesfromdifferentchatbotstosoilsciencerelatedquestions