Evaluating AI-generated responses from different chatbots to soil science-related questions

The emergence of chatbots powered by large language models (LLMs), capable of providing human-like responses to various inquiries, has revolutionized fields like education and research, making artificial intelligence (AI) a major global topic. This study aimed to evaluate the performance of the most...

Full description

Saved in:

Bibliographic Details
Main Author:	Javad Khanifar
Format:	Article
Language:	English
Published:	Elsevier 2025-06-01
Series:	Soil Advances
Subjects:	Artificial intelligence (AI) Chatbot ChatGPT Claude Gemini Large language models (LLMs)
Online Access:	http://www.sciencedirect.com/science/article/pii/S2950289625000028
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832540310214803456
author	Javad Khanifar
author_facet	Javad Khanifar
author_sort	Javad Khanifar
collection	DOAJ
description	The emergence of chatbots powered by large language models (LLMs), capable of providing human-like responses to various inquiries, has revolutionized fields like education and research, making artificial intelligence (AI) a major global topic. This study aimed to evaluate the performance of the most recent LLMs—Claude 3.5 Sonnet, GPT-4o, GPT-4o mini, Gemini 1.5 Pro, and Gemini 1.5 Flash—in yielding correct answers to questions related to soil science, a fundamental discipline in agricultural, natural resources, and environmental sciences. For this purpose, 105 specialized multiple-choice questions covering all domains of soil science were selected from the Iranian national PhD entrance exam in soil science. The GPT-4o-based chatbot, also known as ChatGPT, was employed to translate questions from Persian into English to assess the impact of input language on its performance. The LLMs were compared using Cohen's Kappa coefficient and the Chi-Square test. The study results indicated that the overall performance of chatbots powered by Claude 3.5 Sonnet and GPT-4o was comparable, as both models correctly answered 64.80 % of the questions. Nevertheless, these chatbots had significantly higher answering accuracy than Gemini 1.5 Pro and cost-efficient LLMs, namely Gemini 1.5 Flash and GPT-4o mini (p < 0.05). This finding suggests that soil science questions could be categorized as complex tasks for chatbots. The GPT-4o model's performance in answering questions was not significantly dependent on the language used (p > 0.05), revealing that input language is not a limiting factor when applying ChatGPT to soil science. Overall, AI chatbots can, at best, achieve slightly above-moderate performance in answering soil science questions. The study highlights the importance of soil scientists' knowledge and experience in integrating AI chatbots into soil science research and education.
format	Article
id	doaj-art-37a0571ab328406ca7ea8c8968b11605
institution	Kabale University
issn	2950-2896
language	English
publishDate	2025-06-01
publisher	Elsevier
record_format	Article
series	Soil Advances
spelling	doaj-art-37a0571ab328406ca7ea8c8968b116052025-02-05T04:32:57ZengElsevierSoil Advances2950-28962025-06-013100034Evaluating AI-generated responses from different chatbots to soil science-related questionsJavad Khanifar0Independent Researcher, Shush, Khuzestan, IranThe emergence of chatbots powered by large language models (LLMs), capable of providing human-like responses to various inquiries, has revolutionized fields like education and research, making artificial intelligence (AI) a major global topic. This study aimed to evaluate the performance of the most recent LLMs—Claude 3.5 Sonnet, GPT-4o, GPT-4o mini, Gemini 1.5 Pro, and Gemini 1.5 Flash—in yielding correct answers to questions related to soil science, a fundamental discipline in agricultural, natural resources, and environmental sciences. For this purpose, 105 specialized multiple-choice questions covering all domains of soil science were selected from the Iranian national PhD entrance exam in soil science. The GPT-4o-based chatbot, also known as ChatGPT, was employed to translate questions from Persian into English to assess the impact of input language on its performance. The LLMs were compared using Cohen's Kappa coefficient and the Chi-Square test. The study results indicated that the overall performance of chatbots powered by Claude 3.5 Sonnet and GPT-4o was comparable, as both models correctly answered 64.80 % of the questions. Nevertheless, these chatbots had significantly higher answering accuracy than Gemini 1.5 Pro and cost-efficient LLMs, namely Gemini 1.5 Flash and GPT-4o mini (p < 0.05). This finding suggests that soil science questions could be categorized as complex tasks for chatbots. The GPT-4o model's performance in answering questions was not significantly dependent on the language used (p > 0.05), revealing that input language is not a limiting factor when applying ChatGPT to soil science. Overall, AI chatbots can, at best, achieve slightly above-moderate performance in answering soil science questions. The study highlights the importance of soil scientists' knowledge and experience in integrating AI chatbots into soil science research and education.http://www.sciencedirect.com/science/article/pii/S2950289625000028Artificial intelligence (AI)ChatbotChatGPTClaudeGeminiLarge language models (LLMs)
spellingShingle	Javad Khanifar Evaluating AI-generated responses from different chatbots to soil science-related questions Soil Advances Artificial intelligence (AI) Chatbot ChatGPT Claude Gemini Large language models (LLMs)
title	Evaluating AI-generated responses from different chatbots to soil science-related questions
title_full	Evaluating AI-generated responses from different chatbots to soil science-related questions
title_fullStr	Evaluating AI-generated responses from different chatbots to soil science-related questions
title_full_unstemmed	Evaluating AI-generated responses from different chatbots to soil science-related questions
title_short	Evaluating AI-generated responses from different chatbots to soil science-related questions
title_sort	evaluating ai generated responses from different chatbots to soil science related questions
topic	Artificial intelligence (AI) Chatbot ChatGPT Claude Gemini Large language models (LLMs)
url	http://www.sciencedirect.com/science/article/pii/S2950289625000028
work_keys_str_mv	AT javadkhanifar evaluatingaigeneratedresponsesfromdifferentchatbotstosoilsciencerelatedquestions

Evaluating AI-generated responses from different chatbots to soil science-related questions

Similar Items