Evaluating AI-generated responses from different chatbots to soil science-related questions

The emergence of chatbots powered by large language models (LLMs), capable of providing human-like responses to various inquiries, has revolutionized fields like education and research, making artificial intelligence (AI) a major global topic. This study aimed to evaluate the performance of the most...

Full description

Saved in:

Bibliographic Details
Main Author:	Javad Khanifar
Format:	Article
Language:	English
Published:	Elsevier 2025-06-01
Series:	Soil Advances
Subjects:	Artificial intelligence (AI) Chatbot ChatGPT Claude Gemini Large language models (LLMs)
Online Access:	http://www.sciencedirect.com/science/article/pii/S2950289625000028
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The emergence of chatbots powered by large language models (LLMs), capable of providing human-like responses to various inquiries, has revolutionized fields like education and research, making artificial intelligence (AI) a major global topic. This study aimed to evaluate the performance of the most recent LLMs—Claude 3.5 Sonnet, GPT-4o, GPT-4o mini, Gemini 1.5 Pro, and Gemini 1.5 Flash—in yielding correct answers to questions related to soil science, a fundamental discipline in agricultural, natural resources, and environmental sciences. For this purpose, 105 specialized multiple-choice questions covering all domains of soil science were selected from the Iranian national PhD entrance exam in soil science. The GPT-4o-based chatbot, also known as ChatGPT, was employed to translate questions from Persian into English to assess the impact of input language on its performance. The LLMs were compared using Cohen's Kappa coefficient and the Chi-Square test. The study results indicated that the overall performance of chatbots powered by Claude 3.5 Sonnet and GPT-4o was comparable, as both models correctly answered 64.80 % of the questions. Nevertheless, these chatbots had significantly higher answering accuracy than Gemini 1.5 Pro and cost-efficient LLMs, namely Gemini 1.5 Flash and GPT-4o mini (p < 0.05). This finding suggests that soil science questions could be categorized as complex tasks for chatbots. The GPT-4o model's performance in answering questions was not significantly dependent on the language used (p > 0.05), revealing that input language is not a limiting factor when applying ChatGPT to soil science. Overall, AI chatbots can, at best, achieve slightly above-moderate performance in answering soil science questions. The study highlights the importance of soil scientists' knowledge and experience in integrating AI chatbots into soil science research and education.
ISSN:	2950-2896

Evaluating AI-generated responses from different chatbots to soil science-related questions

Similar Items