Evaluating AI-generated responses from different chatbots to soil science-related questions
The emergence of chatbots powered by large language models (LLMs), capable of providing human-like responses to various inquiries, has revolutionized fields like education and research, making artificial intelligence (AI) a major global topic. This study aimed to evaluate the performance of the most...
Saved in:
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2025-06-01
|
Series: | Soil Advances |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2950289625000028 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The emergence of chatbots powered by large language models (LLMs), capable of providing human-like responses to various inquiries, has revolutionized fields like education and research, making artificial intelligence (AI) a major global topic. This study aimed to evaluate the performance of the most recent LLMs—Claude 3.5 Sonnet, GPT-4o, GPT-4o mini, Gemini 1.5 Pro, and Gemini 1.5 Flash—in yielding correct answers to questions related to soil science, a fundamental discipline in agricultural, natural resources, and environmental sciences. For this purpose, 105 specialized multiple-choice questions covering all domains of soil science were selected from the Iranian national PhD entrance exam in soil science. The GPT-4o-based chatbot, also known as ChatGPT, was employed to translate questions from Persian into English to assess the impact of input language on its performance. The LLMs were compared using Cohen's Kappa coefficient and the Chi-Square test. The study results indicated that the overall performance of chatbots powered by Claude 3.5 Sonnet and GPT-4o was comparable, as both models correctly answered 64.80 % of the questions. Nevertheless, these chatbots had significantly higher answering accuracy than Gemini 1.5 Pro and cost-efficient LLMs, namely Gemini 1.5 Flash and GPT-4o mini (p < 0.05). This finding suggests that soil science questions could be categorized as complex tasks for chatbots. The GPT-4o model's performance in answering questions was not significantly dependent on the language used (p > 0.05), revealing that input language is not a limiting factor when applying ChatGPT to soil science. Overall, AI chatbots can, at best, achieve slightly above-moderate performance in answering soil science questions. The study highlights the importance of soil scientists' knowledge and experience in integrating AI chatbots into soil science research and education. |
---|---|
ISSN: | 2950-2896 |