Improving wheat yield prediction through variable selection using Support Vector Regression, Random Forest, and Extreme Gradient Boosting

Plant breeding centers, in their relentless pursuit of more productive and resilient wheat varieties, have generated vast data repositories that are fundamental to ensuring global food security. This study uses these data to develop a wheat grain yield (GY) prediction model, using machine learning t...

Full description

Saved in:
Bibliographic Details
Main Authors: Juan Carlos Moreno Sánchez, Héctor Gabriel Acosta Mesa, Adrián Trueba Espinosa, Sergio Ruiz Castilla, Farid García Lamont
Format: Article
Language:English
Published: Elsevier 2025-03-01
Series:Smart Agricultural Technology
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2772375525000255
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Plant breeding centers, in their relentless pursuit of more productive and resilient wheat varieties, have generated vast data repositories that are fundamental to ensuring global food security. This study uses these data to develop a wheat grain yield (GY) prediction model, using machine learning techniques such as Random Forest (RF), Support Vector Regression (SVR), and Extreme Gradient Boosting (XGBoost). The results obtained prove the potential of RF and XGBoost-based models to accurately predict wheat yield. One of the major challenges of this research was to find the most relevant variables for predicting wheat yield. Using clustering, feature selection, and variable combination techniques, particularly agronomic variables such as harvest index (HI) and biomass (BM), provided complementary information to the Normalized Difference Vegetation Index (NDVI). This combination, analyzed through the XGBoost model, resulted in an exceptional performance, with an RMSE of 28.5082 (grams/square meter) and an R² of 0.9156, showing the constructive collaboration between these indicators. After a thorough analysis, it was discovered that daily clustering and filtering of climatic variables, especially precipitation rate, were favorable in these types of models.
ISSN:2772-3755