Random Forests in Count Data Modelling: An Analysis of the Influence of Data Features and Overdispersion on Regression Performance

Machine learning algorithms, especially random forests (RFs), have become an integrated part of the modern scientific methodology and represent an efficient alternative to conventional parametric algorithms. This study aimed to assess the influence of data features and overdispersion on RF regressio...

Full description

Saved in:
Bibliographic Details
Main Authors: Ciza Arsène Mushagalusa, Adandé Belarmain Fandohan, Romain Glèlè Kakaï
Format: Article
Language:English
Published: Wiley 2022-01-01
Series:Journal of Probability and Statistics
Online Access:http://dx.doi.org/10.1155/2022/2833537
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832567134360698880
author Ciza Arsène Mushagalusa
Adandé Belarmain Fandohan
Romain Glèlè Kakaï
author_facet Ciza Arsène Mushagalusa
Adandé Belarmain Fandohan
Romain Glèlè Kakaï
author_sort Ciza Arsène Mushagalusa
collection DOAJ
description Machine learning algorithms, especially random forests (RFs), have become an integrated part of the modern scientific methodology and represent an efficient alternative to conventional parametric algorithms. This study aimed to assess the influence of data features and overdispersion on RF regression performance. We assessed the effect of types of predictors (100, 75, 50, and 20% continuous, and 100% categorical), the number of predictors (p = 816 and 24), and the sample size (N = 50, 250, and 1250) on RF parameter settings. We also compared RF performance to that of classical generalized linear models (Poisson, negative binomial, and zero-inflated Poisson) and the linear model applied to log-transformed data. Two real datasets were analysed to demonstrate the usefulness of RF for overdispersed data modelling. Goodness-of-fit statistics such as root mean square error (RMSE) and biases were used to determine RF accuracy and validity. Results revealed that the number of variables to be randomly selected for each split, the proportion of samples to train the model, the minimal number of samples within each terminal node, and RF regression performance are not influenced by the sample size, number, and type of predictors. However, the ratio of observations to the number of predictors affects the stability of the best RF parameters. RF performs well for all types of covariates and different levels of dispersion. The magnitude of dispersion does not significantly influence RF predictive validity. In contrast, its predictive accuracy is significantly influenced by the magnitude of dispersion in the response variable, conditional on the explanatory variables. RF has performed almost as well as the models of the classical Poisson family in the presence of overdispersion. Given RF’s advantages, it is an appropriate statistical alternative for counting data.
format Article
id doaj-art-d91b1f1616f14261b26a03e6765e81b6
institution Kabale University
issn 1687-9538
language English
publishDate 2022-01-01
publisher Wiley
record_format Article
series Journal of Probability and Statistics
spelling doaj-art-d91b1f1616f14261b26a03e6765e81b62025-02-03T01:02:23ZengWileyJournal of Probability and Statistics1687-95382022-01-01202210.1155/2022/2833537Random Forests in Count Data Modelling: An Analysis of the Influence of Data Features and Overdispersion on Regression PerformanceCiza Arsène Mushagalusa0Adandé Belarmain Fandohan1Romain Glèlè Kakaï2Laboratoire de Biomathématiques et d’Estimations ForestièresLaboratoire de Biomathématiques et d’Estimations ForestièresLaboratoire de Biomathématiques et d’Estimations ForestièresMachine learning algorithms, especially random forests (RFs), have become an integrated part of the modern scientific methodology and represent an efficient alternative to conventional parametric algorithms. This study aimed to assess the influence of data features and overdispersion on RF regression performance. We assessed the effect of types of predictors (100, 75, 50, and 20% continuous, and 100% categorical), the number of predictors (p = 816 and 24), and the sample size (N = 50, 250, and 1250) on RF parameter settings. We also compared RF performance to that of classical generalized linear models (Poisson, negative binomial, and zero-inflated Poisson) and the linear model applied to log-transformed data. Two real datasets were analysed to demonstrate the usefulness of RF for overdispersed data modelling. Goodness-of-fit statistics such as root mean square error (RMSE) and biases were used to determine RF accuracy and validity. Results revealed that the number of variables to be randomly selected for each split, the proportion of samples to train the model, the minimal number of samples within each terminal node, and RF regression performance are not influenced by the sample size, number, and type of predictors. However, the ratio of observations to the number of predictors affects the stability of the best RF parameters. RF performs well for all types of covariates and different levels of dispersion. The magnitude of dispersion does not significantly influence RF predictive validity. In contrast, its predictive accuracy is significantly influenced by the magnitude of dispersion in the response variable, conditional on the explanatory variables. RF has performed almost as well as the models of the classical Poisson family in the presence of overdispersion. Given RF’s advantages, it is an appropriate statistical alternative for counting data.http://dx.doi.org/10.1155/2022/2833537
spellingShingle Ciza Arsène Mushagalusa
Adandé Belarmain Fandohan
Romain Glèlè Kakaï
Random Forests in Count Data Modelling: An Analysis of the Influence of Data Features and Overdispersion on Regression Performance
Journal of Probability and Statistics
title Random Forests in Count Data Modelling: An Analysis of the Influence of Data Features and Overdispersion on Regression Performance
title_full Random Forests in Count Data Modelling: An Analysis of the Influence of Data Features and Overdispersion on Regression Performance
title_fullStr Random Forests in Count Data Modelling: An Analysis of the Influence of Data Features and Overdispersion on Regression Performance
title_full_unstemmed Random Forests in Count Data Modelling: An Analysis of the Influence of Data Features and Overdispersion on Regression Performance
title_short Random Forests in Count Data Modelling: An Analysis of the Influence of Data Features and Overdispersion on Regression Performance
title_sort random forests in count data modelling an analysis of the influence of data features and overdispersion on regression performance
url http://dx.doi.org/10.1155/2022/2833537
work_keys_str_mv AT cizaarsenemushagalusa randomforestsincountdatamodellingananalysisoftheinfluenceofdatafeaturesandoverdispersiononregressionperformance
AT adandebelarmainfandohan randomforestsincountdatamodellingananalysisoftheinfluenceofdatafeaturesandoverdispersiononregressionperformance
AT romainglelekakai randomforestsincountdatamodellingananalysisoftheinfluenceofdatafeaturesandoverdispersiononregressionperformance