ϵ-Confidence Approximately Correct (ϵ-CoAC) Learnability and Hyperparameter Selection in Linear Regression Modeling

In a data based learning process, training data set is utilized to provide a hypothesis that can be generalized to explain all data points from a domain set. The hypothesis is chosen from classes with potentially different complexities. Linear regression modeling is an important category of learning...

Full description

Saved in:
Bibliographic Details
Main Authors: Soosan Beheshti, Mahdi Shamsi
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10840229/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In a data based learning process, training data set is utilized to provide a hypothesis that can be generalized to explain all data points from a domain set. The hypothesis is chosen from classes with potentially different complexities. Linear regression modeling is an important category of learning algorithms. The practical uncertainty of the label samples in the training data set has a major effect in the generalization ability of the learned model. Failing to choose a proper model or hypothesis class can lead to serious issues such as underfitting or overfitting. These issues have been addressed mostly by alternating modeling cost functions or by utilizing cross-validation methods. Drawbacks of these methods include introducing new hyperparameters with their own new challenges and uncertainties, potential increase of the computational complexity or requiring large set of training data sets. On the other hand, the theory of probably approximately correct (PAC) aims at defining learnability based on probabilistic settings. Despite its theoretical value, PAC bounds can&#x2019;t be utilized in practical regression learning applications with only available training data sets. This work is motivated by practical issues in regression learning generalization and is inspired by the foundations of the theory of statistical learning. The proposed approach, denoted by <inline-formula> <tex-math notation="LaTeX">$\epsilon $ </tex-math></inline-formula>-Confidence Approximately Correct (<inline-formula> <tex-math notation="LaTeX">$\epsilon $ </tex-math></inline-formula>-CoAC), utilizes the conventional Kullback-Leibler divergence (relative entropy) and defines new related typical sets to develop a unique method of probabilistic statistical learning for practical regression learning and generalization. <inline-formula> <tex-math notation="LaTeX">$\epsilon $ </tex-math></inline-formula>-CoAC learnability is able to validate the learning process as a function of training data sample size, as well as a function of the hypothesis class complexity order. Consequently, it enables the learner to automatically compare hypothesis classes of different complexity orders and to choose among them the optimum class with the minimum <inline-formula> <tex-math notation="LaTeX">$\epsilon $ </tex-math></inline-formula> in the <inline-formula> <tex-math notation="LaTeX">$\epsilon $ </tex-math></inline-formula>-CoAC framework. The <inline-formula> <tex-math notation="LaTeX">$\epsilon $ </tex-math></inline-formula>-CoAC learnability overcomes the issues of overfitting and underfitting. In addition, it shows advantages over the well-known cross-validation method in the sense of accuracy and data length requirements for convergence. Simulation results, for both synthetic and real data, confirm not only strength and capability of <inline-formula> <tex-math notation="LaTeX">$\epsilon $ </tex-math></inline-formula>-CoAC in providing learning measurements as a function of data length and/or hypothesis complexity, but also superiority of the method over the existing approaches in hypothesis complexity and model selection.
ISSN:2169-3536