Quantitative Assessment of Data Volume Requirements for Reliable Machine Learning Analysis

Applying machine learning (ML) techniques in the context of limited data remains a challenge of practical importance. Questions on both the sufficiency of a given dataset for ML data analysis and data acquisition planning arise. Both aspects are quantitatively addressed in this work. The first one i...

Full description

Saved in:
Bibliographic Details
Main Authors: Xukuan Xu, Jinghou Bi, Michael Moeckel, Hajo Wiemer, Steffen Ihlenfeldt
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11029268/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Applying machine learning (ML) techniques in the context of limited data remains a challenge of practical importance. Questions on both the sufficiency of a given dataset for ML data analysis and data acquisition planning arise. Both aspects are quantitatively addressed in this work. The first one is treated as part of dataset diagnostics, which includes the analysis of learning curves, the quantification of confidence intervals (CI) during model evaluation and further dataset metrics. Regarding data generation, data sufficiency considerations must cover both model training and testing. Twenty case datasets are diagnosed, and serve as a reference for establishing an empirical rule for data sufficiency estimation. The results indicate that the relevant aspects are the ratio of data volume to the number of input features, the complexity of the involved data correlations and the applied definition of the small data regime. The presented framework allows for an improved pre-assessment of data sufficiency for a typical given dataset and facilitates a reliable implementation of ML in data volume-limited contexts.
ISSN:2169-3536