Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics

We propose a method to estimate information loss when conducting histogram binning and principal component analysis (PCA) sequentially, as usually done in practice for fleet analytics. Coarser-grained histogram binning results in less data volume, fewer dimensions, but more information loss. Conside...

Full description

Saved in:
Bibliographic Details
Main Authors: Kunxiong Ling, Jan Thiele, Thomas Setzer
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Open Journal of Intelligent Transportation Systems
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10437985/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832590338127036416
author Kunxiong Ling
Jan Thiele
Thomas Setzer
author_facet Kunxiong Ling
Jan Thiele
Thomas Setzer
author_sort Kunxiong Ling
collection DOAJ
description We propose a method to estimate information loss when conducting histogram binning and principal component analysis (PCA) sequentially, as usually done in practice for fleet analytics. Coarser-grained histogram binning results in less data volume, fewer dimensions, but more information loss. Considering fewer principal components (PCs) results in fewer data dimensions but increased information loss. Although information loss with each step is well understood, little guidance exists on the overall information loss when conducting both steps sequentially. We use Monte Carlo simulations to regress information loss on the number of bins and PCs, given few parameters of a dataset related to its scale and correlation structure. A sensitivity study shows that information loss can be approximated well given sufficiently large datasets. Using the number of bins, PCs, and two correlation measures, we derive an empirical loss model with high accuracy. Furthermore, we demonstrate the benefits of estimating information losses and the representativeness of total loss in evaluating the accuracy of k-means clustering for a real-world customer fleet dataset. For preprocessing sensor data which are aggregated from sufficient number of samples, continuously distributed, and can be represented by Beta-distributions, we recommend not to coarsen the histogram binning before PCA.
format Article
id doaj-art-7ef471634399466284f827cd1b6be6bf
institution Kabale University
issn 2687-7813
language English
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Open Journal of Intelligent Transportation Systems
spelling doaj-art-7ef471634399466284f827cd1b6be6bf2025-01-24T00:02:35ZengIEEEIEEE Open Journal of Intelligent Transportation Systems2687-78132024-01-01516017310.1109/OJITS.2024.336627910437985Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet AnalyticsKunxiong Ling0https://orcid.org/0000-0003-0203-301XJan Thiele1Thomas Setzer2https://orcid.org/0000-0002-9241-2648Research and Innovation Center, BMW Group, Munich, GermanyResearch and Innovation Center, BMW Group, Munich, GermanyIngolstadt School of Management, Catholic University of Eichstätt-Ingolstadt, Ingolstadt, GermanyWe propose a method to estimate information loss when conducting histogram binning and principal component analysis (PCA) sequentially, as usually done in practice for fleet analytics. Coarser-grained histogram binning results in less data volume, fewer dimensions, but more information loss. Considering fewer principal components (PCs) results in fewer data dimensions but increased information loss. Although information loss with each step is well understood, little guidance exists on the overall information loss when conducting both steps sequentially. We use Monte Carlo simulations to regress information loss on the number of bins and PCs, given few parameters of a dataset related to its scale and correlation structure. A sensitivity study shows that information loss can be approximated well given sufficiently large datasets. Using the number of bins, PCs, and two correlation measures, we derive an empirical loss model with high accuracy. Furthermore, we demonstrate the benefits of estimating information losses and the representativeness of total loss in evaluating the accuracy of k-means clustering for a real-world customer fleet dataset. For preprocessing sensor data which are aggregated from sufficient number of samples, continuously distributed, and can be represented by Beta-distributions, we recommend not to coarsen the histogram binning before PCA.https://ieeexplore.ieee.org/document/10437985/Fleet analyticshistograminformation lossMonte Carloprincipal component analysis
spellingShingle Kunxiong Ling
Jan Thiele
Thomas Setzer
Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics
IEEE Open Journal of Intelligent Transportation Systems
Fleet analytics
histogram
information loss
Monte Carlo
principal component analysis
title Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics
title_full Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics
title_fullStr Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics
title_full_unstemmed Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics
title_short Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics
title_sort loss aware histogram binning and principal component analysis for customer fleet analytics
topic Fleet analytics
histogram
information loss
Monte Carlo
principal component analysis
url https://ieeexplore.ieee.org/document/10437985/
work_keys_str_mv AT kunxiongling lossawarehistogrambinningandprincipalcomponentanalysisforcustomerfleetanalytics
AT janthiele lossawarehistogrambinningandprincipalcomponentanalysisforcustomerfleetanalytics
AT thomassetzer lossawarehistogrambinningandprincipalcomponentanalysisforcustomerfleetanalytics