Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics

We propose a method to estimate information loss when conducting histogram binning and principal component analysis (PCA) sequentially, as usually done in practice for fleet analytics. Coarser-grained histogram binning results in less data volume, fewer dimensions, but more information loss. Conside...

Full description

Saved in:

Bibliographic Details
Main Authors:	Kunxiong Ling, Jan Thiele, Thomas Setzer
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Open Journal of Intelligent Transportation Systems
Subjects:	Fleet analytics histogram information loss Monte Carlo principal component analysis
Online Access:	https://ieeexplore.ieee.org/document/10437985/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832590338127036416
author	Kunxiong Ling Jan Thiele Thomas Setzer
author_facet	Kunxiong Ling Jan Thiele Thomas Setzer
author_sort	Kunxiong Ling
collection	DOAJ
description	We propose a method to estimate information loss when conducting histogram binning and principal component analysis (PCA) sequentially, as usually done in practice for fleet analytics. Coarser-grained histogram binning results in less data volume, fewer dimensions, but more information loss. Considering fewer principal components (PCs) results in fewer data dimensions but increased information loss. Although information loss with each step is well understood, little guidance exists on the overall information loss when conducting both steps sequentially. We use Monte Carlo simulations to regress information loss on the number of bins and PCs, given few parameters of a dataset related to its scale and correlation structure. A sensitivity study shows that information loss can be approximated well given sufficiently large datasets. Using the number of bins, PCs, and two correlation measures, we derive an empirical loss model with high accuracy. Furthermore, we demonstrate the benefits of estimating information losses and the representativeness of total loss in evaluating the accuracy of k-means clustering for a real-world customer fleet dataset. For preprocessing sensor data which are aggregated from sufficient number of samples, continuously distributed, and can be represented by Beta-distributions, we recommend not to coarsen the histogram binning before PCA.
format	Article
id	doaj-art-7ef471634399466284f827cd1b6be6bf
institution	Kabale University
issn	2687-7813
language	English
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Open Journal of Intelligent Transportation Systems
spelling	doaj-art-7ef471634399466284f827cd1b6be6bf2025-01-24T00:02:35ZengIEEEIEEE Open Journal of Intelligent Transportation Systems2687-78132024-01-01516017310.1109/OJITS.2024.336627910437985Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet AnalyticsKunxiong Ling0https://orcid.org/0000-0003-0203-301XJan Thiele1Thomas Setzer2https://orcid.org/0000-0002-9241-2648Research and Innovation Center, BMW Group, Munich, GermanyResearch and Innovation Center, BMW Group, Munich, GermanyIngolstadt School of Management, Catholic University of Eichstätt-Ingolstadt, Ingolstadt, GermanyWe propose a method to estimate information loss when conducting histogram binning and principal component analysis (PCA) sequentially, as usually done in practice for fleet analytics. Coarser-grained histogram binning results in less data volume, fewer dimensions, but more information loss. Considering fewer principal components (PCs) results in fewer data dimensions but increased information loss. Although information loss with each step is well understood, little guidance exists on the overall information loss when conducting both steps sequentially. We use Monte Carlo simulations to regress information loss on the number of bins and PCs, given few parameters of a dataset related to its scale and correlation structure. A sensitivity study shows that information loss can be approximated well given sufficiently large datasets. Using the number of bins, PCs, and two correlation measures, we derive an empirical loss model with high accuracy. Furthermore, we demonstrate the benefits of estimating information losses and the representativeness of total loss in evaluating the accuracy of k-means clustering for a real-world customer fleet dataset. For preprocessing sensor data which are aggregated from sufficient number of samples, continuously distributed, and can be represented by Beta-distributions, we recommend not to coarsen the histogram binning before PCA.https://ieeexplore.ieee.org/document/10437985/Fleet analyticshistograminformation lossMonte Carloprincipal component analysis
spellingShingle	Kunxiong Ling Jan Thiele Thomas Setzer Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics IEEE Open Journal of Intelligent Transportation Systems Fleet analytics histogram information loss Monte Carlo principal component analysis
title	Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics
title_full	Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics
title_fullStr	Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics
title_full_unstemmed	Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics
title_short	Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics
title_sort	loss aware histogram binning and principal component analysis for customer fleet analytics
topic	Fleet analytics histogram information loss Monte Carlo principal component analysis
url	https://ieeexplore.ieee.org/document/10437985/
work_keys_str_mv	AT kunxiongling lossawarehistogrambinningandprincipalcomponentanalysisforcustomerfleetanalytics AT janthiele lossawarehistogrambinningandprincipalcomponentanalysisforcustomerfleetanalytics AT thomassetzer lossawarehistogrambinningandprincipalcomponentanalysisforcustomerfleetanalytics

Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics

Similar Items