Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics
We propose a method to estimate information loss when conducting histogram binning and principal component analysis (PCA) sequentially, as usually done in practice for fleet analytics. Coarser-grained histogram binning results in less data volume, fewer dimensions, but more information loss. Conside...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2024-01-01
|
Series: | IEEE Open Journal of Intelligent Transportation Systems |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10437985/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832590338127036416 |
---|---|
author | Kunxiong Ling Jan Thiele Thomas Setzer |
author_facet | Kunxiong Ling Jan Thiele Thomas Setzer |
author_sort | Kunxiong Ling |
collection | DOAJ |
description | We propose a method to estimate information loss when conducting histogram binning and principal component analysis (PCA) sequentially, as usually done in practice for fleet analytics. Coarser-grained histogram binning results in less data volume, fewer dimensions, but more information loss. Considering fewer principal components (PCs) results in fewer data dimensions but increased information loss. Although information loss with each step is well understood, little guidance exists on the overall information loss when conducting both steps sequentially. We use Monte Carlo simulations to regress information loss on the number of bins and PCs, given few parameters of a dataset related to its scale and correlation structure. A sensitivity study shows that information loss can be approximated well given sufficiently large datasets. Using the number of bins, PCs, and two correlation measures, we derive an empirical loss model with high accuracy. Furthermore, we demonstrate the benefits of estimating information losses and the representativeness of total loss in evaluating the accuracy of k-means clustering for a real-world customer fleet dataset. For preprocessing sensor data which are aggregated from sufficient number of samples, continuously distributed, and can be represented by Beta-distributions, we recommend not to coarsen the histogram binning before PCA. |
format | Article |
id | doaj-art-7ef471634399466284f827cd1b6be6bf |
institution | Kabale University |
issn | 2687-7813 |
language | English |
publishDate | 2024-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Open Journal of Intelligent Transportation Systems |
spelling | doaj-art-7ef471634399466284f827cd1b6be6bf2025-01-24T00:02:35ZengIEEEIEEE Open Journal of Intelligent Transportation Systems2687-78132024-01-01516017310.1109/OJITS.2024.336627910437985Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet AnalyticsKunxiong Ling0https://orcid.org/0000-0003-0203-301XJan Thiele1Thomas Setzer2https://orcid.org/0000-0002-9241-2648Research and Innovation Center, BMW Group, Munich, GermanyResearch and Innovation Center, BMW Group, Munich, GermanyIngolstadt School of Management, Catholic University of Eichstätt-Ingolstadt, Ingolstadt, GermanyWe propose a method to estimate information loss when conducting histogram binning and principal component analysis (PCA) sequentially, as usually done in practice for fleet analytics. Coarser-grained histogram binning results in less data volume, fewer dimensions, but more information loss. Considering fewer principal components (PCs) results in fewer data dimensions but increased information loss. Although information loss with each step is well understood, little guidance exists on the overall information loss when conducting both steps sequentially. We use Monte Carlo simulations to regress information loss on the number of bins and PCs, given few parameters of a dataset related to its scale and correlation structure. A sensitivity study shows that information loss can be approximated well given sufficiently large datasets. Using the number of bins, PCs, and two correlation measures, we derive an empirical loss model with high accuracy. Furthermore, we demonstrate the benefits of estimating information losses and the representativeness of total loss in evaluating the accuracy of k-means clustering for a real-world customer fleet dataset. For preprocessing sensor data which are aggregated from sufficient number of samples, continuously distributed, and can be represented by Beta-distributions, we recommend not to coarsen the histogram binning before PCA.https://ieeexplore.ieee.org/document/10437985/Fleet analyticshistograminformation lossMonte Carloprincipal component analysis |
spellingShingle | Kunxiong Ling Jan Thiele Thomas Setzer Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics IEEE Open Journal of Intelligent Transportation Systems Fleet analytics histogram information loss Monte Carlo principal component analysis |
title | Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics |
title_full | Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics |
title_fullStr | Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics |
title_full_unstemmed | Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics |
title_short | Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics |
title_sort | loss aware histogram binning and principal component analysis for customer fleet analytics |
topic | Fleet analytics histogram information loss Monte Carlo principal component analysis |
url | https://ieeexplore.ieee.org/document/10437985/ |
work_keys_str_mv | AT kunxiongling lossawarehistogrambinningandprincipalcomponentanalysisforcustomerfleetanalytics AT janthiele lossawarehistogrambinningandprincipalcomponentanalysisforcustomerfleetanalytics AT thomassetzer lossawarehistogrambinningandprincipalcomponentanalysisforcustomerfleetanalytics |