Detecting Invalid Associations between Fare Machines and Metro Stations Using Smart Card Data

Data quality is essential for its authentic usage in analysis and applications. The large volume of automated collection data inevidently suffers from data quality issues including data missing and invalidity. This paper deals with an invalid data problem in the automated fare collection (AFC) datab...

Full description

Saved in:
Bibliographic Details
Main Authors: Pengfei Zhang, Zhenliang Ma, Xiaoxiong Weng
Format: Article
Language:English
Published: Wiley 2021-01-01
Series:Journal of Advanced Transportation
Online Access:http://dx.doi.org/10.1155/2021/5283283
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Data quality is essential for its authentic usage in analysis and applications. The large volume of automated collection data inevidently suffers from data quality issues including data missing and invalidity. This paper deals with an invalid data problem in the automated fare collection (AFC) database caused by the erroneous association between the fare machines and metro stations, e.g., a fare machine located at Station A is wrongly associated with Station B in the AFC database. It could lead to inappropriate fare charges in a distance-based fare system and cause analysis bias for planning/operation practice. We propose a tensor decomposition and isolation forest-based approach to detect and correct the invalid associated fare machines in the system. The tensor decomposition extracts features of passenger flows and travel times passing through fare machines. The isolation forest coupled with a neural network (NN) takes these features as inputs to detect the wrongly associated fare machines and infer the correct association stations. Case studies using data from a metro system show that the proposed detection approach achieves over 90% accuracy in detecting the invalid associations for up to 35% invalid associations. The inferred association has a 90% accuracy even when the invalid association ratio reaches 40%. The proposed data-driven invalid data detection method is useful for large-scale data management in terms of data quality check and fix.
ISSN:0197-6729
2042-3195