Temporal record linkage for heterogeneous big data records
Temporal Record Linkage (TRL) or Temporal Entity Matching (TEM) is the process of identifying records/entities that refer to the same real-world object in different lifetime states. TRL is a well-known problem in different data engineering contexts e.g. data analysis, data warehousing, data mining,...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-06-01
|
| Series: | Egyptian Informatics Journal |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S1110866525000350 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Temporal Record Linkage (TRL) or Temporal Entity Matching (TEM) is the process of identifying records/entities that refer to the same real-world object in different lifetime states. TRL is a well-known problem in different data engineering contexts e.g. data analysis, data warehousing, data mining, and/or machine learning to identify entities denoting the same real-world object over time. Unlike traditional record linkage which considers differences between records of the same entity as contradictions; temporal record linkage considers such differences as normal entity growth over time. Existing frameworks which are limited to, No model, Decay, Disprob, Mixed, and Agreement First Dynamic Second (AFDS) which deal with temporal record linkage achieve high accuracy but with high computation cost. They condition the presence of the time dimension to detect similar entities that refer to the same real-world object. In this research, we present a framework called Tracking Similar Entities in Heterogeneous Temporal Records (TSE-HTR) to track similar entities in heterogeneous, big, low-quality, and temporal data regardless of the presence of the time dimension. It introduces data cleansing and state ranking modules to detect anomalies within similar entities, find the final and accurate set of them, and explain anomalies to the users or domain experts in a comprehensible manner that not only offers increased business intelligence but also opens opportunities for improved solutions. It presents to the user the records of different states of the same real-world object ranked according to different quality measures like completeness, validity, and accuracy. Performance evaluation of the proposed framework against existing frameworks over real and big data shows a great improvement in both effectiveness and efficiency. |
|---|---|
| ISSN: | 1110-8665 |