Bridging the gap: multi-granularity representation learning for text-based vehicle retrieval

Abstract Text-based cross-modal vehicle retrieval has been widely applied in smart city contexts and other scenarios. The objective of this approach is to identify semantically relevant target vehicles in videos using text descriptions, thereby facilitating the analysis of vehicle spatio-temporal tr...

Full description

Saved in:

Bibliographic Details
Main Authors:	Xue Bo, Junjie Liu, Di Yang, Wentao Ma
Format:	Article
Language:	English
Published:	Springer 2024-11-01
Series:	Complex & Intelligent Systems
Subjects:	Cross-modal Vehicle retrieval Multi-granularity Semantic association
Online Access:	https://doi.org/10.1007/s40747-024-01614-w
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832571182046511104
author	Xue Bo Junjie Liu Di Yang Wentao Ma
author_facet	Xue Bo Junjie Liu Di Yang Wentao Ma
author_sort	Xue Bo
collection	DOAJ
description	Abstract Text-based cross-modal vehicle retrieval has been widely applied in smart city contexts and other scenarios. The objective of this approach is to identify semantically relevant target vehicles in videos using text descriptions, thereby facilitating the analysis of vehicle spatio-temporal trajectories. Current methodologies predominantly employ a two-tower architecture, where single-granularity features from both visual and textual domains are extracted independently. However, due to the intricate semantic relationships between videos and text, aligning the two modalities effectively using single-granularity feature representation poses a challenge. To address this issue, we introduce a Multi-Granularity Representation Learning model, termed MGRL, tailored for text-based cross-modal vehicle retrieval. Specifically, the model parses information from the two modalities into three hierarchical levels of feature representation: coarse-granularity, medium-granularity, and fine-granularity. Subsequently, a feature adaptive fusion strategy is devised to automatically determine the optimal pooling mechanism. Finally, a multi-granularity contrastive learning approach is implemented to ensure comprehensive semantic coverage, ranging from coarse to fine levels. Experimental outcomes on public benchmarks show that our method achieves up to a 14.56% improvement in text-to-vehicle retrieval performance, as measured by the Mean Reciprocal Rank (MRR) metric, when compared against 10 state-of-the-art baselines and 6 ablation studies.
format	Article
id	doaj-art-8e6967a0f2a14320a2d8f653b54351c1
institution	Kabale University
issn	2199-4536 2198-6053
language	English
publishDate	2024-11-01
publisher	Springer
record_format	Article
series	Complex & Intelligent Systems
spelling	doaj-art-8e6967a0f2a14320a2d8f653b54351c12025-02-02T12:48:47ZengSpringerComplex & Intelligent Systems2199-45362198-60532024-11-0111111210.1007/s40747-024-01614-wBridging the gap: multi-granularity representation learning for text-based vehicle retrievalXue Bo0Junjie Liu1Di Yang2Wentao Ma3Jilin Provincial Institute of EducationChangchun University of Science and TechnologyChangchun University of Science and TechnologyAnhui Agricultural UniversityAbstract Text-based cross-modal vehicle retrieval has been widely applied in smart city contexts and other scenarios. The objective of this approach is to identify semantically relevant target vehicles in videos using text descriptions, thereby facilitating the analysis of vehicle spatio-temporal trajectories. Current methodologies predominantly employ a two-tower architecture, where single-granularity features from both visual and textual domains are extracted independently. However, due to the intricate semantic relationships between videos and text, aligning the two modalities effectively using single-granularity feature representation poses a challenge. To address this issue, we introduce a Multi-Granularity Representation Learning model, termed MGRL, tailored for text-based cross-modal vehicle retrieval. Specifically, the model parses information from the two modalities into three hierarchical levels of feature representation: coarse-granularity, medium-granularity, and fine-granularity. Subsequently, a feature adaptive fusion strategy is devised to automatically determine the optimal pooling mechanism. Finally, a multi-granularity contrastive learning approach is implemented to ensure comprehensive semantic coverage, ranging from coarse to fine levels. Experimental outcomes on public benchmarks show that our method achieves up to a 14.56% improvement in text-to-vehicle retrieval performance, as measured by the Mean Reciprocal Rank (MRR) metric, when compared against 10 state-of-the-art baselines and 6 ablation studies.https://doi.org/10.1007/s40747-024-01614-wCross-modalVehicle retrievalMulti-granularitySemantic association
spellingShingle	Xue Bo Junjie Liu Di Yang Wentao Ma Bridging the gap: multi-granularity representation learning for text-based vehicle retrieval Complex & Intelligent Systems Cross-modal Vehicle retrieval Multi-granularity Semantic association
title	Bridging the gap: multi-granularity representation learning for text-based vehicle retrieval
title_full	Bridging the gap: multi-granularity representation learning for text-based vehicle retrieval
title_fullStr	Bridging the gap: multi-granularity representation learning for text-based vehicle retrieval
title_full_unstemmed	Bridging the gap: multi-granularity representation learning for text-based vehicle retrieval
title_short	Bridging the gap: multi-granularity representation learning for text-based vehicle retrieval
title_sort	bridging the gap multi granularity representation learning for text based vehicle retrieval
topic	Cross-modal Vehicle retrieval Multi-granularity Semantic association
url	https://doi.org/10.1007/s40747-024-01614-w
work_keys_str_mv	AT xuebo bridgingthegapmultigranularityrepresentationlearningfortextbasedvehicleretrieval AT junjieliu bridgingthegapmultigranularityrepresentationlearningfortextbasedvehicleretrieval AT diyang bridgingthegapmultigranularityrepresentationlearningfortextbasedvehicleretrieval AT wentaoma bridgingthegapmultigranularityrepresentationlearningfortextbasedvehicleretrieval

Bridging the gap: multi-granularity representation learning for text-based vehicle retrieval

Similar Items