CausMatch: Causal Matching Learning With Counterfactual Preference Framework for Cross-Modal Retrieval

Cross-modal retrieval exhibits significant promise within the realm of multimedia analysis. Numerous sophisticated techniques have gained widespread adoption for harnessing attention mechanisms to facilitate cross-modal correspondence in matching tasks. However, most existing methods learn cross-mod...

Full description

Saved in:
Bibliographic Details
Main Authors: Chen Chen, Dan Wang
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10843200/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Cross-modal retrieval exhibits significant promise within the realm of multimedia analysis. Numerous sophisticated techniques have gained widespread adoption for harnessing attention mechanisms to facilitate cross-modal correspondence in matching tasks. However, most existing methods learn cross-modal attention based on conventional likelihood, leading to the fine-gained matching from regions and words containing numerous invalid local relationships and false global connections. This phenomenon will bring negative effects on alignment. Different from these methods, we propose a novel Causal Matching Learning (CausMatch) with counterfactual preference framework for cross-modal retrieval in this paper. This work seeks to ascertain the matching relation by incorporating a counterfactual causality preference, enhancing the quality of attention and providing a robust supervisory signal for the learning process are pivotal objectives. This study specifically investigates the influence of acquired visual and textual attention on network predictions by employing counterfactual intervention as a method of scrutiny. This approach aims to discern and analyze the consequential effects on the learning process. Our approach involves maximizing the positive influence to incentivize the network to assimilate more pertinent relationships and connections conducive to fine-grained cross-modal retrieval. The proposed CausMatch model’s effectiveness is systematically substantiated through a comprehensive series of experiments conducted on two widely recognized benchmark datasets, namely MS-COCO and Flickr30K. The results unequivocally demonstrate its superiority over existing state-of-the-art methods, underscoring its robust performance in cross-modal retrieval tasks.
ISSN:2169-3536