PTF-SimCM: A Simple Contrastive Model with Polysemous Text Fusion for Visual Similarity Metric

Image similarity metric, also known as metric learning (ML) in computer vision, is a significant step in various advanced image tasks. Nevertheless, existing well-performing approaches for image similarity measurement only focus on the image itself without utilizing the information of other modaliti...

Full description

Saved in:

Bibliographic Details
Main Authors:	Xinpan Yuan, Xinxin Mao, Wei Xia, Zhiqi Zhang, Shaojun Xie, Chengyuan Zhang
Format:	Article
Language:	English
Published:	Wiley 2022-01-01
Series:	Complexity
Online Access:	http://dx.doi.org/10.1155/2022/2343707
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832563237245157376
author	Xinpan Yuan Xinxin Mao Wei Xia Zhiqi Zhang Shaojun Xie Chengyuan Zhang
author_facet	Xinpan Yuan Xinxin Mao Wei Xia Zhiqi Zhang Shaojun Xie Chengyuan Zhang
author_sort	Xinpan Yuan
collection	DOAJ
description	Image similarity metric, also known as metric learning (ML) in computer vision, is a significant step in various advanced image tasks. Nevertheless, existing well-performing approaches for image similarity measurement only focus on the image itself without utilizing the information of other modalities, while pictures always appear with the described text. Furthermore, those methods need human supervision, yet most images are unlabeled in the real world. Considering the above problems comprehensively, we present a novel visual similarity metric model named PTF-SimCM. It adopts a self-supervised contrastive structure like SimSiam and incorporates a multimodal fusion module to utilize textual modality correlated to the image. We apply a cross-modal model for text modality rather than a standard unimodal text encoder to improve late fusion productivity. In addition, the proposed model employs Sentence PIE-Net to solve the issue caused by polysemous sentences. For simplicity and efficiency, our model learns a specific embedding space where distances directly correspond to the similarity. Experimental results on MSCOCO, Flickr 30k, and Pascal Sentence datasets show that our model overall outperforms all the compared methods in this work, which illustrates that the model can effectively address the issues faced and enhance the performances on unsupervised visual similarity measuring relatively.
format	Article
id	doaj-art-accd4ffd506a442a92771c0c4c0af338
institution	Kabale University
issn	1099-0526
language	English
publishDate	2022-01-01
publisher	Wiley
record_format	Article
series	Complexity
spelling	doaj-art-accd4ffd506a442a92771c0c4c0af3382025-02-03T01:20:36ZengWileyComplexity1099-05262022-01-01202210.1155/2022/2343707PTF-SimCM: A Simple Contrastive Model with Polysemous Text Fusion for Visual Similarity MetricXinpan Yuan0Xinxin Mao1Wei Xia2Zhiqi Zhang3Shaojun Xie4Chengyuan Zhang5School of Computer ScienceSchool of Computer ScienceSchool of Computer ScienceSchool of Computer ScienceSchool of Computer ScienceCollege of Computer Science and Electronic EngineeringImage similarity metric, also known as metric learning (ML) in computer vision, is a significant step in various advanced image tasks. Nevertheless, existing well-performing approaches for image similarity measurement only focus on the image itself without utilizing the information of other modalities, while pictures always appear with the described text. Furthermore, those methods need human supervision, yet most images are unlabeled in the real world. Considering the above problems comprehensively, we present a novel visual similarity metric model named PTF-SimCM. It adopts a self-supervised contrastive structure like SimSiam and incorporates a multimodal fusion module to utilize textual modality correlated to the image. We apply a cross-modal model for text modality rather than a standard unimodal text encoder to improve late fusion productivity. In addition, the proposed model employs Sentence PIE-Net to solve the issue caused by polysemous sentences. For simplicity and efficiency, our model learns a specific embedding space where distances directly correspond to the similarity. Experimental results on MSCOCO, Flickr 30k, and Pascal Sentence datasets show that our model overall outperforms all the compared methods in this work, which illustrates that the model can effectively address the issues faced and enhance the performances on unsupervised visual similarity measuring relatively.http://dx.doi.org/10.1155/2022/2343707
spellingShingle	Xinpan Yuan Xinxin Mao Wei Xia Zhiqi Zhang Shaojun Xie Chengyuan Zhang PTF-SimCM: A Simple Contrastive Model with Polysemous Text Fusion for Visual Similarity Metric Complexity
title	PTF-SimCM: A Simple Contrastive Model with Polysemous Text Fusion for Visual Similarity Metric
title_full	PTF-SimCM: A Simple Contrastive Model with Polysemous Text Fusion for Visual Similarity Metric
title_fullStr	PTF-SimCM: A Simple Contrastive Model with Polysemous Text Fusion for Visual Similarity Metric
title_full_unstemmed	PTF-SimCM: A Simple Contrastive Model with Polysemous Text Fusion for Visual Similarity Metric
title_short	PTF-SimCM: A Simple Contrastive Model with Polysemous Text Fusion for Visual Similarity Metric
title_sort	ptf simcm a simple contrastive model with polysemous text fusion for visual similarity metric
url	http://dx.doi.org/10.1155/2022/2343707
work_keys_str_mv	AT xinpanyuan ptfsimcmasimplecontrastivemodelwithpolysemoustextfusionforvisualsimilaritymetric AT xinxinmao ptfsimcmasimplecontrastivemodelwithpolysemoustextfusionforvisualsimilaritymetric AT weixia ptfsimcmasimplecontrastivemodelwithpolysemoustextfusionforvisualsimilaritymetric AT zhiqizhang ptfsimcmasimplecontrastivemodelwithpolysemoustextfusionforvisualsimilaritymetric AT shaojunxie ptfsimcmasimplecontrastivemodelwithpolysemoustextfusionforvisualsimilaritymetric AT chengyuanzhang ptfsimcmasimplecontrastivemodelwithpolysemoustextfusionforvisualsimilaritymetric

PTF-SimCM: A Simple Contrastive Model with Polysemous Text Fusion for Visual Similarity Metric

Similar Items