Noise Improves Multimodal Machine Translation: Rethinking the Role of Visual Context

Multimodal Machine Translation (MMT) has long been assumed to outperform traditional text-only MT by leveraging visual information. However, recent studies challenge this assumption, showing that MMT models perform similarly even when tested without images or with mismatched images. This raises fund...

Full description

Saved in:
Bibliographic Details
Main Authors: Xinyu Ma, Jun Rao, Xuebo Liu
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/13/11/1874
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Multimodal Machine Translation (MMT) has long been assumed to outperform traditional text-only MT by leveraging visual information. However, recent studies challenge this assumption, showing that MMT models perform similarly even when tested without images or with mismatched images. This raises fundamental questions about the actual utility of visual information in MMT, which this work aims to investigate. We first revisit commonly used image-must and image-free MMT approaches, identifying that suboptimal performance may stem from insufficiently robust baseline models. To further examine the role of visual information, we propose a novel visual type regularization method and introduce two probing tasks—Visual Contribution Probing and Modality Relationship Probing—to analyze whether and how visual features influence a strong MMT model. Surprisingly, our findings on a mainstream dataset indicate that the gains from visual information are marginal. We attribute this improvement primarily to a regularization effect, which can be replicated using random noise. Our results suggest that the MMT community should critically re-evaluate baseline models, evaluation metrics, and dataset design to advance multimodal learning meaningfully.
ISSN:2227-7390