MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling
In stereo matching, Convolutional Neural Networks (CNNs), a class of deep learning models designed to process grid-like data such as images, have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags b...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10836678/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In stereo matching, Convolutional Neural Networks (CNNs), a class of deep learning models designed to process grid-like data such as images, have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task. In this paper, we propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM), a technique that improves model performance by randomly masking parts of the input image in training Transformer-based stereo model. Given randomly masked stereo images as inputs, our method attempts to conduct both image reconstruction and depth prediction tasks. While this strategy is beneficial to resolving the data scarcity issue, the dual tasks of reconstructing masked tokens and subsequently performing stereo matching pose significant challenges, particularly in terms of training stability. To address this, we propose to use an auxiliary network (teacher) that is updated via Exponential Moving Average (EMA), a method for updating model weights by averaging previous weights with a decaying factor. Specifically, the teacher knowledge is distilled and transferred to the original stereo model (student) by providing pseudo supervisory signals, enhancing training stability and overall model performance. Experimentally, our approach achieves state-of-the-art results on the ETH3D benchmark and competitive performance on the KITTI 2015 benchmark. Our findings highlight the potential for extending this approach to other vision tasks, such as object detection and semantic segmentation, that require sufficient locality inductive bias in Transformer-based architectures. Code is available at: <uri>https://github.com/ja053199/madis/</uri> |
---|---|
ISSN: | 2169-3536 |