MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling

In stereo matching, Convolutional Neural Networks (CNNs), a class of deep learning models designed to process grid-like data such as images, have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags b...

Full description

Saved in:
Bibliographic Details
Main Authors: Jihye Ahn, Hyesong Choi, Soomin Kim, Dongbo Min
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10836678/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832592875050762240
author Jihye Ahn
Hyesong Choi
Soomin Kim
Dongbo Min
author_facet Jihye Ahn
Hyesong Choi
Soomin Kim
Dongbo Min
author_sort Jihye Ahn
collection DOAJ
description In stereo matching, Convolutional Neural Networks (CNNs), a class of deep learning models designed to process grid-like data such as images, have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task. In this paper, we propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM), a technique that improves model performance by randomly masking parts of the input image in training Transformer-based stereo model. Given randomly masked stereo images as inputs, our method attempts to conduct both image reconstruction and depth prediction tasks. While this strategy is beneficial to resolving the data scarcity issue, the dual tasks of reconstructing masked tokens and subsequently performing stereo matching pose significant challenges, particularly in terms of training stability. To address this, we propose to use an auxiliary network (teacher) that is updated via Exponential Moving Average (EMA), a method for updating model weights by averaging previous weights with a decaying factor. Specifically, the teacher knowledge is distilled and transferred to the original stereo model (student) by providing pseudo supervisory signals, enhancing training stability and overall model performance. Experimentally, our approach achieves state-of-the-art results on the ETH3D benchmark and competitive performance on the KITTI 2015 benchmark. Our findings highlight the potential for extending this approach to other vision tasks, such as object detection and semantic segmentation, that require sufficient locality inductive bias in Transformer-based architectures. Code is available at: <uri>https://github.com/ja053199/madis/</uri>
format Article
id doaj-art-926d39c6b6424fbc85580f038b68fbc2
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-926d39c6b6424fbc85580f038b68fbc22025-01-21T00:02:01ZengIEEEIEEE Access2169-35362025-01-01138912892310.1109/ACCESS.2025.352802210836678MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image ModelingJihye Ahn0https://orcid.org/0009-0004-5721-1965Hyesong Choi1Soomin Kim2https://orcid.org/0009-0006-3440-6537Dongbo Min3https://orcid.org/0000-0003-4825-5240Department of Computer Science and Engineering, Ewha Womans University, Seoul, South KoreaDepartment of Computer Science and Engineering, Ewha Womans University, Seoul, South KoreaDepartment of Computer Science and Engineering, Ewha Womans University, Seoul, South KoreaDepartment of Computer Science and Engineering, Ewha Womans University, Seoul, South KoreaIn stereo matching, Convolutional Neural Networks (CNNs), a class of deep learning models designed to process grid-like data such as images, have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task. In this paper, we propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM), a technique that improves model performance by randomly masking parts of the input image in training Transformer-based stereo model. Given randomly masked stereo images as inputs, our method attempts to conduct both image reconstruction and depth prediction tasks. While this strategy is beneficial to resolving the data scarcity issue, the dual tasks of reconstructing masked tokens and subsequently performing stereo matching pose significant challenges, particularly in terms of training stability. To address this, we propose to use an auxiliary network (teacher) that is updated via Exponential Moving Average (EMA), a method for updating model weights by averaging previous weights with a decaying factor. Specifically, the teacher knowledge is distilled and transferred to the original stereo model (student) by providing pseudo supervisory signals, enhancing training stability and overall model performance. Experimentally, our approach achieves state-of-the-art results on the ETH3D benchmark and competitive performance on the KITTI 2015 benchmark. Our findings highlight the potential for extending this approach to other vision tasks, such as object detection and semantic segmentation, that require sufficient locality inductive bias in Transformer-based architectures. Code is available at: <uri>https://github.com/ja053199/madis/</uri>https://ieeexplore.ieee.org/document/10836678/Exponential moving averagemasked image modelingpseudo depth labelstereo depth estimation
spellingShingle Jihye Ahn
Hyesong Choi
Soomin Kim
Dongbo Min
MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling
IEEE Access
Exponential moving average
masked image modeling
pseudo depth label
stereo depth estimation
title MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling
title_full MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling
title_fullStr MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling
title_full_unstemmed MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling
title_short MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling
title_sort madis stereo enhanced stereo matching via distilled masked image modeling
topic Exponential moving average
masked image modeling
pseudo depth label
stereo depth estimation
url https://ieeexplore.ieee.org/document/10836678/
work_keys_str_mv AT jihyeahn madisstereoenhancedstereomatchingviadistilledmaskedimagemodeling
AT hyesongchoi madisstereoenhancedstereomatchingviadistilledmaskedimagemodeling
AT soominkim madisstereoenhancedstereomatchingviadistilledmaskedimagemodeling
AT dongbomin madisstereoenhancedstereomatchingviadistilledmaskedimagemodeling