MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling

In stereo matching, Convolutional Neural Networks (CNNs), a class of deep learning models designed to process grid-like data such as images, have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags b...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jihye Ahn, Hyesong Choi, Soomin Kim, Dongbo Min
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Exponential moving average masked image modeling pseudo depth label stereo depth estimation
Online Access:	https://ieeexplore.ieee.org/document/10836678/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832592875050762240
author	Jihye Ahn Hyesong Choi Soomin Kim Dongbo Min
author_facet	Jihye Ahn Hyesong Choi Soomin Kim Dongbo Min
author_sort	Jihye Ahn
collection	DOAJ
description	In stereo matching, Convolutional Neural Networks (CNNs), a class of deep learning models designed to process grid-like data such as images, have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task. In this paper, we propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM), a technique that improves model performance by randomly masking parts of the input image in training Transformer-based stereo model. Given randomly masked stereo images as inputs, our method attempts to conduct both image reconstruction and depth prediction tasks. While this strategy is beneficial to resolving the data scarcity issue, the dual tasks of reconstructing masked tokens and subsequently performing stereo matching pose significant challenges, particularly in terms of training stability. To address this, we propose to use an auxiliary network (teacher) that is updated via Exponential Moving Average (EMA), a method for updating model weights by averaging previous weights with a decaying factor. Specifically, the teacher knowledge is distilled and transferred to the original stereo model (student) by providing pseudo supervisory signals, enhancing training stability and overall model performance. Experimentally, our approach achieves state-of-the-art results on the ETH3D benchmark and competitive performance on the KITTI 2015 benchmark. Our findings highlight the potential for extending this approach to other vision tasks, such as object detection and semantic segmentation, that require sufficient locality inductive bias in Transformer-based architectures. Code is available at: <uri>https://github.com/ja053199/madis/</uri>
format	Article
id	doaj-art-926d39c6b6424fbc85580f038b68fbc2
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-926d39c6b6424fbc85580f038b68fbc22025-01-21T00:02:01ZengIEEEIEEE Access2169-35362025-01-01138912892310.1109/ACCESS.2025.352802210836678MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image ModelingJihye Ahn0https://orcid.org/0009-0004-5721-1965Hyesong Choi1Soomin Kim2https://orcid.org/0009-0006-3440-6537Dongbo Min3https://orcid.org/0000-0003-4825-5240Department of Computer Science and Engineering, Ewha Womans University, Seoul, South KoreaDepartment of Computer Science and Engineering, Ewha Womans University, Seoul, South KoreaDepartment of Computer Science and Engineering, Ewha Womans University, Seoul, South KoreaDepartment of Computer Science and Engineering, Ewha Womans University, Seoul, South KoreaIn stereo matching, Convolutional Neural Networks (CNNs), a class of deep learning models designed to process grid-like data such as images, have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task. In this paper, we propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM), a technique that improves model performance by randomly masking parts of the input image in training Transformer-based stereo model. Given randomly masked stereo images as inputs, our method attempts to conduct both image reconstruction and depth prediction tasks. While this strategy is beneficial to resolving the data scarcity issue, the dual tasks of reconstructing masked tokens and subsequently performing stereo matching pose significant challenges, particularly in terms of training stability. To address this, we propose to use an auxiliary network (teacher) that is updated via Exponential Moving Average (EMA), a method for updating model weights by averaging previous weights with a decaying factor. Specifically, the teacher knowledge is distilled and transferred to the original stereo model (student) by providing pseudo supervisory signals, enhancing training stability and overall model performance. Experimentally, our approach achieves state-of-the-art results on the ETH3D benchmark and competitive performance on the KITTI 2015 benchmark. Our findings highlight the potential for extending this approach to other vision tasks, such as object detection and semantic segmentation, that require sufficient locality inductive bias in Transformer-based architectures. Code is available at: <uri>https://github.com/ja053199/madis/</uri>https://ieeexplore.ieee.org/document/10836678/Exponential moving averagemasked image modelingpseudo depth labelstereo depth estimation
spellingShingle	Jihye Ahn Hyesong Choi Soomin Kim Dongbo Min MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling IEEE Access Exponential moving average masked image modeling pseudo depth label stereo depth estimation
title	MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling
title_full	MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling
title_fullStr	MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling
title_full_unstemmed	MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling
title_short	MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling
title_sort	madis stereo enhanced stereo matching via distilled masked image modeling
topic	Exponential moving average masked image modeling pseudo depth label stereo depth estimation
url	https://ieeexplore.ieee.org/document/10836678/
work_keys_str_mv	AT jihyeahn madisstereoenhancedstereomatchingviadistilledmaskedimagemodeling AT hyesongchoi madisstereoenhancedstereomatchingviadistilledmaskedimagemodeling AT soominkim madisstereoenhancedstereomatchingviadistilledmaskedimagemodeling AT dongbomin madisstereoenhancedstereomatchingviadistilledmaskedimagemodeling

MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling

Similar Items