MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling
In stereo matching, Convolutional Neural Networks (CNNs), a class of deep learning models designed to process grid-like data such as images, have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags b...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10836678/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832592875050762240 |
---|---|
author | Jihye Ahn Hyesong Choi Soomin Kim Dongbo Min |
author_facet | Jihye Ahn Hyesong Choi Soomin Kim Dongbo Min |
author_sort | Jihye Ahn |
collection | DOAJ |
description | In stereo matching, Convolutional Neural Networks (CNNs), a class of deep learning models designed to process grid-like data such as images, have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task. In this paper, we propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM), a technique that improves model performance by randomly masking parts of the input image in training Transformer-based stereo model. Given randomly masked stereo images as inputs, our method attempts to conduct both image reconstruction and depth prediction tasks. While this strategy is beneficial to resolving the data scarcity issue, the dual tasks of reconstructing masked tokens and subsequently performing stereo matching pose significant challenges, particularly in terms of training stability. To address this, we propose to use an auxiliary network (teacher) that is updated via Exponential Moving Average (EMA), a method for updating model weights by averaging previous weights with a decaying factor. Specifically, the teacher knowledge is distilled and transferred to the original stereo model (student) by providing pseudo supervisory signals, enhancing training stability and overall model performance. Experimentally, our approach achieves state-of-the-art results on the ETH3D benchmark and competitive performance on the KITTI 2015 benchmark. Our findings highlight the potential for extending this approach to other vision tasks, such as object detection and semantic segmentation, that require sufficient locality inductive bias in Transformer-based architectures. Code is available at: <uri>https://github.com/ja053199/madis/</uri> |
format | Article |
id | doaj-art-926d39c6b6424fbc85580f038b68fbc2 |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-926d39c6b6424fbc85580f038b68fbc22025-01-21T00:02:01ZengIEEEIEEE Access2169-35362025-01-01138912892310.1109/ACCESS.2025.352802210836678MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image ModelingJihye Ahn0https://orcid.org/0009-0004-5721-1965Hyesong Choi1Soomin Kim2https://orcid.org/0009-0006-3440-6537Dongbo Min3https://orcid.org/0000-0003-4825-5240Department of Computer Science and Engineering, Ewha Womans University, Seoul, South KoreaDepartment of Computer Science and Engineering, Ewha Womans University, Seoul, South KoreaDepartment of Computer Science and Engineering, Ewha Womans University, Seoul, South KoreaDepartment of Computer Science and Engineering, Ewha Womans University, Seoul, South KoreaIn stereo matching, Convolutional Neural Networks (CNNs), a class of deep learning models designed to process grid-like data such as images, have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task. In this paper, we propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM), a technique that improves model performance by randomly masking parts of the input image in training Transformer-based stereo model. Given randomly masked stereo images as inputs, our method attempts to conduct both image reconstruction and depth prediction tasks. While this strategy is beneficial to resolving the data scarcity issue, the dual tasks of reconstructing masked tokens and subsequently performing stereo matching pose significant challenges, particularly in terms of training stability. To address this, we propose to use an auxiliary network (teacher) that is updated via Exponential Moving Average (EMA), a method for updating model weights by averaging previous weights with a decaying factor. Specifically, the teacher knowledge is distilled and transferred to the original stereo model (student) by providing pseudo supervisory signals, enhancing training stability and overall model performance. Experimentally, our approach achieves state-of-the-art results on the ETH3D benchmark and competitive performance on the KITTI 2015 benchmark. Our findings highlight the potential for extending this approach to other vision tasks, such as object detection and semantic segmentation, that require sufficient locality inductive bias in Transformer-based architectures. Code is available at: <uri>https://github.com/ja053199/madis/</uri>https://ieeexplore.ieee.org/document/10836678/Exponential moving averagemasked image modelingpseudo depth labelstereo depth estimation |
spellingShingle | Jihye Ahn Hyesong Choi Soomin Kim Dongbo Min MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling IEEE Access Exponential moving average masked image modeling pseudo depth label stereo depth estimation |
title | MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling |
title_full | MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling |
title_fullStr | MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling |
title_full_unstemmed | MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling |
title_short | MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling |
title_sort | madis stereo enhanced stereo matching via distilled masked image modeling |
topic | Exponential moving average masked image modeling pseudo depth label stereo depth estimation |
url | https://ieeexplore.ieee.org/document/10836678/ |
work_keys_str_mv | AT jihyeahn madisstereoenhancedstereomatchingviadistilledmaskedimagemodeling AT hyesongchoi madisstereoenhancedstereomatchingviadistilledmaskedimagemodeling AT soominkim madisstereoenhancedstereomatchingviadistilledmaskedimagemodeling AT dongbomin madisstereoenhancedstereomatchingviadistilledmaskedimagemodeling |