Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach

Estimating the camera’s pose given images from a single camera is a traditional task in mobile robots and autonomous vehicles. This problem is called monocular visual odometry and often relies on geometric approaches that require considerable engineering effort for a specific scenario. De...

Full description

Saved in:

Bibliographic Details
Main Authors:	Andre O. Francani, Marcos R. O. A. Maximo
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Deep learning monocular visual odometry transformer video understanding
Online Access:	https://ieeexplore.ieee.org/document/10845764/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832586847856885760
author	Andre O. Francani Marcos R. O. A. Maximo
author_facet	Andre O. Francani Marcos R. O. A. Maximo
author_sort	Andre O. Francani
collection	DOAJ
description	Estimating the camera’s pose given images from a single camera is a traditional task in mobile robots and autonomous vehicles. This problem is called monocular visual odometry and often relies on geometric approaches that require considerable engineering effort for a specific scenario. Deep learning methods have been shown to be generalizable after proper training and with a large amount of available data. Transformer-based architectures have dominated the state-of-the-art in natural language processing and computer vision tasks, such as image and video understanding. In this work, we deal with the monocular visual odometry as a video understanding task to estimate the 6 degrees of freedom of a camera’s pose. We contribute by presenting the TSformer-VO model based on spatio-temporal self-attention mechanisms to extract features from clips and estimate the motions in an end-to-end manner. Our approach achieved competitive state-of-the-art performance compared with geometry-based and deep learning-based methods on the KITTI visual odometry dataset, outperforming the DeepVO implementation highly accepted in the visual odometry community. The code is publicly available at <uri>https://github.com/aofrancani/TSformer-VO</uri>.
format	Article
id	doaj-art-308fadb99b6c48519cd5a6457a522cb7
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-308fadb99b6c48519cd5a6457a522cb72025-01-25T00:02:47ZengIEEEIEEE Access2169-35362025-01-0113139591397110.1109/ACCESS.2025.353166710845764Transformer-Based Model for Monocular Visual Odometry: A Video Understanding ApproachAndre O. Francani0https://orcid.org/0000-0001-6576-1132Marcos R. O. A. Maximo1https://orcid.org/0000-0003-2944-4476Autonomous Computational Systems Laboratory, Aeronautics Institute of Technology, São José dos Campos, São Paulo, BrazilAutonomous Computational Systems Laboratory, Aeronautics Institute of Technology, São José dos Campos, São Paulo, BrazilEstimating the camera’s pose given images from a single camera is a traditional task in mobile robots and autonomous vehicles. This problem is called monocular visual odometry and often relies on geometric approaches that require considerable engineering effort for a specific scenario. Deep learning methods have been shown to be generalizable after proper training and with a large amount of available data. Transformer-based architectures have dominated the state-of-the-art in natural language processing and computer vision tasks, such as image and video understanding. In this work, we deal with the monocular visual odometry as a video understanding task to estimate the 6 degrees of freedom of a camera’s pose. We contribute by presenting the TSformer-VO model based on spatio-temporal self-attention mechanisms to extract features from clips and estimate the motions in an end-to-end manner. Our approach achieved competitive state-of-the-art performance compared with geometry-based and deep learning-based methods on the KITTI visual odometry dataset, outperforming the DeepVO implementation highly accepted in the visual odometry community. The code is publicly available at <uri>https://github.com/aofrancani/TSformer-VO</uri>.https://ieeexplore.ieee.org/document/10845764/Deep learningmonocular visual odometrytransformervideo understanding
spellingShingle	Andre O. Francani Marcos R. O. A. Maximo Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach IEEE Access Deep learning monocular visual odometry transformer video understanding
title	Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach
title_full	Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach
title_fullStr	Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach
title_full_unstemmed	Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach
title_short	Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach
title_sort	transformer based model for monocular visual odometry a video understanding approach
topic	Deep learning monocular visual odometry transformer video understanding
url	https://ieeexplore.ieee.org/document/10845764/
work_keys_str_mv	AT andreofrancani transformerbasedmodelformonocularvisualodometryavideounderstandingapproach AT marcosroamaximo transformerbasedmodelformonocularvisualodometryavideounderstandingapproach

Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach

Similar Items